If you want text without tags, sometimes it's easier to use a text-based web browser, EG:
#!/bin/sh # for mutt to view html e-mails #where html2txt is a shell script that performs the conversion, e.g. by #calling links -html-numbered-links 1 -html-images 1 -dump "file://$@" #or # #lynx -force_html -dump "$@" # #or # #w3m -T text/html -F -dump "$@" On Tue, Mar 9, 2021 at 1:26 PM S Monzur <sb.mon...@gmail.com> wrote: > Dear List, > > Newbie here. I am trying to loop over a text file to remove html tags, > punctuation marks, stopwords. I have already used Beautiful Soup (Python v > 3.8.3) to scrape the text (newspaper articles) from the site. It returns a > list that I saved as a file. However, I am not sure how to use a loop in > order to process all the items in the text file. > > In the code below I have used listfilereduced.text(containing data from one > news article, link to listfilereduced.txt here > < > https://drive.google.com/file/d/1ojwN4u8cmh_nUoMJpdZ5ObaGW5URYYj3/view?usp=sharing > >), > however I would like to run this code on listfile.text(containing data from > multiple articles, link to listfile.text > < > https://drive.google.com/file/d/1V3s8w8a3NQvex91EdOhdC9rQtCAOElpm/view?usp=sharing > > > ). > > > Any help would be greatly appreciated! > > P.S. The text is in a Non-English script, but the tags are all in English. > > > #The code below is for a textfile containing just one item. I am not sure > how to tweak this to make it run for listfile.text (which contains raw data > from multiple articles) with open('listfilereduced.txt', 'r', > encoding='utf8') as my_file: rawData = my_file.read() print(rawData) > #Separating body text from other data articleStart = rawData.find("<div > class=\"story-element story-element-text\">") articleData = > rawData[:articleStart] articleBody = rawData[articleStart:] > print(articleData) print("*******") print(articleBody) print("*******") > #First, I define a function to strip tags from the body text def > stripTags(pageContents): insideTag = 0 text = '' for char in pageContents: > if char == '<': insideTag = 1 elif (insideTag == 1 and char == '>'): > insideTag = 0 elif insideTag == 1: continue else: text += char return text > #Calling the function articleBodyText = stripTags(articleBody) > print(articleBodyText) ##Isolating article title and publication date > TitleEndLoc = articleData.find("</h1>") dateStartLoc = > articleData.find("<div > class=\"storyPageMetaData-m__publish-time__19bdV\">") > dateEndLoc=articleData.find("<div class=\"meta-data-icons > storyPageMetaDataIcons-m__icons__3E4Xg\">") titleString = > articleData[:TitleEndLoc] dateString = articleData[dateStartLoc:dateEndLoc] > ##Call stripTags to clean articleTitle= stripTags(titleString) articleDate > = stripTags(dateString) print(articleTitle) print(articleDate) #Cleaning > the date a bit more startLocDate = articleDate.find(":") endLocDate = > articleDate.find(",") articleDateClean = > articleDate[startLocDate+2:endLocDate] print(articleDateClean) #save all > this data to a dictionary that saves the title, data and the body text > PAloTextDict = {"Title": articleTitle, "Date": articleDateClean, "Text": > articleBodyText} print(PAloTextDict) #Normalize text by: #1. Splitting > paragraphs of text into lists of words articleBodyWordList = > articleBodyText.split() print(articleBodyWordList) #2.Removing punctuation > and stopwords from bnlp.corpus import stopwords, punctuations #A. Remove > punctuation first listNoPunct = [] for word in articleBodyWordList: for > mark in punctuations: word=word.replace(mark, '') listNoPunct.append(word) > print(listNoPunct) #B. removing stopwords banglastopwords = stopwords() > print(banglastopwords) cleanList=[] for word in listNoPunct: if word in > banglastopwords: continue else: cleanList.append(word) print(cleanList) > -- > https://mail.python.org/mailman/listinfo/python-list > -- https://mail.python.org/mailman/listinfo/python-list