Thank you and apologies! I did not realize how jumbled it was at the receiver's end.
The code is now at this site : https://pastebin.com/wSi2xzBh I'm basically trying to do a few things with my code- 1. Extract 3 strings from the text- title, date and main text 2. Remove all tags afterwards 3. Save in a dictionary, with three keys- title, date and bodytext. 4. Remove punctuation and stopwords (I've used a user generated function for that). I've been able to do all of these steps for the file ListFileReduced <https://drive.google.com/file/d/1ojwN4u8cmh_nUoMJpdZ5ObaGW5URYYj3/view?usp=sharing>, as shown in the code (although it's clunky). But, I would like to be able to do it for the other text file: ListFile <https://drive.google.com/file/d/1V3s8w8a3NQvex91EdOhdC9rQtCAOElpm/view?usp=sharing> which has more articles. I used BeautifulSoup to scrape the data from the website, and then generated a list that I saved as a text file. Best, Monzur On Wed, Mar 10, 2021 at 4:00 AM Dan Ciprus (dciprus) <dcip...@cisco.com> wrote: > If you could utilized pastebin or similar site to show your code, it would > help > tremendously since it's an unindented mess now and can not be read easily. > > On Wed, Mar 10, 2021 at 03:07:14AM +0600, S Monzur wrote: > >Dear List, > > > >Newbie here. I am trying to loop over a text file to remove html tags, > >punctuation marks, stopwords. I have already used Beautiful Soup (Python v > >3.8.3) to scrape the text (newspaper articles) from the site. It returns a > >list that I saved as a file. However, I am not sure how to use a loop in > >order to process all the items in the text file. > > > >In the code below I have used listfilereduced.text(containing data from > one > >news article, link to listfilereduced.txt here > >< > https://drive.google.com/file/d/1ojwN4u8cmh_nUoMJpdZ5ObaGW5URYYj3/view?usp=sharing > >), > >however I would like to run this code on listfile.text(containing data > from > >multiple articles, link to listfile.text > >< > https://drive.google.com/file/d/1V3s8w8a3NQvex91EdOhdC9rQtCAOElpm/view?usp=sharing > > > >). > > > > > >Any help would be greatly appreciated! > > > >P.S. The text is in a Non-English script, but the tags are all in English. > > > > > >#The code below is for a textfile containing just one item. I am not sure > >how to tweak this to make it run for listfile.text (which contains raw > data > >from multiple articles) with open('listfilereduced.txt', 'r', > >encoding='utf8') as my_file: rawData = my_file.read() print(rawData) > >#Separating body text from other data articleStart = rawData.find("<div > >class=\"story-element story-element-text\">") articleData = > >rawData[:articleStart] articleBody = rawData[articleStart:] > >print(articleData) print("*******") print(articleBody) print("*******") > >#First, I define a function to strip tags from the body text def > >stripTags(pageContents): insideTag = 0 text = '' for char in pageContents: > >if char == '<': insideTag = 1 elif (insideTag == 1 and char == '>'): > >insideTag = 0 elif insideTag == 1: continue else: text += char return text > >#Calling the function articleBodyText = stripTags(articleBody) > >print(articleBodyText) ##Isolating article title and publication date > >TitleEndLoc = articleData.find("</h1>") dateStartLoc = > >articleData.find("<div > >class=\"storyPageMetaData-m__publish-time__19bdV\">") > >dateEndLoc=articleData.find("<div class=\"meta-data-icons > >storyPageMetaDataIcons-m__icons__3E4Xg\">") titleString = > >articleData[:TitleEndLoc] dateString = > articleData[dateStartLoc:dateEndLoc] > >##Call stripTags to clean articleTitle= stripTags(titleString) articleDate > >= stripTags(dateString) print(articleTitle) print(articleDate) #Cleaning > >the date a bit more startLocDate = articleDate.find(":") endLocDate = > >articleDate.find(",") articleDateClean = > >articleDate[startLocDate+2:endLocDate] print(articleDateClean) #save all > >this data to a dictionary that saves the title, data and the body text > >PAloTextDict = {"Title": articleTitle, "Date": articleDateClean, "Text": > >articleBodyText} print(PAloTextDict) #Normalize text by: #1. Splitting > >paragraphs of text into lists of words articleBodyWordList = > >articleBodyText.split() print(articleBodyWordList) #2.Removing punctuation > >and stopwords from bnlp.corpus import stopwords, punctuations #A. Remove > >punctuation first listNoPunct = [] for word in articleBodyWordList: for > >mark in punctuations: word=word.replace(mark, '') listNoPunct.append(word) > >print(listNoPunct) #B. removing stopwords banglastopwords = stopwords() > >print(banglastopwords) cleanList=[] for word in listNoPunct: if word in > >banglastopwords: continue else: cleanList.append(word) print(cleanList) > >-- > >https://mail.python.org/mailman/listinfo/python-list > > -- > > Daniel Ciprus .:|:.:|:. > CONSULTING ENGINEER.CUSTOMER DELIVERY Cisco Systems Inc. > > dcip...@cisco.com > > tel: +1 703 484 0205 > mob: +1 540 223 7098 > > -- https://mail.python.org/mailman/listinfo/python-list