No problem, list just converts everything into plain/txt which is GREAT ! :-)
So without digging deeply into what you need to do: I am assuming that your input contains html tags. Why don't you utilize lib like: https://pypi.org/project/beautifulsoup4/ instead of doing harakiri with parsing data without using regex ? Just a hint ..
On Wed, Mar 10, 2021 at 04:22:19AM +0600, S Monzur wrote:
Thank you and apologies! I did not realize how jumbled it was at the receiver's end. The code is now at this site : [1]https://pastebin.com/wSi2xzBh I'm basically trying to do a few things with my code- 1. Extract 3 strings from the text- title, date and main text 2. Remove all tags afterwards 3. Save in a dictionary, with three keys- title, date and bodytext. 4. Remove punctuation and stopwords (I've used a user generated function for that). I've been able to do all of these steps for the file [2]ListFileReduced, as shown in the code (although it's clunky). But, I would like to be able to do it for the other text file: [3]ListFile which has more articles. I used BeautifulSoup to scrape the data from the website, and then generated a list that I saved as a text file. Best, Monzur On Wed, Mar 10, 2021 at 4:00 AM Dan Ciprus (dciprus) <[4]dcip...@cisco.com> wrote: If you could utilized pastebin or similar site to show your code, it would help tremendously since it's an unindented mess now and can not be read easily. On Wed, Mar 10, 2021 at 03:07:14AM +0600, S Monzur wrote: >Dear List, > >Newbie here. I am trying to loop over a text file to remove html tags, >punctuation marks, stopwords. I have already used Beautiful Soup (Python v >3.8.3) to scrape the text (newspaper articles) from the site. It returns a >list that I saved as a file. However, I am not sure how to use a loop in >order to process all the items in the text file. > >In the code below I have used listfilereduced.text(containing data from one >news article, link to listfilereduced.txt here ><[5]https://drive.google.com/file/d/1ojwN4u8cmh_nUoMJpdZ5ObaGW5URYYj3/view?usp=sharing>), >however I would like to run this code on listfile.text(containing data from >multiple articles, link to listfile.text ><[6]https://drive.google.com/file/d/1V3s8w8a3NQvex91EdOhdC9rQtCAOElpm/view?usp=sharing> >). > > >Any help would be greatly appreciated! > >P.S. The text is in a Non-English script, but the tags are all in English. > > >#The code below is for a textfile containing just one item. I am not sure >how to tweak this to make it run for listfile.text (which contains raw data >from multiple articles) with open('listfilereduced.txt', 'r', >encoding='utf8') as my_file: rawData = my_file.read() print(rawData) >#Separating body text from other data articleStart = rawData.find("<div >class=\"story-element story-element-text\">") articleData = >rawData[:articleStart] articleBody = rawData[articleStart:] >print(articleData) print("*******") print(articleBody) print("*******") >#First, I define a function to strip tags from the body text def >stripTags(pageContents): insideTag = 0 text = '' for char in pageContents: >if char == '<': insideTag = 1 elif (insideTag == 1 and char == '>'): >insideTag = 0 elif insideTag == 1: continue else: text += char return text >#Calling the function articleBodyText = stripTags(articleBody) >print(articleBodyText) ##Isolating article title and publication date >TitleEndLoc = articleData.find("</h1>") dateStartLoc = >articleData.find("<div >class=\"storyPageMetaData-m__publish-time__19bdV\">") >dateEndLoc=articleData.find("<div class=\"meta-data-icons >storyPageMetaDataIcons-m__icons__3E4Xg\">") titleString = >articleData[:TitleEndLoc] dateString = articleData[dateStartLoc:dateEndLoc] >##Call stripTags to clean articleTitle= stripTags(titleString) articleDate >= stripTags(dateString) print(articleTitle) print(articleDate) #Cleaning >the date a bit more startLocDate = articleDate.find(":") endLocDate = >articleDate.find(",") articleDateClean = >articleDate[startLocDate+2:endLocDate] print(articleDateClean) #save all >this data to a dictionary that saves the title, data and the body text >PAloTextDict = {"Title": articleTitle, "Date": articleDateClean, "Text": >articleBodyText} print(PAloTextDict) #Normalize text by: #1. Splitting >paragraphs of text into lists of words articleBodyWordList = >articleBodyText.split() print(articleBodyWordList) #2.Removing punctuation >and stopwords from bnlp.corpus import stopwords, punctuations #A. Remove >punctuation first listNoPunct = [] for word in articleBodyWordList: for >mark in punctuations: word=word.replace(mark, '') listNoPunct.append(word) >print(listNoPunct) #B. removing stopwords banglastopwords = stopwords() >print(banglastopwords) cleanList=[] for word in listNoPunct: if word in >banglastopwords: continue else: cleanList.append(word) print(cleanList) >-- >[7]https://mail.python.org/mailman/listinfo/python-list -- Daniel Ciprus .:|:.:|:. CONSULTING ENGINEER.CUSTOMER DELIVERY Cisco Systems Inc. [8]dcip...@cisco.com tel: +1 703 484 0205 mob: +1 540 223 7098 References Visible links 1. https://pastebin.com/wSi2xzBh 2. https://drive.google.com/file/d/1ojwN4u8cmh_nUoMJpdZ5ObaGW5URYYj3/view?usp=sharing 3. https://drive.google.com/file/d/1V3s8w8a3NQvex91EdOhdC9rQtCAOElpm/view?usp=sharing 4. mailto:dcip...@cisco.com 5. https://drive.google.com/file/d/1ojwN4u8cmh_nUoMJpdZ5ObaGW5URYYj3/view?usp=sharing 6. https://drive.google.com/file/d/1V3s8w8a3NQvex91EdOhdC9rQtCAOElpm/view?usp=sharing 7. https://mail.python.org/mailman/listinfo/python-list 8. mailto:dcip...@cisco.com
-- Daniel Ciprus .:|:.:|:. CONSULTING ENGINEER.CUSTOMER DELIVERY Cisco Systems Inc. dcip...@cisco.com tel: +1 703 484 0205 mob: +1 540 223 7098
signature.asc
Description: PGP signature
-- https://mail.python.org/mailman/listinfo/python-list