If you could utilized pastebin or similar site to show your code, it would help tremendously since it's an unindented mess now and can not be read easily.

On Wed, Mar 10, 2021 at 03:07:14AM +0600, S Monzur wrote:
Dear List,

Newbie here. I am trying to loop over a text file to remove html tags,
punctuation marks, stopwords. I have already used Beautiful Soup (Python v
3.8.3) to scrape the text (newspaper articles) from the site. It returns a
list that I saved as a file. However, I am not sure how to use a loop in
order to process all the items in the text file.

In the code below I have used listfilereduced.text(containing data from one
news article, link to listfilereduced.txt here
<https://drive.google.com/file/d/1ojwN4u8cmh_nUoMJpdZ5ObaGW5URYYj3/view?usp=sharing>),
however I would like to run this code on listfile.text(containing data from
multiple articles, link to listfile.text
<https://drive.google.com/file/d/1V3s8w8a3NQvex91EdOhdC9rQtCAOElpm/view?usp=sharing>
).


Any help would be greatly appreciated!

P.S. The text is in a Non-English script, but the tags are all in English.


#The code below is for a textfile containing just one item. I am not sure
how to tweak this to make it run for listfile.text (which contains raw data
from multiple articles) with open('listfilereduced.txt', 'r',
encoding='utf8') as my_file: rawData = my_file.read() print(rawData)
#Separating body text from other data articleStart = rawData.find("<div
class=\"story-element story-element-text\">") articleData =
rawData[:articleStart] articleBody = rawData[articleStart:]
print(articleData) print("*******") print(articleBody) print("*******")
#First, I define a function to strip tags from the body text def
stripTags(pageContents): insideTag = 0 text = '' for char in pageContents:
if char == '<': insideTag = 1 elif (insideTag == 1 and char == '>'):
insideTag = 0 elif insideTag == 1: continue else: text += char return text
#Calling the function articleBodyText = stripTags(articleBody)
print(articleBodyText) ##Isolating article title and publication date
TitleEndLoc = articleData.find("</h1>") dateStartLoc =
articleData.find("<div
class=\"storyPageMetaData-m__publish-time__19bdV\">")
dateEndLoc=articleData.find("<div class=\"meta-data-icons
storyPageMetaDataIcons-m__icons__3E4Xg\">") titleString =
articleData[:TitleEndLoc] dateString = articleData[dateStartLoc:dateEndLoc]
##Call stripTags to clean articleTitle= stripTags(titleString) articleDate
= stripTags(dateString) print(articleTitle) print(articleDate) #Cleaning
the date a bit more startLocDate = articleDate.find(":") endLocDate =
articleDate.find(",") articleDateClean =
articleDate[startLocDate+2:endLocDate] print(articleDateClean) #save all
this data to a dictionary that saves the title, data and the body text
PAloTextDict = {"Title": articleTitle, "Date": articleDateClean, "Text":
articleBodyText} print(PAloTextDict) #Normalize text by: #1. Splitting
paragraphs of text into lists of words articleBodyWordList =
articleBodyText.split() print(articleBodyWordList) #2.Removing punctuation
and stopwords from bnlp.corpus import stopwords, punctuations #A. Remove
punctuation first listNoPunct = [] for word in articleBodyWordList: for
mark in punctuations: word=word.replace(mark, '') listNoPunct.append(word)
print(listNoPunct) #B. removing stopwords banglastopwords = stopwords()
print(banglastopwords) cleanList=[] for word in listNoPunct: if word in
banglastopwords: continue else: cleanList.append(word) print(cleanList)
--
https://mail.python.org/mailman/listinfo/python-list

--

Daniel Ciprus                              .:|:.:|:.
CONSULTING ENGINEER.CUSTOMER DELIVERY   Cisco Systems Inc.

dcip...@cisco.com

tel: +1 703 484 0205
mob: +1 540 223 7098

Attachment: signature.asc
Description: PGP signature

-- 
https://mail.python.org/mailman/listinfo/python-list

Reply via email to