Re: How to loop over a text file (to remove tags and normalize) using Python

Dan Ciprus (dciprus) via Python-list Tue, 09 Mar 2021 14:36:28 -0800

No problem, list just converts everything into plain/txt which is GREAT ! :-)

So without digging deeply into what you need to do: I am assuming that your input contains html tags. Why don't you utilize lib like: https://pypi.org/project/beautifulsoup4/ instead of doing harakiri with parsing data without using regex ? Just a hint ..


On Wed, Mar 10, 2021 at 04:22:19AM +0600, S Monzur wrote:

  Thank you and apologies! I did not realize how jumbled it was at the
  receiver's end. 
  The code is now at this site :  [1]https://pastebin.com/wSi2xzBh 
  I'm basically trying to do a few things with my code-

   1. Extract 3 strings from the text- title, date and main text

   2. Remove all tags afterwards

   3. Save in a dictionary, with three keys- title, date and bodytext.

   4. Remove punctuation and stopwords (I've used a user generated function
      for that).

  I've been able to do all of these steps for the file [2]ListFileReduced,
  as shown in the code (although it's clunky).

  But, I would like to be able to do it for the other text file: [3]ListFile
  which has more articles. I used BeautifulSoup to scrape the data from the
  website, and then generated a list that I saved as a text file. 

  Best,
  Monzur
  On Wed, Mar 10, 2021 at 4:00 AM Dan Ciprus (dciprus)
  <[4]dcip...@cisco.com> wrote:

    If you could utilized pastebin or similar site to show your code, it
    would help
    tremendously since it's an unindented mess now and can not be read
    easily.

    On Wed, Mar 10, 2021 at 03:07:14AM +0600, S Monzur wrote:
    >Dear List,
    >
    >Newbie here. I am trying to loop over a text file to remove html tags,
    >punctuation marks, stopwords. I have already used Beautiful Soup
    (Python v
    >3.8.3) to scrape the text (newspaper articles) from the site. It
    returns a
    >list that I saved as a file. However, I am not sure how to use a loop
    in
    >order to process all the items in the text file.
    >
    >In the code below I have used listfilereduced.text(containing data from
    one
    >news article, link to listfilereduced.txt here
    
><[5]https://drive.google.com/file/d/1ojwN4u8cmh_nUoMJpdZ5ObaGW5URYYj3/view?usp=sharing>),
    >however I would like to run this code on listfile.text(containing data
    from
    >multiple articles, link to listfile.text
    
><[6]https://drive.google.com/file/d/1V3s8w8a3NQvex91EdOhdC9rQtCAOElpm/view?usp=sharing>
    >).
    >
    >
    >Any help would be greatly appreciated!
    >
    >P.S. The text is in a Non-English script, but the tags are all in
    English.
    >
    >
    >#The code below is for a textfile containing just one item. I am not
    sure
    >how to tweak this to make it run for listfile.text (which contains raw
    data
    >from multiple articles) with open('listfilereduced.txt', 'r',
    >encoding='utf8') as my_file: rawData = my_file.read() print(rawData)
    >#Separating body text from other data articleStart = rawData.find("<div
    >class=\"story-element story-element-text\">") articleData =
    >rawData[:articleStart] articleBody = rawData[articleStart:]
    >print(articleData) print("*******") print(articleBody) print("*******")
    >#First, I define a function to strip tags from the body text def
    >stripTags(pageContents): insideTag = 0 text = '' for char in
    pageContents:
    >if char == '<': insideTag = 1 elif (insideTag == 1 and char == '>'):
    >insideTag = 0 elif insideTag == 1: continue else: text += char return
    text
    >#Calling the function articleBodyText = stripTags(articleBody)
    >print(articleBodyText) ##Isolating article title and publication date
    >TitleEndLoc = articleData.find("</h1>") dateStartLoc =
    >articleData.find("<div
    >class=\"storyPageMetaData-m__publish-time__19bdV\">")
    >dateEndLoc=articleData.find("<div class=\"meta-data-icons
    >storyPageMetaDataIcons-m__icons__3E4Xg\">") titleString =
    >articleData[:TitleEndLoc] dateString =
    articleData[dateStartLoc:dateEndLoc]
    >##Call stripTags to clean articleTitle= stripTags(titleString)
    articleDate
    >= stripTags(dateString) print(articleTitle) print(articleDate)
    #Cleaning
    >the date a bit more startLocDate = articleDate.find(":") endLocDate =
    >articleDate.find(",") articleDateClean =
    >articleDate[startLocDate+2:endLocDate] print(articleDateClean) #save
    all
    >this data to a dictionary that saves the title, data and the body text
    >PAloTextDict = {"Title": articleTitle, "Date": articleDateClean,
    "Text":
    >articleBodyText} print(PAloTextDict) #Normalize text by: #1. Splitting
    >paragraphs of text into lists of words articleBodyWordList =
    >articleBodyText.split() print(articleBodyWordList) #2.Removing
    punctuation
    >and stopwords from bnlp.corpus import stopwords, punctuations #A.
    Remove
    >punctuation first listNoPunct = [] for word in articleBodyWordList: for
    >mark in punctuations: word=word.replace(mark, '')
    listNoPunct.append(word)
    >print(listNoPunct) #B. removing stopwords banglastopwords = stopwords()
    >print(banglastopwords) cleanList=[] for word in listNoPunct: if word in
    >banglastopwords: continue else: cleanList.append(word) print(cleanList)
    >--
    >[7]https://mail.python.org/mailman/listinfo/python-list

    --

    Daniel Ciprus                              .:|:.:|:.
    CONSULTING ENGINEER.CUSTOMER DELIVERY   Cisco Systems Inc.

    [8]dcip...@cisco.com

    tel: +1 703 484 0205
    mob: +1 540 223 7098

References

  Visible links
  1. https://pastebin.com/wSi2xzBh
  2. 
https://drive.google.com/file/d/1ojwN4u8cmh_nUoMJpdZ5ObaGW5URYYj3/view?usp=sharing
  3. 
https://drive.google.com/file/d/1V3s8w8a3NQvex91EdOhdC9rQtCAOElpm/view?usp=sharing
  4. mailto:dcip...@cisco.com
  5. 
https://drive.google.com/file/d/1ojwN4u8cmh_nUoMJpdZ5ObaGW5URYYj3/view?usp=sharing
  6. 
https://drive.google.com/file/d/1V3s8w8a3NQvex91EdOhdC9rQtCAOElpm/view?usp=sharing
  7. https://mail.python.org/mailman/listinfo/python-list
  8. mailto:dcip...@cisco.com


--

Daniel Ciprus                              .:|:.:|:.
CONSULTING ENGINEER.CUSTOMER DELIVERY   Cisco Systems Inc.

dcip...@cisco.com

tel: +1 703 484 0205
mob: +1 540 223 7098

signature.asc
Description: PGP signature

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: How to loop over a text file (to remove tags and normalize) using Python

Reply via email to