I initially scraped the links using beautiful soup, and from those links downloaded the specific content of the articles I was interested in (titles, dates, names of contributor, main texts) and stored that information in a list. I then saved the list to a text file. https://pastebin.com/8BMi9qjW . I am now trying to remove the html tags from this text file, and running into issues as mentioned in the previous post.
On Wed, Mar 10, 2021 at 3:46 PM Peter Otten <__pete...@web.de> wrote: > On 10/03/2021 04:35, S Monzur wrote: > > Thanks! I ended up using beautiful soup to remove the html tags and > create > > three lists (titles of article, publications dates, main body) but am > still > > facing a problem where the list is not properly storing the main body. > > There is something wrong with my code for that section, and any comment > > would be really helpful! > > > > ListFile Text > > < > https://drive.google.com/file/d/1V3s8w8a3NQvex91EdOhdC9rQtCAOElpm/view?usp=sharing > > > > How did you create that file? > > > BeautifulSoup code for removing tags <https://pastebin.com/qvbVMUGD> > > > print(bodytext[0]) # so here, I'm only getting the first paragraph of > the body of the first article, not all of the first article > > > > print(bodytext[1]) # here, I'm getting the second paragraph of the first > article, and not the second article > > It may help if you process the individual articles with beautiful soup, > not the whole list at once. > -- https://mail.python.org/mailman/listinfo/python-list