On 29Jul2016 18:42, Gordon Levi <gordon@address.invalid> wrote:
c...@zip.com.au wrote:

On 28Jul2016 19:28, Gordon Levi <gordon@address.invalid> wrote:
Arshpreet Singh <arsh...@gmail.com> wrote:
I am writing Imdb scrapper, and getting available list of titles from IMDB
website which provide txt file in very raw format, Here is the one part of
file(http://pastebin.com/fpMgBAjc) as the file provides tags like
Distribution  Votes,Rank,Title I want to parse title names, I tried with
readlines() method but it returns only list which is quite heterogeneous, is
it possible that I can parse each value comes under title section?

Beautiful Soup will make your task much easier
<https://www.crummy.com/software/BeautifulSoup/>.

Did you look at his sample data?

No. I read he was "writing an IMDB scraper, and getting the available
list of titles from the IMDB web site". It's here
<http://www.imdb.com/>.

Plain text, not HTML or XML. Beautiful Soup is
not what he needs here.

Fortunately the OP told us his application rather than just telling us
his current problem. His life would be much easier if he ignored the
plain text he has obtained so far and started again using a Beautiful
Soup tutorial.

Or bypass IMDB's computer unfriendliness and go straight to http://omdbapi.com/

You can have JSON directly from it, and avoid BS entirely. BS is an amazing library, but is essentially a workaround for computer-hostile websites: those not providing clean machine readable data, and only unstable mutable HTML output.

Cheers,
Cameron Simpson <c...@zip.com.au>
--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to