On 27Jul2016 22:12, Arshpreet Singh <arsh...@gmail.com> wrote:
I am writing Imdb scrapper, and getting available list of titles from IMDB website which provide txt file in very raw format, Here is the one part of file(http://pastebin.com/fpMgBAjc) as the file provides tags like Distribution Votes,Rank,Title I want to parse title names, I tried with readlines() method but it returns only list which is quite heterogeneous, is it possible that I can parse each value comes under title section?

Just for etiquette: please just post text snippets like that inline in your text. Some people don't like fetching random URLs, and some of us are not always online when reading and replying to email. Either way, having the text in the message, especially when it is small, is preferable.

To your question:

Your sample text looks like this:

   New  Distribution  Votes  Rank  Title
     0000000125  1680661   9.2  The Shawshank Redemption (1994)
     0000000125  1149871   9.2  The Godfather (1972)
     0000000124  786433   9.0  The Godfather: Part II (1974)
     0000000124  1665643   8.9  The Dark Knight (2008)
     0000000133  860145   8.9  Schindler's List (1993)
     0000000133  444718   8.9  12 Angry Men (1957)
     0000000123  1317267   8.9  Pulp Fiction (1994)
0000000124 1209275 8.9 The Lord of the Rings: The Return of the King (2003)
     0000000123  500803   8.9  Il buono, il brutto, il cattivo (1966)
     0000000133  1339500   8.8  Fight Club (1999)
0000000123 1232468 8.8 The Lord of the Rings: The Fellowship of the Ring (2001) 0000000223 832726 8.7 Star Wars: Episode V - The Empire Strikes Back (1980)
     0000000233  1243066   8.7  Forrest Gump (1994)
     0000000123  1459168   8.7  Inception (2010)
     0000000223  1094504   8.7  The Lord of the Rings: The Two Towers (2002)
     0000000232  676479   8.7  One Flew Over the Cuckoo's Nest (1975)
     0000000232  724590   8.7  Goodfellas (1990)
     0000000233  1211152   8.7  The Matrix (1999)

Firstly, I would suggest you not use readlines(), it pulls all the text into memory. For small text like this is it ok, but some things can be arbitrarily large, so it is something to avoid if convenient. Normally you can just iterate over a file and get lines.

You want "text under the Title." Looking at it, I would be inclined to say that the first line is a header and the rest consist of 4 columns: a number (distribution?), a vote count, a rank and the rest (title plus year).

You can parse data like that like this (untested):

 # presumes `fp` is reading from the text
 for n, line in enumerate(fp):
   if n == 0:
     # heading, skip it
     continue
   distnum, nvotes, rank, etc = split(line, 3)
   ... do stuff with the various fields ...

I hope that gets you going. If not, return with what code you have, what happened, and what you actually wanted to happen and we may help further.

Cheers,
Cameron Simpson <c...@zip.com.au>
--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to