Re: Python text file fetch specific part of line

cs Thu, 28 Jul 2016 00:33:29 -0700

On 27Jul2016 22:12, Arshpreet Singh <[email protected]> wrote:

I am writing Imdb scrapper, and getting available list of titles from IMDBwebsite which provide txt file in very raw format, Here is the one part offile(http://pastebin.com/fpMgBAjc) as the file provides tags like DistributionVotes,Rank,Title I want to parse title names, I tried with readlines() methodbut it returns only list which is quite heterogeneous, is it possible that Ican parse each value comes under title section?

Just for etiquette: please just post text snippets like that inline in yourtext. Some people don't like fetching random URLs, and some of us are notalways online when reading and replying to email. Either way, having the textin the message, especially when it is small, is preferable.


To your question:

Your sample text looks like this:

   New  Distribution  Votes  Rank  Title
     0000000125  1680661   9.2  The Shawshank Redemption (1994)
     0000000125  1149871   9.2  The Godfather (1972)
     0000000124  786433   9.0  The Godfather: Part II (1974)
     0000000124  1665643   8.9  The Dark Knight (2008)
     0000000133  860145   8.9  Schindler's List (1993)
     0000000133  444718   8.9  12 Angry Men (1957)
     0000000123  1317267   8.9  Pulp Fiction (1994)

0000000124 1209275 8.9 The Lord of the Rings: The Return of the King(2003)

     0000000123  500803   8.9  Il buono, il brutto, il cattivo (1966)
     0000000133  1339500   8.8  Fight Club (1999)

0000000123 1232468 8.8 The Lord of the Rings: The Fellowship of theRing (2001)0000000223 832726 8.7 Star Wars: Episode V - The Empire Strikes Back(1980)

     0000000233  1243066   8.7  Forrest Gump (1994)
     0000000123  1459168   8.7  Inception (2010)
     0000000223  1094504   8.7  The Lord of the Rings: The Two Towers (2002)
     0000000232  676479   8.7  One Flew Over the Cuckoo's Nest (1975)
     0000000232  724590   8.7  Goodfellas (1990)
     0000000233  1211152   8.7  The Matrix (1999)

Firstly, I would suggest you not use readlines(), it pulls all the text intomemory. For small text like this is it ok, but some things can be arbitrarilylarge, so it is something to avoid if convenient. Normally you can just iterateover a file and get lines.

You want "text under the Title." Looking at it, I would be inclined to say thatthe first line is a header and the rest consist of 4 columns: a number(distribution?), a vote count, a rank and the rest (title plus year).


You can parse data like that like this (untested):

 # presumes `fp` is reading from the text
 for n, line in enumerate(fp):
   if n == 0:
     # heading, skip it
     continue
   distnum, nvotes, rank, etc = split(line, 3)
   ... do stuff with the various fields ...

I hope that gets you going. If not, return with what code you have, whathappened, and what you actually wanted to happen and we may help further.


Cheers,
Cameron Simpson <[email protected]>
--
https://mail.python.org/mailman/listinfo/python-list

Re: Python text file fetch specific part of line

Reply via email to