On 02/26/2015 10:53 PM, memilanuk wrote:
So... okay.  I've got a bunch of PDFs of tournament reports that I want
to sift thru for information.  Ended up using 'pdftotext -layout
file.pdf file.txt' to extract the text from the PDF.  Still have a few
little glitches to iron out there, but I'm getting decent enough results
for the moment to move on.

I've got my script to where it opens the file, ignores the header lines
at the top, then goes through the rest of the file line by line,
skipping lines if they don't match (don't need the separator lines) and
adding them to a list if they do (and stripping whitespace off the right
side along the way).  So far, so good.

#  rstatPDF2csv.py

import sys
import re


def convert(file):
     lines = []
     data = open(file)

     # Skip first n lines of headers
     for i in range(9):
         data.__next__()

     # Read remaining lines one at a time
     for line in data:

         # If the line begins with a capital letter...
         if re.match(r'^[A-Z]', line):

             # Strip any trailing whitespace and then add to the list
             lines.append(line.rstrip())

     return lines

if __name__ == '__main__':
     print(convert(sys.argv[1]))



What I'm ending up with is a list full of strings that look something
like this:

['JOHN DOE                    C   T   HM   445-20*MW*   199-11*MW* 194-5
1HM     393-16*MW*   198-9 1HM    198-11*MW*    396-20*MW*
789-36*MW*     1234-56 *MW*',

Basically... a certain number of characters allotted for competitor
name, then four or five 1-2 char columns for things like classification,
age group, special categories, etc., then a score ('445-20'), then up to
4 char for award (if any), then another score, another award, etc. etc.
etc.

Right now (in the PDF) the scores are batched by one criterion, then
sorted within those groups.  Makes life easier for the person giving out
awards at the end of the tournament, not so much for someone trying to
see how their individual score ranks against the whole field, not just
their group or sub-group.  I want to be able to pull all the scores out
and then re-sort based on score - mainly the final aggregate score, but
potentially also on stage or daily scores.  Eventually I'd like to be
able to calculate standardized z-scores so as to be able to compare
scores from one event/location against another.

So back to the lines of text I have stored as strings in a list.  I
think I want to convert that to a list of lists, i.e. split each line
up, store that info in another list and ditch the whitespace.  Or would
I be better off using dicts?  Originally I was thinking of how to
process each line and split it them up based on what information was
where - some sort of nested for/if mess.  Now I'm starting to think that
the lines of text are pretty uniform in structure i.e. the same field is
always in the same location, and that list slicing might be the way to
go, if a bit tedious to set up initially...?

Any thoughts or suggestions from people who've gone down this particular
path would be greatly appreciated.  I think I have a general
idea/direction, but I'm open to other ideas if the path I'm on is just
blatantly wrong.


Maintaining a list of lists is a big pain. If the data is truly very uniform, you might want to do it, but I'd find it much more reasonable to have names for the fields of each line. You can either do that with a named-tuple, or with instances of a custom class of your own.

See https://docs.python.org/3.4/library/collections.html#namedtuple-factory-function-for-tuples-with-named-fields

You read a line, do some sanity checking on it, and construct an object. Go to the next line, do the same, another object. Those objects are stored in a list.

Everything else accesses the fields of the object something like:


for row in  mylist:
    print( row.name, row.classification, row.age)
    if row.name == "Doe":
         ...




--
DaveA
--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to