>From time to time I need to parse the rankings PDF files from the Women's Tennis Association. I would do it by copying all the text, pasting it into a text file, and then using a Perl script to take each line, splitting it by spaces into an array, and then manipulating those arrays as need be. (Some tennis players have multiple spaces in their names, such as the Spanish players who use their matronymics, but reversing the array gets around that fairly easily.) That meant getting rid of fields I don't need (eg. a player's nationality), putting in tabs, and printing the modified list to a different text file so that I could use a spreadsheet to open the modified PDF as a tab-delimited spreadsheet.
However, the WTA has changed its PDFs. There used to be numerical fields where some of the values would be a 0. The WTA has repaced these 0's with *empty fields*. Consider a player who has values for every field: 1 (1) WOZNIACKI, CAROLINE DEN 9930 23 470 200 280 200 The various fields are in order, This week's ranking, Last week's ranking, Name, Nationality, Total ranking points, Events played, Points earned last week, Points coming off the rankings for next week, Points earned in 16th-best result, and Points earned in 17th-best result. Now it's fairly obvious that not everybody plays every week, so there are going to be players with 0's in either the "points earned last week" field or the "points coming off the rankings" (the rankings are a rolling 52-week system, so this field is the number of points the player earned in the same calendar week last year). Also, there are players who haven't played 16 events and so have a 0 for the 16th or 17th result. In the past, such fields would have an actual 0, which makes parsing the PDF easy. Now, those are blank fields. In a PDF file, it's visually obvious which fields are empty: <http://www.wtatennis.com/SEWTATour-Archive/Rankings_Stats/Singles_Numeric.pdf> However, copying to text gets rid of these empty fields, making the old Perl script I used now useless. If you download the ~140KB PDF, you'll see that #3 Zvonareva and #11 Peer each have one field (and they're different for the two players) that would have a 0, but when copied to a text file, you get these results: 3 (3) ZVONAREVA, VERA RUS 7815 20 320 125 60 11 (11) PEER, SHAHAR ISR 3030 22 60 60 60 You can't tell which field would have had the 0. (And if you look far enough down the rankings, wait until you get to #172 Sloane Stephens, who has exactly 16 events, which means she's got a 0 for the 17th tournament, but this particular week has an event added but none coming off the rankings.) Any good idea on how to get around this? I presume there's a PDF-parsing module, but I don't do all that much Perl programming, limiting myself to text parsing, regexes and a bit more, so I'm not very good with modules. (Many years ago, the WTA rankings used to be fixed-with text files. Boy do I miss those days.) -- Ted S. fedya at hughes dot net Now blogging at http://justacineast.blogspot.com _______________________________________________ Perl-Win32-Users mailing list Perl-Win32-Users@listserv.ActiveState.com To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs