On May 7, 2009, Dan Lynch wrote: > I think Gordon's original problem is that BeautifulSoup doesn't > work properly for him on Debian (I think it was Debian). He's > tried the Lastscrape Python script already if I read the > original comment correctly. He would prefer something written > in Perl which is why he's doing this. That's how I understood > it but obviously I can't speak for Gordon, I'm sure he'll reply > soon. Just wanted to explain the situation.
After spending 4 days changing a water pump (domestic water supply at the farm), I'm back working on things. Yes, I could install a deprecated version of BeautifulSoup and scrape my information off Last.fm. If I had much experience with Python, I could fix the scraper so that it would work with the up to date BeautifulSoup. I can certainly understand why something like a scraper is sensitive to things which are HTML, or look like it. I've written programs before to clean it (HTML) up, and it isn't nice. The best program I've seen for cleaning HTML, is HTML Tidy from W3C. Anyway, while the CPAN module WebService::LastFM has a function to get a tracklist, it is a random sample of 10 tracks you have listened to. After spending a couple of days looking at the code, I see no easy way to get it to produce the entire tracklist you want. On the Perl side, XML::Simple couldn't parse a sample download of 1 page of my own data. XML::Twig also can't parse it, but at least it would give me messages as to where the problem was. So, I tried the incremental edit method. There seems to be several different styles of sub trees present in any given page. Some of the trees are parsed without problems, so with easy to fix problems, and some are more difficult. Things like escaping double quotation marks, and escaping forward slashes were causing problems. I eventually ran into a problem near the bottom of the page which would require a bit of work to fix, and thought about other means. HTML Tidy is freely available for many platforms, and is fast (I think it is written in C). If I run Tidy to increment and clean, and to assume UTF8 input and generate UTF8 output. I almost get a file which XML::Twig will process. There are a couple of attributes of elements in the page which are empty (such as alt=""), which XML::Twig thinks are duplicates. And XML::Twig doesn't understand for some reason. Deleting the empty attributes from the text of the page, and changing into a space are enough to get XML::Twig to parse the file. Twig is meant to process big files. Having an almost 2000 line HTML file to describe 51 lines of data is excessive, and so Twig makes sense. It is easy to have Twig delete comments, script, noscript and form elements (and all their children) when it is parsing the file. The last thing was to get the ISO timestamp out of the title attribute of the abbr element in the table, and with Twig one can provide a subroutine which overwrites the text content of the abbr element with the value in the title attribute. Twig understands XPath, and now all of the data you want in your file is easily found using XPath. So, I am left with writing the XPath part, and putting the whole thing in a loop so that I can download the 5629 pages of data that I have at Last.fm. (Well, you can use XPath to extract where it is, but I was looking at having all of the data of interest in the text part of elements, which is not strictly needed.) In terms of instructing people interested in libre.fm about how to get their data, sure you can keep your current method. And maybe sometime the current version of BeautifulSoup actually works. And then something else happens. What seems better, and I don't care if you like Python, Perl, Ruby or ksh. Have HTML Tidy clean up the page, then process it with whatever XML methods you like. The cleanup by tidy removes about 20% of the file that I was working with. And you are probably left with a file which you don't need to recommend that people have deprecated parsers installed, to work with the file. It's about as bad as telling someone today that you need PC-DOS-3.09 to run some program. Gord _______________________________________________ Libre-fm mailing list [email protected] http://lists.autonomo.us/mailman/listinfo/libre-fm
