OK here's my question about JPluck, or its JXL files.
How do they work exactly? And are there any tips on (re) writing them to get better versions of pages?
More specifically, it's about the New York Times. I downloaded a set of files a while ago which transformed some NYT sections quite nicely, stripping out the nav etc.
Then they stopped working so well the NYT must have somehow changed their format and real estate ads and the like started intruding into the Plucker files and I needed to do eight or ten page-downs to get to the content.
I downloaded a different nytimes.jxl recently, which takes its links from the Userland RSS feeds which link to the Print-Friendly versions, but it's still not as good as it used to be -- there's still lots of nav and ad content before the real content.
So, any clues on how to write my own NYT-parsing code? I'm assuming the change to the Print-Friendly version was made because otherwise we'd forever be playing catch-up with the changing NYT format? I'm prepared to play catch-up!
Oh yes, and in terms of pure theory, how do XSLT transformations work on non-XML content? I've spent ages learning XML and had it drummed into me over and over again that everything's got to be valid or such transformations won't work, and the NYT's code is nothing like valid XML.
_______________________________________________
plucker-list mailing list
[EMAIL PROTECTED]
http://lists.rubberchicken.org/mailman/listinfo/plucker-list

