On Thursday 22 July 2004 23:54, Lambert, Mark wrote: > > Ooh, now that sounds more promising. Do you have the > > particular expressions &c that I could try out? Are jEdit's > > regexps perlable? (or even sedable?) > > The only problem is I find I have to tweak them for different books.
All right, then...what about trying some of the heuristic algorithms other people have been cooking up? GutenMark, in particular, while not perfect, usually does quite a good job. Try it out on any ebook - to pick one at random (just typed in an ID and saw what I got, try a few others for a representative sample), here's "A Golden Book of Venice": http://underdog.arsc.alaska.edu:8080/ETTS/ConversionOptions?id=10455&format=html Try a few other books - just change the id in that URL to the relevant ebook number. Because the system tries to go for maximum quality, if there is a better way of obtaining an HTML version (say, by extracting a ZIP, or (on certain documents) performing one of our experimental XML conversions), it will use that rather then GutenMark, so do check the beginning of the page for the GutenMark disclaimer. Anyway, do you reckon you'd be able to handle that HTML in an automated fashion? Meredydd _______________________________________________ plucker-list mailing list [EMAIL PROTECTED] http://lists.rubberchicken.org/mailman/listinfo/plucker-list

