Re: Project Gutenberg

Meredydd Fri, 23 Jul 2004 09:03:23 -0700

On Thursday 22 July 2004 23:54, Lambert, Mark wrote:
> > Ooh, now that sounds more promising. Do you have the
> > particular expressions &c that I could try out? Are jEdit's
> > regexps perlable? (or even sedable?)
>
> The only problem is I find I have to tweak them for different books.


All right, then...what about trying some of the heuristic algorithms 
other people have been cooking up? GutenMark, in particular, while not 
perfect, usually does quite a good job. Try it out on any ebook - to 
pick one at random (just typed in an ID and saw what I got, try a few 
others for a representative sample), here's "A Golden Book of Venice":

http://underdog.arsc.alaska.edu:8080/ETTS/ConversionOptions?id=10455&format=html

Try a  few other books - just change the id in that URL to the relevant 
ebook number. Because the system tries to go for maximum quality, if 
there is a better way of obtaining an HTML version (say, by extracting 
a ZIP, or (on certain documents) performing one of our experimental XML 
conversions), it will use that rather then GutenMark, so do check the 
beginning of the page for the GutenMark disclaimer.

Anyway, do you reckon you'd be able to handle that HTML in an automated 
fashion?

Meredydd
_______________________________________________
plucker-list mailing list
[EMAIL PROTECTED]
http://lists.rubberchicken.org/mailman/listinfo/plucker-list

Re: Project Gutenberg

Reply via email to