Joel,
    Here's the URL for the ancient greek texts http://www.perseus.tufts.edu/
cgi-bin/perscoll?collection=Greco-Roman&type=text&lang=greek - take your 
pick. If you're fond of Rome go to the texts and translations page and you'll 
find a lot of latin texts as well. 

The texts will initially display in transliterated greek with a hypertext 
link for every word to the Liddell-Scott lexicon. If you go to the Display 
Configuration Menu you can get it into UTF-8 and drop the morphology links to 
get something a little easier to handle. If your downloading a few pages 
you'll probably want to cut and paste the cookie you get back from the config 
menu. 

It's a very impressive site - though probably a little too cluttered for my 
aesthetic - and closer to the universal library some of us were hoping for 
out of the internet ( until the world wide web turned up and turned it into a 
zillion gigabytes of shallow advertising :-} )

The main parsing problems I've had have to do with - fragments of none html 
in the pages, poorly nested tags, and occaisional missing elements. Because 
html is style based rather than structure based I've had to create some 
guesses for structure. So far I'm close to parsing correctly about 90% of 
pages - even automating the construction af a reasonably correct TEI header. 
But you're right it was a little to ambitious - My codes a mess and I've 
backed myself into some ugly corners. But I think of it as a draft - get 
through it messily once and then create something a little more elegant ( 
probably wishful thinking ).

I'll send you an example xml page when I've got something reasonable.

Have you any idea the best way to set up a good guess under rebol? I've also 
started a Gutenburg text to xml set of scripts, and was curious how you would 
do something like - 

    if find the word "Contents" on a short line followed closely by a series 
of short lines
    guess a contents list and tag accordingly. - if find a match between 
elements in one of     
    these lines and possibly Chapter Header candidates make a link.

Some of REBOLS great parsing abilities make me think something like this is 
possible - but I don't quite know how you would put it together.

Thanks 
    Gary

-- 
To unsubscribe from this list, please send an email to
[EMAIL PROTECTED] with "unsubscribe" in the 
subject, without the quotes.

Reply via email to