According to Torsten Neuer:
> Geoff Hutchison wrote:
> > On Mon, 5 Feb 2001, Greg Lepore wrote:
> > >       Have searched the site and the faq with no results.  Is there
> > > any way for HTDIG to re-create words that are broken across two lines
> > > with a hypen?
> > 
> > I suspect you're talking about external documents as I've never seen
> > hyphenation in HTML documents (and rarely seen it in text documents).
> > 
> > You'd probably have to tackle this on the converter or parser level and I
> > don't know if this can happen at the moment. Of course if you give us more
> > detail (like the file types you're considering), someone might be able to
> > come up with a solution for you.
> > 
> > You could undoubtedly do it in the source itself by keeping track of the
> > last word requested if it ends in a hyphen. But this hasn't been requested
> > before. Test documents would be quite welcome.
> 
> I think that this feature would significantly increase the useability of
> Ht://Dig on PDF and other "pre-print" document types.  However, recons-
> truction of hyphenated words would need an additional database -
> probably
> something similar to the TeX hyphenation database - and slow down the
> indexing process for those documents.

The conv_doc.pl and doc2html.pl scripts feature a very simple minded
dehyphenation algorithm that's applied only to PDF files.  It doesn't
make use of any database.  It only looks for letter-hyphen-newline-letter
sequences and strips out the hyphen and newline.  It allows any space
characters before or after the newline as well.  Simple, but it seems
to do the job quite nicely on the hyphenated PDF documents I indexed.

Dealing with HTML files is a problem, because you don't want to start
messing with external parsers for HTML.  However, if this simple algorithm
is adequate, it should be quite easy to add to htdig/HTML.cc's HTML::parse()
code.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
Information: http://lists.sourceforge.net/lists/listinfo/htdig-general
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to