Re: [htdig] de-hyphenation

Torsten Neuer Tue, 06 Feb 2001 01:29:13 -0800
Geoff Hutchison wrote:
> 
> On Mon, 5 Feb 2001, Greg Lepore wrote:
> 
> >       Have searched the site and the faq with no results.  Is there
> > any way for HTDIG to re-create words that are broken across two lines
> > with a hypen?
> 
> I suspect you're talking about external documents as I've never seen
> hyphenation in HTML documents (and rarely seen it in text documents).
> 
> You'd probably have to tackle this on the converter or parser level and I
> don't know if this can happen at the moment. Of course if you give us more
> detail (like the file types you're considering), someone might be able to
> come up with a solution for you.
> 
> You could undoubtedly do it in the source itself by keeping track of the
> last word requested if it ends in a hyphen. But this hasn't been requested
> before. Test documents would be quite welcome.

I think that this feature would significantly increase the useability of
Ht://Dig on PDF and other "pre-print" document types.  However, recons-
truction of hyphenated words would need an additional database -
probably
something similar to the TeX hyphenation database - and slow down the
indexing process for those documents.

If the TeX hyphenation databases could be transformed into a pattern re-
cognition database for hyphenated words, slow-down of the indexer
process
would not hurt to much - after all, only words ending with "-" would be
considered for lookup in the de-hyphenation database.  If those words
pro-
duce a hit, the next portion of the document could be checked against
the
value parts of the pattern database.

E.g. the TeX patterns "hy\-per hy\-phe\-na\-tion" could be transformed
into
the following key/value pairs:
        "hy-"      -> ( "per" "phenation" )
        "hyphe-"   -> ( "nation" )
        "hyphena-" -> ( "tion" )

This is quite a simple approach and does not take multiple hyphenated
words
into account, but it might work for most cases where hyphenation occurs
in
PDF or Postscript documents.  It also requires quite some storage space
for
de-hyphenation lookup tables, so maybe there is a somewhat nicer
approach?

HTML documents *could* (in theory) be hyphenated as well - there is a
special entity (soft hyphen, "&shy;") which could be used to
automagically
hyphenate documents in the web client.  It should be no problem to make
the Ht://Dig indexer recognizing this special entity (by simply skipping
over it instead of translating it to "-").  However, there are only few
browsers out there which support the "&shy;" hyphenation feature - AFAIK
only Lynx is able to display "&shy;"-hyphenated documents correctly (all
other browsers translate it to "-" regardless whether hyphenation is re-
quired or not).


ciao,

  Torsten

-- 
InWise - Wirtschaftlich-Wissenschaftlicher Internet Service GmbH
Waldhofstraße 14                            Tel: +49-4101-403605
D-25474 Ellerbek                            Fax: +49-4101-403606
E-Mail: [EMAIL PROTECTED]            Internet: http://www.inwise.de

_______________________________________________
htdig-general mailing list
[EMAIL PROTECTED]
http://lists.sourceforge.net/lists/listinfo/htdig-general
Re: [htdig] de-hyphenation

Reply via email to