Greg Lepore wrote:
> 
> Hyphenation example:
> On our site we are doing large scale conversion of previously published
> material to html via OCR.  As we are reproducing format as well as text,
> his results in many hyphenations.  For a page with several examples:
> http://mdsa.net/megafile/msa/speccol/sc2900/sc2908/000001/000138/html/am138--606.html
> The hyphens appear as regular (-).  No special characters are inserted by
> the OCR programs.

For these documents, I would rather suggest processing the HTML
output of the OCR software with a simple filter program that cuts
out the "-<BR>" from the texts.

This could be done quite easily using a lex(1) scanner.


cheers,

  Torsten

-- 
InWise - Wirtschaftlich-Wissenschaftlicher Internet Service GmbH
Waldhofstraße 14                            Tel: +49-4101-403605
D-25474 Ellerbek                            Fax: +49-4101-403606
E-Mail: [EMAIL PROTECTED]            Internet: http://www.inwise.de

_______________________________________________
htdig-general mailing list
[EMAIL PROTECTED]
http://lists.sourceforge.net/lists/listinfo/htdig-general

Reply via email to