> 
> Hi,
> 
> I want to index PDF-Files with German Umlaute (�, �, �, �). Some tests had shown me 
>that htdig (v. 3.1.5) and xpdf (v. 0.91) are working pretty good with German Umlaute, 
>but the external parser parse_doc.pl has problems with them. It splits words with 
>Umlaute in two words without the Umlaut.
> For example:
> 
> w       beim    41      0
> w       diesj   45      0
> w       hrigen  50      0
> w       den     58      0
> w       Platz   62      0
> 
> In this case the German word "diesj�hrigen" is split in "diesj" and "hrigen" and I 
>can find both with htsearch.
> 
> Does anyone know how to solve this problem for example with a modified version of 
>parse_doc.pl?
> 
> Thanks,
> 
> Christian Huhn
> 

You could try the doc2html parser.  I think that the latest version,
available from the Ht://Dig web site, will not split words this way, but
I have not tested it thoroughly. 

If doc2html does not parse your .PDF files properly, then email an
example to me personally, and I'll make sure that the next version of
doc2html works correctly. 

-- 
 
David J Adams
<[EMAIL PROTECTED]>
Computing Services
University of Southampton

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  <http://www.htdig.org/mail/menu.html>
FAQ:            <http://www.htdig.org/FAQ.html>

Reply via email to