pdf indexing problem

Siegfried Puchbauer Wed, 01 Sep 2004 07:50:04 -0700

Hi,


I have a PDF Parser which uses PDFBox libary to parse PDF documents into
plain text.

I have tried this parser by sending the output directly to the
commandline and it works, I 
get the plain text, like I get it with my HTMLParser.

But there is a problem with the indexing, I think:

I can only find the document with lucene with words which occur before
the first dot. With
a word which occurs after the first "." Lucene doesn't find the related
document. So I think
these words are not indexed. I do not use any special analyzer and I
cannot understand
why it does not work. 

 

The indexing with html files work, there I can find words after the
first ".".

Its also possible that the missing indexing is not related to the "."
Character, but to

The first newline (\n). I don't know.

 

Have you got any ideas for making pdf-index work?

 

Siegfried Puchbauer

pdf indexing problem

Reply via email to