Hi,
I have a PDF Parser which uses PDFBox libary to parse PDF documents into plain text. I have tried this parser by sending the output directly to the commandline and it works, I get the plain text, like I get it with my HTMLParser. But there is a problem with the indexing, I think: I can only find the document with lucene with words which occur before the first dot. With a word which occurs after the first "." Lucene doesn't find the related document. So I think these words are not indexed. I do not use any special analyzer and I cannot understand why it does not work. The indexing with html files work, there I can find words after the first ".". Its also possible that the missing indexing is not related to the "." Character, but to The first newline (\n). I don't know. Have you got any ideas for making pdf-index work? Siegfried Puchbauer
