Thank you, Yonnik for this hint. (Again, I wasn't aware that obviousely Solr offers useful extensions for the Lucene indexing process and I wonder why they haven't been added to Lucene itself.)

Anyway, since the HyphenatedWordsFilter needs newlines in the input I will have to take another Tokenizer than StandardTokenizer. If I simply take the WhitespaceTokenizerFactory (as suggested by HyphenatedWordsFilterFactory) I will loose the punctuation handling done by StandardTokenizer, right? What will I have to borrow for that? Or do I have to extend StandardTokenizerImpl.jflex?

Wulf


Am 01.04.2011 18:23, schrieb Yonik Seeley:
Solr has a hyphenated word filter you could copy.
http://lucene.apache.org/solr/api/org/apache/solr/analysis/HyphenatedWordsFilterFactory.html

On trunk, this has been folded into the analysis module.

-Yonik
http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
25-26, San Francisco

On Fri, Apr 1, 2011 at 11:50 AM, Wulf Berschin<bersc...@dosco.de>  wrote:
Hi,

for indexing PDF files we have to undo word hyphenation. The basic idea is
simply to remove the hyphen when a new line and a small letter follows. Of
course this approach isnt 100%-foolproofed but checking against a dictionary
wouldnt be as well...

Since we face this problem too when highlighting using HTMLCharStripper
(yes, we do have hyphenation in our HTML docs...) it seems to me I have to
adjust the JFlex generated StandardTokenizerImpl.

Is this the right approach and hwo would I have to modify this script?

Thanks
Wulf


PS: I see that there are changes made in the brand new 3.1.0 version we are
using 3.0.3, but as far I understand no relevant changes in this respect.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org






---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to