Solr has a hyphenated word filter you could copy. http://lucene.apache.org/solr/api/org/apache/solr/analysis/HyphenatedWordsFilterFactory.html
On trunk, this has been folded into the analysis module. -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco On Fri, Apr 1, 2011 at 11:50 AM, Wulf Berschin <bersc...@dosco.de> wrote: > Hi, > > for indexing PDF files we have to undo word hyphenation. The basic idea is > simply to remove the hyphen when a new line and a small letter follows. Of > course this approach isnt 100%-foolproofed but checking against a dictionary > wouldnt be as well... > > Since we face this problem too when highlighting using HTMLCharStripper > (yes, we do have hyphenation in our HTML docs...) it seems to me I have to > adjust the JFlex generated StandardTokenizerImpl. > > Is this the right approach and hwo would I have to modify this script? > > Thanks > Wulf > > > PS: I see that there are changes made in the brand new 3.1.0 version we are > using 3.0.3, but as far I understand no relevant changes in this respect. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org