HyphenationCompoundWordTokenFilter does not work correctly with the german word Brustamputation -----------------------------------------------------------------------------------------------
Key: LUCENE-3047 URL: https://issues.apache.org/jira/browse/LUCENE-3047 Project: Lucene - Java Issue Type: Bug Components: contrib/analyzers Affects Versions: 3.1 Environment: Linux 2.6.32-31-generic java version "1.6.0_21" Java(TM) SE Runtime Environment (build 1.6.0_21-b06) Java HotSpot(TM) 64-Bit Server VM (build 17.0-b16, mixed mode) Reporter: Lars Feistner Priority: Minor Following Test fails: @Test public void testBrustamputation() throws IOException { Analyzer compoundAnalyzer = new Analyzer() { @Override public TokenStream tokenStream( String fieldName, Reader reader ) { InputStream in = this.getClass().getResourceAsStream( "/de_DR.xml" ); final InputSource inputSource = new InputSource( in ); inputSource.setEncoding( "iso-8859-1" ); HyphenationTree hyphenator = null; try { hyphenator = HyphenationCompoundWordTokenFilter.getHyphenationTree( inputSource ); } catch ( Exception ex ) { Assert.fail( "", ex); } HashSet dict = new HashSet( Arrays.asList( new String[]{"brust", "amputation"} ) ); return new HyphenationCompoundWordTokenFilter( Version.LUCENE_31, new WhitespaceTokenizer( Version.LUCENE_31, reader ), hyphenator, dict, CompoundWordTokenFilterBase.DEFAULT_MIN_WORD_SIZE, 4, CompoundWordTokenFilterBase.DEFAULT_MAX_SUBWORD_SIZE, false ); } }; TokenStream tokenStream = compoundAnalyzer.tokenStream( "Kurztext", new StringReader( "brustamputation" ) ); CharTermAttribute t = tokenStream.addAttribute( CharTermAttribute.class ); Set<String> tokenSet = new HashSet<String>(); while ( tokenStream.incrementToken() ) { tokenSet.add( t.toString() ); System.out.println( t ); } Assert.assertTrue( tokenSet.contains( "brust" ), "brust" ); Assert.assertTrue( tokenSet.contains( "brustamputation" ), "brustamputation" ); Assert.assertTrue( tokenSet.contains( "amputation" ), "amputation" ); } -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org