[
https://issues.apache.org/jira/browse/LUCENE-6991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15115414#comment-15115414
]
Pawel Rog edited comment on LUCENE-6991 at 1/25/16 3:41 PM:
------------------------------------------------------------
Thanks for the suggestion. When I changed whitespace tokenizer to keyword
tokenizer the test passes. Nevertheless I think the problem stays in
WordDelimiterFilter. Right?
was (Author: prog):
Thanks for the suggestion. When I changed whitespace tokenizer to keyword
tokenizer the test passes.
> WordDelimiterFilter bug
> -----------------------
>
> Key: LUCENE-6991
> URL: https://issues.apache.org/jira/browse/LUCENE-6991
> Project: Lucene - Core
> Issue Type: Bug
> Affects Versions: 4.10.4, 5.3.1
> Reporter: Pawel Rog
> Priority: Minor
>
> I was preparing analyzer which contains WordDelimiterFilter and I realized it
> sometimes gives results different then expected.
> I prepared a short test which shows the problem. I haven't used Lucene tests
> for this but this doesn't matter for showing the bug.
> {code}
> String urlIndexed = "144.214.37.14 - - [05/Jun/2013:08:39:27 +0000] \"GET
> /products/key-phrase-extractor/ HTTP/1.1\"" +
> " 200 3437 http://www.google.com/url?sa=t&rct=j&q=&esrc=s&" +
>
> "source=web&cd=15&cad=rja&ved=0CEgQFjAEOAo&url=http%3A%2F%2Fwww.sematext.com%2Fproducts%2Fkey-"
> +
>
> "phrase-extractor%2F&ei=TPOuUbaWM-OKiQfGxIGYDw&usg=AFQjCNGwYAFYg_M3EZnp2eEWJzdvRrVPrg&sig2"
> +
> "=oYitONI2EIZ0CQar7Ej8HA&bvm=bv.47380653,d.aGc\" \"Mozilla/5.0
> (X11; Ubuntu; Linux i686; rv:20.0) " +
> "Gecko/20100101 Firefox/20.0\"";
> List<String> tokens1 = new ArrayList<String>();
> List<String> tokens2 = new ArrayList<String>();
> WhitespaceAnalyzer analyzer = new WhitespaceAnalyzer();
> TokenStream tokenStream = analyzer.tokenStream("test", urlIndexed);
> tokenStream = new WordDelimiterFilter(tokenStream,
> WordDelimiterFilter.GENERATE_WORD_PARTS |
> WordDelimiterFilter.CATENATE_WORDS |
> WordDelimiterFilter.SPLIT_ON_CASE_CHANGE,
> null);
> CharTermAttribute charAttrib =
> tokenStream.addAttribute(CharTermAttribute.class);
> tokenStream.reset();
> while(tokenStream.incrementToken()) {
> tokens1.add(charAttrib.toString());
> System.out.println(charAttrib.toString());
> }
> tokenStream.end();
> tokenStream.close();
> urlIndexed = "144.214.37.14 - - [05/Jun/2013:08:39:27 +0000] \"GET
> /products/key-phrase-extractor/ HTTP/1.1\"" +
> " 200 3437 \"http://www.google.com/url?sa=t&rct=j&q=&esrc=s&" +
>
> "source=web&cd=15&cad=rja&ved=0CEgQFjAEOAo&url=http%3A%2F%2Fwww.sematext.com%2Fproducts%2Fkey-"
> +
>
> "phrase-extractor%2F&ei=TPOuUbaWM-OKiQfGxIGYDw&usg=AFQjCNGwYAFYg_M3EZnp2eEWJzdvRrVPrg&sig2"
> +
> "=oYitONI2EIZ0CQar7Ej8HA&bvm=bv.47380653,d.aGc\" \"Mozilla/5.0 (X11;
> Ubuntu; Linux i686; rv:20.0) " +
> "Gecko/20100101 Firefox/20.0\"";
> System.out.println("\n\n====\n\n");
> tokenStream = analyzer.tokenStream("test", urlIndexed);
> tokenStream = new WordDelimiterFilter(tokenStream,
> WordDelimiterFilter.GENERATE_WORD_PARTS |
> WordDelimiterFilter.CATENATE_WORDS |
> WordDelimiterFilter.SPLIT_ON_CASE_CHANGE,
> null);
> charAttrib = tokenStream.addAttribute(CharTermAttribute.class);
> tokenStream.reset();
> while(tokenStream.incrementToken()) {
> tokens2.add(charAttrib.toString());
> System.out.println(charAttrib.toString());
> }
> tokenStream.end();
> tokenStream.close();
> assertEquals(Joiner.on(",").join(tokens1), Joiner.on(",").join(tokens2));
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]