[jira] [Comment Edited] (LUCENE-6991) WordDelimiterFilter bug

Pawel Rog (JIRA) Mon, 25 Jan 2016 07:41:49 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-6991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15115414#comment-15115414
 ]


Pawel Rog edited comment on LUCENE-6991 at 1/25/16 3:41 PM:
------------------------------------------------------------

Thanks for the suggestion. When I changed whitespace tokenizer to keyword 
tokenizer the test passes. Nevertheless I think the problem stays in 
WordDelimiterFilter. Right?


was (Author: prog):
Thanks for the suggestion. When I changed whitespace tokenizer to keyword 
tokenizer the test passes.

> WordDelimiterFilter bug
> -----------------------
>
>                 Key: LUCENE-6991
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6991
>             Project: Lucene - Core
>          Issue Type: Bug
>    Affects Versions: 4.10.4, 5.3.1
>            Reporter: Pawel Rog
>            Priority: Minor
>
> I was preparing analyzer which contains WordDelimiterFilter and I realized it 
> sometimes gives results different then expected.
> I prepared a short test which shows the problem. I haven't used Lucene tests 
> for this but this doesn't matter for showing the bug.
> {code}
>     String urlIndexed = "144.214.37.14 - - [05/Jun/2013:08:39:27 +0000] \"GET 
> /products/key-phrase-extractor/ HTTP/1.1\"" +
>             " 200 3437 http://www.google.com/url?sa=t&rct=j&q=&esrc=s&"; +
>             
> "source=web&cd=15&cad=rja&ved=0CEgQFjAEOAo&url=http%3A%2F%2Fwww.sematext.com%2Fproducts%2Fkey-"
>  +
>             
> "phrase-extractor%2F&ei=TPOuUbaWM-OKiQfGxIGYDw&usg=AFQjCNGwYAFYg_M3EZnp2eEWJzdvRrVPrg&sig2"
>  +
>             "=oYitONI2EIZ0CQar7Ej8HA&bvm=bv.47380653,d.aGc\" \"Mozilla/5.0 
> (X11; Ubuntu; Linux i686; rv:20.0) " +
>             "Gecko/20100101 Firefox/20.0\"";
>     List<String> tokens1 = new ArrayList<String>();
>     List<String> tokens2 = new ArrayList<String>();
>     WhitespaceAnalyzer analyzer = new WhitespaceAnalyzer();
>     TokenStream tokenStream = analyzer.tokenStream("test", urlIndexed);
>     tokenStream = new WordDelimiterFilter(tokenStream,
>             WordDelimiterFilter.GENERATE_WORD_PARTS |
>             WordDelimiterFilter.CATENATE_WORDS |
>             WordDelimiterFilter.SPLIT_ON_CASE_CHANGE,
>         null);
>     CharTermAttribute charAttrib = 
> tokenStream.addAttribute(CharTermAttribute.class);
>     tokenStream.reset();
>     while(tokenStream.incrementToken()) {
>       tokens1.add(charAttrib.toString());
>       System.out.println(charAttrib.toString());
>     }
>     tokenStream.end();
>     tokenStream.close();
>     urlIndexed = "144.214.37.14 - - [05/Jun/2013:08:39:27 +0000] \"GET 
> /products/key-phrase-extractor/ HTTP/1.1\"" +
>         " 200 3437 \"http://www.google.com/url?sa=t&rct=j&q=&esrc=s&"; +
>         
> "source=web&cd=15&cad=rja&ved=0CEgQFjAEOAo&url=http%3A%2F%2Fwww.sematext.com%2Fproducts%2Fkey-"
>  +
>         
> "phrase-extractor%2F&ei=TPOuUbaWM-OKiQfGxIGYDw&usg=AFQjCNGwYAFYg_M3EZnp2eEWJzdvRrVPrg&sig2"
>  +
>         "=oYitONI2EIZ0CQar7Ej8HA&bvm=bv.47380653,d.aGc\" \"Mozilla/5.0 (X11; 
> Ubuntu; Linux i686; rv:20.0) " +
>         "Gecko/20100101 Firefox/20.0\"";
>     System.out.println("\n\n====\n\n");
>     tokenStream = analyzer.tokenStream("test", urlIndexed);
>     tokenStream = new WordDelimiterFilter(tokenStream,
>             WordDelimiterFilter.GENERATE_WORD_PARTS |
>             WordDelimiterFilter.CATENATE_WORDS |
>             WordDelimiterFilter.SPLIT_ON_CASE_CHANGE,
>         null);
>     charAttrib = tokenStream.addAttribute(CharTermAttribute.class);
>     tokenStream.reset();
>     while(tokenStream.incrementToken()) {
>       tokens2.add(charAttrib.toString());
>       System.out.println(charAttrib.toString());
>     }
>     tokenStream.end();
>     tokenStream.close();
>     assertEquals(Joiner.on(",").join(tokens1), Joiner.on(",").join(tokens2));
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (LUCENE-6991) WordDelimiterFilter bug

Reply via email to