[jira] [Commented] (OAK-3276) Make StandardAnalyzer as the default search analyzer

Vikas Saurabh (JIRA) Tue, 10 Nov 2015 13:38:32 -0800

    [ 
https://issues.apache.org/jira/browse/OAK-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14999387#comment-14999387
 ]


Vikas Saurabh commented on OAK-3276:
------------------------------------

I get following test case failures in oak-lucene by trivial switch to 
{{StandardAnalyzer}}.
{noformat}
Failed tests:   
testFulltext(org.apache.jackrabbit.oak.jcr.query.QueryFulltextTest): 
expected:<[]> but was:<[/testroot/node3]>
  
testSpellcheckMultipleWords(org.apache.jackrabbit.oak.jcr.query.SpellcheckTest):
 expected:<[[voting in ontario]]> but was:<[[]]>
  testSpellcheckSql(org.apache.jackrabbit.oak.jcr.query.SpellcheckTest): 
expected:<[hello[, hold]]> but was:<[hello[]]>
  testSpellcheckXPath(org.apache.jackrabbit.oak.jcr.query.SpellcheckTest): 
expected:<[hello[, hold]]> but was:<[hello[]]>
  
containsPathStrict(org.apache.jackrabbit.oak.plugins.index.lucene.LuceneIndexQueryTest):
 Expected path /match_on_path not found, got []
  
containsPathStrictNum(org.apache.jackrabbit.oak.plugins.index.lucene.LuceneIndexQueryTest):
 Expected path /match_on_path1234 not found, got []
  
analyzerWithStopWords(org.apache.jackrabbit.oak.plugins.index.lucene.LuceneIndexTest):
 Result set size is different (..)
  testTokens(org.apache.jackrabbit.oak.plugins.index.lucene.LuceneIndexTest): 
expected:<[first, second]> but was:<[first_second]>

Tests in error:
  sql1(org.apache.jackrabbit.oak.plugins.index.lucene.LuceneIndexQueryTest): 
Results in target\oajopi.lucene.LuceneIndexQueryTest_sql1.txt don't match 
expected results in 
C:\Users\vsaurabh\Documents\Projects\CQ-misc\jackrabbit-oak\oak-lucene\target\test-classes\org\apache\jackrabbit\oak\query\sql1.txt;
 compare the files for details; got=(..)
{noformat}

There are 2 categories for failures here:
* Stop words getting analyzed out by {{StandardAnalyzer}} but not by 
{{OakAnalyzer}}.
* {{OakAnalyzer}} uses a {{WordDelimiterFilter}} which splits on {{:}}, {{_}}, 
{{.}} while {{StandardAnalyzer}} (internally {{StandardTokenizer}}) doesn't 
(possibly the set is bigger which isn't covered by tests here)

Stop word failures
||Test case||Error||Comment||
|QueryFulltextTest#testFulltext|expected:<\[]> but 
was:<\[/testroot/node3]>|stop word 'or'|
|SpellcheckTest#testSpellcheckMultipleWords|expected:<\[\[voting in 
ontario]]>|stop word 'in'|
|LuceneIndexTest#analyzerWithStopWords|Result set size is different (..)|stop 
word 'was'|

WordDelimiterFilter failures
||Test case||Error||Comment||
|SpellcheckTest#testSpellcheckSql|expected:<\[hello\[, hold]]> but 
was:<\[hello\[]]>|delimiter {{:}}. 'hold' is getting suggested due to 
'rep:hold'|
|SpellcheckTest#testSpellcheckXPath|expected:<\[hello\[, hold]]> but 
was:<\[hello\[]]>|delimiter {{:}}. 'hold' is getting suggested due to 
'rep:hold'|
|LuceneIndexQueryTest#containsPathStrict|Expected path /match_on_path not 
found, got \[]|delimiter {{_}}|
|LuceneIndexQueryTest#containsPathStrictNum|Expected path /match_on_path1234 
not found, got \[]|delimiter {{_}}|
|LuceneIndexTest#testTokens|expected:<\[first, second]> but 
was:<\[first_second]>|delimiter {{_}}|
|LuceneIndexQueryTest#sql1| |delimiter {{.}}. Picking 'jackrabbit' as spell 
check option due to jackrabbit.apache.org available on namespaces node.|


Failures with stop words can be trivially fixed by using EMPTY stop word set 
for {{StandardAnalyzer}} 's constructor (I'm assuming that explicit test case 
against stop words imply that ootb we are not supposed to have any stop words)
For the word delimiting issues, I couldn't find a way to use 
{{WordDelimiterFilter}} to be used by {{StandardAnalyzer}}. Also, since we have 
explicit test case for delimiting on ':' and '-', I'm assuming that we are 
required to delimit on that. '.' is undefined from test case perspective but 
feels like it should be used as a delimiter. [~teofili], how can we do this?

> Make StandardAnalyzer as the default search analyzer
> ----------------------------------------------------
>
>                 Key: OAK-3276
>                 URL: https://issues.apache.org/jira/browse/OAK-3276
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: lucene
>            Reporter: Satya Deep Maheshwari
>            Assignee: Tommaso Teofili
>             Fix For: 1.4
>
>
> Please vote on RTC for making 
> org.apache.lucene.analysis.standard.StandardAnalyzer
>  as the default OOTB analyzer.  This analyzer is capable of handling 
> surrogate characters unlike the current default analyzer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (OAK-3276) Make StandardAnalyzer as the default search analyzer

Reply via email to