[jira] [Comment Edited] (OAK-4042) Full text search doesn't work for prefix text containing GB-18030 characters

Vikas Saurabh (JIRA) Thu, 25 Feb 2016 22:58:23 -0800

    [ 
https://issues.apache.org/jira/browse/OAK-4042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15168547#comment-15168547
 ]


Vikas Saurabh edited comment on OAK-4042 at 2/26/16 6:47 AM:
-------------------------------------------------------------

[~chetanm] pointed out offline that the issue isn't really about analysis of 
GB-18030 chars but that wildcard queries generally [don't get 
analyzed|https://wiki.apache.org/lucene-java/LuceneFAQ#Are_Wildcard.2C_Prefix.2C_and_Fuzzy_queries_case_sensitive.3F].
 If we want to run analyzer over queried text (containing wildcard), then we 
need to use 
[AnalyzingQueryParser|https://lucene.apache.org/core/4_7_1/queryparser/org/apache/lucene/queryparser/analyzing/AnalyzingQueryParser.html]
 instead. That comes with a few caveats - quoting from javadoc:
{quote}
Warning: This class should only be used with analyzers that do not use 
stopwords or that add tokens. Also, several stemming analyzers are 
inappropriate: for example, GermanAnalyzer will turn Häuser into hau, but 
H?user will become h?user when using this parser and thus no match would be 
found (i.e. using this parser will be no improvement over QueryParser in such 
cases).
{quote}

Btw, about issue not being linked to GB-18030: currently, on querying 
{{192.168.1*}} we won't find text containing {{192.168.1.1}} (of course, that's 
a trivial example.. the take away is discrepancy between query text being 
broken on whitespace while analyzer breaking terms on more stuff than just 
whitespace)

So, here's my suggestion - since OakAnalyzer is fairly simple analyzer, we 
should use AnalyzingQueryParser when OakAnalyzer is in play. Otoh, we expose a 
boolean prop on {{analyzers}} config (default=false) which allows custom 
configured analyzers to use AnalyzingQueryParser if necessary.
([~chetanm], [~teofili].... thoughs?)

As a side-node, I tried using 
{{lucene-analyzers-smartcn->SmartChineseAnalyzer}} which analyzes 
{{中文标题suffix}} as {{\[中文], \[标题], \[suffix]}} - and consequently the test case 
still won't work.


was (Author: catholicon):
[~chetanm] pointed out offline that the issue isn't really about analysis of 
GB-18030 chars but that queries generally [don't get 
analyzed|https://wiki.apache.org/lucene-java/LuceneFAQ#Are_Wildcard.2C_Prefix.2C_and_Fuzzy_queries_case_sensitive.3F].
 If we want to run analyzer over queried text, then we need to use 
[AnalyzingQueryParser|https://lucene.apache.org/core/4_7_1/queryparser/org/apache/lucene/queryparser/analyzing/AnalyzingQueryParser.html]
 instead. That comes with a few caveats - quoting from javadoc:
{quote}
Warning: This class should only be used with analyzers that do not use 
stopwords or that add tokens. Also, several stemming analyzers are 
inappropriate: for example, GermanAnalyzer will turn Häuser into hau, but 
H?user will become h?user when using this parser and thus no match would be 
found (i.e. using this parser will be no improvement over QueryParser in such 
cases).
{quote}

Btw, about issue not being linked to GB-18030: currently, on querying 
{{192.168.1*}} we won't find text containing {{192.168.1.1}} (of course, that's 
a trivial example.. the take away is discrepancy between query text being 
broken on whitespace while analyzer breaking terms on more stuff than just 
whitespace)

So, here's my suggestion - since OakAnalyzer is fairly simple analyzer, we 
should use AnalyzingQueryParser when OakAnalyzer is in play. Otoh, we expose a 
boolean prop on {{analyzers}} config (default=false) which allows custom 
configured analyzers to use AnalyzingQueryParser if necessary.
([~chetanm], [~teofili].... thoughs?)

As a side-node, I tried using 
{{lucene-analyzers-smartcn->SmartChineseAnalyzer}} which analyzes 
{{中文标题suffix}} as {{\[中文], \[标题], \[suffix]}} - and consequently the test case 
still won't work.

> Full text search doesn't work for prefix text containing GB-18030 characters
> ----------------------------------------------------------------------------
>
>                 Key: OAK-4042
>                 URL: https://issues.apache.org/jira/browse/OAK-4042
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: lucene
>            Reporter: Vikas Saurabh
>            Assignee: Vikas Saurabh
>             Fix For: 1.6
>
>
> For a full text indexed field {{text}} and a node having
> {{/a/b/@text="some text normaltextsuffix and 中文标题suffix."}}, this node should 
> be returned for:
> {{SELECT * from \[nt:base] WHERE CONTAINS([text], '中文标题*')}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (OAK-4042) Full text search doesn't work for prefix text containing GB-18030 characters

Reply via email to