[
https://issues.apache.org/jira/browse/OAK-4042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15168547#comment-15168547
]
Vikas Saurabh edited comment on OAK-4042 at 2/26/16 6:47 AM:
-------------------------------------------------------------
[~chetanm] pointed out offline that the issue isn't really about analysis of
GB-18030 chars but that wildcard queries generally [don't get
analyzed|https://wiki.apache.org/lucene-java/LuceneFAQ#Are_Wildcard.2C_Prefix.2C_and_Fuzzy_queries_case_sensitive.3F].
If we want to run analyzer over queried text (containing wildcard), then we
need to use
[AnalyzingQueryParser|https://lucene.apache.org/core/4_7_1/queryparser/org/apache/lucene/queryparser/analyzing/AnalyzingQueryParser.html]
instead. That comes with a few caveats - quoting from javadoc:
{quote}
Warning: This class should only be used with analyzers that do not use
stopwords or that add tokens. Also, several stemming analyzers are
inappropriate: for example, GermanAnalyzer will turn Häuser into hau, but
H?user will become h?user when using this parser and thus no match would be
found (i.e. using this parser will be no improvement over QueryParser in such
cases).
{quote}
Btw, about issue not being linked to GB-18030: currently, on querying
{{192.168.1*}} we won't find text containing {{192.168.1.1}} (of course, that's
a trivial example.. the take away is discrepancy between query text being
broken on whitespace while analyzer breaking terms on more stuff than just
whitespace)
So, here's my suggestion - since OakAnalyzer is fairly simple analyzer, we
should use AnalyzingQueryParser when OakAnalyzer is in play. Otoh, we expose a
boolean prop on {{analyzers}} config (default=false) which allows custom
configured analyzers to use AnalyzingQueryParser if necessary.
([~chetanm], [~teofili].... thoughs?)
As a side-node, I tried using
{{lucene-analyzers-smartcn->SmartChineseAnalyzer}} which analyzes
{{中文标题suffix}} as {{\[中文], \[标题], \[suffix]}} - and consequently the test case
still won't work.
was (Author: catholicon):
[~chetanm] pointed out offline that the issue isn't really about analysis of
GB-18030 chars but that queries generally [don't get
analyzed|https://wiki.apache.org/lucene-java/LuceneFAQ#Are_Wildcard.2C_Prefix.2C_and_Fuzzy_queries_case_sensitive.3F].
If we want to run analyzer over queried text, then we need to use
[AnalyzingQueryParser|https://lucene.apache.org/core/4_7_1/queryparser/org/apache/lucene/queryparser/analyzing/AnalyzingQueryParser.html]
instead. That comes with a few caveats - quoting from javadoc:
{quote}
Warning: This class should only be used with analyzers that do not use
stopwords or that add tokens. Also, several stemming analyzers are
inappropriate: for example, GermanAnalyzer will turn Häuser into hau, but
H?user will become h?user when using this parser and thus no match would be
found (i.e. using this parser will be no improvement over QueryParser in such
cases).
{quote}
Btw, about issue not being linked to GB-18030: currently, on querying
{{192.168.1*}} we won't find text containing {{192.168.1.1}} (of course, that's
a trivial example.. the take away is discrepancy between query text being
broken on whitespace while analyzer breaking terms on more stuff than just
whitespace)
So, here's my suggestion - since OakAnalyzer is fairly simple analyzer, we
should use AnalyzingQueryParser when OakAnalyzer is in play. Otoh, we expose a
boolean prop on {{analyzers}} config (default=false) which allows custom
configured analyzers to use AnalyzingQueryParser if necessary.
([~chetanm], [~teofili].... thoughs?)
As a side-node, I tried using
{{lucene-analyzers-smartcn->SmartChineseAnalyzer}} which analyzes
{{中文标题suffix}} as {{\[中文], \[标题], \[suffix]}} - and consequently the test case
still won't work.
> Full text search doesn't work for prefix text containing GB-18030 characters
> ----------------------------------------------------------------------------
>
> Key: OAK-4042
> URL: https://issues.apache.org/jira/browse/OAK-4042
> Project: Jackrabbit Oak
> Issue Type: Improvement
> Components: lucene
> Reporter: Vikas Saurabh
> Assignee: Vikas Saurabh
> Fix For: 1.6
>
>
> For a full text indexed field {{text}} and a node having
> {{/a/b/@text="some text normaltextsuffix and 中文标题suffix."}}, this node should
> be returned for:
> {{SELECT * from \[nt:base] WHERE CONTAINS([text], '中文标题*')}}.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)