[jira] [Commented] (LUCENE-7639) Use Suffix Arrays for fast search with leading asterisks

Michael McCandless (JIRA) Fri, 03 Feb 2017 03:43:09 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-7639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15851384#comment-15851384
 ]


Michael McCandless commented on LUCENE-7639:
--------------------------------------------

It would be nice to have a simple option to make heavy (infix, prefix) wildcard 
queries fast.

I think a custom codec, and likely a custom {{WildcardQuery}} impl that "sees" 
it is working with the custom codec and taps into the suffix array, is a good 
way to implement this.  It should maybe be a straightforward conversion of the 
current patch into a custom codec, i.e. your codec's postings implementation 
would wrap the default codec and hold onto the {{WildcardHelper}} instance.

Separately, I am curious how [~dawid.weiss]'s idea (also index the reversed 
form of the field, then do two prefix searches and intersect the resulting 
terms) compares in performance (index time, index size, search heap, query 
cost) to the suffix array.

The patch falls back to Java's {{Pattern}} for checking each term in the more 
complex cases, but couldn't you just use the {{CompiledAutomaton}}'s {{run}} 
method to check instead?

I ran Lucene's tests w/ this patch, but first 1) hard-wiring the property check 
to {{true}}, and 2) making the init synchronous (not using the thread pool), 
and Lucene's {{WildcardQuery}} tests hit some failures, e.g.:

{noformat}
   [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestTerms 
-Dtests.method=testTermMinMaxRandom -Dtests.seed=11FDF2AE5B77A883 
-Dtests.locale=es-GT -Dtests.timezone=CET -Dtests.asserts=true 
-Dtests.file.encoding=UTF-8
   [junit4] FAILURE 0.04s J0 | TestTerms.testTermMinMaxRandom <<<
   [junit4]    > Throwable #1: java.lang.AssertionError
   [junit4]    >        at 
__randomizedtesting.SeedInfo.seed([11FDF2AE5B77A883:5D8E7DDA37EB73E6]:0)
   [junit4]    >        at 
org.apache.lucene.util.UnicodeUtil.UTF8toUTF16(UnicodeUtil.java:593)
   [junit4]    >        at 
org.apache.lucene.util.BytesRef.utf8ToString(BytesRef.java:152)
   [junit4]    >        at 
org.apache.lucene.codecs.blocktree.WildcardHelper.<init>(WildcardHelper.java:106)
   [junit4]    >        at 
org.apache.lucene.codecs.blocktree.FieldReader.<init>(FieldReader.java:106)
   [junit4]    >        at 
org.apache.lucene.codecs.blocktree.BlockTreeTermsReader.<init>(BlockTreeTermsReader.java:234)
   [junit4]    >        at 
org.apache.lucene.codecs.lucene50.Lucene50PostingsFormat.fieldsProducer(Lucene50PostingsFormat.java:445)
   [junit4]    >        at 
org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:109)
   [junit4]    >        at 
org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:74)
   [junit4]    >        at 
org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:143)
   [junit4]    >        at 
org.apache.lucene.index.ReadersAndUpdates.getReadOnlyClone(ReadersAndUpdates.java:195)
   [junit4]    >        at 
org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:103)
   [junit4]    >        at 
org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:473)
   [junit4]    >        at 
org.apache.lucene.index.RandomIndexWriter.getReader(RandomIndexWriter.java:376)
   [junit4]    >        at 
org.apache.lucene.index.RandomIndexWriter.getReader(RandomIndexWriter.java:313)
   [junit4]    >        at 
org.apache.lucene.index.TestTerms.testTermMinMaxRandom(TestTerms.java:76)
   [junit4]    >        at java.lang.Thread.run(Thread.java:745)
{noformat}

But those failures a possibly harmless, because those tests are sending 
non-UTF8 data into Lucene, whereas this change (the property) would only be 
enabled on fields that are UTF8.

It also hit a stack overflow w/ a long term:

{noformat}
   [junit4] Suite: org.apache.lucene.index.TestIndexWriter
   [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestIndexWriter 
-Dtests.method=testWickedLongTerm -Dtests.seed=524558B2F180613F 
-Dtests.locale=ru-RU -Dtests.timezone=Atlantic/Faroe -Dtests.asserts=true 
-Dtests.file.encoding=UTF-8
   [junit4] ERROR   2.47s J1 | TestIndexWriter.testWickedLongTerm <<<
   [junit4]    > Throwable #1: java.lang.StackOverflowError
   [junit4]    >        at 
__randomizedtesting.SeedInfo.seed([524558B2F180613F:120594CA57A7BADF]:0)
   [junit4]    >        at 
org.apache.lucene.codecs.blocktree.SuffixArrayBytes.sort(SuffixArrayBytes.java:82)
   [junit4]    >        at 
org.apache.lucene.codecs.blocktree.SuffixArrayBytes.sort(SuffixArrayBytes.java:84)
   [junit4]    >        at 
org.apache.lucene.codecs.blocktree.SuffixArrayBytes.sort(SuffixArrayBytes.java:84)
   [junit4]    >        at 
org.apache.lucene.codecs.blocktree.SuffixArrayBytes.sort(SuffixArrayBytes.java:84)
{noformat}

I guess your radix sort implementation is consuming one java stack frame per 
character in the term.  Maybe in practice this is OK too.


> Use Suffix Arrays for fast search with leading asterisks
> --------------------------------------------------------
>
>                 Key: LUCENE-7639
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7639
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Yakov Sirotkin
>         Attachments: suffix-array.patch
>
>
> If query term starts with asterisks FST checks all words in the dictionary so 
> request processing speed falls down. This problem can be solved with Suffix 
> Array approach. Luckily, Suffix Array can be constructed after Lucene start 
> from existing index. Unfortunately, Suffix Arrays requires a lot of RAM so we 
> can use it only when special flag is set:
> -Dsolr.suffixArray.enable=true
> It is possible to  speed up Suffix Array initialization using several 
> threads, so we can control number of threads with 
> -Dsolr.suffixArray.initialization_treads_count=5
> This system property can be omitted, the default value is 5.  
> Attached patch is the suggested implementation for SuffixArray support, it 
> works for all terms starting with asterisks with at least 3 consequent 
> non-wildcard characters. This patch do not change search results and  affects 
> only performance issues.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-7639) Use Suffix Arrays for fast search with leading asterisks

Reply via email to