[jira] [Comment Edited] (LUCENE-7639) Use Suffix Arrays for fast search with leading asterisks

Uwe Schindler (JIRA) Wed, 18 Jan 2017 03:21:07 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-7639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15827886#comment-15827886
 ]


Uwe Schindler edited comment on LUCENE-7639 at 1/18/17 11:20 AM:
-----------------------------------------------------------------

Hi,
I am fine with adding those optimizations to the index file format (one could 
save the suffix array as part of the BlockTreeTermsdict?). As this is not 
needed by all users, the right way to do this would be to use another codec 
that adds this to the terms dict.

What is impossible in Lucene (which is a library, not only used by Solr):
- Don't spawn thread pools or threads anywhere inside Lucene. Those stuff has 
to be done by the calling code (e.g., passing an Executor instance, see 
IndexSearcher). Of course this does not work with static initializers and it 
can also not passed to codecs. We have a forbiddenapis rule that forbids that. 
In addition, you are using a thread pool without proper thread names.
- Lucene nowhere reads system properties, especially not those starting with 
"solr.". The right way to configure this is by customizing your codec you pass 
to IndexWriter/IndexReader.
- We also do not want to do staff like this when IndexReader initializes or the 
first query executes. For that reason "FieldCache" was removed, which was 
invented as a workaround in Lucene initially (uncontrollable memory usage, 
delays on searches,...), and it took long time to get rid of it! This patch 
looks identical to this - it is a "cache" which is generated on the fly. The 
message here is: If something like a suffix tree is useful for searching, then 
it must be persisted to disk during indexing (see above proposal to have a 
BlockTreeTerms variant as separate codec).


was (Author: thetaphi):
Hi,
I am fine with adding those optimizations to the index file format (one could 
save the suffix array as part of the BlockTreeTermsdict?). As this is not 
needed by all users, the right way to do this would be to use another codec 
that adds this to the terms dict.

What is impossible in Lucene (which is a library, not only used by Solr):
- Don't spawn thread pools or threads anywhere inside Lucene. Those stuff has 
to be done by the calling code (e.g., passing an Executor instance, see 
IndexSearcher). Of course this does not work with static initializers and it 
can also not passed to codecs. We have a forbiddenapis rule that forbids that. 
In addition, you are using a thread pool without proper names.
- Lucene nowhere reads system properties, especially not those starting with 
"solr.". The right way to configure this is by customizing your codec you pass 
to IndexWriter/IndexReader.
- We also do not want to do staff like this when IndexReader initializes or the 
first query executes. For that reason "FieldCache" was removed, which was 
invented as a workaround in Lucene initially (uncontrollable memory usage, 
delays on searches,...), and it took long time to get rid of it! This patch 
looks identical to this - it is a "cache" which is generated on the fly. The 
message here is: If something like a suffix tree is useful for searching, then 
it must be persisted to disk during indexing (see above proposal to have a 
BlockTreeTerms variant as separate codec).

> Use Suffix Arrays for fast search with leading asterisks
> --------------------------------------------------------
>
>                 Key: LUCENE-7639
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7639
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Yakov Sirotkin
>         Attachments: suffix-array.patch
>
>
> If query term starts with asterisks FST checks all words in the dictionary so 
> request processing speed falls down. This problem can be solved with Suffix 
> Array approach. Luckily, Suffix Array can be constructed after Lucene start 
> from existing index. Unfortunately, Suffix Arrays requires a lot of RAM so we 
> can use it only when special flag is set:
> -Dsolr.suffixArray.enable=true
> It is possible to  speed up Suffix Array initialization using several 
> threads, so we can control number of threads with 
> -Dsolr.suffixArray.initialization_treads_count=5
> This system property can be omitted, the default value is 5.  
> Attached patch is the suggested implementation for SuffixArray support, it 
> works for all terms starting with asterisks with at least 3 consequent 
> non-wildcard characters. This patch do not change search results and  affects 
> only performance issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (LUCENE-7639) Use Suffix Arrays for fast search with leading asterisks

Reply via email to