[jira] [Commented] (SOLR-6810) Faster searching limited but high rows across many shards all with many hits

Per Steffensen (JIRA) Sat, 27 Dec 2014 10:31:22 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-6810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14259445#comment-14259445
 ]


Per Steffensen commented on SOLR-6810:
--------------------------------------

bq.  IMO, one shouldn't have to look at the patch to figure out what it's 
trying to do.

Seems reasonable. The way things change is IMHO fairly well documented in 
JavaDocs of ShardParams.DQA so I will just steal from there
* Old DQA (FIND_ID_RELEVANCE_FETCH_BY_IDS)
{code}
   /**
    * Algorithm
    * - Shard-queries 1) Ask, by forwarding the outer query, each shard for id 
and relevance of the (up to) #rows most relevant matching documents
    * - Find among those id/relevances the #rows id's with the highest global 
relevances (lets call this set of id's X)
    * - Shard-queries 2) Ask, by sending id's, each shard to return the 
documents from set X that it holds
    * - Return the fetched documents to the client
    */
...
       // Default do not force skip get-ids phase
{code}
* New DQA (FIND_RELEVANCE_FIND_IDS_LIMITED_ROWS_FETCH_BY_IDS)
{code}
   /**
    * Algorithm
    * - Shard-queries 1) Ask, by forwarding the outer query, each shard for 
relevance of the (up to) #rows most relevant matching documents
    * - Find among those relevances the #rows highest global relevances
    * Note for each shard (S) how many entries (docs_among_most_relevant(S)) it 
has among the #rows globally highest relevances
    * - Shard-queries 2) Ask, by forwarding the outer query, each shard S for 
id and relevance of the (up to) #docs_among_most_relevant(S) most relevant 
matching documents
    * - Find among those id/relevances the #rows id's with the highest global 
relevances (lets call this set of id's X)
    * - Shard-queries 3) Ask, by sending id's, each shard to return the 
documents from set X that it holds
    * - Return the fetched documents to the client 
    * 
    * Advantages
    * Asking for data from store (id in shard-queries 1) of 
FIND_ID_RELEVANCE_FETCH_BY_IDS) can be expensive, therefore sometimes you want 
to ask for data
    * from as few documents as possible.
    * The main purpose of this algorithm it to limit the rows asked for in 
shard-queries 2) compared to shard-queries 1) of FIND_ID_RELEVANCE_FETCH_BY_IDS.
    * Lets call the number of rows asked for by the outer request for 
"outer-rows"
    * shard-queries 2) will never ask for data from more than "outer-rows" 
documents total across all involved shards. shard-queries 1) of 
FIND_ID_RELEVANCE_FETCH_BY_IDS
    * will ask each shard for data from "outer-rows" documents, and in worst 
case if each shard contains "outer-rows" matching documents you will
    * fetch data for "number of shards involved" * "outer-rows".
    * Using FIND_RELEVANCE_FIND_IDS_LIMITED_ROWS_FETCH_BY_IDS will become more 
beneficial the more
    * - shards are involved
    * - and/or the more matching documents each shard holds
    */
...
    // Default force skip get-ids phase. In this algorithm there are really 
never any reason not to skip it
{code}
* dqa.forceSkipGetIds
{code}
   /** Request parameter to force skip get-ids phase of the distributed query? 
Value: true or false 
    * Even if you do not force it, the system might choose to do it anyway
    * Skipping the get-ids phase
    * - FIND_ID_RELEVANCE_FETCH_BY_IDS: Fetch entire documents in Shard-queries 
1) and skip Shard-queries 2)
    * - FIND_RELEVANCE_FIND_IDS_LIMITED_ROWS_FETCH_BY_IDS: Fetch entire 
documents in Shard-queries 2) and skip Shard-queries 3)
    */
{code}

> Faster searching limited but high rows across many shards all with many hits
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-6810
>                 URL: https://issues.apache.org/jira/browse/SOLR-6810
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>            Reporter: Per Steffensen
>            Assignee: Shalin Shekhar Mangar
>              Labels: distributed_search, performance
>         Attachments: branch_5x_rev1642874.patch, branch_5x_rev1642874.patch, 
> branch_5x_rev1645549.patch
>
>
> Searching "limited but high rows across many shards all with many hits" is 
> slow
> E.g.
> * Query from outside client: q=something&rows=1000
> * Resulting in sub-requests to each shard something a-la this
> ** 1) q=something&rows=1000&fl=id,score
> ** 2) Request the full documents with ids in the global-top-1000 found among 
> the top-1000 from each shard
> What does the subject mean
> * "limited but high rows" means 1000 in the example above
> * "many shards" means 200-1000 in our case
> * "all with many hits" means that each of the shards have a significant 
> number of hits on the query
> The problem grows on all three factors above
> Doing such a query on our system takes between 5 min to 1 hour - depending on 
> a lot of things. It ought to be much faster, so lets make it.
> Profiling show that the problem is that it takes lots of time to access the 
> store to get id’s for (up to) 1000 docs (value of rows parameter) per shard. 
> Having 1000 shards its up to 1 mio ids that has to be fetched. There is 
> really no good reason to ever read information from store for more than the 
> overall top-1000 documents, that has to be returned to the client.
> For further detail see mail-thread "Slow searching limited but high rows 
> across many shards all with high hits" started 13/11-2014 on 
> dev@lucene.apache.org



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-6810) Faster searching limited but high rows across many shards all with many hits

Reply via email to