[ 
https://issues.apache.org/jira/browse/SOLR-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13399933#comment-13399933
 ] 

Andy Laird edited comment on SOLR-2592 at 6/23/12 1:01 PM:
-----------------------------------------------------------

I've been using some code very similar to Michael's latest patch for a few 
weeks now and am liking it less and less for our use case.  As I described 
above, we are using this patch to ensure that all docs with the same value for 
a specific field end up on the same shard -- this is so that the field collapse 
counting will work for distributed searches, otherwise the returned counts are 
only an upper bound.

The problems we've encountered have entirely to do with our need to update the 
value of the field we're doing a field-collapse on.  Our approach -- 
conceptually similar to the CompositeIdShardKeyParserFactory in Michael's 
latest patch -- involved creating a new schema field, indexId, that was a 
combination of what used to be our uniqueKey plus the field that we collapse on:

*Original schema*
{code:xml}
<field name="id" type="string" indexed="true" stored="true" multiValued="false" 
required="true" />
<field name="xyz" type="string" indexed="true" stored="true" 
multiValued="false" required="true" />
...
<uniqueKey>id</uniqueKey>
{code}

*Modified schema*
{code:xml}
<field name="id" type="string" indexed="true" stored="true" multiValued="false" 
required="true" />
<field name="xyz" type="string" indexed="true" stored="true" 
multiValued="false" required="true" />
<field name="indexId" type="string" indexed="true" stored="true" 
multiValued="false" required="true" />
...
<uniqueKey>indexId</uniqueKey>
{code}

During indexing we insert the extra "indexId" data in the form, "id:xyz".  Our 
custom ShardKeyParser extracts out the xyz portion of the uniqueKey and returns 
that as the hash value for shard selection.  Everything works great in terms of 
field collapse, counts, etc.

The problems begin when considering what happens when we need to change the 
value of the field, xyz.  Suppose that our document starts out with these 
values for the 3 fields above:
{quote}
id=123
xyz=456
indexId=123:456
{quote}

We then want to change xyz to the value 789, say.  In other words, we want to 
end up...
{quote}
id=123
xyz=789
indexId=123:789
{quote}

...so that the doc lives on the same shard along with other docs that have 
xyz=789 (so that field collapse counts are correct since we use that to drive 
paging).

Before any of this we would simply pass in a new document and all would be good 
since we weren't changing the uniqueKey.  However, now we need to delete the 
old document (with the old uniqueKey) or we'll end up with duplicates.  We 
don't know whether a given update changes the value of xyz or not and we don't 
know what the old value for xyz was (without doing an additional lookup) so we 
must include an extra delete along with every change:

*Before*
{code:xml}
<add>
  <doc>
    <field name="id">123</field>
    <field name="xyz">789</field>
  <doc>
</add>
{code}

*Now*
{code:xml}
<delete>
  <query>id:123 AND NOT xyz:789</query>
</delete>
<add>
  <doc>
    <field name="id">123</field>
    <field name="xyz">789</field>
    <field name="clusterId">123:789</field>    <-- old clusterId was 123:456
  <doc>
</add>
{code}

So in addition to the "unsavory coupling" between id and xyz there is a 
significant performance hit to this approach (as we're doing this in the 
context of NRT).  The fundamental issue, of course, is that we only have the 
uniqueKey value (id) and score for the first phase of distributed search -- we 
really need the other field that we are using for shard ownership, too.

One idea is to have another standard schema field similar to uniqueKey that is 
used for the purposes of shard distribution:
{code:xml}
<uniqueKey>id</uniqueKey>
<shardKey>xyz</shardKey>
{code}
Then, as standard procedure, the first phase of distributed search would ask 
for uniqueKey, shardKey and score.  Perhaps the ShardKeyParser gets both 
uniqueKey and shardKey data for maximum flexibility.  In addition to a solution 
to our issue with field collapse counts, date-based sharding could be done by 
setting the shardKey to a date field and doing appropriate slicing in the 
ShardKeyParser.

                
      was (Author: clavius):
    I've been using some code very similar to Michael's latest patch for a few 
weeks now and am liking it less and less for our use case.  As I described 
above, we are using this patch to ensure that all docs with the same value for 
a specific field end up on the same shard -- this is so that the field collapse 
counting will work for distributed searches, otherwise the returned counts are 
only an upper bound.

The problems we've encountered have entirely to do with our need to update the 
value of the field we're doing a field-collapse on.  Our approach -- 
conceptually similar to the CompositeIdShardKeyParserFactory in Michael's 
latest patch, involved creating a new schema field, indexId, that was a 
combination of what used to be our uniqueKey plus the field that we collapse on:

*Original schema*
{code:xml}
<field name="id" type="string" indexed="true" stored="true" multiValued="false" 
required="true" />
<field name="xyz" type="string" indexed="true" stored="true" 
multiValued="false" required="true" />
...
<uniqueKey>id</uniqueKey>
{code}

*Modified schema*
{code:xml}
<field name="id" type="string" indexed="true" stored="true" multiValued="false" 
required="true" />
<field name="xyz" type="string" indexed="true" stored="true" 
multiValued="false" required="true" />
<field name="indexId" type="string" indexed="true" stored="true" 
multiValued="false" required="true" />
...
<uniqueKey>indexId</uniqueKey>
{code}

During indexing we insert the extra "indexId" data in the form, "id:xyz".  Our 
custom ShardKeyParser extracts out the xyz portion of the uniqueKey and returns 
that as the hash value for shard selection.  Everything works great in terms of 
field collapse, counts, etc.

The problems begin when considering what happens when we need to change the 
value of the field, xyz.  Suppose that our document starts out with these 
values for the 3 fields above:
{quote}
id=123
xyz=456
indexId=123:456
{quote}

We then want to change xyz to the value 789, say.  In other words, we want to 
end up...
{quote}
id=123
xyz=789
indexId=123:789
{quote}

...so that the doc lives on the same shard along with other docs that have 
xyz=789 (so that field collapse counts are correct since we use that to drive 
paging).

Before any of this we would simply pass in a new document and all would be good 
since we weren't changing the uniqueKey.  However, now we need to delete the 
old document (with the old uniqueKey) or we'll end up with duplicates.  We 
don't know whether a given update changes the value of xyz or not and we don't 
know what the old value for xyz was (without doing an additional lookup) so we 
must include an extra delete along with every change:

*Before*
{code:xml}
<add>
  <doc>
    <field name="id">123</field>
    <field name="xyz">789</field>
  <doc>
</add>
{code}

*Now*
{code:xml}
<delete>
  <query>id:123 AND NOT xyz:789</query>
</delete>
<add>
  <doc>
    <field name="id">123</field>
    <field name="xyz">789</field>
    <field name="clusterId">123:789</field>    <-- old clusterId was 123:456
  <doc>
</add>
{code}

So in addition to the "unsavory coupling" between id and xyz there is a 
significant performance hit to this approach (as we're doing this in the 
context of NRT).  The fundamental issue, of course, is that we only have the 
uniqueKey value (id) and score for the first phase of distributed search -- we 
really need the other field that we are using for shard ownership, too.

One idea is to have another standard schema field similar to uniqueKey that is 
used for the purposes of shard distribution:
{code:xml}
<uniqueKey>id</uniqueKey>
<shardKey>xyz</shardKey>
{code}
Then, as standard procedure, the first phase of distributed search would ask 
for uniqueKey, shardKey and score.  Perhaps the ShardKeyParser gets both 
uniqueKey and shardKey data for maximum flexibility.  In addition to a solution 
to our issue with field collapse counts, date-based sharding could be done by 
setting the shardKey to a date field and doing appropriate slicing in the 
ShardKeyParser.

                  
> Pluggable shard lookup mechanism for SolrCloud
> ----------------------------------------------
>
>                 Key: SOLR-2592
>                 URL: https://issues.apache.org/jira/browse/SOLR-2592
>             Project: Solr
>          Issue Type: New Feature
>          Components: SolrCloud
>    Affects Versions: 4.0
>            Reporter: Noble Paul
>         Attachments: SOLR-2592.patch, dbq_fix.patch, 
> pluggable_sharding.patch, pluggable_sharding_V2.patch
>
>
> If the data in a cloud can be partitioned on some criteria (say range, hash, 
> attribute value etc) It will be easy to narrow down the search to a smaller 
> subset of shards and in effect can achieve more efficient search.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to