[ 
https://issues.apache.org/jira/browse/SOLR-12638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658100#comment-16658100
 ] 

mosh edited comment on SOLR-12638 at 10/21/18 10:35 AM:
--------------------------------------------------------

We have been testing this feature in-house, and have come across a problem 
regarding sharding when a document that is being updated is indexed inside a 
block,
 and the collection being used has more than a single shard.
 Right now when updating a document, an Id for the document has to be provided, 
in addition to the field which is being updated.
 When the document that is being updated is inside a block, the update can be 
routed to the wrong shard, since the shard in which it is indexed was 
calculated according to the root document's Id. ex.
 When this document:
{code:javascript}
 {"id": "1", "children": [{"id": "20", {"string_s": "ex"}]} {code}
Is being updated:
{code:javascript}
{"id": "20", "grand_children": {"add": [{"id": "21", "string_s": "ex"}]}}{code}
The update can be routed to another shard, where the block does not exist, 
causing the update to be indexed to a different shard,
 splitting our block in two pieces, existing in two separate shards.

Skimming through DistributedUpdateProcessor, I have suggestions for three 
different solutions.
 # If the schema is nested, the the routing method(in 
DistributedUpdateProcessor#setupRequest) can check if the document exists in 
any shards(lookup by id),
 find out whether it is inside a block(_root_) and route the update using the 
hash of _root_
 # Very similar to the previous method, only the _root_ lookup is done if the 
document which is being updated is not found in the shard it was routed to, 
asking other shards if the document exists inside a block, re-routing the 
update command.
 # The user provides the _root_, which is not the ideal case when it comes to 
user friendliness. This approach is very similar to [Elastic 
Search|https://www.elastic.co/guide/en/elasticsearch/guide/current/grandparents.html#CO285-1],
 which uses the *routing* parameter to route all children to the same shard.

IMO the third option is inferior to the first two, since it is the least user 
friendly out of the three options.
 My only concern regarding the first two options are the performance hit it 
might cause.

Another concern which David has discussed is the implications on the update log.
 Would ensuring DistributedUpdateProcessor is run before RunUpdateProcessor be 
of any help?
 I must admit I am not very familiar with these features of Solr.

WDYT [~dsmiley], [~caomanhdat]?


was (Author: moshebla):
We have been testing this feature in-house, and have come across a problem 
regarding sharding when a document that is being updated is indexed inside a 
block,
 and the collection being used has more than a single shard.
 Right now when updating a document, an Id for the document has to be provided, 
in addition to the field which is being updated.
 When the document that is being updated is inside a block, the update can be 
routed to the wrong shard, since the shard in which it is indexed was 
calculated according to the root document's Id. ex.
 When this document:
{code:javascript}
 {"id": "1", "children": [{"id": "20", {"string_s": "ex"}]} {code}
Is being updated:
{code:javascript}
{"id": "20", "grand_children": {"add": [{"id": "21", "string_s": "ex"}]}}{code}
The update can be routed to another shard, where the block does not exist, 
causing the update to be indexed to a different shard,
 splitting our block in two pieces, existing in two separate shards.

Skimming through DistributedUpdateProcessor, I have suggestions for three 
different solutions.
 # If the schema is nested, the the routing method(in 
DistributedUpdateProcessor) can check if the document exists in any 
shards(lookup by id),
 find out whether it is inside a block(_root_) and route the update using the 
hash of _root_
 # Very similar to the previous method, only the _root_ lookup is done if the 
document which is being updated is not found in the shard it was routed to, 
asking other shards if the document exists inside a block, re-routing the 
update command.
 # The user provides the _root_, which is not the ideal case when it comes to 
user friendliness. This approach is very similar to [Elastic 
Search|https://www.elastic.co/guide/en/elasticsearch/guide/current/grandparents.html#CO285-1],
 which uses the *routing* parameter to route all children to the same shard.

IMO the third option is inferior to the first two, since it is the least user 
friendly out of the three options.
 My only concern regarding the first two options are the performance hit it 
might cause.

Another concern which David has discussed is the implications on the update log.
 Would ensuring DistributedUpdateProcessor is run before RunUpdateProcessor be 
of any help?
 I must admit I am not very familiar with these features of Solr.

WDYT [~dsmiley], [~caomanhdat]?

> Support atomic updates of nested/child documents for nested-enabled schema
> --------------------------------------------------------------------------
>
>                 Key: SOLR-12638
>                 URL: https://issues.apache.org/jira/browse/SOLR-12638
>             Project: Solr
>          Issue Type: Sub-task
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: mosh
>            Priority: Major
>         Attachments: SOLR-12638-delete-old-block-no-commit.patch, 
> SOLR-12638-nocommit.patch
>
>          Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> I have been toying with the thought of using this transformer in conjunction 
> with NestedUpdateProcessor and AtomicUpdate to allow SOLR to completely 
> re-index the entire nested structure. This is just a thought, I am still 
> thinking about implementation details. Hopefully I will be able to post a 
> more concrete proposal soon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to