[ 
https://issues.apache.org/jira/browse/SOLR-4016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar updated SOLR-4016:
----------------------------------------

    Attachment: SOLR-4016.patch

This patch expands the document before computing signature.

I'm not convinced that it is the right solution. The DUPF gets the updated 
document in a synchronized (bucket) block which we don't. We could set the 
original document back (after adding signature to it) and let DUPF do its thing 
but that could lead to race conditions.

Perhaps we should decouple the document expansion for partial updates from 
DistributedUpdateRequestProcessor and apply it at the start of the request so 
that all UpdateRequestProcessors can work on the full document.

I don't fully comprehend the race conditions that may happen so I'll let 
someone more knowledgeable about this code to comment before proceeding any 
further.
                
> Deduplication is broken by partial update
> -----------------------------------------
>
>                 Key: SOLR-4016
>                 URL: https://issues.apache.org/jira/browse/SOLR-4016
>             Project: Solr
>          Issue Type: Bug
>          Components: update
>    Affects Versions: 4.0
>         Environment: Tomcat6 / Catalina on Ubuntu 12.04 LTS
>            Reporter: Joel Nothman
>            Assignee: Shalin Shekhar Mangar
>              Labels: 4.0.1_Candidate
>             Fix For: 4.1, 5.0
>
>         Attachments: SOLR-4016.patch
>
>
> The SignatureUpdateProcessorFactory used (primarily?) for deduplication does 
> not consider partial update semantics.
> The below uses the following solrconfig.xml excerpt:
> {noformat}
>      <updateRequestProcessorChain name="text_hash">
>        <processor class="solr.processor.SignatureUpdateProcessorFactory">
>          <bool name="enabled">true</bool>
>          <str name="signatureField">text_hash</str>
>          <bool name="overwriteDupes">false</bool>
>          <str name="fields">text</str>
>          <str name="signatureClass">solr.processor.TextProfileSignature</str>
>        </processor>
>        <processor class="solr.LogUpdateProcessorFactory" />
>        <processor class="solr.RunUpdateProcessorFactory" />
>      </updateRequestProcessorChain>
> {noformat}
> Firstly, the processor treats {noformat}{"set": "value"}{noformat} as a 
> string and hashes it, instead of the value alone:
> {noformat}
> $ curl '$URL/update?commit=true' -H 'Content-type:application/json' -d 
> '{"add":{"doc":{"id": "abcde", "text": {"set": "hello world"}}}}' && curl 
> '$URL/select?q=id:abcde'
> {"responseHeader":{"status":0,"QTime":30}}
> <?xml version="1.0" encoding="UTF-8"?><response><lst 
> name="responseHeader"><int name="status">0</int><int name="QTime">1</int><lst 
> name="params"><str name="q">id:abcde</str></lst></lst><result name="response" 
> numFound="1" start="0"><doc><str name="id">abcde</str><str name="text">hello 
> world</str><str name="text_hash">ad48c7ad60ac22cc</str><long 
> name="_version_">1417247434224959488</long></doc></result>
> </response>
> $
> $ curl '$URL/update?commit=true' -H 'Content-type:application/json' -d 
> '{"add":{"doc":{"id": "abcde", "text": "hello world"}}}' && curl 
> '$URL/select?q=id:abcde'
> {"responseHeader":{"status":0,"QTime":27}}
> <?xml version="1.0" encoding="UTF-8"?>
> <response>
> <lst name="responseHeader"><int name="status">0</int><int 
> name="QTime">1</int><lst name="params"><str 
> name="q">id:abcde</str></lst></lst><result name="response" numFound="1" 
> start="0"><doc><str name="id">abcde</str><str name="text">hello 
> world</str><str name="text_hash">b169c743d220da8d</str><long 
> name="_version_">1417248022215000064</long></doc></result>
> </response>
> {noformat}
> Note the different text_hash value.
> Secondly, when updating a field other than those used to create the signature 
> (which I imagine is a more common use-case), the signature is recalculated 
> from no values:
> {noformat}
> $ curl '$URL/update?commit=true' -H 'Content-type:application/json' -d 
> '{"add":{"doc":{"id": "abcde", "title": {"set": "new title"}}}}' && curl 
> '$URL/select?q=id:abcde'
> {"responseHeader":{"status":0,"QTime":39}}
> <?xml version="1.0" encoding="UTF-8"?>
> <response>
> <lst name="responseHeader"><int name="status">0</int><int 
> name="QTime">1</int><lst name="params"><str 
> name="q">id:abcde</str></lst></lst><result name="response" numFound="1" 
> start="0"><doc><str name="id">abcde</str><str name="text">hello 
> world</str><str name="text_hash">0000000000000000</str><str name="title">new 
> title</str><long name="_version_">1417248120480202752</long></doc></result>
> </response>
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to