[jira] [Commented] (SOLR-2822) don't run update processors twice

Hoss Man (JIRA) Tue, 01 May 2012 18:13:18 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-2822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13266272#comment-13266272
 ]


Hoss Man commented on SOLR-2822:
--------------------------------

Yonik suggested that this issue might be a good place for me to start learning 
more about the internals of Solr Cloud.  First some background on what i 
started thinking about (archived for posterity) but have since dismissed as a 
bad idea...

{panel:title=My Initial Bad Idea That I Think Sucks| borderStyle=dashed| 
borderColor=#ccc| titleBGColor=#F7D6C1| bgColor=#F7E6D1} 

my first impression when yonik mentioned the problem was along the lines of the 
initial bug summary...

bq. perhaps by using a processor chain dedicated to sub-updates

...at first glance, this seemed like it would be an easy way to deal with the 
problem in a way that would be very easy to understand when looking at your 
chains in the solrconfig.xml, and wouldn't involve any sort of "magic" that 
might be hard for users to understand about how distributed updates work...

* some chain(s) (probably the default) would end with 
DistributedUpdateProcessorFactory, which would be configured with a 
"{{distrib.chain}}" init param value having some value {{$foo}}
* some other chain would be named {{$foo}}, and would end with 
RunUpdateProcessorFactory
* when DistributedUpdateProcessorFactory was triggered, it would forward the 
request to all of the nodes (including itself) using the configured 
{{distrib.chain}}

But as i started looking at the code a bit more, i realized that was a 
trivially naive understanding of what's needed here -- mainly because i 
completely forgot about that fact that there are two types of forwarding:

 * some arbitrary node gets the initial request from a client, and forwards to 
the leader
 * the leader does version checking/adding on the command/docs, and forwards to 
*all* the nodes

This is when i started thinking that maybe there should be _three_ named 
chains, and started understanding Jan's comment above about choosing which 
logic should happen "on the leader node".

But the more i thought about the more the more the idea of specifying certain 
logic needed to be run on the leader node made less sense.  I can think of lots 
of reasons for why you would care about whether an UpdateProcessor ran on one 
node or N nodes (mainly the trade off between an expensive computation you 
might only wnat to do once, vs processors that will add a lot of fields to the 
request so you might want to defer until after distribution to minimize data 
over the wire) but i couldn't come up with any compelling reason why, if you 
have a processor you only want to execute _once_ you would care if it happens 
on the first node randomly selected by the client, or if you really wanted to 
ensure it happened on the leader -- particularly since leader 
election/promotion can happen to any node at any time, so you couldn't even use 
the leader logic to try and take advantage of some local resources on a 
particular machine.

But even if we go back to the "two chain" model i suggested above, because of 
the leader logic, the second chain would need to start with some 
UpdateProcessor that handled the "version" & "forward to all nodes" work 
currently handled by DistributedUpdateProcessor when "isLeader" is true.  Which 
means people editing their configs would need to not only remember to put some 
sort of "SendDistributedUpdateProcessorFactory" at the end of the chains they 
expect clients to use, but also some sort of 
"RecieveDistributedUpdateProcessorFactory" at the beginning of the chain that 
"SendDistributedUpdateProcessorFactory" would be pointing at (or maybe both are 
just instances of DistributedUpdateProcessorFactory and we make it smart enough 
to know when it's the start vs end of the chain) ... which starts sounding 
really tedious and easy to screw up -- not to mention it doesn't play very 
nicely with the idea of a hardcoded default chain users get for free if they 
have none in their config; and it thoroughly complicates the story of someone 
who starts with the example (no declared chains) in cloud mode and then wants 
to customize a little, say by adding SignatureUpdateProcessor -- they wouldn't 
be able to just configure a chain starting with SignatureUpdateProcessor, 
they'd need to add two, and be careful about how they start and end.

{panel}

i'm all for transparency and eliminating "magic" by having things spelled out 
in the configs, but even for me this idea was starting to seem like a serious 
pain in the ass.

Which leads me to the idea i do think has merit...

{panel:title=My Current Idea That I Don't Think Sucks| borderStyle=dashed| 
borderColor=#ccc| titleBGColor=#D6F7C1| bgColor=#E6F7D1} 

I think in an ideal world... 
 * Solr Cloud users should be able to just declare a straightforwrd chain of 
processors and not *need* to worry about which nodes they run on (ie: any chain 
they used pre-cloud should still work)
 * if a user *wants* to care what parts of the chain run once vs on every node, 
it should be simple to indicate that decision by putting some "marker" 
processor (ie: DistributedUpdateProcessorFactory) in the chain denoting when 
distribution should happen.
 * If you are using cloud mode, and don't specify where the distributed update 
logic happens, then Solr should pick for you -- either "first" before any other 
processors, or "last" just before RunUpdateProcessor ... I don't have an 
opinion which is better, i'm sure other people who have been experimenting with 
Solr Cloud for a while can tell me.

(I'm usually the person opposed to doing things magically behind the scenes 
like this, but "Solr Cloud" is becoming a central enough concept in Solr, with 
many components doing special things if "zkEnabled", that I think moving 
forward it's "ok" to treat distributed updating as the norm in cloud mode and 
optimize for it.)

The most straight forward way i can think of to do this would be:
 * if Solr is in "cloud mode" then when SolrCore is initializing the chains 
from solrconfig.xml, any chain that doesn't include 
DistributedUpdateProcessorFactory should have it added automatically 
(alternatively: maybe we only add it if RunUpdateProcessor is in the chain, so 
anyone using chains w/o RunUpdateProcessor -- for some weird bizare purpose we 
can't imagine -- won't be surprised by DistributedUpdateProcessorFactory 
getting added magically)
 * DistributedUpdateProcessor should add some param when forwarding to any 
other node indicating that the request is being distributed
 * when an update request is received, if it has this special param in it, then 
any processor in the chain prior to DistributedUpdateProcessorFactory is skipped

This idea is very similar to part of what Jan suggested in his comment above...

bq. The Distrib processor could set some state info on the request to the next 
node so that chain processing could continue where it left off. E.g. 
&update.chain.nextproc=<name/id-of-next-proc>. This would require introduction 
of named processor instances.

...but what i'm thinking of wouldn't require named processors and would be 
specific to distributed updates (but wouldn't *precluded* named processors and 
more enhanced logic down the road if someone wanted it).

I think this would be fairly feasible just by making some small modifications 
to DistributedUpdateProcessor (to add the new special param when forwarding) 
and UpdateRequestProcessorChain (to inject the 
DistributedUpdateProcessorFactory if cloud mode, and to skip up to the 
DistributedUpdateProcessorFactory if the param is set).  I do however still 
think we should generalize somewhat:

 * DistributedUpdateProcessorFactory should be made to implement some marker 
interface with no methods (ie: DistributedUpdateMarker)
 * UpdateRequestProcessorChain.init should scan for instances of 
DistributedUpdateMarker in the chain (instead of looking explicitly for 
DistributedUpdateProcessorFactory) when deciding whether to inject a new 
DistributedUpdateProcessorFactory into the chain
 * UpdateRequestProcessorChain.createProcessor should scan for instances of 
DistributedUpdateMarker in the chain (instead of looking explicitly for 
DistributedUpdateProcessorFactory) when "skipping ahead" if the special param 
is found in the request

...that way advanced users can implement their own distributed update processor 
implementing that interface and register it explicitly in their chain if they 
are so inclined, or implement a NoOp update processor implementing that 
interface if they want to bipass the magic completley.

As a possible optimization/simplification to what gets sent over the wire, the 
new param we add that UpdateRequestProcessorChain would start looking for to 
"skip ahead" in the chain could replace the existing "leader" boolean param 
DistributedUpdateProcessor currently uses (aka: {{SEEN_LEADER}}) by having an 
enum style param (perhaps called "{{update.distrib}}" ?)...

 * {{none}} - default if unset, means no distribution has happend
 * {{toleader}} - means the request is being sent to the leader
 * {{fromleader}} - means the leader is sending the request to all nodes

UpdateRequestProcessorChain would only care if the value is not "{{none}}", in 
which case it would skip ahead to the DistributedUpdateMarker in the chain.  
DistributedUpdateProcessorFactory would care if the value is "{{toleader}}" or 
"{{fromleader}}" in which case it's logic would toggle in the same way it does 
currently for SEEN_LEADER.

{panel}

So that's my idea.  As i mentioned, this is really the first time i've looked 
into any of the "Solr Cloud" internals, and it's very possible i may still be 
missing something major. So if you spot any holes in this idea, please let me 
know -- otherwise i'll try to take a stab at a patch in the next few days.

                
> don't run update processors twice
> ---------------------------------
>
>                 Key: SOLR-2822
>                 URL: https://issues.apache.org/jira/browse/SOLR-2822
>             Project: Solr
>          Issue Type: Sub-task
>          Components: SolrCloud, update
>            Reporter: Yonik Seeley
>             Fix For: 4.0
>
>
> An update will first go through processors until it gets to the point where 
> it is forwarded to the leader (or forwarded to replicas if already on the 
> leader).
> We need a way to skip over the processors that were already run (perhaps by 
> using a processor chain dedicated to sub-updates?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2822) don't run update processors twice

Reply via email to