[ 
https://issues.apache.org/jira/browse/SOLR-5474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Potter updated SOLR-5474:
---------------------------------

    Attachment: SOLR-5474.patch

Here's a new patch that takes a slightly different approach to the previous. 
The previous did the following:
 
1 – Client application creates a request targeted to external collection "foo"
 
2 - CloudSolrServer (on the client side) doesn't know about "foo", so it 
fetches a one-time snapshot of foo's state from ZK using lazy loading. It 
caches that state and keeps track of the state version, e.g. 1
 
3 - CloudSolrServer sends the request to one of the nodes servicing “foo” based 
on the state information it retrieved from ZK. If the request is an update 
request, it will go to the leader, if it is a query, the request will go to one 
of the replicas using LBSolrServer. Every request contains the _stateVer_ 
parameter, e.g. _stateVer_=foo:1
 
4 - Server-side compares the _stateVer_ it receives in the request from the 
client with its _stateVer_ for foo and generates an INVALID_STATE error if they 
don't match. The server does have a watcher for foo’s state in each replica.
 
There are some subtle issues with this:
 
1) If a new replica is added (or recovers) in "foo", then the state of "foo" on 
the server changes and the request fails with an INVALID_STATE even though it 
probably shouldn't fail, but that's the only way now to tell the client its 
state is stale.
 
There is retry logic in the client and the retry may work, but it might not 
because there's nothing that prevents the state from changing again in between 
the client receiving the INVALID_STATE response, re-fetching state from ZK, and 
re-issuing the request. Also, failing a request when a positive state change 
occurs (e.g. adding a replica) just to invalidate cache seems problematic to 
me. In other words, the state of a collection has changed, but in a positive 
way that shouldn’t lead to a request failing. Of course with the correct amount 
of retries, the request will likely work in the end but one can envision a 
number of network round-trips between the client and server just to respond to 
potentially benign state changes.
 
2) Since the client-side is not "watching" any znodes, it runs the risk of 
trying to send a request to a node that is no longer live. Currently, the 
CloudSolrServer consults /live_nodes to make sure a node is "live" before it 
attempts to send a request to it. Without watchers, the client side has no way 
of knowing a node isn't "live" until an error occurs. So now it has to wait for 
some time for the request to timeout and then refresh /live_nodes from 
ZooKeeper.
 
3) Aliases – what happens if a collection is added to an alias? Without 
watchers, the client won’t know the alias changed. I’m sure we could implement 
a similar _stateVer_ solution for aliases but that seems less elegant than just 
watching the znode.
 
4) Queries that span multiple collections … I think problems #1 and 2 mentioned 
above just get worse when dealing with queries that span multiple collections.
 
So based on my discussions with Noble, the new patch takes the following 
approach:

1) No more LazyCloudSolrServer; just adding support for external collections in 
CloudSolrServer

2) Still watch shared znodes, such as /aliases and /live_nodes

3) State for external collections loaded on demand and cached

As it stands now, the CloudSolrServer does not watch external collections when 
running on the client side. The idea there being there may be too many external 
collections to setup watchers for. Thus, state is requested on demand and 
cached. This opens the door for the cached state to go stale, leading to an 
INVALID_STATE error.

However, this presents the need for a new public method on ZkStateReader 
(currently named refreshAllCollectionsOnClusterStateChange), which refreshes 
the internal allCollections set containing the names of internal (those in 
/clusterstate.json and external). While this approach works, it seems like an 
external object telling an internal object to fix itself, which is somewhat 
anti-OO. One improvement would be to dynamically update allCollections when a 
new external collection is discovered. Please advise.

External collections get watched on the server-side only, which gets setup by 
the ZkController. So client-side uses of CloudSolrServer will not have watchers 
setup for external collections.

The remaining issue with this patch is how to handle requests that span 
multiple external collections as the _stateVer_ parameter only supports a 
single collection at this time. A simple comma-delimited list of collection:ver 
pairs could be passed and the server could check each one. However, the test 
case for multiple collections is not passing and is commented out currently. 
Next patch will address that issue.


> Have a new mode for SolrJ to not watch any ZKNode
> -------------------------------------------------
>
>                 Key: SOLR-5474
>                 URL: https://issues.apache.org/jira/browse/SOLR-5474
>             Project: Solr
>          Issue Type: Sub-task
>          Components: SolrCloud
>            Reporter: Noble Paul
>         Attachments: SOLR-5474.patch, SOLR-5474.patch
>
>
> In this mode SolrJ would not watch any ZK node
> It fetches the state  on demand and cache the most recently used n 
> collections in memory.
> SolrJ would not listen to any ZK node. When a request comes for a collection 
> ‘xcoll’
> it would first check if such a collection exists
> If yes it first looks up the details in the local cache for that collection
> If not found in cache , it fetches the node /collections/xcoll/state.json and 
> caches the information
> Any query/update will be sent with extra query param specifying the 
> collection name , shard name, Role (Leader/Replica), and range (example 
> \_target_=xcoll:shard1:L:80000000-b332ffff) . A node would throw an error 
> (INVALID_NODE) if it does not the serve the collection/shard/Role/range combo.
> If SolrJ gets INVALID_NODE error it would invalidate the cache and fetch 
> fresh state information for that collection (and caches it again)
> If there is a connection timeout, SolrJ assumes the node is down and re-fetch 
> the state for the collection and try again



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to