[jira] [Commented] (SOLR-11487) Collection Alias metadata for time partitioned collections

Gus Heck (JIRA) Mon, 23 Oct 2017 20:04:23 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-11487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16216223#comment-16216223
 ]


Gus Heck commented on SOLR-11487:
---------------------------------

Thx for the review Dave. 

I'll start in on some of the fix-ups, here's some of the whys for what I did, I 
can probably be talked out of anything here, but this is what I was thinking... 
Let me know what you think.

*Fix List*
* Will add test for removal of alias case and fix it. 
* docs: yes of course good point :) 
* names: sure sounds fine :)
* I had initially started supporting lists and then decided to axe that until 
discussion. I will move back to <String,String> from <String,Object>.. 

----

*Warnings*: this gets difficult because we don't really *have* type safety 
anymore... 

We have a Map<String,X> with two keys "collection" and "collection_metadata" 
the values for these two keys don't match. The former is Map<String,String> and 
the later is a Map<String,Map<String,String>> String and Map are not 
convertible types so one use case or the other wont compile... unless you back 
off from type safety. To achieve type safety we either need to keep two 
separate maps or we need to be serializing an actual object hierarchy rather 
than collection classes.

----

*The ZkStateReader field* is necessary because we need to get the metadata back 
to zookeeper. Based on your earlier response we won't have a collections API 
call to set metadata, so we need to have a ZkStateReader somehow or nothing 
gets written to zookeeper. In the case of cloneWithCollectionAlias, yeah that 
can be eliminated there, good catch. I can add an overload for the signature 
you suggest as well for the case where both are to be updated at the same time, 
but WRT removing ZkStateReader entirely, see comments about immutability 
below... 

----

*Immutability* is nice of course, and great for things that are immutable, or 
only held for a short duration, but once you have a long held reference and the 
underlying data is actually mutable, it gets difficult to be sure nobody is 
retaining a reference a stale copy every time a change is made... A little 
digging reveals that CoreContainer has a ZkContainer which has a ZkController, 
which has a ZkStateReader, which therefore holds onto an immutable copy of 
Aliases for a difficult to determine time frame....

The existing cloning/immutability scheme therefore worries me. It seems like it 
would make more sense for Aliases to function as a wrapper around the 
(fundamentally mutable) json in zookeeper. If we never want to know if there 
was a change after we get our initial copy, and we never give away a reference 
to the copy we got, and we never retain that copy after we make an update we 
could have immutable copies... hard to make those stick however. It might be 
that we want a snapshot at the start of a request that doesn't change for the 
duration request (I can imagine that getting funky fast), but the long held 
versions need to be mutable I think... maybe a mutable super class and an 
immutable subclass that throws UnsupportedOperation on mutators?

----

The *Collections.EMPTY_MAP* came up because I can't put anything into an empty 
map, and need to test if that's what we currently have or if we have a regular 
map that I can add stuff to. Collections.emptyMap() is not required to return 
the same instance, or any particular implementation class, so in order to test 
for it and not be subject to breakage in odd VM's or future versions (or past 
versions?) I have to use the single unique instance in Collections.EMPTY_MAP. 
I've grown slightly unsure as to whether that if block is still necessary, a 
possible hold over from early versions of this code, so I might give a shot at 
eliminating it and go back to Collections.emptyMap().

----

I *moved CRUD stuff* to ZkStateReader so I didn't have to duplicate it in 
Aliases to get metadata written back to zookeeper. Also it feels reasonable to 
have a ZK class doing the Zk CRUD rather than having that code live in a 
command class that grabs ZkClient from ZkStateReader and writes the data 
directly itself... (law of Demeter etc). This way, the command does command 
type stuff like identify the data to be written and validation to be sure we 
really do want to write the data and then hands the data off to the thing that 
knows how to work with zookeeper data so it can do the actual writing... 
Controller & service/DAO stuff. The present code seems like the 
controller/action in a web app firing up a JDBC connection directly...

----

*TimeOut*... initially I was going to copy it over verbatim since it was in 
core and core is not available in solrj (when I moved the CRUD to 
ZkStateReader) and then I realized it could be improved so I improved it. I 
think perhaps this timeout and the one in Core could be reconciled and moved to 
a commonly available location to facilitate re-use, but that seems like a 
separate endeavor not part of this ticket. The previous checkForAlias() method 
was running a tight loop with no sleep inside the loop for up to 30 seconds 
(evidently waiting for another thread to call updateAliases() to replace our 
"immutable" aliases with a copy that had the alias that we were meant to add... 
This probably soaks up an entire processor core while we wait. Since the use 
case is to block the current thread until a condition is observed, an api that 
forces you to set a sleep delay seems beneficial.

I could possibly also just write the poling loop directly without the timeout 
class... but I was just following from the code I had brought over from the 
command class in this case, perhaps a bit too blindly.

> Collection Alias metadata for time partitioned collections
> ----------------------------------------------------------
>
>                 Key: SOLR-11487
>                 URL: https://issues.apache.org/jira/browse/SOLR-11487
>             Project: Solr
>          Issue Type: Sub-task
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud
>            Reporter: David Smiley
>         Attachments: SOLR_11487.patch, SOLR_11487.patch
>
>
> SOLR-11299 outlines an approach to using a collection Alias to refer to a 
> series of collections of a time series. We'll need to store some metadata 
> about these time series collections, such as which field of the document 
> contains the timestamp to route on.
> The current {{/aliases.json}} is a Map with a key {{collection}} which is in 
> turn a Map of alias name strings to a comma delimited list of the collections.
> _If we change the comma delimited list to be another Map to hold the existing 
> list and more stuff, older CloudSolrClient (configured to talk to ZooKeeper) 
> will break_.  Although if it's configured with an HTTP Solr URL then it would 
> not break.  There's also some read/write hassle to worry about -- we may need 
> to continue to read an aliases.json in the older format.
> Alternatively, we could add a new map entry to aliases.json, say, 
> {{collection_metadata}} keyed by alias name?
> Perhaps another very different approach is to attach metadata to the 
> configset in use?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-11487) Collection Alias metadata for time partitioned collections

Reply via email to