[ 
https://issues.apache.org/jira/browse/SOLR-13131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16758853#comment-16758853
 ] 

Gus Heck commented on SOLR-13131:
---------------------------------

This feature starts from the position that you have a use case where you want 
accept a heterogeneous stream of data and segregate it into various 
collections. If you don't have a reason to separate the data into distinct 
collections, or the data flows generating documents are separate and not easily 
merged, there would be little or no call for using a CRA.

The key benefit is that it's data driven, and doesn't require human 
intervention or down time for configuration/devops/programming/etc to begin 
accepting a new type. This could be important if one is feeding a continuous 
stream of IoT sensor data (for example) and new sensor 
types/brands/locations/etc may come on line and be added without notice.

Autmated collection creation from outside solr based on data values in the 
documents doesn't have a smooth, easy solution that I can see. One obviously 
can't run a check for the existence of a collection for every document via 
collections api. That would be insanely slow. Parsing exception messages to 
know when you need to create a new collection also seems very ugly.  A workable 
solution likely involves tracking solr's list of collections separately, but 
that will have obvious concurrency pitfalls.  One could possibly build indexing 
infrastructure that monitored zookeeper directly similar to what Solr does, but 
that's complex and requires skill with zookeeper. Also, I'm not sure I like 
that idea since it turns zookeeper's organization and details into a public API.

By way of contrast, Solr is already well positioned to know it's own state, 
handle concurrency and react to document values.

Another benefit is sheer convenience and reduction of client side (indexing) 
complexity when segregating based on a field value. One doesn't have to build 
and maintain infrastructure to map categories to your collections, which would 
be required when building URL's to send the data to specific collections or 
setting collections on each client... and if you're handling a mixed stream 
then you have to batch each type independently because they will be headed for 
different URL's or handled by separate SolrJ clients... 

I can also imagine CRA's greatly easing construction of systems with a 
collection per tenant pattern. The indexing infrastructure would always stamp 
the tenant's data with their customer_id and so long as that happens you can be 
sure that solr will route to separate collections on customer_id. The front end 
can build it's queries knowing the customer id and setting the appropriate 
collection. Leaks between customers become impossible, and there is absolutely 
no need to change infrastructure to add a customer (other than adding nodes for 
capacity every N customers of course). There also would be no need or write 
code that has to run admin level commands. Admin command access could possibly 
be removed from the application entirely. Running reports across tenants 
(querying via the alias in a back end application) would "just work" again with 
no special programming. Moving big or noisy tenants to preferred hardware would 
not require software/config changes either, just admin commands, or 
auto-scaling labels, and wouldn't disrupt any of the foregoing.

Much like TRA's there are ways to do any/all of this with custom code, or 
alternate infrastructure, the goal is to make it easier and more hands off.

> Category Routed Aliases
> -----------------------
>
>                 Key: SOLR-13131
>                 URL: https://issues.apache.org/jira/browse/SOLR-13131
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud
>    Affects Versions: master (9.0)
>            Reporter: Gus Heck
>            Assignee: Gus Heck
>            Priority: Major
>
> This ticket is to add a second type of routed alias in addition to the 
> current time routed aliases. The new type of alias will allow data driven 
> creation of collections based on the values of a field and automated 
> organization of these collections under an alias that allows the collections 
> to also be searched as a whole.
> The use case in mind at present is an IOT device type segregation, but I 
> could also see this leading to the ability to direct updates to tenant 
> specific hardware (in cooperation with autoscaling). 
> This ticket also looks forward to (but does not include) the creation of a 
> Dimensionally Routed Alias which would allow organizing time routed data also 
> segregated by device
> Further design details to be added in comments.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to