[jira] [Commented] (SOLR-11299) Time partitioned collections (umbrella issue)

David Smiley (JIRA) Wed, 30 Aug 2017 12:20:21 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-11299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16147868#comment-16147868
 ]


David Smiley commented on SOLR-11299:
-------------------------------------

Approach: I see conceptually two approaches. This could be done by adding a new 
type of {{DocRouter}} to the existing capabilities of shard managements within 
one collection (and other stuff!); see SOLR-9690.  It could also be done by 
managing a series of collections with an alias plus other components. This is 
implied by SOLR-9562.  There is some discussion in SOLR-9562 on the merits of 
both approaches.  At first I felt very strongly about shard based partitions 
because it fits well into Solr's existing code but I've come to realize it's 
got scalability challenges, so I no longer have any preference.  _I'm going to 
proceed with collection based partitions._

What follows are some implementation tasks (which might not necessarily map to 
JIRA issues):
* Partition name pattern encode/decode to date.  Truncate off '0' aligned parts 
to keep short where possible; just ensure it sorts properly). Consumes timezone 
param.
* Special time-partitioned collection alias -- this is a special alias with 
additional settings.  Need C.R.U.D. with various settings:
    ** earliest date.  This might be computed from the 1st partition name 
instead of persisted to collection alias metadata. On creation this might be 
more conveniently set via date-math syntax like in facet.range.start.
    ** max time.  Specify via facet.range.gap syntax?  Support implicit "+1" 
prefix if not there?
    ** max docs.  A soft maximum that will often overflow somewhat -- perhaps 
at the indexing rate of 1 minute worth of data.
    ** timezone.  Read-only; otherwise would break how to interpret existing 
partition names.
    ** max retention age  (date math syntax)
    ** collection creation metadata: nodeset, numShards, replicationFactor, etc.
    ** time field.  Read-only.  Field in Solr doc with a time.
    ** preemptive create.  Boolean; wether to create shards in advance or 
otherwise block if not.
* Index to a collection alias.   Work for add/delete/commit.
    ** Route to SolrCore in HttpSolrCall?
    ** Add a new URP or enhance DistributedURP with support for time 
partitions.  DURP is unwieldy; If use DURP try to separate out to a new helper 
class for most aspects.  Could a new URP be dynamically added at chain 
construction time if the core is referred to by a time-partioned collection 
alias?
* Management. Use daemon mechanism in SOLR-11066 where appropriate.
    ** create new collections
        *** from maxTime exceeded  (+ pre-emptive)
        *** from maxDocs exceeded
    ** delete old partitions
    ** optimize (segment merge) old partitions
* Search optimization: minimize queried collections somehow -- see SOLR-9562

> Time partitioned collections (umbrella issue)
> ---------------------------------------------
>
>                 Key: SOLR-11299
>                 URL: https://issues.apache.org/jira/browse/SOLR-11299
>             Project: Solr
>          Issue Type: New Feature
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud
>            Reporter: David Smiley
>            Assignee: David Smiley
>
> Solr ought to have the ability to manage large-scale time-series data (think 
> logs or sensor data / IOT) itself without a lot of manual/external work.  The 
> most naive and painless approach today is to create a collection with a high 
> numShards with hash routing but this isn't as good as partitioning the 
> underlying indexes by time for these reasons:
> * Easy to scale up/down horizontally as data/requirements change.  (No need 
> to over-provision, use shard splitting, or re-index with different config)
> * Faster queries: 
>     ** can search fewer shards, reducing overall load
>     ** realtime search is more tractable (since most shards are stable -- 
> good caches)
>     ** "recent" shards (that might be queried more) can be allocated to 
> faster hardware
>     ** aged out data is simply removed, not marked as deleted.  Deleted docs 
> still have search overhead.
> * Outages of a shard result in a degraded but sometimes a useful system 
> nonetheless (compare to random subset missing)
> Ideally you could set this up once and then simply work with a collection 
> (potentially actually an alias) in a normal way (search or update), letting 
> Solr handle the addition of new partitions, removing of old ones, and 
> appropriate routing of requests depending on their nature.
> This issue is an umbrella issue for the particular tasks that will make it 
> all happen -- either subtasks or issue linking.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-11299) Time partitioned collections (umbrella issue)

Reply via email to