[
https://issues.apache.org/jira/browse/SOLR-11299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16147868#comment-16147868
]
David Smiley commented on SOLR-11299:
-------------------------------------
Approach: I see conceptually two approaches. This could be done by adding a new
type of {{DocRouter}} to the existing capabilities of shard managements within
one collection (and other stuff!); see SOLR-9690. It could also be done by
managing a series of collections with an alias plus other components. This is
implied by SOLR-9562. There is some discussion in SOLR-9562 on the merits of
both approaches. At first I felt very strongly about shard based partitions
because it fits well into Solr's existing code but I've come to realize it's
got scalability challenges, so I no longer have any preference. _I'm going to
proceed with collection based partitions._
What follows are some implementation tasks (which might not necessarily map to
JIRA issues):
* Partition name pattern encode/decode to date. Truncate off '0' aligned parts
to keep short where possible; just ensure it sorts properly). Consumes timezone
param.
* Special time-partitioned collection alias -- this is a special alias with
additional settings. Need C.R.U.D. with various settings:
** earliest date. This might be computed from the 1st partition name
instead of persisted to collection alias metadata. On creation this might be
more conveniently set via date-math syntax like in facet.range.start.
** max time. Specify via facet.range.gap syntax? Support implicit "+1"
prefix if not there?
** max docs. A soft maximum that will often overflow somewhat -- perhaps
at the indexing rate of 1 minute worth of data.
** timezone. Read-only; otherwise would break how to interpret existing
partition names.
** max retention age (date math syntax)
** collection creation metadata: nodeset, numShards, replicationFactor, etc.
** time field. Read-only. Field in Solr doc with a time.
** preemptive create. Boolean; wether to create shards in advance or
otherwise block if not.
* Index to a collection alias. Work for add/delete/commit.
** Route to SolrCore in HttpSolrCall?
** Add a new URP or enhance DistributedURP with support for time
partitions. DURP is unwieldy; If use DURP try to separate out to a new helper
class for most aspects. Could a new URP be dynamically added at chain
construction time if the core is referred to by a time-partioned collection
alias?
* Management. Use daemon mechanism in SOLR-11066 where appropriate.
** create new collections
*** from maxTime exceeded (+ pre-emptive)
*** from maxDocs exceeded
** delete old partitions
** optimize (segment merge) old partitions
* Search optimization: minimize queried collections somehow -- see SOLR-9562
> Time partitioned collections (umbrella issue)
> ---------------------------------------------
>
> Key: SOLR-11299
> URL: https://issues.apache.org/jira/browse/SOLR-11299
> Project: Solr
> Issue Type: New Feature
> Security Level: Public(Default Security Level. Issues are Public)
> Components: SolrCloud
> Reporter: David Smiley
> Assignee: David Smiley
>
> Solr ought to have the ability to manage large-scale time-series data (think
> logs or sensor data / IOT) itself without a lot of manual/external work. The
> most naive and painless approach today is to create a collection with a high
> numShards with hash routing but this isn't as good as partitioning the
> underlying indexes by time for these reasons:
> * Easy to scale up/down horizontally as data/requirements change. (No need
> to over-provision, use shard splitting, or re-index with different config)
> * Faster queries:
> ** can search fewer shards, reducing overall load
> ** realtime search is more tractable (since most shards are stable --
> good caches)
> ** "recent" shards (that might be queried more) can be allocated to
> faster hardware
> ** aged out data is simply removed, not marked as deleted. Deleted docs
> still have search overhead.
> * Outages of a shard result in a degraded but sometimes a useful system
> nonetheless (compare to random subset missing)
> Ideally you could set this up once and then simply work with a collection
> (potentially actually an alias) in a normal way (search or update), letting
> Solr handle the addition of new partitions, removing of old ones, and
> appropriate routing of requests depending on their nature.
> This issue is an umbrella issue for the particular tasks that will make it
> all happen -- either subtasks or issue linking.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]