[jira] [Reopened] (SOLR-9562) Minimize queried collections for time series alias

David Smiley (JIRA) Wed, 02 Aug 2017 23:00:29 -0700

     [ 
https://issues.apache.org/jira/browse/SOLR-9562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


David Smiley reopened SOLR-9562:
--------------------------------

tl;dr: I'm re-opening for further discussion on the merits of collection based 
time series.  I'm leaning towards this solution now.  

I did look at the patch closer.  It solves one aspect of time series, namely 
what the title says: "minimize queried collections for time series alias" (an 
optimization); and that's it.  That is okay as a first step I guess. It's still 
on the client to write new data to the appropriate collection (routing writes), 
and also to eventually delete the oldest collections, and to actually create 
new collections and perhaps add to an alias while it's at it.  So yeah most 
things are for other issues :-)

Elsewhere, [~janhoy] said:
bq. Question: Is shard-level the best abstraction here or could time-based use 
cases just as well be solved on the collection level? Create a write-alias 
pointing to the newest collection, and read aliases pointing to all or some 
other subset of collections. In this setup, newer collections could have larger 
replicationFactor to support more queries. And you could reduce #shards for 
older collections, merge collections and define the oldest collections as 
"archive" which are loaded lazily on demand only etc... People do this already 
and one could imagine built-in support for all the collection creation and 
alias housekeeping.

Sounds good but I have some questions about some of the potential bonus 
features you mentioned.  Like what is the methodology to reducing the numShards 
of a collection while keeping overall data set searchable and no oddities like 
temporarily searching/counting copies of the same document?  And likewise for 
merging collections?

A disadvantage to the shard/DocRouter approach (SOLR-9690) is that the 
numShards & replicationFactor, and node placement rules etc. are fixed and 
governed at the collection level, not per-shard. But if shard==time-slice then 
there's a good chance we want to make different choices for the most recent 
shard. And there are still scalability issues (Overseer related) with very 
large numbers of shards that is not present if done at the collection level 
(will be solved eventually I'm sure).  I think for this feature we ought to 
cater to very high scalability in diverse use cases, and that probably means 
collection based time slices.  

Maybe there's room for a hybrid where shard based time series is used for all 
data, but it is augmented by an additional "realtime" collection (optional 
feature) for the most recent data that can of course have it's own 
configuration catering to both realtime search and high write volume.  Then we 
devise a way to move the data from the RT collection to the archive collection. 
 Perhaps a big optimize and then copy the segment(s) across the RT shards over 
to one new index involving the MERGEINDEXES admin command.  I spoke about this 
hybrid as a small piece of a larger talk at last year's L/S Revolution but 
didn't ultimately have time to implement this strategy.  I did at least get to 
much of the shard based time series portion.

[~ehatcher] I wonder if LucidWorks might be interested in open-sourcing 
Fusion's time series capability, assuming it's in a suitable shape to be 
donated (e.g. written in Java, etc.)?  I've seen it but not tried it; I don't 
have insight into the particulars of it's approach.  Regardless I've set aside 
my time to improve Solr to help get something committed so that Solr has this 
capability (be it collection based time slices or shard based time slices).

> Minimize queried collections for time series alias
> --------------------------------------------------
>
>                 Key: SOLR-9562
>                 URL: https://issues.apache.org/jira/browse/SOLR-9562
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Eungsop Yoo
>            Priority: Minor
>         Attachments: SOLR-9562.patch, SOLR-9562-v2.patch
>
>
> For indexing time series data(such as large log data), we can create a new 
> collection regularly(hourly, daily, etc.) with a write alias and create a 
> read alias for all of those collections. But all of the collections of the 
> read alias are queried even if we search over very narrow time window. In 
> this case, the docs to be queried may be stored in very small portion of 
> collections. So we don't need to do that.
> I suggest this patch for read alias to minimize queried collections. Three 
> parameters for CREATEALIAS action are added.
> || Key || Type || Required || Default || Description ||
> | timeField | string | No | | The time field name for time series data. It 
> should be date type. |
> | dateTimeFormat | string | No | | The format of timestamp for collection 
> creation. Every collection should has a suffix(start with "_") with this 
> format. 
> Ex. dateTimeFormat: yyyyMMdd, collectionName: col_20160927
> See 
> [DateTimeFormatter|https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html].
>  |
> | timeZone | string | No | | The time zone information for dateTimeFormat 
> parameter.
> Ex. GMT+9. 
> See 
> [DateTimeFormatter|https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html].
>  |
> And then when we query with filter query like this "timeField:\[fromTime TO 
> toTime\]", only the collections have the docs for a given time range will be 
> queried.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Reopened] (SOLR-9562) Minimize queried collections for time series alias

Reply via email to