[ 
https://issues.apache.org/jira/browse/SOLR-9562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16205860#comment-16205860
 ] 

Radu Gheorghe commented on SOLR-9562:
-------------------------------------

My two cents:
* if data is relatively low-velocity, merging shards of an existing collection 
that's already "done" (e.g. yesterday's collection) by the way of a pure merge 
should help with scaling the cluster. Here's how Elasticsearch does it: 
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-shrink-index.html
* if data is high-velocity, one will likely have to to live with the trade-off 
between many collections (i.e. rotate them more frequently, which would then 
write faster because of less merging and read faster because they are "done" 
faster => better caching for those "wrapped up" collections) or less 
collections (which imply less shards). I'm saying this because the benefits of 
merging shards may not be worth the overhead

That said, loading/unloading shards might help reduce the overhead of many 
shards, assuming that old data is rarely touched. I'm probably getting way 
ahead of myself here, but a read alias that would automatically load shards 
(that would be closed from a cronjob looking at activity) would be pretty 
awesome (especially if we think about them in the context of AutoScaling and 
shared file systems).

> Minimize queried collections for time series alias
> --------------------------------------------------
>
>                 Key: SOLR-9562
>                 URL: https://issues.apache.org/jira/browse/SOLR-9562
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Eungsop Yoo
>            Priority: Minor
>         Attachments: SOLR-9562-v2.patch, SOLR-9562.patch
>
>
> For indexing time series data(such as large log data), we can create a new 
> collection regularly(hourly, daily, etc.) with a write alias and create a 
> read alias for all of those collections. But all of the collections of the 
> read alias are queried even if we search over very narrow time window. In 
> this case, the docs to be queried may be stored in very small portion of 
> collections. So we don't need to do that.
> I suggest this patch for read alias to minimize queried collections. Three 
> parameters for CREATEALIAS action are added.
> || Key || Type || Required || Default || Description ||
> | timeField | string | No | | The time field name for time series data. It 
> should be date type. |
> | dateTimeFormat | string | No | | The format of timestamp for collection 
> creation. Every collection should has a suffix(start with "_") with this 
> format. 
> Ex. dateTimeFormat: yyyyMMdd, collectionName: col_20160927
> See 
> [DateTimeFormatter|https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html].
>  |
> | timeZone | string | No | | The time zone information for dateTimeFormat 
> parameter.
> Ex. GMT+9. 
> See 
> [DateTimeFormatter|https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html].
>  |
> And then when we query with filter query like this "timeField:\[fromTime TO 
> toTime\]", only the collections have the docs for a given time range will be 
> queried.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to