[jira] [Commented] (SOLR-11653) create next time collection based on a fixed time gap

2018-01-11 Thread Gus Heck (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16323486#comment-16323486
 ] 

Gus Heck commented on SOLR-11653:
-

Just thought of something I missed in my review:

It looks like we would be creating collection names that look like this in the 
case of +1HOUR interval

alias_2014-01-14_22
alias_2014-01-14_23
alias_2014-01-15
alias_2014-01-15_01

or for +30MINUTE interval:

alias_2014-01-14_22
alias_2014-01-14_22_30
alias_2014-01-14_23
alias_2014-01-14_23_30
alias_2014-01-15
alias_2014-01-15_01

I think that's not very nice since the length is inconsistent lengths and would 
be hard for (users) who won't have our fancy formater definition on hand to 
parse or generate... maybe we should be creating names like these:

alias_2014-01-14_22
alias_2014-01-14_23
alias_2014-01-15_00
alias_2014-01-15_01

and

alias_2014-01-14_22_00
alias_2014-01-14_22_30
alias_2014-01-14_23_00
alias_2014-01-14_23_30
alias_2014-01-15_00_00
alias_2014-01-15_01_00

To do that we probably have to analyze the interval and decide what the 
smallest unit in the date math is and then record that format (or a value that 
maps to the right format) in metadata.

If someone specifies +60MINUTES however I'd say they just get the extra _00 on 
everything vs +1HOUR ...that actually could be viewed as a feature.

> create next time collection based on a fixed time gap
> -
>
> Key: SOLR-11653
> URL: https://issues.apache.org/jira/browse/SOLR-11653
> Project: Solr
>  Issue Type: Sub-task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Reporter: David Smiley
>Assignee: David Smiley
> Fix For: 7.3
>
> Attachments: SOLR-11653.patch, SOLR-11653.patch, SOLR-11653.patch
>
>
> For time series collections (as part of a collection Alias with certain 
> metadata), we want to automatically add new collections. In this issue, this 
> is about creating the next collection based on a configurable fixed time gap. 
>  And we will also add this collection synchronously once a document flowing 
> through the URP chain exceeds the gap, as opposed to asynchronously in 
> advance.  There will be some Alias metadata to define in this issue.  The 
> preponderance of the implementation will be in TimePartitionedUpdateProcessor 
> or perhaps a helper to this URP.
> note: other issues will implement pre-emptive creation and capping 
> collections by size.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11653) create next time collection based on a fixed time gap

2018-01-05 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16313669#comment-16313669
 ] 

ASF subversion and git services commented on SOLR-11653:


Commit c59db0c33778bac7430aa4c2dfd0eb39ef60e205 in lucene-solr's branch 
refs/heads/branch_7x from [~dsmiley]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c59db0c ]

SOLR-11653: TimeRoutedAlias URP now auto-creates collections using new 
RoutedAliasCreateCollectionCmd

(cherry picked from commit 925733d)


> create next time collection based on a fixed time gap
> -
>
> Key: SOLR-11653
> URL: https://issues.apache.org/jira/browse/SOLR-11653
> Project: Solr
>  Issue Type: Sub-task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Reporter: David Smiley
>Assignee: David Smiley
> Attachments: SOLR-11653.patch, SOLR-11653.patch, SOLR-11653.patch
>
>
> For time series collections (as part of a collection Alias with certain 
> metadata), we want to automatically add new collections. In this issue, this 
> is about creating the next collection based on a configurable fixed time gap. 
>  And we will also add this collection synchronously once a document flowing 
> through the URP chain exceeds the gap, as opposed to asynchronously in 
> advance.  There will be some Alias metadata to define in this issue.  The 
> preponderance of the implementation will be in TimePartitionedUpdateProcessor 
> or perhaps a helper to this URP.
> note: other issues will implement pre-emptive creation and capping 
> collections by size.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11653) create next time collection based on a fixed time gap

2018-01-05 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16313666#comment-16313666
 ] 

ASF subversion and git services commented on SOLR-11653:


Commit 925733d1ef3ac6fbabc450804511c65a4c6424ac in lucene-solr's branch 
refs/heads/master from [~dsmiley]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=925733d ]

SOLR-11653: TimeRoutedAlias URP now auto-creates collections using new 
RoutedAliasCreateCollectionCmd


> create next time collection based on a fixed time gap
> -
>
> Key: SOLR-11653
> URL: https://issues.apache.org/jira/browse/SOLR-11653
> Project: Solr
>  Issue Type: Sub-task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Reporter: David Smiley
>Assignee: David Smiley
> Attachments: SOLR-11653.patch, SOLR-11653.patch, SOLR-11653.patch
>
>
> For time series collections (as part of a collection Alias with certain 
> metadata), we want to automatically add new collections. In this issue, this 
> is about creating the next collection based on a configurable fixed time gap. 
>  And we will also add this collection synchronously once a document flowing 
> through the URP chain exceeds the gap, as opposed to asynchronously in 
> advance.  There will be some Alias metadata to define in this issue.  The 
> preponderance of the implementation will be in TimePartitionedUpdateProcessor 
> or perhaps a helper to this URP.
> note: other issues will implement pre-emptive creation and capping 
> collections by size.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11653) create next time collection based on a fixed time gap

2018-01-02 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16308618#comment-16308618
 ] 

David Smiley commented on SOLR-11653:
-

* I don't believe this patch exposes ROUTEDALIAS_CREATECOLL through v1 or v2; 
it takes internal code to invoke it.  Notice there is no reference to it in 
CollectionsHandler.  Eventually I do think it will be a useful command but I 
don't want to lengthen this issue with documenting it, ensuring v1 & v2, and 
thinking about it's API which might need work.  The first patch iteration 
exposed it but 2nd patch removed it from CollectionsHandler for the above 
reasons.
* RE Why the "extra layer":  Very good question; I should add some explanatory 
docs. I think you are wondering why does RoutedAliasCreateCollectionCmd exist 
as such when our URP could do the same actions? In my work for the Harvard BOP 
project, I approached it that way in fact.  The reason is that by adding an 
Overseer command, I can get code to operate in a mutex/lock by the alias name, 
thus ensuring that the choice of the next collection name & it's creation and 
addition to the alias happens atomically.  This isn't critical at the moment 
because the next collection name is deterministic, and thus could be handled at 
the URP with retries.  But eventually we'd like to have it be more dynamic like 
when a size threshold is reached, or simply because the user wants to (calls an 
API to make it happen on-demand).  Without a lock, I think it's impossible to 
support that.
** It does seem to be a shame that I need to create an Overseer command just to 
get a cluster lock on the alias name... not that it's *that* big a deal. I 
suppose using ZooKeeper directly (or probably better Curator) but unless other 
parts of Solr are doing this (I don't think so?), I don't want time routed 
aliases to be the first to break the mold.
** BTW I think it's silly that all the alias operations are Overseer commands 
since they merely do atomic operations against ZooKeeper (that compare the 
version) so what's the point?
* RE "+1SECOND" sure that's perhaps not realistic but I'm not sure we want to 
insist you can't do it.  We already round away unnecessary _00 suffixes of 
seconds, minutes, and hours.
* RE create collection loop: What is not clear in the patch is that 
parsedCollectionAliases is going to be updated with every new collection (since 
it gets prepended to the alias).  I want to improve the clarity of the logic to 
instead have it examine the head collection name to see that it's different.  
And maybe we don't need 5 retries; maybe none or make it configurable?
* Yes in SOLR-11722 please add maxFutureMS.  But I don't think that issue 
should create more than the initial collection.
* In a couple cases you've mentioned creating the next collection in advance of 
it being needed.  Yes absolutely, LucidWorks' Fusion appropriately calls this 
"preemptive" creation BTW. But I want to make that a separate feature we can 
work on later, these issues open now have enough to do without worrying about 
that :-)
* Ah, I really like your suggestion of "most recent" naming... thus I'll do 
some renames even if it's more wordy.

> create next time collection based on a fixed time gap
> -
>
> Key: SOLR-11653
> URL: https://issues.apache.org/jira/browse/SOLR-11653
> Project: Solr
>  Issue Type: Sub-task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Reporter: David Smiley
>Assignee: David Smiley
> Attachments: SOLR-11653.patch, SOLR-11653.patch
>
>
> For time series collections (as part of a collection Alias with certain 
> metadata), we want to automatically add new collections. In this issue, this 
> is about creating the next collection based on a configurable fixed time gap. 
>  And we will also add this collection synchronously once a document flowing 
> through the URP chain exceeds the gap, as opposed to asynchronously in 
> advance.  There will be some Alias metadata to define in this issue.  The 
> preponderance of the implementation will be in TimePartitionedUpdateProcessor 
> or perhaps a helper to this URP.
> note: other issues will implement pre-emptive creation and capping 
> collections by size.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11653) create next time collection based on a fixed time gap

2018-01-01 Thread Gus Heck (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16307584#comment-16307584
 ] 

Gus Heck commented on SOLR-11653:
-

Some thoughts: 

It seems that you've added ROUTEDALIAS_CREATECOLL as a user accessible command 
(the v1 API will respond to it I think) but not reflected in v2 api json files. 
I think this is because this command is  probably not meant for API invocation 
in the first place, so it kinda looks out of place as an undocumented API. I 
kind of wonder why it was done as an admin command. I'm probably missing 
something, but it seems like this means we have:
# doc arrives for which it is appropriate to create a new collection.
# issue admin command ROUTEDALIAS_CREATECOLL and wait for it
# inside ROUTEDALIAS_CREATECOLL issue 
CollectionsHandler.CollectionOperation.CREATE_OP.execute(... and wait for that
# finally process the update

I'm not sure why we want the extra layer? is there actually a use case for 
manual creation of the next partition? To me it seems as if this operation is 
internal to TimeRoutedAliasUpdateProcessor and should be there. I can possibly 
imagine that if we were proactively keeping ahead of things by one collection 
this structure could allow fast processing of the update by giving it an async 
id, but it looks like the code is set up to add a collection the first time it 
gets a document that requires that collection (up to maxFutureMs), and delay 
the update until that succeeds.

This makes me wonder if we should even be supporting +1SECOND? If this command 
is not sub-second we fall behind... That probably also is a really bad idea for 
sheer number of collections too. As a side benefit we could also shorten our 
collection names too...

I think your loop for creating collections should be dependent on the value of 
maxFutureMs? Let's say now() is 01:22 the head is presently ending in _01_00 
and accepting 01:00 hrs to 02:00 hrs, maxFutureMs=72*60*60*1000 and we get a 
document for 60 hours in the future... looks like we loop up to 5 times and 
successfully create the _02_00, 0_03_00, _04_00,_05_00,_06_00 partitions and 
then error out unless we get an update to our parsedCollectionAliases in the 
mean time... if any 5 creations are faster than the updates to 
parsedCollectionAliases we wrongly error out. The error message will be 
misleading too since the attempts to create the collections all succeeded but 
we quit due to some sort of communications lag. IF the parsedCollectionsAliases 
gets updated in time we will reset our counter and keep going, but why not 
directly calculate the appropriate minimum number of attempts from 
DocTimestampMs - HeadTimestampMs divided by HeadTimestampMs+(one interval) - 
HeadTimestampMs?  A constant can be added if we want some slack for retrying 
failed calls.

Perhaps my work in SOLR-11722 should be filling out all prospective next 
collections up to maxFutureMs (which I need to add to metadata)? Then in this 
code rather than testing the current document's time stamp to see if we create 
a new collection, test whether or not the next increment falls within 
MaxFutureMS. In the case that it's time to create a new collection we can then 
check the document value and create the next one asynchronously unless the 
current document would fall in the yet to be created collection (which should 
normally be a rare case). That would go from one guaranteed slow update every 
hour (for +1HOUR routed aliases) to only having a slow update if a document at 
the very beginning of an hour happens to fall *just* inside maxFutureMs. We 
would need to track that the collection was in progress for creation of course 
to avoid spamming the overseer each hour... 

It seems easier and more robust to never need to create more than one 
collection during an update. In that case your present loop is just fine.

In code, you have variable names and comments talking about "head" but I had to 
dig a little to confirm that this was actually the "most recent" not "the 
first" be nice if comments made this clear.

I agree with your desire to rename handleResponse() ... confused me at first.

> create next time collection based on a fixed time gap
> -
>
> Key: SOLR-11653
> URL: https://issues.apache.org/jira/browse/SOLR-11653
> Project: Solr
>  Issue Type: Sub-task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Reporter: David Smiley
>Assignee: David Smiley
> Attachments: SOLR-11653.patch, SOLR-11653.patch
>
>
> For time series collections (as part of a collection Alias with certain 
> metadata), we want to automatically add new collections. In this issue, this 
> is about creating the next collection based on a configurable fixed time gap. 
>  And we will