[jira] [Commented] (SPARK-6664) Split Ordered RDD into multiple RDDs by keys (boundaries or intervals)

Florian Verhein (JIRA) Thu, 02 Apr 2015 23:48:07 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-6664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14394141#comment-14394141
 ]


Florian Verhein commented on SPARK-6664:
----------------------------------------

Thanks [~sowen]. I disagree :-) 

...If you think there's non-stationarity you most certainly want to see how 
well a model trained in the past holds up in the future (possibly with more 
than one out of time sample if one is used for pruning, etc), and you can do 
this for temporal data by adjusting the way you do cross validation... 
actually, the exact method you describe is one common approach in time series 
data, e.g. see 
http://stats.stackexchange.com/questions/14099/using-k-fold-cross-validation-for-time-series-model-selection
Doing this multiple times does exactly what is does for normal cross-validation 
- gives you a distribution of your error estimate, rather than a single value 
(a sample of it). So it's quite important. The size of the data isn't really 
relevant to this argument (also consider that I might like to employ larger 
datasets to remove the risk of overfitting a more complex but better fitting 
model, rather than to improve my error estimates). 

Note that this proposal doesn't define how the split RDDs are used (i.e. 
unioned) to create training sets and test sets. So the test set can be a single 
RDD, or multiple ones. It's entirely up to the user.

Allowing overlapping partitions (i.e. part 2) is a little different, because 
you probably wouldn't union the resulting RDDs due to duplication. It would be 
more useful for as a primitive for bootstrapping the performance measures of 
streaming models or simulations (so, you're not resampling records, but 
resampling subsequences). 
Alternatively if you have big data but a class imbalance problem, you might 
need to resort to overlaps in the training sets to get multiple test sets with 
enough examples of your minority class.

>From what I understand MLUtils.kFold is standard randomised k-fold cross 
>validation *but without shuffling* (from a cursory look at the code, It looks 
>like ordering will always be maintained... which should probably be documented 
>if it is the case because it can lead to bad things... and adds another 
>argument for #6665). Either way, since elements of its splits are 
>non-consecutive, it's not applicable for time series. 

Do you know how the performance of filterByRange would compare? It should be 
pretty performant if and only if the data is RangePartitioned right? 


> Split Ordered RDD into multiple RDDs by keys (boundaries or intervals)
> ----------------------------------------------------------------------
>
>                 Key: SPARK-6664
>                 URL: https://issues.apache.org/jira/browse/SPARK-6664
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Core
>            Reporter: Florian Verhein
>
> I can't find this functionality (if I missed something, apologies!), but it 
> would be very useful for evaluating ml models.  
> *Use case example* 
> suppose you have pre-processed web logs for a few months, and now want to 
> split it into a training set (where you train a model to predict some aspect 
> of site accesses, perhaps per user) and an out of time test set (where you 
> evaluate how well your model performs in the future). This example has just a 
> single split, but in general you could want more for cross validation. You 
> may also want to have multiple overlaping intervals.   
> *Specification* 
> 1. Given an Ordered RDD and an ordered sequence of n boundaries (i.e. keys), 
> return n+1 RDDs such that values in the ith RDD are within the (i-1)th and 
> ith boundary.
> 2. More complex alternative (but similar under the hood): provide a sequence 
> of possibly overlapping intervals (ordered by the start key of the interval), 
> and return the RDDs containing values within those intervals. 
> *Implementation ideas / notes for 1*
> - The ordered RDDs are likely RangePartitioned (or there should be a simple 
> way to find ranges from partitions in an ordered RDD)
> - Find the partitions containing the boundary, and split them in two.  
> - Construct the new RDDs from the original partitions (and any split ones)
> I suspect this could be done by launching only a few jobs to split the 
> partitions containing the boundaries. 
> Alternatively, it might be possible to decorate these partitions and use them 
> in more than one RDD. I.e. let one of these partitions (for boundary i) be p. 
> Apply two decorators p' and p'', where p' is masks out values above the ith 
> boundary, and p'' masks out values below the ith boundary. Any operations on 
> these partitions apply only to values not masked out. Then assign p' to the 
> ith output RDD and p'' to the (i+1)th output RDD.
> If I understand Spark correctly, this should not require any jobs. Not sure 
> whether it's worth trying this optimisation.
> *Implementation ideas / notes for 2*
> This is very similar, except that we have to handle entire (or parts) of 
> partitions belonging to more than one output RDD, since they are no longer 
> mutually exclusive. But since RDDs are immutable(??), the decorator idea 
> should still work?
> Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-6664) Split Ordered RDD into multiple RDDs by keys (boundaries or intervals)

Reply via email to