[GitHub] spark issue #18313: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...

MLnick Wed, 23 Aug 2017 00:21:49 -0700

Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/18313
  
    The idea with best/all/k is to to allow use cases with fairly large models 
(say large enough that all 100 or 1000 or whatever param combinations is not 
feasible to collect to driver) to still store more than just the best model.
    
    So it's a way to satisfy both the small-to-medium use case of storing 
"all", the default use case of "best" and a part solution to the large-model 
use case using "k". So the idea with the "k" version is to not do a full 
"collect then top k" but instead keep a running top-k (PriorityQueue perhaps) 
in order to limit memory consumption.
    
    But I agree if we have both solutions (memory and file-based) then it's not 
necessary (though if one did want to do a top-k on the file-based scenario it 
would be quite clunky to do). So if that is the end goal then let's do the 
2-step process suggested above.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #18313: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...

Reply via email to