[ 
https://issues.apache.org/jira/browse/MAHOUT-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13987450#comment-13987450
 ] 

Sebastian Schelter commented on MAHOUT-1529:
--------------------------------------------

The problem with having a no-arg materialize operator is that our optimizer 
would have to make the decision how to materialize the data for spark 
(in-memory, in-memory in a serialized fashion, on disk). I don't think that we 
can/should make that decision ourselves. If people run into OOMs with spark we 
have to give them something to work around that (e.g. allow them to tell the 
system to use a different storage level).

What do you think about keeping those storagelevels, but interpret them as 
hints to the underlying system, which indicates to the user that the system 
might make a (hopefully) smarter decision, e.g. something like

{code}drm.cache(CacheHint.IN_MEMORY){code}



> Finalize abstraction of distributed logical plans from backend operations
> -------------------------------------------------------------------------
>
>                 Key: MAHOUT-1529
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1529
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Dmitriy Lyubimov
>             Fix For: 1.0
>
>
> We have a few situations when algorithm-facing API has Spark dependencies 
> creeping in. 
> In particular, we know of the following cases:
> (1) checkpoint() accepts Spark constant StorageLevel directly;
> (2) certain things in CheckpointedDRM;
> (3) drmParallelize etc. routines in the "drm" and "sparkbindings" package. 
> (5) drmBroadcast returns a Spark-specific Broadcast object



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to