[jira] [Commented] (SPARK-6695) Add an external iterator: a hadoop-like output collector

Sean Owen (JIRA) Sat, 04 Apr 2015 08:28:14 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14395773#comment-14395773
 ]


Sean Owen commented on SPARK-6695:
----------------------------------

The problem with this in the general case raises questions like, where are you 
allowed to spill and how much? It has to be cleaned up but how do you know when 
it can be? Evaluation happens at some future point. It also gets much slower, 
which may not solve much.

(As an aside for this particular function, it makes me think that your settings 
aren't causing it to do much sampling at all. I think the partial products this 
returns for each row are intended to be pretty sparse. If you're running out of 
memory then that is likely the problem?)

I think that in general, a function that uses a large amount of interim memory 
is going to get into trouble in Spark and a bunch of I/O pushes the problem 
around. For example, it might be possible here to decompose the overall flatMap 
over an iterator of rows into a flatMap of a flatMapping of the row, each of 
which emits partial products from just one element of the row. I think you'd 
get for free a much lower amount of max memory usage, but I haven't thought it 
through 100%

> Add an external iterator: a hadoop-like output collector
> --------------------------------------------------------
>
>                 Key: SPARK-6695
>                 URL: https://issues.apache.org/jira/browse/SPARK-6695
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Core
>            Reporter: uncleGen
>
> In practical use, we usually need to create a big iterator, which means too 
> big in `memory usage` or too long in `array size`. On the one hand, it leads 
> to too much memory consumption. On the other hand, one `Array` may not hold 
> all the elements, as java array indices are of type 'int' (4 bytes or 32 
> bits). So, IMHO, we may provide a `collector`, which has a buffer, 100MB or 
> any others, and could spill data into disk. The use case may like:
> {code: borderStyle=solid}
>    rdd.mapPartition { it => 
>       ...
>       val collector = new ExternalCollector()
>       collector.collect(a)
>       ...
>       collector.iterator
>   }
>    
> {code}
> I have done some related works, and I need your opinions, thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-6695) Add an external iterator: a hadoop-like output collector

Reply via email to