groupByKey() and keys with many values

2015-09-07 Thread kaklakariada
Hi,

I already posted this question on the users mailing list
(http://apache-spark-user-list.1001560.n3.nabble.com/Using-groupByKey-with-many-values-per-key-td24538.html)
but did not get a reply. Maybe this is the correct forum to ask.

My problem is, that doing groupByKey().mapToPair() loads all values for a
key into memory which is a problem when the values don't fit into memory.
This was not a problem with Hadoop map/reduce, as the Iterable passed to the
reducer read from disk.

In Spark, the Iterable passed to mapToPair() is backed by a CompactBuffer
containing all values.

Is it possible to change this behavior without modifying Spark, or is there
a plan to change this?

Thank you very much for your help!
Christoph.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/groupByKey-and-keys-with-many-values-tp13985.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: groupByKey() and keys with many values

2015-09-07 Thread Sean Owen
That's how it's intended to work; if it's a problem, you probably need
to re-design your computation to not use groupByKey. Usually you can
do so.

On Mon, Sep 7, 2015 at 9:02 AM, kaklakariada  wrote:
> Hi,
>
> I already posted this question on the users mailing list
> (http://apache-spark-user-list.1001560.n3.nabble.com/Using-groupByKey-with-many-values-per-key-td24538.html)
> but did not get a reply. Maybe this is the correct forum to ask.
>
> My problem is, that doing groupByKey().mapToPair() loads all values for a
> key into memory which is a problem when the values don't fit into memory.
> This was not a problem with Hadoop map/reduce, as the Iterable passed to the
> reducer read from disk.
>
> In Spark, the Iterable passed to mapToPair() is backed by a CompactBuffer
> containing all values.
>
> Is it possible to change this behavior without modifying Spark, or is there
> a plan to change this?
>
> Thank you very much for your help!
> Christoph.
>
>
>
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/groupByKey-and-keys-with-many-values-tp13985.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.5.0 (RC3)

2015-09-07 Thread james
add a critical bug https://issues.apache.org/jira/browse/SPARK-10474
(Aggregation failed with unable to acquire memory)



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-5-0-RC3-tp13928p13987.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Fast Iteration while developing

2015-09-07 Thread Justin Uang
Hi,

What is the normal workflow for the core devs?

- Do we need to build the assembly jar to be able to run it from the spark
repo?
- Do you use sbt or maven to do the build?
- Is zinc only usuable for maven?

I'm asking because the current process I have right now is to do sbt build,
which means I'm stuck with about a 3-5 minute iteration cycle.

Thanks!

Justin


Re: Fast Iteration while developing

2015-09-07 Thread Reynold Xin
I usually write a test case for what I want to test, and then run

sbt/sbt "~module/test:test-only *MyTestSuite"



On Mon, Sep 7, 2015 at 6:02 PM, Justin Uang  wrote:

> Hi,
>
> What is the normal workflow for the core devs?
>
> - Do we need to build the assembly jar to be able to run it from the spark
> repo?
> - Do you use sbt or maven to do the build?
> - Is zinc only usuable for maven?
>
> I'm asking because the current process I have right now is to do sbt
> build, which means I'm stuck with about a 3-5 minute iteration cycle.
>
> Thanks!
>
> Justin
>


Re: groupByKey() and keys with many values

2015-09-07 Thread Antonio Piccolboni
To expand on what Sean said, I would look into replacing groupByKey with
reduceByKey. Also take a look at this doc
.
I happen to have designed a library that was subject to the same criticism
when compared to the java mapreduce API wrt the use of iterables, but
neither we nor the critics could ever find a natural example of a problem
when you can express a computation as a single pass through each group
while using a constant amount of memory  that could not be converted to
using a combiner (mapreduce jargon, called a reduce in Spark and most
functional circles). If  you found such an example, while an obstacle for
you,  it would be of some  interest to know what it is.


On Mon, Sep 7, 2015 at 1:31 AM Sean Owen  wrote:

> That's how it's intended to work; if it's a problem, you probably need
> to re-design your computation to not use groupByKey. Usually you can
> do so.
>
> On Mon, Sep 7, 2015 at 9:02 AM, kaklakariada 
> wrote:
> > Hi,
> >
> > I already posted this question on the users mailing list
> > (
> http://apache-spark-user-list.1001560.n3.nabble.com/Using-groupByKey-with-many-values-per-key-td24538.html
> )
> > but did not get a reply. Maybe this is the correct forum to ask.
> >
> > My problem is, that doing groupByKey().mapToPair() loads all values for a
> > key into memory which is a problem when the values don't fit into memory.
> > This was not a problem with Hadoop map/reduce, as the Iterable passed to
> the
> > reducer read from disk.
> >
> > In Spark, the Iterable passed to mapToPair() is backed by a CompactBuffer
> > containing all values.
> >
> > Is it possible to change this behavior without modifying Spark, or is
> there
> > a plan to change this?
> >
> > Thank you very much for your help!
> > Christoph.
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/groupByKey-and-keys-with-many-values-tp13985.html
> > Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > For additional commands, e-mail: dev-h...@spark.apache.org
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Code generation for GPU

2015-09-07 Thread lonikar
Hi Reynold,

Thanks for responding. I was waiting for this on the spark user group and my
own email id since I had not posted this on spark dev. Just saw your reply.

1. I figured the various code generation classes have either *apply* or
*eval* method depending on whether it computes something or uses expression
as filter. And the code that executes this generated code is in
sql.execution.basicOperators.scala.

2. If the vectorization is difficult or a major effort, I am not sure how I
am going to implement even a glimpse of changes I would like to. I think I
will have to satisfied with only a partial effort. Batching rows defeats the
purpose as I have found that it consumes a considerable amount of CPU cycles
and producing one row at a time also takes away the performance benefit.
Whats really required is to access a large partition and produce the result
partition in one shot. 

I think I will have to severely limit the scope of my talk in that case. Or
re-orient it to propose the changes instead of presenting the results of
execution on GPU. Please suggest since you seem to have selected the talk.

3. I agree, its pretty high paced development. I have started working on
1.5.1 spapshot.

4. How do I tune the batch size (number of rows in the ByteBuffer)? Is it
through the property spark.sql.inMemoryColumnarStorage.batchSize?

-Kiran



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Re-Code-generation-for-GPU-tp13954p13989.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org