Re: [Spark 2.x Core] .collect() size limit

2018-04-30 Thread Irving Duran
I don't think there is a magic number, so I would say that it will depend
on how big your dataset is and the size of your worker(s).

Thank You,

Irving Duran


On Sat, Apr 28, 2018 at 10:41 AM klrmowse  wrote:

> i am currently trying to find a workaround for the Spark application i am
> working on so that it does not have to use .collect()
>
> but, for now, it is going to have to use .collect()
>
> what is the size limit (memory for the driver) of RDD file that .collect()
> can work with?
>
> i've been scouring google-search - S.O., blogs, etc, and everyone is
> cautioning about .collect(), but does not specify how huge is huge... are
> we
> talking about a few gigabytes? terabytes?? petabytes???
>
>
>
> thank you
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: [Spark 2.x Core] .collect() size limit

2018-04-30 Thread Vadim Semenov
`.collect` returns an Array, and array's can't have more than Int.MaxValue
elements, and in most JVMs it's lower: `Int.MaxValue - 8`
So it puts upper limit, however, you can just create Array of Arrays, and
so on, basically limitless, albeit with some gymnastics.


Re: [Spark 2.x Core] .collect() size limit

2018-04-30 Thread Deepak Goel
Could you please help us and provide the source which says about the
general guidelines (80-85)?

Even if there is a general guideline, it is probably to keep the
performance of Spark application high (And to *distinguish* it from
Hadoop). But if you are not too concerned about the *performance* hit from
memory to disk, then you could use virtual memory to your advantage. Infact
I think the OS could do a pretty good job of data management by keeping
only the necessary data in RAM and at the same time having no hard-limit
(It would be great to have benchmarks if anyone has done any test before)

Also we should *tread* carefully when applying general guidelines to
problems. They might not be *relevant* at all.

Deepak
"Please stop cruelty to Animals, help by becoming a Vegan"
+91 73500 12833
deic...@gmail.com

Facebook: https://www.facebook.com/deicool
LinkedIn: www.linkedin.com/in/deicool

"Plant a Tree, Go Green"

Made In India : http://www.makeinindia.com/home

On Mon, Apr 30, 2018 at 9:06 PM, Lalwani, Jayesh <
jayesh.lalw...@capitalone.com> wrote:

> Although there is such a thing as virtualization of memory done at the OS
> layer, JVM imposes it’s own limit that is controlled by the 
> *spark.executor.memory
> *and *spark.driver.memory* configurations. The amount of memory allocated
> by JVM will be controlled by those parameters. General guidelines say that
> executor and driver memory should be kept at 80-85% of available RAM. So,
> if general guidelines are followed, **virtual memory** is moot.
>
> *From: *Deepak Goel <deic...@gmail.com>
> *Date: *Saturday, April 28, 2018 at 12:58 PM
> *To: *Stephen Boesch <java...@gmail.com>
> *Cc: *klrmowse <klrmo...@gmail.com>, "user @spark" <user@spark.apache.org>
> *Subject: *Re: [Spark 2.x Core] .collect() size limit
>
>
>
> I believe the virtualization of memory happens at the OS layer hiding it
> completely from the application layer
>
>
>
> On Sat, 28 Apr 2018, 22:22 Stephen Boesch, <java...@gmail.com> wrote:
>
> While it is certainly possible to use VM I have seen in a number of places
> warnings that collect() results must be able to be fit in memory. I'm not
> sure if that applies to *all" spark calculations: but in the very least
> each of the specific collect()'s that are performed would need to be
> verified.
>
>
>
> And maybe *all *collects do require sufficient memory - would you like to
> check the source code to see if there were disk backed collects actually
> happening for some cases?
>
>
>
> 2018-04-28 9:48 GMT-07:00 Deepak Goel <deic...@gmail.com>:
>
> There is something as *virtual memory*
>
>
>
> On Sat, 28 Apr 2018, 21:19 Stephen Boesch, <java...@gmail.com> wrote:
>
> Do you have a machine with  terabytes of RAM?  afaik collect() requires
> RAM - so that would be your limiting factor.
>
>
>
> 2018-04-28 8:41 GMT-07:00 klrmowse <klrmo...@gmail.com>:
>
> i am currently trying to find a workaround for the Spark application i am
> working on so that it does not have to use .collect()
>
> but, for now, it is going to have to use .collect()
>
> what is the size limit (memory for the driver) of RDD file that .collect()
> can work with?
>
> i've been scouring google-search - S.O., blogs, etc, and everyone is
> cautioning about .collect(), but does not specify how huge is huge... are
> we
> talking about a few gigabytes? terabytes?? petabytes???
>
>
>
> thank you
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Dspark-2Duser-2Dlist.1001560.n3.nabble.com_=DwMFaQ=pLULRYW__RtkwsQUPxJVDGboCTdgji3AcHNJU0BpTJE=F2RNeGILvLdBxn7RJ4effes_QFIiEsoVM2rPi9qX1DKow5HQSjq0_WhIW109SXQ4=5LYtB_tQbPNzr4wqcwOP6XqPSef2zJRufNimgqXUCYA=iXh4776YwilYUo2ouANkz0T-Gn6uOli8kqYrR1Lr_2o=>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>
>
>
>
> --
>
> The information contained in this e-mail is confidential and/or
> proprietary to Capital One and/or its affiliates and may only be used
> solely in performance of work or services for Capital One. The information
> transmitted herewith is intended only for use by the individual or entity
> to which it is addressed. If the reader of this message is not the intended
> recipient, you are hereby notified that any review, retransmission,
> dissemination, distribution, copying or other use of, or taking of any
> action in reliance upon this information is strictly prohibited. If you
> have received this communication in error, please contact the sender and
> delete the material from your computer.
>


Re: [Spark 2.x Core] .collect() size limit

2018-04-30 Thread Lalwani, Jayesh
Although there is such a thing as virtualization of memory done at the OS 
layer, JVM imposes it’s own limit that is controlled by the 
spark.executor.memory and spark.driver.memory configurations. The amount of 
memory allocated by JVM will be controlled by those parameters. General 
guidelines say that executor and driver memory should be kept at 80-85% of 
available RAM. So, if general guidelines are followed, *virtual memory* is moot.
From: Deepak Goel <deic...@gmail.com>
Date: Saturday, April 28, 2018 at 12:58 PM
To: Stephen Boesch <java...@gmail.com>
Cc: klrmowse <klrmo...@gmail.com>, "user @spark" <user@spark.apache.org>
Subject: Re: [Spark 2.x Core] .collect() size limit

I believe the virtualization of memory happens at the OS layer hiding it 
completely from the application layer

On Sat, 28 Apr 2018, 22:22 Stephen Boesch, 
<java...@gmail.com<mailto:java...@gmail.com>> wrote:
While it is certainly possible to use VM I have seen in a number of places 
warnings that collect() results must be able to be fit in memory. I'm not sure 
if that applies to *all" spark calculations: but in the very least each of the 
specific collect()'s that are performed would need to be verified.

And maybe all collects do require sufficient memory - would you like to check 
the source code to see if there were disk backed collects actually happening 
for some cases?

2018-04-28 9:48 GMT-07:00 Deepak Goel 
<deic...@gmail.com<mailto:deic...@gmail.com>>:
There is something as *virtual memory*

On Sat, 28 Apr 2018, 21:19 Stephen Boesch, 
<java...@gmail.com<mailto:java...@gmail.com>> wrote:
Do you have a machine with  terabytes of RAM?  afaik collect() requires RAM - 
so that would be your limiting factor.

2018-04-28 8:41 GMT-07:00 klrmowse 
<klrmo...@gmail.com<mailto:klrmo...@gmail.com>>:
i am currently trying to find a workaround for the Spark application i am
working on so that it does not have to use .collect()

but, for now, it is going to have to use .collect()

what is the size limit (memory for the driver) of RDD file that .collect()
can work with?

i've been scouring google-search - S.O., blogs, etc, and everyone is
cautioning about .collect(), but does not specify how huge is huge... are we
talking about a few gigabytes? terabytes?? petabytes???



thank you



--
Sent from: 
http://apache-spark-user-list.1001560.n3.nabble.com/<https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Dspark-2Duser-2Dlist.1001560.n3.nabble.com_=DwMFaQ=pLULRYW__RtkwsQUPxJVDGboCTdgji3AcHNJU0BpTJE=F2RNeGILvLdBxn7RJ4effes_QFIiEsoVM2rPi9qX1DKow5HQSjq0_WhIW109SXQ4=5LYtB_tQbPNzr4wqcwOP6XqPSef2zJRufNimgqXUCYA=iXh4776YwilYUo2ouANkz0T-Gn6uOli8kqYrR1Lr_2o=>

-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>




The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates and may only be used solely in performance of 
work or services for Capital One. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.


Re: [Spark 2.x Core] .collect() size limit

2018-04-28 Thread Mark Hamstra
spark.driver.maxResultSize

http://spark.apache.org/docs/latest/configuration.html

On Sat, Apr 28, 2018 at 8:41 AM, klrmowse  wrote:

> i am currently trying to find a workaround for the Spark application i am
> working on so that it does not have to use .collect()
>
> but, for now, it is going to have to use .collect()
>
> what is the size limit (memory for the driver) of RDD file that .collect()
> can work with?
>
> i've been scouring google-search - S.O., blogs, etc, and everyone is
> cautioning about .collect(), but does not specify how huge is huge... are
> we
> talking about a few gigabytes? terabytes?? petabytes???
>
>
>
> thank you
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: [Spark 2.x Core] .collect() size limit

2018-04-28 Thread Deepak Goel
I believe the virtualization of memory happens at the OS layer hiding it
completely from the application layer

On Sat, 28 Apr 2018, 22:22 Stephen Boesch,  wrote:

> While it is certainly possible to use VM I have seen in a number of places
> warnings that collect() results must be able to be fit in memory. I'm not
> sure if that applies to *all" spark calculations: but in the very least
> each of the specific collect()'s that are performed would need to be
> verified.
>
> And maybe *all *collects do require sufficient memory - would you like to
> check the source code to see if there were disk backed collects actually
> happening for some cases?
>
> 2018-04-28 9:48 GMT-07:00 Deepak Goel :
>
>> There is something as *virtual memory*
>>
>> On Sat, 28 Apr 2018, 21:19 Stephen Boesch,  wrote:
>>
>>> Do you have a machine with  terabytes of RAM?  afaik collect() requires
>>> RAM - so that would be your limiting factor.
>>>
>>> 2018-04-28 8:41 GMT-07:00 klrmowse :
>>>
 i am currently trying to find a workaround for the Spark application i
 am
 working on so that it does not have to use .collect()

 but, for now, it is going to have to use .collect()

 what is the size limit (memory for the driver) of RDD file that
 .collect()
 can work with?

 i've been scouring google-search - S.O., blogs, etc, and everyone is
 cautioning about .collect(), but does not specify how huge is huge...
 are we
 talking about a few gigabytes? terabytes?? petabytes???



 thank you



 --
 Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

 -
 To unsubscribe e-mail: user-unsubscr...@spark.apache.org


>>>
>


Re: [Spark 2.x Core] .collect() size limit

2018-04-28 Thread Stephen Boesch
While it is certainly possible to use VM I have seen in a number of places
warnings that collect() results must be able to be fit in memory. I'm not
sure if that applies to *all" spark calculations: but in the very least
each of the specific collect()'s that are performed would need to be
verified.

And maybe *all *collects do require sufficient memory - would you like to
check the source code to see if there were disk backed collects actually
happening for some cases?

2018-04-28 9:48 GMT-07:00 Deepak Goel :

> There is something as *virtual memory*
>
> On Sat, 28 Apr 2018, 21:19 Stephen Boesch,  wrote:
>
>> Do you have a machine with  terabytes of RAM?  afaik collect() requires
>> RAM - so that would be your limiting factor.
>>
>> 2018-04-28 8:41 GMT-07:00 klrmowse :
>>
>>> i am currently trying to find a workaround for the Spark application i am
>>> working on so that it does not have to use .collect()
>>>
>>> but, for now, it is going to have to use .collect()
>>>
>>> what is the size limit (memory for the driver) of RDD file that
>>> .collect()
>>> can work with?
>>>
>>> i've been scouring google-search - S.O., blogs, etc, and everyone is
>>> cautioning about .collect(), but does not specify how huge is huge...
>>> are we
>>> talking about a few gigabytes? terabytes?? petabytes???
>>>
>>>
>>>
>>> thank you
>>>
>>>
>>>
>>> --
>>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>


Re: [Spark 2.x Core] .collect() size limit

2018-04-28 Thread Deepak Goel
There is something as *virtual memory*

On Sat, 28 Apr 2018, 21:19 Stephen Boesch,  wrote:

> Do you have a machine with  terabytes of RAM?  afaik collect() requires
> RAM - so that would be your limiting factor.
>
> 2018-04-28 8:41 GMT-07:00 klrmowse :
>
>> i am currently trying to find a workaround for the Spark application i am
>> working on so that it does not have to use .collect()
>>
>> but, for now, it is going to have to use .collect()
>>
>> what is the size limit (memory for the driver) of RDD file that .collect()
>> can work with?
>>
>> i've been scouring google-search - S.O., blogs, etc, and everyone is
>> cautioning about .collect(), but does not specify how huge is huge... are
>> we
>> talking about a few gigabytes? terabytes?? petabytes???
>>
>>
>>
>> thank you
>>
>>
>>
>> --
>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


Re: [Spark 2.x Core] .collect() size limit

2018-04-28 Thread Stephen Boesch
Do you have a machine with  terabytes of RAM?  afaik collect() requires RAM
- so that would be your limiting factor.

2018-04-28 8:41 GMT-07:00 klrmowse :

> i am currently trying to find a workaround for the Spark application i am
> working on so that it does not have to use .collect()
>
> but, for now, it is going to have to use .collect()
>
> what is the size limit (memory for the driver) of RDD file that .collect()
> can work with?
>
> i've been scouring google-search - S.O., blogs, etc, and everyone is
> cautioning about .collect(), but does not specify how huge is huge... are
> we
> talking about a few gigabytes? terabytes?? petabytes???
>
>
>
> thank you
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>