Re: Using Spark on Data size larger than Memory size

Andrew Ash Thu, 05 Jun 2014 17:28:29 -0700

Hi Roger,

You should be able to sort within partitions using the rdd.mapPartitions()
method, and that shouldn't require holding all data in memory at once.  It
does require holding the entire partition in memory though.  Do you need
the partition to never be held in memory all at once?


As far as the work that Aaron mentioned is happening, I think he might be
referring to the discussion and code surrounding
https://issues.apache.org/jira/browse/SPARK-983

Cheers!
Andrew


On Thu, Jun 5, 2014 at 5:16 PM, Roger Hoover <roger.hoo...@gmail.com> wrote:

> I think it would very handy to be able to specify that you want sorting
> during a partitioning stage.
>
>
> On Thu, Jun 5, 2014 at 4:42 PM, Roger Hoover <roger.hoo...@gmail.com>
> wrote:
>
>> Hi Aaron,
>>
>> When you say that sorting is being worked on, can you elaborate a little
>> more please?
>>
>> If particular, I want to sort the items within each partition (not
>> globally) without necessarily bringing them all into memory at once.
>>
>> Thanks,
>>
>> Roger
>>
>>
>> On Sat, May 31, 2014 at 11:10 PM, Aaron Davidson <ilike...@gmail.com>
>> wrote:
>>
>>> There is no fundamental issue if you're running on data that is larger
>>> than cluster memory size. Many operations can stream data through, and thus
>>> memory usage is independent of input data size. Certain operations require
>>> an entire *partition* (not dataset) to fit in memory, but there are not
>>> many instances of this left (sorting comes to mind, and this is being
>>> worked on).
>>>
>>> In general, one problem with Spark today is that you *can* OOM under
>>> certain configurations, and it's possible you'll need to change from the
>>> default configuration if you're using doing very memory-intensive jobs.
>>> However, there are very few cases where Spark would simply fail as a matter
>>> of course *-- *for instance, you can always increase the number of
>>> partitions to decrease the size of any given one. or repartition data to
>>> eliminate skew.
>>>
>>> Regarding impact on performance, as Mayur said, there may absolutely be
>>> an impact depending on your jobs. If you're doing a join on a very large
>>> amount of data with few partitions, then we'll have to spill to disk. If
>>> you can't cache your working set of data in memory, you will also see a
>>> performance degradation. Spark enables the use of memory to make things
>>> fast, but if you just don't have enough memory, it won't be terribly fast.
>>>
>>>
>>> On Sat, May 31, 2014 at 12:14 AM, Mayur Rustagi <mayur.rust...@gmail.com
>>> > wrote:
>>>
>>>> Clearly thr will be impact on performance but frankly depends on what
>>>> you are trying to achieve with the dataset.
>>>>
>>>> Mayur Rustagi
>>>> Ph: +1 (760) 203 3257
>>>> http://www.sigmoidanalytics.com
>>>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>>
>>>>
>>>>
>>>> On Sat, May 31, 2014 at 11:45 AM, Vibhor Banga <vibhorba...@gmail.com>
>>>> wrote:
>>>>
>>>>> Some inputs will be really helpful.
>>>>>
>>>>> Thanks,
>>>>> -Vibhor
>>>>>
>>>>>
>>>>> On Fri, May 30, 2014 at 7:51 PM, Vibhor Banga <vibhorba...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I am planning to use spark with HBase, where I generate RDD by
>>>>>> reading data from HBase Table.
>>>>>>
>>>>>> I want to know that in the case when the size of HBase Table grows
>>>>>> larger than the size of RAM available in the cluster, will the 
>>>>>> application
>>>>>> fail, or will there be an impact in performance ?
>>>>>>
>>>>>> Any thoughts in this direction will be helpful and are welcome.
>>>>>>
>>>>>> Thanks,
>>>>>> -Vibhor
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Vibhor Banga
>>>>> Software Development Engineer
>>>>> Flipkart Internet Pvt. Ltd., Bangalore
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Using Spark on Data size larger than Memory size

Reply via email to