Re: Quick question on spark performance

2016-05-20 Thread Yash Sharma
Am going with the default java opts for emr-
-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps
-XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70
-XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled
-XX:OnOutOfMemoryError='kill -9 %p'

The data is not partitioned. Its 6Tb data of around 400 Megs gz files. The
workload is a scan/filter/reduceBy which needs to scan the entire data.



On Sat, May 21, 2016 at 11:07 AM, Yash Sharma  wrote:

> The median GC time is 1.3 mins for a median duration of 41 mins. What
> parameters can I tune for controlling GC.
>
> Other details, median Peak execution memory of 13 G and input records of
> 2.3 gigs.
> 180-200 executors launched.
>
> - Thanks, via mobile,  excuse brevity.
> On May 21, 2016 10:59 AM, "Reynold Xin"  wrote:
>
>> It's probably due to GC.
>>
>> On Fri, May 20, 2016 at 5:54 PM, Yash Sharma  wrote:
>>
>>> Hi All,
>>> I am here to get some expert advice on a use case I am working on.
>>>
>>> Cluster & job details below -
>>>
>>> Data - 6 Tb
>>> Cluster - EMR - 15 Nodes C3-8xLarge (shared by other MR apps)
>>>
>>> Parameters-
>>> --executor-memory 10G \
>>> --executor-cores 6 \
>>> --conf spark.dynamicAllocation.enabled=true \
>>> --conf spark.dynamicAllocation.initialExecutors=15 \
>>>
>>> Runtime : 3 Hrs
>>>
>>> On monitoring the metrics I notices 10G for executors is not required
>>> (since I don't have lot of groupings)
>>>
>>> Reducing to --executor-memory 3G, Runtime reduced to: 2 Hrs
>>>
>>> Question:
>>> On adding more nodes now has absolutely no effect on the runtime. Is
>>> there anything I can tune/change/experiment with to make the job faster.
>>>
>>> Workload: Mostly reduceBy's and scans.
>>>
>>> Would appreciate any insights and thoughts. Best Regards
>>>
>>>
>>>
>>


Re: Quick question on spark performance

2016-05-20 Thread Yash Sharma
The median GC time is 1.3 mins for a median duration of 41 mins. What
parameters can I tune for controlling GC.

Other details, median Peak execution memory of 13 G and input records of
2.3 gigs.
180-200 executors launched.

- Thanks, via mobile,  excuse brevity.
On May 21, 2016 10:59 AM, "Reynold Xin"  wrote:

> It's probably due to GC.
>
> On Fri, May 20, 2016 at 5:54 PM, Yash Sharma  wrote:
>
>> Hi All,
>> I am here to get some expert advice on a use case I am working on.
>>
>> Cluster & job details below -
>>
>> Data - 6 Tb
>> Cluster - EMR - 15 Nodes C3-8xLarge (shared by other MR apps)
>>
>> Parameters-
>> --executor-memory 10G \
>> --executor-cores 6 \
>> --conf spark.dynamicAllocation.enabled=true \
>> --conf spark.dynamicAllocation.initialExecutors=15 \
>>
>> Runtime : 3 Hrs
>>
>> On monitoring the metrics I notices 10G for executors is not required
>> (since I don't have lot of groupings)
>>
>> Reducing to --executor-memory 3G, Runtime reduced to: 2 Hrs
>>
>> Question:
>> On adding more nodes now has absolutely no effect on the runtime. Is
>> there anything I can tune/change/experiment with to make the job faster.
>>
>> Workload: Mostly reduceBy's and scans.
>>
>> Would appreciate any insights and thoughts. Best Regards
>>
>>
>>
>


Re: Quick question on spark performance

2016-05-20 Thread Ted Yu
Yash:
Can you share the JVM parameters you used ?

How many partitions are there in your data set ?

Thanks

On Fri, May 20, 2016 at 5:59 PM, Reynold Xin  wrote:

> It's probably due to GC.
>
> On Fri, May 20, 2016 at 5:54 PM, Yash Sharma  wrote:
>
>> Hi All,
>> I am here to get some expert advice on a use case I am working on.
>>
>> Cluster & job details below -
>>
>> Data - 6 Tb
>> Cluster - EMR - 15 Nodes C3-8xLarge (shared by other MR apps)
>>
>> Parameters-
>> --executor-memory 10G \
>> --executor-cores 6 \
>> --conf spark.dynamicAllocation.enabled=true \
>> --conf spark.dynamicAllocation.initialExecutors=15 \
>>
>> Runtime : 3 Hrs
>>
>> On monitoring the metrics I notices 10G for executors is not required
>> (since I don't have lot of groupings)
>>
>> Reducing to --executor-memory 3G, Runtime reduced to: 2 Hrs
>>
>> Question:
>> On adding more nodes now has absolutely no effect on the runtime. Is
>> there anything I can tune/change/experiment with to make the job faster.
>>
>> Workload: Mostly reduceBy's and scans.
>>
>> Would appreciate any insights and thoughts. Best Regards
>>
>>
>>
>


Re: Quick question on spark performance

2016-05-20 Thread Reynold Xin
It's probably due to GC.

On Fri, May 20, 2016 at 5:54 PM, Yash Sharma  wrote:

> Hi All,
> I am here to get some expert advice on a use case I am working on.
>
> Cluster & job details below -
>
> Data - 6 Tb
> Cluster - EMR - 15 Nodes C3-8xLarge (shared by other MR apps)
>
> Parameters-
> --executor-memory 10G \
> --executor-cores 6 \
> --conf spark.dynamicAllocation.enabled=true \
> --conf spark.dynamicAllocation.initialExecutors=15 \
>
> Runtime : 3 Hrs
>
> On monitoring the metrics I notices 10G for executors is not required
> (since I don't have lot of groupings)
>
> Reducing to --executor-memory 3G, Runtime reduced to: 2 Hrs
>
> Question:
> On adding more nodes now has absolutely no effect on the runtime. Is there
> anything I can tune/change/experiment with to make the job faster.
>
> Workload: Mostly reduceBy's and scans.
>
> Would appreciate any insights and thoughts. Best Regards
>
>
>


Quick question on spark performance

2016-05-20 Thread Yash Sharma
Hi All,
I am here to get some expert advice on a use case I am working on.

Cluster & job details below -

Data - 6 Tb
Cluster - EMR - 15 Nodes C3-8xLarge (shared by other MR apps)

Parameters-
--executor-memory 10G \
--executor-cores 6 \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.dynamicAllocation.initialExecutors=15 \

Runtime : 3 Hrs

On monitoring the metrics I notices 10G for executors is not required
(since I don't have lot of groupings)

Reducing to --executor-memory 3G, Runtime reduced to: 2 Hrs

Question:
On adding more nodes now has absolutely no effect on the runtime. Is there
anything I can tune/change/experiment with to make the job faster.

Workload: Mostly reduceBy's and scans.

Would appreciate any insights and thoughts. Best Regards