Re: Quick question on spark performance
Am going with the default java opts for emr- -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p' The data is not partitioned. Its 6Tb data of around 400 Megs gz files. The workload is a scan/filter/reduceBy which needs to scan the entire data. On Sat, May 21, 2016 at 11:07 AM, Yash Sharmawrote: > The median GC time is 1.3 mins for a median duration of 41 mins. What > parameters can I tune for controlling GC. > > Other details, median Peak execution memory of 13 G and input records of > 2.3 gigs. > 180-200 executors launched. > > - Thanks, via mobile, excuse brevity. > On May 21, 2016 10:59 AM, "Reynold Xin" wrote: > >> It's probably due to GC. >> >> On Fri, May 20, 2016 at 5:54 PM, Yash Sharma wrote: >> >>> Hi All, >>> I am here to get some expert advice on a use case I am working on. >>> >>> Cluster & job details below - >>> >>> Data - 6 Tb >>> Cluster - EMR - 15 Nodes C3-8xLarge (shared by other MR apps) >>> >>> Parameters- >>> --executor-memory 10G \ >>> --executor-cores 6 \ >>> --conf spark.dynamicAllocation.enabled=true \ >>> --conf spark.dynamicAllocation.initialExecutors=15 \ >>> >>> Runtime : 3 Hrs >>> >>> On monitoring the metrics I notices 10G for executors is not required >>> (since I don't have lot of groupings) >>> >>> Reducing to --executor-memory 3G, Runtime reduced to: 2 Hrs >>> >>> Question: >>> On adding more nodes now has absolutely no effect on the runtime. Is >>> there anything I can tune/change/experiment with to make the job faster. >>> >>> Workload: Mostly reduceBy's and scans. >>> >>> Would appreciate any insights and thoughts. Best Regards >>> >>> >>> >>
Re: Quick question on spark performance
The median GC time is 1.3 mins for a median duration of 41 mins. What parameters can I tune for controlling GC. Other details, median Peak execution memory of 13 G and input records of 2.3 gigs. 180-200 executors launched. - Thanks, via mobile, excuse brevity. On May 21, 2016 10:59 AM, "Reynold Xin"wrote: > It's probably due to GC. > > On Fri, May 20, 2016 at 5:54 PM, Yash Sharma wrote: > >> Hi All, >> I am here to get some expert advice on a use case I am working on. >> >> Cluster & job details below - >> >> Data - 6 Tb >> Cluster - EMR - 15 Nodes C3-8xLarge (shared by other MR apps) >> >> Parameters- >> --executor-memory 10G \ >> --executor-cores 6 \ >> --conf spark.dynamicAllocation.enabled=true \ >> --conf spark.dynamicAllocation.initialExecutors=15 \ >> >> Runtime : 3 Hrs >> >> On monitoring the metrics I notices 10G for executors is not required >> (since I don't have lot of groupings) >> >> Reducing to --executor-memory 3G, Runtime reduced to: 2 Hrs >> >> Question: >> On adding more nodes now has absolutely no effect on the runtime. Is >> there anything I can tune/change/experiment with to make the job faster. >> >> Workload: Mostly reduceBy's and scans. >> >> Would appreciate any insights and thoughts. Best Regards >> >> >> >
Re: Quick question on spark performance
Yash: Can you share the JVM parameters you used ? How many partitions are there in your data set ? Thanks On Fri, May 20, 2016 at 5:59 PM, Reynold Xinwrote: > It's probably due to GC. > > On Fri, May 20, 2016 at 5:54 PM, Yash Sharma wrote: > >> Hi All, >> I am here to get some expert advice on a use case I am working on. >> >> Cluster & job details below - >> >> Data - 6 Tb >> Cluster - EMR - 15 Nodes C3-8xLarge (shared by other MR apps) >> >> Parameters- >> --executor-memory 10G \ >> --executor-cores 6 \ >> --conf spark.dynamicAllocation.enabled=true \ >> --conf spark.dynamicAllocation.initialExecutors=15 \ >> >> Runtime : 3 Hrs >> >> On monitoring the metrics I notices 10G for executors is not required >> (since I don't have lot of groupings) >> >> Reducing to --executor-memory 3G, Runtime reduced to: 2 Hrs >> >> Question: >> On adding more nodes now has absolutely no effect on the runtime. Is >> there anything I can tune/change/experiment with to make the job faster. >> >> Workload: Mostly reduceBy's and scans. >> >> Would appreciate any insights and thoughts. Best Regards >> >> >> >
Re: Quick question on spark performance
It's probably due to GC. On Fri, May 20, 2016 at 5:54 PM, Yash Sharmawrote: > Hi All, > I am here to get some expert advice on a use case I am working on. > > Cluster & job details below - > > Data - 6 Tb > Cluster - EMR - 15 Nodes C3-8xLarge (shared by other MR apps) > > Parameters- > --executor-memory 10G \ > --executor-cores 6 \ > --conf spark.dynamicAllocation.enabled=true \ > --conf spark.dynamicAllocation.initialExecutors=15 \ > > Runtime : 3 Hrs > > On monitoring the metrics I notices 10G for executors is not required > (since I don't have lot of groupings) > > Reducing to --executor-memory 3G, Runtime reduced to: 2 Hrs > > Question: > On adding more nodes now has absolutely no effect on the runtime. Is there > anything I can tune/change/experiment with to make the job faster. > > Workload: Mostly reduceBy's and scans. > > Would appreciate any insights and thoughts. Best Regards > > >
Quick question on spark performance
Hi All, I am here to get some expert advice on a use case I am working on. Cluster & job details below - Data - 6 Tb Cluster - EMR - 15 Nodes C3-8xLarge (shared by other MR apps) Parameters- --executor-memory 10G \ --executor-cores 6 \ --conf spark.dynamicAllocation.enabled=true \ --conf spark.dynamicAllocation.initialExecutors=15 \ Runtime : 3 Hrs On monitoring the metrics I notices 10G for executors is not required (since I don't have lot of groupings) Reducing to --executor-memory 3G, Runtime reduced to: 2 Hrs Question: On adding more nodes now has absolutely no effect on the runtime. Is there anything I can tune/change/experiment with to make the job faster. Workload: Mostly reduceBy's and scans. Would appreciate any insights and thoughts. Best Regards