Shuffle performance tuning. How to tune netty?

2015-11-19 Thread t3l
I am facing a very tricky issue here. I have a treeReduce task. The reduce-function returns a very large object. In fact it is a Map[Int, Array[Double]]. Each reduce task inserts and/or updates values into the map or updates the array. My problem is, that this Map can become very large. Currently,

Prevent partitions from moving

2015-10-28 Thread t3l
that guy has cores waiting for work). Am i hallucinating or is that really the happening? Is there any way I prevent this from happening? Greetings, T3L -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Prevent-partitions-from-moving-tp25216.html Sent from

(SOLVED) Ahhhh... Spark creates >30000 partitions... What can I do?

2015-10-22 Thread t3l
I was able to solve this by myself. What I did is changing the way spark computes the partitioning for binary files. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/A-Spark-creates-3-partitions-What-can-I-do-tp25140p25170.html Sent from the Apache

Ahhhh... Spark creates >30000 partitions... What can I do?

2015-10-20 Thread t3l
I have dataset consisting of 5 binary files (each between 500kb and 2MB). They are stored in HDFS on a Hadoop cluster. The datanodes of the cluster are also the workers for Spark. I open the files as a RDD using sc.binaryFiles("hdfs:///path_to_directory").When I run the first action that

Partition for each executor

2015-10-20 Thread t3l
If I have a cluster with 7 nodes, each having an equal amount of cores and create an RDD with sc.parallelize() it looks as if the Spark will always tries to distribute the partitions. Question: (1) Is that something I can rely on? (2) Can I rely that sc.parallelize() will assign partitions to as