On 09/03/2014 04:23 PM, Nicholas Chammas wrote: > On Wed, Sep 3, 2014 at 3:24 AM, Patrick Wendell <pwend...@gmail.com> wrote: > >> == What default changes should I be aware of? == >> 1. The default value of "spark.io.compression.codec" is now "snappy" >> --> Old behavior can be restored by switching to "lzf" >> >> 2. PySpark now performs external spilling during aggregations. >> --> Old behavior can be restored by setting "spark.shuffle.spill" to >> "false". >> >> 3. PySpark uses a new heuristic for determining the parallelism of >> shuffle operations. >> --> Old behavior can be restored by setting >> "spark.default.parallelism" to the number of cores in the cluster. >> > > Will these changes be called out in the release notes or somewhere in the > docs? > > That last one (which I believe is what we discovered as the result of > SPARK-3333 <https://issues.apache.org/jira/browse/SPARK-3333>) could have a > large impact on PySpark users.
Just wanted to add, it might be related to this issue or different. There is a regression when using pyspark to read data from HDFS. its performance during map tasks has gone down approx 1 -> 0.5x. I have tested the 1.0.2 and the performance was fine, but the 1.1 release candidate has this issue. I tested by setting the following properties to make sure it was not due to these. set("spark.io.compression.codec","lzf").set("spark.shuffle.spill","false") in conf object. Regards, Gurvinder > > Nick > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org