On 09/03/2014 04:23 PM, Nicholas Chammas wrote:
> On Wed, Sep 3, 2014 at 3:24 AM, Patrick Wendell <pwend...@gmail.com> wrote:
> 
>> == What default changes should I be aware of? ==
>> 1. The default value of "spark.io.compression.codec" is now "snappy"
>> --> Old behavior can be restored by switching to "lzf"
>>
>> 2. PySpark now performs external spilling during aggregations.
>> --> Old behavior can be restored by setting "spark.shuffle.spill" to
>> "false".
>>
>> 3. PySpark uses a new heuristic for determining the parallelism of
>> shuffle operations.
>> --> Old behavior can be restored by setting
>> "spark.default.parallelism" to the number of cores in the cluster.
>>
> 
> Will these changes be called out in the release notes or somewhere in the
> docs?
> 
> That last one (which I believe is what we discovered as the result of
> SPARK-3333 <https://issues.apache.org/jira/browse/SPARK-3333>) could have a
> large impact on PySpark users.

Just wanted to add, it might be related to this issue or different.
There is a regression when using pyspark to read data
from HDFS. its performance during map tasks has gone down approx 1 ->
0.5x. I have tested the 1.0.2 and the performance was fine, but the 1.1
release candidate has this issue. I tested by setting the following
properties to make sure it was not due to these.

set("spark.io.compression.codec","lzf").set("spark.shuffle.spill","false")

in conf object.

Regards,
Gurvinder
> 
> Nick
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to