[GitHub] spark pull request: SPARK-12948. [SQL]. Consider reducing size of ...

rajeshbalamohan Sun, 24 Jan 2016 16:08:55 -0800

Github user rajeshbalamohan commented on the pull request:

    https://github.com/apache/spark/pull/10861#issuecomment-174355176
  
    **Usecase**: User tries to map the dataset which is partitioned (e.g TPC-DS 
dataset at 200 GB scale) & runs a query in spark-shell. 
    
    E.g
    ...
    val o_store_sales = 
sqlContext.read.format("orc").load("/tmp/spark_tpcds_bin_partitioned_orc_200/store_sales")
    o_store_sales.registerTempTable("o_store_sales")
    ..
    sqlContext.sql("SELECT..").show();
    ...
    
    
    When this is executed, OrcRelation creates Config objects for every 
partition (Ref: 
[OrcRelation.execute()](https://github.com/apache/spark/blob/e14817b528ccab4b4685b45a95e2325630b5fc53/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala#L295)).
 In the case of TPC-DS, it generates 1826 partitions. This info is broadcasted 
in 
[DAGScheduler#submitMissingTasks()](https://github.com/apache/spark/blob/1b2c2162af4d5d2d950af94571e69273b49bf913/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1010).
  As a part of this, the configurations created for 1826 partitions are also 
streamed through (i.e embedded in HadoopMapParitionsWithSplitRDD -->f()--> 
wrappedConf).  Each of these configuration takes around 251 KB per partition.  
Please refer to the profiler snapshot attached in the JIRA 
([mem_snap_shot](https://issues.apache.org/jira/secure/attachment/12784080/SPARK-12948.mem.prof.snapshot.png)).
 This causes quite a bit of delay in the overall job runtim
 e. 
    
    Patch reuses the already broadcastedconf from SparkContext.  fillObject() 
function is executed later for every partition, which internally sets up any 
additional config details. This drastically reduces the amount of payload that 
is broadcasted and helps in reducing the overall job runtime.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: SPARK-12948. [SQL]. Consider reducing size of ...

Reply via email to