[GitHub] spark pull request #21868: [SPARK-24906][SQL] Adaptively enlarge split / par...

habren Fri, 17 Aug 2018 05:15:21 -0700

Github user habren commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21868#discussion_r210890027
  
    --- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -459,6 +460,29 @@ object SQLConf {
         .intConf
         .createWithDefault(4096)
     
    +  val IS_PARQUET_PARTITION_ADAPTIVE_ENABLED = 
buildConf("spark.sql.parquet.adaptiveFileSplit")
    +    .doc("For columnar file format (e.g., Parquet), it's possible that 
only few (not all) " +
    +      "columns are needed. So, it's better to make sure that the total 
size of the selected " +
    +      "columns is about 128 MB "
    +    )
    +    .booleanConf
    +    .createWithDefault(false)
    +
    +  val PARQUET_STRUCT_LENGTH = buildConf("spark.sql.parquet.struct.length")
    +    .doc("Set the default size of struct column")
    +    .intConf
    +    .createWithDefault(StringType.defaultSize)
    +
    +  val PARQUET_MAP_LENGTH = buildConf("spark.sql.parquet.map.length")
    --- End diff --
    
    @HyukjinKwon  @viirya  Setting spark.sql.files.maxPartitionBytes explicitly 
do works. For you or other advanced users, it's convenient to set a bigger 
number of maxPartitionBytes.
    
    But for ad-hoc query, the selected columns are different for different 
queries, and it's not convenient or event impossible for users to set different 
maxPartitionBytes for different queries. 
    
    And for general user (non advanced user), it's not easy for them to 
calculate a proper value of maxPartitionBytes. 
    
    You know, in many big company, there may be one or few teams are familiar 
with the details of  Spark, and they maintain the Spark cluster. Other teams 
are general users of Spark and they care more about their business, such as 
data warehouse build up and recommendation algorithm. This feature try to 
handle it dynamically even the users are not familiar with Spark.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #21868: [SPARK-24906][SQL] Adaptively enlarge split / par...

Reply via email to