Github user habren commented on a diff in the pull request:
https://github.com/apache/spark/pull/21868#discussion_r210890027
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -459,6 +460,29 @@ object SQLConf {
.intConf
.createWithDefault(4096)
+ val IS_PARQUET_PARTITION_ADAPTIVE_ENABLED =
buildConf("spark.sql.parquet.adaptiveFileSplit")
+ .doc("For columnar file format (e.g., Parquet), it's possible that
only few (not all) " +
+ "columns are needed. So, it's better to make sure that the total
size of the selected " +
+ "columns is about 128 MB "
+ )
+ .booleanConf
+ .createWithDefault(false)
+
+ val PARQUET_STRUCT_LENGTH = buildConf("spark.sql.parquet.struct.length")
+ .doc("Set the default size of struct column")
+ .intConf
+ .createWithDefault(StringType.defaultSize)
+
+ val PARQUET_MAP_LENGTH = buildConf("spark.sql.parquet.map.length")
--- End diff --
@HyukjinKwon @viirya Setting spark.sql.files.maxPartitionBytes explicitly
do works. For you or other advanced users, it's convenient to set a bigger
number of maxPartitionBytes.
But for ad-hoc query, the selected columns are different for different
queries, and it's not convenient or event impossible for users to set different
maxPartitionBytes for different queries.
And for general user (non advanced user), it's not easy for them to
calculate a proper value of maxPartitionBytes.
You know, in many big company, there may be one or few teams are familiar
with the details of Spark, and they maintain the Spark cluster. Other teams
are general users of Spark and they care more about their business, such as
data warehouse build up and recommendation algorithm. This feature try to
handle it dynamically even the users are not familiar with Spark.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]