spark git commit: [SPARK-24626] Add statistics prefix to parallelFileListingInStatsComputation

lixiao Tue, 18 Sep 2018 22:41:56 -0700

Repository: spark
Updated Branches:
  refs/heads/master 6c7db7fd1 -> 4193c7623



[SPARK-24626] Add statistics prefix to parallelFileListingInStatsComputation

## What changes were proposed in this pull request?
To be more consistent with other statistics based configs.

## How was this patch tested?
N/A - straightforward rename of config option. Used `git grep` to make sure 
there are no mention of it.

Closes #22457 from rxin/SPARK-24626.

Authored-by: Reynold Xin <r...@databricks.com>
Signed-off-by: gatorsmile <gatorsm...@gmail.com>


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4193c762
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4193c762
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4193c762

Branch: refs/heads/master
Commit: 4193c7623b92765adaee539e723328ddc9048c09
Parents: 6c7db7f
Author: Reynold Xin <r...@databricks.com>
Authored: Tue Sep 18 22:41:27 2018 -0700
Committer: gatorsmile <gatorsm...@gmail.com>
Committed: Tue Sep 18 22:41:27 2018 -0700

----------------------------------------------------------------------
 docs/sql-programming-guide.md                   |  2 +-
 .../org/apache/spark/sql/internal/SQLConf.scala | 63 ++++++++++----------
 2 files changed, 32 insertions(+), 33 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/4193c762/docs/sql-programming-guide.md
----------------------------------------------------------------------
diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index f25415c..2fa29a0 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -1896,7 +1896,7 @@ working with timestamps in `pandas_udf`s to get the best 
performance, see
   - In version 2.3 and earlier, Spark converts Parquet Hive tables by default 
but ignores table properties like `TBLPROPERTIES (parquet.compression 'NONE')`. 
This happens for ORC Hive table properties like `TBLPROPERTIES (orc.compress 
'NONE')` in case of `spark.sql.hive.convertMetastoreOrc=true`, too. Since Spark 
2.4, Spark respects Parquet/ORC specific table properties while converting 
Parquet/ORC Hive tables. As an example, `CREATE TABLE t(id int) STORED AS 
PARQUET TBLPROPERTIES (parquet.compression 'NONE')` would generate Snappy 
parquet files during insertion in Spark 2.3, and in Spark 2.4, the result would 
be uncompressed parquet files.
   - Since Spark 2.0, Spark converts Parquet Hive tables by default for better 
performance. Since Spark 2.4, Spark converts ORC Hive tables by default, too. 
It means Spark uses its own ORC support by default instead of Hive SerDe. As an 
example, `CREATE TABLE t(id int) STORED AS ORC` would be handled with Hive 
SerDe in Spark 2.3, and in Spark 2.4, it would be converted into Spark's ORC 
data source table and ORC vectorization would be applied. To set `false` to 
`spark.sql.hive.convertMetastoreOrc` restores the previous behavior.
   - In version 2.3 and earlier, CSV rows are considered as malformed if at 
least one column value in the row is malformed. CSV parser dropped such rows in 
the DROPMALFORMED mode or outputs an error in the FAILFAST mode. Since Spark 
2.4, CSV row is considered as malformed only when it contains malformed column 
values requested from CSV datasource, other values can be ignored. As an 
example, CSV file contains the "id,name" header and one row "1234". In Spark 
2.4, selection of the id column consists of a row with one column value 1234 
but in Spark 2.3 and earlier it is empty in the DROPMALFORMED mode. To restore 
the previous behavior, set `spark.sql.csv.parser.columnPruning.enabled` to 
`false`.
-  - Since Spark 2.4, File listing for compute statistics is done in parallel 
by default. This can be disabled by setting 
`spark.sql.parallelFileListingInStatsComputation.enabled` to `False`.
+  - Since Spark 2.4, File listing for compute statistics is done in parallel 
by default. This can be disabled by setting 
`spark.sql.statistics.parallelFileListingInStatsComputation.enabled` to `False`.
   - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and temporary 
files are not counted as data files when calculating table size during 
Statistics computation.
   - Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In 
version 2.3 and earlier, empty strings are equal to `null` values and do not 
reflect to any characters in saved CSV files. For example, the row of `"a", 
null, "", 1` was writted as `a,,,1`. Since Spark 2.4, the same row is saved as 
`a,,"",1`. To restore the previous behavior, set the CSV option `emptyValue` to 
empty (not quoted) string.  
   - Since Spark 2.4, The LOAD DATA command supports wildcard `?` and `*`, 
which match any one character, and zero or more characters, respectively. 
Example: `LOAD DATA INPATH '/tmp/folder*/'` or `LOAD DATA INPATH 
'/tmp/part-?'`. Special Characters like `space` also now work in paths. 
Example: `LOAD DATA INPATH '/tmp/folder name/'`.

http://git-wip-us.apache.org/repos/asf/spark/blob/4193c762/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
----------------------------------------------------------------------
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
index bdc4007..4499a35 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
@@ -269,22 +269,6 @@ object SQLConf {
       .booleanConf
       .createWithDefault(true)
 
-  val ENABLE_FALL_BACK_TO_HDFS_FOR_STATS =
-    buildConf("spark.sql.statistics.fallBackToHdfs")
-    .doc("If the table statistics are not available from table metadata enable 
fall back to hdfs." +
-      " This is useful in determining if a table is small enough to use auto 
broadcast joins.")
-    .booleanConf
-    .createWithDefault(false)
-
-  val DEFAULT_SIZE_IN_BYTES = buildConf("spark.sql.defaultSizeInBytes")
-    .internal()
-    .doc("The default table size used in query planning. By default, it is set 
to Long.MaxValue " +
-      "which is larger than `spark.sql.autoBroadcastJoinThreshold` to be more 
conservative. " +
-      "That is to say by default the optimizer will not choose to broadcast a 
table unless it " +
-      "knows for sure its size is small enough.")
-    .longConf
-    .createWithDefault(Long.MaxValue)
-
   val SHUFFLE_PARTITIONS = buildConf("spark.sql.shuffle.partitions")
     .doc("The default number of partitions to use when shuffling data for 
joins or aggregations.")
     .intConf
@@ -1110,6 +1094,30 @@ object SQLConf {
       .internal()
       .stringConf
 
+  val PARALLEL_FILE_LISTING_IN_STATS_COMPUTATION =
+    
buildConf("spark.sql.statistics.parallelFileListingInStatsComputation.enabled")
+      .internal()
+      .doc("When true, SQL commands use parallel file listing, " +
+        "as opposed to single thread listing." +
+        "This usually speeds up commands that need to list many directories.")
+      .booleanConf
+      .createWithDefault(true)
+
+  val ENABLE_FALL_BACK_TO_HDFS_FOR_STATS = 
buildConf("spark.sql.statistics.fallBackToHdfs")
+    .doc("If the table statistics are not available from table metadata enable 
fall back to hdfs." +
+      " This is useful in determining if a table is small enough to use auto 
broadcast joins.")
+    .booleanConf
+    .createWithDefault(false)
+
+  val DEFAULT_SIZE_IN_BYTES = buildConf("spark.sql.defaultSizeInBytes")
+    .internal()
+    .doc("The default table size used in query planning. By default, it is set 
to Long.MaxValue " +
+      "which is larger than `spark.sql.autoBroadcastJoinThreshold` to be more 
conservative. " +
+      "That is to say by default the optimizer will not choose to broadcast a 
table unless it " +
+      "knows for sure its size is small enough.")
+    .longConf
+    .createWithDefault(Long.MaxValue)
+
   val NDV_MAX_ERROR =
     buildConf("spark.sql.statistics.ndv.maxError")
       .internal()
@@ -1553,15 +1561,6 @@ object SQLConf {
         "are performed before any UNION, EXCEPT and MINUS operations.")
       .booleanConf
       .createWithDefault(false)
-
-  val PARALLEL_FILE_LISTING_IN_STATS_COMPUTATION =
-    buildConf("spark.sql.parallelFileListingInStatsComputation.enabled")
-      .internal()
-      .doc("When true, SQL commands use parallel file listing, " +
-        "as opposed to single thread listing." +
-        "This usually speeds up commands that need to list many directories.")
-      .booleanConf
-      .createWithDefault(true)
 }
 
 /**
@@ -1770,14 +1769,10 @@ class SQLConf extends Serializable with Logging {
   def advancedPartitionPredicatePushdownEnabled: Boolean =
     getConf(ADVANCED_PARTITION_PREDICATE_PUSHDOWN)
 
-  def fallBackToHdfsForStatsEnabled: Boolean = 
getConf(ENABLE_FALL_BACK_TO_HDFS_FOR_STATS)
-
   def preferSortMergeJoin: Boolean = getConf(PREFER_SORTMERGEJOIN)
 
   def enableRadixSort: Boolean = getConf(RADIX_SORT_ENABLED)
 
-  def defaultSizeInBytes: Long = getConf(DEFAULT_SIZE_IN_BYTES)
-
   def isParquetSchemaMergingEnabled: Boolean = 
getConf(PARQUET_SCHEMA_MERGING_ENABLED)
 
   def isParquetSchemaRespectSummaries: Boolean = 
getConf(PARQUET_SCHEMA_RESPECT_SUMMARIES)
@@ -1869,6 +1864,13 @@ class SQLConf extends Serializable with Logging {
 
   def sessionLocalTimeZone: String = getConf(SQLConf.SESSION_LOCAL_TIMEZONE)
 
+  def parallelFileListingInStatsComputation: Boolean =
+    getConf(SQLConf.PARALLEL_FILE_LISTING_IN_STATS_COMPUTATION)
+
+  def fallBackToHdfsForStatsEnabled: Boolean = 
getConf(ENABLE_FALL_BACK_TO_HDFS_FOR_STATS)
+
+  def defaultSizeInBytes: Long = getConf(DEFAULT_SIZE_IN_BYTES)
+
   def ndvMaxError: Double = getConf(NDV_MAX_ERROR)
 
   def histogramEnabled: Boolean = getConf(HISTOGRAM_ENABLED)
@@ -1971,9 +1973,6 @@ class SQLConf extends Serializable with Logging {
 
   def setOpsPrecedenceEnforced: Boolean = 
getConf(SQLConf.LEGACY_SETOPS_PRECEDENCE_ENABLED)
 
-  def parallelFileListingInStatsComputation: Boolean =
-    getConf(SQLConf.PARALLEL_FILE_LISTING_IN_STATS_COMPUTATION)
-
   /** ********************** SQLConf functionality methods ************ */
 
   /** Set Spark SQL configuration properties. */


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-24626] Add statistics prefix to parallelFileListingInStatsComputation

Reply via email to