[GitHub] spark pull request: [SPARK-10143] [SQL] Use parquet's block size (...

rxin Thu, 20 Aug 2015 21:43:49 -0700

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8346#discussion_r37606272
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala
 ---
    @@ -482,11 +490,35 @@ private[sql] object ParquetRelation extends Logging {
       // internally.
       private[sql] val METASTORE_SCHEMA = "metastoreSchema"
     
    +  /**
    +   * If parquet's block size (row group size) setting is larger than the 
min split size,
    +   * we use parquet's block size setting as the min split size. Otherwise, 
we will create
    +   * tasks processing nothing (because a split does not cover the starting 
point of a
    +   * parquet block). See https://issues.apache.org/jira/browse/SPARK-10143 
for more information.
    +   */
    +  private def overrideMinSplitSize(parquetBlockSize: Long, conf: 
Configuration): Unit = {
    +    val minSplitSize =
    +      math.max(
    +        conf.getLong("mapred.min.split.size", 0L),
    +        conf.getLong("mapreduce.input.fileinputformat.split.minsize", 0L))
    +    if (parquetBlockSize > minSplitSize) {
    +      val message =
    +        s"Parquet's block size (row group size) is larger than " +
    +          
s"mapred.min.split.size/mapreduce.input.fileinputformat.split.minsize. Setting 
" +
    +          s"mapred.min.split.size and 
mapreduce.input.fileinputformat.split.minsize to " +
    +          s"$parquetBlockSize."
    +      logInfo(message)
    --- End diff --
    
    logDebug.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-10143] [SQL] Use parquet's block size (...

Reply via email to