[GitHub] spark pull request: [SPARK-15365] [SQL]: When table size statistic...

Parth-Brahmbhatt Thu, 19 May 2016 16:02:13 -0700

Github user Parth-Brahmbhatt commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13150#discussion_r63968104
  
    --- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/MetastoreRelation.scala ---
    @@ -114,17 +115,27 @@ private[hive] case class MetastoreRelation(
           val rawDataSize = 
hiveQlTable.getParameters.get(StatsSetupConst.RAW_DATA_SIZE)
           // TODO: check if this estimate is valid for tables after partition 
pruning.
           // NOTE: getting `totalSize` directly from params is kind of hacky, 
but this should be
    -      // relatively cheap if parameters for the table are populated into 
the metastore.  An
    -      // alternative would be going through Hadoop's FileSystem API, which 
can be expensive if a lot
    -      // of RPCs are involved.  Besides `totalSize`, there are also 
`numFiles`, `numRows`,
    -      // `rawDataSize` keys (see StatsSetupConst in Hive) that we can look 
at in the future.
    +      // relatively cheap if parameters for the table are populated into 
the metastore.
    +      // Besides `totalSize`, there are also `numFiles`, `numRows`, 
`rawDataSize` keys
    +      // (see StatsSetupConst in Hive) that we can look at in the future.
           BigInt(
             // When table is external,`totalSize` is always zero, which will 
influence join strategy
             // so when `totalSize` is zero, use `rawDataSize` instead
    -        // if the size is still less than zero, we use default size
    +        // if the size is still less than zero, we try to get the file 
size from HDFS.
    +        // given this is only needed for optimization, if the HDFS call 
fails we return the default.
             Option(totalSize).map(_.toLong).filter(_ > 0)
               .getOrElse(Option(rawDataSize).map(_.toLong).filter(_ > 0)
    -            .getOrElse(sparkSession.sessionState.conf.defaultSizeInBytes)))
    +              .getOrElse({
    +                val hadoopConf = sparkSession.sessionState.newHadoopConf()
    +                val fs: FileSystem = 
hiveQlTable.getPath.getFileSystem(hadoopConf)
    +                try {
    +                  fs.getContentSummary(hiveQlTable.getPath).getLength
    +                } catch {
    +                  case e: Exception =>
    +                    log.warn("Failed to get table size from hdfs.", e)
    --- End diff --
    
    updated.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-15365] [SQL]: When table size statistic...

Reply via email to