[GitHub] spark pull request #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10....

rdblue Wed, 25 Apr 2018 11:08:48 -0700

Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21070#discussion_r184156731
  
    --- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/columnar/InMemoryColumnarQuerySuite.scala
 ---
    @@ -503,7 +503,7 @@ class InMemoryColumnarQuerySuite extends QueryTest with 
SharedSQLContext {
                 case plan: InMemoryRelation => plan
               }.head
               // InMemoryRelation's stats is file size before the underlying 
RDD is materialized
    -          assert(inMemoryRelation.computeStats().sizeInBytes === 740)
    +          assert(inMemoryRelation.computeStats().sizeInBytes === 800)
    --- End diff --
    
    This is data dependent so it is hard to estimate. We write the stats for 
older readers when the type uses a signed sort order, so it is limited to 
mostly primitive types and won't be written for byte arrays or utf8 data. That 
limits the size to 16 bytes + thrift overhead per page and you might have about 
100 pages per row group. So 1.5k per 128MB, which is about 0.001%.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10....

Reply via email to