[GitHub] spark pull request #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10....

rdblue Wed, 25 Apr 2018 10:15:29 -0700

Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21070#discussion_r184139736
  
    --- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/columnar/InMemoryColumnarQuerySuite.scala
 ---
    @@ -503,7 +503,7 @@ class InMemoryColumnarQuerySuite extends QueryTest with 
SharedSQLContext {
                 case plan: InMemoryRelation => plan
               }.head
               // InMemoryRelation's stats is file size before the underlying 
RDD is materialized
    -          assert(inMemoryRelation.computeStats().sizeInBytes === 740)
    +          assert(inMemoryRelation.computeStats().sizeInBytes === 800)
    --- End diff --
    
    Parquet fixed a problem with value ordering in statistics, which required 
adding new metadata min and max fields. For older readers, Parquet also writes 
the old values when it makes sense to. This is a slight increase in overhead, 
which is more noticeable when files contain just a few records.
    
    Don't be alarmed at the percentage difference here, it is just a small 
file. Parquet isn't increasing file sizes by 8%, that would be silly.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10....

Reply via email to