Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21070#discussion_r184139736
--- Diff:
sql/core/src/test/scala/org/apache/spark/sql/execution/columnar/InMemoryColumnarQuerySuite.scala
---
@@ -503,7 +503,7 @@ class InMemoryColumnarQuerySuite extends QueryTest with
SharedSQLContext {
case plan: InMemoryRelation => plan
}.head
// InMemoryRelation's stats is file size before the underlying
RDD is materialized
- assert(inMemoryRelation.computeStats().sizeInBytes === 740)
+ assert(inMemoryRelation.computeStats().sizeInBytes === 800)
--- End diff --
Parquet fixed a problem with value ordering in statistics, which required
adding new metadata min and max fields. For older readers, Parquet also writes
the old values when it makes sense to. This is a slight increase in overhead,
which is more noticeable when files contain just a few records.
Don't be alarmed at the percentage difference here, it is just a small
file. Parquet isn't increasing file sizes by 8%, that would be silly.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]