Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21070#discussion_r184156731
--- Diff:
sql/core/src/test/scala/org/apache/spark/sql/execution/columnar/InMemoryColumnarQuerySuite.scala
---
@@ -503,7 +503,7 @@ class InMemoryColumnarQuerySuite extends QueryTest with
SharedSQLContext {
case plan: InMemoryRelation => plan
}.head
// InMemoryRelation's stats is file size before the underlying
RDD is materialized
- assert(inMemoryRelation.computeStats().sizeInBytes === 740)
+ assert(inMemoryRelation.computeStats().sizeInBytes === 800)
--- End diff --
This is data dependent so it is hard to estimate. We write the stats for
older readers when the type uses a signed sort order, so it is limited to
mostly primitive types and won't be written for byte arrays or utf8 data. That
limits the size to 16 bytes + thrift overhead per page and you might have about
100 pages per row group. So 1.5k per 128MB, which is about 0.001%.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]