Github user rtreffer commented on the pull request:
https://github.com/apache/spark/pull/6796#issuecomment-114835713
@liancheng I'll rebase on your branch, I really like the way you cleaned up
toPrimitiveDataType by using a fluent Types interface. This will make this
patch way easier.
Talking about testing/compatibility/interoperability, I have added a
hive-generated parquet file that I'd like to turn into a test case:
https://github.com/rtreffer/spark/tree/spark-4176-store-large-decimal-in-parquet/sql/core/src/test/resources/hive-decimal-parquet
There are some parquet files attached to tickets in jira, too.
Do you plan to convert those into tests?
Regarding FIXED_LENGTH_BYTE_ARRAY.... The overhead would decreases compared
to size. BINARY overhead would be <10% from ~DECIMAL(100) and <25% from
~DECIAL(40) (pre-compression). I'd expect DECIMAL(40) to use the full precision
only from time to time. But yeah, I've overlooked the 4 byte overhead at
https://github.com/Parquet/parquet-format/blob/master/Encodings.md and assumed
it would be less, FIXED_LENGTH_BYTE_ARRAY should be good for now (until s.o.
complains).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]