Github user joesu commented on the pull request:
https://github.com/apache/spark/pull/1737#issuecomment-53976708
It's not that straightforward to reuse BinaryType for handling parquet's
binary type and fixed_len_byte_array types because these two types are
incompatible in the parquet library and we have to specify the data type when
reading data from parquet files thought the library. Parquet library refuses to
read data if you ask it to read binary typed data from a fixed_len_byte_array
typed field.
If we really want to reuse the BinaryType, we have to change all the
Catalyst-to-Parquet type conversion functions ( e.g. convertFromAttributes()
function in ParquetTypes.scala) to consider the underlying file schema when
mapping BinaryType to corresponding parquet types. Do you have any suggested
way to do this?
In the long run we might want to optimize storage for common fixed length
things like UUID, IPv6 address, MD5 hashes, etc.. Parquet files prepend data
length in every data field for regular binary typed fields, but it only store
the data length once in the metadata for fixed length byte array typed fields.
It's a good fit to use fixed length byte array typed field to store the fixed
length data.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]