Github user joesu commented on the pull request:

    https://github.com/apache/spark/pull/1737#issuecomment-53976708
  
    It's not that straightforward to reuse BinaryType for handling parquet's 
binary type and fixed_len_byte_array types because these two types are 
incompatible in the parquet library and we have to specify the data type when 
reading data from parquet files thought the library. Parquet library refuses to 
read data if you ask it to read binary typed data from a fixed_len_byte_array 
typed field.
    
    If we really want to reuse the BinaryType, we have to change all the 
Catalyst-to-Parquet type conversion functions ( e.g. convertFromAttributes() 
function in ParquetTypes.scala) to consider the underlying file schema when 
mapping BinaryType to corresponding parquet types. Do you have any suggested 
way to do this?
    
    In the long run we might want to optimize storage for common fixed length 
things like UUID, IPv6 address, MD5 hashes, etc.. Parquet files prepend data 
length in every data field for regular binary typed fields, but it only store 
the data length once in the metadata for fixed length byte array typed fields. 
It's a good fit to use fixed length byte array typed field to store the fixed 
length data.
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to