[ https://issues.apache.org/jira/browse/PARQUET-758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17731621#comment-17731621 ]
ASF GitHub Bot commented on PARQUET-758: ---------------------------------------- JFinis commented on PR #184: URL: https://github.com/apache/parquet-format/pull/184#issuecomment-1587367949 > > It isn't clear to me if this should be a logical type or a physical type. We would need understand if there is different handling for forward compatibility purposes (what do we want the desired behavior to be be). I think C++ might be lenient here, but don't know about parquet-mr @gszadovszky thoughts? > > I think the basic idea behind having physical and logical types is to support forward compatibility since we can always represent (somehow) a long-existing physical type while logical types are getting extended. Parquet-mr should work fine with "unknown" logical types by reading it back as an un-annotated physical vale (a `Binary` with two bytes in this case). So, if the community supports having a half-precision floating point type I would vote on specifying it as a logical type. > > The tricky thing will be the implementations. Even though parquet-mr does not really care about converting the values according to their logical types we still need to care about the logical types at the ordering (min/max values in the statistics). It would not be too easy to implement the half-precision floating point comparison logic since java does not have such a primitive type. (BTW the sorting order of floating point numbers are still an open issue: [PARQUET-1222](https://issues.apache.org/jira/browse/PARQUET-1222)) FWIW, I rather think it should be a physical type for the following reasons: * encodings are currently only defined on the physical type, not the logical one. So allowing BYTE_STREAM_SPLIT for this type would actually break this if it is a logical type. * Having this be a logical type while float and double are physical types seems inconsistent. * There might eventually be hardware support or native language support for this for this type. In this case, having it as physical type would allow easier to leverage this hardware / language support, as most libraries instantiate encoders/decoders based on the physical type. Again, having now one exception where you would need a decoder based on a *logical* type would break this pattern and require additional effort. If Java and C++ had a float16 type, I guess more people would agree that it should be a physical type. So is the intuition of this being a logical type just based on the yet missing language support for this? * IMHO, the basic idea behind physical and logical types is not to support forward compatibility; that is just a byproduct. Otherwise, there should just be one or two physical types in the first place (FIXED_LEN_BYTE_ARRAY and BYTE_ARRAY). The basic idea is rather to make a distinction between physical representation and what the values logically mean. In my mental model it is rather a layered approach: There are layers that only care about the physical types (e.g., the encoders/decoders) and then further layers that also care about the logical type (e.g. the statistics maintenance code). And here again, this would break this layering. > [Format] HALF precision FLOAT Logical type > ------------------------------------------ > > Key: PARQUET-758 > URL: https://issues.apache.org/jira/browse/PARQUET-758 > Project: Parquet > Issue Type: Improvement > Components: parquet-format > Reporter: Julien Le Dem > Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010)