Hi Antoine, One quick note about this. Parquet min/max statistics need a total ordering for each logical type. Without that we either use some default based on the primitive type (that might not be suitable for the related extension type) or we won't store min/max statistics for the related values. It means no min/max stats for the row group nor page indices. So, I guess, we would need a way to define total ordering for an extension type. Does not sound like an easy topic.
Cheers, Gabor Antoine Pitrou <anto...@python.org> ezt írta (időpont: 2024. máj. 28., K, 16:45): > > Hello, > > (NOTE: this comes in the context of > https://github.com/apache/parquet-format/pull/240 -- > "PARQUET-2471: Add geometry logical type") > > I'd like to launch a discussion about the possible addition of > extension types in Parquet. > > Extension types are a concept borrowed from the Arrow type system [1]. > They provide a standard way of conveying more precise information about > the intended type and usage of a given column, without requiring the > metadata format to have a dedicated serialization for that type. > > In Arrow, extension types are typically conveyed through two > string/binary parameters: 1) the extension type name; 2) the > type-specific serialization. The extension type name unambiguously > designates the abstract extension type (such as "Tensor"); the > serialization optionally encodes the extension type's parameters, if > it has any (such as the dimensionality for a "Tensor" type). > > Initially, Arrow extension types tended to be ad hoc and > application-specific, but there is a growing trend to standardize > "canonical extension types" to allow for better data interoperability > accross widely-used data types [2]. > > From my experience as an Arrow PMC member, if Arrow didn't have > extension types, the barrier to propose and standardize new data types > would be much higher, especially for complex proposals such as the > fixed-shape and variable-shape tensor types. > > > For Parquet, extension types would be an alternative to enchristening > additional logical types in the Thrift specification. I can see several > advantages to extension types over additional logical types: > > 1) extension types would make it easier to experiment in dedicated > communities, trying to find out the best possible representation for > some kinds of data (example: the Geoparquet work) > > 2) extension types would allow "soft standardization": an extension type > could first be formally defined by a dedicated community, then > optionally find an official place under the Parquet project. > > 3) extension types would allow defining complex data representations > and semantics without imposing a large burden on the developers of > Parquet implementations, who may not be competent in the target domain. > This includes non-trivial statistics such as bounding boxes for > geospatial data. > > > Technically, I can imagine two possible ways of adding extension types > to the Parquet format: > > 1) as an additional logical type; > 2) as a separate type determination, in addition to the logical type. > > We should also ensure it is possible to express extension-specific > statistics (such as bounding boxes for geospatial data). > > What do you think? > > Regards > > Antoine. > > > [1] https://arrow.apache.org/docs/format/Columnar.html#extension-types > > [2] > https://arrow.apache.org/docs/dev/format/CanonicalExtensions.html > > > >