I think adding extension type support will make it easier for adding
tensor or vector type, which is [1] trying to target.

However, the geometry type seems not easy to fit to the imagination
of the extension type. It would be better to explicitly define geospatial
statistics in the spec, otherwise we have to encode them like plain-encoded
min/max values or even use thrift/protobuf to serialize them as binary data.
This seems does not make the life easier for downstreams when statistics
are the most important and expected thing. However, I'm open to any idea
if extension statistics can be well supported. Let me think about it.

>  Once the type graduates to full logical type status, the ColumnOrdering
could be updated if necessary for the new type.

I would expect an extension type might not have a graduation day. Otherwise,
we have to make the reader to be aware of the graduated formal logical type
and legacy extension type at the same time. IMHO, if a data type is
important
enough, we'd better consider promoting it formally from day 1.

[1] https://github.com/apache/parquet-format/pull/241

Best,
Gang


On Wed, May 29, 2024 at 4:32 AM Ed Seidl <etse...@live.com> wrote:

> I like the idea of an EXTENSION logical type (Antoine's option 1).
> Perhaps the stats ordering could be left as an implementation
> detail...those implementations that understand the new type will
> implicitly know the proper ordering. Once the type graduates to full
> logical type status, the ColumnOrdering could be updated if necessary
> for the new type. Implementations that don't know the type will ignore
> the statistics.
>
> Ed
>
> On 5/28/24 7:58 AM, Antoine Pitrou wrote:
> > Hi Gabor,
> >
> > Perhaps we can eschew this problem by having a separate "extension
> > statistics" field that does not mandate total ordering?
> >
> > Regards
> >
> > Antoine.
> >
> >
> > On Tue, 28 May 2024 16:54:49 +0200
> > Gábor Szádovszky <ga...@apache.org> wrote:
> >> Hi Antoine,
> >>
> >> One quick note about this. Parquet min/max statistics need a total
> ordering
> >> for each logical type. Without that we either use some default based on
> the
> >> primitive type (that might not be suitable for the related extension
> type)
> >> or we won't store min/max statistics for the related values. It means no
> >> min/max stats for the row group nor page indices.
> >> So, I guess, we would need a way to define total ordering for an
> extension
> >> type. Does not sound like an easy topic.
> >>
> >> Cheers,
> >> Gabor
> >>
> >> Antoine Pitrou <anto...@python.org> ezt írta (időpont: 2024. máj. 28.,
> K,
> >> 16:45):
> >>
> >>> Hello,
> >>>
> >>> (NOTE: this comes in the context of
> >>> https://github.com/apache/parquet-format/pull/240 --
> >>> "PARQUET-2471: Add geometry logical type")
> >>>
> >>> I'd like to launch a discussion about the possible addition of
> >>> extension types in Parquet.
> >>>
> >>> Extension types are a concept borrowed from the Arrow type system [1].
> >>> They provide a standard way of conveying more precise information about
> >>> the intended type and usage of a given column, without requiring the
> >>> metadata format to have a dedicated serialization for that type.
> >>>
> >>> In Arrow, extension types are typically conveyed through two
> >>> string/binary parameters: 1) the extension type name; 2) the
> >>> type-specific serialization. The extension type name unambiguously
> >>> designates the abstract extension type (such as "Tensor"); the
> >>> serialization optionally encodes the extension type's parameters, if
> >>> it has any (such as the dimensionality for a "Tensor" type).
> >>>
> >>> Initially, Arrow extension types tended to be ad hoc and
> >>> application-specific, but there is a growing trend to standardize
> >>> "canonical extension types" to allow for better data interoperability
> >>> accross widely-used data types [2].
> >>>
> >>>  From my experience as an Arrow PMC member, if Arrow didn't have
> >>> extension types, the barrier to propose and standardize new data types
> >>> would be much higher, especially for complex proposals such as the
> >>> fixed-shape and variable-shape tensor types.
> >>>
> >>>
> >>> For Parquet, extension types would be an alternative to enchristening
> >>> additional logical types in the Thrift specification. I can see several
> >>> advantages to extension types over additional logical types:
> >>>
> >>> 1) extension types would make it easier to experiment in dedicated
> >>> communities, trying to find out the best possible representation for
> >>> some kinds of data (example: the Geoparquet work)
> >>>
> >>> 2) extension types would allow "soft standardization": an extension
> type
> >>> could first be formally defined by a dedicated community, then
> >>> optionally find an official place under the Parquet project.
> >>>
> >>> 3) extension types would allow defining complex data representations
> >>> and semantics without imposing a large burden on the developers of
> >>> Parquet implementations, who may not be competent in the target domain.
> >>> This includes non-trivial statistics such as bounding boxes for
> >>> geospatial data.
> >>>
> >>>
> >>> Technically, I can imagine two possible ways of adding extension types
> >>> to the Parquet format:
> >>>
> >>> 1) as an additional logical type;
> >>> 2) as a separate type determination, in addition to the logical type.
> >>>
> >>> We should also ensure it is possible to express extension-specific
> >>> statistics (such as bounding boxes for geospatial data).
> >>>
> >>> What do you think?
> >>>
> >>> Regards
> >>>
> >>> Antoine.
> >>>
> >>>
> >>> [1] https://arrow.apache.org/docs/format/Columnar.html#extension-types
> >>>
> >>> [2]
> >>> https://arrow.apache.org/docs/dev/format/CanonicalExtensions.html
> >>>
> >>>
> >>>
> >>>
> >
> >
>
>

Reply via email to