I don't think it makes a lot of sense to add a INT128 physical type
while it could be FLBA(16) instead. New physical types are
larger implementation burden than new logical types.

Regards

Antoine.



On Mon, 7 Jul 2025 10:24:50 +0300
Alkis Evlogimenos
<alkis.evlogime...@databricks.com.INVALID>
wrote:
> In the 25th June meeting the predominant suggestion was to make INTERVAL a
> logical type with physical type being INT64 nanos. This does not cover the
> full 10k year ANSI SQL range and the rebuttal for that is that we could add
> INT128 physical type that could also be used for that and other things.
> 
> Why wouldn't the above work?
> 
> On Mon, Jul 7, 2025 at 9:37 AM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
> 
> > I think I've lost the thread a little bit on this discussion.  I'd like to
> > try to summarize my understanding of the issues at hand and current
> > proposals.
> >
> > As a short summary, my understanding of the requirements we want for a new
> > parquet day-time interval type are:
> >
> > 1.  Represent +/- 10K years for at least nano-second granularity (IIUC the
> > range is specified by ANSI SQL.  The granularity is picked for convenience
> > that covers most use cases).
> > 2.  Allow pruning via min-max statistics
> >
> > Shortcomings of the existing interval type:
> > 1.  Millisecond granularity.
> > 2.  Only supports positive values (all integer in FLBA of 12 bytes are
> > unsigned).
> > 3.  No sort order
> >
> > Proposals:
> > 1.  Fixed width integer duration type (assumes days are always exactly
> > 86400 seconds)
> >     Pros:
> >        - Easy to model.
> >        - Maps easily to ANSI SQL Day Time Interval representations that
> > several engines use.
> >     Cons:
> >        - Doesn't match well with any Arrow type.
> >        - Doesn't capture semantics for engines that treat day as a calendar
> > type.
> >
> > 2.  IIUC, Something based off of Arrow's MonthDayNanos type.  I apologize,
> > I'm not sure if this is what is being proposed but here are the approaches
> > i see with this:
> >    - FLBA 16 byte type exactly like arrow
> >    - Shredded (month, day and nanos each have a separate field under a
> > struct).  The main complication with this would be statistics since it is
> > all on the leaf fields, so there is either some information loss OR a
> > convention for recording statistics on one of the leafs need to followed).
> >
> >    - Pros:
> >       - Potentially more efficient storage
> >       - Maps directly to one of Arrow's Interval types.
> >    - Cons:
> >       - Comparison in general is challenging. Perhaps this can be solved by
> > new sort order that clarify relationships between fields or additional
> > metadata on the logic type. e.g. UNORDERED (Arrow semantics) or YEAR-MONTH
> > (only month field is expected to be populated and values are ordered on it)
> > or  DAY-TIME ordered (A day means 86400 seconds and nanoseconds normalized
> > to be less then 24 hours).
> >       - Doesn't conform to the notion of ANSI SQL interval types (which
> > have two separate representations for Year/Month and Day Time).  So some
> > parquet files might not be readable by ANSI SQL.
> >       - For the likely uncommon case (intervals ~+/- 200 years) might
> > require more transformation to an engine's internal representation.
> >       - Only works for the SQL standard if days are assumed to be 86400
> > seconds (right now fields are specified as independent of each other).
> >
> >
> > Did I get the requirements right?  Are there some other options people were
> > thinking about?
> >
> > Thanks,
> > Micah
> >
> >
> >
> > On Fri, Jul 4, 2025 at 6:49 AM Antoine Pitrou 
> > <antoine-+zn9apsxkcednm+yrof...@public.gmane.org> wrote:
> >  
> > > On Thu, 3 Jul 2025 17:22:27 +0100
> > > Raphael Taylor-Davies
> > > <r.taylordav...@googlemail.com.INVALID>
> > > wrote:  
> > > > Hi,
> > > >
> > > > I may be misunderstanding something, but it appears that the motivation
> > > > for this effort is that the current interval type represents a superset
> > > > of the functionality required by ANSI SQL, and therefore can cause
> > > > compatibility problems for some databases that only support the minimum
> > > > required by ANSI SQL?  
> > >
> > > While I understand the desire to be able to represent all values
> > > allowable in ANSI SQL, I really don't understand why our types should
> > > not be allowed to represent any values *outside* of the range allowed
> > > in ANSI SQL.
> > >
> > > Please let's be mindful that Parquet is not useful only for SQL-type
> > > workloads. Besides, ANSI SQL itself might evolve and we don't want to
> > > add another Interval type in a few years because the one we're
> > > current specifying ends up too tight.
> > >
> > > (if some people want to make sure that values don't fall outside the
> > > ANSI SQL range, they can write a validation pass for it; no need to
> > > burden the Parquet *format* with such contraints)
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > >  
> >  
> 



Reply via email to