In the 25th June meeting the predominant suggestion was to make INTERVAL a logical type with physical type being INT64 nanos. This does not cover the full 10k year ANSI SQL range and the rebuttal for that is that we could add INT128 physical type that could also be used for that and other things.
Why wouldn't the above work? On Mon, Jul 7, 2025 at 9:37 AM Micah Kornfield <emkornfi...@gmail.com> wrote: > I think I've lost the thread a little bit on this discussion. I'd like to > try to summarize my understanding of the issues at hand and current > proposals. > > As a short summary, my understanding of the requirements we want for a new > parquet day-time interval type are: > > 1. Represent +/- 10K years for at least nano-second granularity (IIUC the > range is specified by ANSI SQL. The granularity is picked for convenience > that covers most use cases). > 2. Allow pruning via min-max statistics > > Shortcomings of the existing interval type: > 1. Millisecond granularity. > 2. Only supports positive values (all integer in FLBA of 12 bytes are > unsigned). > 3. No sort order > > Proposals: > 1. Fixed width integer duration type (assumes days are always exactly > 86400 seconds) > Pros: > - Easy to model. > - Maps easily to ANSI SQL Day Time Interval representations that > several engines use. > Cons: > - Doesn't match well with any Arrow type. > - Doesn't capture semantics for engines that treat day as a calendar > type. > > 2. IIUC, Something based off of Arrow's MonthDayNanos type. I apologize, > I'm not sure if this is what is being proposed but here are the approaches > i see with this: > - FLBA 16 byte type exactly like arrow > - Shredded (month, day and nanos each have a separate field under a > struct). The main complication with this would be statistics since it is > all on the leaf fields, so there is either some information loss OR a > convention for recording statistics on one of the leafs need to followed). > > - Pros: > - Potentially more efficient storage > - Maps directly to one of Arrow's Interval types. > - Cons: > - Comparison in general is challenging. Perhaps this can be solved by > new sort order that clarify relationships between fields or additional > metadata on the logic type. e.g. UNORDERED (Arrow semantics) or YEAR-MONTH > (only month field is expected to be populated and values are ordered on it) > or DAY-TIME ordered (A day means 86400 seconds and nanoseconds normalized > to be less then 24 hours). > - Doesn't conform to the notion of ANSI SQL interval types (which > have two separate representations for Year/Month and Day Time). So some > parquet files might not be readable by ANSI SQL. > - For the likely uncommon case (intervals ~+/- 200 years) might > require more transformation to an engine's internal representation. > - Only works for the SQL standard if days are assumed to be 86400 > seconds (right now fields are specified as independent of each other). > > > Did I get the requirements right? Are there some other options people were > thinking about? > > Thanks, > Micah > > > > On Fri, Jul 4, 2025 at 6:49 AM Antoine Pitrou <anto...@python.org> wrote: > > > On Thu, 3 Jul 2025 17:22:27 +0100 > > Raphael Taylor-Davies > > <r.taylordav...@googlemail.com.INVALID> > > wrote: > > > Hi, > > > > > > I may be misunderstanding something, but it appears that the motivation > > > for this effort is that the current interval type represents a superset > > > of the functionality required by ANSI SQL, and therefore can cause > > > compatibility problems for some databases that only support the minimum > > > required by ANSI SQL? > > > > While I understand the desire to be able to represent all values > > allowable in ANSI SQL, I really don't understand why our types should > > not be allowed to represent any values *outside* of the range allowed > > in ANSI SQL. > > > > Please let's be mindful that Parquet is not useful only for SQL-type > > workloads. Besides, ANSI SQL itself might evolve and we don't want to > > add another Interval type in a few years because the one we're > > current specifying ends up too tight. > > > > (if some people want to make sure that values don't fall outside the > > ANSI SQL range, they can write a validation pass for it; no need to > > burden the Parquet *format* with such contraints) > > > > Regards > > > > Antoine. > > > > > > >