Re: [DISCUSS] 64-bit offset variable width types (i.e.Large List, Last String, Large bytes)

Micah Kornfield Mon, 15 Apr 2019 20:24:01 -0700

To summarize my understanding of the thread so far, there seems to be
consensus on having a new distinct type for each "large" type.


There are some reservations around the "large" types being harder to
support in algorithmic implementations.

I'm curious Philipp, was there a concrete use-case that inspired you to
start the PR?

Also, this was brought up on another thread, but utility of the "large"
types might be limited in some languages (e.g. Java) until they support
buffer sizes larger then INT_MAX bytes.  I brought this up on the current
PR to decouple Netty and memory management from ArrowBuf [1], but the
consensus seems to be to handle any modifications in follow-up PRs (if they
are agreed upon).

Anything else people want to discuss before a vote on whether to allow the
additional types into the spec?

Thanks,
Micah

[1] https://github.com/apache/arrow/pull/4151




On Monday, April 15, 2019, Jacques Nadeau <jacq...@apache.org> wrote:

> I am not Jacques, but I will try to give my own point of view on this.
> >
>
> Thanks for making me laugh :)
>
> I think that this is unavoidable. Even with batches, taking an example of a
> > binary column where the mean size of the payload is 1mb, it limits to
> > batches of 2048 elements. This can become annoying pretty quickly.
> >
>
> Good example. I'm not sure columnar matters but I find it more useful than
> others.
>
> logical types and physical types
> >
>
> TLDR; It is painful no matter which model you pick.
>
> I definitely think we worked hard to go different on Arrow than Parquet. It
> was something I pushed consciously when we started as I found some of the
> patterns in Parquet to be quite challenging. Unfortunately, we went too far
> in some places in the Java code which tried to parallel the structure of
> the physical types directly (and thus the big refactor we did to reduce
> duplication last year -- props to Sidd, Bryan and the others who worked on
> that). I also think that we easily probably lost as much as we gained using
> the current model.
>
> I agree with Antoine both in his clean statement of the approaches and that
> sticking to the model we have today makes the most sense.
>
> On Mon, Apr 15, 2019 at 11:05 AM Francois Saint-Jacques <
> fsaintjacq...@gmail.com> wrote:
>
> > Thanks for the clarification Antoine, very insightful.
> >
> > I'd also vote for keeping the existing model for consistency.
> >
> > On Mon, Apr 15, 2019 at 1:40 PM Antoine Pitrou <anto...@python.org>
> wrote:
> >
> > >
> > > Hi,
> > >
> > > I am not Jacques, but I will try to give my own point of view on this.
> > >
> > > The distinction between logical and physical types can be modelled in
> > > two different ways:
> > >
> > > 1) a physical type can denote several logical types, but a logical type
> > > can only have a single physical representation.  This is currently the
> > > Arrow model.
> > >
> > > 2) a physical type can denote several logical types, and a logical type
> > > can also be denoted by several physical types.  This is the Parquet
> > model.
> > >
> > > (theoretically, there are two other possible models, but they are not
> > > very interesting to consider, since they don't seem to cater to
> concrete
> > > use cases)
> > >
> > > Model 1 is obviously more restrictive, while model 2 is more flexible.
> > > Model 2 could be said "higher level"; you see something similar if you
> > > compare Python's and C++'s typing systems.  On the other hand, model 1
> > > provides a potentially simpler programming model for implementors of
> > > low-level kernels, as you can simply query the logical type of your
> data
> > > and you automatically know its physical type.
> > >
> > > The model chosen for Arrow is ingrained in its API.  If we want to
> > > change the model we'd better do it wholesale (implying probably a large
> > > refactoring and a significant number of unavoidable regressions) to
> > > avoid subjecting users to a confusing middle point.
> > >
> > > Also and as a sidenote, "convertibility" between different types can be
> > > a hairy subject... Having strict boundaries between types avoids being
> > > dragged into it too early.
> > >
> > >
> > > To return to the original subject: IMHO, LargeList (resp. LargeBinary)
> > > should be a distinct logical type from List (resp. Binary), the same
> way
> > > Int64 is a distinct logical type from Int32.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > >
> > > Le 15/04/2019 à 18:45, Francois Saint-Jacques a écrit :
> > > > Hello,
> > > >
> > > > I would like understand where do we stand on logical types and
> physical
> > > > types. As I understand, this proposal is for the physical
> > representation.
> > > >
> > > > In the context of an execution engine, the concept of logical types
> > > becomes
> > > > more important as two physical representation might have the same
> > > semantical
> > > > values, e.g. LargeList and List where all values fits in the 32-bits.
> > A
> > > > more
> > > > complex example would be an Integer array and a dictionary array
> where
> > > > values
> > > > are integers.
> > > >
> > > > Is this something only something only relevant for execution engine?
> > What
> > > > about
> > > > the (C++) Array.Equals method and related comparisons methods? This
> > also
> > > > touch
> > > > the subject of type equality, e.g. dict with different but compatible
> > > > encoding.
> > > >
> > > > Jacques, knowing that you worked on Parquet (which follows this
> model)
> > > and
> > > > Dremio,
> > > > what is your opinion?
> > > >
> > > > François
> > > >
> > > > Some related tickets:
> > > > - https://jira.apache.org/jira/browse/ARROW-554
> > > > - https://jira.apache.org/jira/browse/ARROW-1741
> > > > - https://jira.apache.org/jira/browse/ARROW-3144
> > > > - https://jira.apache.org/jira/browse/ARROW-4097
> > > > - https://jira.apache.org/jira/browse/ARROW-5052
> > > >
> > > >
> > > >
> > > > On Thu, Apr 11, 2019 at 4:52 AM Micah Kornfield <
> emkornfi...@gmail.com
> > >
> > > > wrote:
> > > >
> > > >> ARROW-4810 [1] and ARROW-750 [2] discuss adding types with 64-bit
> > > offsets
> > > >> to Lists, Strings and binary data types.
> > > >>
> > > >> Philipp started an implementation for the large list type [3] and I
> > > hacked
> > > >> together a potentially viable java implementation [4]
> > > >>
> > > >> I'd like to kickoff the discussion for getting these types voted on.
> > > I'm
> > > >> coupling them together because I think there are design
> consideration
> > > for
> > > >> how we evolve Schema.fbs
> > > >>
> > > >> There are two proposed options:
> > > >> 1.  The current PR proposal which adds a new type LargeList:
> > > >>   // List with 64-bit offsets
> > > >>   table LargeList {}
> > > >>
> > > >> 2.  As François suggested, it might cleaner to parameterize List
> with
> > > >> offset width.  I suppose something like:
> > > >>
> > > >> table List {
> > > >>   // only 32 bit and 64 bit is supported.
> > > >>   bitWidth: int = 32;
> > > >> }
> > > >>
> > > >> I think Option 2 is cleaner and potentially better long-term, but I
> > > think
> > > >> it breaks forward compatibility of the existing arrow libraries.  If
> > we
> > > >> proceed with Option 2, I would advocate making the change to
> > Schema.fbs
> > > all
> > > >> at once for all types (assuming we think that 64-bit offsets are
> > > desirable
> > > >> for all types) along with future compatibility checks to avoid
> > multiple
> > > >> releases were future compatibility is broken (by broken I mean the
> > > >> inability to detect that an implementation is receiving data it
> can't
> > > >> read).    What are peoples thoughts on this?
> > > >>
> > > >> Also, any other concern with adding these types?
> > > >>
> > > >> Thanks,
> > > >> Micah
> > > >>
> > > >> [1] https://issues.apache.org/jira/browse/ARROW-4810
> > > >> [2] https://issues.apache.org/jira/browse/ARROW-750
> > > >> [3] https://github.com/apache/arrow/pull/3848
> > > >> [4]
> > > >>
> > > >>
> > >
> >
> https://github.com/apache/arrow/commit/03956cac2202139e43404d7a994508080dc2cdd1
> > > >>
> > > >
> > >
> >
>

Re: [DISCUSS] 64-bit offset variable width types (i.e.Large List, Last String, Large bytes)

Reply via email to