Re: [DISCUSS] 64-bit offset variable width types (i.e.Large List, Last String, Large bytes)

Jacques Nadeau Mon, 15 Apr 2019 13:33:57 -0700

I am not Jacques, but I will try to give my own point of view on this.
>


Thanks for making me laugh :)

I think that this is unavoidable. Even with batches, taking an example of a
> binary column where the mean size of the payload is 1mb, it limits to
> batches of 2048 elements. This can become annoying pretty quickly.
>

Good example. I'm not sure columnar matters but I find it more useful than
others.

logical types and physical types
>

TLDR; It is painful no matter which model you pick.

I definitely think we worked hard to go different on Arrow than Parquet. It
was something I pushed consciously when we started as I found some of the
patterns in Parquet to be quite challenging. Unfortunately, we went too far
in some places in the Java code which tried to parallel the structure of
the physical types directly (and thus the big refactor we did to reduce
duplication last year -- props to Sidd, Bryan and the others who worked on
that). I also think that we easily probably lost as much as we gained using
the current model.

I agree with Antoine both in his clean statement of the approaches and that
sticking to the model we have today makes the most sense.

On Mon, Apr 15, 2019 at 11:05 AM Francois Saint-Jacques <
fsaintjacq...@gmail.com> wrote:

> Thanks for the clarification Antoine, very insightful.
>
> I'd also vote for keeping the existing model for consistency.
>
> On Mon, Apr 15, 2019 at 1:40 PM Antoine Pitrou <anto...@python.org> wrote:
>
> >
> > Hi,
> >
> > I am not Jacques, but I will try to give my own point of view on this.
> >
> > The distinction between logical and physical types can be modelled in
> > two different ways:
> >
> > 1) a physical type can denote several logical types, but a logical type
> > can only have a single physical representation.  This is currently the
> > Arrow model.
> >
> > 2) a physical type can denote several logical types, and a logical type
> > can also be denoted by several physical types.  This is the Parquet
> model.
> >
> > (theoretically, there are two other possible models, but they are not
> > very interesting to consider, since they don't seem to cater to concrete
> > use cases)
> >
> > Model 1 is obviously more restrictive, while model 2 is more flexible.
> > Model 2 could be said "higher level"; you see something similar if you
> > compare Python's and C++'s typing systems.  On the other hand, model 1
> > provides a potentially simpler programming model for implementors of
> > low-level kernels, as you can simply query the logical type of your data
> > and you automatically know its physical type.
> >
> > The model chosen for Arrow is ingrained in its API.  If we want to
> > change the model we'd better do it wholesale (implying probably a large
> > refactoring and a significant number of unavoidable regressions) to
> > avoid subjecting users to a confusing middle point.
> >
> > Also and as a sidenote, "convertibility" between different types can be
> > a hairy subject... Having strict boundaries between types avoids being
> > dragged into it too early.
> >
> >
> > To return to the original subject: IMHO, LargeList (resp. LargeBinary)
> > should be a distinct logical type from List (resp. Binary), the same way
> > Int64 is a distinct logical type from Int32.
> >
> > Regards
> >
> > Antoine.
> >
> >
> >
> > Le 15/04/2019 à 18:45, Francois Saint-Jacques a écrit :
> > > Hello,
> > >
> > > I would like understand where do we stand on logical types and physical
> > > types. As I understand, this proposal is for the physical
> representation.
> > >
> > > In the context of an execution engine, the concept of logical types
> > becomes
> > > more important as two physical representation might have the same
> > semantical
> > > values, e.g. LargeList and List where all values fits in the 32-bits.
> A
> > > more
> > > complex example would be an Integer array and a dictionary array where
> > > values
> > > are integers.
> > >
> > > Is this something only something only relevant for execution engine?
> What
> > > about
> > > the (C++) Array.Equals method and related comparisons methods? This
> also
> > > touch
> > > the subject of type equality, e.g. dict with different but compatible
> > > encoding.
> > >
> > > Jacques, knowing that you worked on Parquet (which follows this model)
> > and
> > > Dremio,
> > > what is your opinion?
> > >
> > > François
> > >
> > > Some related tickets:
> > > - https://jira.apache.org/jira/browse/ARROW-554
> > > - https://jira.apache.org/jira/browse/ARROW-1741
> > > - https://jira.apache.org/jira/browse/ARROW-3144
> > > - https://jira.apache.org/jira/browse/ARROW-4097
> > > - https://jira.apache.org/jira/browse/ARROW-5052
> > >
> > >
> > >
> > > On Thu, Apr 11, 2019 at 4:52 AM Micah Kornfield <emkornfi...@gmail.com
> >
> > > wrote:
> > >
> > >> ARROW-4810 [1] and ARROW-750 [2] discuss adding types with 64-bit
> > offsets
> > >> to Lists, Strings and binary data types.
> > >>
> > >> Philipp started an implementation for the large list type [3] and I
> > hacked
> > >> together a potentially viable java implementation [4]
> > >>
> > >> I'd like to kickoff the discussion for getting these types voted on.
> > I'm
> > >> coupling them together because I think there are design consideration
> > for
> > >> how we evolve Schema.fbs
> > >>
> > >> There are two proposed options:
> > >> 1.  The current PR proposal which adds a new type LargeList:
> > >>   // List with 64-bit offsets
> > >>   table LargeList {}
> > >>
> > >> 2.  As François suggested, it might cleaner to parameterize List with
> > >> offset width.  I suppose something like:
> > >>
> > >> table List {
> > >>   // only 32 bit and 64 bit is supported.
> > >>   bitWidth: int = 32;
> > >> }
> > >>
> > >> I think Option 2 is cleaner and potentially better long-term, but I
> > think
> > >> it breaks forward compatibility of the existing arrow libraries.  If
> we
> > >> proceed with Option 2, I would advocate making the change to
> Schema.fbs
> > all
> > >> at once for all types (assuming we think that 64-bit offsets are
> > desirable
> > >> for all types) along with future compatibility checks to avoid
> multiple
> > >> releases were future compatibility is broken (by broken I mean the
> > >> inability to detect that an implementation is receiving data it can't
> > >> read).    What are peoples thoughts on this?
> > >>
> > >> Also, any other concern with adding these types?
> > >>
> > >> Thanks,
> > >> Micah
> > >>
> > >> [1] https://issues.apache.org/jira/browse/ARROW-4810
> > >> [2] https://issues.apache.org/jira/browse/ARROW-750
> > >> [3] https://github.com/apache/arrow/pull/3848
> > >> [4]
> > >>
> > >>
> >
> https://github.com/apache/arrow/commit/03956cac2202139e43404d7a994508080dc2cdd1
> > >>
> > >
> >
>

Re: [DISCUSS] 64-bit offset variable width types (i.e.Large List, Last String, Large bytes)

Reply via email to