Re: [DISCUSS] 64-bit offset variable width types (i.e.Large List, Last String, Large bytes)

Philipp Moritz Thu, 25 Apr 2019 13:44:14 -0700

Thanks for getting the vote started! If approved I'm planning to finish the
LargeList C++ implementation, yeah! We can discuss how to best move forward
(especially wrt. reducing code-duplication) on the PR.


On Thu, Apr 25, 2019 at 5:53 AM Wes McKinney <wesmck...@gmail.com> wrote:

> +1 to a vote (I can kick it off today). I had planned to respond
> earlier this week but jet lag and e-mail backlog got the better of me
>
> On Thu, Apr 25, 2019 at 3:09 AM Antoine Pitrou <anto...@python.org> wrote:
> >
> >
> > Voting sounds like a good idea to me.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 25/04/2019 à 06:16, Micah Kornfield a écrit :
> > > Given that conversation seems to have died down on this, would it make
> > > sense to do a vote to allow for large variable width types to be
> added?  As
> > > discussed previously PRs would need both C++ and Java implementation
> before
> > > being merged.
> > >
> > > Could a PMC member facilitate this?
> > >
> > > Philipp if approved, do you have bandwidth to finish up the PR for
> > > LargeList?
> > >
> > > Thanks,
> > > Micah
> > >
> > > On Mon, Apr 15, 2019 at 11:16 PM Philipp Moritz <pcmor...@gmail.com>
> wrote:
> > >
> > >> @Micah: I wanted to make it possible to support serializing large
> objects
> > >> (existing large pandas dataframes with an "object" column and also
> large
> > >> python types with the pyarrow serialization).
> > >>
> > >> On Mon, Apr 15, 2019 at 8:22 PM Micah Kornfield <
> emkornfi...@gmail.com>
> > >> wrote:
> > >>
> > >>> To summarize my understanding of the thread so far, there seems to be
> > >>> consensus on having a new distinct type for each "large" type.
> > >>>
> > >>> There are some reservations around the "large" types being harder to
> > >>> support in algorithmic implementations.
> > >>>
> > >>> I'm curious Philipp, was there a concrete use-case that inspired you
> to
> > >>> start the PR?
> > >>>
> > >>> Also, this was brought up on another thread, but utility of the
> "large"
> > >>> types might be limited in some languages (e.g. Java) until they
> support
> > >>> buffer sizes larger then INT_MAX bytes.  I brought this up on the
> current
> > >>> PR to decouple Netty and memory management from ArrowBuf [1], but the
> > >>> consensus seems to be to handle any modifications in follow-up PRs
> (if they
> > >>> are agreed upon).
> > >>>
> > >>> Anything else people want to discuss before a vote on whether to
> allow
> > >>> the additional types into the spec?
> > >>>
> > >>> Thanks,
> > >>> Micah
> > >>>
> > >>> [1] https://github.com/apache/arrow/pull/4151
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> On Monday, April 15, 2019, Jacques Nadeau <jacq...@apache.org>
> wrote:
> > >>>
> > >>>> I am not Jacques, but I will try to give my own point of view on
> this.
> > >>>>>
> > >>>>
> > >>>> Thanks for making me laugh :)
> > >>>>
> > >>>> I think that this is unavoidable. Even with batches, taking an
> example
> > >>>> of a
> > >>>>> binary column where the mean size of the payload is 1mb, it limits
> to
> > >>>>> batches of 2048 elements. This can become annoying pretty quickly.
> > >>>>>
> > >>>>
> > >>>> Good example. I'm not sure columnar matters but I find it more
> useful
> > >>>> than
> > >>>> others.
> > >>>>
> > >>>> logical types and physical types
> > >>>>>
> > >>>>
> > >>>> TLDR; It is painful no matter which model you pick.
> > >>>>
> > >>>> I definitely think we worked hard to go different on Arrow than
> Parquet.
> > >>>> It
> > >>>> was something I pushed consciously when we started as I found some
> of the
> > >>>> patterns in Parquet to be quite challenging. Unfortunately, we went
> too
> > >>>> far
> > >>>> in some places in the Java code which tried to parallel the
> structure of
> > >>>> the physical types directly (and thus the big refactor we did to
> reduce
> > >>>> duplication last year -- props to Sidd, Bryan and the others who
> worked
> > >>>> on
> > >>>> that). I also think that we easily probably lost as much as we
> gained
> > >>>> using
> > >>>> the current model.
> > >>>>
> > >>>> I agree with Antoine both in his clean statement of the approaches
> and
> > >>>> that
> > >>>> sticking to the model we have today makes the most sense.
> > >>>>
> > >>>> On Mon, Apr 15, 2019 at 11:05 AM Francois Saint-Jacques <
> > >>>> fsaintjacq...@gmail.com> wrote:
> > >>>>
> > >>>>> Thanks for the clarification Antoine, very insightful.
> > >>>>>
> > >>>>> I'd also vote for keeping the existing model for consistency.
> > >>>>>
> > >>>>> On Mon, Apr 15, 2019 at 1:40 PM Antoine Pitrou <anto...@python.org
> >
> > >>>> wrote:
> > >>>>>
> > >>>>>>
> > >>>>>> Hi,
> > >>>>>>
> > >>>>>> I am not Jacques, but I will try to give my own point of view on
> > >>>> this.
> > >>>>>>
> > >>>>>> The distinction between logical and physical types can be
> modelled in
> > >>>>>> two different ways:
> > >>>>>>
> > >>>>>> 1) a physical type can denote several logical types, but a logical
> > >>>> type
> > >>>>>> can only have a single physical representation.  This is currently
> > >>>> the
> > >>>>>> Arrow model.
> > >>>>>>
> > >>>>>> 2) a physical type can denote several logical types, and a logical
> > >>>> type
> > >>>>>> can also be denoted by several physical types.  This is the
> Parquet
> > >>>>> model.
> > >>>>>>
> > >>>>>> (theoretically, there are two other possible models, but they are
> not
> > >>>>>> very interesting to consider, since they don't seem to cater to
> > >>>> concrete
> > >>>>>> use cases)
> > >>>>>>
> > >>>>>> Model 1 is obviously more restrictive, while model 2 is more
> > >>>> flexible.
> > >>>>>> Model 2 could be said "higher level"; you see something similar if
> > >>>> you
> > >>>>>> compare Python's and C++'s typing systems.  On the other hand,
> model
> > >>>> 1
> > >>>>>> provides a potentially simpler programming model for implementors
> of
> > >>>>>> low-level kernels, as you can simply query the logical type of
> your
> > >>>> data
> > >>>>>> and you automatically know its physical type.
> > >>>>>>
> > >>>>>> The model chosen for Arrow is ingrained in its API.  If we want to
> > >>>>>> change the model we'd better do it wholesale (implying probably a
> > >>>> large
> > >>>>>> refactoring and a significant number of unavoidable regressions)
> to
> > >>>>>> avoid subjecting users to a confusing middle point.
> > >>>>>>
> > >>>>>> Also and as a sidenote, "convertibility" between different types
> can
> > >>>> be
> > >>>>>> a hairy subject... Having strict boundaries between types avoids
> > >>>> being
> > >>>>>> dragged into it too early.
> > >>>>>>
> > >>>>>>
> > >>>>>> To return to the original subject: IMHO, LargeList (resp.
> > >>>> LargeBinary)
> > >>>>>> should be a distinct logical type from List (resp. Binary), the
> same
> > >>>> way
> > >>>>>> Int64 is a distinct logical type from Int32.
> > >>>>>>
> > >>>>>> Regards
> > >>>>>>
> > >>>>>> Antoine.
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> Le 15/04/2019 à 18:45, Francois Saint-Jacques a écrit :
> > >>>>>>> Hello,
> > >>>>>>>
> > >>>>>>> I would like understand where do we stand on logical types and
> > >>>> physical
> > >>>>>>> types. As I understand, this proposal is for the physical
> > >>>>> representation.
> > >>>>>>>
> > >>>>>>> In the context of an execution engine, the concept of logical
> types
> > >>>>>> becomes
> > >>>>>>> more important as two physical representation might have the same
> > >>>>>> semantical
> > >>>>>>> values, e.g. LargeList and List where all values fits in the
> > >>>> 32-bits.
> > >>>>> A
> > >>>>>>> more
> > >>>>>>> complex example would be an Integer array and a dictionary array
> > >>>> where
> > >>>>>>> values
> > >>>>>>> are integers.
> > >>>>>>>
> > >>>>>>> Is this something only something only relevant for execution
> > >>>> engine?
> > >>>>> What
> > >>>>>>> about
> > >>>>>>> the (C++) Array.Equals method and related comparisons methods?
> This
> > >>>>> also
> > >>>>>>> touch
> > >>>>>>> the subject of type equality, e.g. dict with different but
> > >>>> compatible
> > >>>>>>> encoding.
> > >>>>>>>
> > >>>>>>> Jacques, knowing that you worked on Parquet (which follows this
> > >>>> model)
> > >>>>>> and
> > >>>>>>> Dremio,
> > >>>>>>> what is your opinion?
> > >>>>>>>
> > >>>>>>> François
> > >>>>>>>
> > >>>>>>> Some related tickets:
> > >>>>>>> - https://jira.apache.org/jira/browse/ARROW-554
> > >>>>>>> - https://jira.apache.org/jira/browse/ARROW-1741
> > >>>>>>> - https://jira.apache.org/jira/browse/ARROW-3144
> > >>>>>>> - https://jira.apache.org/jira/browse/ARROW-4097
> > >>>>>>> - https://jira.apache.org/jira/browse/ARROW-5052
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On Thu, Apr 11, 2019 at 4:52 AM Micah Kornfield <
> > >>>> emkornfi...@gmail.com
> > >>>>>>
> > >>>>>>> wrote:
> > >>>>>>>
> > >>>>>>>> ARROW-4810 [1] and ARROW-750 [2] discuss adding types with
> 64-bit
> > >>>>>> offsets
> > >>>>>>>> to Lists, Strings and binary data types.
> > >>>>>>>>
> > >>>>>>>> Philipp started an implementation for the large list type [3]
> and
> > >>>> I
> > >>>>>> hacked
> > >>>>>>>> together a potentially viable java implementation [4]
> > >>>>>>>>
> > >>>>>>>> I'd like to kickoff the discussion for getting these types voted
> > >>>> on.
> > >>>>>> I'm
> > >>>>>>>> coupling them together because I think there are design
> > >>>> consideration
> > >>>>>> for
> > >>>>>>>> how we evolve Schema.fbs
> > >>>>>>>>
> > >>>>>>>> There are two proposed options:
> > >>>>>>>> 1.  The current PR proposal which adds a new type LargeList:
> > >>>>>>>>   // List with 64-bit offsets
> > >>>>>>>>   table LargeList {}
> > >>>>>>>>
> > >>>>>>>> 2.  As François suggested, it might cleaner to parameterize List
> > >>>> with
> > >>>>>>>> offset width.  I suppose something like:
> > >>>>>>>>
> > >>>>>>>> table List {
> > >>>>>>>>   // only 32 bit and 64 bit is supported.
> > >>>>>>>>   bitWidth: int = 32;
> > >>>>>>>> }
> > >>>>>>>>
> > >>>>>>>> I think Option 2 is cleaner and potentially better long-term,
> but
> > >>>> I
> > >>>>>> think
> > >>>>>>>> it breaks forward compatibility of the existing arrow libraries.
> > >>>> If
> > >>>>> we
> > >>>>>>>> proceed with Option 2, I would advocate making the change to
> > >>>>> Schema.fbs
> > >>>>>> all
> > >>>>>>>> at once for all types (assuming we think that 64-bit offsets are
> > >>>>>> desirable
> > >>>>>>>> for all types) along with future compatibility checks to avoid
> > >>>>> multiple
> > >>>>>>>> releases were future compatibility is broken (by broken I mean
> the
> > >>>>>>>> inability to detect that an implementation is receiving data it
> > >>>> can't
> > >>>>>>>> read).    What are peoples thoughts on this?
> > >>>>>>>>
> > >>>>>>>> Also, any other concern with adding these types?
> > >>>>>>>>
> > >>>>>>>> Thanks,
> > >>>>>>>> Micah
> > >>>>>>>>
> > >>>>>>>> [1] https://issues.apache.org/jira/browse/ARROW-4810
> > >>>>>>>> [2] https://issues.apache.org/jira/browse/ARROW-750
> > >>>>>>>> [3] https://github.com/apache/arrow/pull/3848
> > >>>>>>>> [4]
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> https://github.com/apache/arrow/commit/03956cac2202139e43404d7a994508080dc2cdd1
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >
>

Re: [DISCUSS] 64-bit offset variable width types (i.e.Large List, Last String, Large bytes)

Reply via email to