Thanks for getting the vote started! If approved I'm planning to finish the LargeList C++ implementation, yeah! We can discuss how to best move forward (especially wrt. reducing code-duplication) on the PR.
On Thu, Apr 25, 2019 at 5:53 AM Wes McKinney <wesmck...@gmail.com> wrote: > +1 to a vote (I can kick it off today). I had planned to respond > earlier this week but jet lag and e-mail backlog got the better of me > > On Thu, Apr 25, 2019 at 3:09 AM Antoine Pitrou <anto...@python.org> wrote: > > > > > > Voting sounds like a good idea to me. > > > > Regards > > > > Antoine. > > > > > > Le 25/04/2019 à 06:16, Micah Kornfield a écrit : > > > Given that conversation seems to have died down on this, would it make > > > sense to do a vote to allow for large variable width types to be > added? As > > > discussed previously PRs would need both C++ and Java implementation > before > > > being merged. > > > > > > Could a PMC member facilitate this? > > > > > > Philipp if approved, do you have bandwidth to finish up the PR for > > > LargeList? > > > > > > Thanks, > > > Micah > > > > > > On Mon, Apr 15, 2019 at 11:16 PM Philipp Moritz <pcmor...@gmail.com> > wrote: > > > > > >> @Micah: I wanted to make it possible to support serializing large > objects > > >> (existing large pandas dataframes with an "object" column and also > large > > >> python types with the pyarrow serialization). > > >> > > >> On Mon, Apr 15, 2019 at 8:22 PM Micah Kornfield < > emkornfi...@gmail.com> > > >> wrote: > > >> > > >>> To summarize my understanding of the thread so far, there seems to be > > >>> consensus on having a new distinct type for each "large" type. > > >>> > > >>> There are some reservations around the "large" types being harder to > > >>> support in algorithmic implementations. > > >>> > > >>> I'm curious Philipp, was there a concrete use-case that inspired you > to > > >>> start the PR? > > >>> > > >>> Also, this was brought up on another thread, but utility of the > "large" > > >>> types might be limited in some languages (e.g. Java) until they > support > > >>> buffer sizes larger then INT_MAX bytes. I brought this up on the > current > > >>> PR to decouple Netty and memory management from ArrowBuf [1], but the > > >>> consensus seems to be to handle any modifications in follow-up PRs > (if they > > >>> are agreed upon). > > >>> > > >>> Anything else people want to discuss before a vote on whether to > allow > > >>> the additional types into the spec? > > >>> > > >>> Thanks, > > >>> Micah > > >>> > > >>> [1] https://github.com/apache/arrow/pull/4151 > > >>> > > >>> > > >>> > > >>> > > >>> On Monday, April 15, 2019, Jacques Nadeau <jacq...@apache.org> > wrote: > > >>> > > >>>> I am not Jacques, but I will try to give my own point of view on > this. > > >>>>> > > >>>> > > >>>> Thanks for making me laugh :) > > >>>> > > >>>> I think that this is unavoidable. Even with batches, taking an > example > > >>>> of a > > >>>>> binary column where the mean size of the payload is 1mb, it limits > to > > >>>>> batches of 2048 elements. This can become annoying pretty quickly. > > >>>>> > > >>>> > > >>>> Good example. I'm not sure columnar matters but I find it more > useful > > >>>> than > > >>>> others. > > >>>> > > >>>> logical types and physical types > > >>>>> > > >>>> > > >>>> TLDR; It is painful no matter which model you pick. > > >>>> > > >>>> I definitely think we worked hard to go different on Arrow than > Parquet. > > >>>> It > > >>>> was something I pushed consciously when we started as I found some > of the > > >>>> patterns in Parquet to be quite challenging. Unfortunately, we went > too > > >>>> far > > >>>> in some places in the Java code which tried to parallel the > structure of > > >>>> the physical types directly (and thus the big refactor we did to > reduce > > >>>> duplication last year -- props to Sidd, Bryan and the others who > worked > > >>>> on > > >>>> that). I also think that we easily probably lost as much as we > gained > > >>>> using > > >>>> the current model. > > >>>> > > >>>> I agree with Antoine both in his clean statement of the approaches > and > > >>>> that > > >>>> sticking to the model we have today makes the most sense. > > >>>> > > >>>> On Mon, Apr 15, 2019 at 11:05 AM Francois Saint-Jacques < > > >>>> fsaintjacq...@gmail.com> wrote: > > >>>> > > >>>>> Thanks for the clarification Antoine, very insightful. > > >>>>> > > >>>>> I'd also vote for keeping the existing model for consistency. > > >>>>> > > >>>>> On Mon, Apr 15, 2019 at 1:40 PM Antoine Pitrou <anto...@python.org > > > > >>>> wrote: > > >>>>> > > >>>>>> > > >>>>>> Hi, > > >>>>>> > > >>>>>> I am not Jacques, but I will try to give my own point of view on > > >>>> this. > > >>>>>> > > >>>>>> The distinction between logical and physical types can be > modelled in > > >>>>>> two different ways: > > >>>>>> > > >>>>>> 1) a physical type can denote several logical types, but a logical > > >>>> type > > >>>>>> can only have a single physical representation. This is currently > > >>>> the > > >>>>>> Arrow model. > > >>>>>> > > >>>>>> 2) a physical type can denote several logical types, and a logical > > >>>> type > > >>>>>> can also be denoted by several physical types. This is the > Parquet > > >>>>> model. > > >>>>>> > > >>>>>> (theoretically, there are two other possible models, but they are > not > > >>>>>> very interesting to consider, since they don't seem to cater to > > >>>> concrete > > >>>>>> use cases) > > >>>>>> > > >>>>>> Model 1 is obviously more restrictive, while model 2 is more > > >>>> flexible. > > >>>>>> Model 2 could be said "higher level"; you see something similar if > > >>>> you > > >>>>>> compare Python's and C++'s typing systems. On the other hand, > model > > >>>> 1 > > >>>>>> provides a potentially simpler programming model for implementors > of > > >>>>>> low-level kernels, as you can simply query the logical type of > your > > >>>> data > > >>>>>> and you automatically know its physical type. > > >>>>>> > > >>>>>> The model chosen for Arrow is ingrained in its API. If we want to > > >>>>>> change the model we'd better do it wholesale (implying probably a > > >>>> large > > >>>>>> refactoring and a significant number of unavoidable regressions) > to > > >>>>>> avoid subjecting users to a confusing middle point. > > >>>>>> > > >>>>>> Also and as a sidenote, "convertibility" between different types > can > > >>>> be > > >>>>>> a hairy subject... Having strict boundaries between types avoids > > >>>> being > > >>>>>> dragged into it too early. > > >>>>>> > > >>>>>> > > >>>>>> To return to the original subject: IMHO, LargeList (resp. > > >>>> LargeBinary) > > >>>>>> should be a distinct logical type from List (resp. Binary), the > same > > >>>> way > > >>>>>> Int64 is a distinct logical type from Int32. > > >>>>>> > > >>>>>> Regards > > >>>>>> > > >>>>>> Antoine. > > >>>>>> > > >>>>>> > > >>>>>> > > >>>>>> Le 15/04/2019 à 18:45, Francois Saint-Jacques a écrit : > > >>>>>>> Hello, > > >>>>>>> > > >>>>>>> I would like understand where do we stand on logical types and > > >>>> physical > > >>>>>>> types. As I understand, this proposal is for the physical > > >>>>> representation. > > >>>>>>> > > >>>>>>> In the context of an execution engine, the concept of logical > types > > >>>>>> becomes > > >>>>>>> more important as two physical representation might have the same > > >>>>>> semantical > > >>>>>>> values, e.g. LargeList and List where all values fits in the > > >>>> 32-bits. > > >>>>> A > > >>>>>>> more > > >>>>>>> complex example would be an Integer array and a dictionary array > > >>>> where > > >>>>>>> values > > >>>>>>> are integers. > > >>>>>>> > > >>>>>>> Is this something only something only relevant for execution > > >>>> engine? > > >>>>> What > > >>>>>>> about > > >>>>>>> the (C++) Array.Equals method and related comparisons methods? > This > > >>>>> also > > >>>>>>> touch > > >>>>>>> the subject of type equality, e.g. dict with different but > > >>>> compatible > > >>>>>>> encoding. > > >>>>>>> > > >>>>>>> Jacques, knowing that you worked on Parquet (which follows this > > >>>> model) > > >>>>>> and > > >>>>>>> Dremio, > > >>>>>>> what is your opinion? > > >>>>>>> > > >>>>>>> François > > >>>>>>> > > >>>>>>> Some related tickets: > > >>>>>>> - https://jira.apache.org/jira/browse/ARROW-554 > > >>>>>>> - https://jira.apache.org/jira/browse/ARROW-1741 > > >>>>>>> - https://jira.apache.org/jira/browse/ARROW-3144 > > >>>>>>> - https://jira.apache.org/jira/browse/ARROW-4097 > > >>>>>>> - https://jira.apache.org/jira/browse/ARROW-5052 > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> On Thu, Apr 11, 2019 at 4:52 AM Micah Kornfield < > > >>>> emkornfi...@gmail.com > > >>>>>> > > >>>>>>> wrote: > > >>>>>>> > > >>>>>>>> ARROW-4810 [1] and ARROW-750 [2] discuss adding types with > 64-bit > > >>>>>> offsets > > >>>>>>>> to Lists, Strings and binary data types. > > >>>>>>>> > > >>>>>>>> Philipp started an implementation for the large list type [3] > and > > >>>> I > > >>>>>> hacked > > >>>>>>>> together a potentially viable java implementation [4] > > >>>>>>>> > > >>>>>>>> I'd like to kickoff the discussion for getting these types voted > > >>>> on. > > >>>>>> I'm > > >>>>>>>> coupling them together because I think there are design > > >>>> consideration > > >>>>>> for > > >>>>>>>> how we evolve Schema.fbs > > >>>>>>>> > > >>>>>>>> There are two proposed options: > > >>>>>>>> 1. The current PR proposal which adds a new type LargeList: > > >>>>>>>> // List with 64-bit offsets > > >>>>>>>> table LargeList {} > > >>>>>>>> > > >>>>>>>> 2. As François suggested, it might cleaner to parameterize List > > >>>> with > > >>>>>>>> offset width. I suppose something like: > > >>>>>>>> > > >>>>>>>> table List { > > >>>>>>>> // only 32 bit and 64 bit is supported. > > >>>>>>>> bitWidth: int = 32; > > >>>>>>>> } > > >>>>>>>> > > >>>>>>>> I think Option 2 is cleaner and potentially better long-term, > but > > >>>> I > > >>>>>> think > > >>>>>>>> it breaks forward compatibility of the existing arrow libraries. > > >>>> If > > >>>>> we > > >>>>>>>> proceed with Option 2, I would advocate making the change to > > >>>>> Schema.fbs > > >>>>>> all > > >>>>>>>> at once for all types (assuming we think that 64-bit offsets are > > >>>>>> desirable > > >>>>>>>> for all types) along with future compatibility checks to avoid > > >>>>> multiple > > >>>>>>>> releases were future compatibility is broken (by broken I mean > the > > >>>>>>>> inability to detect that an implementation is receiving data it > > >>>> can't > > >>>>>>>> read). What are peoples thoughts on this? > > >>>>>>>> > > >>>>>>>> Also, any other concern with adding these types? > > >>>>>>>> > > >>>>>>>> Thanks, > > >>>>>>>> Micah > > >>>>>>>> > > >>>>>>>> [1] https://issues.apache.org/jira/browse/ARROW-4810 > > >>>>>>>> [2] https://issues.apache.org/jira/browse/ARROW-750 > > >>>>>>>> [3] https://github.com/apache/arrow/pull/3848 > > >>>>>>>> [4] > > >>>>>>>> > > >>>>>>>> > > >>>>>> > > >>>>> > > >>>> > https://github.com/apache/arrow/commit/03956cac2202139e43404d7a994508080dc2cdd1 > > >>>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > > >