Voting sounds like a good idea to me.
Regards Antoine. Le 25/04/2019 à 06:16, Micah Kornfield a écrit : > Given that conversation seems to have died down on this, would it make > sense to do a vote to allow for large variable width types to be added? As > discussed previously PRs would need both C++ and Java implementation before > being merged. > > Could a PMC member facilitate this? > > Philipp if approved, do you have bandwidth to finish up the PR for > LargeList? > > Thanks, > Micah > > On Mon, Apr 15, 2019 at 11:16 PM Philipp Moritz <[email protected]> wrote: > >> @Micah: I wanted to make it possible to support serializing large objects >> (existing large pandas dataframes with an "object" column and also large >> python types with the pyarrow serialization). >> >> On Mon, Apr 15, 2019 at 8:22 PM Micah Kornfield <[email protected]> >> wrote: >> >>> To summarize my understanding of the thread so far, there seems to be >>> consensus on having a new distinct type for each "large" type. >>> >>> There are some reservations around the "large" types being harder to >>> support in algorithmic implementations. >>> >>> I'm curious Philipp, was there a concrete use-case that inspired you to >>> start the PR? >>> >>> Also, this was brought up on another thread, but utility of the "large" >>> types might be limited in some languages (e.g. Java) until they support >>> buffer sizes larger then INT_MAX bytes. I brought this up on the current >>> PR to decouple Netty and memory management from ArrowBuf [1], but the >>> consensus seems to be to handle any modifications in follow-up PRs (if they >>> are agreed upon). >>> >>> Anything else people want to discuss before a vote on whether to allow >>> the additional types into the spec? >>> >>> Thanks, >>> Micah >>> >>> [1] https://github.com/apache/arrow/pull/4151 >>> >>> >>> >>> >>> On Monday, April 15, 2019, Jacques Nadeau <[email protected]> wrote: >>> >>>> I am not Jacques, but I will try to give my own point of view on this. >>>>> >>>> >>>> Thanks for making me laugh :) >>>> >>>> I think that this is unavoidable. Even with batches, taking an example >>>> of a >>>>> binary column where the mean size of the payload is 1mb, it limits to >>>>> batches of 2048 elements. This can become annoying pretty quickly. >>>>> >>>> >>>> Good example. I'm not sure columnar matters but I find it more useful >>>> than >>>> others. >>>> >>>> logical types and physical types >>>>> >>>> >>>> TLDR; It is painful no matter which model you pick. >>>> >>>> I definitely think we worked hard to go different on Arrow than Parquet. >>>> It >>>> was something I pushed consciously when we started as I found some of the >>>> patterns in Parquet to be quite challenging. Unfortunately, we went too >>>> far >>>> in some places in the Java code which tried to parallel the structure of >>>> the physical types directly (and thus the big refactor we did to reduce >>>> duplication last year -- props to Sidd, Bryan and the others who worked >>>> on >>>> that). I also think that we easily probably lost as much as we gained >>>> using >>>> the current model. >>>> >>>> I agree with Antoine both in his clean statement of the approaches and >>>> that >>>> sticking to the model we have today makes the most sense. >>>> >>>> On Mon, Apr 15, 2019 at 11:05 AM Francois Saint-Jacques < >>>> [email protected]> wrote: >>>> >>>>> Thanks for the clarification Antoine, very insightful. >>>>> >>>>> I'd also vote for keeping the existing model for consistency. >>>>> >>>>> On Mon, Apr 15, 2019 at 1:40 PM Antoine Pitrou <[email protected]> >>>> wrote: >>>>> >>>>>> >>>>>> Hi, >>>>>> >>>>>> I am not Jacques, but I will try to give my own point of view on >>>> this. >>>>>> >>>>>> The distinction between logical and physical types can be modelled in >>>>>> two different ways: >>>>>> >>>>>> 1) a physical type can denote several logical types, but a logical >>>> type >>>>>> can only have a single physical representation. This is currently >>>> the >>>>>> Arrow model. >>>>>> >>>>>> 2) a physical type can denote several logical types, and a logical >>>> type >>>>>> can also be denoted by several physical types. This is the Parquet >>>>> model. >>>>>> >>>>>> (theoretically, there are two other possible models, but they are not >>>>>> very interesting to consider, since they don't seem to cater to >>>> concrete >>>>>> use cases) >>>>>> >>>>>> Model 1 is obviously more restrictive, while model 2 is more >>>> flexible. >>>>>> Model 2 could be said "higher level"; you see something similar if >>>> you >>>>>> compare Python's and C++'s typing systems. On the other hand, model >>>> 1 >>>>>> provides a potentially simpler programming model for implementors of >>>>>> low-level kernels, as you can simply query the logical type of your >>>> data >>>>>> and you automatically know its physical type. >>>>>> >>>>>> The model chosen for Arrow is ingrained in its API. If we want to >>>>>> change the model we'd better do it wholesale (implying probably a >>>> large >>>>>> refactoring and a significant number of unavoidable regressions) to >>>>>> avoid subjecting users to a confusing middle point. >>>>>> >>>>>> Also and as a sidenote, "convertibility" between different types can >>>> be >>>>>> a hairy subject... Having strict boundaries between types avoids >>>> being >>>>>> dragged into it too early. >>>>>> >>>>>> >>>>>> To return to the original subject: IMHO, LargeList (resp. >>>> LargeBinary) >>>>>> should be a distinct logical type from List (resp. Binary), the same >>>> way >>>>>> Int64 is a distinct logical type from Int32. >>>>>> >>>>>> Regards >>>>>> >>>>>> Antoine. >>>>>> >>>>>> >>>>>> >>>>>> Le 15/04/2019 à 18:45, Francois Saint-Jacques a écrit : >>>>>>> Hello, >>>>>>> >>>>>>> I would like understand where do we stand on logical types and >>>> physical >>>>>>> types. As I understand, this proposal is for the physical >>>>> representation. >>>>>>> >>>>>>> In the context of an execution engine, the concept of logical types >>>>>> becomes >>>>>>> more important as two physical representation might have the same >>>>>> semantical >>>>>>> values, e.g. LargeList and List where all values fits in the >>>> 32-bits. >>>>> A >>>>>>> more >>>>>>> complex example would be an Integer array and a dictionary array >>>> where >>>>>>> values >>>>>>> are integers. >>>>>>> >>>>>>> Is this something only something only relevant for execution >>>> engine? >>>>> What >>>>>>> about >>>>>>> the (C++) Array.Equals method and related comparisons methods? This >>>>> also >>>>>>> touch >>>>>>> the subject of type equality, e.g. dict with different but >>>> compatible >>>>>>> encoding. >>>>>>> >>>>>>> Jacques, knowing that you worked on Parquet (which follows this >>>> model) >>>>>> and >>>>>>> Dremio, >>>>>>> what is your opinion? >>>>>>> >>>>>>> François >>>>>>> >>>>>>> Some related tickets: >>>>>>> - https://jira.apache.org/jira/browse/ARROW-554 >>>>>>> - https://jira.apache.org/jira/browse/ARROW-1741 >>>>>>> - https://jira.apache.org/jira/browse/ARROW-3144 >>>>>>> - https://jira.apache.org/jira/browse/ARROW-4097 >>>>>>> - https://jira.apache.org/jira/browse/ARROW-5052 >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Thu, Apr 11, 2019 at 4:52 AM Micah Kornfield < >>>> [email protected] >>>>>> >>>>>>> wrote: >>>>>>> >>>>>>>> ARROW-4810 [1] and ARROW-750 [2] discuss adding types with 64-bit >>>>>> offsets >>>>>>>> to Lists, Strings and binary data types. >>>>>>>> >>>>>>>> Philipp started an implementation for the large list type [3] and >>>> I >>>>>> hacked >>>>>>>> together a potentially viable java implementation [4] >>>>>>>> >>>>>>>> I'd like to kickoff the discussion for getting these types voted >>>> on. >>>>>> I'm >>>>>>>> coupling them together because I think there are design >>>> consideration >>>>>> for >>>>>>>> how we evolve Schema.fbs >>>>>>>> >>>>>>>> There are two proposed options: >>>>>>>> 1. The current PR proposal which adds a new type LargeList: >>>>>>>> // List with 64-bit offsets >>>>>>>> table LargeList {} >>>>>>>> >>>>>>>> 2. As François suggested, it might cleaner to parameterize List >>>> with >>>>>>>> offset width. I suppose something like: >>>>>>>> >>>>>>>> table List { >>>>>>>> // only 32 bit and 64 bit is supported. >>>>>>>> bitWidth: int = 32; >>>>>>>> } >>>>>>>> >>>>>>>> I think Option 2 is cleaner and potentially better long-term, but >>>> I >>>>>> think >>>>>>>> it breaks forward compatibility of the existing arrow libraries. >>>> If >>>>> we >>>>>>>> proceed with Option 2, I would advocate making the change to >>>>> Schema.fbs >>>>>> all >>>>>>>> at once for all types (assuming we think that 64-bit offsets are >>>>>> desirable >>>>>>>> for all types) along with future compatibility checks to avoid >>>>> multiple >>>>>>>> releases were future compatibility is broken (by broken I mean the >>>>>>>> inability to detect that an implementation is receiving data it >>>> can't >>>>>>>> read). What are peoples thoughts on this? >>>>>>>> >>>>>>>> Also, any other concern with adding these types? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Micah >>>>>>>> >>>>>>>> [1] https://issues.apache.org/jira/browse/ARROW-4810 >>>>>>>> [2] https://issues.apache.org/jira/browse/ARROW-750 >>>>>>>> [3] https://github.com/apache/arrow/pull/3848 >>>>>>>> [4] >>>>>>>> >>>>>>>> >>>>>> >>>>> >>>> https://github.com/apache/arrow/commit/03956cac2202139e43404d7a994508080dc2cdd1 >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >
