Re: [DISCUSS] 64-bit offset variable width types (i.e.Large List, Last String, Large bytes)

Antoine Pitrou Thu, 25 Apr 2019 01:09:10 -0700


Voting sounds like a good idea to me.


Regards

Antoine.


Le 25/04/2019 à 06:16, Micah Kornfield a écrit :
> Given that conversation seems to have died down on this, would it make
> sense to do a vote to allow for large variable width types to be added?  As
> discussed previously PRs would need both C++ and Java implementation before
> being merged.
> 
> Could a PMC member facilitate this?
> 
> Philipp if approved, do you have bandwidth to finish up the PR for
> LargeList?
> 
> Thanks,
> Micah
> 
> On Mon, Apr 15, 2019 at 11:16 PM Philipp Moritz <[email protected]> wrote:
> 
>> @Micah: I wanted to make it possible to support serializing large objects
>> (existing large pandas dataframes with an "object" column and also large
>> python types with the pyarrow serialization).
>>
>> On Mon, Apr 15, 2019 at 8:22 PM Micah Kornfield <[email protected]>
>> wrote:
>>
>>> To summarize my understanding of the thread so far, there seems to be
>>> consensus on having a new distinct type for each "large" type.
>>>
>>> There are some reservations around the "large" types being harder to
>>> support in algorithmic implementations.
>>>
>>> I'm curious Philipp, was there a concrete use-case that inspired you to
>>> start the PR?
>>>
>>> Also, this was brought up on another thread, but utility of the "large"
>>> types might be limited in some languages (e.g. Java) until they support
>>> buffer sizes larger then INT_MAX bytes.  I brought this up on the current
>>> PR to decouple Netty and memory management from ArrowBuf [1], but the
>>> consensus seems to be to handle any modifications in follow-up PRs (if they
>>> are agreed upon).
>>>
>>> Anything else people want to discuss before a vote on whether to allow
>>> the additional types into the spec?
>>>
>>> Thanks,
>>> Micah
>>>
>>> [1] https://github.com/apache/arrow/pull/4151
>>>
>>>
>>>
>>>
>>> On Monday, April 15, 2019, Jacques Nadeau <[email protected]> wrote:
>>>
>>>> I am not Jacques, but I will try to give my own point of view on this.
>>>>>
>>>>
>>>> Thanks for making me laugh :)
>>>>
>>>> I think that this is unavoidable. Even with batches, taking an example
>>>> of a
>>>>> binary column where the mean size of the payload is 1mb, it limits to
>>>>> batches of 2048 elements. This can become annoying pretty quickly.
>>>>>
>>>>
>>>> Good example. I'm not sure columnar matters but I find it more useful
>>>> than
>>>> others.
>>>>
>>>> logical types and physical types
>>>>>
>>>>
>>>> TLDR; It is painful no matter which model you pick.
>>>>
>>>> I definitely think we worked hard to go different on Arrow than Parquet.
>>>> It
>>>> was something I pushed consciously when we started as I found some of the
>>>> patterns in Parquet to be quite challenging. Unfortunately, we went too
>>>> far
>>>> in some places in the Java code which tried to parallel the structure of
>>>> the physical types directly (and thus the big refactor we did to reduce
>>>> duplication last year -- props to Sidd, Bryan and the others who worked
>>>> on
>>>> that). I also think that we easily probably lost as much as we gained
>>>> using
>>>> the current model.
>>>>
>>>> I agree with Antoine both in his clean statement of the approaches and
>>>> that
>>>> sticking to the model we have today makes the most sense.
>>>>
>>>> On Mon, Apr 15, 2019 at 11:05 AM Francois Saint-Jacques <
>>>> [email protected]> wrote:
>>>>
>>>>> Thanks for the clarification Antoine, very insightful.
>>>>>
>>>>> I'd also vote for keeping the existing model for consistency.
>>>>>
>>>>> On Mon, Apr 15, 2019 at 1:40 PM Antoine Pitrou <[email protected]>
>>>> wrote:
>>>>>
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I am not Jacques, but I will try to give my own point of view on
>>>> this.
>>>>>>
>>>>>> The distinction between logical and physical types can be modelled in
>>>>>> two different ways:
>>>>>>
>>>>>> 1) a physical type can denote several logical types, but a logical
>>>> type
>>>>>> can only have a single physical representation.  This is currently
>>>> the
>>>>>> Arrow model.
>>>>>>
>>>>>> 2) a physical type can denote several logical types, and a logical
>>>> type
>>>>>> can also be denoted by several physical types.  This is the Parquet
>>>>> model.
>>>>>>
>>>>>> (theoretically, there are two other possible models, but they are not
>>>>>> very interesting to consider, since they don't seem to cater to
>>>> concrete
>>>>>> use cases)
>>>>>>
>>>>>> Model 1 is obviously more restrictive, while model 2 is more
>>>> flexible.
>>>>>> Model 2 could be said "higher level"; you see something similar if
>>>> you
>>>>>> compare Python's and C++'s typing systems.  On the other hand, model
>>>> 1
>>>>>> provides a potentially simpler programming model for implementors of
>>>>>> low-level kernels, as you can simply query the logical type of your
>>>> data
>>>>>> and you automatically know its physical type.
>>>>>>
>>>>>> The model chosen for Arrow is ingrained in its API.  If we want to
>>>>>> change the model we'd better do it wholesale (implying probably a
>>>> large
>>>>>> refactoring and a significant number of unavoidable regressions) to
>>>>>> avoid subjecting users to a confusing middle point.
>>>>>>
>>>>>> Also and as a sidenote, "convertibility" between different types can
>>>> be
>>>>>> a hairy subject... Having strict boundaries between types avoids
>>>> being
>>>>>> dragged into it too early.
>>>>>>
>>>>>>
>>>>>> To return to the original subject: IMHO, LargeList (resp.
>>>> LargeBinary)
>>>>>> should be a distinct logical type from List (resp. Binary), the same
>>>> way
>>>>>> Int64 is a distinct logical type from Int32.
>>>>>>
>>>>>> Regards
>>>>>>
>>>>>> Antoine.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Le 15/04/2019 à 18:45, Francois Saint-Jacques a écrit :
>>>>>>> Hello,
>>>>>>>
>>>>>>> I would like understand where do we stand on logical types and
>>>> physical
>>>>>>> types. As I understand, this proposal is for the physical
>>>>> representation.
>>>>>>>
>>>>>>> In the context of an execution engine, the concept of logical types
>>>>>> becomes
>>>>>>> more important as two physical representation might have the same
>>>>>> semantical
>>>>>>> values, e.g. LargeList and List where all values fits in the
>>>> 32-bits.
>>>>> A
>>>>>>> more
>>>>>>> complex example would be an Integer array and a dictionary array
>>>> where
>>>>>>> values
>>>>>>> are integers.
>>>>>>>
>>>>>>> Is this something only something only relevant for execution
>>>> engine?
>>>>> What
>>>>>>> about
>>>>>>> the (C++) Array.Equals method and related comparisons methods? This
>>>>> also
>>>>>>> touch
>>>>>>> the subject of type equality, e.g. dict with different but
>>>> compatible
>>>>>>> encoding.
>>>>>>>
>>>>>>> Jacques, knowing that you worked on Parquet (which follows this
>>>> model)
>>>>>> and
>>>>>>> Dremio,
>>>>>>> what is your opinion?
>>>>>>>
>>>>>>> François
>>>>>>>
>>>>>>> Some related tickets:
>>>>>>> - https://jira.apache.org/jira/browse/ARROW-554
>>>>>>> - https://jira.apache.org/jira/browse/ARROW-1741
>>>>>>> - https://jira.apache.org/jira/browse/ARROW-3144
>>>>>>> - https://jira.apache.org/jira/browse/ARROW-4097
>>>>>>> - https://jira.apache.org/jira/browse/ARROW-5052
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Apr 11, 2019 at 4:52 AM Micah Kornfield <
>>>> [email protected]
>>>>>>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> ARROW-4810 [1] and ARROW-750 [2] discuss adding types with 64-bit
>>>>>> offsets
>>>>>>>> to Lists, Strings and binary data types.
>>>>>>>>
>>>>>>>> Philipp started an implementation for the large list type [3] and
>>>> I
>>>>>> hacked
>>>>>>>> together a potentially viable java implementation [4]
>>>>>>>>
>>>>>>>> I'd like to kickoff the discussion for getting these types voted
>>>> on.
>>>>>> I'm
>>>>>>>> coupling them together because I think there are design
>>>> consideration
>>>>>> for
>>>>>>>> how we evolve Schema.fbs
>>>>>>>>
>>>>>>>> There are two proposed options:
>>>>>>>> 1.  The current PR proposal which adds a new type LargeList:
>>>>>>>>   // List with 64-bit offsets
>>>>>>>>   table LargeList {}
>>>>>>>>
>>>>>>>> 2.  As François suggested, it might cleaner to parameterize List
>>>> with
>>>>>>>> offset width.  I suppose something like:
>>>>>>>>
>>>>>>>> table List {
>>>>>>>>   // only 32 bit and 64 bit is supported.
>>>>>>>>   bitWidth: int = 32;
>>>>>>>> }
>>>>>>>>
>>>>>>>> I think Option 2 is cleaner and potentially better long-term, but
>>>> I
>>>>>> think
>>>>>>>> it breaks forward compatibility of the existing arrow libraries.
>>>> If
>>>>> we
>>>>>>>> proceed with Option 2, I would advocate making the change to
>>>>> Schema.fbs
>>>>>> all
>>>>>>>> at once for all types (assuming we think that 64-bit offsets are
>>>>>> desirable
>>>>>>>> for all types) along with future compatibility checks to avoid
>>>>> multiple
>>>>>>>> releases were future compatibility is broken (by broken I mean the
>>>>>>>> inability to detect that an implementation is receiving data it
>>>> can't
>>>>>>>> read).    What are peoples thoughts on this?
>>>>>>>>
>>>>>>>> Also, any other concern with adding these types?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Micah
>>>>>>>>
>>>>>>>> [1] https://issues.apache.org/jira/browse/ARROW-4810
>>>>>>>> [2] https://issues.apache.org/jira/browse/ARROW-750
>>>>>>>> [3] https://github.com/apache/arrow/pull/3848
>>>>>>>> [4]
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>> https://github.com/apache/arrow/commit/03956cac2202139e43404d7a994508080dc2cdd1
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>

Re: [DISCUSS] 64-bit offset variable width types (i.e.Large List, Last String, Large bytes)

Reply via email to