Re: [DISCUSS] Dictionary Encoding Clarifications/Future Proofing

Antoine Pitrou Thu, 14 Nov 2019 02:02:52 -0800


Right.  The dictionaries can be found from the file footer, so it seems ok.


Thank you

Regards

Antoine.


Le 14/11/2019 à 07:11, Micah Kornfield a écrit :
> I'll add for:
> 
> If so, how does this play with the fact that there potentially are delta
>> dictionaries in the "stream"?
> 
> That in this case the important feature is the dictionary batches have an
> explicit ordering in the file format based on metadata.  So their ordering
> in the "stream" is largely irrelevant.  As Wes pointed out the most
> convenient implementation for this would have to load all dictionary
> batches before doing random access (and would be very similar to the stream
> code).
> 
> Does this make sense?
> 
> 
> On Tue, Nov 12, 2019 at 2:01 PM Wes McKinney <[email protected]> wrote:
> 
>> Hi Antoine,
>>
>> Each *record batch* is intended to be readable in random order. To read any
>> record batch requires loading the dictionaries indicated in the schema, so
>> appending the deltas as part of this process does not seem like it would
>> introduce hardship given that such logic is needed to properly handle the
>> stream format. Dictionary replacements in the file format (at least as
>> currently conceived) does not seem possible.
>>
>>
>> On Tue, Nov 12, 2019, 10:13 AM Antoine Pitrou <[email protected]> wrote:
>>
>>>
>>> Hi,
>>>
>>> Sorry for the delay.
>>>
>>> My high-level question is the following:  is the file format intended to
>>> be readable in random order (rather than having to read through it in
>>> sequence as with the stream format)?  If so, how does this play with the
>>> fact that there potentially are delta dictionaries in the "stream"?
>>>
>>> Regards
>>>
>>> Antoine.
>>>
>>>
>>> Le 30/10/2019 à 21:11, Wes McKinney a écrit :
>>>> Returning to this discussion as there seems to lack consensus in the
>>> vote thread
>>>>
>>>> Copying Micah's proposals in the VOTE thread here, I wanted to state
>>>> my opinions so we can discuss further and see where there is potential
>>>> disagreement
>>>>
>>>> 1.  It is not required that all dictionary batches occur at the
>> beginning
>>>> of the IPC stream format (if a the first record batch has an all null
>>>> dictionary encoded column, the null column's dictionary might not be
>> sent
>>>> until later in the stream).
>>>>
>>>> This seems preferable to requiring a placeholder empty dictionary
>>>> batch. This does mean more to test but the integration tests will
>>>> force the issue
>>>>
>>>> 2.  A second dictionary batch for the same ID that is not a "delta
>> batch"
>>>> in an IPC stream indicates the dictionary should be replaced.
>>>>
>>>> Agree.
>>>>
>>>> 3.  Clarifies that the file format, can only contain 1 "NON-delta"
>>>> dictionary batch and multiple "delta" dictionary batches.
>>>>
>>>> Agree -- it is also worth stating explicitly that dictionary
>>>> replacements are not allowed in the file format.
>>>>
>>>> In the file format, all the dictionaries must be "loaded" up front.
>>>> The code path for loading the dictionaries ideally should use nearly
>>>> the same code as the stream-reader code that sees follow-up dictionary
>>>> batches interspersed in the stream. The only downside is that it will
>>>> not be possible to exactly preserve the dictionary "state" as of each
>>>> record batch being written.
>>>>
>>>> So if we had a file containing
>>>>
>>>> DICTIONARY ID=0
>>>> RECORD BATCH
>>>> RECORD BATCH
>>>> DICTIONARY DELTA ID=0
>>>> RECORD BATCH
>>>> RECORD BATCH
>>>>
>>>> Then after processing/loading the dictionaries, the first two record
>>>> batches will have a dictionary that is "larger" (on account of the
>>>> delta) than when they were written. Since dictionaries are
>>>> fundamentally about data representation, they still represent the same
>>>> data so I think this is acceptable.
>>>>
>>>> 4.  Add an enum to dictionary metadata for possible future changes in
>>> what
>>>> format dictionary batches can be sent. (the most likely would be an
>> array
>>>> Map<Int, Value>).  An enum is needed as a place holder to allow for
>>> forward
>>>> compatibility past the release 1.0.0.
>>>>
>>>> I'm least sure about this but I do not think it is harmful to have a
>>>> forward-compatible "escape hatch" for future evolutions in dictionary
>>>> encoding.
>>>>
>>>> On Wed, Oct 16, 2019 at 2:57 AM Micah Kornfield <[email protected]
>>>
>>> wrote:
>>>>>
>>>>> I'll plan on starting a vote in the next day or two if there are no
>>> further
>>>>> objections/comments.
>>>>>
>>>>> On Sun, Oct 13, 2019 at 11:06 AM Micah Kornfield <
>> [email protected]
>>>>
>>>>> wrote:
>>>>>
>>>>>> I think the only point asked on the PR that I think is worth
>>> discussing is
>>>>>> assumptions about dictionaries at the beginning of streams.
>>>>>>
>>>>>> There are two options:
>>>>>> 1.  Based on the current wording, it does not seem that all
>>> dictionaries
>>>>>> need to be at the beginning of the stream if they aren't made use of
>>> in the
>>>>>> first record batch (i.e. a dictionary encoded column is all null in
>> the
>>>>>> first record batch).
>>>>>> 2.  We require a dictionary batch for each dictionary at the
>> beginning
>>> of
>>>>>> the stream (and require implementations to send an empty batch if
>> they
>>>>>> don't have the dictionary available).
>>>>>>
>>>>>> The current proposal in the PR is option #1.
>>>>>>
>>>>>> Thanks,
>>>>>> Micah
>>>>>>
>>>>>> On Sat, Oct 5, 2019 at 4:01 PM Micah Kornfield <
>> [email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> I've opened a pull request [1] to clarify some recent conversations
>>> about
>>>>>>> semantics/edge cases for dictionary encoding [2][3] around
>> interleaved
>>>>>>> batches and when isDelta=False.
>>>>>>>
>>>>>>> Specifically, it proposes isDelta=False indicates dictionary
>>>>>>> replacement.  For the file format, only one isDelta=False batch is
>>> allowed
>>>>>>> per file and isDelta=true batches are applied in the order supplied
>>> file
>>>>>>> footer.
>>>>>>>
>>>>>>> In addition, I've added a new enum to DictionaryEncoding to preserve
>>>>>>> future compatibility in case we want to expand dictionary encoding
>> to
>>> be an
>>>>>>> explicit mapping from "ID" to "VALUE" as discussed in [4].
>>>>>>>
>>>>>>> Once people have had a change to review and come to a consensus. I
>>> will
>>>>>>> call a formal vote to approve the change commit the change.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Micah
>>>>>>>
>>>>>>> [1] https://github.com/apache/arrow/pull/5585
>>>>>>> [2]
>>>>>>>
>>>
>> https://lists.apache.org/thread.html/9734b71bc12aca16eb997388e95105bff412fdaefa4e19422f477389@%3Cdev.arrow.apache.org%3E
>>>>>>> [3]
>>>>>>>
>>>
>> https://lists.apache.org/thread.html/5c3c9346101df8d758e24664638e8ada0211d310ab756a89cde3786a@%3Cdev.arrow.apache.org%3E
>>>>>>> [4]
>>>>>>>
>>>
>> https://lists.apache.org/thread.html/15a4810589b2eb772bce5b2372970d9d93badbd28999a1bbe2af418a@%3Cdev.arrow.apache.org%3E
>>>>>>>
>>>>>>>
>>>
>>
>

Re: [DISCUSS] Dictionary Encoding Clarifications/Future Proofing

Reply via email to