Right. The dictionaries can be found from the file footer, so it seems ok.
Thank you Regards Antoine. Le 14/11/2019 à 07:11, Micah Kornfield a écrit : > I'll add for: > > If so, how does this play with the fact that there potentially are delta >> dictionaries in the "stream"? > > That in this case the important feature is the dictionary batches have an > explicit ordering in the file format based on metadata. So their ordering > in the "stream" is largely irrelevant. As Wes pointed out the most > convenient implementation for this would have to load all dictionary > batches before doing random access (and would be very similar to the stream > code). > > Does this make sense? > > > On Tue, Nov 12, 2019 at 2:01 PM Wes McKinney <wesmck...@gmail.com> wrote: > >> Hi Antoine, >> >> Each *record batch* is intended to be readable in random order. To read any >> record batch requires loading the dictionaries indicated in the schema, so >> appending the deltas as part of this process does not seem like it would >> introduce hardship given that such logic is needed to properly handle the >> stream format. Dictionary replacements in the file format (at least as >> currently conceived) does not seem possible. >> >> >> On Tue, Nov 12, 2019, 10:13 AM Antoine Pitrou <anto...@python.org> wrote: >> >>> >>> Hi, >>> >>> Sorry for the delay. >>> >>> My high-level question is the following: is the file format intended to >>> be readable in random order (rather than having to read through it in >>> sequence as with the stream format)? If so, how does this play with the >>> fact that there potentially are delta dictionaries in the "stream"? >>> >>> Regards >>> >>> Antoine. >>> >>> >>> Le 30/10/2019 à 21:11, Wes McKinney a écrit : >>>> Returning to this discussion as there seems to lack consensus in the >>> vote thread >>>> >>>> Copying Micah's proposals in the VOTE thread here, I wanted to state >>>> my opinions so we can discuss further and see where there is potential >>>> disagreement >>>> >>>> 1. It is not required that all dictionary batches occur at the >> beginning >>>> of the IPC stream format (if a the first record batch has an all null >>>> dictionary encoded column, the null column's dictionary might not be >> sent >>>> until later in the stream). >>>> >>>> This seems preferable to requiring a placeholder empty dictionary >>>> batch. This does mean more to test but the integration tests will >>>> force the issue >>>> >>>> 2. A second dictionary batch for the same ID that is not a "delta >> batch" >>>> in an IPC stream indicates the dictionary should be replaced. >>>> >>>> Agree. >>>> >>>> 3. Clarifies that the file format, can only contain 1 "NON-delta" >>>> dictionary batch and multiple "delta" dictionary batches. >>>> >>>> Agree -- it is also worth stating explicitly that dictionary >>>> replacements are not allowed in the file format. >>>> >>>> In the file format, all the dictionaries must be "loaded" up front. >>>> The code path for loading the dictionaries ideally should use nearly >>>> the same code as the stream-reader code that sees follow-up dictionary >>>> batches interspersed in the stream. The only downside is that it will >>>> not be possible to exactly preserve the dictionary "state" as of each >>>> record batch being written. >>>> >>>> So if we had a file containing >>>> >>>> DICTIONARY ID=0 >>>> RECORD BATCH >>>> RECORD BATCH >>>> DICTIONARY DELTA ID=0 >>>> RECORD BATCH >>>> RECORD BATCH >>>> >>>> Then after processing/loading the dictionaries, the first two record >>>> batches will have a dictionary that is "larger" (on account of the >>>> delta) than when they were written. Since dictionaries are >>>> fundamentally about data representation, they still represent the same >>>> data so I think this is acceptable. >>>> >>>> 4. Add an enum to dictionary metadata for possible future changes in >>> what >>>> format dictionary batches can be sent. (the most likely would be an >> array >>>> Map<Int, Value>). An enum is needed as a place holder to allow for >>> forward >>>> compatibility past the release 1.0.0. >>>> >>>> I'm least sure about this but I do not think it is harmful to have a >>>> forward-compatible "escape hatch" for future evolutions in dictionary >>>> encoding. >>>> >>>> On Wed, Oct 16, 2019 at 2:57 AM Micah Kornfield <emkornfi...@gmail.com >>> >>> wrote: >>>>> >>>>> I'll plan on starting a vote in the next day or two if there are no >>> further >>>>> objections/comments. >>>>> >>>>> On Sun, Oct 13, 2019 at 11:06 AM Micah Kornfield < >> emkornfi...@gmail.com >>>> >>>>> wrote: >>>>> >>>>>> I think the only point asked on the PR that I think is worth >>> discussing is >>>>>> assumptions about dictionaries at the beginning of streams. >>>>>> >>>>>> There are two options: >>>>>> 1. Based on the current wording, it does not seem that all >>> dictionaries >>>>>> need to be at the beginning of the stream if they aren't made use of >>> in the >>>>>> first record batch (i.e. a dictionary encoded column is all null in >> the >>>>>> first record batch). >>>>>> 2. We require a dictionary batch for each dictionary at the >> beginning >>> of >>>>>> the stream (and require implementations to send an empty batch if >> they >>>>>> don't have the dictionary available). >>>>>> >>>>>> The current proposal in the PR is option #1. >>>>>> >>>>>> Thanks, >>>>>> Micah >>>>>> >>>>>> On Sat, Oct 5, 2019 at 4:01 PM Micah Kornfield < >> emkornfi...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> I've opened a pull request [1] to clarify some recent conversations >>> about >>>>>>> semantics/edge cases for dictionary encoding [2][3] around >> interleaved >>>>>>> batches and when isDelta=False. >>>>>>> >>>>>>> Specifically, it proposes isDelta=False indicates dictionary >>>>>>> replacement. For the file format, only one isDelta=False batch is >>> allowed >>>>>>> per file and isDelta=true batches are applied in the order supplied >>> file >>>>>>> footer. >>>>>>> >>>>>>> In addition, I've added a new enum to DictionaryEncoding to preserve >>>>>>> future compatibility in case we want to expand dictionary encoding >> to >>> be an >>>>>>> explicit mapping from "ID" to "VALUE" as discussed in [4]. >>>>>>> >>>>>>> Once people have had a change to review and come to a consensus. I >>> will >>>>>>> call a formal vote to approve the change commit the change. >>>>>>> >>>>>>> Thanks, >>>>>>> Micah >>>>>>> >>>>>>> [1] https://github.com/apache/arrow/pull/5585 >>>>>>> [2] >>>>>>> >>> >> https://lists.apache.org/thread.html/9734b71bc12aca16eb997388e95105bff412fdaefa4e19422f477389@%3Cdev.arrow.apache.org%3E >>>>>>> [3] >>>>>>> >>> >> https://lists.apache.org/thread.html/5c3c9346101df8d758e24664638e8ada0211d310ab756a89cde3786a@%3Cdev.arrow.apache.org%3E >>>>>>> [4] >>>>>>> >>> >> https://lists.apache.org/thread.html/15a4810589b2eb772bce5b2372970d9d93badbd28999a1bbe2af418a@%3Cdev.arrow.apache.org%3E >>>>>>> >>>>>>> >>> >> >