[jira] [Commented] (ARROW-10121) [C++][Python] Variable dictionaries do not survive roundtrip to IPC stream

Wes McKinney (Jira) Tue, 29 Sep 2020 09:58:14 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-10121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17204133#comment-17204133
 ]


Wes McKinney commented on ARROW-10121:
--------------------------------------

I would have thought this would have been addressed by delta dictionary work 
but it seems that it only treated the read side, not the write side.

When encountering an unequal dictionary (based at minimum on the memory address 
of the dictionary):

* At minimum we need to determine whether it's a permutation, delta, or 
dictionary prefix.
* If it's a permutation (neither delta not subdictionary) and we are writing a 
stream (non-file), then we can either permute the indices (and if necessary 
write a delta dictionary if there are new values) or write a dictionary 
replacement. Dictionary replacement is the most "preserving"

> [C++][Python] Variable dictionaries do not survive roundtrip to IPC stream
> --------------------------------------------------------------------------
>
>                 Key: ARROW-10121
>                 URL: https://issues.apache.org/jira/browse/ARROW-10121
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>            Reporter: Wes McKinney
>            Assignee: Antoine Pitrou
>            Priority: Blocker
>             Fix For: 2.0.0
>
>
> Failing test case (from dev@ 
> https://lists.apache.org/thread.html/r338942b4e9f9316b48e87aab41ac49c7ffedd45733d4a6349523b7eb%40%3Cdev.arrow.apache.org%3E)
> {code}
> import pyarrow as pa
> from io import BytesIO
> pa.__version__
> schema = pa.schema([pa.field('foo', pa.int32()), pa.field('bar', 
> pa.dictionary(pa.int32(), pa.string()))] )
> r1 = pa.record_batch(
>     [
>         [1, 2, 3, 4, 5],
>         pa.array(["a", "b", "c", "d", "e"]).dictionary_encode()
>     ],
>     schema
> )
> r1.validate()
> r2 = pa.record_batch(
>     [
>         [1, 2, 3, 4, 5],
>         pa.array(["c", "c", "e", "f", "g"]).dictionary_encode()
>     ],
>     schema
> )
> r2.validate()
> assert r1.column(1).dictionary != r2.column(1).dictionary
> sink =  pa.BufferOutputStream()
> writer = pa.RecordBatchStreamWriter(sink, schema)
> writer.write(r1)
> writer.write(r2)
> serialized = BytesIO(sink.getvalue().to_pybytes())
> stream = pa.ipc.open_stream(serialized)
> deserialized = []
> while True:
>     try:
>         deserialized.append(stream.read_next_batch())
>     except StopIteration:
>         break
> assert deserialized[1][1].to_pylist() == r2[1].to_pylist()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-10121) [C++][Python] Variable dictionaries do not survive roundtrip to IPC stream

Reply via email to