[
https://issues.apache.org/jira/browse/ARROW-10406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17259261#comment-17259261
]
Antoine Pitrou edited comment on ARROW-10406 at 1/5/21, 10:40 PM:
------------------------------------------------------------------
Hmm, what does this have to do with the CSV reader? You can perfectly well
write that data to an IPC stream (even as an on-disk file in the more general
meaning) or to a Parquet file. You just can't write it to an IPC file (as in
"IPC file format").
IMO, it's an issue with how dictionary mapping has been defined in the IPC
protocol. If each dictionary batch had its own unique id (instead of putting
dictionary ids in the schema), it would probably be easy.
was (Author: pitrou):
Hmm, what does this have to do with the CSV reader? You can perfectly well
write that data to an IPC stream (even as an on-disk file in the more general
meaning) or to a Parquet file. You just can't write it to an IPC file (as in
"IPC file format").
> [Format] Support dictionary replacement in the IPC file format
> --------------------------------------------------------------
>
> Key: ARROW-10406
> URL: https://issues.apache.org/jira/browse/ARROW-10406
> Project: Apache Arrow
> Issue Type: Wish
> Components: Format
> Reporter: Neal Richardson
> Priority: Major
>
> I read a big (taxi) csv file and specified that I wanted to dictionary-encode
> some columns. The resulting Table has ChunkedArrays with 1604 chunks. When I
> go to write this Table to the IPC file format (write_feather), I get an
> error:
> {code}
> Invalid: Dictionary replacement detected when writing IPC file format.
> Arrow IPC files only support a single dictionary for a given field accross
> all batches.
> {code}
> I can write this to Parquet and read it back in, and the roundtrip of the
> data is correct. We should be able to do this in IPC too.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)