Todd West created ARROW-17391:
---------------------------------
Summary: arrow::read_feather() cannot read DictionaryArray written
from C#
Key: ARROW-17391
URL: https://issues.apache.org/jira/browse/ARROW-17391
Project: Apache Arrow
Issue Type: Bug
Components: C#, R
Affects Versions: 9.0.1
Reporter: Todd West
Fix For: 9.0.1
This applies to Arrow 9.0.0, both the C# nuget and R package, but for some
reason 9.0.0 isn't in the issue dropdowns' list of released versions. It also
appears the [implementation status
page|https://arrow.apache.org/docs/status.html#ipc-format] may be stale as the
C# source contains
[DictionaryArray|https://github.com/apache/arrow/blob/master/csharp/src/Apache.Arrow/Arrays/DictionaryArray.cs]
and a look in the debugger confirms the flags flip and the data structures
update for
[ArrowStreamWriter|https://github.com/apache/arrow/blob/master/csharp/src/Apache.Arrow/Ipc/ArrowStreamWriter.cs]
having correctly received both the dictionary index and value arrays it's
given on the code paths which write a [dictionary
batch|https://arrow.apache.org/docs/format/Columnar.html] . However, on the R
side, read_feather() fails with
{{Error: Key error: Dictionary with id 1 not found}}
So it appears most likely either C# isn't properly emitting the dictionary
batch, despite seeming to have all the code to do so, or something's going
wrong in the C++ layers under R in the reading side.
Setup on the C# side is simple
{{ public static DictionaryArray CreateStringTable(Memory<byte>
indicies, IList<string> values)}}
{{ {}}
{{ StringArray.Builder valueArray = new();}}
{{ for (int valueIndex = 0; valueIndex < values.Count;
++valueIndex)}}
{{ {}}
{{ valueArray.Append(values[valueIndex]);}}
{{ }}}{{ UInt8Array indexArray =
new(ArrowArrayExtensions.WrapInArrayData(UInt8Type.Default, indicies,
indicies.Length));}}
{{ return new DictionaryArray(new(UInt8Type.Default,
StringType.Default, false), indexArray, valueArray.Build());}}
{{ }}}
as is the R
{{ library(arrow)}}
{{ foo = read_feather("test.feather")}}
If I drop the dictionary column the two Arrow implementations interop without
difficulty. Same if I write only the indices as a UInt8 column. So the issue
here is clearly specific to the use of DictionaryColumn. I've also tried other
index sizes, so it doesn't appear specific to the use of UInt8.
I'm therefore left with two questions:
1) Does DictionaryArray have working use cases in 9.0.0?
2) If what I'm doing's not supposed to work yet, or I'm not getting the data
structures set up correctly (there's no C# DictionaryArray example [on
github|https://github.com/apache/arrow/tree/master/csharp/examples]), is there
an array level workaround?
There's only one string table in this schema and it's typically tiny (five
values or less) so putting its values part in the schema metadata is a viable
workaround, albeit an inelegant one.
Not seeing that there's a feather file viewer available but, if there is, I'd
be happy to take a closer look. Can also link the sources after they've been
committed and pushed, which should be by the end of the day tomorrow.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)