[
https://issues.apache.org/jira/browse/ARROW-17062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Neal Richardson updated ARROW-17062:
------------------------------------
Summary: [C#] Support compression in IPC format (was: [C#] write_feather()
in R doesn't interop with ArrowFileReader.ReadNextRecordBatch())
> [C#] Support compression in IPC format
> --------------------------------------
>
> Key: ARROW-17062
> URL: https://issues.apache.org/jira/browse/ARROW-17062
> Project: Apache Arrow
> Issue Type: Bug
> Components: C#, R
> Affects Versions: 8.0.0
> Environment: Arrow 8.0.0, R 4.2.1, VS 17.2.4
> Reporter: Todd West
> Priority: Major
> Fix For: 8.0.2
>
>
> Hello world between write_feather() and ArrowFileReader.ReadNextRecordBatch()
> fails with default settings. This is specific to compressed files (see
> workaround below) and it looks like what happens is C# correctly decompresses
> the batches but provides the caller with the compressed versions of the data
> arrays instead of the uncompressed ones. While all of the various Length
> properties are set correctly in C#, the data arrays are too short to contain
> all of the values in the file, the bytes do not match what the decompressed
> bytes should be, and basic data accessors like PrimitiveArray<T>.Values can't
> be used because they throw ArgumentOutOfRangeException. Looking through the
> C# classes in the github repo it doesn't appear there's a way for the caller
> to request decompression. So I'm guessing decompression is supposed to be
> automatic but, for some reason, isn't.
>
> While functionally successful, the workaround of using uncompressed feather
> isn't great as the uncompressed files are bigger than .csv. In my application
> the resulting disk space penalty is hundreds of megabytes compared to the
> footprint of using compressed feather.
>
> Simple single field repex:
> In R (arrow 8.0.0):
> {{write_feather(tibble(value = seq(0, 1, length.out = 21)), "test
> lz4.feather")}}
> In C# (Apache.Arrow 8.0.0):
> {{using Apache.Arrow;}}
> {{using Apache.Arrow.Ipc;}}
> {{using System.IO;}}
> {{using System.Runtime.InteropServices;}}
> {{ using FileStream stream = new("test lz4.feather",
> FileMode.Open, FileAccess.Read, FileShare.Read);}}
> {{ using ArrowFileReader arrowFile = new(stream);}}
> {{ for (RecordBatch batch = arrowFile.ReadNextRecordBatch(); batch
> != null; batch = arrowFile.ReadNextRecordBatch())}}
> {{ {}}
> {{ IArrowArray[] fields = batch.Arrays.ToArray();}}
> {{ ReadOnlySpan<double> test = MemoryMarshal.Cast<byte,
> double>(((DoubleArray)fields[0]).ValueBuffer.Span); // 15 incorrect values
> instead of 21 correctly incrementing ones (0, 0.05, 0.10, ..., 1)}}
> {{ }}}
> Workaround in R:
> {{write_feather(tibble(value = seq(0, 1, length.out = 21)), "test.feather",
> compression = "uncompressed")}}
>
> Apologies if this is a known issue. I didn't find anything on a Jira search
> and this isn't included in the [known issues list on
> github|http://example.com/].
--
This message was sent by Atlassian Jira
(v8.20.10#820010)