[jira] [Updated] (ARROW-17062) [C#] Support compression in IPC format

Neal Richardson (Jira) Wed, 13 Jul 2022 05:39:27 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-17062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Neal Richardson updated ARROW-17062:
------------------------------------
    Summary: [C#] Support compression in IPC format  (was: [C#] write_feather() 
in R doesn't interop with ArrowFileReader.ReadNextRecordBatch())

> [C#] Support compression in IPC format
> --------------------------------------
>
>                 Key: ARROW-17062
>                 URL: https://issues.apache.org/jira/browse/ARROW-17062
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C#, R
>    Affects Versions: 8.0.0
>         Environment: Arrow 8.0.0, R 4.2.1, VS 17.2.4
>            Reporter: Todd West
>            Priority: Major
>             Fix For: 8.0.2
>
>
> Hello world between write_feather() and ArrowFileReader.ReadNextRecordBatch() 
> fails with default settings. This is specific to compressed files (see 
> workaround below) and it looks like what happens is C# correctly decompresses 
> the batches but provides the caller with the compressed versions of the data 
> arrays instead of the uncompressed ones. While all of the various Length 
> properties are set correctly in C#, the data arrays are too short to contain 
> all of the values in the file, the bytes do not match what the decompressed 
> bytes should be, and basic data accessors like PrimitiveArray<T>.Values can't 
> be used because they throw ArgumentOutOfRangeException. Looking through the 
> C# classes in the github repo it doesn't appear there's a way for the caller 
> to request decompression. So I'm guessing decompression is supposed to be 
> automatic but, for some reason, isn't.
>  
> While functionally successful, the workaround of using uncompressed feather 
> isn't great as the uncompressed files are bigger than .csv. In my application 
> the resulting disk space penalty is hundreds of megabytes compared to the 
> footprint of using compressed feather.
>  
> Simple single field repex:
> In R (arrow 8.0.0):
> {{write_feather(tibble(value = seq(0, 1, length.out = 21)), "test 
> lz4.feather")}}
> In C# (Apache.Arrow 8.0.0):
> {{using Apache.Arrow;}}
> {{using Apache.Arrow.Ipc;}}
> {{using System.IO;}}
> {{using System.Runtime.InteropServices;}}
> {{            using FileStream stream = new("test lz4.feather", 
> FileMode.Open, FileAccess.Read, FileShare.Read);}}
> {{            using ArrowFileReader arrowFile = new(stream);}}
> {{            for (RecordBatch batch = arrowFile.ReadNextRecordBatch(); batch 
> != null; batch = arrowFile.ReadNextRecordBatch())}}
> {{            {}}
> {{                IArrowArray[] fields = batch.Arrays.ToArray();}}
> {{                ReadOnlySpan<double> test = MemoryMarshal.Cast<byte, 
> double>(((DoubleArray)fields[0]).ValueBuffer.Span); // 15 incorrect values 
> instead of 21 correctly incrementing ones (0, 0.05, 0.10, ..., 1)}}
> {{            }}}
> Workaround in R:
> {{write_feather(tibble(value = seq(0, 1, length.out = 21)), "test.feather", 
> compression = "uncompressed")}}
>  
> Apologies if this is a known issue. I didn't find anything on a Jira search 
> and this isn't included in the [known issues list on 
> github|http://example.com/].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17062) [C#] Support compression in IPC format

Reply via email to