[jira] [Commented] (ARROW-12100) [C#] Cannot round-trip record batch with PyArrow

Antoine Pitrou (Jira) Tue, 13 Apr 2021 08:14:32 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-12100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17320245#comment-17320245
 ]


Antoine Pitrou commented on ARROW-12100:
----------------------------------------

There is a JSON representation format that the C# implementation needs to 
understand. It is described in 
https://arrow.apache.org/docs/format/Integration.html , but you may get more 
insight by running the integration tests themselves (the current ones) and look 
the generated JSON files

Integration testing uses an internal tool written in Python named Archery (see 
here for install instructions: 
https://arrow.apache.org/docs/developers/archery.html). You'll find the Archery 
bits related to integration testing in the {{dev/archery/archery/integration}} 
directory: 
https://github.com/apache/arrow/tree/master/dev/archery/archery/integration. 

The C# implementation needs to expose endpoints (command line APIs) for four 
functionalities:
* JSON to Arrow: read a JSON file and convert it to an Arrow IPC file
* Validate: read both a JSON file and a Arrow IPC file, and check that the 
contents are equal
* File to stream: read an Arrow IPC file and convert it to an Arrow IPC stream
* Stream to file: read an Arrow IPC stream and convert it to an Arrow IPC file

You need to add a definition for those endpoints to the Archery file for the C# 
implementation (see the various {{tester_*.py}} files in the directory 
mentioned earlier).

Also, unless you're supporting each and every functionality, you'll probably 
need to add skips, for example here: 
https://github.com/apache/arrow/blob/master/dev/archery/archery/integration/datagen.py#L1512
 and there: 
https://github.com/apache/arrow/blob/master/dev/archery/archery/integration/runner.py#L129

Feel free to ask any more questions.

> [C#] Cannot round-trip record batch with PyArrow
> ------------------------------------------------
>
>                 Key: ARROW-12100
>                 URL: https://issues.apache.org/jira/browse/ARROW-12100
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C#, C++, Python
>    Affects Versions: 3.0.0
>            Reporter: Tanguy Fautre
>            Assignee: Antoine Pitrou
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: 4.0.0
>
>         Attachments: ArrowSharedMemory_20210326.zip, 
> ArrowSharedMemory_20210326_2.zip, ArrowSharedMemory_20210329.zip
>
>          Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Has anyone ever tried to round-trip a record batch between Arrow C# and 
> PyArrow? I can't get PyArrow to read the data correctly.
> For context, I'm trying to do Arrow data-frames inter-process communication 
> between C# and Python using shared memory (local TCP/IP is also an 
> alternative). Ideally, I wouldn't even have to serialise the data and could 
> just share the Arrow in-memory representation directly, but I'm not sure this 
> is even possible with Apache Arrow. Full source code as attachment.
> *C#*
> {code:c#}
> using (var stream = sharedMemory.CreateStream(0, 0, 
> MemoryMappedFileAccess.ReadWrite))
> {
>     var recordBatch = /* ... */
>     using (var writer = new ArrowFileWriter(stream, recordBatch.Schema, 
> leaveOpen: true))
>     {
>         writer.WriteRecordBatch(recordBatch);
>         writer.WriteEnd();
>     }
> }
> {code}
> *Python*
> {code:python}
> shmem = open_shared_memory(args)
> address = get_shared_memory_address(shmem)
> buf = pa.foreign_buffer(address, args.sharedMemorySize)
> stream = pa.input_stream(buf)
> reader = pa.ipc.open_stream(stream)
> {code}
> Unfortunately, it fails with the following error: {{pyarrow.lib.ArrowInvalid: 
> Expected to read 1330795073 metadata bytes, but only read 1230}}.
> I can see that the memory content starts with 
> {{ARROW1\x00\x00\xff\xff\xff\xff\x08\x01\x00\x00\x10\x00\x00\x00}}. It seems 
> that using the API calls above, PyArrow reads "ARRO" as the length of the 
> metadata.
> I assume I'm using the API incorrectly. Has anyone got a working example?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-12100) [C#] Cannot round-trip record batch with PyArrow

Reply via email to