[
https://issues.apache.org/jira/browse/ARROW-12100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17320245#comment-17320245
]
Antoine Pitrou commented on ARROW-12100:
----------------------------------------
There is a JSON representation format that the C# implementation needs to
understand. It is described in
https://arrow.apache.org/docs/format/Integration.html , but you may get more
insight by running the integration tests themselves (the current ones) and look
the generated JSON files
Integration testing uses an internal tool written in Python named Archery (see
here for install instructions:
https://arrow.apache.org/docs/developers/archery.html). You'll find the Archery
bits related to integration testing in the {{dev/archery/archery/integration}}
directory:
https://github.com/apache/arrow/tree/master/dev/archery/archery/integration.
The C# implementation needs to expose endpoints (command line APIs) for four
functionalities:
* JSON to Arrow: read a JSON file and convert it to an Arrow IPC file
* Validate: read both a JSON file and a Arrow IPC file, and check that the
contents are equal
* File to stream: read an Arrow IPC file and convert it to an Arrow IPC stream
* Stream to file: read an Arrow IPC stream and convert it to an Arrow IPC file
You need to add a definition for those endpoints to the Archery file for the C#
implementation (see the various {{tester_*.py}} files in the directory
mentioned earlier).
Also, unless you're supporting each and every functionality, you'll probably
need to add skips, for example here:
https://github.com/apache/arrow/blob/master/dev/archery/archery/integration/datagen.py#L1512
and there:
https://github.com/apache/arrow/blob/master/dev/archery/archery/integration/runner.py#L129
Feel free to ask any more questions.
> [C#] Cannot round-trip record batch with PyArrow
> ------------------------------------------------
>
> Key: ARROW-12100
> URL: https://issues.apache.org/jira/browse/ARROW-12100
> Project: Apache Arrow
> Issue Type: Bug
> Components: C#, C++, Python
> Affects Versions: 3.0.0
> Reporter: Tanguy Fautre
> Assignee: Antoine Pitrou
> Priority: Blocker
> Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: ArrowSharedMemory_20210326.zip,
> ArrowSharedMemory_20210326_2.zip, ArrowSharedMemory_20210329.zip
>
> Time Spent: 1h 20m
> Remaining Estimate: 0h
>
> Has anyone ever tried to round-trip a record batch between Arrow C# and
> PyArrow? I can't get PyArrow to read the data correctly.
> For context, I'm trying to do Arrow data-frames inter-process communication
> between C# and Python using shared memory (local TCP/IP is also an
> alternative). Ideally, I wouldn't even have to serialise the data and could
> just share the Arrow in-memory representation directly, but I'm not sure this
> is even possible with Apache Arrow. Full source code as attachment.
> *C#*
> {code:c#}
> using (var stream = sharedMemory.CreateStream(0, 0,
> MemoryMappedFileAccess.ReadWrite))
> {
> var recordBatch = /* ... */
> using (var writer = new ArrowFileWriter(stream, recordBatch.Schema,
> leaveOpen: true))
> {
> writer.WriteRecordBatch(recordBatch);
> writer.WriteEnd();
> }
> }
> {code}
> *Python*
> {code:python}
> shmem = open_shared_memory(args)
> address = get_shared_memory_address(shmem)
> buf = pa.foreign_buffer(address, args.sharedMemorySize)
> stream = pa.input_stream(buf)
> reader = pa.ipc.open_stream(stream)
> {code}
> Unfortunately, it fails with the following error: {{pyarrow.lib.ArrowInvalid:
> Expected to read 1330795073 metadata bytes, but only read 1230}}.
> I can see that the memory content starts with
> {{ARROW1\x00\x00\xff\xff\xff\xff\x08\x01\x00\x00\x10\x00\x00\x00}}. It seems
> that using the API calls above, PyArrow reads "ARRO" as the length of the
> metadata.
> I assume I'm using the API incorrectly. Has anyone got a working example?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)