[
https://issues.apache.org/jira/browse/ARROW-15642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17505032#comment-17505032
]
Weston Pace commented on ARROW-15642:
-------------------------------------
That sounds like a good plan. One thing to keep in mind is that the two
formats have different purposes so it is entirely possible that arquero (I'm
afraid I don't know much about this lib) is intentionally using the streaming
format and the file format doesn't make sense.
Streaming format: The recipient can start processing results before they have
received the entire delivery. Typically used when sending results between two
processes.
File format: Allows for random access to batches. Typically used when storing
data on disk or some other storage device with random access capabilities.
Most (maybe all?) language implementations can read and write both formats.
> [Python] [JavaScript] Arrow IPC file output by apache-arrow tableToIPC method
> cannot be read by pyarrow
> -------------------------------------------------------------------------------------------------------
>
> Key: ARROW-15642
> URL: https://issues.apache.org/jira/browse/ARROW-15642
> Project: Apache Arrow
> Issue Type: Bug
> Components: JavaScript, Python
> Affects Versions: 7.0.0
> Reporter: Dan Coates
> Assignee: Weston Pace
> Priority: Major
>
> IPC files created by the node library `apache-arrow` don't seem to be able to
> be read by pyarrow. There is an example of this issue here:
> [https://github.com/dancoates/pyarrow-jsarrow-test
> |https://github.com/dancoates/pyarrow-jsarrow-test]
>
> writing the arrow file from js
> {code:javascript}
> import {tableToIPC, tableFromArrays} from 'apache-arrow';
> import fs from 'fs';
> const LENGTH = 2000;
> const rainAmounts = Float32Array.from(
> { length: LENGTH },
> () => Number((Math.random() * 20).toFixed(1)));
> const rainDates = Array.from(
> { length: LENGTH },
> (_, i) => new Date(Date.now() - 1000 * 60 * 60 * 24 * i));
> const rainfall = tableFromArrays({
> precipitation: rainAmounts,
> date: rainDates
> });
> const outputTable = tableToIPC(rainfall);
> fs.writeFileSync('jsarrow.arrow', outputTable); {code}
>
> reading in python
> {code:python}
> import pyarrow as pa
> with open('jsarrow.arrow', 'rb') as f:
> with pa.ipc.open_file(f) as reader:
> df = reader.read_pandas()
> print(df.head())
> {code}
>
> produces the error:
> {code:java}
> pyarrow.lib.ArrowInvalid: Not an Arrow file {code}
>
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)