[
https://issues.apache.org/jira/browse/ARROW-15642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17518297#comment-17518297
]
Paul Taylor commented on ARROW-15642:
-------------------------------------
[~domoritz] the IPC stream format is the more common use-case, at least in
real-time ETL processing. File format is useful for reading more efficiently
from disk, but not suited for inter-process communication.
If a consumer process wanted the advantage of constant-time random batch access
(like the File format provides), they could buffer the stream until it's
finished and write the footer. However it is not possible to to process an
incoming Arrow table (in the IPC File format) in batches as they arrive, as the
IPC File reader blocks until it sees the footer at the end.
> [Python] [JavaScript] Arrow IPC file output by apache-arrow tableToIPC method
> cannot be read by pyarrow
> -------------------------------------------------------------------------------------------------------
>
> Key: ARROW-15642
> URL: https://issues.apache.org/jira/browse/ARROW-15642
> Project: Apache Arrow
> Issue Type: Bug
> Components: JavaScript, Python
> Affects Versions: 7.0.0
> Reporter: Dan Coates
> Assignee: Weston Pace
> Priority: Major
>
> IPC files created by the node library `apache-arrow` don't seem to be able to
> be read by pyarrow. There is an example of this issue here:
> [https://github.com/dancoates/pyarrow-jsarrow-test
> |https://github.com/dancoates/pyarrow-jsarrow-test]
>
> writing the arrow file from js
> {code:javascript}
> import {tableToIPC, tableFromArrays} from 'apache-arrow';
> import fs from 'fs';
> const LENGTH = 2000;
> const rainAmounts = Float32Array.from(
> { length: LENGTH },
> () => Number((Math.random() * 20).toFixed(1)));
> const rainDates = Array.from(
> { length: LENGTH },
> (_, i) => new Date(Date.now() - 1000 * 60 * 60 * 24 * i));
> const rainfall = tableFromArrays({
> precipitation: rainAmounts,
> date: rainDates
> });
> const outputTable = tableToIPC(rainfall);
> fs.writeFileSync('jsarrow.arrow', outputTable); {code}
>
> reading in python
> {code:python}
> import pyarrow as pa
> with open('jsarrow.arrow', 'rb') as f:
> with pa.ipc.open_file(f) as reader:
> df = reader.read_pandas()
> print(df.head())
> {code}
>
> produces the error:
> {code:java}
> pyarrow.lib.ArrowInvalid: Not an Arrow file {code}
>
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)