[jira] [Commented] (ARROW-15642) [Python] [JavaScript] Arrow IPC file output by apache-arrow tableToIPC method cannot be read by pyarrow

Paul Taylor (Jira) Wed, 06 Apr 2022 09:42:07 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-15642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17518297#comment-17518297
 ]


Paul Taylor commented on ARROW-15642:
-------------------------------------

[~domoritz] the IPC stream format is the more common use-case, at least in 
real-time ETL processing. File format is useful for reading more efficiently 
from disk, but not suited for inter-process communication.

If a consumer process wanted the advantage of constant-time random batch access 
(like the File format provides), they could buffer the stream until it's 
finished and write the footer. However it is not possible to to process an 
incoming Arrow table (in the IPC File format) in batches as they arrive, as the 
IPC File reader blocks until it sees the footer at the end.

> [Python] [JavaScript] Arrow IPC file output by apache-arrow tableToIPC method 
> cannot be read by pyarrow
> -------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-15642
>                 URL: https://issues.apache.org/jira/browse/ARROW-15642
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: JavaScript, Python
>    Affects Versions: 7.0.0
>            Reporter: Dan Coates
>            Assignee: Weston Pace
>            Priority: Major
>
> IPC files created by the node library `apache-arrow` don't seem to be able to 
> be read by pyarrow. There is an example of this issue here: 
> [https://github.com/dancoates/pyarrow-jsarrow-test 
> |https://github.com/dancoates/pyarrow-jsarrow-test]
>  
> writing the arrow file from js
> {code:javascript}
> import {tableToIPC, tableFromArrays} from 'apache-arrow';
> import fs from 'fs';
> const LENGTH = 2000;
> const rainAmounts = Float32Array.from(
>     { length: LENGTH },
>     () => Number((Math.random() * 20).toFixed(1)));
> const rainDates = Array.from(
>     { length: LENGTH },
>     (_, i) => new Date(Date.now() - 1000 * 60 * 60 * 24 * i));
> const rainfall = tableFromArrays({
>     precipitation: rainAmounts,
>     date: rainDates
> });
> const outputTable = tableToIPC(rainfall);
> fs.writeFileSync('jsarrow.arrow', outputTable); {code}
>  
> reading in python
> {code:python}
> import pyarrow as pa
> with open('jsarrow.arrow', 'rb') as f:
>     with pa.ipc.open_file(f) as reader:
>         df = reader.read_pandas()
>         print(df.head())
>  {code}
>  
> produces the error:
> {code:java}
> pyarrow.lib.ArrowInvalid: Not an Arrow file {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15642) [Python] [JavaScript] Arrow IPC file output by apache-arrow tableToIPC method cannot be read by pyarrow

Reply via email to