[
https://issues.apache.org/jira/browse/ARROW-12428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17324821#comment-17324821
]
Joris Van den Bossche commented on ARROW-12428:
-----------------------------------------------
[~lidavidm] small comment on the benchmark code: for the pyarrow cases, you
need to add a {{.to_pandas()}} call for it to be equivalent with the pandas
pd.read_parquet version (although I would expect this not be that significant
compared to reading from S3).
(the {{read_pandas}} is a bit confusing name, but it still reads into a
pyarrow.Table, it only uses the pandas metadata by default to eg ensure to read
the pandas index column as well)
> [Python] pyarrow.parquet.read_* should use pre_buffer=True
> ----------------------------------------------------------
>
> Key: ARROW-12428
> URL: https://issues.apache.org/jira/browse/ARROW-12428
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Python
> Reporter: David Li
> Assignee: David Li
> Priority: Major
> Labels: pull-request-available
> Fix For: 5.0.0
>
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> If the user is synchronously reading a single file, we should try to read it
> as fast as possible. The one sticking point might be whether it's beneficial
> to enable this no matter the filesystem or whether we should try to only
> enable it on high-latency filesystems.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)