[ 
https://issues.apache.org/jira/browse/ARROW-12428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17324043#comment-17324043
 ] 

David Li edited comment on ARROW-12428 at 4/16/21, 7:41 PM:
------------------------------------------------------------

And for local files, to confirm that pre_buffer isn't a negative:
{noformat}
Pandas: 14.584974920144305 seconds
PyArrow: 6.650648137088865 seconds
PyArrow (pre-buffer): 6.587288308190182 seconds
{noformat}
This is on a system with NVME storage, so results may vary for spinning-rust or 
SATA SSDs.

(Updated results to read once without measuring before taking the measurement, 
in case disk cache is a factor)


was (Author: lidavidm):
And for local files, to confirm that pre_buffer isn't a negative:
{noformat}
Pandas: 14.566267257090658 seconds
PyArrow: 6.649410092970356 seconds
PyArrow (pre-buffer): 6.627140663098544 seconds {noformat}
This is on a system with NVME storage, so results may vary for spinning-rust or 
SATA SSDs.

> [Python] pyarrow.parquet.read_* should use pre_buffer=True
> ----------------------------------------------------------
>
>                 Key: ARROW-12428
>                 URL: https://issues.apache.org/jira/browse/ARROW-12428
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: David Li
>            Assignee: David Li
>            Priority: Major
>             Fix For: 5.0.0
>
>
> If the user is synchronously reading a single file, we should try to read it 
> as fast as possible. The one sticking point might be whether it's beneficial 
> to enable this no matter the filesystem or whether we should try to only 
> enable it on high-latency filesystems.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to