[ 
https://issues.apache.org/jira/browse/ARROW-501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593996#comment-16593996
 ] 

Antoine Pitrou commented on ARROW-501:
--------------------------------------

Yes, I understand your idea. There are basically two possible APIs, and we can 
implement all two:

1) a {{ReadaheadSpooler}} object that reads up to N fixed-size buffers in 
advance (N and buffer size being fixed in the constructor call)

2) a regular {{InputStream}} implementation that reads some data ahead 
speculatively in some internal buffer or queue of buffers, at the cost of 
risking an additional copy when the user calls {{Read(<some given size>)}}

Note that for CSV reading, it would be desirable for the readahead spooler to 
leave some configurable padding in the front of buffers. This way, if a 
preceding buffer had an unfinished line at the end, you can copy it in front of 
the next buffer without having to copy the (much larger) rest of the buffer. 
Symetrically, some padding at the end of buffers can be useful too (I expect 
that it might speed up some CSV algorithms).

> [C++] Implement concurrent / buffering InputStream for streaming data use 
> cases
> -------------------------------------------------------------------------------
>
>                 Key: ARROW-501
>                 URL: https://issues.apache.org/jira/browse/ARROW-501
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++
>            Reporter: Wes McKinney
>            Priority: Major
>              Labels: csv
>             Fix For: 0.13.0
>
>
> Related to ARROW-500, when processing an input data stream, we may wish to 
> continue buffering input (up to an maximum buffer size) in between 
> synchronous Read calls



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to