[jira] [Commented] (ARROW-17313) Add Byte Range to CSV Reader ReadOptions

Weston Pace (Jira) Thu, 04 Aug 2022 15:18:57 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-17313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17575465#comment-17575465
 ]


Weston Pace commented on ARROW-17313:
-------------------------------------

For additional motivation, this overlaps with the way Substrait expresses 
partitioning information.  Substrait allows any file type to include "start 
byte" and "length" to slice the file.  For file types like parquet & IPC this 
would involve grabbing all row groups whose first byte falls in that range 
(even though this may mean reading beyond the end of the specified range).  The 
advantage is that there is then a uniform API for partitioning files across 
formats.

Another advantage here is that this would allow us to potentially parallelize 
chunking at a minor cost of overreading a bit for each block.  This overreading 
could be avoided if we knew we were going to read multiple blocks.  For 
example, if we know we want to read blocks 20-30 then we issue reads for blocks 
20-31.  As soon as any two consecutive blocks are loaded we can start parsing 
the lower block of the pair.

So the algorithm for each block boils down to:

Although...now that I type this up...I remember a potential flaw in this logic. 
 Finding the "first line delimiter" in a block can be an impossible problem if 
newlines are allowed inside of delimiters.  Though maybe we don't need to 
support that case, I don't recall.

CC [~apitrou] for additional thoughts

> Add Byte Range to CSV Reader ReadOptions
> ----------------------------------------
>
>                 Key: ARROW-17313
>                 URL: https://issues.apache.org/jira/browse/ARROW-17313
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++, Python
>            Reporter: Ziheng Wang
>            Assignee: Ziheng Wang
>            Priority: Major
>
> Sometimes it's desirable to just read a portion of a CSV. The best way to do 
> that is to pass in a list of byte ranges to CSV read options that specify 
> where in the CSV you want to read. These byte ranges don't necessarily have 
> to be aligned on line break boundaries, the CSV reader should just read until 
> the end of the line, and skip anything before the first line break in a byte 
> range.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17313) Add Byte Range to CSV Reader ReadOptions

Reply via email to