[
https://issues.apache.org/jira/browse/ARROW-17313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17575465#comment-17575465
]
Weston Pace commented on ARROW-17313:
-------------------------------------
For additional motivation, this overlaps with the way Substrait expresses
partitioning information. Substrait allows any file type to include "start
byte" and "length" to slice the file. For file types like parquet & IPC this
would involve grabbing all row groups whose first byte falls in that range
(even though this may mean reading beyond the end of the specified range). The
advantage is that there is then a uniform API for partitioning files across
formats.
Another advantage here is that this would allow us to potentially parallelize
chunking at a minor cost of overreading a bit for each block. This overreading
could be avoided if we knew we were going to read multiple blocks. For
example, if we know we want to read blocks 20-30 then we issue reads for blocks
20-31. As soon as any two consecutive blocks are loaded we can start parsing
the lower block of the pair.
So the algorithm for each block boils down to:
Although...now that I type this up...I remember a potential flaw in this logic.
Finding the "first line delimiter" in a block can be an impossible problem if
newlines are allowed inside of delimiters. Though maybe we don't need to
support that case, I don't recall.
CC [~apitrou] for additional thoughts
> Add Byte Range to CSV Reader ReadOptions
> ----------------------------------------
>
> Key: ARROW-17313
> URL: https://issues.apache.org/jira/browse/ARROW-17313
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++, Python
> Reporter: Ziheng Wang
> Assignee: Ziheng Wang
> Priority: Major
>
> Sometimes it's desirable to just read a portion of a CSV. The best way to do
> that is to pass in a list of byte ranges to CSV read options that specify
> where in the CSV you want to read. These byte ranges don't necessarily have
> to be aligned on line break boundaries, the CSV reader should just read until
> the end of the line, and skip anything before the first line break in a byte
> range.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)