[
https://issues.apache.org/jira/browse/ARROW-17313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17575489#comment-17575489
]
Ziheng Wang commented on ARROW-17313:
-------------------------------------
My proposal is that we will allow additional fields in ScanOptions that
specifies the byte ranges to read for each fragment.
Those byte ranges will be updated when you are calling OpenReaderAsync to be
aligned to line breaks, potentially in another async function that samples the
file around the byte range boundaries and figure out where the line breaks are.
Then these aligned byte ranges will be used to create a MaskedRandomAccessFile
object, which is a new object that resembles the RandomAcessFile object with
all of its interface, except it will skip bytes that it's not supposed to read
using seek. This will not read those skip-bytes on either disk or network based
files.
We pass this MaskedRandomAccessFile object to make a BufferedInputStream and a
StreamingReader without any further change in code. The CSV StreamingReader has
no idea that it is only reading partial chunks in the underlying file.
The alternative to deal with this in the CSV StreamingReaderImpl. However this
is very complicated as it can only access a BufferedInputStream which is not
seekable. Adding seek functionality to InputStream probably doesn't make sense
when the underlying InputStream is not a file.
> Add Byte Range to CSV Reader ReadOptions
> ----------------------------------------
>
> Key: ARROW-17313
> URL: https://issues.apache.org/jira/browse/ARROW-17313
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++, Python
> Reporter: Ziheng Wang
> Assignee: Ziheng Wang
> Priority: Major
>
> Sometimes it's desirable to just read a portion of a CSV. The best way to do
> that is to pass in a list of byte ranges to CSV read options that specify
> where in the CSV you want to read. These byte ranges don't necessarily have
> to be aligned on line break boundaries, the CSV reader should just read until
> the end of the line, and skip anything before the first line break in a byte
> range.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)