[ 
https://issues.apache.org/jira/browse/ARROW-17313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17575489#comment-17575489
 ] 

Ziheng Wang edited comment on ARROW-17313 at 8/5/22 12:02 AM:
--------------------------------------------------------------

My proposal is that we will allow additional fields in ScanOptions that 
specifies the byte ranges to read for each fragment. 

Those byte ranges will be updated when you are calling OpenReaderAsync to be 
aligned to line breaks, potentially in another async function that samples the 
file around the byte range boundaries and figure out where the line breaks are. 

Then these aligned byte ranges will be used to create a MaskedRandomAccessFile 
object, which is a new object that resembles the RandomAcessFile object with 
all of its interface, except it will skip bytes that it's not supposed to read 
using seek. This will not read those skip-bytes on either disk or network based 
files. 

We pass this MaskedRandomAccessFile object to make a BufferedInputStream and a 
StreamingReader without any further change in code. The CSV StreamingReader has 
no idea that it is only reading partial chunks in the underlying file.

The alternative to deal with this in the CSV StreamingReaderImpl. However this 
is very complicated as it can only access a BufferedInputStream which is not 
seekable. Adding seek functionality to InputStream probably doesn't make sense 
when the underlying InputStream is not a file. 


was (Author: JIRAUSER287162):
My proposal is that we will allow additional fields in ScanOptions that 
specifies the byte ranges to read for each fragment. 

Those byte ranges will be updated when you are calling OpenReaderAsync to be 
aligned to line breaks, potentially in another async function that samples the 
file around the byte range boundaries and figure out where the line breaks are. 

Then these aligned byte ranges will be used to create a MaskedRandomAccessFile 
object, which is a new object that resembles the RandomAcessFile object with 
all of its interface, except it will skip bytes that it's not supposed to read 
using seek. This will not read those skip-bytes on either disk or network based 
files. 

We pass this MaskedRandomAccessFile object to make a BufferedInputStream and a 
StreamingReader without any further change in code. The CSV StreamingReader has 
no idea that it is only reading partial chunks in the underlying file.

The alternative to deal with this in the CSV StreamingReaderImpl. However this 
is very complicated as it can only access a BufferedInputStream which is not 
seekable. Adding seek functionality to InputStream probably doesn't make sense 
when the underlying InputStream is not a file. 

> Add Byte Range to CSV Reader ReadOptions
> ----------------------------------------
>
>                 Key: ARROW-17313
>                 URL: https://issues.apache.org/jira/browse/ARROW-17313
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++, Python
>            Reporter: Ziheng Wang
>            Assignee: Ziheng Wang
>            Priority: Major
>
> Sometimes it's desirable to just read a portion of a CSV. The best way to do 
> that is to pass in a list of byte ranges to CSV read options that specify 
> where in the CSV you want to read. These byte ranges don't necessarily have 
> to be aligned on line break boundaries, the CSV reader should just read until 
> the end of the line, and skip anything before the first line break in a byte 
> range.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to