[jira] [Commented] (ARROW-17313) [C++] Add Byte Range to CSV Reader ReadOptions

Ziheng Wang (Jira) Fri, 05 Aug 2022 16:29:10 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-17313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17576079#comment-17576079
 ]


Ziheng Wang commented on ARROW-17313:
-------------------------------------

Ideally we update the Dataset Scanner to be able to take in different byte 
ranges for different fragments. Or is this not required?

A complication would be that fragments right now don't seem to have some sort 
of "ID", so it might be hard for a user to specify which fragments should read 
which byte ranges. The way to do this would be to let the user pass in a dict 
in the ScanOptions that's something like \{file_path1: byte_range1, file_path2: 
byte_range2}. I think this would make sense.

Alternatively if this is not going to be supported, then this option ideally 
should only make sense for a dataset with one fragment. Perhaps I'll just add a 
check in the FragmentsToBatches function or something.

> [C++] Add Byte Range to CSV Reader ReadOptions
> ----------------------------------------------
>
>                 Key: ARROW-17313
>                 URL: https://issues.apache.org/jira/browse/ARROW-17313
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++, Python
>            Reporter: Ziheng Wang
>            Assignee: Ziheng Wang
>            Priority: Major
>
> Sometimes it's desirable to just read a portion of a CSV. The best way to do 
> that is to pass in a list of byte ranges to CSV read options that specify 
> where in the CSV you want to read. These byte ranges don't necessarily have 
> to be aligned on line break boundaries, the CSV reader should just read until 
> the end of the line, and skip anything before the first line break in a byte 
> range.  
> Based on discussion, the scope is going to be reduced here. The first 
> implementation will support a single byte range that is already assumed to be 
> aligned on byte boundaries. 
> Will not handle quotes/returns and other edge cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17313) [C++] Add Byte Range to CSV Reader ReadOptions

Reply via email to