[ 
https://issues.apache.org/jira/browse/ARROW-17313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17576880#comment-17576880
 ] 

Weston Pace commented on ARROW-17313:
-------------------------------------

Would it help to think of these not as byte ranges but as percentages?  I'm 
pretty sure the goal is just to be able to split a scan specification into 
subtasks.  They could then be divided amongst processes, divided amongst 
servers, or simply run piecemeal so that partial success and retry is simpler 
(I think this might be [~marsupialtail]'s end goal).

"Repartition the data into smaller files" should always work but I don't know 
that this is always an acceptable option.

> [C++] Add Byte Range to CSV Reader ReadOptions
> ----------------------------------------------
>
>                 Key: ARROW-17313
>                 URL: https://issues.apache.org/jira/browse/ARROW-17313
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++, Python
>            Reporter: Ziheng Wang
>            Assignee: Ziheng Wang
>            Priority: Major
>
> Sometimes it's desirable to just read a portion of a CSV. The best way to do 
> that is to pass in a list of byte ranges to CSV read options that specify 
> where in the CSV you want to read. These byte ranges don't necessarily have 
> to be aligned on line break boundaries, the CSV reader should just read until 
> the end of the line, and skip anything before the first line break in a byte 
> range.  
> Based on discussion, the scope is going to be reduced here. The first 
> implementation will support a single byte range that is already assumed to be 
> aligned on byte boundaries. 
> Will not handle quotes/returns and other edge cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to