[jira] [Commented] (ARROW-17313) [C++] Add Byte Range to CSV Reader ReadOptions

Weston Pace (Jira) Fri, 05 Aug 2022 09:04:40 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-17313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17575930#comment-17575930
 ]


Weston Pace commented on ARROW-17313:
-------------------------------------

> It's not too late to change the Substrait spec, is it?
> Or we can raise NotImplemented if the offset is ever-non-zero.

Raising "not implemented" in this case is fine I'm sure.  If it can't be done 
then it can't be done.  Perhaps we can avoid most of these cases by reading a 
little bit (e.g. 32 bytes) before the beginning of the block as well.

> The sample block starts inside a "quoted" field

I think this is only a problem if we allow newlines in values.  We should 
reject a partial read if {{newlines_in_values}} is false.

> The first char of a block is "\n" but the last char of previous block is an 
> "escape"

Reading a bit early would help here as long as it isn't a really long chain of 
escapes which should be rare and detectable (we could error in this case).

> Sample at middle of "\r\n" may also be confusing

Reading a bit early would help here too.

> [C++] Add Byte Range to CSV Reader ReadOptions
> ----------------------------------------------
>
>                 Key: ARROW-17313
>                 URL: https://issues.apache.org/jira/browse/ARROW-17313
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++, Python
>            Reporter: Ziheng Wang
>            Assignee: Ziheng Wang
>            Priority: Major
>
> Sometimes it's desirable to just read a portion of a CSV. The best way to do 
> that is to pass in a list of byte ranges to CSV read options that specify 
> where in the CSV you want to read. These byte ranges don't necessarily have 
> to be aligned on line break boundaries, the CSV reader should just read until 
> the end of the line, and skip anything before the first line break in a byte 
> range.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17313) [C++] Add Byte Range to CSV Reader ReadOptions

Reply via email to