[jira] [Created] (ARROW-17529) Clean up how the CSV reader handles the first buffer

Ziheng Wang (Jira) Thu, 25 Aug 2022 10:48:07 -0700

Ziheng Wang created ARROW-17529:
-----------------------------------

             Summary: Clean up how the CSV reader handles the first buffer
                 Key: ARROW-17529
                 URL: https://issues.apache.org/jira/browse/ARROW-17529
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++, Python
            Reporter: Ziheng Wang
            Assignee: Ziheng Wang



Currently how the CSV reader handles the first block in the CSV is not great.

In fact I think the first block is read multiple times. First in the Peek in 
file_csv.cc and then in the InitFromBlock in the OpenReaderAsync in reader.cc

This could be problematic if the first block is pretty big, and also delays the 
synchronous opening of a dataset.

Possible solution is to use a smaller block size for the peek in file_csv.cc 
since you don't need to read the entire block to GetConvertOptions. So we could 
really just have another option in reader_options that's first_peek_size or 
something like that. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-17529) Clean up how the CSV reader handles the first buffer

Reply via email to