[GitHub] [arrow] n3world commented on a change in pull request #10255: ARROW-12661: [C++] Add ReaderOptions::skip_rows_after_names

GitBox Wed, 12 May 2021 07:10:30 -0700


n3world commented on a change in pull request #10255:
URL: https://github.com/apache/arrow/pull/10255#discussion_r631076862




##########
File path: cpp/src/arrow/csv/reader_test.cc
##########
@@ -216,5 +216,83 @@ TEST(StreamingReaderTests, NestedParallelism) {
   TestNestedParallelism(thread_pool, table_factory);
 }
 
+TEST(ReaderOptionsTests, SkipRowsAfterNames) {

Review comment:
       My guess is that the reason it is simple is because of the comment that 
it is intended to skip corrupt rows so for that it probably has to be a bit 
brute force.
   
   While adding a more csv aware skip does add some more complexity that 
parsing knowledge is already contained in the BoundryFinder implementations, so 
it already exists. Also, as a user specifying the number of rows to skip I 
would expect that csv rows would be skipped and not file lines.
   
   If it sways your opinion any, I did get the implementation working that uses 
the BlockReaders, Chunker and BoundryFinders to skip over the lines and the 
parser and everything downstream are unaware. Also, it can skip lines beyond a 
single block, satisfying ARROW-8527.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] n3world commented on a change in pull request #10255: ARROW-12661: [C++] Add ReaderOptions::skip_rows_after_names

Reply via email to