n3world commented on a change in pull request #10790:
URL: https://github.com/apache/arrow/pull/10790#discussion_r700563231
##########
File path: cpp/src/arrow/csv/parser.cc
##########
@@ -324,9 +324,29 @@ class BlockParserImpl {
if (*(end - 1) == '\r') {
--end;
}
- return MismatchingColumns(batch_.num_cols_, num_cols,
- first_row_ < 0 ? -1 : first_row_ +
batch_.num_rows_,
- util::string_view(start, end - start));
+ int32_t batch_row = batch_.num_rows_ + batch_.num_skipped_rows();
+ InvalidRow row{batch_.num_cols_, num_cols,
+ first_row_ < 0 ? -1 : first_row_ + batch_row,
+ util::string_view(start, end - start)};
+
+ if (options_.invalid_row_handler) {
+ if (options_.invalid_row_handler.value()(row) ==
InvalidRowResult::Skip) {
+ values_writer->RollbackLine();
+ parsed_writer->RollbackLine();
+ auto last_skip = batch_.skipped_rows_.rbegin();
+ if (last_skip == batch_.skipped_rows_.rend() ||
+ last_skip->second + 1 != batch_row) {
+ batch_.skipped_rows_.emplace_back(batch_row, batch_row);
+ } else {
+ last_skip->second = batch_row;
+ }
Review comment:
Trying to reduce the potential overhead of a lot of skipped rows and it
makes the row number calculation in DataBatch::VisitColumn a little easier
because I am able to add the skipped rows as ranges instead of individually.
But if you prefer the list of skipped rows I can change it to that and
change how the batch_row is calculated.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]