Jefffrey commented on code in PR #5679:
URL: https://github.com/apache/arrow-rs/pull/5679#discussion_r1590730608


##########
arrow-csv/src/reader/records.rs:
##########
@@ -172,6 +188,11 @@ impl RecordDecoder {
         self.num_rows = 0;
     }
 
+    /// Sets the decoder to allow rows with less than the expected number 
columns
+    pub fn set_allow_truncated_rows(&mut self, allow: bool) {
+        self.allow_truncated_rows = allow;
+    }

Review Comment:
   Maybe we can put this directly in the `new()`? I don't think `RecordDecoder` 
is a public API so should be good to change `new()` :thinking: 



##########
arrow-csv/src/reader/mod.rs:
##########
@@ -265,6 +266,14 @@ impl Format {
         self
     }
 
+    /// Whether to allow flexible lengths for records.

Review Comment:
   `truncated_rows` does sound clearer. Thought, `with_allow_truncated_rows()` 
does seem a bit unwieldy as a name; do you think `with_truncated_rows()` might 
be sufficient?



##########
arrow-csv/src/reader/records.rs:
##########
@@ -127,10 +134,19 @@ impl RecordDecoder {
                     }
                     ReadRecordResult::Record => {
                         if self.current_field != self.num_columns {
-                            return Err(ArrowError::CsvError(format!(
-                                "incorrect number of fields for line {}, 
expected {} got {}",
-                                self.line_number, self.num_columns, 
self.current_field
-                            )));
+                            if self.allow_truncated_rows && self.current_field 
< self.num_columns {
+                                // If the number of fields is less than 
expected, pad with nulls
+                                let fill_count = self.num_columns - 
self.current_field;
+                                let fill_value = self.offsets[self.offsets_len 
- 1];
+                                
self.offsets[self.offsets_len..self.offsets_len + fill_count]
+                                    .fill(fill_value);
+                                self.offsets_len += fill_count;
+                            } else {
+                                return Err(ArrowError::CsvError(format!(
+                                    "incorrect number of fields for line {}, 
expected {} got {}",
+                                    self.line_number, self.num_columns, 
self.current_field
+                                )));
+                            }

Review Comment:
   Not too familiar with the `RecordDecoder` code but this looks good to me 
:+1: 



##########
arrow-csv/test/data/truncated_rows.csv:
##########
@@ -0,0 +1,7 @@
+Name,Age,Occupation,DOB
+A1,34,Engineer,1985-07-16
+B2,29,Doctor
+,,

Review Comment:
   I'm curious if a completely empty line would introduce just a row of nulls?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to