connec opened a new issue, #18213:
URL: https://github.com/apache/datafusion/issues/18213

   ### Is your feature request related to a problem or challenge?
   
   We're working on the user experience of our tool that lets users upload a 
folder of CSV files (e.g.). DataFusion supports this natively, which is great, 
but if there's a decoding issue in any of the files a single `ArrowError` is 
reported for the entire operation, e.g.
   
   ```
   incorrect number of fields for line 1, expected 55 got 34
   ```
   
   It would be great if we could report in which file the issue was encountered.
   
   ```
   some/location.csv: incorrect number of fields for line 1, expected 55 got 34
   ```
   
   (Note that there's even more context needed to resolve this specific error. 
The resolution is to turn off headers, though I've not dug in to understand why 
that leads to overestimating the number of columns.)
   
   ### Describe the solution you'd like
   
   I first thought about using `Diagnostic` for this. A small patch that 
captures the location for CSV decoding errors is as follows:
   
   ```diff
   diff --git a/datafusion/datasource-csv/src/source.rs 
b/datafusion/datasource-csv/src/source.rs
   index 8b95d9ba9..c2e7788fa 100644
   --- a/datafusion/datasource-csv/src/source.rs
   +++ b/datafusion/datasource-csv/src/source.rs
   @@ -35,7 +35,7 @@ use datafusion_datasource::{
    
    use arrow::csv;
    use arrow::datatypes::SchemaRef;
   -use datafusion_common::{DataFusionError, Result, Statistics};
   +use datafusion_common::{DataFusionError, Diagnostic, Result, Statistics};
    use datafusion_common_runtime::JoinSet;
    use datafusion_datasource::file::FileSource;
    use datafusion_datasource::file_scan_config::FileScanConfig;
   @@ -405,7 +405,14 @@ impl FileOpener for CsvOpener {
                            input,
                            DecoderDeserializer::new(CsvDecoder::new(decoder)),
                        );
   -                    Ok(stream.map_err(Into::into).boxed())
   +                    Ok(stream
   +                        .map_err(move |error| {
   +                            DataFusionError::from(error).context(format!(
   +                                "decoding error in {}",
   +                                file_meta.object_meta.location
   +                            ))
   +                        })
   +                        .boxed())
                    }
                }
            }))
   
   ```
   
   This gives an error such as:
   
   ```
   decoding error in some/location.csv
   caused by
   Arrow error: Csv error: incorrect number of fields for line 1, expected 55 
got 34
   ```
   
   This is OK, but there are other approaches that might make more sense so I 
wanted to open the issue first in case there are opinions:
   
   1. On the basis that most arrow errors will be associated with a particular 
location, `DataFusionError::CsvError` could be converted into a struct variant 
with a new `location: Option<Path>` field.
   
      This would probably be a pretty disruptive change (and an API breaking 
change), and I haven't checked whether it's actually the case that most arrow 
errors are associated with a location.
   
   2. We could introduce a new `DataFusionError` variant specifically for 
decoding errors and use that instead of the generic `ArrowError`, e.g.
   
      ```rust
      DecodingError {
          inner: Box<ArrowError>, // Not sure why `Box`, but copying 
`DataFusionError::ArrowError`
          location: Path,
      },
      ```
   
      Since `DataFusionError` is exhaustive this would still be an API breaking 
change, but it would be much less disruptive to introduce.
   
   3. Something else?
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to