Re: [I] Empty strings not interpreted as null when reading CSV files [arrow-datafusion]

via GitHub Sun, 15 Oct 2023 04:20:10 -0700


haohuaijin commented on issue #7797:
URL: 
https://github.com/apache/arrow-datafusion/issues/7797#issuecomment-1763358333


   I find arrow-csv also have the above problem and seem like arrow-csv never 
set string to null, see below link
   
https://github.com/apache/arrow-rs/blob/bb8e42f6392284f4a7a39d3eec74144a603b481c/arrow-csv/src/reader/mod.rs#L792-L795
   ```rust
   DataType::Utf8 => Ok(Arc::new(
       rows.iter()
           .map(|row| Some(row.get(i)))
           .collect::<StringArray>(),
   ```
   
   a example show this problem
   
https://github.com/apache/arrow-rs/blob/bb8e42f6392284f4a7a39d3eec74144a603b481c/arrow-csv/src/reader/mod.rs#L1544-L1572
   ```csv
   c_int,c_float,c_string,c_bool,c_null
   ,,,,
   2,2.2,"a",TRUE,
   3,,"b",true,
   4,4.4,,False,
   5,6.6,"",FALSE,
   ```
   
   ```rust
       fn test_init_nulls_with_inference() {
           let format = 
Format::default().with_header(true).with_delimiter(b',');
   
           let mut file = File::open("test/data/init_null_test.csv").unwrap();
           let (schema, _) = format.infer_schema(&mut file, None).unwrap();
           file.rewind().unwrap();
   
           let mut csv = ReaderBuilder::new(Arc::new(schema))
               .with_format(format)
               .build(file)
               .unwrap();
   
           let batch = csv.next().unwrap().unwrap();
           println!("{:?}",batch);
       }
   ```
   the print result is 
   ```
   RecordBatch { schema: Schema { fields: [Field { name: "c_int", data_type: 
Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, 
Field { name: "c_float", data_type: Float64, nullable: true, dict_id: 0, 
dict_is_ordered: false, metadata: {} }, Field { name: "c_string", data_type: 
Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field 
{ name: "c_bool", data_type: Boolean, nullable: true, dict_id: 0, 
dict_is_ordered: false, metadata: {} }, Field { name: "c_null", data_type: 
Null, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], 
metadata: {} }, columns: [PrimitiveArray<Int64>
   [
     null,
     2,
     3,
     4,
     5,
   ], PrimitiveArray<Float64>
   [
     null,
     2.2,
     null,
     4.4,
     6.6,
   ], StringArray
   [
     "",
     "a",
     "b",
     "",
     "",
   ], BooleanArray
   [
     null,
     true,
     true,
     false,
     false,
   ], NullArray(5)], row_count: 5 }
   ```
   and I also find the infer scheme of datafusion is different from arrow-csv
   ```
   RecordBatch { schema: Schema { fields: [Field { name: "c_int", data_type: 
Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, 
Field { name: "c_float", data_type: Float64, nullable: true, dict_id: 0, 
dict_is_ordered: false, metadata: {} }, Field { name: "c_string", data_type: 
Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field 
{ name: "c_bool", data_type: Boolean, nullable: true, dict_id: 0, 
dict_is_ordered: false, metadata: {} }, Field { name: "c_null", data_type: 
Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], 
metadata: {} }, columns: [PrimitiveArray<Int64>
   [
     null,
     2,
     3,
     4,
     5,
   ], PrimitiveArray<Float64>
   [
     null,
     2.2,
     null,
     4.4,
     6.6,
   ], StringArray
   [
     "",
     "a",
     "b",
     "",
     "",
   ], BooleanArray
   [
     null,
     true,
     true,
     false,
     false,
   ], StringArray
   [
     "",
     "",
     "",
     "",
     "",
   ]], row_count: 5 }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Empty strings not interpreted as null when reading CSV files [arrow-datafusion]

Reply via email to