alamb commented on code in PR #9855:
URL: https://github.com/apache/arrow-rs/pull/9855#discussion_r3173159274
##########
parquet/src/arrow/array_reader/builder.rs:
##########
@@ -422,29 +422,33 @@ impl<'a> ArrayReaderBuilder<'a> {
let page_iterator = self.row_groups.column_chunks(col_idx)?;
let arrow_type = Some(field.arrow_type.clone());
+ // LogicalType::Unknown maps to DataType::Null. In the past it has
been assumed
+ // that only INT32 can have this annotation, but this is not required
by the Parquet
+ // specification. Since this can only annotate an entirely null
column, the data type
+ // used for the NullArrayReader should be irrelevant. It's just needed
to read the
+ // repetition and definition level data.
+ if matches!(arrow_type, Some(DataType::Null)) {
Review Comment:
pulling this up out of INT32 type makes sense to me
##########
parquet/src/arrow/schema/primitive.rs:
##########
@@ -194,8 +199,6 @@ fn from_int32(info: &BasicTypeInfo, scale: i32, precision:
i32) -> Result<DataTy
unit
)),
},
- //
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#unknown-always-null
Review Comment:
The reson this is removed is that it is now handled higher up in
`from_parquet` 👍
##########
parquet/src/arrow/arrow_reader/mod.rs:
##########
@@ -3686,6 +3686,76 @@ pub(crate) mod tests {
}
}
+ // test that we can handle the UNKNOWN logical type annotation on any
physical type
+ #[test]
+ fn test_unknown_logical_type() {
+ let message_type = "message uk {
+ OPTIONAL INT32 uki32 (UNKNOWN);
+ OPTIONAL INT64 uki64 (UNKNOWN);
+ OPTIONAL INT96 uki96 (UNKNOWN);
+ OPTIONAL BOOLEAN ukbool (UNKNOWN);
+ OPTIONAL FLOAT ukfloat (UNKNOWN);
+ OPTIONAL DOUBLE ukdbl (UNKNOWN);
+ OPTIONAL BYTE_ARRAY ukbytes (UNKNOWN);
+ OPTIONAL FIXED_LEN_BYTE_ARRAY(10) ukflba (UNKNOWN);
+ }";
+
+ let schema = Arc::new(parse_message_type(message_type).unwrap());
+ let file = tempfile::tempfile().unwrap();
+
+ let mut writer =
+ SerializedFileWriter::new(file.try_clone().unwrap(), schema,
Default::default())
+ .unwrap();
+
+ let mut row_group_writer = writer.next_row_group().unwrap();
+
+ fn write_nulls<T: DataType>(row_group_writer: &mut
SerializedRowGroupWriter<'_, File>) {
+ let mut column_writer =
row_group_writer.next_column().unwrap().unwrap();
+ // write out a bunch of nulls
+ column_writer
+ .typed::<T>()
+ .write_batch(&[], Some(&[0, 0, 0, 0]), None)
+ .unwrap();
+ column_writer.close().unwrap();
+ }
+
+ // INT32
+ write_nulls::<Int32Type>(&mut row_group_writer);
+
+ // INT64
+ write_nulls::<Int64Type>(&mut row_group_writer);
+
+ // INT96
+ write_nulls::<Int96Type>(&mut row_group_writer);
+
+ // BOOLEAN
+ write_nulls::<BoolType>(&mut row_group_writer);
+
+ // FLOAT
+ write_nulls::<FloatType>(&mut row_group_writer);
+
+ // DOUBLE
+ write_nulls::<DoubleType>(&mut row_group_writer);
+
+ // BYTE_ARRAY
+ write_nulls::<ByteArrayType>(&mut row_group_writer);
+
+ // FIXED_LEN_BYTE_ARRAY
+ write_nulls::<FixedLenByteArrayType>(&mut row_group_writer);
+
+ row_group_writer.close().unwrap();
+
+ writer.close().unwrap();
+
+ let mut reader = ParquetRecordBatchReader::try_new(file, 4).unwrap();
+ let batch = reader.next().unwrap().unwrap();
+
+ for col in batch.columns() {
+ assert_eq!(col.len(), 4);
Review Comment:
I think we should probably also assert the `DataType` of each column read
in. In this case, I think they should all be `DataType::Null`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]