alamb commented on code in PR #8715:
URL: https://github.com/apache/arrow-rs/pull/8715#discussion_r2500952434
##########
parquet/src/arrow/schema/complex.rs:
##########
@@ -229,6 +244,20 @@ impl Visitor {
}
}
+ if rep_level == 0 && def_level == 0 {
+ for virtual_column in self.virtual_columns {
+ // Ensure this is actually a virtual column
+ assert!(
+ super::virtual_type::is_virtual_column(virtual_column),
+ "Field '{}' is not a virtual column. Virtual columns must
have extension type names starting with 'arrow.virtual.'",
+ virtual_column.name()
+ );
+ child_fields.push(virtual_column.clone());
+ let child = convert_virtual_field(virtual_column, rep_level,
def_level)?;
Review Comment:
I think it makes sense to append all virtual fields at the end of the real
schema
I don't understand the need for passing the virtual columns into the type
visitor as they are only converted at the top level (why not just append them
the output schema?)
##########
parquet/src/arrow/arrow_reader/mod.rs:
##########
@@ -388,6 +388,8 @@ pub struct ArrowReaderOptions {
/// If encryption is enabled, the file decryption properties can be
provided
#[cfg(feature = "encryption")]
pub(crate) file_decryption_properties:
Option<Arc<FileDecryptionProperties>>,
+
+ virtual_columns: Vec<Field>
Review Comment:
I think it should be `Vec<FieldRef>` (aka `Vec<Arc<Field>>` -- making it
`&Field` would require adding a lifetime to the ArrowReaderOptions struct which
I think would be a disruptive API change
##########
parquet/src/arrow/arrow_reader/mod.rs:
##########
@@ -516,6 +518,73 @@ impl ArrowReaderOptions {
}
}
+ /// Include virtual columns in the output.
+ ///
+ /// Virtual columns are columns that are not part of the Parquet schema,
but are added to the output by the reader.
+ ///
+ /// # Example
+ /// ```
+ /// # use std::sync::Arc;
+ /// # use arrow_array::{ArrayRef, Int64Array, RecordBatch};
+ /// # use arrow_schema::{DataType, Field, Schema};
+ /// # use parquet::arrow::{ArrowWriter, RowNumber};
+ /// # use parquet::arrow::arrow_reader::{ArrowReaderOptions,
ParquetRecordBatchReaderBuilder};
+ /// # use tempfile::tempfile;
+ /// #
+ /// # fn main() -> Result<(), Box<dyn std::error::Error>> {
+ /// // Create a simple record batch with some data
+ /// let values = Arc::new(Int64Array::from(vec![1, 2, 3])) as ArrayRef;
+ /// let batch = RecordBatch::try_from_iter(vec![("value", values)])?;
+ ///
+ /// // Write the batch to a temporary parquet file
+ /// let file = tempfile()?;
+ /// let mut writer = ArrowWriter::try_new(
+ /// file.try_clone()?,
+ /// batch.schema(),
+ /// None
+ /// )?;
+ /// writer.write(&batch)?;
+ /// writer.close()?;
+ ///
+ /// // Create a virtual column for row numbers
+ /// let row_number_field = Field::new("row_number", DataType::Int64, false)
+ /// .with_extension_type(RowNumber::default());
+ ///
+ /// // Configure options with virtual columns
+ /// let options = ArrowReaderOptions::new()
+ /// .with_virtual_columns(vec![row_number_field]);
+ ///
+ /// // Create a reader with the options
+ /// let mut reader = ParquetRecordBatchReaderBuilder::try_new_with_options(
+ /// file,
+ /// options
+ /// )?
+ /// .build()?;
+ ///
+ /// // Read the batch - it will include both the original column and the
virtual row_number column
+ /// let result_batch = reader.next().unwrap()?;
+ /// assert_eq!(result_batch.num_columns(), 2); // "value" + "row_number"
+ /// assert_eq!(result_batch.num_rows(), 3);
+ /// #
+ /// # Ok(())
+ /// # }
+ /// ```
+ pub fn with_virtual_columns(self, virtual_columns: Vec<Field>) -> Self {
Review Comment:
Yes, this is exactly what I wanted to see
I do think this should probably be elsewhere (it is kind of buried here) but
I couldn't immediately find a better location. I will try and follow up with
some other PRs to improve the docs
##########
parquet/src/arrow/array_reader/row_number.rs:
##########
@@ -0,0 +1,107 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::arrow::array_reader::ArrayReader;
+use crate::errors::{ParquetError, Result};
+use crate::file::metadata::{ParquetMetaData, RowGroupMetaData};
+use arrow_array::{ArrayRef, Int64Array};
+use arrow_schema::DataType;
+use std::any::Any;
+use std::collections::HashSet;
+use std::sync::Arc;
+
+pub(crate) struct RowNumberReader {
+ buffered_row_numbers: Vec<i64>,
+ remaining_row_numbers:
std::iter::Flatten<std::vec::IntoIter<std::ops::Range<i64>>>,
+}
+
+impl RowNumberReader {
+ pub(crate) fn try_new<'a>(
+ parquet_metadata: &'a ParquetMetaData,
+ row_groups: impl Iterator<Item = &'a RowGroupMetaData>,
+ ) -> Result<Self> {
+ // Collect ordinals from the selected row groups
+ let selected_ordinals: HashSet<i16> = row_groups
+ .map(|rg| {
+ rg.ordinal().ok_or_else(|| {
Review Comment:
I don't think we need to have the parquet reader error out if the `ordinal`
field wasn't set in the parquet metadata -- as you say we could pretty easily
assign that value as part of deserialization (thrift parsing)
That being said, we could improve the setting of ordinal positions during
decode as a follow on issue
##########
parquet/src/arrow/schema/virtual_type.rs:
##########
@@ -0,0 +1,163 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! RowNumber
+//!
+
+use arrow_schema::{ArrowError, DataType, Field, extension::ExtensionType};
+
+/// Prefix for virtual column extension type names.
+const VIRTUAL_PREFIX: &str = "arrow.virtual.";
Review Comment:
we should probably pick something that isn't in the arrow namespace
Maybe something like this
```suggestion
const VIRTUAL_PREFIX: &str = "parquet.virtual.";
```
##########
parquet/src/arrow/arrow_reader/mod.rs:
##########
@@ -5003,4 +5088,245 @@ mod tests {
assert!(sbbf.check(&"Hello"));
assert!(!sbbf.check(&"Hello_Not_Exists"));
}
+
+ #[test]
+ fn test_read_row_numbers() {
Review Comment:
I personally think we should write some more 'end to end' style tests, e.g.
that create the parquet reader and read out the row numbers in different
scenarios -- i will help write some shortly)
##########
parquet/src/arrow/schema/virtual_type.rs:
##########
@@ -0,0 +1,163 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! RowNumber
+//!
+
+use arrow_schema::{ArrowError, DataType, Field, extension::ExtensionType};
+
+/// Prefix for virtual column extension type names.
+const VIRTUAL_PREFIX: &str = "arrow.virtual.";
+
+/// Macro to concatenate VIRTUAL_PREFIX with a suffix.
+macro_rules! virtual_name {
+ ($suffix:literal) => {
+ concat!("arrow.virtual.", $suffix)
+ };
+}
+
+/// Constants for virtual column type identifiers.
+mod virtual_column_type {
+ /// Row number virtual column.
+ pub(super) const ROW_NUMBER: u8 = 0;
+}
+
+/// Generic virtual column extension type.
+///
+/// This struct provides a common implementation for all virtual column types.
+///
+/// The storage type of the extension is `Int64`.
+#[derive(Debug, Default, Clone, Copy, PartialEq)]
+pub struct VirtualColumn<const TYPE: u8>;
+
+impl<const TYPE: u8> ExtensionType for VirtualColumn<TYPE> {
+ const NAME: &'static str = match TYPE {
+ virtual_column_type::ROW_NUMBER => virtual_name!("row_number"),
Review Comment:
If we used a different Rust type for each virtual column type I think we
could avoid these panics which would be nice.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]