etseidl commented on code in PR #9450:
URL: https://github.com/apache/arrow-rs/pull/9450#discussion_r2897311237


##########
parquet/src/arrow/arrow_writer/mod.rs:
##########
@@ -261,13 +266,27 @@ impl<W: Write + Send> ArrowWriter<W> {
         let row_group_writer_factory =
             ArrowRowGroupWriterFactory::new(&file_writer, 
arrow_schema.clone());
 
+        let cdc_chunkers = match props_ptr.content_defined_chunking() {
+            Some(opts) => {
+                let chunkers = file_writer
+                    .schema_descr()
+                    .columns()
+                    .iter()
+                    .map(|desc| ContentDefinedChunker::new(desc, opts))
+                    .collect::<Result<Vec<_>>>()?;
+                Some(chunkers)
+            }
+            None => None,
+        };

Review Comment:
   ```suggestion
           let cdc_chunkers = props_ptr.content_defined_chunking().map(|opts| {
               file_writer
                       .schema_descr()
                       .columns()
                       .iter()
                       .map(|desc| ContentDefinedChunker::new(desc, opts))
                       .collect::<Result<Vec<_>>>()
           }).transpose()?;
   ```
   Can simplify this a bit.



##########
parquet/src/arrow/arrow_writer/mod.rs:
##########
@@ -335,10 +354,12 @@ impl<W: Write + Send> ArrowWriter<W> {
 
         let in_progress = match &mut self.in_progress {
             Some(in_progress) => in_progress,
-            x => x.insert(
-                self.row_group_writer_factory
-                    
.create_row_group_writer(self.writer.flushed_row_groups().len())?,
-            ),
+            x => {

Review Comment:
   Is this change necessary? Or is it left over from earlier debugging?



##########
parquet/src/column/chunker/mod.rs:
##########
@@ -0,0 +1,40 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Content-defined chunking (CDC) for Parquet data pages.
+//!
+//! CDC creates data page boundaries based on content rather than fixed sizes,
+//! enabling efficient deduplication in content-addressable storage (CAS) 
systems.
+//! See [`CdcOptions`](crate::file::properties::CdcOptions) for configuration.
+
+mod cdc;
+mod cdc_generated;
+
+pub(crate) use cdc::ContentDefinedChunker;
+
+/// A chunk of data with level and value offsets for record-shredded nested 
data.
+#[derive(Debug, Clone, Copy)]
+pub(crate) struct Chunk {

Review Comment:
   "chunk" is an overloaded term (I keep thinking column chunk). What do you 
think of changing this to `CdcChunk`?



##########
parquet/src/arrow/arrow_writer/levels.rs:
##########
@@ -801,11 +802,55 @@ impl ArrayLevels {
     pub fn non_null_indices(&self) -> &[usize] {
         &self.non_null_indices
     }
+
+    /// Create a sliced view of this `ArrayLevels` for a CDC chunk.
+    pub(crate) fn slice_for_chunk(&self, chunk: &Chunk) -> Self {

Review Comment:
   I have trouble with calling this a "view" when it's actually allocating new 
vectors for the levels and non-null indices. I'm thinking out loud, but I 
wonder if we could create an actual `ArrayLevelsView` that uses proper slices 
of the underlying `Vec`s and pass that to `write_internal`.



##########
parquet/src/arrow/arrow_writer/levels.rs:
##########
@@ -2096,4 +2141,140 @@ mod tests {
         let v = Arc::new(array) as ArrayRef;
         LevelInfoBuilder::try_new(field, Default::default(), &v).unwrap()
     }
+
+    #[test]
+    fn test_slice_for_chunk_flat() {
+        // Required field (no levels): array [1..=6], slice values 2..5

Review Comment:
   ```suggestion
           // Required field (no levels): array [1..=6], slice values 3..=5
   ```
   ? Trying to understand `value_offset == 2`



##########
parquet/src/schema/types.rs:
##########
@@ -1218,12 +1240,14 @@ fn build_tree<'a>(
         Type::PrimitiveType { .. } => {
             let mut path: Vec<String> = vec![];
             path.extend(path_so_far.iter().copied().map(String::from));
-            leaves.push(Arc::new(ColumnDescriptor::new(
+            let mut desc = ColumnDescriptor::new(

Review Comment:
   You could add a `new_with_repeated_ancestor`, and perhaps change `new` to 
call that with the default value for `repeated_ancestor_def_level`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to