Re: [PR] Core: Interface based DataFile reader and writer API - PoC [iceberg]

via GitHub Fri, 21 Nov 2025 01:40:24 -0800


pvary commented on code in PR #12298:
URL: https://github.com/apache/iceberg/pull/12298#discussion_r2549129899



##########
spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkFormatModels.java:
##########
@@ -0,0 +1,80 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.spark.source;
+
+import org.apache.iceberg.avro.AvroFormatModel;
+import org.apache.iceberg.data.DeleteFilter;
+import org.apache.iceberg.formats.FormatModelRegistry;
+import org.apache.iceberg.orc.ORCFormatModel;
+import org.apache.iceberg.parquet.ParquetFormatModel;
+import org.apache.iceberg.spark.SparkSchemaUtil;
+import org.apache.iceberg.spark.data.SparkAvroWriter;
+import org.apache.iceberg.spark.data.SparkOrcReader;
+import org.apache.iceberg.spark.data.SparkOrcWriter;
+import org.apache.iceberg.spark.data.SparkParquetReaders;
+import org.apache.iceberg.spark.data.SparkParquetWriters;
+import org.apache.iceberg.spark.data.SparkPlannedAvroReader;
+import org.apache.iceberg.spark.data.vectorized.VectorizedSparkOrcReaders;
+import org.apache.iceberg.spark.data.vectorized.VectorizedSparkParquetReaders;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.types.StructType;
+import org.apache.spark.sql.vectorized.ColumnarBatch;
+
+public class SparkFormatModels {
+  public static void register() {
+    FormatModelRegistry.register(
+        new AvroFormatModel<>(
+            InternalRow.class,
+            StructType.class,
+            SparkPlannedAvroReader::create,
+            (avroSchema, inputSchema) -> new SparkAvroWriter(inputSchema)));
+
+    FormatModelRegistry.register(
+        new ParquetFormatModel<InternalRow, StructType, 
DeleteFilter<InternalRow>>(
+            InternalRow.class,
+            StructType.class,
+            SparkParquetReaders::buildReader,
+            (icebergSchema, messageType, inputType) ->
+                SparkParquetWriters.buildWriter(
+                    SparkSchemaUtil.convert(icebergSchema), messageType)));
+
+    FormatModelRegistry.register(
+        new ParquetFormatModel<ColumnarBatch, StructType, 
DeleteFilter<InternalRow>>(
+            ColumnarBatch.class, StructType.class, 
VectorizedSparkParquetReaders::buildReader));
+
+    FormatModelRegistry.register(
+        new ParquetFormatModel<ColumnarBatch, StructType, 
DeleteFilter<InternalRow>>(
+            VectorizedSparkParquetReaders.CometColumnarBatch.class,
+            StructType.class,
+            VectorizedSparkParquetReaders::buildCometReader));

Review Comment:
   This one is a questionable decision. I didn't intend to put this commit on 
this branch. I just created it to test how it could work.
   
   Currently (on main) the Comet/Arrow decision is done in the 
`BaseBatchReader`:
   
https://github.com/apache/iceberg/blob/0d4d3a562ffd339faae7e1db41c2068dc77920e8/spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/BaseBatchReader.java#L96-L105
   
   My original idea was to add another parameter to the 
`FormatModelRegistry.readBuilder` method, like `readBuilder(returnType, 
readerType, inputFile)`, and based on the `readerType` it could chose the 
Arrow, or the Comet reader.
   Since this was only used for Spark/Parquet, your suggestion was to hide it 
behind a config, and push this decision to the 
`VectorizedSparkParquetReaders.buildReader`. This is how I wanted to keep this 
PR.
     
   I was playing around how can I change this, and use the currently proposed 
API to move this decision back to the caller. That is why I experimented with a 
[Hacky solution to register the Comet vectorized reader to the File Format 
API](https://github.com/apache/iceberg/pull/12298/commits/4b6f7a5f577afe0d5f4742e4602cc34c14c0115d
   ). I did not like what I have seen. Mostly because I had to create a 
`CometColumnarBatch` for the sole reason to differentiate between the Arrow and 
the Comet reader. So I abandoned the idea but forgot to revert the commit 😢 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Core: Interface based DataFile reader and writer API - PoC [iceberg]

Reply via email to