pvary commented on code in PR #12298: URL: https://github.com/apache/iceberg/pull/12298#discussion_r2549129899
########## spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkFormatModels.java: ########## @@ -0,0 +1,80 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.spark.source; + +import org.apache.iceberg.avro.AvroFormatModel; +import org.apache.iceberg.data.DeleteFilter; +import org.apache.iceberg.formats.FormatModelRegistry; +import org.apache.iceberg.orc.ORCFormatModel; +import org.apache.iceberg.parquet.ParquetFormatModel; +import org.apache.iceberg.spark.SparkSchemaUtil; +import org.apache.iceberg.spark.data.SparkAvroWriter; +import org.apache.iceberg.spark.data.SparkOrcReader; +import org.apache.iceberg.spark.data.SparkOrcWriter; +import org.apache.iceberg.spark.data.SparkParquetReaders; +import org.apache.iceberg.spark.data.SparkParquetWriters; +import org.apache.iceberg.spark.data.SparkPlannedAvroReader; +import org.apache.iceberg.spark.data.vectorized.VectorizedSparkOrcReaders; +import org.apache.iceberg.spark.data.vectorized.VectorizedSparkParquetReaders; +import org.apache.spark.sql.catalyst.InternalRow; +import org.apache.spark.sql.types.StructType; +import org.apache.spark.sql.vectorized.ColumnarBatch; + +public class SparkFormatModels { + public static void register() { + FormatModelRegistry.register( + new AvroFormatModel<>( + InternalRow.class, + StructType.class, + SparkPlannedAvroReader::create, + (avroSchema, inputSchema) -> new SparkAvroWriter(inputSchema))); + + FormatModelRegistry.register( + new ParquetFormatModel<InternalRow, StructType, DeleteFilter<InternalRow>>( + InternalRow.class, + StructType.class, + SparkParquetReaders::buildReader, + (icebergSchema, messageType, inputType) -> + SparkParquetWriters.buildWriter( + SparkSchemaUtil.convert(icebergSchema), messageType))); + + FormatModelRegistry.register( + new ParquetFormatModel<ColumnarBatch, StructType, DeleteFilter<InternalRow>>( + ColumnarBatch.class, StructType.class, VectorizedSparkParquetReaders::buildReader)); + + FormatModelRegistry.register( + new ParquetFormatModel<ColumnarBatch, StructType, DeleteFilter<InternalRow>>( + VectorizedSparkParquetReaders.CometColumnarBatch.class, + StructType.class, + VectorizedSparkParquetReaders::buildCometReader)); Review Comment: This one is a questionable decision. I didn't intend to put this commit on this branch. I just created it to test how it could work. Currently (on main) the Comet/Arrow decision is done in the `BaseBatchReader`: https://github.com/apache/iceberg/blob/0d4d3a562ffd339faae7e1db41c2068dc77920e8/spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/BaseBatchReader.java#L96-L105 My original idea was to add another parameter to the `FormatModelRegistry.readBuilder` method, like `readBuilder(returnType, readerType, inputFile)`, and based on the `readerType` it could chose the Arrow, or the Comet reader. Since this was only used for Spark/Parquet, your suggestion was to hide it behind a config, and push this decision to the `VectorizedSparkParquetReaders.buildReader`. This is how I wanted to keep this PR. I was playing around how can I change this, and use the currently proposed API to move this decision back to the caller. That is why I experimented with a [Hacky solution to register the Comet vectorized reader to the File Format API](https://github.com/apache/iceberg/pull/12298/commits/4b6f7a5f577afe0d5f4742e4602cc34c14c0115d ). I did not like what I have seen. Mostly because I had to create a `CometColumnarBatch` for the sole reason to differentiate between the Arrow and the Comet reader. So I abandoned the idea but forgot to revert the commit 😢 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
