parthchandra commented on code in PR #13786: URL: https://github.com/apache/iceberg/pull/13786#discussion_r2462151507
########## parquet/src/main/java/org/apache/iceberg/parquet/CometVectorizedParquetReader.java: ########## @@ -0,0 +1,287 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.parquet; + +import java.io.IOException; +import java.nio.ByteBuffer; +import java.util.List; +import java.util.Map; +import java.util.NoSuchElementException; +import java.util.function.Function; +import org.apache.hadoop.conf.Configuration; +import org.apache.iceberg.Schema; +import org.apache.iceberg.expressions.Expression; +import org.apache.iceberg.expressions.Expressions; +import org.apache.iceberg.hadoop.HadoopInputFile; +import org.apache.iceberg.io.CloseableGroup; +import org.apache.iceberg.io.CloseableIterable; +import org.apache.iceberg.io.CloseableIterator; +import org.apache.iceberg.io.InputFile; +import org.apache.iceberg.mapping.NameMapping; +import org.apache.iceberg.relocated.com.google.common.collect.Lists; +import org.apache.iceberg.util.ByteBuffers; +import org.apache.parquet.ParquetReadOptions; +import org.apache.parquet.column.ColumnDescriptor; +import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData; +import org.apache.parquet.hadoop.metadata.ColumnPath; +import org.apache.parquet.schema.MessageType; + +public class CometVectorizedParquetReader<T> extends CloseableGroup Review Comment: The main reason os to allow Comet's Parquet file reader to be available independent of engine. `iceberg-parquet` provides Parquet reading `iceberg-spark` provides the translation from Parquet to Spark's internal representation (ColumnarBatch in the case of vectorized readers) `CometVectorizedParquetReader` is essentially the first part and provides a fast parallel reader. It is usable by other engines like Flink. Because of the separation of the parquet file reading from the engine specific code in Iceberg (a Good Thing, imho), we must create the `CometVectorizedParquetReader` in the `iceberg-parquet` module. The classes `CometBridge` etc were created so we did not need to have a compile time dependency on Comet in iceberg-parquet. For the `CometVectorizedParquetReader` this also took care of the issue that after Parquet classes are shaded in Iceberg it is no longer possible to call any Comet method with a Parquet class in the signature. (This is the original problem this PR set out to solve). It looks like it might be possible for us to create an extra level of indirection and move all these classes to the Spark module, but we would probably lose the capability to use Comet's faster reader for Flink. If we are okay with that, then I can try to do it that way. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
