Shekharrajak opened a new issue, #3756: URL: https://github.com/apache/datafusion-comet/issues/3756
We should use below matrix to check for any missing implementations that could accelerate Spark Iceberg pipeline using comet ### READ PATH | Feature | Iceberg Java | iceberg-rust | datafusion-comet (via iceberg-rust) | |---|---|---|---| | Basic Parquet scan | Yes | Yes | Yes - IcebergScanExec | | Positional deletes (V2 MoR) | Yes | Yes - ArrowReader + DeleteVector (RoaringTreemap) | Yes - delegates to ArrowReader with row_selection_enabled(true) | | Equality deletes (V2 MoR) | Yes | Yes - ArrowReader builds equality delete predicates | Yes - delegates to ArrowReader | | Deletion vectors (V3) | Yes - DVUtil, DVFileWriter, DVIterator | Yes - DeleteVector + Puffin deletion-vector-v1 blob support | Not wired - Comet doesn't pass DV metadata via protobuf | | Schema evolution | Yes | Yes | Yes - IcebergStreamWrapper adapts batches to target schema | | Partition pruning (static) | Yes | Yes | Yes - partitions serialized in protobuf | | Dynamic partition pruning | Yes (Spark) | N/A (engine-level) | Yes - CometIcebergNativeScanExec defers serialization for DPP | | Row-group filtering (residuals) | Yes | Yes | Yes - residual predicates converted to iceberg::expr::BoundPredicate | | Identity partition columns | Yes | Yes | Yes | | Object stores (S3/GCS/OSS) | Yes (Hadoop FS) | Yes (OpenDAL) | Yes (OpenDAL via FileIOBuilder) | | V1 spec | Yes | Yes | Yes | | V2 spec | Yes | Yes | Yes | | V3 spec metadata | Yes | Yes (FormatVersion::V3, next_row_id, row lineage) | Not used - Comet doesn't handle V3-specific metadata | ### WRITE PATH | Feature | Iceberg Java | iceberg-rust | datafusion-comet | |---|---|---|---| | Data file writing | Yes - DataWriter | Yes - DataFileWriter | No - uses raw parquet crate, not iceberg-rust | | Partitioned writes (sorted) | Yes - ClusteredDataWriter | Yes - ClusteredWriter | No - writes single file per Spark partition | | Partitioned writes (fanout) | Yes - FanoutDataWriter | Yes - FanoutWriter | No | | Rolling file writer | Yes | Yes - RollingFileWriter | No | | Equality delete writer | Yes | Yes - EqualityDeleteWriter | No | | Position delete writer | Yes | Partial | No | | Deletion vector writer | Yes - DVFileWriter, PartitioningDVWriter | No explicit DV writer | No | | AppendFiles / FastAppend | Yes - AppendFiles | Yes - FastAppendAction | No - commit done in Java | | OverwriteFiles | Yes - OverwriteFiles | Missing | No | | ReplacePartitions | Yes - ReplacePartitions | Missing | No | | DeleteFiles | Yes - DeleteFiles | Missing | No | | RowDelta | Yes - RowDelta | Missing | No | | RewriteFiles | Yes - RewriteFiles | Missing | No | | Transaction + commit | Yes - full atomic commit | Yes - Transaction::commit() with retry | No - commit is JVM-side | ### ROW-LEVEL OPERATIONS (DELETE/UPDATE/MERGE) | Feature | Iceberg Java + Spark | iceberg-rust | datafusion-comet | |---|---|---|---| | Copy-on-Write (CoW) scan | Yes - SparkCopyOnWriteScan | No CoW scan | No | | Copy-on-Write write | Yes - rewrite affected data files | Partial (rewrite manually) | No | | Merge-on-Read (MoR) scan | Yes - buildMergeOnReadScan() | Yes - ArrowReader applies deletes | Yes (read only) | | MoR position delta write | Yes - SparkPositionDeltaWrite | No | No | | DELETE FROM | Yes (CoW or MoR) | No action | No | | UPDATE | Yes (CoW or MoR) | No action | No | | MERGE INTO | Yes - SparkRowLevelOperationBuilder | No (issue #2201) | No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
