mbutrovich commented on PR #4251:
URL: 
https://github.com/apache/datafusion-comet/pull/4251#issuecomment-4402219205

   ## Review notes for #4251
   
   Thanks for the revision, @jordepic. The deletes test is a strong end-to-end 
check (asserting `sumDataFileRecords == 1` after rewrite catches the case where 
positional or equality deletes silently fail to apply).
   
   A few things I'd love your take on.
   
   ### Reflection fall-back in `tasksFromTaskGroups`
   
   In `IcebergReflection.scala`, the new `tasksFromTaskGroups` wraps the outer 
`taskGroups()` lookup in `findMethodInHierarchy`, which returns `None` cleanly. 
But the inner per-group `group.getClass.getMethod("tasks").invoke(group)` isn't 
guarded, so any `NoSuchMethodException` or `InvocationTargetException` there 
would bubble out of `getTasks` instead of returning `None` and letting 
`CometScanRule` fall back to Spark via the existing "Failed to extract Iceberg 
metadata via reflection" path. Every sibling helper in this file (`getTable`, 
`tasksFromTasksAccessor`, `getFilterExpressions`, `getExpectedSchema`) honors 
that contract. Could we wrap the loop body so a future Iceberg reshuffle of 
`BaseScanTaskGroup` falls back rather than crashes the planner?
   
   ### Method caching in the loop
   
   Same function: `group.getClass.getMethod("tasks")` runs once per group, but 
every group is the same concrete class. `validateIcebergFileScanTasks` already 
caches its `Method` handles outside its iteration loop. Worth following the 
same pattern here to avoid N redundant lookups on large rewrite groups?
   
   ### Where do the new class-name constants live?
   
   `SPARK_STAGED_SCAN_CLASS` is a private val in `IcebergReflection`, and 
`ICEBERG_SCAN_CLASSES` plus `isIcebergScanClass` live on `CometScanRule`. The 
existing `IcebergReflection.ClassNames` block is already the home for Iceberg 
FQCNs, and the `SparkBatchQueryScan` literal at `CometScanRule.scala:307` 
(pre-existing) is the same string. Would consolidating all three under 
`ClassNames` (and exposing `isIcebergScanClass` from `IcebergReflection`) help 
avoid drift?
   
   ### The long comment on `ICEBERG_SCAN_CLASSES`
   
   The comment explains that metadata scans are gated because 
`getMetadataLocation` returns `None`, but it's attached to the `Set` in 
`CometScanRule` rather than to `getMetadataLocation` where the gating actually 
happens. As a future reader I think I'd look for it at the gate. Reasonable to 
move it (or shorten it to a one-liner pointing at the real site)?
   
   ### Plan assertions
   
   `assertReadsAreComet` and `assertOperator` substring-match 
`physicalPlanDescription`. Two small concerns: `assertOperator(plans, 
"CometExchange")` will silently match `CometColumnarExchange`, and the captured 
string is a `toString` rather than a real plan tree. Would attaching a 
`QueryExecutionListener` and doing `plan.collect { case _: 
CometIcebergNativeScanExec => }` give us exact-class assertions? Same for 
`"AppendData"`, which is a logical-plan node name appearing in the executedPlan 
toString by convention.
   
   The `RewritePositionDeleteFiles` negative test is great as-is: asserting "no 
Comet operator at all" is a much sharper pin than substring-matching individual 
operators.
   
   ### Spark 4.1 coverage
   
   The deletes test cancels (rather than fails) on the 
iceberg-spark-runtime-4.0 `NoSuchMethodError`. Reasonable given the upstream 
issue, but it does mean delete coverage is effectively spark-3.5 / 4.0 only. 
Worth a line in the PR description so reviewers don't assume the deletes path 
is exercised on every profile?
   
   ### Smaller things
   
   - `tasksFromTaskGroups` returns 
`Some(flat.asInstanceOf[java.util.List[_]])`. `ArrayList[_]` is already a 
`java.util.List[_]`, so the cast looks unneeded. Declaring `flat` as 
`ArrayList[AnyRef]` would also drop the 
`asInstanceOf[java.util.Collection[Any]]` rewrap on the inner `addAll`.
   - `icebergAvailable` and `withTempIcebergDir` are flagged in the diff as 
duplicated from `CometIcebergNativeSuite`. `withTempIcebergDir` is also close 
to `SQLTestUtils.withTempDir` that `CometTestBase` already inherits. Worth 
lifting both into a shared trait?
   - `captureSqlPlans` is generic SparkListener plan recording with no Iceberg 
specifics. There are similar `CometListenerBusUtils.waitUntilEmpty` patterns in 
`CometIcebergNativeSuite`. Promote to `CometTestBase`?
   - `runRewriteAction` and `runRewritePositionDeletes` are near-duplicates. 
One `runActionChain(actionMethod, configure, options)` could cover both.
   - `runRewriteTest` has five callbacks with three defaulting to no-ops. A 
small `RewriteTestCase` case class might read a bit cleaner.
   - Several `Class.forName(...)` calls in the suite could go through 
`IcebergReflection.loadClass`, which already handles the context classloader.
   - `mutable.ArrayBuffer` writes in `captureSqlPlans` are synchronized but the 
`.toSeq` read after `removeSparkListener` isn't. `CopyOnWriteArrayList` or 
wrapping the read would close the gap.
   - `numFiles = 4` and the `>= 4` assertion are unrelated magic numbers. A 
single named constant would link them.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to