mbutrovich commented on PR #4251:
URL:
https://github.com/apache/datafusion-comet/pull/4251#issuecomment-4402219205
## Review notes for #4251
Thanks for the revision, @jordepic. The deletes test is a strong end-to-end
check (asserting `sumDataFileRecords == 1` after rewrite catches the case where
positional or equality deletes silently fail to apply).
A few things I'd love your take on.
### Reflection fall-back in `tasksFromTaskGroups`
In `IcebergReflection.scala`, the new `tasksFromTaskGroups` wraps the outer
`taskGroups()` lookup in `findMethodInHierarchy`, which returns `None` cleanly.
But the inner per-group `group.getClass.getMethod("tasks").invoke(group)` isn't
guarded, so any `NoSuchMethodException` or `InvocationTargetException` there
would bubble out of `getTasks` instead of returning `None` and letting
`CometScanRule` fall back to Spark via the existing "Failed to extract Iceberg
metadata via reflection" path. Every sibling helper in this file (`getTable`,
`tasksFromTasksAccessor`, `getFilterExpressions`, `getExpectedSchema`) honors
that contract. Could we wrap the loop body so a future Iceberg reshuffle of
`BaseScanTaskGroup` falls back rather than crashes the planner?
### Method caching in the loop
Same function: `group.getClass.getMethod("tasks")` runs once per group, but
every group is the same concrete class. `validateIcebergFileScanTasks` already
caches its `Method` handles outside its iteration loop. Worth following the
same pattern here to avoid N redundant lookups on large rewrite groups?
### Where do the new class-name constants live?
`SPARK_STAGED_SCAN_CLASS` is a private val in `IcebergReflection`, and
`ICEBERG_SCAN_CLASSES` plus `isIcebergScanClass` live on `CometScanRule`. The
existing `IcebergReflection.ClassNames` block is already the home for Iceberg
FQCNs, and the `SparkBatchQueryScan` literal at `CometScanRule.scala:307`
(pre-existing) is the same string. Would consolidating all three under
`ClassNames` (and exposing `isIcebergScanClass` from `IcebergReflection`) help
avoid drift?
### The long comment on `ICEBERG_SCAN_CLASSES`
The comment explains that metadata scans are gated because
`getMetadataLocation` returns `None`, but it's attached to the `Set` in
`CometScanRule` rather than to `getMetadataLocation` where the gating actually
happens. As a future reader I think I'd look for it at the gate. Reasonable to
move it (or shorten it to a one-liner pointing at the real site)?
### Plan assertions
`assertReadsAreComet` and `assertOperator` substring-match
`physicalPlanDescription`. Two small concerns: `assertOperator(plans,
"CometExchange")` will silently match `CometColumnarExchange`, and the captured
string is a `toString` rather than a real plan tree. Would attaching a
`QueryExecutionListener` and doing `plan.collect { case _:
CometIcebergNativeScanExec => }` give us exact-class assertions? Same for
`"AppendData"`, which is a logical-plan node name appearing in the executedPlan
toString by convention.
The `RewritePositionDeleteFiles` negative test is great as-is: asserting "no
Comet operator at all" is a much sharper pin than substring-matching individual
operators.
### Spark 4.1 coverage
The deletes test cancels (rather than fails) on the
iceberg-spark-runtime-4.0 `NoSuchMethodError`. Reasonable given the upstream
issue, but it does mean delete coverage is effectively spark-3.5 / 4.0 only.
Worth a line in the PR description so reviewers don't assume the deletes path
is exercised on every profile?
### Smaller things
- `tasksFromTaskGroups` returns
`Some(flat.asInstanceOf[java.util.List[_]])`. `ArrayList[_]` is already a
`java.util.List[_]`, so the cast looks unneeded. Declaring `flat` as
`ArrayList[AnyRef]` would also drop the
`asInstanceOf[java.util.Collection[Any]]` rewrap on the inner `addAll`.
- `icebergAvailable` and `withTempIcebergDir` are flagged in the diff as
duplicated from `CometIcebergNativeSuite`. `withTempIcebergDir` is also close
to `SQLTestUtils.withTempDir` that `CometTestBase` already inherits. Worth
lifting both into a shared trait?
- `captureSqlPlans` is generic SparkListener plan recording with no Iceberg
specifics. There are similar `CometListenerBusUtils.waitUntilEmpty` patterns in
`CometIcebergNativeSuite`. Promote to `CometTestBase`?
- `runRewriteAction` and `runRewritePositionDeletes` are near-duplicates.
One `runActionChain(actionMethod, configure, options)` could cover both.
- `runRewriteTest` has five callbacks with three defaulting to no-ops. A
small `RewriteTestCase` case class might read a bit cleaner.
- Several `Class.forName(...)` calls in the suite could go through
`IcebergReflection.loadClass`, which already handles the context classloader.
- `mutable.ArrayBuffer` writes in `captureSqlPlans` are synchronized but the
`.toSeq` read after `removeSparkListener` isn't. `CopyOnWriteArrayList` or
wrapping the read would close the gap.
- `numFiles = 4` and the `>= 4` assertion are unrelated magic numbers. A
single named constant would link them.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]