Re: [PR] chore: add assertion that not using comet scan but using native scan [datafusion-comet]

via GitHub Wed, 28 May 2025 15:51:32 -0700


andygrove commented on PR #1793:
URL: 
https://github.com/apache/datafusion-comet/pull/1793#issuecomment-2917789948


   > I still think there is a bug here:
   > 
   > For this test (when running on main):
   > 
   > ```scala
   > test("debug datafusion native filter") {
   >   val schema = StructType(
   >     Seq(
   >       StructField("row_idx", IntegerType, nullable = false),
   >       StructField("int", IntegerType, nullable = false)))
   > 
   >   val data = DataGenerator.DEFAULT.generateRows(1000, schema)
   > 
   >   withSQLConf(
   >     CometConf.COMET_EXPLAIN_VERBOSE_ENABLED.key -> "true",
   >     CometConf.COMET_EXPLAIN_NATIVE_ENABLED.key -> "true",
   >     CometConf.COMET_SPARK_TO_ARROW_SUPPORTED_OPERATOR_LIST.key -> 
"RDDScan") {
   >     val df = spark
   >       .createDataFrame(spark.sparkContext.parallelize(data, 1), schema)
   >       .where(col("row_idx") < 10000 || col("row_idx") > 10010)
   > 
   >     df.explain(true)
   >     df
   >       .show()
   >   }
   > }
   > ```
   > 
   > The spark plan is:
   > 
   > ```
   > == Parsed Logical Plan ==
   > 'Filter (('row_idx < 10000) OR ('row_idx > 10010))
   > +- LogicalRDD [row_idx#2, int#3], false
   > 
   > == Analyzed Logical Plan ==
   > row_idx: int, int: int
   > Filter ((row_idx#2 < 10000) OR (row_idx#2 > 10010))
   > +- LogicalRDD [row_idx#2, int#3], false
   > 
   > == Optimized Logical Plan ==
   > Filter ((row_idx#2 < 10000) OR (row_idx#2 > 10010))
   > +- LogicalRDD [row_idx#2, int#3], false
   > 
   > == Physical Plan ==
   > *(2) CometColumnarToRow
   > +- CometFilter [row_idx#2, int#3], ((row_idx#2 < 10000) OR (row_idx#2 > 
10010))
   >    +- CometSparkRowToColumnar
   >       +- *(1) Scan ExistingRDD[row_idx#2,int#3]
   > ```
   > 
   > and the datafusion plan is:
   > 
   > ```
   > 25/05/28 19:17:14 INFO core/src/execution/jni_api.rs: Comet native query 
plan:
   > FilterExec: col_0@0 < 10000 OR col_0@0 > 10010
   >   ScanExec: source=[CometSparkRowToColumnar (unknown)], schema=[col_0: 
Int32, col_1: Int32]
   > ```
   > 
   > It is using DataFusion Filter and not CometFilter while it should use 
comet filter as there is reuse, no?
   
   In this example, Spark (not Comet) is performing the scan. Comet is then 
performing the row-to-columnar conversion. The `native_comet` scan is not being 
used so there is no need to use Comet's filter.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] chore: add assertion that not using comet scan but using native scan [datafusion-comet]

Reply via email to