Re: [PR] docs: Add a plugin overview page to the contributors guide [datafusion-comet]

2024-05-01 Thread via GitHub


parthchandra commented on code in PR #345:
URL: https://github.com/apache/datafusion-comet/pull/345#discussion_r1586638726


##
docs/source/contributor-guide/plugin_overview.md:
##
@@ -0,0 +1,50 @@
+
+
+# Comet Plugin Overview
+
+The entry point to Comet is the `org.apache.comet.CometSparkSessionExtensions` 
class, which can be registered with Spark by adding the following setting to 
the Spark configuration when launching `spark-shell` or `spark-submit`:
+
+```
+--conf spark.sql.extensions=org.apache.comet.CometSparkSessionExtensions
+```
+
+On initialization, this class registers two physical plan optimization rules 
with Spark: `CometScanRule` and `CometExecRule`. These rules run whenever a 
query stage is being planned.
+
+## CometScanRule
+
+`CometScanRule` replaces any Parquet scans with Comet Parquet scan classes.
+
+When the V1 data source API is being used, `FileSourceScanExec` is replaced 
with `CometScanExec`.
+
+When the V2 data source API is being used, `BatchScanExec` is replaced with 
`CometBatchScanExec`.
+
+## CometExecRule
+
+`CometExecRule` attempts to transform a Spark physical plan into a Comet plan.
+
+This rule traverses bottom-up from the original Spark plan and attempts to 
replace each node with a Comet equivalent. For example, a `ProjectExec` will be 
replaced by `CometProjectExec`.
+
+When replacing a node, various checks are performed to determine if Comet can 
support the operator and its expressions. If an operator or expression is not 
supported by Comet then the reason will be stored in a tag on the underlying 
Spark node. Running `explain` on a query will show any reasons that prevented 
the plan from being executed natively in Comet. If any part of the plan is not 
supported in Comet then the original Spark plan will be returned.

Review Comment:
   Sorry, just seeing this today. The `explain` information will show up in the 
UI or `EXPLAIN` output only from Spark 4.0.0 onwards as the 
`ExtendedExplainGenerator` trait was only added in Spark 4.0. Internally, we 
can always call `ExtendedExplainInfo.generateExtendedInfo(plan)`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] docs: Add a plugin overview page to the contributors guide [datafusion-comet]

2024-04-30 Thread via GitHub


sunchao commented on code in PR #345:
URL: https://github.com/apache/datafusion-comet/pull/345#discussion_r1585863201


##
docs/source/contributor-guide/plugin_overview.md:
##
@@ -0,0 +1,50 @@
+
+
+# Comet Plugin Overview
+
+The entry point to Comet is the `org.apache.comet.CometSparkSessionExtensions` 
class, which can be registered with Spark by adding the following setting to 
the Spark configuration when launching `spark-shell` or `spark-submit`:
+
+```
+--conf spark.sql.extensions=org.apache.comet.CometSparkSessionExtensions
+```
+
+On initialization, this class registers two physical plan optimization rules 
with Spark: `CometScanRule` and `CometExecRule`. These rules run whenever a 
query stage is being planned.
+
+## CometScanRule
+
+`CometScanRule` replaces any Parquet scans with Comet Parquet scan classes.
+
+When the V1 data source API is being used, `FileSourceScanExec` is replaced 
with `CometScanExec`.
+
+When the V2 data source API is being used, `BatchScanExec` is replaced with 
`CometBatchScanExec`.
+
+## CometExecRule
+
+`CometExecRule` attempts to transform a Spark physical plan into a Comet plan.
+
+This rule traverses bottom-up from the original Spark plan and attempts to 
replace each node with a Comet equivalent. For example, a `ProjectExec` will be 
replaced by `CometProjectExec`.
+
+When replacing a node, various checks are performed to determine if Comet can 
support the operator and its expressions. If an operator or expression is not 
supported by Comet then the reason will be stored in a tag on the underlying 
Spark node. Running `explain` on a query will show any reasons that prevented 
the plan from being executed natively in Comet. If any part of the plan is not 
supported in Comet then the original Spark plan will be returned.
+
+Comet does not support partially replacing subsets of the plan because this 
would involve adding transitions to convert between row-based and columnar data 
between Spark operators and Comet operators and the overhead of this could 
outweigh the benefits of running parts of the plan natively in Comet.

Review Comment:
   Looks good. Thanks



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] docs: Add a plugin overview page to the contributors guide [datafusion-comet]

2024-04-30 Thread via GitHub


sunchao merged PR #345:
URL: https://github.com/apache/datafusion-comet/pull/345


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] docs: Add a plugin overview page to the contributors guide [datafusion-comet]

2024-04-30 Thread via GitHub


andygrove commented on code in PR #345:
URL: https://github.com/apache/datafusion-comet/pull/345#discussion_r1585441419


##
docs/source/contributor-guide/plugin_overview.md:
##
@@ -0,0 +1,50 @@
+
+
+# Comet Plugin Overview
+
+The entry point to Comet is the `org.apache.comet.CometSparkSessionExtensions` 
class, which can be registered with Spark by adding the following setting to 
the Spark configuration when launching `spark-shell` or `spark-submit`:
+
+```
+--conf spark.sql.extensions=org.apache.comet.CometSparkSessionExtensions
+```
+
+On initialization, this class registers two physical plan optimization rules 
with Spark: `CometScanRule` and `CometExecRule`. These rules run whenever a 
query stage is being planned.
+
+## CometScanRule
+
+`CometScanRule` replaces any Parquet scans with Comet Parquet scan classes.
+
+When the V1 data source API is being used, `FileSourceScanExec` is replaced 
with `CometScanExec`.
+
+When the V2 data source API is being used, `BatchScanExec` is replaced with 
`CometBatchScanExec`.
+
+## CometExecRule
+
+`CometExecRule` attempts to transform a Spark physical plan into a Comet plan.
+
+This rule traverses bottom-up from the original Spark plan and attempts to 
replace each node with a Comet equivalent. For example, a `ProjectExec` will be 
replaced by `CometProjectExec`.
+
+When replacing a node, various checks are performed to determine if Comet can 
support the operator and its expressions. If an operator or expression is not 
supported by Comet then the reason will be stored in a tag on the underlying 
Spark node. Running `explain` on a query will show any reasons that prevented 
the plan from being executed natively in Comet. If any part of the plan is not 
supported in Comet then the original Spark plan will be returned.

Review Comment:
I removed the reference to `explain` here. I will add something in the 
future when I understand this part more.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] docs: Add a plugin overview page to the contributors guide [datafusion-comet]

2024-04-30 Thread via GitHub


andygrove commented on code in PR #345:
URL: https://github.com/apache/datafusion-comet/pull/345#discussion_r1585441776


##
docs/source/contributor-guide/plugin_overview.md:
##
@@ -0,0 +1,50 @@
+
+
+# Comet Plugin Overview
+
+The entry point to Comet is the `org.apache.comet.CometSparkSessionExtensions` 
class, which can be registered with Spark by adding the following setting to 
the Spark configuration when launching `spark-shell` or `spark-submit`:
+
+```
+--conf spark.sql.extensions=org.apache.comet.CometSparkSessionExtensions
+```
+
+On initialization, this class registers two physical plan optimization rules 
with Spark: `CometScanRule` and `CometExecRule`. These rules run whenever a 
query stage is being planned.
+
+## CometScanRule
+
+`CometScanRule` replaces any Parquet scans with Comet Parquet scan classes.
+
+When the V1 data source API is being used, `FileSourceScanExec` is replaced 
with `CometScanExec`.
+
+When the V2 data source API is being used, `BatchScanExec` is replaced with 
`CometBatchScanExec`.
+
+## CometExecRule
+
+`CometExecRule` attempts to transform a Spark physical plan into a Comet plan.
+
+This rule traverses bottom-up from the original Spark plan and attempts to 
replace each node with a Comet equivalent. For example, a `ProjectExec` will be 
replaced by `CometProjectExec`.
+
+When replacing a node, various checks are performed to determine if Comet can 
support the operator and its expressions. If an operator or expression is not 
supported by Comet then the reason will be stored in a tag on the underlying 
Spark node. Running `explain` on a query will show any reasons that prevented 
the plan from being executed natively in Comet. If any part of the plan is not 
supported in Comet then the original Spark plan will be returned.
+
+Comet does not support partially replacing subsets of the plan because this 
would involve adding transitions to convert between row-based and columnar data 
between Spark operators and Comet operators and the overhead of this could 
outweigh the benefits of running parts of the plan natively in Comet.

Review Comment:
   @sunchao Thanks. I have updated this. Let me know what you think.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] docs: Add a plugin overview page to the contributors guide [datafusion-comet]

2024-04-29 Thread via GitHub


sunchao commented on code in PR #345:
URL: https://github.com/apache/datafusion-comet/pull/345#discussion_r1584106029


##
docs/source/contributor-guide/plugin_overview.md:
##
@@ -0,0 +1,50 @@
+
+
+# Comet Plugin Overview
+
+The entry point to Comet is the `org.apache.comet.CometSparkSessionExtensions` 
class, which can be registered with Spark by adding the following setting to 
the Spark configuration when launching `spark-shell` or `spark-submit`:
+
+```
+--conf spark.sql.extensions=org.apache.comet.CometSparkSessionExtensions
+```
+
+On initialization, this class registers two physical plan optimization rules 
with Spark: `CometScanRule` and `CometExecRule`. These rules run whenever a 
query stage is being planned.
+
+## CometScanRule
+
+`CometScanRule` replaces any Parquet scans with Comet Parquet scan classes.
+
+When the V1 data source API is being used, `FileSourceScanExec` is replaced 
with `CometScanExec`.
+
+When the V2 data source API is being used, `BatchScanExec` is replaced with 
`CometBatchScanExec`.
+
+## CometExecRule
+
+`CometExecRule` attempts to transform a Spark physical plan into a Comet plan.
+
+This rule traverses bottom-up from the original Spark plan and attempts to 
replace each node with a Comet equivalent. For example, a `ProjectExec` will be 
replaced by `CometProjectExec`.
+
+When replacing a node, various checks are performed to determine if Comet can 
support the operator and its expressions. If an operator or expression is not 
supported by Comet then the reason will be stored in a tag on the underlying 
Spark node. Running `explain` on a query will show any reasons that prevented 
the plan from being executed natively in Comet. If any part of the plan is not 
supported in Comet then the original Spark plan will be returned.
+
+Comet does not support partially replacing subsets of the plan because this 
would involve adding transitions to convert between row-based and columnar data 
between Spark operators and Comet operators and the overhead of this could 
outweigh the benefits of running parts of the plan natively in Comet.

Review Comment:
   nit: maybe we should mention this is within a Spark stage? instead of the 
whole Spark plan



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] docs: Add a plugin overview page to the contributors guide [datafusion-comet]

2024-04-29 Thread via GitHub


viirya commented on code in PR #345:
URL: https://github.com/apache/datafusion-comet/pull/345#discussion_r1583377840


##
docs/source/contributor-guide/plugin_overview.md:
##
@@ -0,0 +1,50 @@
+
+
+# Comet Plugin Overview
+
+The entry point to Comet is the `org.apache.comet.CometSparkSessionExtensions` 
class, which can be registered with Spark by adding the following setting to 
the Spark configuration when launching `spark-shell` or `spark-submit`:
+
+```
+--conf spark.sql.extensions=org.apache.comet.CometSparkSessionExtensions
+```
+
+On initialization, this class registers two physical plan optimization rules 
with Spark: `CometScanRule` and `CometExecRule`. These rules run whenever a 
query stage is being planned.
+
+## CometScanRule
+
+`CometScanRule` replaces any Parquet scans with Comet Parquet scan classes.
+
+When the V1 data source API is being used, `FileSourceScanExec` is replaced 
with `CometScanExec`.
+
+When the V2 data source API is being used, `BatchScanExec` is replaced with 
`CometBatchScanExec`.
+
+## CometExecRule
+
+`CometExecRule` attempts to transform a Spark physical plan into a Comet plan.
+
+This rule traverses bottom-up from the original Spark plan and attempts to 
replace each node with a Comet equivalent. For example, a `ProjectExec` will be 
replaced by `CometProjectExec`.
+
+When replacing a node, various checks are performed to determine if Comet can 
support the operator and its expressions. If an operator or expression is not 
supported by Comet then the reason will be stored in a tag on the underlying 
Spark node. Running `explain` on a query will show any reasons that prevented 
the plan from being executed natively in Comet. If any part of the plan is not 
supported in Comet then the original Spark plan will be returned.

Review Comment:
   I've not tried, but don't we need do anything to get the unsupported info 
from `explain`? For example, specified version of Spark, or any code snippet 
before/after `explain`?
   
   cc @parthchandra 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] docs: Add a plugin overview page to the contributors guide [datafusion-comet]

2024-04-29 Thread via GitHub


viirya commented on code in PR #345:
URL: https://github.com/apache/datafusion-comet/pull/345#discussion_r1583377840


##
docs/source/contributor-guide/plugin_overview.md:
##
@@ -0,0 +1,50 @@
+
+
+# Comet Plugin Overview
+
+The entry point to Comet is the `org.apache.comet.CometSparkSessionExtensions` 
class, which can be registered with Spark by adding the following setting to 
the Spark configuration when launching `spark-shell` or `spark-submit`:
+
+```
+--conf spark.sql.extensions=org.apache.comet.CometSparkSessionExtensions
+```
+
+On initialization, this class registers two physical plan optimization rules 
with Spark: `CometScanRule` and `CometExecRule`. These rules run whenever a 
query stage is being planned.
+
+## CometScanRule
+
+`CometScanRule` replaces any Parquet scans with Comet Parquet scan classes.
+
+When the V1 data source API is being used, `FileSourceScanExec` is replaced 
with `CometScanExec`.
+
+When the V2 data source API is being used, `BatchScanExec` is replaced with 
`CometBatchScanExec`.
+
+## CometExecRule
+
+`CometExecRule` attempts to transform a Spark physical plan into a Comet plan.
+
+This rule traverses bottom-up from the original Spark plan and attempts to 
replace each node with a Comet equivalent. For example, a `ProjectExec` will be 
replaced by `CometProjectExec`.
+
+When replacing a node, various checks are performed to determine if Comet can 
support the operator and its expressions. If an operator or expression is not 
supported by Comet then the reason will be stored in a tag on the underlying 
Spark node. Running `explain` on a query will show any reasons that prevented 
the plan from being executed natively in Comet. If any part of the plan is not 
supported in Comet then the original Spark plan will be returned.

Review Comment:
   I've not tried, but don't we need do anything to get the unsupported info 
from `explain`?
   
   cc @parthchandra 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org