Re: [PR] docs: Add a plugin overview page to the contributors guide [datafusion-comet]
parthchandra commented on code in PR #345: URL: https://github.com/apache/datafusion-comet/pull/345#discussion_r1586638726 ## docs/source/contributor-guide/plugin_overview.md: ## @@ -0,0 +1,50 @@ + + +# Comet Plugin Overview + +The entry point to Comet is the `org.apache.comet.CometSparkSessionExtensions` class, which can be registered with Spark by adding the following setting to the Spark configuration when launching `spark-shell` or `spark-submit`: + +``` +--conf spark.sql.extensions=org.apache.comet.CometSparkSessionExtensions +``` + +On initialization, this class registers two physical plan optimization rules with Spark: `CometScanRule` and `CometExecRule`. These rules run whenever a query stage is being planned. + +## CometScanRule + +`CometScanRule` replaces any Parquet scans with Comet Parquet scan classes. + +When the V1 data source API is being used, `FileSourceScanExec` is replaced with `CometScanExec`. + +When the V2 data source API is being used, `BatchScanExec` is replaced with `CometBatchScanExec`. + +## CometExecRule + +`CometExecRule` attempts to transform a Spark physical plan into a Comet plan. + +This rule traverses bottom-up from the original Spark plan and attempts to replace each node with a Comet equivalent. For example, a `ProjectExec` will be replaced by `CometProjectExec`. + +When replacing a node, various checks are performed to determine if Comet can support the operator and its expressions. If an operator or expression is not supported by Comet then the reason will be stored in a tag on the underlying Spark node. Running `explain` on a query will show any reasons that prevented the plan from being executed natively in Comet. If any part of the plan is not supported in Comet then the original Spark plan will be returned. Review Comment: Sorry, just seeing this today. The `explain` information will show up in the UI or `EXPLAIN` output only from Spark 4.0.0 onwards as the `ExtendedExplainGenerator` trait was only added in Spark 4.0. Internally, we can always call `ExtendedExplainInfo.generateExtendedInfo(plan)` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] docs: Add a plugin overview page to the contributors guide [datafusion-comet]
sunchao commented on code in PR #345: URL: https://github.com/apache/datafusion-comet/pull/345#discussion_r1585863201 ## docs/source/contributor-guide/plugin_overview.md: ## @@ -0,0 +1,50 @@ + + +# Comet Plugin Overview + +The entry point to Comet is the `org.apache.comet.CometSparkSessionExtensions` class, which can be registered with Spark by adding the following setting to the Spark configuration when launching `spark-shell` or `spark-submit`: + +``` +--conf spark.sql.extensions=org.apache.comet.CometSparkSessionExtensions +``` + +On initialization, this class registers two physical plan optimization rules with Spark: `CometScanRule` and `CometExecRule`. These rules run whenever a query stage is being planned. + +## CometScanRule + +`CometScanRule` replaces any Parquet scans with Comet Parquet scan classes. + +When the V1 data source API is being used, `FileSourceScanExec` is replaced with `CometScanExec`. + +When the V2 data source API is being used, `BatchScanExec` is replaced with `CometBatchScanExec`. + +## CometExecRule + +`CometExecRule` attempts to transform a Spark physical plan into a Comet plan. + +This rule traverses bottom-up from the original Spark plan and attempts to replace each node with a Comet equivalent. For example, a `ProjectExec` will be replaced by `CometProjectExec`. + +When replacing a node, various checks are performed to determine if Comet can support the operator and its expressions. If an operator or expression is not supported by Comet then the reason will be stored in a tag on the underlying Spark node. Running `explain` on a query will show any reasons that prevented the plan from being executed natively in Comet. If any part of the plan is not supported in Comet then the original Spark plan will be returned. + +Comet does not support partially replacing subsets of the plan because this would involve adding transitions to convert between row-based and columnar data between Spark operators and Comet operators and the overhead of this could outweigh the benefits of running parts of the plan natively in Comet. Review Comment: Looks good. Thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] docs: Add a plugin overview page to the contributors guide [datafusion-comet]
sunchao merged PR #345: URL: https://github.com/apache/datafusion-comet/pull/345 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] docs: Add a plugin overview page to the contributors guide [datafusion-comet]
andygrove commented on code in PR #345: URL: https://github.com/apache/datafusion-comet/pull/345#discussion_r1585441419 ## docs/source/contributor-guide/plugin_overview.md: ## @@ -0,0 +1,50 @@ + + +# Comet Plugin Overview + +The entry point to Comet is the `org.apache.comet.CometSparkSessionExtensions` class, which can be registered with Spark by adding the following setting to the Spark configuration when launching `spark-shell` or `spark-submit`: + +``` +--conf spark.sql.extensions=org.apache.comet.CometSparkSessionExtensions +``` + +On initialization, this class registers two physical plan optimization rules with Spark: `CometScanRule` and `CometExecRule`. These rules run whenever a query stage is being planned. + +## CometScanRule + +`CometScanRule` replaces any Parquet scans with Comet Parquet scan classes. + +When the V1 data source API is being used, `FileSourceScanExec` is replaced with `CometScanExec`. + +When the V2 data source API is being used, `BatchScanExec` is replaced with `CometBatchScanExec`. + +## CometExecRule + +`CometExecRule` attempts to transform a Spark physical plan into a Comet plan. + +This rule traverses bottom-up from the original Spark plan and attempts to replace each node with a Comet equivalent. For example, a `ProjectExec` will be replaced by `CometProjectExec`. + +When replacing a node, various checks are performed to determine if Comet can support the operator and its expressions. If an operator or expression is not supported by Comet then the reason will be stored in a tag on the underlying Spark node. Running `explain` on a query will show any reasons that prevented the plan from being executed natively in Comet. If any part of the plan is not supported in Comet then the original Spark plan will be returned. Review Comment: I removed the reference to `explain` here. I will add something in the future when I understand this part more. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] docs: Add a plugin overview page to the contributors guide [datafusion-comet]
andygrove commented on code in PR #345: URL: https://github.com/apache/datafusion-comet/pull/345#discussion_r1585441776 ## docs/source/contributor-guide/plugin_overview.md: ## @@ -0,0 +1,50 @@ + + +# Comet Plugin Overview + +The entry point to Comet is the `org.apache.comet.CometSparkSessionExtensions` class, which can be registered with Spark by adding the following setting to the Spark configuration when launching `spark-shell` or `spark-submit`: + +``` +--conf spark.sql.extensions=org.apache.comet.CometSparkSessionExtensions +``` + +On initialization, this class registers two physical plan optimization rules with Spark: `CometScanRule` and `CometExecRule`. These rules run whenever a query stage is being planned. + +## CometScanRule + +`CometScanRule` replaces any Parquet scans with Comet Parquet scan classes. + +When the V1 data source API is being used, `FileSourceScanExec` is replaced with `CometScanExec`. + +When the V2 data source API is being used, `BatchScanExec` is replaced with `CometBatchScanExec`. + +## CometExecRule + +`CometExecRule` attempts to transform a Spark physical plan into a Comet plan. + +This rule traverses bottom-up from the original Spark plan and attempts to replace each node with a Comet equivalent. For example, a `ProjectExec` will be replaced by `CometProjectExec`. + +When replacing a node, various checks are performed to determine if Comet can support the operator and its expressions. If an operator or expression is not supported by Comet then the reason will be stored in a tag on the underlying Spark node. Running `explain` on a query will show any reasons that prevented the plan from being executed natively in Comet. If any part of the plan is not supported in Comet then the original Spark plan will be returned. + +Comet does not support partially replacing subsets of the plan because this would involve adding transitions to convert between row-based and columnar data between Spark operators and Comet operators and the overhead of this could outweigh the benefits of running parts of the plan natively in Comet. Review Comment: @sunchao Thanks. I have updated this. Let me know what you think. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] docs: Add a plugin overview page to the contributors guide [datafusion-comet]
sunchao commented on code in PR #345: URL: https://github.com/apache/datafusion-comet/pull/345#discussion_r1584106029 ## docs/source/contributor-guide/plugin_overview.md: ## @@ -0,0 +1,50 @@ + + +# Comet Plugin Overview + +The entry point to Comet is the `org.apache.comet.CometSparkSessionExtensions` class, which can be registered with Spark by adding the following setting to the Spark configuration when launching `spark-shell` or `spark-submit`: + +``` +--conf spark.sql.extensions=org.apache.comet.CometSparkSessionExtensions +``` + +On initialization, this class registers two physical plan optimization rules with Spark: `CometScanRule` and `CometExecRule`. These rules run whenever a query stage is being planned. + +## CometScanRule + +`CometScanRule` replaces any Parquet scans with Comet Parquet scan classes. + +When the V1 data source API is being used, `FileSourceScanExec` is replaced with `CometScanExec`. + +When the V2 data source API is being used, `BatchScanExec` is replaced with `CometBatchScanExec`. + +## CometExecRule + +`CometExecRule` attempts to transform a Spark physical plan into a Comet plan. + +This rule traverses bottom-up from the original Spark plan and attempts to replace each node with a Comet equivalent. For example, a `ProjectExec` will be replaced by `CometProjectExec`. + +When replacing a node, various checks are performed to determine if Comet can support the operator and its expressions. If an operator or expression is not supported by Comet then the reason will be stored in a tag on the underlying Spark node. Running `explain` on a query will show any reasons that prevented the plan from being executed natively in Comet. If any part of the plan is not supported in Comet then the original Spark plan will be returned. + +Comet does not support partially replacing subsets of the plan because this would involve adding transitions to convert between row-based and columnar data between Spark operators and Comet operators and the overhead of this could outweigh the benefits of running parts of the plan natively in Comet. Review Comment: nit: maybe we should mention this is within a Spark stage? instead of the whole Spark plan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] docs: Add a plugin overview page to the contributors guide [datafusion-comet]
viirya commented on code in PR #345: URL: https://github.com/apache/datafusion-comet/pull/345#discussion_r1583377840 ## docs/source/contributor-guide/plugin_overview.md: ## @@ -0,0 +1,50 @@ + + +# Comet Plugin Overview + +The entry point to Comet is the `org.apache.comet.CometSparkSessionExtensions` class, which can be registered with Spark by adding the following setting to the Spark configuration when launching `spark-shell` or `spark-submit`: + +``` +--conf spark.sql.extensions=org.apache.comet.CometSparkSessionExtensions +``` + +On initialization, this class registers two physical plan optimization rules with Spark: `CometScanRule` and `CometExecRule`. These rules run whenever a query stage is being planned. + +## CometScanRule + +`CometScanRule` replaces any Parquet scans with Comet Parquet scan classes. + +When the V1 data source API is being used, `FileSourceScanExec` is replaced with `CometScanExec`. + +When the V2 data source API is being used, `BatchScanExec` is replaced with `CometBatchScanExec`. + +## CometExecRule + +`CometExecRule` attempts to transform a Spark physical plan into a Comet plan. + +This rule traverses bottom-up from the original Spark plan and attempts to replace each node with a Comet equivalent. For example, a `ProjectExec` will be replaced by `CometProjectExec`. + +When replacing a node, various checks are performed to determine if Comet can support the operator and its expressions. If an operator or expression is not supported by Comet then the reason will be stored in a tag on the underlying Spark node. Running `explain` on a query will show any reasons that prevented the plan from being executed natively in Comet. If any part of the plan is not supported in Comet then the original Spark plan will be returned. Review Comment: I've not tried, but don't we need do anything to get the unsupported info from `explain`? For example, specified version of Spark, or any code snippet before/after `explain`? cc @parthchandra -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] docs: Add a plugin overview page to the contributors guide [datafusion-comet]
viirya commented on code in PR #345: URL: https://github.com/apache/datafusion-comet/pull/345#discussion_r1583377840 ## docs/source/contributor-guide/plugin_overview.md: ## @@ -0,0 +1,50 @@ + + +# Comet Plugin Overview + +The entry point to Comet is the `org.apache.comet.CometSparkSessionExtensions` class, which can be registered with Spark by adding the following setting to the Spark configuration when launching `spark-shell` or `spark-submit`: + +``` +--conf spark.sql.extensions=org.apache.comet.CometSparkSessionExtensions +``` + +On initialization, this class registers two physical plan optimization rules with Spark: `CometScanRule` and `CometExecRule`. These rules run whenever a query stage is being planned. + +## CometScanRule + +`CometScanRule` replaces any Parquet scans with Comet Parquet scan classes. + +When the V1 data source API is being used, `FileSourceScanExec` is replaced with `CometScanExec`. + +When the V2 data source API is being used, `BatchScanExec` is replaced with `CometBatchScanExec`. + +## CometExecRule + +`CometExecRule` attempts to transform a Spark physical plan into a Comet plan. + +This rule traverses bottom-up from the original Spark plan and attempts to replace each node with a Comet equivalent. For example, a `ProjectExec` will be replaced by `CometProjectExec`. + +When replacing a node, various checks are performed to determine if Comet can support the operator and its expressions. If an operator or expression is not supported by Comet then the reason will be stored in a tag on the underlying Spark node. Running `explain` on a query will show any reasons that prevented the plan from being executed natively in Comet. If any part of the plan is not supported in Comet then the original Spark plan will be returned. Review Comment: I've not tried, but don't we need do anything to get the unsupported info from `explain`? cc @parthchandra -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org