Re: [PR] [GLUTEN-9004][DOC] Adding doc for Partial Projection [incubator-gluten]

via GitHub Tue, 19 Aug 2025 11:48:04 -0700


steveburnett commented on code in PR #10135:
URL: 
https://github.com/apache/incubator-gluten/pull/10135#discussion_r2286036483



##########
docs/developers/PartialProject.md:
##########
@@ -0,0 +1,30 @@
+---
+layout: page
+title: PartialProject
+nav_order: 15
+parent: Developer Overview
+---
+
+# Partial Projection Support
+
+In Gluten, there is still a gap in supporting all Spark expressions natively 
(e.g., some json functions or Java UDFs). In this case, Gluten will choose the 
JVM code path to run the expressions, which can introduce performance 
regressions.

Review Comment:
   ```suggestion
   In Gluten, there is still a gap in supporting all Spark expressions natively 
(e.g., some JSON functions or Java UDFs). In this case, Gluten will choose the 
JVM code path to run the expressions, which can introduce performance 
regressions.
   ```



##########
docs/developers/PartialProject.md:
##########
@@ -0,0 +1,30 @@
+---
+layout: page
+title: PartialProject
+nav_order: 15
+parent: Developer Overview
+---
+
+# Partial Projection Support
+
+In Gluten, there is still a gap in supporting all Spark expressions natively 
(e.g., some json functions or Java UDFs). In this case, Gluten will choose the 
JVM code path to run the expressions, which can introduce performance 
regressions.
+
+Partial projections were added to improve performance in these cases. It 
allows Gluten to minimal data copy between JVM and C++, thus there is no big 
performance regression. 
+
+
+## Detailed Implementations
+
+### Adding Partial Projection for UDF
+
+For example, with the expression `hash(udf(col0)), col1, col2, col3, col4`, 
partial projection allows us to convert only `col0` to row or column to Arrow 
as input, and convert `udf(col0)` as an alias `partialProject1_`. Then, 
ProjectExecTransformer will handle `hash(partialProject1_), col1, col2, col3, 
col4, partialProject1_`. This feature saves the cost of converting the columnar 
format to row format and vice-versa.

Review Comment:
   No change suggested, just noting that I like this example as it's easy to 
understand.



##########
docs/developers/PartialProject.md:
##########
@@ -0,0 +1,30 @@
+---
+layout: page
+title: PartialProject
+nav_order: 15
+parent: Developer Overview
+---
+
+# Partial Projection Support
+
+In Gluten, there is still a gap in supporting all Spark expressions natively 
(e.g., some json functions or Java UDFs). In this case, Gluten will choose the 
JVM code path to run the expressions, which can introduce performance 
regressions.
+
+Partial projections were added to improve performance in these cases. It 
allows Gluten to minimal data copy between JVM and C++, thus there is no big 
performance regression. 
+
+
+## Detailed Implementations
+
+### Adding Partial Projection for UDF
+
+For example, with the expression `hash(udf(col0)), col1, col2, col3, col4`, 
partial projection allows us to convert only `col0` to row or column to Arrow 
as input, and convert `udf(col0)` as an alias `partialProject1_`. Then, 
ProjectExecTransformer will handle `hash(partialProject1_), col1, col2, col3, 
col4, partialProject1_`. This feature saves the cost of converting the columnar 
format to row format and vice-versa.
+
+
+## Adding Partial Projection for Unsupported Expressions
+
+The partial projection feature can also benefit from expressions that are not 
natively supported. For example, `substr(from_json(col_a))`. Since from_json is 
not fully supported, Gluten may use the JVM code path. Instead of projecting 
the whole expression, partial projection will attempt to project `from_json()` 
and perform a native projection of `substr()`.

Review Comment:
   ```suggestion
   The partial projection feature can also benefit from expressions that are 
not natively supported. For example, `substr(from_json(col_a))`. Since 
`from_json` is not fully supported, Gluten may use the JVM code path. Instead 
of projecting the whole expression, partial projection will attempt to project 
`from_json()` and perform a native projection of `substr()`.
   ```



##########
docs/developers/PartialProject.md:
##########
@@ -0,0 +1,30 @@
+---
+layout: page
+title: PartialProject
+nav_order: 15
+parent: Developer Overview
+---
+
+# Partial Projection Support
+
+In Gluten, there is still a gap in supporting all Spark expressions natively 
(e.g., some json functions or Java UDFs). In this case, Gluten will choose the 
JVM code path to run the expressions, which can introduce performance 
regressions.
+
+Partial projections were added to improve performance in these cases. It 
allows Gluten to minimal data copy between JVM and C++, thus there is no big 
performance regression. 

Review Comment:
   ```suggestion
   Partial projections, which allow Gluten to minimal data copy between JVM and 
C++, were added to avoid these performance regressions. 
   ```
   Suggestion made primarily for conciseness. Let me know if my suggestion 
changes the meaning in a way that is incorrect!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [GLUTEN-9004][DOC] Adding doc for Partial Projection [incubator-gluten]

Reply via email to