Github user sarutak commented on the pull request:
https://github.com/apache/spark/pull/11301#issuecomment-204257179
@kiszk Now I'm inspecting this patch and I noticed some defects so far.
1.
I ran a job like as follows.
```
sc.parallelize(1 to 1, 1).toDF.show
```
And then, I got generated code like as follows.
```
/* 006 */ class SpecificSafeProjection extends
org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
/* 007 */
/* 008 */ private Object[] references;
/* 009 */ private MutableRow mutableRow;
/* 010 */ private org.apache.spark.sql.types.StructType schema;
/* 011 */
/* 012 */
/* 013 */ public SpecificSafeProjection(Object[] references) {
/* 014 */ this.references = references;
/* 015 */ mutableRow = (MutableRow) references[references.length - 1];
/* 016 */ this.schema = (org.apache.spark.sql.types.StructType)
references[0];
/* 017 */ }
/* 018 */
/* 019 */ public java.lang.Object apply(java.lang.Object _i) {
/* 020 */ InternalRow i = (InternalRow) _i;
/* 021 */ /* createexternalrow(if (isnull(input[0, int])) null else
input[0, int], StructField(value,IntegerType,false)) @ selectExpr at
<console>:27 */
/* 022 */ boolean isNull = false;
/* 023 */ final Object[] values = new Object[1];
```
The code above is about `SpecificSafeProjection` generated by
`boundTEncoder.fromRow` called in `Dataset#collect`.
I think the code above is not directly related to query or DSL written by
application developers so It's funny that call sites are in the comments.
2.
When columns are described in DSL style like $"column", call sites in
comment can be `$`.
You will reproduce it by the code bellow.
```
sc.parallelize(1 to 1, 1).select($"value").show
```
3.
Call sites of multiple jobs can be scrambled. For example, you can
reproduce by following code.
```
val df = sc.parallelize(1 to 1, 1).toDF
df.select($"value").show
df.selectExpr("value").show
```
After you run the second job above, you can see generated code as follows.
```
/* 006 */ class SpecificSafeProjection extends
org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
/* 007 */
/* 008 */ private Object[] references;
/* 009 */ private MutableRow mutableRow;
/* 010 */ private org.apache.spark.sql.types.StructType schema;
/* 011 */
/* 012 */
/* 013 */ public SpecificSafeProjection(Object[] references) {
/* 014 */ this.references = references;
/* 015 */ mutableRow = (MutableRow) references[references.length - 1];
/* 016 */ this.schema = (org.apache.spark.sql.types.StructType)
references[0];
/* 017 */ }
/* 018 */
/* 019 */ public java.lang.Object apply(java.lang.Object _i) {
/* 020 */ InternalRow i = (InternalRow) _i;
/* 021 */ /* createexternalrow(if (isnull(input[0, int])) null else
input[0, int], StructField(value,IntegerType,false)) @ selectExpr at
<console>:27 */
/* 022 */ boolean isNull = false;
/* 023 */ final Object[] values = new Object[1];
/* 024 */ /* if (isnull(input[0, int])) null else input[0, int] @
selectExpr at <console>:27 */
```
As you can see, `select` and `selectExpr` appear in the comment above
although `select` is not used in the second job.
Another case is here.
```
val df = sc.parallelize(1 to 1, 1).toDF
df.select($"value" + 1).show
df.selectExpr("value + 2").show
...
/* 009 */ final class GeneratedIterator extends
org.apache.spark.sql.execution.BufferedRowIterator {
...
/* 034 */ /*** CONSUME: Project [(value#1 + 2) AS (value + 2)#7] */
/* 035 */
/* 036 */ /*** CONSUME: WholeStageCodegen */
/* 037 */
/* 038 */ /* (input[0, int] + 2) @ selectExpr at <console>:27 */
/* 039 */ /* input[0, int] @ select at <console>:27 */
/* 040 */ /* input[0, int] @ selectExpr at <console>:27 */
/* 041 */ int inputadapter_value = inputadapter_row.getInt(0);
...
/* 050 */ }
```
4.
As I mentioned before, could you make `TreeNode` serializable?
>Sorry, I said we don't need to make TreeNode serializable but it's needed
otherwise origin is not serialized and origin information like callsite is not
in the comment in the generated code when WholeStageCodegen is disabled.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]