[GitHub] [spark] davidrabinowitz opened a new pull request #30372: [SPARK-33172][SQL] Adding support for UserDefinedType for Spark SQL Code generator

GitBox Fri, 13 Nov 2020 13:31:03 -0800


davidrabinowitz opened a new pull request #30372:
URL: https://github.com/apache/spark/pull/30372



   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: 
https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: 
https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., 
'[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a 
faster review.
     7. If you want to add a new configuration, please read the guideline first 
for naming configurations in
        
'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
   -->
   This PR is based on the master branch, replacing PR #30071 
   
   ### What changes were proposed in this pull request?
   Having `CodeGenerator.getValueFromVector()` to correctly treat 
`UserDefniedType`s as `CodeGenerator.javaType()` does.
   
   ### Why are the changes needed?
   Without it the generated java code would not compile, the error was 
   ```
   rg.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
153, Column 126: No applicable constructor/method found for actual parameters 
"int, int"; candidates are: "public org.apache.spark.sql.vectorized.ColumnarRow 
org.apache.spark.sql.vectorized.ColumnVector.getStruct(int)"
   ```
   The fix makes sure the method call has just one parameter.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   I've added a unit test to verify the proper code is generated: 
`getStruct(ordinal)`
   
   ### How to verify
   
   In order to verify it first you need to create a table in BigQuery in the 
following manner:
   ```
   bq load --source_format NEWLINE_DELIMITED_JSON <TABLE> vector_test.data.json 
vector_test.schema.json
   ```
   The files are:
   
   - vector_test.data.json:
   ```
   {"name":"row1","num":"1","vector":{"type":"1","indices":[],"values":[1,2,3]}}
   {"name":"row2","num":"2","vector":{"type":"1","indices":[],"values":[4,5,6]}}
   {"name":"row3","num":"3","vector":{"type":"1","indices":[],"values":[7,8,9]}}
   ```
   
   - vector_test.schema.json:
   ```
   [
     {
       "mode": "NULLABLE",
       "name": "name",
       "type": "STRING"
     },
     {
       "mode": "NULLABLE",
       "name": "num",
       "type": "INTEGER"
     },
     {
       "description": "{spark.type=vector}",
       "fields": [
         {
           "mode": "NULLABLE",
           "name": "type",
           "type": "INTEGER"
         },
         {
           "mode": "NULLABLE",
           "name": "size",
           "type": "INTEGER"
         },
         {
           "mode": "REPEATED",
           "name": "indices",
           "type": "INTEGER"
         },
         {
           "mode": "REPEATED",
           "name": "values",
           "type": "FLOAT"
         }
       ],
       "mode": "NULLABLE",
       "name": "vector",
       "type": "RECORD"
     }
   ]
   ```
   A GCP account is needed for that, but the amount of data and operation are 
well in the free tier.
   
   Run `spark-shell  --packages 
com.google.cloud.spark:spark-bigquery-with-dependencies_2.11:0.17.3` and enter 
the following commands:
   ```
   val df = 
spark.read.format("com.google.cloud.spark.bigquery.v2.BigQueryDataSourceV2").load("<TABLE>")
   df.schema()
   df.show()
   ```
   
   Notice that when the format is changed to `bigquery` another path is used 
which does not rely on the code generator and hence does not suffer from this 
issue.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] davidrabinowitz opened a new pull request #30372: [SPARK-33172][SQL] Adding support for UserDefinedType for Spark SQL Code generator

Reply via email to