umehrot2 opened a new pull request #1223: [HUDI-530] Fix conversion of Spark 
struct type to Avro schema
URL: https://github.com/apache/incubator-hudi/pull/1223
 
 
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   With migration of Hudi to `spark 2.4.4` and to using `native spark-avro`, 
there is an issue with conversion of struct fields because of the way 
spark-avro handles avro schema conversion vs databricks-avro. This has been 
reported earlier for EMR in 
https://github.com/apache/incubator-hudi/issues/1034 and now exists in Hudi 
master as well.
   
   The issue is `spark-avro` has a different way of naming `Avro namespace` 
than `databricks-avro`, while converting the schema to avro schema. For example 
suppose the data is:
   
   ```
   List("{ \"deviceId\": \"aaaaa\", \"eventType\": \"uditevent1\", 
\"eventTimeMilli\": 1574297893836, \"location\": { \"latitude\": 2.5, 
\"longitude\": 3.5 }}");
   ```
   
   `databricks-avro` used to convert it to avro schema, such that namespace of 
`location` struct field has field name in it:
   ```
   {
     "type" : "record",
     "name" : "hudi_issue_1034_dec30_01_record",
     "namespace" : "hoodie.hudi_issue_1034_dec30_01",
     "fields" : [ {
       "name" : "deviceId",
       "type" : [ "string", "null" ]
     }, {
       "name" : "eventTimeMilli",
       "type" : [ "long", "null" ]
     }, {
       "name" : "location",
       "type" : [ {
         "type" : "record",
         "name" : "location",
         "namespace" : "hoodie.hudi_issue_1034_dec30_01.location",
         "fields" : [ {
           "name" : "latitude",
           "type" : [ "double", "null" ]
         }, {
           "name" : "longitude",
           "type" : [ "double", "null" ]
         } ]
       }, "null" ]
     } ]
   }
   ```
   `spark-avro` now converts the same to the following, and uses the `record 
name` in the schema instead:
   ```
   {        
     "type" : "record",
     "name" : "hudi_issue_1034_dec31_01_record",
     "namespace" : "hoodie.hudi_issue_1034_dec31_01",
     "fields" : [ {
       "name" : "deviceId",
       "type" : [ "string", "null" ]
     }, {
       "name" : "eventTimeMilli",
       "type" : [ "long", "null" ]
     }, {
       "name" : "location",
       "type" : [ {
         "type" : "record",
         "name" : "location",
         "namespace" : 
"hoodie.hudi_issue_1034_dec31_01.hudi_issue_1034_dec31_01_record",
         "fields" : [ {
           "name" : "latitude",
           "type" : [ "double", "null" ]
         }, {
           "name" : "longitude",
           "type" : [ "double", "null" ]
         } ]
       }, "null" ]
     } ]
   }
   ```
   
   This PR fixes the above issue as we have now migrated to spark-avro.
   
   ## Brief change log
   
   - Fix conversion of Spark struct type to Avro schema
   - Modify the schema of data used in unit tests and integration tests to have 
struct type data as well, so that any issue with struct type can be caught 
earlier
   
   ## Verify this pull request
   
   This PR modifies the schema of the data that is being used across unit tests 
and certain integration tests to have a struct field. From now on 
Unit/Integration tests would catch any issue with struct fields.
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to