umehrot2 opened a new pull request #1223: [HUDI-530] Fix conversion of Spark struct type to Avro schema URL: https://github.com/apache/incubator-hudi/pull/1223 ## *Tips* - *Thank you very much for contributing to Apache Hudi.* - *Please review https://hudi.apache.org/contributing.html before opening a pull request.* ## What is the purpose of the pull request With migration of Hudi to `spark 2.4.4` and to using `native spark-avro`, there is an issue with conversion of struct fields because of the way spark-avro handles avro schema conversion vs databricks-avro. This has been reported earlier for EMR in https://github.com/apache/incubator-hudi/issues/1034 and now exists in Hudi master as well. The issue is `spark-avro` has a different way of naming `Avro namespace` than `databricks-avro`, while converting the schema to avro schema. For example suppose the data is: ``` List("{ \"deviceId\": \"aaaaa\", \"eventType\": \"uditevent1\", \"eventTimeMilli\": 1574297893836, \"location\": { \"latitude\": 2.5, \"longitude\": 3.5 }}"); ``` `databricks-avro` used to convert it to avro schema, such that namespace of `location` struct field has field name in it: ``` { "type" : "record", "name" : "hudi_issue_1034_dec30_01_record", "namespace" : "hoodie.hudi_issue_1034_dec30_01", "fields" : [ { "name" : "deviceId", "type" : [ "string", "null" ] }, { "name" : "eventTimeMilli", "type" : [ "long", "null" ] }, { "name" : "location", "type" : [ { "type" : "record", "name" : "location", "namespace" : "hoodie.hudi_issue_1034_dec30_01.location", "fields" : [ { "name" : "latitude", "type" : [ "double", "null" ] }, { "name" : "longitude", "type" : [ "double", "null" ] } ] }, "null" ] } ] } ``` `spark-avro` now converts the same to the following, and uses the `record name` in the schema instead: ``` { "type" : "record", "name" : "hudi_issue_1034_dec31_01_record", "namespace" : "hoodie.hudi_issue_1034_dec31_01", "fields" : [ { "name" : "deviceId", "type" : [ "string", "null" ] }, { "name" : "eventTimeMilli", "type" : [ "long", "null" ] }, { "name" : "location", "type" : [ { "type" : "record", "name" : "location", "namespace" : "hoodie.hudi_issue_1034_dec31_01.hudi_issue_1034_dec31_01_record", "fields" : [ { "name" : "latitude", "type" : [ "double", "null" ] }, { "name" : "longitude", "type" : [ "double", "null" ] } ] }, "null" ] } ] } ``` This PR fixes the above issue as we have now migrated to spark-avro. ## Brief change log - Fix conversion of Spark struct type to Avro schema - Modify the schema of data used in unit tests and integration tests to have struct type data as well, so that any issue with struct type can be caught earlier ## Verify this pull request This PR modifies the schema of the data that is being used across unit tests and certain integration tests to have a struct field. From now on Unit/Integration tests would catch any issue with struct fields. ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
