gsanon opened a new issue, #12778:
URL: https://github.com/apache/hudi/issues/12778

   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   - Join the mailing list to engage in conversations and get faster support at 
[email protected].
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   Currently this is not possible to write Hudi data if the source DF contains 
a field with union type.
   
   Let's say you have an avro schema with an union type field : 
   ```json
   {
     "name": "field",
     "type":
     [
       "null",
       "int",
       "string",
       "boolean"
     ],
     "default": null
   }
   ```
   
   If you create a parquet file from this the field will be transformed into a 
struct with a memberX for each type of the union `field<struct<member0:int, 
member1:string, member2:boolean>>` 
   
   Then when writing data in Hudi, in 0.X, the process fails because it will 
take only the first type and then try to write the struct into the type 
selected, in our case you will get something like : 
   `java.lang.IllegalArgumentException: 
StructType(StructField(member0,IntegerType,true),StructField(member1,StringType,true),StructField(member2,BooleanType,true))
 and IntegerType are incompatible:`
   
   This behavior was a bit 
[opaque](https://github.com/apache/hudi/blob/release-0.14.1/hudi-common/src/main/java/org/apache/hudi/internal/schema/convert/AvroInternalSchemaConverter.java#L199-L204)
 in 0.x but in 1.0.0 this has been made pretty clear [here
   
](https://github.com/apache/hudi/blob/14c292c626cd8d18b5997a90cfbb865befb5f6d2/hudi-common/src/main/java/org/apache/hudi/avro/AvroSchemaUtils.java#L436-L439)
   
   So each time we encounter parquet files containing fields with union type we 
need to pre-process the data as a inelegant workaround (renaming the `memberX` 
fields to avoid the union type detection)
   
   Knowing that the parquet implementation allows this union type and Avro as 
well, we could expect Hudi to be able to handle it in one or other way 
(representing the member struct as it is ?). Wdyt ?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to