[GitHub] [hudi] gtwuser commented on issue #2265: Arrays with nulls in them result in broken parquet files

GitBox Mon, 30 May 2022 03:47:17 -0700


gtwuser commented on issue #2265:
URL: https://github.com/apache/hudi/issues/2265#issuecomment-1141003069


   > @kazdy I did a workaround by changing the schema to just `string`. In any 
case, the hive table column can be parsed using some UDFs to an array or list
   @stym06 can you please share how you resolved this issue by casting to 
`string`. i was following this [stackoverflow 
link](https://stackoverflow.com/questions/60297547/handling-empty-arrays-in-pyspark-optional-binary-element-utf8-is-not-a-group)
 to achieve that. But was stuck at requirement of `making all incoming records 
arrays as string`. 
   It seems it works only if we have surrounded the arrays with double 
quotes`\"[]\"`. 
   What I tried is to store the inferred schema from array data and store it in 
another column `colSchema`, but it didnt help. Im also not sure how to get the 
data out of column `colSchema` which holds the `schema` for the non empty array 
column. 
   Schema with all columns cast to `string`. Any pointers will be highly 
appreciated:
   root
    |-- id: string (nullable = true)
    |-- some-array: string (nullable = true)
    |-- colSchema: string (nullable = true)
   
   Schema post from_json() applied :
   root
    |-- id: string (nullable = true)
    |-- some-array: array (nullable = true)
    |    |-- element: struct (containsNull = true)
    |    |    |-- array-field-1: string (nullable = true)
    |    |    |-- array-field-2: string (nullable = true)
    |-- colSchema: string (nullable = true)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] gtwuser commented on issue #2265: Arrays with nulls in them result in broken parquet files

Reply via email to