gtwuser commented on issue #2265: URL: https://github.com/apache/hudi/issues/2265#issuecomment-1141003069
> @kazdy I did a workaround by changing the schema to just `string`. In any case, the hive table column can be parsed using some UDFs to an array or list @stym06 can you please share how you resolved this issue by casting to `string`. i was following this [stackoverflow link](https://stackoverflow.com/questions/60297547/handling-empty-arrays-in-pyspark-optional-binary-element-utf8-is-not-a-group) to achieve that. But was stuck at requirement of `making all incoming records arrays as string`. It seems it works only if we have surrounded the arrays with double quotes`\"[]\"`. What I tried is to store the inferred schema from array data and store it in another column `colSchema`, but it didnt help. Im also not sure how to get the data out of column `colSchema` which holds the `schema` for the non empty array column. Schema with all columns cast to `string`. Any pointers will be highly appreciated: root |-- id: string (nullable = true) |-- some-array: string (nullable = true) |-- colSchema: string (nullable = true) Schema post from_json() applied : root |-- id: string (nullable = true) |-- some-array: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- array-field-1: string (nullable = true) | | |-- array-field-2: string (nullable = true) |-- colSchema: string (nullable = true) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
