yma11 commented on PR #5437:
URL: 
https://github.com/apache/incubator-gluten/pull/5437#issuecomment-2062794394

   > Thanks! Not sure if this fuzzer test can cover many cases. Looks like the 
data generator only generates some simple data types 
https://github.com/apache/parquet-mr/blob/master/parquet-benchmarks/src/main/java/org/apache/parquet/benchmarks/DataGenerator.java#L156-L165
 But issues like 
[facebookincubator/velox#9242](https://github.com/facebookincubator/velox/issues/9242)
 can be related to complex data types.
   
   Yes. The parquet-mr DataGenerator only cover simple data types as it's used 
for benchmark test. Complex type support is a gap. Parquet read related issues 
are mainly about 1) legacy format/schema that velox haven't fully covered  and 
2) complex type reading if the schema can be recognized correctly, especially 
null/empty values. For 1), we can use script in this PR as it will generate 
files with different schemas as different parquet version used. We may can add 
more options to control generating more kinds of schemas. For 2), we can add 
complex type generation in it and try upstream or we can leverage the data 
generation in Gluten to cover null/empty values. Both are fine. What do you 
think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to