yma11 commented on PR #5437: URL: https://github.com/apache/incubator-gluten/pull/5437#issuecomment-2062794394
> Thanks! Not sure if this fuzzer test can cover many cases. Looks like the data generator only generates some simple data types https://github.com/apache/parquet-mr/blob/master/parquet-benchmarks/src/main/java/org/apache/parquet/benchmarks/DataGenerator.java#L156-L165 But issues like [facebookincubator/velox#9242](https://github.com/facebookincubator/velox/issues/9242) can be related to complex data types. Yes. The parquet-mr DataGenerator only cover simple data types as it's used for benchmark test. Complex type support is a gap. Parquet read related issues are mainly about 1) legacy format/schema that velox haven't fully covered and 2) complex type reading if the schema can be recognized correctly, especially null/empty values. For 1), we can use script in this PR as it will generate files with different schemas as different parquet version used. We may can add more options to control generating more kinds of schemas. For 2), we can add complex type generation in it and try upstream or we can leverage the data generation in Gluten to cover null/empty values. Both are fine. What do you think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
