Re: [PR] [GLUTEN-5434] Add script for parquet read fuzzer test [incubator-gluten]

via GitHub Wed, 17 Apr 2024 17:52:33 -0700


yma11 commented on PR #5437:
URL: 
https://github.com/apache/incubator-gluten/pull/5437#issuecomment-2062794394

> Thanks! Not sure if this fuzzer test can cover many cases. Looks like the
data generator only generates some simple data types
https://github.com/apache/parquet-mr/blob/master/parquet-benchmarks/src/main/java/org/apache/parquet/benchmarks/DataGenerator.java#L156-L165
But issues like
[facebookincubator/velox#9242](https://github.com/facebookincubator/velox/issues/9242)
can be related to complex data types.

Yes. The parquet-mr DataGenerator only cover simple data types as it's used
for benchmark test. Complex type support is a gap. Parquet read related issues
are mainly about 1) legacy format/schema that velox haven't fully covered and
2) complex type reading if the schema can be recognized correctly, especially
null/empty values. For 1), we can use script in this PR as it will generate
files with different schemas as different parquet version used. We may can add
more options to control generating more kinds of schemas. For 2), we can add
complex type generation in it and try upstream or we can leverage the data
generation in Gluten to cover null/empty values. Both are fine. What do you
think?

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [GLUTEN-5434] Add script for parquet read fuzzer test [incubator-gluten]

Reply via email to