alamb opened a new issue, #75: URL: https://github.com/apache/parquet-testing/issues/75
## Use Case (What are you trying to do?) We are trying to organize the implementation of Variant the Rust implementation of parquet and arrow: - https://github.com/apache/arrow-rs/issues/6736 We would like to make sure the Rust implementation is compatible with other implementations (that seem mostly JVM / spark focused at the moment). From what I can tell, the JVM based implementations are tested by verifing round tripped to and from JSON. For example, the `ParquetVariantShreddingSuite`: https://github.com/apache/spark/blob/418cfd1f78014698ac4baac21156341a11b771b3/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetVariantShreddingSuite.scala#L30 There are several limitations with this approach: 1. it doesn't ensure compatibility across language implementations (it only ensures consistency between reader/writer) 2. VARIANTs have more types than JSON (e.g. timestamps, etc) so using JSON limits the range of types testable ## What do I want I would like example data in the parquet-testing repository that contains: 1. Example binary variant data (e.g. metadata and data fields) 2. A parquet file with a column that stores variant data (but does not "shred" any of the columns) 3. A parquet file with the same data as 2, but that stores several of the columns "shredded" (aka some of the fields in their own column, as described in ['VariantShredding'](https://github.com/apache/parquet-format/blob/master/VariantShredding.md) when storing in parquet files Each of the above should have 1. some sort of human interpretable description of the encoded values to help verify comparisons (e.g. text, markdown or json) 2. Cover the variant scalar types 3. Cover the variant nested types (struct, etc) I recommend keeping the scalar and nested types in separate files / columns to make it easier to incrementally implement variant support (starting with non nested types and then nested types) Having the above data would permit other parquet implementations to start with a reader that can handle the basic types and then move on to more complex parts (like nested types and shredding). This is similar to how [`alltypes_plain.parquet`](https://github.com/apache/parquet-testing/blob/master/data/alltypes_plain.parquet) is used today. ## Suggestions @cashmand David Cashman suggests on the Parquet Dev list: https://lists.apache.org/thread/22dvcnm7v5d30slzc3hp8d9qq8syj1dq > Hi Andrew, you should be able to create shredded files using OSS Spark > 4.0. I think the only issue is that it doesn't have the logical type > annotation yet, so readers wouldn't be able to distinguish it from a > non-variant struct that happens to have the same schema. (Spark is > able to infer that it is a Variant from the > `org.apache.spark.sql.parquet.row.metadata` metadata.) > > The ParquetVariantShreddingSuite in Spark has some tests that write > and read shredded parquet files. Below is an example that translates > the first test into code that runs in spark-shell and writes a Parquet > file. The shredding schema is set via conf. If you want to test types > that Spark doesn't infer in parse_json (e.g. timestamp, binary), you > can use `to_variant_object` to cast structured values to Variant. > > I won't have time to work on this in the next couple of weeks, but am > happy to answer any questions. > > Thanks, > David ``` scala> import org.apache.spark.sql.internal.SQLConf scala> spark.conf.set(SQLConf.VARIANT_WRITE_SHREDDING_ENABLED.key, true) scala> spark.conf.set(SQLConf.VARIANT_ALLOW_READING_SHREDDED.key, true) scala> spark.conf.set(SQLConf.VARIANT_FORCE_SHREDDING_SCHEMA_FOR_TEST.key, "a int, b string, c decimal(15, 1)") scala> val df = spark.sql( | """ | | select case | | when id = 0 then parse_json('{"a": 1, "b": "2", "c": 3.3, "d": 4.4}') | | when id = 1 then parse_json('{"a": [1,2,3], "b": "hello", "c": {"x": 0}}') | | when id = 2 then parse_json('{"A": 1, "c": 1.23}') | | end v from range(3) | |""".stripMargin) scala> df.write.mode("overwrite").parquet("/tmp/shredded_test") scala> spark.read.parquet("/tmp/shredded_test").show +--------------------+ | v| +--------------------+ |{"a":1,"b":"2","c...| |{"a":[1,2,3],"b":...| | {"A":1,"c":1.23}| +--------------------+ ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
