[I] Add example Variant data [parquet-testing]

via GitHub Mon, 07 Apr 2025 03:50:12 -0700


alamb opened a new issue, #75:
URL: https://github.com/apache/parquet-testing/issues/75


   ## Use Case (What are you trying to do?)
   
   We are trying to organize the implementation of Variant the Rust 
implementation of parquet and arrow:
   - https://github.com/apache/arrow-rs/issues/6736
   
   We would like to make sure the Rust implementation is compatible with other 
implementations (that seem mostly JVM / spark focused at the moment). 
   
   From what I can tell, the JVM based implementations are tested by verifing 
round tripped to and from JSON. For example, the `ParquetVariantShreddingSuite`:
   
https://github.com/apache/spark/blob/418cfd1f78014698ac4baac21156341a11b771b3/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetVariantShreddingSuite.scala#L30
   
   There are several limitations with this approach:
   1. it doesn't ensure compatibility across language implementations (it only 
ensures consistency between reader/writer)
   2. VARIANTs have more types than JSON (e.g. timestamps, etc) so using JSON 
limits the range of types testable
   
   
   ## What do I want
   
   I would like example data in the parquet-testing repository that contains:
   1. Example binary variant data (e.g. metadata and data fields) 
   2. A parquet file with a column that stores variant data (but does not 
"shred" any of the columns)
   3. A parquet file with the same data as 2, but  that stores several of the 
columns "shredded" (aka some of the fields in their own column, as described in 
 
['VariantShredding'](https://github.com/apache/parquet-format/blob/master/VariantShredding.md)
 when storing in parquet files
   
   Each of the above should have 
   1. some sort of human interpretable description of the encoded values to 
help verify comparisons (e.g. text, markdown or json)
   2. Cover the variant scalar types
   3. Cover the variant nested types (struct, etc) 
   
   I recommend keeping the scalar and nested types in separate files / columns 
to make it easier to incrementally implement variant support (starting with non 
nested types and then nested types)
   
   
   Having the above data would permit other parquet implementations  to start 
with a reader that can handle  the basic types and then move on to more complex 
parts (like nested types and shredding). This is similar to how 
[`alltypes_plain.parquet`](https://github.com/apache/parquet-testing/blob/master/data/alltypes_plain.parquet)
 is used today. 
   
   
   
   ## Suggestions
   
   @cashmand David Cashman suggests on the Parquet Dev list: 
https://lists.apache.org/thread/22dvcnm7v5d30slzc3hp8d9qq8syj1dq
   
   
   > Hi Andrew, you should be able to create shredded files using OSS Spark
   > 4.0. I think the only issue is that it doesn't have the logical type
   > annotation yet, so readers wouldn't be able to distinguish it from a
   > non-variant struct that happens to have the same schema. (Spark is
   > able to infer that it is a Variant from the
   > `org.apache.spark.sql.parquet.row.metadata` metadata.)
   >
   > The ParquetVariantShreddingSuite in Spark has some tests that write
   > and read shredded parquet files. Below is an example that translates
   > the first test into code that runs in spark-shell and writes a Parquet
   > file. The shredding schema is set via conf. If you want to test types
   > that Spark doesn't infer in parse_json (e.g. timestamp, binary), you
   > can use `to_variant_object` to cast structured values to Variant.
   > 
   > I won't have time to work on this in the next couple of weeks, but am
   > happy to answer any questions.
   >
   > Thanks,
   > David
   
   ```
   scala> import org.apache.spark.sql.internal.SQLConf
   scala> spark.conf.set(SQLConf.VARIANT_WRITE_SHREDDING_ENABLED.key, true)
   scala> spark.conf.set(SQLConf.VARIANT_ALLOW_READING_SHREDDED.key, true)
   scala> spark.conf.set(SQLConf.VARIANT_FORCE_SHREDDING_SCHEMA_FOR_TEST.key,
   "a int, b string, c decimal(15, 1)")
   scala> val df = spark.sql(
        |       """
        |         | select case
        |         | when id = 0 then parse_json('{"a": 1, "b": "2", "c":
   3.3, "d": 4.4}')
        |         | when id = 1 then parse_json('{"a": [1,2,3], "b":
   "hello", "c": {"x": 0}}')
        |         | when id = 2 then parse_json('{"A": 1, "c": 1.23}')
        |         | end v from range(3)
        |         |""".stripMargin)
   scala> df.write.mode("overwrite").parquet("/tmp/shredded_test")
   scala> spark.read.parquet("/tmp/shredded_test").show
   +--------------------+
   |                   v|
   +--------------------+
   |{"a":1,"b":"2","c...|
   |{"a":[1,2,3],"b":...|
   |    {"A":1,"c":1.23}|
   +--------------------+
   ```
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Add example Variant data [parquet-testing]

Reply via email to