[
https://issues.apache.org/jira/browse/SPARK-10289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Cheng Lian updated SPARK-10289:
-------------------------------
Comment: was deleted
(was: User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/8460)
> A direct write API for testing Parquet compatibility
> ----------------------------------------------------
>
> Key: SPARK-10289
> URL: https://issues.apache.org/jira/browse/SPARK-10289
> Project: Spark
> Issue Type: Sub-task
> Components: SQL, Tests
> Affects Versions: 1.5.0
> Reporter: Cheng Lian
> Assignee: Cheng Lian
>
> Due to a set of unfortunate historical issues, it's relatively hard to
> achieve full interoperability among various Parquet data models. Spark 1.5
> implemented all backwards-compatibility rules defined in parquet-format spec
> on the read path (SPARK-6774) to improve this. However, testing all those
> corner cases can be really challenging. Currently, we are testing Parquet
> compatibility/interoperability by two means:
> # Generate Parquet files by other systems, bundle them into Spark source tree
> as testing resources, and write test cases against them to ensure that we can
> interpret them correctly. Currently, we are testing parquet-thrift and
> parquet-protobuf compatibility in this way.
> #- Pros: Easy to write test cases, easy to test against multiple versions of
> a given external system/libraries (by generating Parquet files with these
> versions)
> #- Cons: Hard to track how testing Parquet files are generated
> # Make external libraries as testing dependencies, and call their APIs
> directly to write Parquet files and verify them. Currently, parquet-avro
> compatibility is tested using this approach.
> #- Pros: Easy to track how testing Parquet files are generated
> #- Cons:
> ##- Often requires code generation (Avro/Thrift/ProtoBuf/...), either
> complicates build system by using build time code generation, or bloats the
> code base by checking in generated Java files. The former one is especially
> annoying because Spark has two build systems, and require two sets of plugins
> to do code generation (e.g., for Avro, we need both sbt-avro and
> avro-maven-plugin).
> ##- Can only test a single version of a given target library
> Inspired by the
> [{{writeDirect}}|https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.1/parquet-avro/src/test/java/org/apache/parquet/avro/TestArrayCompatibility.java#L945-L972]
> method in parquet-avro testing code, a direct write API can be a good
> complement for testing Parquet compatibilities. Ideally, this API should
> # be easy to construct arbitrary complex Parquet records
> # have a DSL that reflects the nested nature of Parquet records
> In this way, it would be both easy to track Parquet file generation and easy
> to cover various versions of external libraries. However, test case authors
> must be really careful when constructing the test cases and ensure
> constructed Parquet structures are identical to those generated by the target
> systems/libraries. We're probably not going to replace the above two
> approaches with this API, but just add it as a complement.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]