[ 
https://issues.apache.org/jira/browse/SPARK-10289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-10289:
-------------------------------
    Fix Version/s: 1.6.0

> A direct write API for testing Parquet compatibility
> ----------------------------------------------------
>
>                 Key: SPARK-10289
>                 URL: https://issues.apache.org/jira/browse/SPARK-10289
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL, Tests
>    Affects Versions: 1.5.0
>            Reporter: Cheng Lian
>            Assignee: Cheng Lian
>             Fix For: 1.6.0
>
>
> Due to a set of unfortunate historical issues, it's relatively hard to 
> achieve full interoperability among various Parquet data models. Spark 1.5 
> implemented all backwards-compatibility rules defined in parquet-format spec 
> on the read path (SPARK-6774) to improve this.  However, testing all those 
> corner cases can be really challenging.  Currently, we are testing Parquet 
> compatibility/interoperability by two means:
> # Generate Parquet files by other systems, bundle them into Spark source tree 
> as testing resources, and write test cases against them to ensure that we can 
> interpret them correctly. Currently, we are testing parquet-thrift and 
> parquet-protobuf compatibility in this way.
> #- Pros: Easy to write test cases, easy to test against multiple versions of 
> a given external system/libraries (by generating Parquet files with these 
> versions)
> #- Cons: Hard to track how testing Parquet files are generated
> # Make external libraries as testing dependencies, and call their APIs 
> directly to write Parquet files and verify them. Currently, parquet-avro 
> compatibility is tested using this approach.
> #- Pros: Easy to track how testing Parquet files are generated
> #- Cons:
> ##- Often requires code generation (Avro/Thrift/ProtoBuf/...), either 
> complicates build system by using build time code generation, or bloats the 
> code base by checking in generated Java files.  The former one is especially 
> annoying because Spark has two build systems, and require two sets of plugins 
> to do code generation (e.g., for Avro, we need both sbt-avro and 
> avro-maven-plugin).
> ##- Can only test a single version of a given target library
> Inspired by the 
> [{{writeDirect}}|https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.1/parquet-avro/src/test/java/org/apache/parquet/avro/TestArrayCompatibility.java#L945-L972]
>  method in parquet-avro testing code, a direct write API can be a good 
> complement for testing Parquet compatibilities.  Ideally, this API should
> # be easy to construct arbitrary complex Parquet records
> # have a DSL that reflects the nested nature of Parquet records
> In this way, it would be both easy to track Parquet file generation and easy 
> to cover various versions of external libraries.  However, test case authors 
> must be really careful when constructing the test cases and ensure 
> constructed Parquet structures are identical to those generated by the target 
> systems/libraries.  We're probably not going to replace the above two 
> approaches with this API, but just add it as a complement.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to