[ 
https://issues.apache.org/jira/browse/PARQUET-1985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17286352#comment-17286352
 ] 

Gabor Szadovszky commented on PARQUET-1985:
-------------------------------------------

[~emkornfield], I agree CSV is not the best approach. I did not think about 
nested types. I think JSON is more wide-spread than Protobuf or Avro so it has 
a higher chance to get an easy to use library on for language. In addition JSON 
is human readable making debugging easier (for non-binary types). Meanwhile, 
JSON would be much larger than Protobuf/Avro files. 

We might use any formats to store "gold data" we still need to properly specify 
the way. Do we want to test logical types as well? From e.g. Arrow/Impala point 
of view it make sense as they have the related types (e.g. timestamp, decimal). 
To validate these types we need to have data in they rich form (e.g. as a 
timestamp/decimal and not binary). Meanwhile, parquet-mr does not have support 
for these types so when we convert the binary values to these types we are not 
testing parquet-mr but the test itself. But maybe it is a parquet-mr related 
issue and we shall provide the widest set of data available.

> Improve integration tests between implementations
> -------------------------------------------------
>
>                 Key: PARQUET-1985
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1985
>             Project: Parquet
>          Issue Type: Test
>          Components: parquet-testing
>         Environment: {noformat}
> *no* further _formatting_ is done here
> {noformat}
>            Reporter: Gabor Szadovszky
>            Priority: Major
>
> We have a lack of proper integration tests between components. Fortunately, 
> we already have a git repository to upload test data: 
> https://github.com/apache/parquet-testing.
> The idea is the following.
> Create a directory structure for the different versions of the 
> implementations containing parquet files with defined data. The structure 
> definition shall be self-descriptive so we can write integration tests that 
> reads the whole structure automatically and also works with files to be added 
> later.
> The following directory structure is an example for the previous requirements:
> {noformat}
> test-data/
> ├── impala
> │   ├── 3.2.0
> │   │   └── basic-data.parquet
> │   ├── 3.3.0
> │   │   └── basic-data.parquet
> │   └── 3.4.0
> │       ├── basic-data.lz4.parquet
> │       ├── basic-data.snappy.parquet
> │       ├── some-specific-issue-2.parquet
> │       ├── some-specific-issue-3.csv
> │       ├── some-specific-issue-3_mode1.parquet
> │       ├── some-specific-issue-3_mode2.parquet
> │       └── some-specific-issue-3.schema
> ├── parquet-cpp
> │   ├── 1.5.0
> │   │   ├── basic-data.lz4.parquet
> │   │   └── basic-data.parquet
> │   └── 1.6.0
> │       ├── basic-data.lz4.parquet
> │       └── some-specific-issue-2.parquet
> ├── parquet-mr
> │   ├── 1.10.2
> │   │   └── basic-data.parquet
> │   ├── 1.11.1
> │   │   ├── basic-data.parquet
> │   │   └── some-specific-issue-1.parquet
> │   ├── 1.12.0
> │   │   ├── basic-data.br.parquet
> │   │   ├── basic-data.lz4.parquet
> │   │   ├── basic-data.snappy.parquet
> │   │   ├── basic-data.zstd.parquet
> │   │   ├── some-specific-issue-1.parquet
> │   │   └── some-specific-issue-2.parquet
> │   ├── some-specific-issue-1.csv
> │   └── some-specific-issue-1.schema
> ├── basic-data.csv
> ├── basic-data.schema
> ├── some-specific-issue-2.csv
> └── some-specific-issue-2.schema
> {noformat}
> Parquet files are created at leaf level. The expected data is saved in a csv 
> format (to be specified: separators, how to save binary etc.), the expected 
> schema (to specify the data types independently from the parquet files) are 
> saved in .schema files. The csv and schema files can be saved on the same 
> level of the parquet files or upper levels if they are common to several 
> parquet files.
> Any comments about the idea are welcomed. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to