[ 
https://issues.apache.org/jira/browse/PARQUET-1985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17286274#comment-17286274
 ] 

Micah Kornfield commented on PARQUET-1985:
------------------------------------------

I think trying to shoe horn structured data into CSV might not be worthwhile.  
Instead I would propose either JSON, Protobuf (text representation) or Avro 
(probably json representation).

 

In arrow at least we've used both "gold files" like this suggests and a test 
harness that run commands from different language bindings with temporary data. 
 Both have been useful.

> Improve integration tests between implementations
> -------------------------------------------------
>
>                 Key: PARQUET-1985
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1985
>             Project: Parquet
>          Issue Type: Test
>          Components: parquet-testing
>         Environment: {noformat}
> *no* further _formatting_ is done here
> {noformat}
>            Reporter: Gabor Szadovszky
>            Priority: Major
>
> We have a lack of proper integration tests between components. Fortunately, 
> we already have a git repository to upload test data: 
> https://github.com/apache/parquet-testing.
> The idea is the following.
> Create a directory structure for the different versions of the 
> implementations containing parquet files with defined data. The structure 
> definition shall be self-descriptive so we can write integration tests that 
> reads the whole structure automatically and also works with files to be added 
> later.
> The following directory structure is an example for the previous requirements:
> {noformat}
> test-data/
> ├── impala
> │   ├── 3.2.0
> │   │   └── basic-data.parquet
> │   ├── 3.3.0
> │   │   └── basic-data.parquet
> │   └── 3.4.0
> │       ├── basic-data.lz4.parquet
> │       ├── basic-data.snappy.parquet
> │       ├── some-specific-issue-2.parquet
> │       ├── some-specific-issue-3.csv
> │       ├── some-specific-issue-3_mode1.parquet
> │       ├── some-specific-issue-3_mode2.parquet
> │       └── some-specific-issue-3.schema
> ├── parquet-cpp
> │   ├── 1.5.0
> │   │   ├── basic-data.lz4.parquet
> │   │   └── basic-data.parquet
> │   └── 1.6.0
> │       ├── basic-data.lz4.parquet
> │       └── some-specific-issue-2.parquet
> ├── parquet-mr
> │   ├── 1.10.2
> │   │   └── basic-data.parquet
> │   ├── 1.11.1
> │   │   ├── basic-data.parquet
> │   │   └── some-specific-issue-1.parquet
> │   ├── 1.12.0
> │   │   ├── basic-data.br.parquet
> │   │   ├── basic-data.lz4.parquet
> │   │   ├── basic-data.snappy.parquet
> │   │   ├── basic-data.zstd.parquet
> │   │   ├── some-specific-issue-1.parquet
> │   │   └── some-specific-issue-2.parquet
> │   ├── some-specific-issue-1.csv
> │   └── some-specific-issue-1.schema
> ├── basic-data.csv
> ├── basic-data.schema
> ├── some-specific-issue-2.csv
> └── some-specific-issue-2.schema
> {noformat}
> Parquet files are created at leaf level. The expected data is saved in a csv 
> format (to be specified: separators, how to save binary etc.), the expected 
> schema (to specify the data types independently from the parquet files) are 
> saved in .schema files. The csv and schema files can be saved on the same 
> level of the parquet files or upper levels if they are common to several 
> parquet files.
> Any comments about the idea are welcomed. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to