[
https://issues.apache.org/jira/browse/PARQUET-1985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17286274#comment-17286274
]
Micah Kornfield commented on PARQUET-1985:
------------------------------------------
I think trying to shoe horn structured data into CSV might not be worthwhile.
Instead I would propose either JSON, Protobuf (text representation) or Avro
(probably json representation).
In arrow at least we've used both "gold files" like this suggests and a test
harness that run commands from different language bindings with temporary data.
Both have been useful.
> Improve integration tests between implementations
> -------------------------------------------------
>
> Key: PARQUET-1985
> URL: https://issues.apache.org/jira/browse/PARQUET-1985
> Project: Parquet
> Issue Type: Test
> Components: parquet-testing
> Environment: {noformat}
> *no* further _formatting_ is done here
> {noformat}
> Reporter: Gabor Szadovszky
> Priority: Major
>
> We have a lack of proper integration tests between components. Fortunately,
> we already have a git repository to upload test data:
> https://github.com/apache/parquet-testing.
> The idea is the following.
> Create a directory structure for the different versions of the
> implementations containing parquet files with defined data. The structure
> definition shall be self-descriptive so we can write integration tests that
> reads the whole structure automatically and also works with files to be added
> later.
> The following directory structure is an example for the previous requirements:
> {noformat}
> test-data/
> ├── impala
> │ ├── 3.2.0
> │ │ └── basic-data.parquet
> │ ├── 3.3.0
> │ │ └── basic-data.parquet
> │ └── 3.4.0
> │ ├── basic-data.lz4.parquet
> │ ├── basic-data.snappy.parquet
> │ ├── some-specific-issue-2.parquet
> │ ├── some-specific-issue-3.csv
> │ ├── some-specific-issue-3_mode1.parquet
> │ ├── some-specific-issue-3_mode2.parquet
> │ └── some-specific-issue-3.schema
> ├── parquet-cpp
> │ ├── 1.5.0
> │ │ ├── basic-data.lz4.parquet
> │ │ └── basic-data.parquet
> │ └── 1.6.0
> │ ├── basic-data.lz4.parquet
> │ └── some-specific-issue-2.parquet
> ├── parquet-mr
> │ ├── 1.10.2
> │ │ └── basic-data.parquet
> │ ├── 1.11.1
> │ │ ├── basic-data.parquet
> │ │ └── some-specific-issue-1.parquet
> │ ├── 1.12.0
> │ │ ├── basic-data.br.parquet
> │ │ ├── basic-data.lz4.parquet
> │ │ ├── basic-data.snappy.parquet
> │ │ ├── basic-data.zstd.parquet
> │ │ ├── some-specific-issue-1.parquet
> │ │ └── some-specific-issue-2.parquet
> │ ├── some-specific-issue-1.csv
> │ └── some-specific-issue-1.schema
> ├── basic-data.csv
> ├── basic-data.schema
> ├── some-specific-issue-2.csv
> └── some-specific-issue-2.schema
> {noformat}
> Parquet files are created at leaf level. The expected data is saved in a csv
> format (to be specified: separators, how to save binary etc.), the expected
> schema (to specify the data types independently from the parquet files) are
> saved in .schema files. The csv and schema files can be saved on the same
> level of the parquet files or upper levels if they are common to several
> parquet files.
> Any comments about the idea are welcomed.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)