Re: Solution to read/write multiple parquet files

Lizhou Gao Tue, 03 Apr 2018 02:37:51 -0700

Thanks for your quick reply!
Given below  scenario, there are 200k rows of sql data, 0-100k contains more 
nulls while 100k-200k contains more not null values.
If we convert two parts into parquet files, we may get 0-100k.parquet (500M), 
100k-200k(1.3G). Then what is the best practice to cutting these rows into
parquet files ?
Another question is that should we keep same RowGroup size for one parquet file 
?

Thanks,
Lizhou
------------------ Original ------------------
From:  "Uwe L. Korn"<uw...@xhochy.com>;
Date:  Tue, Apr 3, 2018 04:21 PM
To:  "dev"<dev@parquet.apache.org>; 

Subject:  Re: Solution to read/write multiple parquet files

Hello Lizhou,

on the Python side there is http://dask.pydata.org/en/latest/ that can read 
large, distributed Parquet datasets. When using `engine=pyarrow`, it also uses 
parquet-cpp under the hood.

On the pure C++ side, I know that https://github.com/thrill/thrill has 
experimental parquet support. But this is an experimental feature in an 
experimental framework, so be careful about relying on it.

In general, Parquet files should not exceed the single digit gigabyte size and 
the RowGroups inside these files should also be 128MiB or less. You will be 
able to write tools that can deal with other sizes but that will break a bit 
the portability aspect of Parquet file.

Uwe

On Tue, Apr 3, 2018, at 10:00 AM, 高立周 wrote:
> Hi experts,
>    We have a storage engine that needs to manage large set of data (PB 
> level) . Currently we store it as a single parquet file. After some 
> searching, It seems the data should be cut into
> multiple parquet files for further reading/writing/managing.  But I 
> don't know whether there is already opensource solution to read/write/
> manage multiple parquet files.  Our programming language is cpp.
>   Any comments/suggestions are welcomed. Thanks!
> 
> 
> Regards,
> Lizhou

Re: Solution to read/write multiple parquet files

Reply via email to