Thanks for your quick reply! Given below scenario, there are 200k rows of sql data, 0-100k contains more nulls while 100k-200k contains more not null values. If we convert two parts into parquet files, we may get 0-100k.parquet (500M), 100k-200k(1.3G). Then what is the best practice to cutting these rows into parquet files ? Another question is that should we keep same RowGroup size for one parquet file ?
Thanks, Lizhou ------------------ Original ------------------ From: "Uwe L. Korn"<uw...@xhochy.com>; Date: Tue, Apr 3, 2018 04:21 PM To: "dev"<dev@parquet.apache.org>; Subject: Re: Solution to read/write multiple parquet files Hello Lizhou, on the Python side there is http://dask.pydata.org/en/latest/ that can read large, distributed Parquet datasets. When using `engine=pyarrow`, it also uses parquet-cpp under the hood. On the pure C++ side, I know that https://github.com/thrill/thrill has experimental parquet support. But this is an experimental feature in an experimental framework, so be careful about relying on it. In general, Parquet files should not exceed the single digit gigabyte size and the RowGroups inside these files should also be 128MiB or less. You will be able to write tools that can deal with other sizes but that will break a bit the portability aspect of Parquet file. Uwe On Tue, Apr 3, 2018, at 10:00 AM, 高立周 wrote: > Hi experts, > We have a storage engine that needs to manage large set of data (PB > level) . Currently we store it as a single parquet file. After some > searching, It seems the data should be cut into > multiple parquet files for further reading/writing/managing. But I > don't know whether there is already opensource solution to read/write/ > manage multiple parquet files. Our programming language is cpp. > Any comments/suggestions are welcomed. Thanks! > > > Regards, > Lizhou