Re: Parquet file size

Cheng Lian Wed, 07 Oct 2015 16:02:26 -0700

The reason why so many small files are generated should probably be thefact that you are inserting into a partitioned table with threepartition columns.

If you want a large Parquet files, you may try to either avoid usingpartitioned table, or using less partition columns (e.g., only year,without month and day).


Cheng

So you want to dump all data into a single large Parquet file?

On 10/7/15 1:55 PM, Younes Naguib wrote:


The TSV original files is 600GB and generated 40k files of 15-25MB.

y

*From:*Cheng Lian [mailto:lian.cs....@gmail.com]
*Sent:* October-07-15 3:18 PM
*To:* Younes Naguib; 'user@spark.apache.org'
*Subject:* Re: Parquet file size

Why do you want larger files? Doesn't the result Parquet file containall the data in the original TSV file?


Cheng

On 10/7/15 11:07 AM, Younes Naguib wrote:

    Hi,

    I’m reading a large tsv file, and creating parquet files using
    sparksql:

    insert overwrite

    table tbl partition(year, month, day)....

    Select .... from tbl_tsv;

    This works nicely, but generates small parquet files (15MB).

    I wanted to generate larger files, any idea how to address this?

    *Thanks,*

    *Younes Naguib*

Triton Digital | 1440 Ste-Catherine W., Suite 1200 | Montreal, QCH3G 1R8


    Tel.: +1 514 448 4037 x2688 | Tel.: +1 866 448 4037 x2688 |
    younes.nag...@tritondigital.com<mailto:younes.nag...@streamtheworld.com>

Re: Parquet file size

Reply via email to