; user@spark.apache.org
*Subject:* Re: Parquet file size
Hi,
In our case, we're using
the org.apache.hadoop.mapreduce.lib.input.FileInputFormat.SPLIT_MINSIZE to
increase the size of the RDD partitions when loading text files, so it
would generate larger parquet files. We just set it in the
The TSV original files is 600GB and generated 40k files of 15-25MB.
y
From: Cheng Lian [mailto:lian.cs@gmail.com]
Sent: October-07-15 3:18 PM
To: Younes Naguib; 'user@spark.apache.org'
Subject: Re: Parquet file size
Why do you want larger files? Doesn't the result Parquet file contain all
*To:* Younes Naguib; 'user@spark.apache.org'
*Subject:* Re: Parquet file size
Why do you want larger files? Doesn't the result Parquet file contain
all the data in the original TSV file?
Cheng
On 10/7/15 11:07 AM, Younes Naguib wrote:
Hi,
I’m reading a large tsv file, and creating parquet
x2688 | Tel.: +1 866 448 4037 x2688 | younes.naguib
> @tritondigital.com <younes.nag...@streamtheworld.com>
> --
> *From:* Cheng Lian [lian.cs@gmail.com]
> *Sent:* Wednesday, October 07, 2015 7:01 PM
>
> *To:* Younes Naguib; 'user@spark.apache.o
:01 PM
To: Younes Naguib; 'user@spark.apache.org'
Subject: Re: Parquet file size
The reason why so many small files are generated should probably be the fact
that you are inserting into a partitioned table with three partition columns.
If you want a large Parquet files, you may try to either avoid
Why do you want larger files? Doesn't the result Parquet file contain
all the data in the original TSV file?
Cheng
On 10/7/15 11:07 AM, Younes Naguib wrote:
Hi,
I’m reading a large tsv file, and creating parquet files using sparksql:
insert overwrite
table tbl partition(year, month,