Hi,

What is your motivation for merging the files? My guess is that you want to
achieve good performance and historically that has been associated with
large Parquet files. However, merging Parquet files by placing the same row
groups one after the other won't really improve the performance of
parquet-mr readers like Hive for example. It will make the performance of
Impala much worse though, if you happen to use it. The reason for that is
that good performance is not a result of large files but large rowgroups
instead (up to the HDFS block size). Please see PARQUET-1115
<https://issues.apache.org/jira/browse/PARQUET-1115> that describes this
same problem when using the parquet-tools merge command.

Br,

Zoltan

On Fri, Oct 13, 2017 at 12:42 PM Siva Gudavalli <[email protected]> wrote:

> I have a spark streaming app that is generating a lot of small parquet
> files.
> I need a way to prepare big parquet files.
>
> I try to put something together that does the job for me. (similar to merge
> from parquet-tools)
>
>     FileMetaData mergedMeta = mergedMetadata(inputFiles);
> >
> >     ParquetFileWriter writer = new ParquetFileWriter(conf,
> > mergedMeta.getSchema(), outputFile, ParquetFileWriter.Mode.CREATE);
> >
> >
> >     writer.start();
> >
> >     for (Path input: inputFiles) {
> >
> >           writer.appendFile(conf, input);
> >
> >           hdfs.delete(input,false);
> >
> >         }
> >
> >
> >     writer.end(mergedMeta.getKeyValueMetaData());
>
>
>
> I need to make sure stats are up to date on these newly generated files.
> How can i do so? I am using parquet-tools 1.9.0 and I dont see stats here.
> any ideas.
>

Reply via email to