Zortan,

I agree, I am already setting my Row Group size same as my block size.

The issue I had is a streaming job running every minute that is generating
tons of small files. I need to prepare this data in such a way this is
optimized for reads.

how can i make sure my stats are not corrupted on the parquet? Is there any
other tips I need to keep in mind when I am preparing my data?

Regards
Shiv



On Fri, Oct 13, 2017 at 9:21 AM, Zoltan Ivanfi <[email protected]> wrote:

> Hi,
>
> What is your motivation for merging the files? My guess is that you want to
> achieve good performance and historically that has been associated with
> large Parquet files. However, merging Parquet files by placing the same row
> groups one after the other won't really improve the performance of
> parquet-mr readers like Hive for example. It will make the performance of
> Impala much worse though, if you happen to use it. The reason for that is
> that good performance is not a result of large files but large rowgroups
> instead (up to the HDFS block size). Please see PARQUET-1115
> <https://issues.apache.org/jira/browse/PARQUET-1115> that describes this
> same problem when using the parquet-tools merge command.
>
> Br,
>
> Zoltan
>
> On Fri, Oct 13, 2017 at 12:42 PM Siva Gudavalli <[email protected]>
> wrote:
>
> > I have a spark streaming app that is generating a lot of small parquet
> > files.
> > I need a way to prepare big parquet files.
> >
> > I try to put something together that does the job for me. (similar to
> merge
> > from parquet-tools)
> >
> >     FileMetaData mergedMeta = mergedMetadata(inputFiles);
> > >
> > >     ParquetFileWriter writer = new ParquetFileWriter(conf,
> > > mergedMeta.getSchema(), outputFile, ParquetFileWriter.Mode.CREATE);
> > >
> > >
> > >     writer.start();
> > >
> > >     for (Path input: inputFiles) {
> > >
> > >           writer.appendFile(conf, input);
> > >
> > >           hdfs.delete(input,false);
> > >
> > >         }
> > >
> > >
> > >     writer.end(mergedMeta.getKeyValueMetaData());
> >
> >
> >
> > I need to make sure stats are up to date on these newly generated files.
> > How can i do so? I am using parquet-tools 1.9.0 and I dont see stats
> here.
> > any ideas.
> >
>

Reply via email to