Shiv,

Zoltan is right. Just because you've configured Parquet to have large row
groups doesn't mean it is able to create them. Sounds like your files are
one small row group per small file, so merging them without rewriting to
produce larger row groups wouldn't help. I recommend adding a merge job
that compacts small files by rewriting them once an hour or so. Make sure
the interval is enough time for a large amount of data to accumulate.

rb

On Fri, Oct 13, 2017 at 6:38 AM, Siva Gudavalli <[email protected]> wrote:

> Zortan,
>
> I agree, I am already setting my Row Group size same as my block size.
>
> The issue I had is a streaming job running every minute that is generating
> tons of small files. I need to prepare this data in such a way this is
> optimized for reads.
>
> how can i make sure my stats are not corrupted on the parquet? Is there any
> other tips I need to keep in mind when I am preparing my data?
>
> Regards
> Shiv
>
>
>
> On Fri, Oct 13, 2017 at 9:21 AM, Zoltan Ivanfi <[email protected]> wrote:
>
> > Hi,
> >
> > What is your motivation for merging the files? My guess is that you want
> to
> > achieve good performance and historically that has been associated with
> > large Parquet files. However, merging Parquet files by placing the same
> row
> > groups one after the other won't really improve the performance of
> > parquet-mr readers like Hive for example. It will make the performance of
> > Impala much worse though, if you happen to use it. The reason for that is
> > that good performance is not a result of large files but large rowgroups
> > instead (up to the HDFS block size). Please see PARQUET-1115
> > <https://issues.apache.org/jira/browse/PARQUET-1115> that describes this
> > same problem when using the parquet-tools merge command.
> >
> > Br,
> >
> > Zoltan
> >
> > On Fri, Oct 13, 2017 at 12:42 PM Siva Gudavalli <[email protected]>
> > wrote:
> >
> > > I have a spark streaming app that is generating a lot of small parquet
> > > files.
> > > I need a way to prepare big parquet files.
> > >
> > > I try to put something together that does the job for me. (similar to
> > merge
> > > from parquet-tools)
> > >
> > >     FileMetaData mergedMeta = mergedMetadata(inputFiles);
> > > >
> > > >     ParquetFileWriter writer = new ParquetFileWriter(conf,
> > > > mergedMeta.getSchema(), outputFile, ParquetFileWriter.Mode.CREATE);
> > > >
> > > >
> > > >     writer.start();
> > > >
> > > >     for (Path input: inputFiles) {
> > > >
> > > >           writer.appendFile(conf, input);
> > > >
> > > >           hdfs.delete(input,false);
> > > >
> > > >         }
> > > >
> > > >
> > > >     writer.end(mergedMeta.getKeyValueMetaData());
> > >
> > >
> > >
> > > I need to make sure stats are up to date on these newly generated
> files.
> > > How can i do so? I am using parquet-tools 1.9.0 and I dont see stats
> > here.
> > > any ideas.
> > >
> >
>



-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to