Zortan, I agree, I am already setting my Row Group size same as my block size.
The issue I had is a streaming job running every minute that is generating tons of small files. I need to prepare this data in such a way this is optimized for reads. how can i make sure my stats are not corrupted on the parquet? Is there any other tips I need to keep in mind when I am preparing my data? Regards Shiv On Fri, Oct 13, 2017 at 9:21 AM, Zoltan Ivanfi <[email protected]> wrote: > Hi, > > What is your motivation for merging the files? My guess is that you want to > achieve good performance and historically that has been associated with > large Parquet files. However, merging Parquet files by placing the same row > groups one after the other won't really improve the performance of > parquet-mr readers like Hive for example. It will make the performance of > Impala much worse though, if you happen to use it. The reason for that is > that good performance is not a result of large files but large rowgroups > instead (up to the HDFS block size). Please see PARQUET-1115 > <https://issues.apache.org/jira/browse/PARQUET-1115> that describes this > same problem when using the parquet-tools merge command. > > Br, > > Zoltan > > On Fri, Oct 13, 2017 at 12:42 PM Siva Gudavalli <[email protected]> > wrote: > > > I have a spark streaming app that is generating a lot of small parquet > > files. > > I need a way to prepare big parquet files. > > > > I try to put something together that does the job for me. (similar to > merge > > from parquet-tools) > > > > FileMetaData mergedMeta = mergedMetadata(inputFiles); > > > > > > ParquetFileWriter writer = new ParquetFileWriter(conf, > > > mergedMeta.getSchema(), outputFile, ParquetFileWriter.Mode.CREATE); > > > > > > > > > writer.start(); > > > > > > for (Path input: inputFiles) { > > > > > > writer.appendFile(conf, input); > > > > > > hdfs.delete(input,false); > > > > > > } > > > > > > > > > writer.end(mergedMeta.getKeyValueMetaData()); > > > > > > > > I need to make sure stats are up to date on these newly generated files. > > How can i do so? I am using parquet-tools 1.9.0 and I dont see stats > here. > > any ideas. > > >
