Hi, What is your motivation for merging the files? My guess is that you want to achieve good performance and historically that has been associated with large Parquet files. However, merging Parquet files by placing the same row groups one after the other won't really improve the performance of parquet-mr readers like Hive for example. It will make the performance of Impala much worse though, if you happen to use it. The reason for that is that good performance is not a result of large files but large rowgroups instead (up to the HDFS block size). Please see PARQUET-1115 <https://issues.apache.org/jira/browse/PARQUET-1115> that describes this same problem when using the parquet-tools merge command.
Br, Zoltan On Fri, Oct 13, 2017 at 12:42 PM Siva Gudavalli <[email protected]> wrote: > I have a spark streaming app that is generating a lot of small parquet > files. > I need a way to prepare big parquet files. > > I try to put something together that does the job for me. (similar to merge > from parquet-tools) > > FileMetaData mergedMeta = mergedMetadata(inputFiles); > > > > ParquetFileWriter writer = new ParquetFileWriter(conf, > > mergedMeta.getSchema(), outputFile, ParquetFileWriter.Mode.CREATE); > > > > > > writer.start(); > > > > for (Path input: inputFiles) { > > > > writer.appendFile(conf, input); > > > > hdfs.delete(input,false); > > > > } > > > > > > writer.end(mergedMeta.getKeyValueMetaData()); > > > > I need to make sure stats are up to date on these newly generated files. > How can i do so? I am using parquet-tools 1.9.0 and I dont see stats here. > any ideas. >
