Re: Is there a way to merge parquet small files?

2016-05-20 Thread Takeshi Yamamuro
Many small files could cause technical issues in both hdfs and spark though, they do not generate many stages and tasks in the recent version of spark. // maropu On Fri, May 20, 2016 at 2:41 PM, Gavin Yue wrote: > For logs file I would suggest save as gziped text file

Re: Is there a way to merge parquet small files?

2016-05-19 Thread Gavin Yue
For logs file I would suggest save as gziped text file first. After aggregation, convert them into parquet by merging a few files. > On May 19, 2016, at 22:32, Deng Ching-Mallete wrote: > > IMO, it might be better to merge or compact the parquet files instead of >

Re: Is there a way to merge parquet small files?

2016-05-19 Thread Deng Ching-Mallete
IMO, it might be better to merge or compact the parquet files instead of keeping lots of small files in the HDFS. Please refer to [1] for more info. We also encountered the same issue with the slow query, and it was indeed caused by the many small parquet files. In our case, we were processing

Re: Is there a way to merge parquet small files?

2016-05-19 Thread Alexander Pivovarov
Try to use hadoop setting mapreduce.input.fileinputformat.split.maxsize to control RDD partition size I heard that DF can several files in 1 task On Thu, May 19, 2016 at 8:50 PM, 王晓龙/0515 wrote: > I’m using a spark streaming program to store log message into

Is there a way to merge parquet small files?

2016-05-19 Thread 王晓龙/01111515
I’m using a spark streaming program to store log message into parquet file every 10 mins. Now, when I query the parquet, it usually takes hundreds of thousands of stages to compute a single count. I looked into the parquet file’s path and find a great amount of small files. Do the small files