Re: Is there a way to merge parquet small files?
Many small files could cause technical issues in both hdfs and spark though, they do not generate many stages and tasks in the recent version of spark. // maropu On Fri, May 20, 2016 at 2:41 PM, Gavin Yuewrote: > For logs file I would suggest save as gziped text file first. After > aggregation, convert them into parquet by merging a few files. > > > > On May 19, 2016, at 22:32, Deng Ching-Mallete wrote: > > IMO, it might be better to merge or compact the parquet files instead of > keeping lots of small files in the HDFS. Please refer to [1] for more info. > > We also encountered the same issue with the slow query, and it was indeed > caused by the many small parquet files. In our case, we were processing > large data sets with batch jobs instead of a streaming job. To solve our > issue, we just did a coalesce to reduce the number of partitions before > saving as parquet format. > > HTH, > Deng > > [1] http://blog.cloudera.com/blog/2009/02/the-small-files-problem/ > > On Fri, May 20, 2016 at 1:50 PM, 王晓龙/0515 > wrote: > >> I’m using a spark streaming program to store log message into parquet >> file every 10 mins. >> Now, when I query the parquet, it usually takes hundreds of thousands of >> stages to compute a single count. >> I looked into the parquet file’s path and find a great amount of small >> files. >> >> Do the small files caused the problem? Can I merge them, or is there a >> better way to solve it? >> >> Lots of thanks. >> >> >> >> 此邮件内容仅代表发送者的个人观点和意见,与招商银行股份有限公司及其下属分支机构的观点和意见无关,招商银行股份有限公司及其下属分支机构不对此邮件内容承担任何责任。此邮件内容仅限收件人查阅,如误收此邮件请立即删除。 >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> > -- --- Takeshi Yamamuro
Re: Is there a way to merge parquet small files?
For logs file I would suggest save as gziped text file first. After aggregation, convert them into parquet by merging a few files. > On May 19, 2016, at 22:32, Deng Ching-Malletewrote: > > IMO, it might be better to merge or compact the parquet files instead of > keeping lots of small files in the HDFS. Please refer to [1] for more info. > > We also encountered the same issue with the slow query, and it was indeed > caused by the many small parquet files. In our case, we were processing large > data sets with batch jobs instead of a streaming job. To solve our issue, we > just did a coalesce to reduce the number of partitions before saving as > parquet format. > > HTH, > Deng > > [1] http://blog.cloudera.com/blog/2009/02/the-small-files-problem/ > >> On Fri, May 20, 2016 at 1:50 PM, 王晓龙/0515 >> wrote: >> I’m using a spark streaming program to store log message into parquet file >> every 10 mins. >> Now, when I query the parquet, it usually takes hundreds of thousands of >> stages to compute a single count. >> I looked into the parquet file’s path and find a great amount of small files. >> >> Do the small files caused the problem? Can I merge them, or is there a >> better way to solve it? >> >> Lots of thanks. >> >> >> 此邮件内容仅代表发送者的个人观点和意见,与招商银行股份有限公司及其下属分支机构的观点和意见无关,招商银行股份有限公司及其下属分支机构不对此邮件内容承担任何责任。此邮件内容仅限收件人查阅,如误收此邮件请立即删除。 >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org > >
Re: Is there a way to merge parquet small files?
IMO, it might be better to merge or compact the parquet files instead of keeping lots of small files in the HDFS. Please refer to [1] for more info. We also encountered the same issue with the slow query, and it was indeed caused by the many small parquet files. In our case, we were processing large data sets with batch jobs instead of a streaming job. To solve our issue, we just did a coalesce to reduce the number of partitions before saving as parquet format. HTH, Deng [1] http://blog.cloudera.com/blog/2009/02/the-small-files-problem/ On Fri, May 20, 2016 at 1:50 PM, 王晓龙/0515wrote: > I’m using a spark streaming program to store log message into parquet file > every 10 mins. > Now, when I query the parquet, it usually takes hundreds of thousands of > stages to compute a single count. > I looked into the parquet file’s path and find a great amount of small > files. > > Do the small files caused the problem? Can I merge them, or is there a > better way to solve it? > > Lots of thanks. > > > > 此邮件内容仅代表发送者的个人观点和意见,与招商银行股份有限公司及其下属分支机构的观点和意见无关,招商银行股份有限公司及其下属分支机构不对此邮件内容承担任何责任。此邮件内容仅限收件人查阅,如误收此邮件请立即删除。 > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
Re: Is there a way to merge parquet small files?
Try to use hadoop setting mapreduce.input.fileinputformat.split.maxsize to control RDD partition size I heard that DF can several files in 1 task On Thu, May 19, 2016 at 8:50 PM, 王晓龙/0515wrote: > I’m using a spark streaming program to store log message into parquet file > every 10 mins. > Now, when I query the parquet, it usually takes hundreds of thousands of > stages to compute a single count. > I looked into the parquet file’s path and find a great amount of small > files. > > Do the small files caused the problem? Can I merge them, or is there a > better way to solve it? > > Lots of thanks. > > > > 此邮件内容仅代表发送者的个人观点和意见,与招商银行股份有限公司及其下属分支机构的观点和意见无关,招商银行股份有限公司及其下属分支机构不对此邮件内容承担任何责任。此邮件内容仅限收件人查阅,如误收此邮件请立即删除。 > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
Is there a way to merge parquet small files?
I’m using a spark streaming program to store log message into parquet file every 10 mins. Now, when I query the parquet, it usually takes hundreds of thousands of stages to compute a single count. I looked into the parquet file’s path and find a great amount of small files. Do the small files caused the problem? Can I merge them, or is there a better way to solve it? Lots of thanks. 此邮件内容仅代表发送者的个人观点和意见,与招商银行股份有限公司及其下属分支机构的观点和意见无关,招商银行股份有限公司及其下属分支机构不对此邮件内容承担任何责任。此邮件内容仅限收件人查阅,如误收此邮件请立即删除。 - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org