Re: Is there a way to merge parquet small files?

2016-05-20 Thread Takeshi Yamamuro
Many small files could cause technical issues in both hdfs and spark
though, they do not
generate many stages and tasks in the recent version of spark.

// maropu

On Fri, May 20, 2016 at 2:41 PM, Gavin Yue  wrote:

> For logs file I would suggest save as gziped text file first.  After
> aggregation, convert them into parquet by merging a few files.
>
>
>
> On May 19, 2016, at 22:32, Deng Ching-Mallete  wrote:
>
> IMO, it might be better to merge or compact the parquet files instead of
> keeping lots of small files in the HDFS. Please refer to [1] for more info.
>
> We also encountered the same issue with the slow query, and it was indeed
> caused by the many small parquet files. In our case, we were processing
> large data sets with batch jobs instead of a streaming job. To solve our
> issue, we just did a coalesce to reduce the number of partitions before
> saving as parquet format.
>
> HTH,
> Deng
>
> [1] http://blog.cloudera.com/blog/2009/02/the-small-files-problem/
>
> On Fri, May 20, 2016 at 1:50 PM, 王晓龙/0515 
> wrote:
>
>> I’m using a spark streaming program to store log message into parquet
>> file every 10 mins.
>> Now, when I query the parquet, it usually takes hundreds of thousands of
>> stages to compute a single count.
>> I looked into the parquet file’s path and find a great amount of small
>> files.
>>
>> Do the small files caused the problem? Can I merge them, or is there a
>> better way to solve it?
>>
>> Lots of thanks.
>>
>> 
>>
>> 此邮件内容仅代表发送者的个人观点和意见,与招商银行股份有限公司及其下属分支机构的观点和意见无关,招商银行股份有限公司及其下属分支机构不对此邮件内容承担任何责任。此邮件内容仅限收件人查阅,如误收此邮件请立即删除。
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


-- 
---
Takeshi Yamamuro


Re: Is there a way to merge parquet small files?

2016-05-19 Thread Gavin Yue
For logs file I would suggest save as gziped text file first.  After 
aggregation, convert them into parquet by merging a few files. 



> On May 19, 2016, at 22:32, Deng Ching-Mallete  wrote:
> 
> IMO, it might be better to merge or compact the parquet files instead of 
> keeping lots of small files in the HDFS. Please refer to [1] for more info. 
> 
> We also encountered the same issue with the slow query, and it was indeed 
> caused by the many small parquet files. In our case, we were processing large 
> data sets with batch jobs instead of a streaming job. To solve our issue, we 
> just did a coalesce to reduce the number of partitions before saving as 
> parquet format. 
> 
> HTH,
> Deng
> 
> [1] http://blog.cloudera.com/blog/2009/02/the-small-files-problem/
> 
>> On Fri, May 20, 2016 at 1:50 PM, 王晓龙/0515  
>> wrote:
>> I’m using a spark streaming program to store log message into parquet file 
>> every 10 mins.
>> Now, when I query the parquet, it usually takes hundreds of thousands of 
>> stages to compute a single count.
>> I looked into the parquet file’s path and find a great amount of small files.
>> 
>> Do the small files caused the problem? Can I merge them, or is there a 
>> better way to solve it?
>> 
>> Lots of thanks.
>> 
>> 
>> 此邮件内容仅代表发送者的个人观点和意见,与招商银行股份有限公司及其下属分支机构的观点和意见无关,招商银行股份有限公司及其下属分支机构不对此邮件内容承担任何责任。此邮件内容仅限收件人查阅,如误收此邮件请立即删除。
>> 
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
> 
> 


Re: Is there a way to merge parquet small files?

2016-05-19 Thread Deng Ching-Mallete
IMO, it might be better to merge or compact the parquet files instead of
keeping lots of small files in the HDFS. Please refer to [1] for more info.

We also encountered the same issue with the slow query, and it was indeed
caused by the many small parquet files. In our case, we were processing
large data sets with batch jobs instead of a streaming job. To solve our
issue, we just did a coalesce to reduce the number of partitions before
saving as parquet format.

HTH,
Deng

[1] http://blog.cloudera.com/blog/2009/02/the-small-files-problem/

On Fri, May 20, 2016 at 1:50 PM, 王晓龙/0515 
wrote:

> I’m using a spark streaming program to store log message into parquet file
> every 10 mins.
> Now, when I query the parquet, it usually takes hundreds of thousands of
> stages to compute a single count.
> I looked into the parquet file’s path and find a great amount of small
> files.
>
> Do the small files caused the problem? Can I merge them, or is there a
> better way to solve it?
>
> Lots of thanks.
>
> 
>
> 此邮件内容仅代表发送者的个人观点和意见,与招商银行股份有限公司及其下属分支机构的观点和意见无关,招商银行股份有限公司及其下属分支机构不对此邮件内容承担任何责任。此邮件内容仅限收件人查阅,如误收此邮件请立即删除。
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Is there a way to merge parquet small files?

2016-05-19 Thread Alexander Pivovarov
Try to use hadoop setting mapreduce.input.fileinputformat.split.maxsize to
control RDD partition size
I heard that DF can several files in 1 task


On Thu, May 19, 2016 at 8:50 PM, 王晓龙/0515 
wrote:

> I’m using a spark streaming program to store log message into parquet file
> every 10 mins.
> Now, when I query the parquet, it usually takes hundreds of thousands of
> stages to compute a single count.
> I looked into the parquet file’s path and find a great amount of small
> files.
>
> Do the small files caused the problem? Can I merge them, or is there a
> better way to solve it?
>
> Lots of thanks.
>
> 
>
> 此邮件内容仅代表发送者的个人观点和意见,与招商银行股份有限公司及其下属分支机构的观点和意见无关,招商银行股份有限公司及其下属分支机构不对此邮件内容承担任何责任。此邮件内容仅限收件人查阅,如误收此邮件请立即删除。
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Is there a way to merge parquet small files?

2016-05-19 Thread 王晓龙/01111515
I’m using a spark streaming program to store log message into parquet file 
every 10 mins.
Now, when I query the parquet, it usually takes hundreds of thousands of stages 
to compute a single count.
I looked into the parquet file’s path and find a great amount of small files.

Do the small files caused the problem? Can I merge them, or is there a better 
way to solve it?

Lots of thanks.


此邮件内容仅代表发送者的个人观点和意见,与招商银行股份有限公司及其下属分支机构的观点和意见无关,招商银行股份有限公司及其下属分支机构不对此邮件内容承担任何责任。此邮件内容仅限收件人查阅,如误收此邮件请立即删除。

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org