[
https://issues.apache.org/jira/browse/HIVE-25837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yao Guangdong updated HIVE-25837:
---------------------------------
Description:
It will cost very long time in some cases when we use hive merge files.This
is because we have thousands, even tens of thousands or more small files.But
this files is very small.Most of them only have a little kb or less.The merge
file implement is only consider the target size(default 256M) at now.Which make
one map will merge thousands, even tens of thousands or more small files.Which
will cost too long time.
In this case,we change the code not only consider the targe size but also
care about the number of merge files per map(default 1024/map).Which may cause
the target files smaller than user's setting,but compare with the cost on merge
files i think user can accept it.
was:
It will cost very long time in some cases when we use hive merge files.This
is because we have thousands, even tens of thousands or more small files.But
this files is very small.Most of small files only have a little kb.The merge
file implement is only consider the target size(default 256M) at now.Which make
one map will merge thousands, even tens of thousands or more small files.Which
will cost too long time.
In this case,we change the code not only consider the targe size but also
care about the number of merge files per map(default 1024/map).Which may cause
the target files small than user's setting,but compare with the cost on merge
files i think user can accept it.
> Hive merge file operation may cost too long time
> ------------------------------------------------
>
> Key: HIVE-25837
> URL: https://issues.apache.org/jira/browse/HIVE-25837
> Project: Hive
> Issue Type: Improvement
> Components: Hive
> Affects Versions: All Versions
> Reporter: Yao Guangdong
> Assignee: Yao Guangdong
> Priority: Major
> Attachments: HIVE-25837.0001.patch
>
>
> It will cost very long time in some cases when we use hive merge files.This
> is because we have thousands, even tens of thousands or more small files.But
> this files is very small.Most of them only have a little kb or less.The merge
> file implement is only consider the target size(default 256M) at now.Which
> make one map will merge thousands, even tens of thousands or more small
> files.Which will cost too long time.
> In this case,we change the code not only consider the targe size but also
> care about the number of merge files per map(default 1024/map).Which may
> cause the target files smaller than user's setting,but compare with the cost
> on merge files i think user can accept it.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)