[ 
https://issues.apache.org/jira/browse/HDFS-8060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14394660#comment-14394660
 ] 

Umesh Kacha commented on HDFS-8060:
-----------------------------------

Hi Chrish thanks for the prompt response. Let me explain my use case.

I have one Java application which collects data from database servers and data 
collected is roughly 1 TB every day. Now I need to compress these data files. 
And these data files are residing in directories as explained below

ABC --dir
    2015-03-25 --dir
               0-1  --dir
                    0-1.dat --actual data fiile
               1-2  --dir
                     1-2.dat
DEF --dir
    2015-03-25 --dir
               0-1  --dir
                    0-1.dat --actual data fiile
               1-2  --dir
                     1-2.dat

So if you see above structures there are hundreds of servers named ABC, DEF and 
for each ABC,DEF  I have business date and internally each business date 
contains hourly dirs 0-1,1-2,23-24 and so on and finally these hourly dirs 
contains data files.

Now I have two jobs running daily compress and weekly merge of these data 
files. Daily compress I find each hourly data files using fs.globStatus() 
pattern so its easy. But to merge I need to copy all these hourly files into 
one dir using
copy(FileSystem srcFS, Path[] srcs, FileSystem dstFS, Path dst, boolean 
deleteSource, boolean overwrite, Configuration conf) 

and then finally use copyMerge now above copy is slow then merge is also slow 
when I have tera bytes of files. Hope this makes you understand my use case 
more.

> org.apache.hadoop.fs.FileUtil.copyMerge should merge all files in recursive 
> sub directories
> -------------------------------------------------------------------------------------------
>
>                 Key: HDFS-8060
>                 URL: https://issues.apache.org/jira/browse/HDFS-8060
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>            Reporter: Umesh Kacha
>
> org.apache.hadoop.fs.FileUtil.copyMerge does not find all the files 
> recursively in sub directories. I am ready to push the code for the same. 
> This is my first JIRA so dont know much the process. Please validate I feel 
> this feature is very helpful. Since copyMerge does not support recursive 
> finding in sub directories I need to copy files from thousands of directories 
> first and then move into one directory and give that directory to copyMerge.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to