[ 
https://issues.apache.org/jira/browse/HDFS-8836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14900303#comment-14900303
 ] 

Jan Filipiak commented on HDFS-8836:
------------------------------------

[~ajisakaa]
Your approach is quite similliar to the one followed in the ticket. Find zero 
size files and treat them differently.
 Ideally I would like skipping the empty files from the moment they get 
created, but this is 1) unpractical as many different applications show the 
behavior of creating empty files and all of them had to be fixed and 2) 
sometimes these emtpy files are required for some purposes and only harmful 
during the getmerge step. To explain case 2 a little bit more, imagine an 
application that uses directory A as an intermediate output that gets used by 
many other applications. Sqoop makes a good example for this. One could set up 
many oozie coordinators that would wait for A/_SUCCESS and then start 
processing it. There would be no safe time to delete the file as one is always 
in danger of having one of the cooridnators not executed as they didn't find 
its "dataset" file. 

Those two are the main reasons I consider this patch very helpfull. If 
namespacesize gets a problem one can always start tackling this at a different 
level. Applying the default Hiddenfilefilter would help in my case, but this 
would need a option aswell and just skipping all the empty files is 
semantically more correct in this case.

> Skip newline on empty files with getMerge -nl
> ---------------------------------------------
>
>                 Key: HDFS-8836
>                 URL: https://issues.apache.org/jira/browse/HDFS-8836
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: hdfs-client
>    Affects Versions: 2.6.0, 2.7.1
>            Reporter: Jan Filipiak
>            Assignee: Kanaka Kumar Avvaru
>            Priority: Trivial
>         Attachments: HDFS-8836-01.patch, HDFS-8836-02.patch, 
> HDFS-8836-03.patch, HDFS-8836-04.patch, HDFS-8836-05.patch
>
>
> Hello everyone,
> I recently was in the need of using the new line option -nl with getMerge 
> because the files I needed to merge simply didn't had one. I was merging all 
> the files from one directory and unfortunately this directory also included 
> empty files, which effectively led to multiple newlines append after some 
> files. I needed to remove them manually afterwards.
> In this situation it is maybe good to have another argument that allows 
> skipping empty files.
> Thing one could try to implement this feature:
> The call for IOUtils.copyBytes(in, out, getConf(), false); doesn't
> return the number of bytes copied which would be convenient as one could
> skip append the new line when 0 bytes where copied or one would check the 
> file size before.
> I posted this Idea on the mailing list 
> http://mail-archives.apache.org/mod_mbox/hadoop-user/201507.mbox/%3C55B25140.3060005%40trivago.com%3E
>  but I didn't really get many responses, so I thought I my try this way.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to