[jira] Issue Comment Edited: (HADOOP-1572) should have utime method in HDFS & FIleSystem to set modification times.

Preston Pfarner (JIRA) Tue, 17 Feb 2009 13:51:24 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-1572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12674367#action_12674367
 ]


pfarner edited comment on HADOOP-1572 at 2/17/09 1:49 PM:
------------------------------------------------------------------

Copying between filesystems (mentioned above) and restore from backup are 
familiar use cases.

Here's another use case: repeated update of the output of a data pipeline.  If 
we have an input files A1,A2,..An, which need to be aggregated into file B 
whenever another file Ai is created, then it's useful to have an easy way to 
know when B needs to be updated.  If one compares the modification time of B to 
the modification times of A1,..An, then a race condition can cause some updates 
of B to be delayed forever.  If we could modify the modification time of B, 
then we could avoid this race condition cleanly.  (details below, for the 
curious)

Race condition sequence:
  * A1 and A2 are created
  * transformation operation to create B starts, chooses A1 and A2 as inputs
  * A3 is created
  * output of the transformation operation is stored as B
At this point, B contains data from A1 and A2, but not from A3, and yet B's 
modification time is later than A3's.  If we could set the timestamp, we could 
choose A1,A2 as inputs, record the maximum of their timestamps (tmax) in the 
JobConf, and then create B with a modification time of tmax.  tmax would be 
less than A3's modification time, so it would be clear that B needs to be 
updated and the race condition would be prevented.

In order to avoid this problem in current systems, I create a secondary file 
containing the timestamp for B, in its text, but this doubles the number of 
name node entries needed for B, and is slower than using the modification time.


This change would be a significant improvement in my use of hadoop, so I'm 
naturally motivated to help.  I've created a prototype of a patch (using the 
"option 2" style above), and I'll refine it, confirm that I'm complying with 
hadoop's style guidelines, and post it here.  Any information you have on 
special complications would be appreciated.  (I know that some FileSystems 
won't support this operation, but I'm more worried about subtler problems).

      was (Author: pfarner):
    Copying between filesystems (mentioned above) and restore from backup are 
familiar use cases.

Here's another use case: incremental update of a data pipeline.  If we have an 
input files A1,A2,..An, which needs to be aggregated into file B whenever 
another file Ai is created, then it's useful to have an easy way to know when B 
needs to be updated.  If one compares the modification time of B to the 
modification times of A1,..An, then a race condition can cause some updates of 
B to be delayed forever.  If we could modify the modification time of B, then 
we could avoid this race condition cleanly.  (details below, for the curious)

Race condition sequence:
  * A1 and A2 are created
  * transformation operation to create B starts, chooses A1 and A2 as inputs
  * A3 is created
  * output of the transformation operation is stored as B
At this point, B contains data from A1 and A2, but not from A3, and yet B's 
modification time is later than A3's.  If we could set the timestamp, we could 
choose A1,A2 as inputs, record the maximum of their timestamps (tmax) in the 
JobConf, and then create B with a modification time of tmax.  tmax would be 
less than A3's modification time, so the race condition would be prevented.

In order to avoid this problem in current systems, I create a secondary file 
containing the timestamp for B as text, but this doubles the number of name 
node entries needed for B, and is slower than using the modification time.


This change would be a significant improvement in my use of hadoop, so I'm 
naturally motivated to help.  I've created a prototype of a patch (using the 
"option 2" style above), and I'll refine it, confirm that I'm complying with 
hadoop's style guidelines, and post it here.  Any information you have on 
special complications would be appreciated.  (I know that some FileSystems 
won't support this operation, but I'm more worried about subtler problems).
  
> should have utime method in HDFS & FIleSystem to set modification times.
> ------------------------------------------------------------------------
>
>                 Key: HADOOP-1572
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1572
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: Owen O'Malley
>
> It would be nice to modify the modification times of files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (HADOOP-1572) should have utime method in HDFS & FIleSystem to set modification times.

Reply via email to