[
https://issues.apache.org/jira/browse/HADOOP-1572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12674367#action_12674367
]
pfarner edited comment on HADOOP-1572 at 2/17/09 1:49 PM:
------------------------------------------------------------------
Copying between filesystems (mentioned above) and restore from backup are
familiar use cases.
Here's another use case: repeated update of the output of a data pipeline. If
we have an input files A1,A2,..An, which need to be aggregated into file B
whenever another file Ai is created, then it's useful to have an easy way to
know when B needs to be updated. If one compares the modification time of B to
the modification times of A1,..An, then a race condition can cause some updates
of B to be delayed forever. If we could modify the modification time of B,
then we could avoid this race condition cleanly. (details below, for the
curious)
Race condition sequence:
* A1 and A2 are created
* transformation operation to create B starts, chooses A1 and A2 as inputs
* A3 is created
* output of the transformation operation is stored as B
At this point, B contains data from A1 and A2, but not from A3, and yet B's
modification time is later than A3's. If we could set the timestamp, we could
choose A1,A2 as inputs, record the maximum of their timestamps (tmax) in the
JobConf, and then create B with a modification time of tmax. tmax would be
less than A3's modification time, so it would be clear that B needs to be
updated and the race condition would be prevented.
In order to avoid this problem in current systems, I create a secondary file
containing the timestamp for B, in its text, but this doubles the number of
name node entries needed for B, and is slower than using the modification time.
This change would be a significant improvement in my use of hadoop, so I'm
naturally motivated to help. I've created a prototype of a patch (using the
"option 2" style above), and I'll refine it, confirm that I'm complying with
hadoop's style guidelines, and post it here. Any information you have on
special complications would be appreciated. (I know that some FileSystems
won't support this operation, but I'm more worried about subtler problems).
was (Author: pfarner):
Copying between filesystems (mentioned above) and restore from backup are
familiar use cases.
Here's another use case: incremental update of a data pipeline. If we have an
input files A1,A2,..An, which needs to be aggregated into file B whenever
another file Ai is created, then it's useful to have an easy way to know when B
needs to be updated. If one compares the modification time of B to the
modification times of A1,..An, then a race condition can cause some updates of
B to be delayed forever. If we could modify the modification time of B, then
we could avoid this race condition cleanly. (details below, for the curious)
Race condition sequence:
* A1 and A2 are created
* transformation operation to create B starts, chooses A1 and A2 as inputs
* A3 is created
* output of the transformation operation is stored as B
At this point, B contains data from A1 and A2, but not from A3, and yet B's
modification time is later than A3's. If we could set the timestamp, we could
choose A1,A2 as inputs, record the maximum of their timestamps (tmax) in the
JobConf, and then create B with a modification time of tmax. tmax would be
less than A3's modification time, so the race condition would be prevented.
In order to avoid this problem in current systems, I create a secondary file
containing the timestamp for B as text, but this doubles the number of name
node entries needed for B, and is slower than using the modification time.
This change would be a significant improvement in my use of hadoop, so I'm
naturally motivated to help. I've created a prototype of a patch (using the
"option 2" style above), and I'll refine it, confirm that I'm complying with
hadoop's style guidelines, and post it here. Any information you have on
special complications would be appreciated. (I know that some FileSystems
won't support this operation, but I'm more worried about subtler problems).
> should have utime method in HDFS & FIleSystem to set modification times.
> ------------------------------------------------------------------------
>
> Key: HADOOP-1572
> URL: https://issues.apache.org/jira/browse/HADOOP-1572
> Project: Hadoop Core
> Issue Type: New Feature
> Components: dfs
> Reporter: Owen O'Malley
>
> It would be nice to modify the modification times of files.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.