[
https://issues.apache.org/jira/browse/HADOOP-1869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12623489#action_12623489
]
Konstantin Shvachko commented on HADOOP-1869:
---------------------------------------------
I think this proposal is in the right direction.
According to HADOOP-3860 name-node can currently perform 20-25 times more opens
per second than creates.
Which means that if we let every open / getBlockLocation be logged and flushed
we loose big.
Another observation is that map-reduce does a lot of {{ls}} operations both for
directories and individual files.
I have seen 20,000 per second. This is done when the job starts and depends on
the user input data and on how many tasks should the job be running.
So may be we should not log file access for ls, permission checking, etc. I
think it would be sufficient to write
{{OP_SET_ACCESSTIME}} only in case of getBlockLocations().
Also I think we should not support access time for directories only for regular
files.
Another alternative would be to keep the access time only in the name-node
memory. Would that be sufficient enough to detect "malicious"
behavior of some users? Name-nodes usually run for months, right? So before say
upgrading the name-node or simply every (other) week
administrators may look at files that have never been touched during that
period and act accordingly.
My main concern is that even though with Dhruba's approach we will batch access
operations and would not loose time on flushing them
directly the journaling traffic will double that is with each flush more bytes
need to be flushed. Meaning increased latency for each flush,
and bigger edits files.
It would be good to have some experimental data measuring throughput and
latency for getBlockLocation with and without ACCESSTIME
transactions. The easy way to test would be to use NNThroughputBenchmark.
> access times of HDFS files
> --------------------------
>
> Key: HADOOP-1869
> URL: https://issues.apache.org/jira/browse/HADOOP-1869
> Project: Hadoop Core
> Issue Type: New Feature
> Components: dfs
> Reporter: dhruba borthakur
> Assignee: dhruba borthakur
>
> HDFS should support some type of statistics that allows an administrator to
> determine when a file was last accessed.
> Since HDFS does not have quotas yet, it is likely that users keep on
> accumulating files in their home directories without much regard to the
> amount of space they are occupying. This causes memory-related problems with
> the namenode.
> Access times are costly to maintain. AFS does not maintain access times. I
> thind DCE-DFS does maintain access times with a coarse granularity.
> One proposal for HDFS would be to implement something like an "access bit".
> 1. This access-bit is set when a file is accessed. If the access bit is
> already set, then this call does not result in a transaction.
> 2. A FileSystem.clearAccessBits() indicates that the access bits of all files
> need to be cleared.
> An administrator can effectively use the above mechanism (maybe a daily cron
> job) to determine files that are recently used.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.