[
https://issues.apache.org/jira/browse/HADOOP-9639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13747831#comment-13747831
]
Sangjin Lee commented on HADOOP-9639:
-------------------------------------
[~jlowe], thanks for the great feedback. You pointed out some additional needed
details in the design.
I'd also like to clarify a couple of things, and hopefully that will set the
stage for answering some of your questions in the process. I see that these
points may not have been as clear as I thought they were. Sorry it's quite
long. :)
TL;DR
First, the *only* consumer of the reader locks (.in_use files) is the cleaner.
No other components use these files or check them (clients or node manager).
Furthermore, the only time the cleaner looks at these reader locks and the
timestamps in the names of these files is to prevent a race, as described in
step W2 in the doc.
To determine whether a particular cached entry is stale, the cleaner first
looks at the modification time of the directory that contains the cached file
(more on the directory modification time later). Then the cleaner writes the
cleaner lock to prevent clients from taking action, and proceeds to delete the
cached entry. However, there can be a race between the time the cleaner
determines the cached entry is stale and after it writes the cleaner lock. It
is possible that a client may come in just at the right time and start using
this entry between those two points in time. For this reason, the cleaner needs
to double check if there is any "recent" attempt to use this file before
proceeding to delete. Thus, we're talking about a *real recent* attempt (a few
seconds ago at most). The moment the cleaner detects a recent reader lock, it
recognizes this race, and skips this directory.
So the primary reason that these reader locks (and the associated timestamps)
are needed is to prevent this race.
Another way of thinking about this is, we could have easily come up with
another idea that does not need this timestamp. We could rely on the directory
modification time alone instead. The cleaner could check the directory
modification time, write the cleaner lock, and then double-check the directory
modification time to see if there was any recent attempt to use it in between.
In fact, I'm actually thinking I may want to modify the design to take that
approach. The timestamp there seems bit confusing in terms of what it is used
for. I'll probably make that change.
Now on to why I use the modification time of the directory that contains the
cached entry; I'm talking about the modification time of the directory on
*HDFS*. I am using it to detect whether a new client came in and started using
the cached entry. Since a client is required to drop the reader lock, any use
of the cached entry will update the modification timestamp of the containing
directory. So it can be used as a proxy of when the cached entry was last used.
This timestamp gets updated one more time if the client removes the reader lock
at the end.
I originally considered using another file (like .last_used) to keep track of
the last used time of cached entries. However, we were concerned about the
impact of adding another file for each cached entry, and thus putting more
pressure on the name node. Thus, we settled on looking at the modification time
of the containing directory in lieu of that.
The only consumer of this information is also the cleaner.
With this, I'll add my comments to your points/questions.
{quote}
I'm thinking of the general case of permissions - just because the job client
has access to the local files during job submission does not mean the user
wants all those files available to anyone with cluster access. It's probably
less of an issue in practice if this is limited to just jars, but it's
definitely an issue if this is expanded to other distcache file types (e.g.:
data files for something like a map-side join).
{quote}
Agreed. I need to add clarifying details here. If you set the boolean flag, the
intent is that it covers only the job jar and the libjars, but it does not
apply to other files (for example, -files or -archives are excluded from this).
Let me know if you think that's reasonable. I'll add that clarification.
{quote}
How is the case of orphaned temporary files any different than the orphaned
read lock case? I would think the issue of staleness would apply there as well.
If a temporary file is over a day old, it's highly likely to be orphaned.
Nobody wants to wait a day to upload a distcache entry to HDFS, as it implies
it would be on the same order of time to localize it later.
{quote}
I see it's not explicitly stated, but checkAndUpload() may return the *temp
file* instead of the jar file under some scenarios. This pertains to that error
handling I talked about in my previous comment. So, it is possible that jobs
may use these temp files directly. This is rather unlikely but in theory it is
possible. In this case, if this temp file was prematurely deleted, it may lead
to localization failures.
Having said that, I think it may be possible to modify the logic of
checkAndUpload() slightly so that it always returns the intended jar. It may
make the algorithm bit more involved (as it would entail retrying renaming),
but I might be able to make that change. If that can be done, then any closed
temp files can be considered safe for clean-up.
{quote}
Speaking of long-running jobs, an alternative would be to use the YARN
application ID (which clients grab just before submitting) as part of the read
lock. Then the cleaner can query the ResourceManager to know for certain
whether the job is still active.
{quote}
This is a great point. I think the issue of long-running apps wasn't fully
explored in the current version of the design. If there are apps that run
beyond the specified staleness value, then the cleaner could erroneously clean
up the cached entry, and it may result in localization failures. I think this
needs to be addressed.
As you mentioned, probably the best way to solve this cleanly is to add the
YARN app id so that the cleaner knows whether the app is active or not.
We originally shied away from using the app id because I wanted to keep the
cleaner as minimal as possible with relying only on HDFS. But this may be a
compelling reason to introduce the use of the app id. I'll tinker with this and
see if it works.
{quote}
That would not be OK. The last job to initiate a reference on a distcache file
is not necessarily going to be the last one to relinquish that reference. Job A
starts first but is long-running, and job B starts later but is very quick. We
do not want to delete job A's reference because it's older than job B.
Otherwise job A could easily fail after job B completes if new tasks (think
reducers or failed maps) are later launched on nodes that have not localized
those distcache entries yet.
{quote}
You're right that this type of inversion can happen. But I think there are also
mitigating circumstances. This goes back to the observation that the only
consumer of the reader locks is the cleaner, and that it looks at them only to
determine if there is the aforementioned race. So even if the inversion occurs,
it is impactful only if it is combined with this race occuring. I think this
could be easily addressed by doing this reader lock clean-up only on recent
cached entries.
At any rate, if we end up introducing the app id, this problem may be solved
alongside too.
{quote}
Is the directory timestamp that important? We're localizing files (jars in this
case), not directories. A stable timestamp of the file being localized is key
to preventing unnecessary re-localization, but I don't see why that would be
changing.
{quote}
I hope I answered this question by explaining why we're using the modification
time of the containing directory.
> truly shared cache for jars (jobjar/libjar)
> -------------------------------------------
>
> Key: HADOOP-9639
> URL: https://issues.apache.org/jira/browse/HADOOP-9639
> Project: Hadoop Common
> Issue Type: New Feature
> Components: filecache
> Affects Versions: 2.0.4-alpha
> Reporter: Sangjin Lee
> Assignee: Sangjin Lee
> Attachments: shared_cache_design.pdf
>
>
> Currently there is the distributed cache that enables you to cache jars and
> files so that attempts from the same job can reuse them. However, sharing is
> limited with the distributed cache because it is normally on a per-job basis.
> On a large cluster, sometimes copying of jobjars and libjars becomes so
> prevalent that it consumes a large portion of the network bandwidth, not to
> speak of defeating the purpose of "bringing compute to where data is". This
> is wasteful because in most cases code doesn't change much across many jobs.
> I'd like to propose and discuss feasibility of introducing a truly shared
> cache so that multiple jobs from multiple users can share and cache jars.
> This JIRA is to open the discussion.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira