[ 
https://issues.apache.org/jira/browse/HADOOP-9639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13747831#comment-13747831
 ] 

Sangjin Lee commented on HADOOP-9639:
-------------------------------------

[~jlowe], thanks for the great feedback. You pointed out some additional needed 
details in the design.

I'd also like to clarify a couple of things, and hopefully that will set the 
stage for answering some of your questions in the process. I see that these 
points may not have been as clear as I thought they were. Sorry it's quite 
long. :)

TL;DR

First, the *only* consumer of the reader locks (.in_use files) is the cleaner. 
No other components use these files or check them (clients or node manager). 
Furthermore, the only time the cleaner looks at these reader locks and the 
timestamps in the names of these files is to prevent a race, as described in 
step W2 in the doc.

To determine whether a particular cached entry is stale, the cleaner first 
looks at the modification time of the directory that contains the cached file 
(more on the directory modification time later). Then the cleaner writes the 
cleaner lock to prevent clients from taking action, and proceeds to delete the 
cached entry. However, there can be a race between the time the cleaner 
determines the cached entry is stale and after it writes the cleaner lock. It 
is possible that a client may come in just at the right time and start using 
this entry between those two points in time. For this reason, the cleaner needs 
to double check if there is any "recent" attempt to use this file before 
proceeding to delete. Thus, we're talking about a *real recent* attempt (a few 
seconds ago at most). The moment the cleaner detects a recent reader lock, it 
recognizes this race, and skips this directory.

So the primary reason that these reader locks (and the associated timestamps) 
are needed is to prevent this race.

Another way of thinking about this is, we could have easily come up with 
another idea that does not need this timestamp. We could rely on the directory 
modification time alone instead. The cleaner could check the directory 
modification time, write the cleaner lock, and then double-check the directory 
modification time to see if there was any recent attempt to use it in between. 
In fact, I'm actually thinking I may want to modify the design to take that 
approach. The timestamp there seems bit confusing in terms of what it is used 
for. I'll probably make that change.

Now on to why I use the modification time of the directory that contains the 
cached entry; I'm talking about the modification time of the directory on 
*HDFS*. I am using it to detect whether a new client came in and started using 
the cached entry. Since a client is required to drop the reader lock, any use 
of the cached entry will update the modification timestamp of the containing 
directory. So it can be used as a proxy of when the cached entry was last used. 
This timestamp gets updated one more time if the client removes the reader lock 
at the end.

I originally considered using another file (like .last_used) to keep track of 
the last used time of cached entries. However, we were concerned about the 
impact of adding another file for each cached entry, and thus putting more 
pressure on the name node. Thus, we settled on looking at the modification time 
of the containing directory in lieu of that.

The only consumer of this information is also the cleaner.

With this, I'll add my comments to your points/questions.

{quote}
I'm thinking of the general case of permissions - just because the job client 
has access to the local files during job submission does not mean the user 
wants all those files available to anyone with cluster access. It's probably 
less of an issue in practice if this is limited to just jars, but it's 
definitely an issue if this is expanded to other distcache file types (e.g.: 
data files for something like a map-side join).
{quote}

Agreed. I need to add clarifying details here. If you set the boolean flag, the 
intent is that it covers only the job jar and the libjars, but it does not 
apply to other files (for example, -files or -archives are excluded from this). 
Let me know if you think that's reasonable. I'll add that clarification.

{quote}
How is the case of orphaned temporary files any different than the orphaned 
read lock case? I would think the issue of staleness would apply there as well. 
If a temporary file is over a day old, it's highly likely to be orphaned. 
Nobody wants to wait a day to upload a distcache entry to HDFS, as it implies 
it would be on the same order of time to localize it later.
{quote}

I see it's not explicitly stated, but checkAndUpload() may return the *temp 
file* instead of the jar file under some scenarios. This pertains to that error 
handling I talked about in my previous comment. So, it is possible that jobs 
may use these temp files directly. This is rather unlikely but in theory it is 
possible. In this case, if this temp file was prematurely deleted, it may lead 
to localization failures.

Having said that, I think it may be possible to modify the logic of 
checkAndUpload() slightly so that it always returns the intended jar. It may 
make the algorithm bit more involved (as it would entail retrying renaming), 
but I might be able to make that change. If that can be done, then any closed 
temp files can be considered safe for clean-up.

{quote}
Speaking of long-running jobs, an alternative would be to use the YARN 
application ID (which clients grab just before submitting) as part of the read 
lock. Then the cleaner can query the ResourceManager to know for certain 
whether the job is still active.
{quote}

This is a great point. I think the issue of long-running apps wasn't fully 
explored in the current version of the design. If there are apps that run 
beyond the specified staleness value, then the cleaner could erroneously clean 
up the cached entry, and it may result in localization failures. I think this 
needs to be addressed.

As you mentioned, probably the best way to solve this cleanly is to add the 
YARN app id so that the cleaner knows whether the app is active or not.

We originally shied away from using the app id because I wanted to keep the 
cleaner as minimal as possible with relying only on HDFS. But this may be a 
compelling reason to introduce the use of the app id. I'll tinker with this and 
see if it works.

{quote}
That would not be OK. The last job to initiate a reference on a distcache file 
is not necessarily going to be the last one to relinquish that reference. Job A 
starts first but is long-running, and job B starts later but is very quick. We 
do not want to delete job A's reference because it's older than job B. 
Otherwise job A could easily fail after job B completes if new tasks (think 
reducers or failed maps) are later launched on nodes that have not localized 
those distcache entries yet.
{quote}

You're right that this type of inversion can happen. But I think there are also 
mitigating circumstances. This goes back to the observation that the only 
consumer of the reader locks is the cleaner, and that it looks at them only to 
determine if there is the aforementioned race. So even if the inversion occurs, 
it is impactful only if it is combined with this race occuring. I think this 
could be easily addressed by doing this reader lock clean-up only on recent 
cached entries.

At any rate, if we end up introducing the app id, this problem may be solved 
alongside too.

{quote}
Is the directory timestamp that important? We're localizing files (jars in this 
case), not directories. A stable timestamp of the file being localized is key 
to preventing unnecessary re-localization, but I don't see why that would be 
changing.
{quote}

I hope I answered this question by explaining why we're using the modification 
time of the containing directory.

                
> truly shared cache for jars (jobjar/libjar)
> -------------------------------------------
>
>                 Key: HADOOP-9639
>                 URL: https://issues.apache.org/jira/browse/HADOOP-9639
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: filecache
>    Affects Versions: 2.0.4-alpha
>            Reporter: Sangjin Lee
>            Assignee: Sangjin Lee
>         Attachments: shared_cache_design.pdf
>
>
> Currently there is the distributed cache that enables you to cache jars and 
> files so that attempts from the same job can reuse them. However, sharing is 
> limited with the distributed cache because it is normally on a per-job basis. 
> On a large cluster, sometimes copying of jobjars and libjars becomes so 
> prevalent that it consumes a large portion of the network bandwidth, not to 
> speak of defeating the purpose of "bringing compute to where data is". This 
> is wasteful because in most cases code doesn't change much across many jobs.
> I'd like to propose and discuss feasibility of introducing a truly shared 
> cache so that multiple jobs from multiple users can share and cache jars. 
> This JIRA is to open the discussion.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to