[
https://issues.apache.org/jira/browse/HADOOP-9639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13745510#comment-13745510
]
Jason Lowe commented on HADOOP-9639:
------------------------------------
bq. Right now, it is a binary choice. If the binary key is set, the job jar and
libjars (if any) will be all sharable/shared. However, with the APIs you should
be able to have a finer-grained control. Is that acceptable? Could you give me
a scenario under which the client may want a finer-grained control? Is that
what you were getting at?
I'm thinking of the general case of permissions - just because the job client
has access to the local files during job submission does not mean the user
wants all those files available to anyone with cluster access. It's probably
less of an issue in practice if this is limited to just jars, but it's
definitely an issue if this is expanded to other distcache file types (e.g.:
data files for something like a map-side join).
{quote}
I've thought about cleaning up orphaned temporary files. The difficulty is to
determine authoritatively whether a certain temporary file is truly unused. One
could check whether the file is closed. But note that a closed temp file may be
in use (as a client may be using a temp file and it's being localized, etc.).
Otherwise, heuristics may become hairy; e.g. "old enough" (how old is old
enough?), etc.
{quote}
How is the case of orphaned temporary files any different than the orphaned
read lock case? I would think the issue of staleness would apply there as
well. If a temporary file is over a day old, it's highly likely to be
orphaned. Nobody wants to wait a day to upload a distcache entry to HDFS, as
it implies it would be on the same order of time to localize it later.
Actually it's a bit easier than the read lock case since long-running jobs do
exist, and if a job runs for a long time it makes the orphaned read lock
detection much more difficult to detect in a timely manner if the detection is
timestamp-based.
Speaking of long-running jobs, an alternative would be to use the YARN
application ID (which clients grab just before submitting) as part of the read
lock. Then the cleaner can query the ResourceManager to know for certain
whether the job is still active.
bq. I think it should be safe to have the cleaner service remove all reader
locks except for the latest one.
That would not be OK. The last job to initiate a reference on a distcache file
is not necessarily going to be the last one to relinquish that reference. Job
A starts first but is long-running, and job B starts later but is very quick.
We do not want to delete job A's reference because it's older than job B.
Otherwise job A could easily fail after job B completes if new tasks (think
reducers or failed maps) are later launched on nodes that have not localized
those distcache entries yet.
bq. We just need to be careful in selecting the directories to clean in this
manner as removing these files would also update the directory modification
time and "refresh" the directory artificially.
Is the directory timestamp that important? We're localizing files (jars in
this case), not directories. A stable timestamp of the file being localized is
key to preventing unnecessary re-localization, but I don't see why that would
be changing.
> truly shared cache for jars (jobjar/libjar)
> -------------------------------------------
>
> Key: HADOOP-9639
> URL: https://issues.apache.org/jira/browse/HADOOP-9639
> Project: Hadoop Common
> Issue Type: New Feature
> Components: filecache
> Affects Versions: 2.0.4-alpha
> Reporter: Sangjin Lee
> Assignee: Sangjin Lee
> Attachments: shared_cache_design.pdf
>
>
> Currently there is the distributed cache that enables you to cache jars and
> files so that attempts from the same job can reuse them. However, sharing is
> limited with the distributed cache because it is normally on a per-job basis.
> On a large cluster, sometimes copying of jobjars and libjars becomes so
> prevalent that it consumes a large portion of the network bandwidth, not to
> speak of defeating the purpose of "bringing compute to where data is". This
> is wasteful because in most cases code doesn't change much across many jobs.
> I'd like to propose and discuss feasibility of introducing a truly shared
> cache so that multiple jobs from multiple users can share and cache jars.
> This JIRA is to open the discussion.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira