[
https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725704#comment-13725704
]
David S. Wang commented on HDFS-4949:
-------------------------------------
HDFS-4949 meeting, July 29, 2013 2 PM @ Hortonworks office
====
Attendees
====
Aaron T. Myers
Andrew Wang
Arpit Gupta
Bikas Saha
Brandon Li
Colin McCabe
Dave Wang
Jing Zhao
Suresh Srinivas
Sanjay Radia
Todd Lipcon
Vinod Kumar Vavilapalli
Minutes
====
* General agreement to hold HDFS-2832 meeting some other day
* Andrew: Posted HDFS-4949 design doc upstream; Sanjay has read this, agrees
with the goals
Data path (zero-copy reads)
----
* Sanjay: quota mgmt - counted up front, not after cache is populated
* Colin: talking about ZCR (mmap) - used to implement caching at the DN level
** Considered copying everything into /dev/shm (e.g. Tachyon). But cannot
cache parts of a file, so limits our flexibility. Also, the associated fd
gives clients a way to control memory mgmt (will not release until that
descriptor is closed), which is not good because of buggy clients etc.
** Sanjay: you want an abstraction for a durable file. Colin: yes.
** Colin: ZCR currently doesn't have checksums, but will. Todd: assumption is
that DN will do the cksum when doing the mlock and communicate that to the
client so the client knows that it's safe to read.
** Todd: mincore() can tell you what's already in cache, but it's too granular,
very expensive to call, and can be out-of-date immediately.
** Assuming this is for local clients only obviously.
** ZCR uses ByteBuffers to avoid copies. Not entirely compatible with current
DFSClient since that uses byte arrays, so you cannot avoid copies.
* This may have a conflict with inline checksums. Clients would have to be
aware of how to skip over checksums, and this would have to be in the app, not
the client since we're talking mmap.
** HBase gets around this by disabling HDFS-level cksums, and doing it on their
level.
** Sanjay: QFS puts all of the cksums in the beginning of the file
** Todd: Liang Xie had an HBase study where he figured out that perf didn't
improve until he got to a TB of data, when the cksum files themselves dropped
out ofcache.
* ZCR API can be made public? Colin, Todd: Yes.
** Hard to compete with Spark if this isn't public.
** Suresh: Will the app know if you got ZCR? Can be added as counters. Colin:
already have similar concepts today for SCR on a per-stream basis.
** Todd: SCR is fully transparent (uses today's API), while ZCR requires new
client API.
* Sanjay: Current policy is manual. Later policy can have system automatically
cache hot files. Need the fallback buffer in case you are remote.
** Todd: high perf apps will always use the ZCR API. Sometimes it will fall
back to a normal read, so no worse than today.
** Colin: should we have a flag that basically says "always mmap"? Can add it
later, don't know how useful this could be.
* Colin: no native support required for ZCR beyond what is there today. There
are some libhdfs changes, but not completely required. Java has mmap today.
** We will need a native call for locking though.
Centralized cache mgmt
----
* Andrew gave whiteboard presentation
** DN has mlock hooks, ulimit conf of how much it can cache
** NN sends heartbeats to DN with cache/uncache commands on whole blocks
** DN will send cache state to NN similar to block reports
** clients call getFileBlockLocations() with storageType arg. This returns the
current state of the cache.
** clients can issue CachingRequests, with a path that points to a file or
directory. If directory, then what is cached is what's in that directory (but
not recurse to subdirs), in order to support Hive. Can also specify user, pool
for quote mgmt. Can also specify # cache copies (must be <= replicationFactor).
* Quotas
** Quotas are on pools, not users. Quotas enforced on the NN.
** Questions about what is cached as machines come and go? Use
getFileBlockLocations() to get cache request and current status of fulfillment.
Can be not fulfilled due to quota for instance.
*** Should cache requests from two pools be counted fully against both?
Half-half? Cluster capacity can be dynamic, so you always have potential quote
mgmt problems.
** Don't want to get this so complicated so that you basically are implementing
another scheduler just for cache quotas.
** Suresh: Resource failures - how does this affect the pools? Should we have
priorities for pools? Priorities for individual CacheRequests?
** Andrew: suggestion of min/max/share (similar to VMware ESX VM memory
configuration).
** Suresh: fine with doing something very basic, and then be more intelligent
later.
** Sanjay: need to have some idea of per-pool priority to enforce min, to
figure out what to evict from the cache first in mem-constrained scenarios.
Also what happens once we have resources again?
* Suresh: don't cache files being written to? ATM, others: yes.
* Cache entire files vs. shares of files in a request that can't be 100%
satisfied? Yes.
* Interaction with symlinks? Discuss on JIRA.
* Suresh: only cache something that is local to DN? Yes.
* Sanjay: should block reports should include storage types? ATM: Treat RAM as
a special case at least for now (e.g. have a bit that indicates that this
replica is in RAM).
* Sanjay: leases for the CacheRequests? Todd: CacheRequests could have their
status listed as a API call, not a lease. Otherwise clients have to keep
requesting renewals, which sucks for existing folks like Hive.
** Suresh: how about a expiry time in the CacheRequest?
* Sanjay: Question about how big the RPCs are from NN -> DN
** Todd: assuming 10 MB blocks, and 100 GB cache memory for a DN, then max of
10K blocks in an RPC. But will be smaller due to incremental reports.
** Suresh: perhaps push down idea of collection id to DN, so NN can just say
"evict this collection" instead of having to send evicts for all of the blocks.
Can be addressed later in a compatible fashion.
* Suresh: how can we help?
** Colin: review on the ZCR would be great.
** Andrew: HDFS-4949: will start churning on this ASAP, over the next month.
Could use some help with testing.
* Suresh: branch to put this in? 2.3? That's where all of the feature work is
going.
> Centralized cache management in HDFS
> ------------------------------------
>
> Key: HDFS-4949
> URL: https://issues.apache.org/jira/browse/HDFS-4949
> Project: Hadoop HDFS
> Issue Type: New Feature
> Components: datanode, namenode
> Affects Versions: 3.0.0, 2.3.0
> Reporter: Andrew Wang
> Assignee: Andrew Wang
> Attachments: caching-design-doc-2013-07-02.pdf
>
>
> HDFS currently has no support for managing or exposing in-memory caches at
> datanodes. This makes it harder for higher level application frameworks like
> Hive, Pig, and Impala to effectively use cluster memory, because they cannot
> explicitly cache important datasets or place their tasks for memory locality.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira