[
https://issues.apache.org/jira/browse/HDFS-5096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13794539#comment-13794539
]
Colin Patrick McCabe commented on HDFS-5096:
--------------------------------------------
bq. Looks like the TODO: support loading and storing snuck back in during a
rebase
Fixed.
bq. I'd lower the cachedBlocks GSet size to something like 1 or 2%. This makes
sense considering the blocksMap is set to 50% of heap, and there should be at
least an order of magnitude fewer cached blocks.
It's actually at 0.25% now. {{LightWeightGSet#computeCapacity}} does things in
percentages.
bq. Is adding the iterator to LightWeightCache necessary? Seems like that
should go on trunk instead, and there are no callers.
It's necessary to protect callers from getting undefined behavior when they
call {{Iterator#remove}}. It's just a simple wrapper, not a lot of code.
bq. Seems like a generally useful class, could we put this in common?
Sure.
bq. Rather than randomly deciding to remove via iterator or pseudo-randomly,
let's just test each case definitively.
I use Random with a pre-defined seed, so it's deterministic.
bq. [CachedBlock] should check for setting an invalid replication factor,
especially since we're re-using the top bit for the mark.
Added assert.
bq. I still think we should code share somehow with BlockInfo, having all this
triplets logic twice sucks. This could go into a trunk refactor patch.
I don't think it's really possible. BlockInfo already inherits from Block, and
Java doesn't support multiple inheritance. We can't use composition here
because the overhead of having another object inside each Block would be too
high. Perhaps we could refactor them both to support the same interface, but
it would be a big job.
bq. I think we should be taking the write lock, not the read lock in #rescan,
since in rescanCachedBlockMap we're modifying things, e.g. removeElement,
addNewPendingCached.
Yeah, I suppose by writing to cachedBlocks, we might be conflicting with other
threads that also hold the read lock. such as threads getting information
about cached replicas. I'll change it to take the write lock for now.
bq. Javadoc incorrect on #rescanCachedBlockMap, those parameters don't exist
Fixed.
bq. Rename neededReplication to neededCached for uniformity
OK.
bq. Comment for neededReplicaion >= numCached needs to be updated (it's just
copy pasted from the above)
OK.
bq. The neededCached case is passing in pendingUncached instead of pendingCached
fixed, thanks.
bq. I think it's okay to populate the pending queues and do all the tracking
even in standby mode, since it'd be good to take over these duties immediately
on failover. We just shouldn't send any datanode operations.
Partly, I want to avoid having this thread run during cases like when
formatting the namespace, and running it only when active is a good way to do
that. Also, keep in mind that activating the CacheManager will kick off an
immediate rescan, so we won't have long to wait before we start sending off
cache directives after becoming active.
bq. processCacheReport, shouldReply is unused.
I got rid of all that; thanks.
bq. This merged class no longer uses the ReportProcessor class we refactored
out from BlockManager. I like ReportProcessor, but it's of dubious utility now.
We should either punt this change out to trunk, or revert it to keep our diff
down.
Good point. I'll try to sync us up with trunk here.
bq. Is there some way to enforce that a block is never present on multiple of a
datanode's CachedBlockLists?
A block can absolutely be on multiple of a datanode's CachedBlockLists. For
example, it can be on the 'cached' list and also on the 'pendingUncached' list,
if we know about it but want to get rid of it.
bq. This would be clearer without prevCachedBlock
I added a comment clarifying how this works.
I'm going to add these comments now so that I don't lose them if I close the
window. Still have a few to address, I realize.
> Automatically cache new data added to a cached path
> ---------------------------------------------------
>
> Key: HDFS-5096
> URL: https://issues.apache.org/jira/browse/HDFS-5096
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Components: datanode, namenode
> Reporter: Andrew Wang
> Assignee: Colin Patrick McCabe
> Attachments: HDFS-5096-caching.005.patch, HDFS-5096-caching.006.patch
>
>
> For some applications, it's convenient to specify a path to cache, and have
> HDFS automatically cache new data added to the path without sending a new
> caching request or a manual refresh command.
> One example is new data appended to a cached file. It would be nice to
> re-cache a block at the new appended length, and cache new blocks added to
> the file.
> Another example is a cached Hive partition directory, where a user can drop
> new files directly into the partition. It would be nice if these new files
> were cached.
> In both cases, this automatic caching would happen after the file is closed,
> i.e. block replica is finalized.
--
This message was sent by Atlassian JIRA
(v6.1#6144)