[jira] [Commented] (HDFS-5096) Automatically cache new data added to a cached path

Colin Patrick McCabe (JIRA) Mon, 14 Oct 2013 15:07:50 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-5096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13794539#comment-13794539
 ]


Colin Patrick McCabe commented on HDFS-5096:
--------------------------------------------

bq. Looks like the TODO: support loading and storing snuck back in during a 
rebase

Fixed.

bq.  I'd lower the cachedBlocks GSet size to something like 1 or 2%. This makes 
sense considering the blocksMap is set to 50% of heap, and there should be at 
least an order of magnitude fewer cached blocks.

It's actually at 0.25% now.  {{LightWeightGSet#computeCapacity}} does things in 
percentages.

bq. Is adding the iterator to LightWeightCache necessary? Seems like that 
should go on trunk instead, and there are no callers.

It's necessary to protect callers from getting undefined behavior when they 
call {{Iterator#remove}}.  It's just a simple wrapper, not a lot of code.

bq. Seems like a generally useful class, could we put this in common?

Sure.

bq. Rather than randomly deciding to remove via iterator or pseudo-randomly, 
let's just test each case definitively.

I use Random with a pre-defined seed, so it's deterministic.

bq. [CachedBlock] should check for setting an invalid replication factor, 
especially since we're re-using the top bit for the mark.

Added assert.

bq. I still think we should code share somehow with BlockInfo, having all this 
triplets logic twice sucks. This could go into a trunk refactor patch.

I don't think it's really possible.  BlockInfo already inherits from Block, and 
Java doesn't support multiple inheritance.  We can't use composition here 
because the overhead of having another object inside each Block would be too 
high.  Perhaps we could refactor them both to support the same interface, but 
it would be a big job.

bq. I think we should be taking the write lock, not the read lock in #rescan, 
since in rescanCachedBlockMap we're modifying things, e.g. removeElement, 
addNewPendingCached.

Yeah, I suppose by writing to cachedBlocks, we might be conflicting with other 
threads that also hold the read lock.  such as threads getting information 
about cached replicas.  I'll change it to take the write lock for now.

bq. Javadoc incorrect on #rescanCachedBlockMap, those parameters don't exist

Fixed.

bq. Rename neededReplication to neededCached for uniformity

OK.

bq. Comment for neededReplicaion >= numCached needs to be updated (it's just 
copy pasted from the above)

OK.

bq. The neededCached case is passing in pendingUncached instead of pendingCached

fixed, thanks.

bq. I think it's okay to populate the pending queues and do all the tracking 
even in standby mode, since it'd be good to take over these duties immediately 
on failover. We just shouldn't send any datanode operations.

Partly, I want to avoid having this thread run during cases like when 
formatting the namespace, and running it only when active is a good way to do 
that.  Also, keep in mind that activating the CacheManager will kick off an 
immediate rescan, so we won't have long to wait before we start sending off 
cache directives after becoming active.

bq. processCacheReport, shouldReply is unused.

I got rid of all that; thanks.

bq. This merged class no longer uses the ReportProcessor class we refactored 
out from BlockManager. I like ReportProcessor, but it's of dubious utility now. 
We should either punt this change out to trunk, or revert it to keep our diff 
down.

Good point.  I'll try to sync us up with trunk here.

bq. Is there some way to enforce that a block is never present on multiple of a 
datanode's CachedBlockLists?

A block can absolutely be on multiple of a datanode's CachedBlockLists.  For 
example, it can be on the 'cached' list and also on the 'pendingUncached' list, 
if we know about it but want to get rid of it.

bq. This would be clearer without prevCachedBlock

I added a comment clarifying how this works.

I'm going to add these comments now so that I don't lose them if I close the 
window.  Still have a few to address, I realize.

> Automatically cache new data added to a cached path
> ---------------------------------------------------
>
>                 Key: HDFS-5096
>                 URL: https://issues.apache.org/jira/browse/HDFS-5096
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: datanode, namenode
>            Reporter: Andrew Wang
>            Assignee: Colin Patrick McCabe
>         Attachments: HDFS-5096-caching.005.patch, HDFS-5096-caching.006.patch
>
>
> For some applications, it's convenient to specify a path to cache, and have 
> HDFS automatically cache new data added to the path without sending a new 
> caching request or a manual refresh command.
> One example is new data appended to a cached file. It would be nice to 
> re-cache a block at the new appended length, and cache new blocks added to 
> the file.
> Another example is a cached Hive partition directory, where a user can drop 
> new files directly into the partition. It would be nice if these new files 
> were cached.
> In both cases, this automatic caching would happen after the file is closed, 
> i.e. block replica is finalized.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (HDFS-5096) Automatically cache new data added to a cached path

Reply via email to