[jira] [Commented] (HDFS-5096) Automatically cache new data added to a cached path

Andrew Wang (JIRA) Fri, 11 Oct 2013 19:01:54 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-5096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13793207#comment-13793207
 ]


Andrew Wang commented on HDFS-5096:
-----------------------------------

This is only a partial review, but I wanted to dump my review comments thus far 
before the weekend. Mostly small comments, still wrapping my head around all 
the changes. Overall it feels pretty good though.

Nitty:
* Looks like the {{TODO: support loading and storing}} snuck back in during a 
rebase
* I'd lower the {{cachedBlocks}} GSet size to something like 1 or 2%. This 
makes sense considering the blocksMap is set to 50% of heap, and there should 
be at least an order of magnitude fewer cached blocks.
* Is adding the iterator to LightWeightCache necessary? Seems like that should 
go on trunk instead, and there are no callers.
* Still some uncapitalized log/exception messages. I used this to check:
{code}
git diff | egrep "\+(.*)(Exception|LOG)(.*)\(\"[a-z]"
{code}

IntrusiveList:
* Seems like a generally useful class, could we put this in common?
* It'd be nice to reuse Java's {{List}} interface as much as possible for 
consistency.
** Instead of {{addToFront}} and {{addToBack}}, it looks like we're only using 
{{addToBack}} (thus {{add}}). {{List}} suggests {{add(Element, int)}} for a 
positional add, which is optional
** {{removeElement}} returns the next element which isn't part of the normal 
{{remove}} contract. Could also be named simply {{remove}}
** {{getLength}} to {{size}}
** etc etc
* Actually, since there's no {{get}}, {{set}}, or {{indexOf}} methods means 
this isn't really a list right? It feels more like a double-ended queue 
({{Deque}}). If this interface matches better with your intent, maybe copy that 
instead.
* Need javadoc on all public methods

TestCachedBlockImplicitList
* Should probably be named {{TestCachedBlocksLists}} and moved to the same 
package as {{CachedBlocksList}}. Or you could test a more generic 
implementation.
* Rather than randomly deciding to remove via iterator or pseudo-randomly, 
let's just test each case definitively.
* The test lists aren't big enough to make the random testing effective. I'd 
expect to see O(1000) elements, not 4.
* How about some more elaborate random tests that do things like addition, 
removal, and traversal O(1000) times? Can compare against a {{LinkedList}} or 
something.

CachedBlock:
* Although it has a {{triplets}} array, it's formatted differently from the 
triplets array in {{BlockInfo}}. CachedBlock goes {{prev, item, next}}, while 
{{BlockInfo}} goes {{item, prev, next}}. This is pretty confusing; let's make 
it like {{BlockInfo}}.
* The format and contents of {{triplets}} should also be explained in a comment.
* Having the Types for a cached block is also unfortunate
* Javadoc for {{getDatanodes}} mentions cached or planning to be cached; should 
also mention planning to be uncached, or just say "of a certain type".
* A little comment before {{setNext}} and {{getNext}} saying they're for 
LightWeightGSet would be nice.
* Class javadoc should explain how GSet and IntrusiveList are used, took me a 
while to understand the backreferences to the containing list. It's tricky code.
* The overrides for Element accidentally increased visibility to public from 
package-protected
* We should check for setting an invalid replication factor, especially since 
we're re-using the top bit for the mark.
* I still think we should code share somehow with BlockInfo, having all this 
triplets logic twice sucks. This could go into a trunk refactor patch.

CacheReplicationMonitor:
* I think we should be taking the write lock, not the read lock in {{#rescan}}, 
since in {{rescanCachedBlockMap}} we're modifying things, e.g. 
{{removeElement}}, {{addNewPendingCached}}.
* Javadoc incorrect on {{#rescanCachedBlockMap}}, those parameters don't exist
* Rename {{neededReplication}} to {{neededCached}} for uniformity
* Comment for {{neededReplicaion >= numCached}} needs to be updated (it's just 
copy pasted from the above)
* The {{neededCached}} case is passing in {{pendingUncached}} instead of 
{{pendingCached}}
* I think it's okay to populate the pending queues and do all the tracking even 
in standby mode, since it'd be good to take over these duties immediately on 
failover. We just shouldn't send any datanode operations.

CacheManager:
* processCacheReport, shouldReply is unused.
* This merged class no longer uses the ReportProcessor class we refactored out 
from BlockManager. I like ReportProcessor, but it's of dubious utility now. We 
should either punt this change out to trunk, or revert it to keep our diff down.
* Is there some way to enforce that a block is never present on multiple of a 
datanode's CachedBlockLists?

{code}
CachedBlock cachedBlock =
          new CachedBlock(block.getBlockId(), (short)0, false);
      CachedBlock prevCachedBlock = cachedBlocks.get(cachedBlock);
      if (prevCachedBlock != null) {
        cachedBlock = prevCachedBlock;
{code}
* This would be clearer without {{prevCachedBlock}}

Tests:
* Test for adding a PCE for a path with repl 1, verify, new PCE with repl 3 for 
same path, verify, remove PCE repl 3, verify.
* Different namespace operations for paranoia: create, delete, rename
* Writing more tests in general would be a good way of staying busy, since I 
still need more time to review thoroughly

> Automatically cache new data added to a cached path
> ---------------------------------------------------
>
>                 Key: HDFS-5096
>                 URL: https://issues.apache.org/jira/browse/HDFS-5096
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: datanode, namenode
>            Reporter: Andrew Wang
>            Assignee: Colin Patrick McCabe
>         Attachments: HDFS-5096-caching.005.patch, HDFS-5096-caching.006.patch
>
>
> For some applications, it's convenient to specify a path to cache, and have 
> HDFS automatically cache new data added to the path without sending a new 
> caching request or a manual refresh command.
> One example is new data appended to a cached file. It would be nice to 
> re-cache a block at the new appended length, and cache new blocks added to 
> the file.
> Another example is a cached Hive partition directory, where a user can drop 
> new files directly into the partition. It would be nice if these new files 
> were cached.
> In both cases, this automatic caching would happen after the file is closed, 
> i.e. block replica is finalized.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (HDFS-5096) Automatically cache new data added to a cached path

Reply via email to