[jira] [Commented] (HDFS-5096) Automatically cache new data added to a cached path

Andrew Wang (JIRA) Thu, 10 Oct 2013 17:42:04 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-5096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13792166#comment-13792166
 ]


Andrew Wang commented on HDFS-5096:
-----------------------------------

Hey Colin, thanks for the writeup, it helped me understand the goals here. Only 
a cursory review since it doesn't compile for me right now:

* Getting compile errors right now in BlockManager and FSNamesystem
* Can we split this up? e.g. the merging of CacheManager with 
CacheReplicationManager, the datanode blockid changes, other stuff? It's big 
right now, since it rewrites both the NN logic and cached block tracking.
* Please capitalize log and exception messages, e.g. {{LOG.info("starting 
CacheReplicationMonitor with interval " +}}, {{throw new 
RuntimeException("called getPrev on an element that wasn't " +}}
* typo in CacheManager: {{   * Whether the caching is enabled.}}
* In {{TestDatanodeConfig}}, I notice we have a {{return}} if the memlockLimit 
is Long.MAX_VALUE. Can we make this an {{assumeTrue}} instead while you're 
there?
* I expected to see some new tests involving changing replication factors, 
caching of directories, adding new files to a cached directory, etc. Did I just 
miss this?
* CacheManager: rename {{setActive}} and {{setInactive}} to {{activate}} and 
{{deactivate}} or {{start}} and {{stop}}?

{{code}}
      if ((datanode != null) && 
          ((!pendingCached.contains(datanode)) &&
          ((corrupt == null) || (!corrupt.contains(datanode))))) {
{{code}}
* This is hard to parse
* Looks like there's some duplication between {{BlockInfo}} and {{CachedBlock}} 
now with the triplets. Can we code share here? Especially important since the 
triplets code is confusing.
* javadoc for {{addNewPending*}} CRMon methods
* Where did {{CacheReplicationPolicy}} go? Looks like {{CRMon}} is just 
choosing randomly now. {{CRPolicy}} was doing something smarter than that, 
taking into consideration remaining cache capacity on the DN. Having the policy 
in a separate class is also nice logically.

> Automatically cache new data added to a cached path
> ---------------------------------------------------
>
>                 Key: HDFS-5096
>                 URL: https://issues.apache.org/jira/browse/HDFS-5096
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: datanode, namenode
>            Reporter: Andrew Wang
>            Assignee: Colin Patrick McCabe
>         Attachments: HDFS-5096-caching.005.patch
>
>
> For some applications, it's convenient to specify a path to cache, and have 
> HDFS automatically cache new data added to the path without sending a new 
> caching request or a manual refresh command.
> One example is new data appended to a cached file. It would be nice to 
> re-cache a block at the new appended length, and cache new blocks added to 
> the file.
> Another example is a cached Hive partition directory, where a user can drop 
> new files directly into the partition. It would be nice if these new files 
> were cached.
> In both cases, this automatic caching would happen after the file is closed, 
> i.e. block replica is finalized.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (HDFS-5096) Automatically cache new data added to a cached path

Reply via email to