[jira] [Commented] (HDFS-5096) Automatically cache new data added to a cached path

Andrew Wang (JIRA) Mon, 14 Oct 2013 15:08:35 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-5096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13794540#comment-13794540
 ]


Andrew Wang commented on HDFS-5096:
-----------------------------------

Hey Chris, I can field a few of these for Colin:

bq. Something seems off about the logic...

I stumbled on this too, so the comment should be improved. I think the logic is 
correct though:
* If the # of cached replicas is geq than the desired cache replication factor, 
then clear any pending cache operations
* If the number cached is less than the desired cache replication factor, then 
clear pending uncache operations
Maybe saying {{numCached >= neededRepl}} and {{numCached < neededRepl}} would 
be clearer, as well as renaming {{neededReplication}} to {{desiredCached}} or 
{{cacheReplFactor}} and rewording the javadoc.

bq. CRMon#rescanFile...

If the mark hasn't been set, we use the repl PCE. Else if it's already been 
visited during this rescan (as evidenced via the mark already being set), we 
want it to be the max of previous PCE repl factors and this PCE. This handles 
duplicate PCEs for the same file, and could definitely use a comment. 

bq. NameNode: Is this HA change meant for this patch...

I think this is so we can assert the write lock in {{CacheManager#activate()}}. 
I think holding the write lock here makes sense in general, so it could be 
punted to trunk after this to reduce the branch diff.

> Automatically cache new data added to a cached path
> ---------------------------------------------------
>
>                 Key: HDFS-5096
>                 URL: https://issues.apache.org/jira/browse/HDFS-5096
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: datanode, namenode
>            Reporter: Andrew Wang
>            Assignee: Colin Patrick McCabe
>         Attachments: HDFS-5096-caching.005.patch, HDFS-5096-caching.006.patch
>
>
> For some applications, it's convenient to specify a path to cache, and have 
> HDFS automatically cache new data added to the path without sending a new 
> caching request or a manual refresh command.
> One example is new data appended to a cached file. It would be nice to 
> re-cache a block at the new appended length, and cache new blocks added to 
> the file.
> Another example is a cached Hive partition directory, where a user can drop 
> new files directly into the partition. It would be nice if these new files 
> were cached.
> In both cases, this automatic caching would happen after the file is closed, 
> i.e. block replica is finalized.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (HDFS-5096) Automatically cache new data added to a cached path

Reply via email to