[jira] [Commented] (HDFS-6093) Expose more caching information for debugging by users

Colin Patrick McCabe (JIRA) Sat, 15 Mar 2014 01:21:13 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-6093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13936087#comment-13936087
 ]


Colin Patrick McCabe commented on HDFS-6093:
--------------------------------------------

bq. Arpit said: In addition to reducing the timeout as you suggested, can we 
add some explanation to the command output, or update 
CentralizedCacheManagement.html in the docs?

I agree, we could add a short comment to the docs about this.  Now that the 
timeout has been reduced, there should be much less discrepancy between the 
output of the two commands, of course.

Taking a more detailed look at the patch now.

{code}
+  public FsStatus getCacheStatus() throws IOException {
{code}

I know it seems clever to reuse the same class for getStatus and 
getCacheStatus, but it could become a problem if someone later adds more fields 
to getStatus that don't apply to getCacheStatus.  I think we need our own type 
for this, to maintain sanity in the future.  It's not that much code.

{code}
   public long getMissingBlocksCount() throws IOException {
+    statistics.incrementReadOps(1);
     return dfs.getMissingBlocksCount();
{code}

Can we put the non-caching-related incrementReadOps changes in their own JIRA?  
It may seem like a trivial change, but it's kind of distracting from this JIRA. 
 Also I'm not sure I understand when we're "supposed" to increment this...

{code}
  /**
   * Number of replicas pending caching.
   */
  private long numPendingCaching;
  /**
   * Number of replicas pending uncaching.
   */
  private long numPendingUncaching;
{code}

Could use a linebreak after {{numPendingCaching}} for consistency.

Like I said earlier, I'd prefer to decouple the counter(s) that can be read 
from the CRM from the counters that the CRM uses internally during the scan.  
Using the same variable for both just invites bugs like the one Arpit pointed 
out, where rescan zeroes the counter outside the lock.

{code}
[CacheManager#processCacheReportImpl changes]
{code}

Incrementally updating the pendingUncached list and stats is a nice idea, but 
it seems too ambitious for 2.4 at this point.  Now that the CRM interval is 30 
seconds, it shouldn't be too bad to just wait for the CRM to update its stats 
and the lists.  Additionally, we don't even know that monitor is non-null at 
this point, so there is an NPE here, I think.  Let's leave this out and revisit 
it later.

> Expose more caching information for debugging by users
> ------------------------------------------------------
>
>                 Key: HDFS-6093
>                 URL: https://issues.apache.org/jira/browse/HDFS-6093
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: caching
>    Affects Versions: 2.4.0
>            Reporter: Andrew Wang
>            Assignee: Andrew Wang
>         Attachments: hdfs-6093-1.patch
>
>
> When users submit a new cache directive, it's unclear if the NN has 
> recognized it and is actively trying to cache it, or if it's hung for some 
> other reason. It'd be nice to expose a "pending caching/uncaching" count the 
> same way we expose pending replication work.
> It'd also be nice to display the aggregate cache capacity and usage in 
> dfsadmin -report, since we already have have it as a metric and expose it 
> per-DN in report output.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HDFS-6093) Expose more caching information for debugging by users

Reply via email to