[ 
https://issues.apache.org/jira/browse/HADOOP-3110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12582803#action_12582803
 ] 

Pete Wyckoff commented on HADOOP-3110:
--------------------------------------

I've seen a case in the JobTracker where a large #of failed tasks at one time 
can cause the  JobTracker to lock up because of this. The JT ends up writing 
hundreds of error messages to the log and by the time its done, the next set of 
tasks have failed because they all timed out waiting for the JobTracker to 
write the last batch of failures and then so on and on and on and on...

These are critical sections, so if you have a queue of 200 clients waiting to 
do an operation, they each have to wait 200 * Logging Time + the rest of it and 
during those 200, one will probably force a flush of the Java buffers to the OS.

We don't have a benchmarking cluster here unfortunately, but I can tell you one 
place where I perceive this pain. In the FUSE-dfs mountable filesystem. When 
there are no DFS clients, it's lightening fast, but add a few Map/Red jobs and 
the thing really, *really* slows down.  The fuse client ends up making multiple 
calls per file system operation, so these delays waiting for other processes to 
relinquish the namenode global lock really hurt it. How much this contributes 
to that I don't know, but for real-time clients we know that DFS is pretty weak 
right now.




> NameNode does logging in critical sections just about everywhere
> ----------------------------------------------------------------
>
>                 Key: HADOOP-3110
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3110
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.14.0, 0.14.1, 0.14.2, 0.14.3, 0.14.4, 0.15.0, 0.15.1, 
> 0.15.2, 0.15.3, 0.16.0, 0.16.1
>         Environment: All
>            Reporter: Pete Wyckoff
>
> e.g., FSNameSystem.addStoredBlock (but almost every method has logging in its 
> critical sections)
> This method is synchronized and it's spitting something out to Log.info every 
> block stored. Normally not a big deal, but since this is in the name node and 
> these are critical sections...
> We shouldn't even do any logging at all in critical sections, so even the 
> info and warn are bad.  But, in many places in the code, it would be hard to 
> tease these out (although eventually they really should be), but the system 
> could start using something like an AsyncAppender and see how it improves 
> performance. 
> Even though the log may have a buffer, the writing and doing the formatting 
> and stuff cause a drag on performance with 100s/1000s of machines trying to 
> talk to the name node.
> At a minimum, the most often  triggered Log.info could be changed to 
> Log.debug.
> for reference: 
> http://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/AsyncAppender.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to