[
https://issues.apache.org/jira/browse/HDFS-5241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Daryn Sharp updated HDFS-5241:
------------------------------
Attachment: HDFS-5241.patch
No tests, requesting feedback before investing the time.
Provides an option to enable async logging via a single background thread. The
performance gains are impressive under an ideal read heavy load:
* fair lock = 26k op/s
* unfair lock = 58k op/s
* unfair lock + unbuffered appender = 120k ops/sec
A single thread consuming log messages from a queue populated by the 100 rpc
handlers is sufficient to improve performance. Additional threads showed no
significant improvement.
The problem is 100 threads colliding on log4j's synch'ed method. The
contention is so high and the logging call takes enough time, that the thread's
futex has to call into the kernel. The context switch and rescheduling wait
ruins performance. By comparison, the time spent waiting to add a log message
to the queue is negligible. The futexes stay in userland.
The performance sweet spot is a queue sized to the number of handlers. As long
as the background thread can log messages faster than a handler can process the
next call, the handler is guaranteed a spot in the queue w/o a context switch.
It's a configurable undocumented option for now since the audit log becomes
prone to data loss and slight offset of timestamps.
The call queue tends to run relatively dry so I expect my other connection
handling patches like HADOOP-9956 will have a larger impact.
> Provide alternate queuing audit logger to reduce logging contention
> -------------------------------------------------------------------
>
> Key: HDFS-5241
> URL: https://issues.apache.org/jira/browse/HDFS-5241
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Components: namenode
> Affects Versions: 2.0.0-alpha, 3.0.0
> Reporter: Daryn Sharp
> Assignee: Daryn Sharp
> Attachments: HDFS-5241.patch
>
>
> The default audit logger has extremely poor performance. The internal
> synchronization of log4j causes massive contention between the call handlers
> (100 by default) which drastically limits the throughput of the NN.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira