[ 
https://issues.apache.org/jira/browse/LOGCXX-500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16585583#comment-16585583
 ] 

Denys Smolianiuk commented on LOGCXX-500:
-----------------------------------------

Hi Thorsten,

Thank you very much for giving it a try and providing your feedback!

I think it's probably worth splitting this patch into three different ones 
addressing problems 1-3. They are actually independent and have different 
dependencies.

1. Atomic queue patch. It's currently uses boost which might not be a feasible 
option, I understand. Apr doesn't currently have any lock-free containers and I 
didn't really want to write that from scratch. Maybe we can find available one 
fitting log4cxx constraints? Also, boost::lockfree::queue is a header-only 
library, so conditional compilation might be on option here too.

As for usefulness of lock-free algorithms in general I agree that there might 
be cases where they do not improve performance. However, in this particular 
case after running lots of tests I do see real benefits. Also, in my previous 
project we had proprietary logging library which was lock-free and performed 
really well. After all there is no way any application utilizes newer 28-core 
CPUs if such basic thing as logging would cause independent threads to block 
each other.

2. RW mutex patch. It uses apr and doesn't introduce new dependencies. Can be 
applied independently.

3. Static stringstream. Depends on C++11 as apr doesn't have thread-local 
storage. Might not be feasible on some of the platforms now, but earlier or 
later C++11 will be supported everywhere, so this change is future-proof. Can 
be made compile-time option too. Btw, not creating stringstream every time not 
only prevents blocking, but also noticeably improves performance.

Thank you,
 Denys Smolianiuk

> Logging in Timing-Critical Applications
> ---------------------------------------
>
>                 Key: LOGCXX-500
>                 URL: https://issues.apache.org/jira/browse/LOGCXX-500
>             Project: Log4cxx
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.10.0
>            Reporter: Thorsten Schöning
>            Priority: Minor
>         Attachments: config.xml, main.cpp, non_blocking.diff, 
> non_blocking_wo_debian_control.diff
>
>
> The following has been arrived on the mailing list, providing it here as well 
> mainly to additionally collect the given patches etc.:
> {quote}Hello All,
> I'd like to share some experience as well as some patches with regard 
> to using log4cxx in timing-critical application. First few words about 
> our requirements: it's a service which must generate some network 
> packets with up to hundred of microseconds precision. Thus, it's very 
> important to have predictable code timing. One can argue that log4cxx 
> is not very well suited for such applications, but surprisingly it 
> works pretty well after some light tuning.
> So, what were the issues?
> Basically from library user's point of view they looked the same: one 
> of a sudden logging done with LOG4CXX_DEBUG() macro could take 
> unexpectedly long time to complete. For example the same trace which 
> takes several μs for 99% of the time would take hundreds microseconds 
> and even few milliseconds sometimes. After further investigation this 
> has been traced down to few of the root-causes:
> 1. Asyns logger (which we have been using of course) has internal queue 
> to pass log entries to background disk-writer thread. This queue is 
> mutex-protected which might seem fine unless you think a little bit 
> more about it. First of all, someone calling LOG4CXX_DEBUG() to simply 
> put something into the log might not expect to be blocked inside 
> waiting for a mutex at all. Second point is that, although there were 
> measures taken to minimize time disk-thread holds that lock, 
> OS-schedulers often work in a way that thread which is blocked on a 
> mutex gets de-scheduled. With normal OS-scheduler quantum that means 
> that the logging thread can be preempted for milliseconds.
> 2. There are some mutexes protecting internal states of both loggers 
> and appenders. This means that two separate threads calling 
> LOG4CXX_DEBUG() can block each other. Even if they are using different 
> loggers they would block on appender! This has the same consequences 
> for execution timing and the performance as described above.
> 3. std::stringstream class constructor has some internal locks on it's 
> own. Unfortunately each MessageBuffer has it's own instance of this 
> class. And also unfortunately MessageBuffer is created inside 
> LOG4CXX_DEBUG() macro. There is optimization to not create stringstream 
> for logging simple strings, but as soon as your log statement has 
> single '<<' operator it's created.
> 4. Dynamic memory allocations. Unfortunately there are still quite few 
> of them even though memory pool is used in some other places. Thus, 
> hidden calls to new and malloc induce unpredictable delays.
> So, what we did to mitigate these problems?
> 1. Natural solution for this issue was to use atomic queue. There are 
> few of them available, but we made use of boost::lockfree::queue as it 
> can serve as a drop-in replacement allowing us to keep all present 
> functionality.
> 2. After looking more into the code it has appeared that two concurrent 
> calls to LOG4CXX_DEBUG() from within different threads are not harmful 
> because internal structures of logger and appender are not being 
> changed there. What only really requires protection is concurrency 
> between logging and configuring. Thus, we came to a solution - 
> read-write locks where logging calls act as readers and 
> configuration/exiting calls are writers. With such approach multiple 
> threads calling LOG4CXX_DEBUG() became free of any contention.
> 3. This problem also has one simple solution - make one static 
> std::stringstream object per thread using std::thread_local. 
> Unfortunately we found one drawback - thread_local memory is not 
> released if thread is not detached or joined. As there is some code 
> which does neither of this we made static stringstream a xml file 
> configuration option. Also, there could be an issue with using multiple 
> MessageBuffer instances from within single thread, but LOG4CXX_DEBUG() 
> is not doing that.
> 4. At this time we didn't do anything to address dynamic memory 
> allocation issue.
> So, if you want to give our patch a try it is attached. It's based on 
> log4cxx 0.10.0-12 as found in Debian. There are new SHARED_MUTEX and 
> LOCK_R/LOCK_W macros defined in mutex.h and synchronized.h for easy 
> switching between RW and convenient mutexes for benchmarking. Also, 
> there is an test application which spawns two threads doing logging 
> with some small sleep between iterations. It prints to stdout if trace 
> statement took more than 500 microseconds. It might look familiar too 
> you because it's based on one of the examples from StackOverflow.
> In our testing modified log4cxx has about 20% better performance 
> overall, but more importantly for us it has much less cases when log 
> statement takes excessive time to complete. The second part is only 
> true for CPUs with >2 cores where all threads can physically run in 
> parallel. Unfortunately we still see rare cases with prolonged logging 
> calls, we have traced that down to dynamic memory allocation issue.
> Any thoughts?
> Denys Smolianiuk
> Senior SW Engineer
> Harmonic Inc.{quote}
> http://mail-archives.apache.org/mod_mbox/logging-dev/201808.mbox/<1534433474.4460.1%40smtp-mail.outlook.com>



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to