Hi all, We discovered a performance regression in recent kernels with LTTng related to the use of fadvise DONTNEED. A call to this syscall is present in the LTTng consumer.
The following kernel commit cause the call to fadvise to be sometime really slower. Kernel commit info: mm/fadvise.c: drain all pagevecs if POSIX_FADV_DONTNEED fails to discard all pages main tree: (since 3.9-rc1) commit 67d46b296a1ba1477c0df8ff3bc5e0167a0b0732 stable tree: (since 3.8.1) https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit?id=bb01afe62feca1e7cdca60696f8b074416b0910d On the workload test, we observe that the call to fadvise takes about 4-5 us before this patch is applied. After applying the patch, The syscall now takes values from 5 us up to 4 ms (4000 us) sometime. The effect on lttng is that the consumer is frozen for this long period which leads to dropped event in the trace. If we remove the call to fadvise in src/common/consumer.c, we don't have any dropped event and we don't observe any bad side effect. (The added latency seem to come from the new call to lru_add_drain_all(). We removed this line and the performance went back to normal.) It's obviously a problem in the kernel, but since it impacts LTTng, we wanted to report it here first and ask advice on what should be the next step to solve this problem. If you want to see for youself, you can find the trace with the long call to fadvise here: http://www.dorsal.polymtl.ca/~rbeamonte/3.8.0~autocreated-4469887.tar.gz Yannick and Raphael _______________________________________________ lttng-dev mailing list [email protected] http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
