On 12/10/2015 05:58 PM, Chao Fan wrote:
----- Original Message -----
From: "Wenjian Zhou/周文剑" <[email protected]>
To: "Atsushi Kumagai" <[email protected]>
Cc: [email protected]
Sent: Thursday, December 10, 2015 5:36:47 PM
Subject: Re: [PATCH RFC 00/11] makedumpfile: parallel processing
On 12/10/2015 04:14 PM, Atsushi Kumagai wrote:
Hello Kumagai,
On 12/04/2015 10:30 AM, Atsushi Kumagai wrote:
Hello, Zhou
On 12/02/2015 03:24 PM, Dave Young wrote:
Hi,
On 12/02/15 at 01:29pm, "Zhou, Wenjian/周文剑" wrote:
I think there is no problem if other test results are as expected.
--num-threads mainly reduces the time of compressing.
So for lzo, it can't do much help at most of time.
Seems the help of --num-threads does not say it exactly:
[--num-threads THREADNUM]:
Using multiple threads to read and compress data of each page
in parallel.
And it will reduces time for saving DUMPFILE.
This feature only supports creating DUMPFILE in
kdump-comressed format from
VMCORE in kdump-compressed format or elf format.
Lzo is also a compress method, it should be mentioned that
--num-threads only
supports zlib compressed vmcore.
Sorry, it seems that something I said is not so clear.
lzo is also supported. Since lzo compresses data at a high speed, the
improving of the performance is not so obvious at most of time.
Also worth to mention about the recommended -d value for this feature.
Yes, I think it's worth. I forgot it.
I saw your patch, but I think I should confirm what is the problem first.
However, when "-d 31" is specified, it will be worse.
Less than 50 buffers are used to cache the compressed page.
And even the page has been filtered, it will also take a buffer.
So if "-d 31" is specified, the filtered page will use a lot
of buffers. Then the page which needs to be compressed can't
be compressed parallel.
Could you explain why compression will not be parallel in more detail ?
Actually the buffers are used also for filtered pages, it sounds
inefficient.
However, I don't understand why it prevents parallel compression.
Think about this, in a huge memory, most of the page will be filtered, and
we have 5 buffers.
page1 page2 page3 page4 page5 page6 page7
.....
[buffer1] [2] [3] [4] [5]
unfiltered filtered filtered filtered filtered unfiltered filtered
Since filtered page will take a buffer, when compressing page1,
page6 can't be compressed at the same time.
That why it will prevent parallel compression.
Thanks for your explanation, I understand.
This is just an issue of the current implementation, there is no
reason to stand this restriction.
Further, according to Chao's benchmark, there is a big performance
degradation even if the number of thread is 1. (58s vs 240s)
The current implementation seems to have some problems, we should
solve them.
If "-d 31" is specified, on the one hand we can't save time by compressing
parallel, on the other hand we will introduce some extra work by adding
"--num-threads". So it is obvious that it will have a performance
degradation.
Sure, there must be some overhead due to "some extra work"(e.g. exclusive
lock),
but "--num-threads=1 is 4 times slower than --num-threads=0" still sounds
too slow, the degradation is too big to be called "some extra work".
Both --num-threads=0 and --num-threads=1 are serial processing,
the above "buffer fairness issue" will not be related to this degradation.
What do you think what make this degradation ?
I can't get such result at this moment, so I can't do some further
investigation
right now. I guess it may be caused by the underlying implementation of
pthread.
I reviewed the test result of the patch v2 and found in different machines,
the results are quite different.
Hi Zhou Wenjian,
I have done more tests in another machine with 128G memory, and get the result:
the size of vmcore is 300M in "-d 31"
makedumpfile -l --message-level 1 -d 31:
time: 8.6s page-faults: 2272
makedumpfile -l --num-threads 1 --message-level 1 -d 31:
time: 28.1s page-faults: 2359
and the size of vmcore is 2.6G in "-d 0".
In this machine, I get the same result as yours:
makedumpfile -c --message-level 1 -d 0:
time: 597s page-faults: 2287
makedumpfile -c --num-threads 1 --message-level 1 -d 0:
time: 602s page-faults: 2361
makedumpfile -c --num-threads 2 --message-level 1 -d 0:
time: 337s page-faults: 2397
makedumpfile -c --num-threads 4 --message-level 1 -d 0:
time: 175s page-faults: 2461
makedumpfile -c --num-threads 8 --message-level 1 -d 0:
time: 103s page-faults: 2611
But the machine of my first test is not under my control, should I wait for
the first machine to do more tests?
If there are still some problems in my tests, please tell me.
Thanks a lot for your test, it seems that there is nothing wrong.
And I haven't got any idea about more tests...
Could you provide the information of your cpu ?
I will do some further investigation later.
But I still believe it's better not to use "-l -d 31" and "--num-threads"
at the same time, though it's very strange that the performance
degradation is so big.
--
Thanks
Zhou
Thanks,
Chao Fan
It seems that I can get almost the same result of Chao from "PRIMEQUEST
1800E".
###################################
- System: PRIMERGY RX300 S6
- CPU: Intel(R) Xeon(R) CPU x5660
- memory: 16GB
###################################
************ makedumpfile -d 7 ******************
core-data 0 256
threads-num
-l
0 10 144
4 5 110
8 5 111
12 6 111
************ makedumpfile -d 31 ******************
core-data 0 256
threads-num
-l
0 0 0
4 2 2
8 2 3
12 2 3
###################################
- System: PRIMEQUEST 1800E
- CPU: Intel(R) Xeon(R) CPU E7540
- memory: 32GB
###################################
************ makedumpfile -d 7 ******************
core-data 0 256
threads-num
-l
0 34 270
4 63 154
8 64 131
12 65 159
************ makedumpfile -d 31 ******************
core-data 0 256
threads-num
-l
0 2 1
4 48 48
8 48 49
12 49 50
I'm not so sure if it is a problem that the performance degradation is so
big.
But I think if in other cases, it works as expected, this won't be a
problem(
or a problem needs to be fixed), for the performance degradation existing
in theory.
Or the current implementation should be replaced by a new arithmetic.
For example:
We can add an array to record whether the page is filtered or not.
And only the unfiltered page will take the buffer.
We should discuss how to implement new mechanism, I'll mention this later.
But I'm not sure if it is worth.
For "-l -d 31" is fast enough, the new arithmetic also can't do much help.
Basically the faster, the better. There is no obvious target time.
If there is room for improvement, we should do it.
Maybe we can improve the performance of "-c -d 31" in some case.
BTW, we can easily get the theoretical performance by using the "--split".
--
Thanks
Zhou
_______________________________________________
kexec mailing list
[email protected]
http://lists.infradead.org/mailman/listinfo/kexec
_______________________________________________
kexec mailing list
[email protected]
http://lists.infradead.org/mailman/listinfo/kexec