Re: [deal.II] Re: measuring cpu and wall time for assembly routine

2022-10-23 Thread Simon Wiesheier
Sorry, I was wrong. Of course, it is the other way round.
The fast one is 3 times faster.

-Simon

Am So., 23. Okt. 2022 um 10:37 Uhr schrieb Peter Munch <
peterrmue...@gmail.com>:

> Now, I am lost. The fast one is 3 times slower!?
>
> Peter
>
> On Sunday, 23 October 2022 at 10:33:38 UTC+2 Simon wrote:
>
>> Certainly.
>> When using the slow path, i.e. MappingQ in version 9.3.2, the cpu time is
>> about 6.3 seconds.
>> In case of the fast path, i.e. MappingQGeneric in version 9.3.2, the cpu
>> time is about 18.7 seconds.
>> Crudely,  the .reinit function associated with the FEPointEvaluation
>> objects is called about 1.2 million times.
>>
>> Best,
>> Simon
>>
>> Am Sa., 22. Okt. 2022 um 17:09 Uhr schrieb Peter Munch <
>> peterr...@gmail.com>:
>>
>>> Happy about that! May I ask you to post the results here. I am curious
>>> since I never actually compared timings (and only blindly trusted Martin).
>>>
>>> Thanks,
>>> Peter
>>>
>>> On Saturday, 22 October 2022 at 16:46:16 UTC+2 Simon wrote:
>>>
 Yes, the issue is resolved and the computation time decreased
 significantly.

 Thank you all!

 -Simon

 Am Sa., 22. Okt. 2022 um 12:57 Uhr schrieb Peter Munch <
 peterr...@gmail.com>:

> You are right. Release 9.3 uses the slow path for MappingQ. The reason
> is that here
> https://github.com/dealii/dealii/blob/ccfaddc2bab172d9d139dabc044d028f65bb480a/include/deal.II/matrix_free/fe_point_evaluation.h#L708-L711
> we check for MappingQGeneric. At that time MappingQ and MappingQGeneric
> were different classes. In the meantime, we have merged the classes so 
> that
> in release 9.4 and on master this is not an issue anymore. Is there a
> chance that you update deal.II. Alternatively, you could use
> MappingQGeneric instead of MappingQ.
>
> Hope this resolves this issue!
>
> Peter
>
> On Friday, 21 October 2022 at 10:59:35 UTC+2 Simon wrote:
>
>> I revised the appendix from my last message a little bit and attache
>> now a minimal working example (just 140 lines) along with a 
>> CMakeLists.txt.
>> After checking the profiling results from valgrind, the combination
>> of MappingQ with FE_Q takes *not* the fast path.
>>
>> For info: I use dealii version 9.3.2
>>
>> Best,
>> Simon
>>
>> Am Do., 20. Okt. 2022 um 18:11 Uhr schrieb Simon Wiesheier <
>> simon.w...@gmail.com>:
>>
>>> " When you use FEPointEvaluation, you should construct it only once
>>> and re-use the same object for different points. Furthermore, you should
>>> also avoid to create "p_dofs" and the "std::vector" near the  I was not
>>> clear with my original message. Anyway, the problem is the FEValues 
>>> object
>>> that gets used. I am confused by your other message that you use FE_Q
>>> together with MappingQ - that combination should be supported and if it 
>>> is
>>> not, we should take a look at a (reduced) code from you. "
>>>
>>> I added a snippet of my code (see appendix) in which I describe the
>>> logic as to what I am doing with FEPointEvaluation.
>>> In fact, constructing FEPointEvaluation (and the vector p_dofs) once
>>> and re-using them brings only minor changes as the overall costs are
>>> dominated by the call to reinit().
>>> But, of course, it helps at least.
>>>
>>> I am surprised too that the fast path is not used. Maybe you can
>>> identify a problem in my code.
>>> Thank you!
>>>
>>> Best,
>>> Simon
>>>
>>> Am Do., 20. Okt. 2022 um 17:02 Uhr schrieb Martin Kronbichler <
>>> kronbichl...@gmail.com>:
>>>
 Dear Simon,

 When you use FEPointEvaluation, you should construct it only once
 and re-use the same object for different points. Furthermore, you 
 should
 also avoid to create "p_dofs" and the "std::vector" near the  I was not
 clear with my original message. Anyway, the problem is the FEValues 
 object
 that gets used. I am confused by your other message that you use FE_Q
 together with MappingQ - that combination should be supported and if 
 it is
 not, we should take a look at a (reduced) code from you.

 Regarding the high timings: There is some parallelization by tasks
 that gets done inside the constructor of FEValues. This has good 
 intents
 for the case that we are in 3D and have a reasonable amount of work to 
 do.
 However, you are in 1D (if I read your code correctly), and then it is
 having adverse effects. The reason is that the constructor of FEValues 
 is
 very likely completely dominated by memory allocation. When we have 1
 thread, everything is fine, but when we have multiple threads working 
 they
 will start to interfere with each other when the 

Re: [deal.II] Re: measuring cpu and wall time for assembly routine

2022-10-23 Thread Peter Munch
Now, I am lost. The fast one is 3 times slower!?

Peter

On Sunday, 23 October 2022 at 10:33:38 UTC+2 Simon wrote:

> Certainly.
> When using the slow path, i.e. MappingQ in version 9.3.2, the cpu time is 
> about 6.3 seconds. 
> In case of the fast path, i.e. MappingQGeneric in version 9.3.2, the cpu 
> time is about 18.7 seconds.
> Crudely,  the .reinit function associated with the FEPointEvaluation 
> objects is called about 1.2 million times.
>
> Best,
> Simon
>
> Am Sa., 22. Okt. 2022 um 17:09 Uhr schrieb Peter Munch <
> peterr...@gmail.com>:
>
>> Happy about that! May I ask you to post the results here. I am curious 
>> since I never actually compared timings (and only blindly trusted Martin).
>>
>> Thanks,
>> Peter
>>
>> On Saturday, 22 October 2022 at 16:46:16 UTC+2 Simon wrote:
>>
>>> Yes, the issue is resolved and the computation time decreased 
>>> significantly.
>>>
>>> Thank you all!
>>>
>>> -Simon
>>>
>>> Am Sa., 22. Okt. 2022 um 12:57 Uhr schrieb Peter Munch <
>>> peterr...@gmail.com>:
>>>
 You are right. Release 9.3 uses the slow path for MappingQ. The reason 
 is that here 
 https://github.com/dealii/dealii/blob/ccfaddc2bab172d9d139dabc044d028f65bb480a/include/deal.II/matrix_free/fe_point_evaluation.h#L708-L711
  
 we check for MappingQGeneric. At that time MappingQ and MappingQGeneric 
 were different classes. In the meantime, we have merged the classes so 
 that 
 in release 9.4 and on master this is not an issue anymore. Is there a 
 chance that you update deal.II. Alternatively, you could use 
 MappingQGeneric instead of MappingQ.

 Hope this resolves this issue!

 Peter

 On Friday, 21 October 2022 at 10:59:35 UTC+2 Simon wrote:

> I revised the appendix from my last message a little bit and attache 
> now a minimal working example (just 140 lines) along with a 
> CMakeLists.txt.
> After checking the profiling results from valgrind, the combination of 
> MappingQ with FE_Q takes *not* the fast path.
>
> For info: I use dealii version 9.3.2
>
> Best,
> Simon
>
> Am Do., 20. Okt. 2022 um 18:11 Uhr schrieb Simon Wiesheier <
> simon.w...@gmail.com>:
>
>> " When you use FEPointEvaluation, you should construct it only once 
>> and re-use the same object for different points. Furthermore, you should 
>> also avoid to create "p_dofs" and the "std::vector" near the  I was not 
>> clear with my original message. Anyway, the problem is the FEValues 
>> object 
>> that gets used. I am confused by your other message that you use FE_Q 
>> together with MappingQ - that combination should be supported and if it 
>> is 
>> not, we should take a look at a (reduced) code from you. "
>>
>> I added a snippet of my code (see appendix) in which I describe the 
>> logic as to what I am doing with FEPointEvaluation. 
>> In fact, constructing FEPointEvaluation (and the vector p_dofs) once 
>> and re-using them brings only minor changes as the overall costs are 
>> dominated by the call to reinit(). 
>> But, of course, it helps at least.
>>
>> I am surprised too that the fast path is not used. Maybe you can 
>> identify a problem in my code.
>> Thank you!
>>
>> Best,
>> Simon
>>
>> Am Do., 20. Okt. 2022 um 17:02 Uhr schrieb Martin Kronbichler <
>> kronbichl...@gmail.com>:
>>
>>> Dear Simon,
>>>
>>> When you use FEPointEvaluation, you should construct it only once 
>>> and re-use the same object for different points. Furthermore, you 
>>> should 
>>> also avoid to create "p_dofs" and the "std::vector" near the  I was not 
>>> clear with my original message. Anyway, the problem is the FEValues 
>>> object 
>>> that gets used. I am confused by your other message that you use FE_Q 
>>> together with MappingQ - that combination should be supported and if it 
>>> is 
>>> not, we should take a look at a (reduced) code from you.
>>>
>>> Regarding the high timings: There is some parallelization by tasks 
>>> that gets done inside the constructor of FEValues. This has good 
>>> intents 
>>> for the case that we are in 3D and have a reasonable amount of work to 
>>> do. 
>>> However, you are in 1D (if I read your code correctly), and then it is 
>>> having adverse effects. The reason is that the constructor of FEValues 
>>> is 
>>> very likely completely dominated by memory allocation. When we have 1 
>>> thread, everything is fine, but when we have multiple threads working 
>>> they 
>>> will start to interfere with each other when the request memory through 
>>> malloc(), which has to be coordinated by the operating system (and thus 
>>> gets slower). In fact, the big gap between compute time and wall time 
>>> shows 
>>> that there is a lot of time wasted by "system time" that 

Re: [deal.II] Re: measuring cpu and wall time for assembly routine

2022-10-23 Thread Simon Wiesheier
Certainly.
When using the slow path, i.e. MappingQ in version 9.3.2, the cpu time is
about 6.3 seconds.
In case of the fast path, i.e. MappingQGeneric in version 9.3.2, the cpu
time is about 18.7 seconds.
Crudely,  the .reinit function associated with the FEPointEvaluation
objects is called about 1.2 million times.

Best,
Simon

Am Sa., 22. Okt. 2022 um 17:09 Uhr schrieb Peter Munch <
peterrmue...@gmail.com>:

> Happy about that! May I ask you to post the results here. I am curious
> since I never actually compared timings (and only blindly trusted Martin).
>
> Thanks,
> Peter
>
> On Saturday, 22 October 2022 at 16:46:16 UTC+2 Simon wrote:
>
>> Yes, the issue is resolved and the computation time decreased
>> significantly.
>>
>> Thank you all!
>>
>> -Simon
>>
>> Am Sa., 22. Okt. 2022 um 12:57 Uhr schrieb Peter Munch <
>> peterr...@gmail.com>:
>>
>>> You are right. Release 9.3 uses the slow path for MappingQ. The reason
>>> is that here
>>> https://github.com/dealii/dealii/blob/ccfaddc2bab172d9d139dabc044d028f65bb480a/include/deal.II/matrix_free/fe_point_evaluation.h#L708-L711
>>> we check for MappingQGeneric. At that time MappingQ and MappingQGeneric
>>> were different classes. In the meantime, we have merged the classes so that
>>> in release 9.4 and on master this is not an issue anymore. Is there a
>>> chance that you update deal.II. Alternatively, you could use
>>> MappingQGeneric instead of MappingQ.
>>>
>>> Hope this resolves this issue!
>>>
>>> Peter
>>>
>>> On Friday, 21 October 2022 at 10:59:35 UTC+2 Simon wrote:
>>>
 I revised the appendix from my last message a little bit and attache
 now a minimal working example (just 140 lines) along with a CMakeLists.txt.
 After checking the profiling results from valgrind, the combination of
 MappingQ with FE_Q takes *not* the fast path.

 For info: I use dealii version 9.3.2

 Best,
 Simon

 Am Do., 20. Okt. 2022 um 18:11 Uhr schrieb Simon Wiesheier <
 simon.w...@gmail.com>:

> " When you use FEPointEvaluation, you should construct it only once
> and re-use the same object for different points. Furthermore, you should
> also avoid to create "p_dofs" and the "std::vector" near the  I was not
> clear with my original message. Anyway, the problem is the FEValues object
> that gets used. I am confused by your other message that you use FE_Q
> together with MappingQ - that combination should be supported and if it is
> not, we should take a look at a (reduced) code from you. "
>
> I added a snippet of my code (see appendix) in which I describe the
> logic as to what I am doing with FEPointEvaluation.
> In fact, constructing FEPointEvaluation (and the vector p_dofs) once
> and re-using them brings only minor changes as the overall costs are
> dominated by the call to reinit().
> But, of course, it helps at least.
>
> I am surprised too that the fast path is not used. Maybe you can
> identify a problem in my code.
> Thank you!
>
> Best,
> Simon
>
> Am Do., 20. Okt. 2022 um 17:02 Uhr schrieb Martin Kronbichler <
> kronbichl...@gmail.com>:
>
>> Dear Simon,
>>
>> When you use FEPointEvaluation, you should construct it only once and
>> re-use the same object for different points. Furthermore, you should also
>> avoid to create "p_dofs" and the "std::vector" near the  I was not clear
>> with my original message. Anyway, the problem is the FEValues object that
>> gets used. I am confused by your other message that you use FE_Q together
>> with MappingQ - that combination should be supported and if it is not, we
>> should take a look at a (reduced) code from you.
>>
>> Regarding the high timings: There is some parallelization by tasks
>> that gets done inside the constructor of FEValues. This has good intents
>> for the case that we are in 3D and have a reasonable amount of work to 
>> do.
>> However, you are in 1D (if I read your code correctly), and then it is
>> having adverse effects. The reason is that the constructor of FEValues is
>> very likely completely dominated by memory allocation. When we have 1
>> thread, everything is fine, but when we have multiple threads working 
>> they
>> will start to interfere with each other when the request memory through
>> malloc(), which has to be coordinated by the operating system (and thus
>> gets slower). In fact, the big gap between compute time and wall time 
>> shows
>> that there is a lot of time wasted by "system time" that does not do 
>> actual
>> work on the cores.
>>
>> I guess the library could have a better measure of when to spawn
>> tasks in FEValues in similar context, but it is a lot of work to get this
>> right. (This is why I keep avoiding it in critical functions.)
>>
>> Best,
>> Martin
>>
>>
>> On 20.10.22 

Re: [deal.II] Re: measuring cpu and wall time for assembly routine

2022-10-22 Thread Peter Munch
Happy about that! May I ask you to post the results here. I am curious 
since I never actually compared timings (and only blindly trusted Martin).

Thanks,
Peter

On Saturday, 22 October 2022 at 16:46:16 UTC+2 Simon wrote:

> Yes, the issue is resolved and the computation time decreased 
> significantly.
>
> Thank you all!
>
> -Simon
>
> Am Sa., 22. Okt. 2022 um 12:57 Uhr schrieb Peter Munch <
> peterr...@gmail.com>:
>
>> You are right. Release 9.3 uses the slow path for MappingQ. The reason is 
>> that here 
>> https://github.com/dealii/dealii/blob/ccfaddc2bab172d9d139dabc044d028f65bb480a/include/deal.II/matrix_free/fe_point_evaluation.h#L708-L711
>>  
>> we check for MappingQGeneric. At that time MappingQ and MappingQGeneric 
>> were different classes. In the meantime, we have merged the classes so that 
>> in release 9.4 and on master this is not an issue anymore. Is there a 
>> chance that you update deal.II. Alternatively, you could use 
>> MappingQGeneric instead of MappingQ.
>>
>> Hope this resolves this issue!
>>
>> Peter
>>
>> On Friday, 21 October 2022 at 10:59:35 UTC+2 Simon wrote:
>>
>>> I revised the appendix from my last message a little bit and attache now 
>>> a minimal working example (just 140 lines) along with a CMakeLists.txt.
>>> After checking the profiling results from valgrind, the combination of 
>>> MappingQ with FE_Q takes *not* the fast path.
>>>
>>> For info: I use dealii version 9.3.2
>>>
>>> Best,
>>> Simon
>>>
>>> Am Do., 20. Okt. 2022 um 18:11 Uhr schrieb Simon Wiesheier <
>>> simon.w...@gmail.com>:
>>>
 " When you use FEPointEvaluation, you should construct it only once and 
 re-use the same object for different points. Furthermore, you should also 
 avoid to create "p_dofs" and the "std::vector" near the  I was not clear 
 with my original message. Anyway, the problem is the FEValues object that 
 gets used. I am confused by your other message that you use FE_Q together 
 with MappingQ - that combination should be supported and if it is not, we 
 should take a look at a (reduced) code from you. "

 I added a snippet of my code (see appendix) in which I describe the 
 logic as to what I am doing with FEPointEvaluation. 
 In fact, constructing FEPointEvaluation (and the vector p_dofs) once 
 and re-using them brings only minor changes as the overall costs are 
 dominated by the call to reinit(). 
 But, of course, it helps at least.

 I am surprised too that the fast path is not used. Maybe you can 
 identify a problem in my code.
 Thank you!

 Best,
 Simon

 Am Do., 20. Okt. 2022 um 17:02 Uhr schrieb Martin Kronbichler <
 kronbichl...@gmail.com>:

> Dear Simon,
>
> When you use FEPointEvaluation, you should construct it only once and 
> re-use the same object for different points. Furthermore, you should also 
> avoid to create "p_dofs" and the "std::vector" near the  I was not clear 
> with my original message. Anyway, the problem is the FEValues object that 
> gets used. I am confused by your other message that you use FE_Q together 
> with MappingQ - that combination should be supported and if it is not, we 
> should take a look at a (reduced) code from you.
>
> Regarding the high timings: There is some parallelization by tasks 
> that gets done inside the constructor of FEValues. This has good intents 
> for the case that we are in 3D and have a reasonable amount of work to 
> do. 
> However, you are in 1D (if I read your code correctly), and then it is 
> having adverse effects. The reason is that the constructor of FEValues is 
> very likely completely dominated by memory allocation. When we have 1 
> thread, everything is fine, but when we have multiple threads working 
> they 
> will start to interfere with each other when the request memory through 
> malloc(), which has to be coordinated by the operating system (and thus 
> gets slower). In fact, the big gap between compute time and wall time 
> shows 
> that there is a lot of time wasted by "system time" that does not do 
> actual 
> work on the cores.
>
> I guess the library could have a better measure of when to spawn tasks 
> in FEValues in similar context, but it is a lot of work to get this 
> right. 
> (This is why I keep avoiding it in critical functions.)
>
> Best,
> Martin
>
>
> On 20.10.22 16:47, Simon Wiesheier wrote:
>
> Update:
>
> I profiled my program with valgrind --tool=callgrind and could figure 
> out that
> FEPointEvaluation creates an FEValues object along with a quadrature 
> object under the hood. 
> Closer inspection revealed that all constructors, destructors,... 
> associated with FEPointEvaluation 
> need roughly 5000 instructions more (per call!).
> That said, FEValues is indeed the faster 

Re: [deal.II] Re: measuring cpu and wall time for assembly routine

2022-10-22 Thread Simon Wiesheier
Yes, the issue is resolved and the computation time decreased significantly.

Thank you all!

-Simon

Am Sa., 22. Okt. 2022 um 12:57 Uhr schrieb Peter Munch <
peterrmue...@gmail.com>:

> You are right. Release 9.3 uses the slow path for MappingQ. The reason is
> that here
> https://github.com/dealii/dealii/blob/ccfaddc2bab172d9d139dabc044d028f65bb480a/include/deal.II/matrix_free/fe_point_evaluation.h#L708-L711
> we check for MappingQGeneric. At that time MappingQ and MappingQGeneric
> were different classes. In the meantime, we have merged the classes so that
> in release 9.4 and on master this is not an issue anymore. Is there a
> chance that you update deal.II. Alternatively, you could use
> MappingQGeneric instead of MappingQ.
>
> Hope this resolves this issue!
>
> Peter
>
> On Friday, 21 October 2022 at 10:59:35 UTC+2 Simon wrote:
>
>> I revised the appendix from my last message a little bit and attache now
>> a minimal working example (just 140 lines) along with a CMakeLists.txt.
>> After checking the profiling results from valgrind, the combination of
>> MappingQ with FE_Q takes *not* the fast path.
>>
>> For info: I use dealii version 9.3.2
>>
>> Best,
>> Simon
>>
>> Am Do., 20. Okt. 2022 um 18:11 Uhr schrieb Simon Wiesheier <
>> simon.w...@gmail.com>:
>>
>>> " When you use FEPointEvaluation, you should construct it only once and
>>> re-use the same object for different points. Furthermore, you should also
>>> avoid to create "p_dofs" and the "std::vector" near the  I was not clear
>>> with my original message. Anyway, the problem is the FEValues object that
>>> gets used. I am confused by your other message that you use FE_Q together
>>> with MappingQ - that combination should be supported and if it is not, we
>>> should take a look at a (reduced) code from you. "
>>>
>>> I added a snippet of my code (see appendix) in which I describe the
>>> logic as to what I am doing with FEPointEvaluation.
>>> In fact, constructing FEPointEvaluation (and the vector p_dofs) once and
>>> re-using them brings only minor changes as the overall costs are dominated
>>> by the call to reinit().
>>> But, of course, it helps at least.
>>>
>>> I am surprised too that the fast path is not used. Maybe you can
>>> identify a problem in my code.
>>> Thank you!
>>>
>>> Best,
>>> Simon
>>>
>>> Am Do., 20. Okt. 2022 um 17:02 Uhr schrieb Martin Kronbichler <
>>> kronbichl...@gmail.com>:
>>>
 Dear Simon,

 When you use FEPointEvaluation, you should construct it only once and
 re-use the same object for different points. Furthermore, you should also
 avoid to create "p_dofs" and the "std::vector" near the  I was not clear
 with my original message. Anyway, the problem is the FEValues object that
 gets used. I am confused by your other message that you use FE_Q together
 with MappingQ - that combination should be supported and if it is not, we
 should take a look at a (reduced) code from you.

 Regarding the high timings: There is some parallelization by tasks that
 gets done inside the constructor of FEValues. This has good intents for the
 case that we are in 3D and have a reasonable amount of work to do. However,
 you are in 1D (if I read your code correctly), and then it is having
 adverse effects. The reason is that the constructor of FEValues is very
 likely completely dominated by memory allocation. When we have 1 thread,
 everything is fine, but when we have multiple threads working they will
 start to interfere with each other when the request memory through
 malloc(), which has to be coordinated by the operating system (and thus
 gets slower). In fact, the big gap between compute time and wall time shows
 that there is a lot of time wasted by "system time" that does not do actual
 work on the cores.

 I guess the library could have a better measure of when to spawn tasks
 in FEValues in similar context, but it is a lot of work to get this right.
 (This is why I keep avoiding it in critical functions.)

 Best,
 Martin


 On 20.10.22 16:47, Simon Wiesheier wrote:

 Update:

 I profiled my program with valgrind --tool=callgrind and could figure
 out that
 FEPointEvaluation creates an FEValues object along with a quadrature
 object under the hood.
 Closer inspection revealed that all constructors, destructors,...
 associated with FEPointEvaluation
 need roughly 5000 instructions more (per call!).
 That said, FEValues is indeed the faster approach, at least for FE_Q
 elements.

 export DEAL_II_NUM_THREADS=1
 eliminated the gap between cpu and wall time.
 Using FEValues directly, I get cpu time 19.8 seconds
 and in the case of FEPointEvaluation cpu time = 21.9 seconds;
 Wall times are in the same ballpark.
 Out of curiosity, why produces multi-threading such high wall times
 (200 seconds) in my case?.

 These times 

Re: [deal.II] Re: measuring cpu and wall time for assembly routine

2022-10-22 Thread Peter Munch
You are right. Release 9.3 uses the slow path for MappingQ. The reason is 
that here 
https://github.com/dealii/dealii/blob/ccfaddc2bab172d9d139dabc044d028f65bb480a/include/deal.II/matrix_free/fe_point_evaluation.h#L708-L711
 
we check for MappingQGeneric. At that time MappingQ and MappingQGeneric 
were different classes. In the meantime, we have merged the classes so that 
in release 9.4 and on master this is not an issue anymore. Is there a 
chance that you update deal.II. Alternatively, you could use 
MappingQGeneric instead of MappingQ.

Hope this resolves this issue!

Peter

On Friday, 21 October 2022 at 10:59:35 UTC+2 Simon wrote:

> I revised the appendix from my last message a little bit and attache now a 
> minimal working example (just 140 lines) along with a CMakeLists.txt.
> After checking the profiling results from valgrind, the combination of 
> MappingQ with FE_Q takes *not* the fast path.
>
> For info: I use dealii version 9.3.2
>
> Best,
> Simon
>
> Am Do., 20. Okt. 2022 um 18:11 Uhr schrieb Simon Wiesheier <
> simon.w...@gmail.com>:
>
>> " When you use FEPointEvaluation, you should construct it only once and 
>> re-use the same object for different points. Furthermore, you should also 
>> avoid to create "p_dofs" and the "std::vector" near the  I was not clear 
>> with my original message. Anyway, the problem is the FEValues object that 
>> gets used. I am confused by your other message that you use FE_Q together 
>> with MappingQ - that combination should be supported and if it is not, we 
>> should take a look at a (reduced) code from you. "
>>
>> I added a snippet of my code (see appendix) in which I describe the logic 
>> as to what I am doing with FEPointEvaluation. 
>> In fact, constructing FEPointEvaluation (and the vector p_dofs) once and 
>> re-using them brings only minor changes as the overall costs are dominated 
>> by the call to reinit(). 
>> But, of course, it helps at least.
>>
>> I am surprised too that the fast path is not used. Maybe you can identify 
>> a problem in my code.
>> Thank you!
>>
>> Best,
>> Simon
>>
>> Am Do., 20. Okt. 2022 um 17:02 Uhr schrieb Martin Kronbichler <
>> kronbichl...@gmail.com>:
>>
>>> Dear Simon,
>>>
>>> When you use FEPointEvaluation, you should construct it only once and 
>>> re-use the same object for different points. Furthermore, you should also 
>>> avoid to create "p_dofs" and the "std::vector" near the  I was not clear 
>>> with my original message. Anyway, the problem is the FEValues object that 
>>> gets used. I am confused by your other message that you use FE_Q together 
>>> with MappingQ - that combination should be supported and if it is not, we 
>>> should take a look at a (reduced) code from you.
>>>
>>> Regarding the high timings: There is some parallelization by tasks that 
>>> gets done inside the constructor of FEValues. This has good intents for the 
>>> case that we are in 3D and have a reasonable amount of work to do. However, 
>>> you are in 1D (if I read your code correctly), and then it is having 
>>> adverse effects. The reason is that the constructor of FEValues is very 
>>> likely completely dominated by memory allocation. When we have 1 thread, 
>>> everything is fine, but when we have multiple threads working they will 
>>> start to interfere with each other when the request memory through 
>>> malloc(), which has to be coordinated by the operating system (and thus 
>>> gets slower). In fact, the big gap between compute time and wall time shows 
>>> that there is a lot of time wasted by "system time" that does not do actual 
>>> work on the cores.
>>>
>>> I guess the library could have a better measure of when to spawn tasks 
>>> in FEValues in similar context, but it is a lot of work to get this right. 
>>> (This is why I keep avoiding it in critical functions.)
>>>
>>> Best,
>>> Martin
>>>
>>>
>>> On 20.10.22 16:47, Simon Wiesheier wrote:
>>>
>>> Update:
>>>
>>> I profiled my program with valgrind --tool=callgrind and could figure 
>>> out that
>>> FEPointEvaluation creates an FEValues object along with a quadrature 
>>> object under the hood. 
>>> Closer inspection revealed that all constructors, destructors,... 
>>> associated with FEPointEvaluation 
>>> need roughly 5000 instructions more (per call!).
>>> That said, FEValues is indeed the faster approach, at least for FE_Q 
>>> elements.
>>>
>>> export DEAL_II_NUM_THREADS=1
>>> eliminated the gap between cpu and wall time. 
>>> Using FEValues directly, I get cpu time 19.8 seconds
>>> and in the case of FEPointEvaluation cpu time = 21.9 seconds;
>>> Wall times are in the same ballpark. 
>>> Out of curiosity, why produces multi-threading such high wall times (200 
>>> seconds) in my case?. 
>>>
>>> These times are far too big given that the solution of the linear system 
>>> takes only about 13 seconds.
>>> But based on what all of you have said, there is probably no other to 
>>> way to implement my problem. 
>>>
>>> Best
>>> Simon
>>>
>>> Am Do., 20. Okt. 

Re: [deal.II] Re: measuring cpu and wall time for assembly routine

2022-10-21 Thread Simon Wiesheier
I revised the appendix from my last message a little bit and attache now a
minimal working example (just 140 lines) along with a CMakeLists.txt.
After checking the profiling results from valgrind, the combination of
MappingQ with FE_Q takes *not* the fast path.

For info: I use dealii version 9.3.2

Best,
Simon

Am Do., 20. Okt. 2022 um 18:11 Uhr schrieb Simon Wiesheier <
simon.wieshe...@gmail.com>:

> " When you use FEPointEvaluation, you should construct it only once and
> re-use the same object for different points. Furthermore, you should also
> avoid to create "p_dofs" and the "std::vector" near the  I was not clear
> with my original message. Anyway, the problem is the FEValues object that
> gets used. I am confused by your other message that you use FE_Q together
> with MappingQ - that combination should be supported and if it is not, we
> should take a look at a (reduced) code from you. "
>
> I added a snippet of my code (see appendix) in which I describe the logic
> as to what I am doing with FEPointEvaluation.
> In fact, constructing FEPointEvaluation (and the vector p_dofs) once and
> re-using them brings only minor changes as the overall costs are dominated
> by the call to reinit().
> But, of course, it helps at least.
>
> I am surprised too that the fast path is not used. Maybe you can identify
> a problem in my code.
> Thank you!
>
> Best,
> Simon
>
> Am Do., 20. Okt. 2022 um 17:02 Uhr schrieb Martin Kronbichler <
> kronbichler.mar...@gmail.com>:
>
>> Dear Simon,
>>
>> When you use FEPointEvaluation, you should construct it only once and
>> re-use the same object for different points. Furthermore, you should also
>> avoid to create "p_dofs" and the "std::vector" near the  I was not clear
>> with my original message. Anyway, the problem is the FEValues object that
>> gets used. I am confused by your other message that you use FE_Q together
>> with MappingQ - that combination should be supported and if it is not, we
>> should take a look at a (reduced) code from you.
>>
>> Regarding the high timings: There is some parallelization by tasks that
>> gets done inside the constructor of FEValues. This has good intents for the
>> case that we are in 3D and have a reasonable amount of work to do. However,
>> you are in 1D (if I read your code correctly), and then it is having
>> adverse effects. The reason is that the constructor of FEValues is very
>> likely completely dominated by memory allocation. When we have 1 thread,
>> everything is fine, but when we have multiple threads working they will
>> start to interfere with each other when the request memory through
>> malloc(), which has to be coordinated by the operating system (and thus
>> gets slower). In fact, the big gap between compute time and wall time shows
>> that there is a lot of time wasted by "system time" that does not do actual
>> work on the cores.
>>
>> I guess the library could have a better measure of when to spawn tasks in
>> FEValues in similar context, but it is a lot of work to get this right.
>> (This is why I keep avoiding it in critical functions.)
>>
>> Best,
>> Martin
>>
>>
>> On 20.10.22 16:47, Simon Wiesheier wrote:
>>
>> Update:
>>
>> I profiled my program with valgrind --tool=callgrind and could figure out
>> that
>> FEPointEvaluation creates an FEValues object along with a quadrature
>> object under the hood.
>> Closer inspection revealed that all constructors, destructors,...
>> associated with FEPointEvaluation
>> need roughly 5000 instructions more (per call!).
>> That said, FEValues is indeed the faster approach, at least for FE_Q
>> elements.
>>
>> export DEAL_II_NUM_THREADS=1
>> eliminated the gap between cpu and wall time.
>> Using FEValues directly, I get cpu time 19.8 seconds
>> and in the case of FEPointEvaluation cpu time = 21.9 seconds;
>> Wall times are in the same ballpark.
>> Out of curiosity, why produces multi-threading such high wall times (200
>> seconds) in my case?.
>>
>> These times are far too big given that the solution of the linear system
>> takes only about 13 seconds.
>> But based on what all of you have said, there is probably no other to way
>> to implement my problem.
>>
>> Best
>> Simon
>>
>> Am Do., 20. Okt. 2022 um 11:55 Uhr schrieb Simon Wiesheier <
>> simon.wieshe...@gmail.com>:
>>
>>> Dear Martin and Wolfgang,
>>>
>>> " You seem to be looking for FEPointEvaluation. That class is shown in
>>> step-19 and provides, for simple FiniteElement types, a much faster way to
>>> evaluate solutions at arbitrary points within a cell. Do you want to give
>>> it a try? "
>>>
>>> I implemented the FEPointEvaluation approach like this:
>>>
>>> FEPointEvaluation<1,1> fe_eval(mapping,
>>> FE_Q<1>(1),
>>> update_gradients |
>>> update_values);
>>> fe_eval.reinit(cell,
>>> make_array_view(std::vector>{ref_point_energy_vol}));
>>> Vector p_dofs(2);
>>> cell->get_dof_values(solution_global, p_dofs);
>>> 

Re: [deal.II] Re: measuring cpu and wall time for assembly routine

2022-10-20 Thread Simon Wiesheier
" When you use FEPointEvaluation, you should construct it only once and
re-use the same object for different points. Furthermore, you should also
avoid to create "p_dofs" and the "std::vector" near the  I was not clear
with my original message. Anyway, the problem is the FEValues object that
gets used. I am confused by your other message that you use FE_Q together
with MappingQ - that combination should be supported and if it is not, we
should take a look at a (reduced) code from you. "

I added a snippet of my code (see appendix) in which I describe the logic
as to what I am doing with FEPointEvaluation.
In fact, constructing FEPointEvaluation (and the vector p_dofs) once and
re-using them brings only minor changes as the overall costs are dominated
by the call to reinit().
But, of course, it helps at least.

I am surprised too that the fast path is not used. Maybe you can identify a
problem in my code.
Thank you!

Best,
Simon

Am Do., 20. Okt. 2022 um 17:02 Uhr schrieb Martin Kronbichler <
kronbichler.mar...@gmail.com>:

> Dear Simon,
>
> When you use FEPointEvaluation, you should construct it only once and
> re-use the same object for different points. Furthermore, you should also
> avoid to create "p_dofs" and the "std::vector" near the  I was not clear
> with my original message. Anyway, the problem is the FEValues object that
> gets used. I am confused by your other message that you use FE_Q together
> with MappingQ - that combination should be supported and if it is not, we
> should take a look at a (reduced) code from you.
>
> Regarding the high timings: There is some parallelization by tasks that
> gets done inside the constructor of FEValues. This has good intents for the
> case that we are in 3D and have a reasonable amount of work to do. However,
> you are in 1D (if I read your code correctly), and then it is having
> adverse effects. The reason is that the constructor of FEValues is very
> likely completely dominated by memory allocation. When we have 1 thread,
> everything is fine, but when we have multiple threads working they will
> start to interfere with each other when the request memory through
> malloc(), which has to be coordinated by the operating system (and thus
> gets slower). In fact, the big gap between compute time and wall time shows
> that there is a lot of time wasted by "system time" that does not do actual
> work on the cores.
>
> I guess the library could have a better measure of when to spawn tasks in
> FEValues in similar context, but it is a lot of work to get this right.
> (This is why I keep avoiding it in critical functions.)
>
> Best,
> Martin
>
>
> On 20.10.22 16:47, Simon Wiesheier wrote:
>
> Update:
>
> I profiled my program with valgrind --tool=callgrind and could figure out
> that
> FEPointEvaluation creates an FEValues object along with a quadrature
> object under the hood.
> Closer inspection revealed that all constructors, destructors,...
> associated with FEPointEvaluation
> need roughly 5000 instructions more (per call!).
> That said, FEValues is indeed the faster approach, at least for FE_Q
> elements.
>
> export DEAL_II_NUM_THREADS=1
> eliminated the gap between cpu and wall time.
> Using FEValues directly, I get cpu time 19.8 seconds
> and in the case of FEPointEvaluation cpu time = 21.9 seconds;
> Wall times are in the same ballpark.
> Out of curiosity, why produces multi-threading such high wall times (200
> seconds) in my case?.
>
> These times are far too big given that the solution of the linear system
> takes only about 13 seconds.
> But based on what all of you have said, there is probably no other to way
> to implement my problem.
>
> Best
> Simon
>
> Am Do., 20. Okt. 2022 um 11:55 Uhr schrieb Simon Wiesheier <
> simon.wieshe...@gmail.com>:
>
>> Dear Martin and Wolfgang,
>>
>> " You seem to be looking for FEPointEvaluation. That class is shown in
>> step-19 and provides, for simple FiniteElement types, a much faster way to
>> evaluate solutions at arbitrary points within a cell. Do you want to give
>> it a try? "
>>
>> I implemented the FEPointEvaluation approach like this:
>>
>> FEPointEvaluation<1,1> fe_eval(mapping,
>> FE_Q<1>(1),
>> update_gradients |
>> update_values);
>> fe_eval.reinit(cell,
>> make_array_view(std::vector>{ref_point_energy_vol}));
>> Vector p_dofs(2);
>> cell->get_dof_values(solution_global, p_dofs);
>> fe_eval.evaluate(make_array_view(p_dofs),
>> EvaluationFlags::values |
>> EvaluationFlags::gradients);
>> double val = fe_eval.get_value(0);
>> Tensor<1,1> grad = fe_eval.get_gradient(0);
>>
>> I am using FE_Q elements of degree one and a MappingQ object also of
>> degree one.
>>
>> Frankly, I do not really understand the measured computation times.
>> My program has several loadsteps with nested Newton iterations:
>> Loadstep 1:
>> Assembly 1: cpu time 12.8 sec  wall time 268.7 sec
>> Assembly 2: cpu time 

Re: [deal.II] Re: measuring cpu and wall time for assembly routine

2022-10-20 Thread Martin Kronbichler

Dear Simon,

When you use FEPointEvaluation, you should construct it only once and 
re-use the same object for different points. Furthermore, you should 
also avoid to create "p_dofs" and the "std::vector" near the  I was not 
clear with my original message. Anyway, the problem is the FEValues 
object that gets used. I am confused by your other message that you use 
FE_Q together with MappingQ - that combination should be supported and 
if it is not, we should take a look at a (reduced) code from you.


Regarding the high timings: There is some parallelization by tasks that 
gets done inside the constructor of FEValues. This has good intents for 
the case that we are in 3D and have a reasonable amount of work to do. 
However, you are in 1D (if I read your code correctly), and then it is 
having adverse effects. The reason is that the constructor of FEValues 
is very likely completely dominated by memory allocation. When we have 1 
thread, everything is fine, but when we have multiple threads working 
they will start to interfere with each other when the request memory 
through malloc(), which has to be coordinated by the operating system 
(and thus gets slower). In fact, the big gap between compute time and 
wall time shows that there is a lot of time wasted by "system time" that 
does not do actual work on the cores.


I guess the library could have a better measure of when to spawn tasks 
in FEValues in similar context, but it is a lot of work to get this 
right. (This is why I keep avoiding it in critical functions.)


Best,
Martin


On 20.10.22 16:47, Simon Wiesheier wrote:

Update:

I profiled my program with valgrind --tool=callgrind and could figure 
out that
FEPointEvaluation creates an FEValues object along with a quadrature 
object under the hood.
Closer inspection revealed that all constructors, destructors,... 
associated with FEPointEvaluation

need roughly 5000 instructions more (per call!).
That said, FEValues is indeed the faster approach, at least for FE_Q 
elements.


export DEAL_II_NUM_THREADS=1
eliminated the gap between cpu and wall time.
Using FEValues directly, I get cpu time 19.8 seconds
and in the case of FEPointEvaluation cpu time = 21.9 seconds;
Wall times are in the same ballpark.
Out of curiosity, why produces multi-threading such high wall times 
(200 seconds) in my case?.


These times are far too big given that the solution of the linear 
system takes only about 13 seconds.
But based on what all of you have said, there is probably no other to 
way to implement my problem.


Best
Simon

Am Do., 20. Okt. 2022 um 11:55 Uhr schrieb Simon Wiesheier 
:


Dear Martin and Wolfgang,

" You seem to be looking for FEPointEvaluation. That class is
shown in step-19 and provides, for simple FiniteElement types, a
much faster way to evaluate solutions at arbitrary points within a
cell. Do you want to give it a try? "

I implemented the FEPointEvaluation approach like this:

FEPointEvaluation<1,1> fe_eval(mapping,
FE_Q<1>(1),
update_gradients | update_values);
fe_eval.reinit(cell,
make_array_view(std::vector>{ref_point_energy_vol}));
Vector p_dofs(2);
cell->get_dof_values(solution_global, p_dofs);
fe_eval.evaluate(make_array_view(p_dofs),
EvaluationFlags::values | EvaluationFlags::gradients);
double val = fe_eval.get_value(0);
Tensor<1,1> grad = fe_eval.get_gradient(0);

I am using FE_Q elements of degree one and a MappingQ object also
of degree one.

Frankly, I do not really understand the measured computation times.
My program has several loadsteps with nested Newton iterations:
Loadstep 1:
Assembly 1: cpu time 12.8 sec  wall time 268.7 sec
Assembly 2: cpu time 17.7 sec  wall time 275.2 sec
Assembly 3: cpu time 22.3 sec  wall time 272.6 sec
Assembly 4: cpu time 23.8 sec  wall time 271.3sec
Loadstep 2:
Assembly 1: cpu time 14.3 sec  wall time 260.0 sec
Assembly 2: cpu time 16.9 sec  wall time 262.1 sec
Assembly 3: cpu time 18.5 sec  wall time 270.6 sec
Assembly 4: cpu time 17.1 sec  wall time 262.2 sec
...

Using FEValues instead of FEPointEvaluation, the results are:
Loadstep 1:
Assembly 1: cpu time 23.9 sec  wall time 171.0 sec
Assembly 2: cpu time 32.5 sec  wall time 168.9 sec
Assembly 3: cpu time 33.2 sec  wall time 168.0 sec
Assembly 4: cpu time 32.7 sec  wall time 166.9 sec
Loadstep 2:
Assembly 1: cpu time 24.9 sec  wall time 168.0 sec
Assembly 2: cpu time 34.7 sec  wall time 167.3 sec
Assembly 3: cpu time 33.9 sec  wall time 167.8 sec
Assembly 4: cpu time 34.3 sec  wall time 167.7 sec
...

Clearly, the fluctuations using FEValues are smaller than in case
of FEPointEvaluation.
Anyway, using FEPointEvaluation the cpu time is smaller but the
wall time substantially bigger.
If I am not mistaken, the values cpu time 34.3 sec and wall time
167.7 sec mean that
the cpu needs 34.3 sec to 

Re: [deal.II] Re: measuring cpu and wall time for assembly routine

2022-10-20 Thread Simon Wiesheier
" What type of Mapping are you using? If you take a look at
https://github.com/dealii/dealii/blob/ad13824e599601ee170cb2fd1c7c3099d3d5b0f7/source/matrix_free/fe_point_evaluation.cc#L40-L95
you can see when the fast path of FEPointEvaluation is taken. Indeed, the
slow path is (FEValues). One question: are you running in release or debug
mode? "

I use FE_Q<1>(1) with a MappingQ<1>(1) and
FE_Q<2>(1) with a MappingQ<2>(1).

I am running in release mode.

Best,
Simon

Am Do., 20. Okt. 2022 um 16:53 Uhr schrieb Peter Munch <
peterrmue...@gmail.com>:

> > FEPointEvaluation creates an FEValues object along with a quadrature
> object under the hood.
> Closer inspection revealed that all constructors, destructors,...
> associated with FEPointEvaluation
> need roughly 5000 instructions more (per call!).
> That said, FEValues is indeed the faster approach, at least for FE_Q
> elements.
>
> What type of Mapping are you using? If you take a look at
> https://github.com/dealii/dealii/blob/ad13824e599601ee170cb2fd1c7c3099d3d5b0f7/source/matrix_free/fe_point_evaluation.cc#L40-L95
> you can see when the fast path of FEPointEvaluation is taken. Indeed, the
> slow path is (FEValues). One question: are you running in release or debug
> mode?
>
> Hope this brings us closer to the issue,
> Peter
>
> On Thursday, 20 October 2022 at 16:47:17 UTC+2 Simon wrote:
>
>> Update:
>>
>> I profiled my program with valgrind --tool=callgrind and could figure out
>> that
>> FEPointEvaluation creates an FEValues object along with a quadrature
>> object under the hood.
>> Closer inspection revealed that all constructors, destructors,...
>> associated with FEPointEvaluation
>> need roughly 5000 instructions more (per call!).
>> That said, FEValues is indeed the faster approach, at least for FE_Q
>> elements.
>>
>> export DEAL_II_NUM_THREADS=1
>> eliminated the gap between cpu and wall time.
>> Using FEValues directly, I get cpu time 19.8 seconds
>> and in the case of FEPointEvaluation cpu time = 21.9 seconds;
>> Wall times are in the same ballpark.
>> Out of curiosity, why produces multi-threading such high wall times (200
>> seconds) in my case?.
>>
>> These times are far too big given that the solution of the linear system
>> takes only about 13 seconds.
>> But based on what all of you have said, there is probably no other to way
>> to implement my problem.
>>
>> Best
>> Simon
>>
>> Am Do., 20. Okt. 2022 um 11:55 Uhr schrieb Simon Wiesheier <
>> simon.w...@gmail.com>:
>>
>>> Dear Martin and Wolfgang,
>>>
>>> " You seem to be looking for FEPointEvaluation. That class is shown in
>>> step-19 and provides, for simple FiniteElement types, a much faster way to
>>> evaluate solutions at arbitrary points within a cell. Do you want to give
>>> it a try? "
>>>
>>> I implemented the FEPointEvaluation approach like this:
>>>
>>> FEPointEvaluation<1,1> fe_eval(mapping,
>>> FE_Q<1>(1),
>>> update_gradients |
>>> update_values);
>>> fe_eval.reinit(cell,
>>> make_array_view(std::vector>{ref_point_energy_vol}));
>>> Vector p_dofs(2);
>>> cell->get_dof_values(solution_global, p_dofs);
>>> fe_eval.evaluate(make_array_view(p_dofs),
>>> EvaluationFlags::values |
>>> EvaluationFlags::gradients);
>>> double val = fe_eval.get_value(0);
>>> Tensor<1,1> grad = fe_eval.get_gradient(0);
>>>
>>> I am using FE_Q elements of degree one and a MappingQ object also of
>>> degree one.
>>>
>>> Frankly, I do not really understand the measured computation times.
>>> My program has several loadsteps with nested Newton iterations:
>>> Loadstep 1:
>>> Assembly 1: cpu time 12.8 sec  wall time 268.7 sec
>>> Assembly 2: cpu time 17.7 sec  wall time 275.2 sec
>>> Assembly 3: cpu time 22.3 sec  wall time 272.6 sec
>>> Assembly 4: cpu time 23.8 sec  wall time 271.3sec
>>> Loadstep 2:
>>> Assembly 1: cpu time 14.3 sec  wall time 260.0 sec
>>> Assembly 2: cpu time 16.9 sec  wall time 262.1 sec
>>> Assembly 3: cpu time 18.5 sec  wall time 270.6 sec
>>> Assembly 4: cpu time 17.1 sec  wall time 262.2 sec
>>> ...
>>>
>>> Using FEValues instead of FEPointEvaluation, the results are:
>>> Loadstep 1:
>>> Assembly 1: cpu time 23.9 sec  wall time 171.0 sec
>>> Assembly 2: cpu time 32.5 sec  wall time 168.9 sec
>>> Assembly 3: cpu time 33.2 sec  wall time 168.0 sec
>>> Assembly 4: cpu time 32.7 sec  wall time 166.9 sec
>>> Loadstep 2:
>>> Assembly 1: cpu time 24.9 sec  wall time 168.0 sec
>>> Assembly 2: cpu time 34.7 sec  wall time 167.3 sec
>>> Assembly 3: cpu time 33.9 sec  wall time 167.8 sec
>>> Assembly 4: cpu time 34.3 sec  wall time 167.7 sec
>>> ...
>>>
>>> Clearly, the fluctuations using FEValues are smaller than in case of
>>> FEPointEvaluation.
>>> Anyway, using FEPointEvaluation the cpu time is smaller but the wall
>>> time substantially bigger.
>>> If I am not mistaken, the values cpu time 34.3 sec and wall time 167.7
>>> sec mean that
>>> the cpu needs 34.3 sec to 

Re: [deal.II] Re: measuring cpu and wall time for assembly routine

2022-10-20 Thread Peter Munch
> FEPointEvaluation creates an FEValues object along with a quadrature 
object under the hood.
Closer inspection revealed that all constructors, destructors,... 
associated with FEPointEvaluation
need roughly 5000 instructions more (per call!).
That said, FEValues is indeed the faster approach, at least for FE_Q 
elements.

What type of Mapping are you using? If you take a look 
at 
https://github.com/dealii/dealii/blob/ad13824e599601ee170cb2fd1c7c3099d3d5b0f7/source/matrix_free/fe_point_evaluation.cc#L40-L95
 
you can see when the fast path of FEPointEvaluation is taken. Indeed, the 
slow path is (FEValues). One question: are you running in release or debug 
mode?

Hope this brings us closer to the issue,
Peter

On Thursday, 20 October 2022 at 16:47:17 UTC+2 Simon wrote:

> Update:
>
> I profiled my program with valgrind --tool=callgrind and could figure out 
> that
> FEPointEvaluation creates an FEValues object along with a quadrature 
> object under the hood. 
> Closer inspection revealed that all constructors, destructors,... 
> associated with FEPointEvaluation 
> need roughly 5000 instructions more (per call!).
> That said, FEValues is indeed the faster approach, at least for FE_Q 
> elements.
>
> export DEAL_II_NUM_THREADS=1
> eliminated the gap between cpu and wall time. 
> Using FEValues directly, I get cpu time 19.8 seconds
> and in the case of FEPointEvaluation cpu time = 21.9 seconds;
> Wall times are in the same ballpark. 
> Out of curiosity, why produces multi-threading such high wall times (200 
> seconds) in my case?. 
>
> These times are far too big given that the solution of the linear system 
> takes only about 13 seconds.
> But based on what all of you have said, there is probably no other to way 
> to implement my problem. 
>
> Best
> Simon
>
> Am Do., 20. Okt. 2022 um 11:55 Uhr schrieb Simon Wiesheier <
> simon.w...@gmail.com>:
>
>> Dear Martin and Wolfgang,
>>
>> " You seem to be looking for FEPointEvaluation. That class is shown in 
>> step-19 and provides, for simple FiniteElement types, a much faster way to 
>> evaluate solutions at arbitrary points within a cell. Do you want to give 
>> it a try? "
>>
>> I implemented the FEPointEvaluation approach like this:
>>
>> FEPointEvaluation<1,1> fe_eval(mapping,
>> FE_Q<1>(1),
>> update_gradients | 
>> update_values); 
>> fe_eval.reinit(cell, 
>> make_array_view(std::vector>{ref_point_energy_vol}));
>> Vector p_dofs(2);
>> cell->get_dof_values(solution_global, p_dofs);
>> fe_eval.evaluate(make_array_view(p_dofs),
>> EvaluationFlags::values | 
>> EvaluationFlags::gradients);
>> double val = fe_eval.get_value(0);
>> Tensor<1,1> grad = fe_eval.get_gradient(0);
>>
>> I am using FE_Q elements of degree one and a MappingQ object also of 
>> degree one.
>>
>> Frankly, I do not really understand the measured computation times. 
>> My program has several loadsteps with nested Newton iterations:
>> Loadstep 1:
>> Assembly 1: cpu time 12.8 sec  wall time 268.7 sec
>> Assembly 2: cpu time 17.7 sec  wall time 275.2 sec 
>> Assembly 3: cpu time 22.3 sec  wall time 272.6 sec 
>> Assembly 4: cpu time 23.8 sec  wall time 271.3sec 
>> Loadstep 2:
>> Assembly 1: cpu time 14.3 sec  wall time 260.0 sec
>> Assembly 2: cpu time 16.9 sec  wall time 262.1 sec 
>> Assembly 3: cpu time 18.5 sec  wall time 270.6 sec 
>> Assembly 4: cpu time 17.1 sec  wall time 262.2 sec 
>> ...
>>
>> Using FEValues instead of FEPointEvaluation, the results are:
>> Loadstep 1:
>> Assembly 1: cpu time 23.9 sec  wall time 171.0 sec
>> Assembly 2: cpu time 32.5 sec  wall time 168.9 sec 
>> Assembly 3: cpu time 33.2 sec  wall time 168.0 sec 
>> Assembly 4: cpu time 32.7 sec  wall time 166.9 sec 
>> Loadstep 2:
>> Assembly 1: cpu time 24.9 sec  wall time 168.0 sec 
>> Assembly 2: cpu time 34.7 sec  wall time 167.3 sec 
>> Assembly 3: cpu time 33.9 sec  wall time 167.8 sec 
>> Assembly 4: cpu time 34.3 sec  wall time 167.7 sec 
>> ...
>>
>> Clearly, the fluctuations using FEValues are smaller than in case of 
>> FEPointEvaluation. 
>> Anyway, using FEPointEvaluation the cpu time is smaller but the wall time 
>> substantially bigger. 
>> If I am not mistaken, the values cpu time 34.3 sec and wall time 167.7 
>> sec mean that
>> the cpu needs 34.3 sec to execute my assembly routine and has to wait in 
>> the
>> remaining 167.7-34.3 seconds. 
>> This huge gap between cpu and wall time has to be related to what I do 
>> with FEValues or FEPointEvaluation
>> as cpu and wall time are nearly balanced if I use either neither of them.
>> What might be the problem?
>>
>> Best
>> Simon
>>
>>
>>
>>
>>
>> Am Mi., 19. Okt. 2022 um 22:34 Uhr schrieb Wolfgang Bangerth <
>> bang...@colostate.edu>:
>>
>>> On 10/19/22 08:45, Simon Wiesheier wrote:
>>> > 
>>> > What I want to do boils down to the following:
>>> > Given the reference co-ordinates of a point 'p', along with the cell 
>>> on 
>>> 

Re: [deal.II] Re: measuring cpu and wall time for assembly routine

2022-10-20 Thread Simon Wiesheier
Update:

I profiled my program with valgrind --tool=callgrind and could figure out
that
FEPointEvaluation creates an FEValues object along with a quadrature object
under the hood.
Closer inspection revealed that all constructors, destructors,...
associated with FEPointEvaluation
need roughly 5000 instructions more (per call!).
That said, FEValues is indeed the faster approach, at least for FE_Q
elements.

export DEAL_II_NUM_THREADS=1
eliminated the gap between cpu and wall time.
Using FEValues directly, I get cpu time 19.8 seconds
and in the case of FEPointEvaluation cpu time = 21.9 seconds;
Wall times are in the same ballpark.
Out of curiosity, why produces multi-threading such high wall times (200
seconds) in my case?.

These times are far too big given that the solution of the linear system
takes only about 13 seconds.
But based on what all of you have said, there is probably no other to way
to implement my problem.

Best
Simon

Am Do., 20. Okt. 2022 um 11:55 Uhr schrieb Simon Wiesheier <
simon.wieshe...@gmail.com>:

> Dear Martin and Wolfgang,
>
> " You seem to be looking for FEPointEvaluation. That class is shown in
> step-19 and provides, for simple FiniteElement types, a much faster way to
> evaluate solutions at arbitrary points within a cell. Do you want to give
> it a try? "
>
> I implemented the FEPointEvaluation approach like this:
>
> FEPointEvaluation<1,1> fe_eval(mapping,
> FE_Q<1>(1),
> update_gradients | update_values);
> fe_eval.reinit(cell,
> make_array_view(std::vector>{ref_point_energy_vol}));
> Vector p_dofs(2);
> cell->get_dof_values(solution_global, p_dofs);
> fe_eval.evaluate(make_array_view(p_dofs),
> EvaluationFlags::values |
> EvaluationFlags::gradients);
> double val = fe_eval.get_value(0);
> Tensor<1,1> grad = fe_eval.get_gradient(0);
>
> I am using FE_Q elements of degree one and a MappingQ object also of
> degree one.
>
> Frankly, I do not really understand the measured computation times.
> My program has several loadsteps with nested Newton iterations:
> Loadstep 1:
> Assembly 1: cpu time 12.8 sec  wall time 268.7 sec
> Assembly 2: cpu time 17.7 sec  wall time 275.2 sec
> Assembly 3: cpu time 22.3 sec  wall time 272.6 sec
> Assembly 4: cpu time 23.8 sec  wall time 271.3sec
> Loadstep 2:
> Assembly 1: cpu time 14.3 sec  wall time 260.0 sec
> Assembly 2: cpu time 16.9 sec  wall time 262.1 sec
> Assembly 3: cpu time 18.5 sec  wall time 270.6 sec
> Assembly 4: cpu time 17.1 sec  wall time 262.2 sec
> ...
>
> Using FEValues instead of FEPointEvaluation, the results are:
> Loadstep 1:
> Assembly 1: cpu time 23.9 sec  wall time 171.0 sec
> Assembly 2: cpu time 32.5 sec  wall time 168.9 sec
> Assembly 3: cpu time 33.2 sec  wall time 168.0 sec
> Assembly 4: cpu time 32.7 sec  wall time 166.9 sec
> Loadstep 2:
> Assembly 1: cpu time 24.9 sec  wall time 168.0 sec
> Assembly 2: cpu time 34.7 sec  wall time 167.3 sec
> Assembly 3: cpu time 33.9 sec  wall time 167.8 sec
> Assembly 4: cpu time 34.3 sec  wall time 167.7 sec
> ...
>
> Clearly, the fluctuations using FEValues are smaller than in case of
> FEPointEvaluation.
> Anyway, using FEPointEvaluation the cpu time is smaller but the wall time
> substantially bigger.
> If I am not mistaken, the values cpu time 34.3 sec and wall time 167.7 sec
> mean that
> the cpu needs 34.3 sec to execute my assembly routine and has to wait in
> the
> remaining 167.7-34.3 seconds.
> This huge gap between cpu and wall time has to be related to what I do
> with FEValues or FEPointEvaluation
> as cpu and wall time are nearly balanced if I use either neither of them.
> What might be the problem?
>
> Best
> Simon
>
>
>
>
>
> Am Mi., 19. Okt. 2022 um 22:34 Uhr schrieb Wolfgang Bangerth <
> bange...@colostate.edu>:
>
>> On 10/19/22 08:45, Simon Wiesheier wrote:
>> >
>> > What I want to do boils down to the following:
>> > Given the reference co-ordinates of a point 'p', along with the cell on
>> > which 'p' lives,
>> > give me the value and gradient of a finite element function evaluated
>> at
>> > 'p'.
>> >
>> > My idea was to create a quadrature object with 'p' being the only
>> > quadrature point and pass this
>> > quadrature object to the FEValues object and finally do the
>> > .reinit(cell) call (then, of course, get_function_values()...)
>> > 'p' is different for all (2.5 million) quadrature points, which is why
>> I
>> > create the FEValues object so many times.
>>
>> It's worth pointing out that is exactly what VectorTools::point_values()
>> does.
>>
>> (As others have already mentioned, if you want to do that many many
>> times over, this is too expensive and you should be using
>> FEPointEvaluation instead.)
>>
>> Best
>>   W.
>>
>> --
>> 
>> Wolfgang Bangerth  email: bange...@colostate.edu
>> www: 

Re: [deal.II] Re: measuring cpu and wall time for assembly routine

2022-10-20 Thread Simon Wiesheier
Dear Martin and Wolfgang,

" You seem to be looking for FEPointEvaluation. That class is shown in
step-19 and provides, for simple FiniteElement types, a much faster way to
evaluate solutions at arbitrary points within a cell. Do you want to give
it a try? "

I implemented the FEPointEvaluation approach like this:

FEPointEvaluation<1,1> fe_eval(mapping,
FE_Q<1>(1),
update_gradients | update_values);
fe_eval.reinit(cell,
make_array_view(std::vector>{ref_point_energy_vol}));
Vector p_dofs(2);
cell->get_dof_values(solution_global, p_dofs);
fe_eval.evaluate(make_array_view(p_dofs),
EvaluationFlags::values |
EvaluationFlags::gradients);
double val = fe_eval.get_value(0);
Tensor<1,1> grad = fe_eval.get_gradient(0);

I am using FE_Q elements of degree one and a MappingQ object also of degree
one.

Frankly, I do not really understand the measured computation times.
My program has several loadsteps with nested Newton iterations:
Loadstep 1:
Assembly 1: cpu time 12.8 sec  wall time 268.7 sec
Assembly 2: cpu time 17.7 sec  wall time 275.2 sec
Assembly 3: cpu time 22.3 sec  wall time 272.6 sec
Assembly 4: cpu time 23.8 sec  wall time 271.3sec
Loadstep 2:
Assembly 1: cpu time 14.3 sec  wall time 260.0 sec
Assembly 2: cpu time 16.9 sec  wall time 262.1 sec
Assembly 3: cpu time 18.5 sec  wall time 270.6 sec
Assembly 4: cpu time 17.1 sec  wall time 262.2 sec
...

Using FEValues instead of FEPointEvaluation, the results are:
Loadstep 1:
Assembly 1: cpu time 23.9 sec  wall time 171.0 sec
Assembly 2: cpu time 32.5 sec  wall time 168.9 sec
Assembly 3: cpu time 33.2 sec  wall time 168.0 sec
Assembly 4: cpu time 32.7 sec  wall time 166.9 sec
Loadstep 2:
Assembly 1: cpu time 24.9 sec  wall time 168.0 sec
Assembly 2: cpu time 34.7 sec  wall time 167.3 sec
Assembly 3: cpu time 33.9 sec  wall time 167.8 sec
Assembly 4: cpu time 34.3 sec  wall time 167.7 sec
...

Clearly, the fluctuations using FEValues are smaller than in case of
FEPointEvaluation.
Anyway, using FEPointEvaluation the cpu time is smaller but the wall time
substantially bigger.
If I am not mistaken, the values cpu time 34.3 sec and wall time 167.7 sec
mean that
the cpu needs 34.3 sec to execute my assembly routine and has to wait in the
remaining 167.7-34.3 seconds.
This huge gap between cpu and wall time has to be related to what I do with
FEValues or FEPointEvaluation
as cpu and wall time are nearly balanced if I use either neither of them.
What might be the problem?

Best
Simon





Am Mi., 19. Okt. 2022 um 22:34 Uhr schrieb Wolfgang Bangerth <
bange...@colostate.edu>:

> On 10/19/22 08:45, Simon Wiesheier wrote:
> >
> > What I want to do boils down to the following:
> > Given the reference co-ordinates of a point 'p', along with the cell on
> > which 'p' lives,
> > give me the value and gradient of a finite element function evaluated at
> > 'p'.
> >
> > My idea was to create a quadrature object with 'p' being the only
> > quadrature point and pass this
> > quadrature object to the FEValues object and finally do the
> > .reinit(cell) call (then, of course, get_function_values()...)
> > 'p' is different for all (2.5 million) quadrature points, which is why I
> > create the FEValues object so many times.
>
> It's worth pointing out that is exactly what VectorTools::point_values()
> does.
>
> (As others have already mentioned, if you want to do that many many
> times over, this is too expensive and you should be using
> FEPointEvaluation instead.)
>
> Best
>   W.
>
> --
> 
> Wolfgang Bangerth  email: bange...@colostate.edu
> www: http://www.math.colostate.edu/~bangerth/
>
> --
> The deal.II project is located at http://www.dealii.org/
> For mailing list/forum options, see
> https://groups.google.com/d/forum/dealii?hl=en
> ---
> You received this message because you are subscribed to the Google Groups
> "deal.II User Group" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to dealii+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/dealii/cd1c8fa0-443d-b7bf-b433-f5ab033a247c%40colostate.edu
> .
>

-- 
The deal.II project is located at http://www.dealii.org/
For mailing list/forum options, see 
https://groups.google.com/d/forum/dealii?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"deal.II User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to dealii+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/dealii/CAM50jEscu%2BSVwwUd6izNn9F9F1483QR%3DfFiBFbar27ZORDpeqA%40mail.gmail.com.


Re: [deal.II] Re: measuring cpu and wall time for assembly routine

2022-10-19 Thread Wolfgang Bangerth

On 10/19/22 08:45, Simon Wiesheier wrote:


What I want to do boils down to the following:
Given the reference co-ordinates of a point 'p', along with the cell on 
which 'p' lives,
give me the value and gradient of a finite element function evaluated at 
'p'.


My idea was to create a quadrature object with 'p' being the only 
quadrature point and pass this
quadrature object to the FEValues object and finally do the 
.reinit(cell) call (then, of course, get_function_values()...)
'p' is different for all (2.5 million) quadrature points, which is why I 
create the FEValues object so many times.


It's worth pointing out that is exactly what VectorTools::point_values() 
does.


(As others have already mentioned, if you want to do that many many 
times over, this is too expensive and you should be using 
FEPointEvaluation instead.)


Best
 W.

--

Wolfgang Bangerth  email: bange...@colostate.edu
   www: http://www.math.colostate.edu/~bangerth/

--
The deal.II project is located at http://www.dealii.org/
For mailing list/forum options, see 
https://groups.google.com/d/forum/dealii?hl=en
--- 
You received this message because you are subscribed to the Google Groups "deal.II User Group" group.

To unsubscribe from this group and stop receiving emails from it, send an email 
to dealii+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/dealii/cd1c8fa0-443d-b7bf-b433-f5ab033a247c%40colostate.edu.


Re: [deal.II] Re: measuring cpu and wall time for assembly routine

2022-10-19 Thread Martin Kronbichler
Dear Simon,

You seem to be looking for FEPointEvaluation. That class is shown in
step-19 and provides, for simple FiniteElement types, a much faster way to
evaluate solutions at arbitrary points within a cell. Do you want to give
it a try? The issue you are facing is that FEValues that you are using is
using a very abstract entry point that does precomputations that only pay
off if using the unit points many times. And even in the case of the same
unit points it is not really fast, it is a general-purpose baseline that I
would not recommend for high-performance purposes.

As a final note, I would mention that FEPointEvaluation falls back to
FEValues for complicated FiniteElement types, so it might be that you do
not get speedups in those cases. But we could work on it if you need it,
today we know much better what to do than a few years ago.

Best,
Martin

On Wed, 19 Oct 2022, 16:45 Simon Wiesheier, 
wrote:

> " It's an environment variable. "
>
> I did
> $DEAL_II_NUM_THREADS
> and the variable is not set.
> But if it were set to one, why would this explain the gap between cpu and
> wall time?
>
> " My point is the constructor should not be called millions of times. You
> are not going to be able to get that function 100 times faster. It's best
> to find a way to call it less often. "
>
> What I want to do boils down to the following:
> Given the reference co-ordinates of a point 'p', along with the cell on
> which 'p' lives,
> give me the value and gradient of a finite element function evaluated at
> 'p'.
>
> My idea was to create a quadrature object with 'p' being the only
> quadrature point and pass this
> quadrature object to the FEValues object and finally do the .reinit(cell)
> call (then, of course, get_function_values()...)
> 'p' is different for all (2.5 million) quadrature points, which is why I
> create the FEValues object so many times.
>
> Do you a different suggestion to solve my problem, ie to evaluate the
> finite element field and its derivatives at 'p'?
>
> Best,
> Simon
>
>
> Am Mi., 19. Okt. 2022 um 16:17 Uhr schrieb Bruno Turcksin <
> bruno.turck...@gmail.com>:
>
>> Simon,
>>
>> Le mer. 19 oct. 2022 à 09:33, Simon Wiesheier 
>> a écrit :
>>
>>> Thank you for your answer!
>>>
>>> " Did you set DEAL_II_NUM_THREADS=1?"
>>>
>>> How can I double-check that?
>>> ccmake .
>>> only shows my the variables CMAKE_BUILD_TYPE and deal.II_DIR .
>>> But I do  do knot if this is the right place to look for.
>>>
>> It's an environment variable. If you are using bash, you can do
>>
>> export DEAL_II_NUM_THREADS=1
>>
>>
>>>
>>> " That could explain why CPU and Wall time are different. Finally, if I
>>> understand correctly, you are calling the constructor of FEValues about 2.5
>>> million times. That means that the call to one FEValues constructor is
>>> 100/2.5e6 seconds about 40 microseconds. That doesn't seem too slow. "
>>>
>>> There was a typo in my post. It should be 160/2.5e6 seconds about 64
>>> microsecends.
>>>
>> My point is the constructor should not be called millions of times. You
>> are not going to be able to get that function 100 times faster. It's best
>> to find a way to call it less often.
>>
>> Best,
>>
>> Bruno
>>
>> --
>> The deal.II project is located at http://www.dealii.org/
>> For mailing list/forum options, see
>> https://groups.google.com/d/forum/dealii?hl=en
>> ---
>> You received this message because you are subscribed to the Google Groups
>> "deal.II User Group" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to dealii+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/dealii/CAGVt9eMfVohOUToQOsBD_v%2BqU%3D0Em_XOMiwqFi2SM_0zLoy-sQ%40mail.gmail.com
>> 
>> .
>>
> --
> The deal.II project is located at http://www.dealii.org/
> For mailing list/forum options, see
> https://groups.google.com/d/forum/dealii?hl=en
> ---
> You received this message because you are subscribed to the Google Groups
> "deal.II User Group" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to dealii+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/dealii/CAM50jEtyY576riC6yNqqMafXfGGvTXY8mhm%3Di7HMzr-U_LAxbQ%40mail.gmail.com
> 
> .
>

-- 
The deal.II project is located at http://www.dealii.org/
For mailing list/forum options, see 
https://groups.google.com/d/forum/dealii?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"deal.II User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to dealii+unsubscr...@googlegroups.com.
To view this 

Re: [deal.II] Re: measuring cpu and wall time for assembly routine

2022-10-19 Thread Simon Wiesheier
" It's an environment variable. "

I did
$DEAL_II_NUM_THREADS
and the variable is not set.
But if it were set to one, why would this explain the gap between cpu and
wall time?

" My point is the constructor should not be called millions of times. You
are not going to be able to get that function 100 times faster. It's best
to find a way to call it less often. "

What I want to do boils down to the following:
Given the reference co-ordinates of a point 'p', along with the cell on
which 'p' lives,
give me the value and gradient of a finite element function evaluated at
'p'.

My idea was to create a quadrature object with 'p' being the only
quadrature point and pass this
quadrature object to the FEValues object and finally do the .reinit(cell)
call (then, of course, get_function_values()...)
'p' is different for all (2.5 million) quadrature points, which is why I
create the FEValues object so many times.

Do you a different suggestion to solve my problem, ie to evaluate the
finite element field and its derivatives at 'p'?

Best,
Simon


Am Mi., 19. Okt. 2022 um 16:17 Uhr schrieb Bruno Turcksin <
bruno.turck...@gmail.com>:

> Simon,
>
> Le mer. 19 oct. 2022 à 09:33, Simon Wiesheier 
> a écrit :
>
>> Thank you for your answer!
>>
>> " Did you set DEAL_II_NUM_THREADS=1?"
>>
>> How can I double-check that?
>> ccmake .
>> only shows my the variables CMAKE_BUILD_TYPE and deal.II_DIR .
>> But I do  do knot if this is the right place to look for.
>>
> It's an environment variable. If you are using bash, you can do
>
> export DEAL_II_NUM_THREADS=1
>
>
>>
>> " That could explain why CPU and Wall time are different. Finally, if I
>> understand correctly, you are calling the constructor of FEValues about 2.5
>> million times. That means that the call to one FEValues constructor is
>> 100/2.5e6 seconds about 40 microseconds. That doesn't seem too slow. "
>>
>> There was a typo in my post. It should be 160/2.5e6 seconds about 64
>> microsecends.
>>
> My point is the constructor should not be called millions of times. You
> are not going to be able to get that function 100 times faster. It's best
> to find a way to call it less often.
>
> Best,
>
> Bruno
>
> --
> The deal.II project is located at http://www.dealii.org/
> For mailing list/forum options, see
> https://groups.google.com/d/forum/dealii?hl=en
> ---
> You received this message because you are subscribed to the Google Groups
> "deal.II User Group" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to dealii+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/dealii/CAGVt9eMfVohOUToQOsBD_v%2BqU%3D0Em_XOMiwqFi2SM_0zLoy-sQ%40mail.gmail.com
> 
> .
>

-- 
The deal.II project is located at http://www.dealii.org/
For mailing list/forum options, see 
https://groups.google.com/d/forum/dealii?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"deal.II User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to dealii+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/dealii/CAM50jEtyY576riC6yNqqMafXfGGvTXY8mhm%3Di7HMzr-U_LAxbQ%40mail.gmail.com.


Re: [deal.II] Re: measuring cpu and wall time for assembly routine

2022-10-19 Thread Bruno Turcksin
Simon,

Le mer. 19 oct. 2022 à 09:33, Simon Wiesheier  a
écrit :

> Thank you for your answer!
>
> " Did you set DEAL_II_NUM_THREADS=1?"
>
> How can I double-check that?
> ccmake .
> only shows my the variables CMAKE_BUILD_TYPE and deal.II_DIR .
> But I do  do knot if this is the right place to look for.
>
It's an environment variable. If you are using bash, you can do

export DEAL_II_NUM_THREADS=1


>
> " That could explain why CPU and Wall time are different. Finally, if I
> understand correctly, you are calling the constructor of FEValues about 2.5
> million times. That means that the call to one FEValues constructor is
> 100/2.5e6 seconds about 40 microseconds. That doesn't seem too slow. "
>
> There was a typo in my post. It should be 160/2.5e6 seconds about 64
> microsecends.
>
My point is the constructor should not be called millions of times. You are
not going to be able to get that function 100 times faster. It's best to
find a way to call it less often.

Best,

Bruno

-- 
The deal.II project is located at http://www.dealii.org/
For mailing list/forum options, see 
https://groups.google.com/d/forum/dealii?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"deal.II User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to dealii+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/dealii/CAGVt9eMfVohOUToQOsBD_v%2BqU%3D0Em_XOMiwqFi2SM_0zLoy-sQ%40mail.gmail.com.


Re: [deal.II] Re: measuring cpu and wall time for assembly routine

2022-10-19 Thread Simon Wiesheier
Thank you for your answer!

" Did you set DEAL_II_NUM_THREADS=1?"

How can I double-check that?
ccmake .
only shows my the variables CMAKE_BUILD_TYPE and deal.II_DIR .
But I do  do knot if this is the right place to look for.

" That could explain why CPU and Wall time are different. Finally, if I
understand correctly, you are calling the constructor of FEValues about 2.5
million times. That means that the call to one FEValues constructor is
100/2.5e6 seconds about 40 microseconds. That doesn't seem too slow. "

There was a typo in my post. It should be 160/2.5e6 seconds about 64
microsecends.

Best,
Simon

Am Mi., 19. Okt. 2022 um 15:08 Uhr schrieb Bruno Turcksin <
bruno.turck...@gmail.com>:

> Simon,
>
> The best way to profile a code is to use a profiler. It can give a lot
> more information than what simple timers can do. You say that your code is
> not parallelized but by default deal.II is multithreaded . Did you set
> DEAL_II_NUM_THREADS=1? That could explain why CPU and Wall time are
> different. Finally, if I understand correctly, you are calling the
> constructor of FEValues about 2.5 million times. That means that the call
> to one FEValues constructor is 100/2.5e6 seconds about 40 microseconds.
> That doesn't seem too slow.
>
> Best,
>
> Bruno
>
> On Wednesday, October 19, 2022 at 7:51:55 AM UTC-4 Simon wrote:
>
>> Dear all,
>>
>> I implemented two different versions to compute a stress for a given
>> strain and want to compare the associated computation times in release mode.
>>
>> version 1: stress = fun1(strain)  cpu time:  4.52  s  wall
>> time:   4.53 s
>> version 2: stress = fun2(strain) cpu time: 32.5s  wall time:
>> 167.5 s
>>
>> fun1 and fun2, respectively, are invoked for all quadrature points
>> (1,286,144 in the above example) defined on the triangulation. My program
>> is not parallelized.
>> In fun2, I call  find_active_cell_around_point
>> 
>> twice for two different points on two different (helper) triangulations and
>> initialize two FEValues objects
>> with the points ' ref_point_vol' and 'ref_point_dev'
>> as returned by find_active_cell_around_point
>> 
>> .
>> FEValues<1> fe_vol(dof_handler_vol.get_fe(),
>> Quadrature<1>(ref_point_vol),
>> update_gradients |
>> update_values);
>> FEValues<1> fe_values_energy_dev(this->dof_handler_dev.get_fe(),
>> Quadrature<1>(ref_point_dev),
>> update_gradients |
>> update_values);
>>
>> I figured out that the initialization of the two FEValues objects is the
>> biggest portion of the above mentioned times.  In particular, if I comment
>> the initialization out, I have
>> cpu time: 6.54 s wall time: 6.55 s .
>>
>> The triangulations associated with dof_handler_vol and dof_handler_dev
>> are both 1d and store only 4 and 16 elements, respectively. That said, I am
>> wondering why the initialization takes so long (roughly 100 seconds wall
>> time in total) and why this causes a gap between the cpu and wall time.
>> Unfortunately, I have to reinitialize them anew whenever fun2 is called,
>> because  the point 'ref_point_vol' (see Quadrature<1>(ref_point_vol)) is
>> different in each call to fun2.
>>
>> Best
>> Simon
>>
>>
>>
>> --
> The deal.II project is located at http://www.dealii.org/
> For mailing list/forum options, see
> https://groups.google.com/d/forum/dealii?hl=en
> ---
> You received this message because you are subscribed to a topic in the
> Google Groups "deal.II User Group" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/dealii/uAplhH99yg4/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> dealii+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/dealii/d955e8e6-78c8-41f7-9f6c-f5339c22b319n%40googlegroups.com
> 
> .
>

-- 
The deal.II project is located at http://www.dealii.org/
For mailing list/forum options, see 
https://groups.google.com/d/forum/dealii?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"deal.II User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to dealii+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/dealii/CAM50jEs0T02%2BpdBiP%2BP72vz_eoN2Et%2BJ6X6HPL9xdZGXOtnmdA%40mail.gmail.com.