Hi, Chao's observations are correct. - Yes, it's the time per execution. The backprojection filter was called 43 times with an average of 0.0275691 s, so 1.185 s in total. The 0.025 second difference with the info displayed on your command line is not completely surprising because the timing done automatically for each filter starts/stops the timing a bit after/before. Note that we have removed any other timing than RTK_TIME_EACH_FILTER in the current version. - CudaFDKConeBeamReconstructionFillter is a "mini-pipeline" (see ITK doc) of a few filters, as documented here <http://www.openrtk.org/Doxygen/classrtk_1_1FDKConeBeamReconstructionFilter.html>. The ExtractImageFilter is missing from the drawing but the sum of the 4 (ExtractImagefilter, FDKWeightProjectionFilter, FFTRampImageFilter and FDKBackprojectionImageFilter) gives 43*(0.0130324+0.0275+0.0389145+0.0275691)=4.6 s. It's different from the 9.6 s you observe and I'm afraid I don't know why. But I did not see FDKWeightProjectionFilter in your list, did you remove some filters from the list? That might explain the few seconds missing if I realize some filters are not there. - I'm not sure CUDAWeighting is the longest... Cuda computation is asynchronous so the backprojection, which should be the longest, might be finishing in the weighting filter. If you want a more accurate timing, you probably need to force synchronous computation. I hope this helps. Simon
On Fri, Jul 20, 2018 at 1:03 PM, Chao Wu <wucha...@gmail.com> wrote: > Hi, By a quick look, the time reported with RTK_TIME_EACH_FILTER seems to > be time per execution of each filter. I didn't look into the code so I have > no idea whether it is an average time or the time of the last execution . > In addition (not shown in your example) if one filter has more than one > instances in the pipeline, the report only lists the total number of > executions of all instances. > Regards, Chao > > Elena Padovani <elenapadovani...@gmail.com> 于2018年7月19日周四 上午11:44写道: > >> Hi Simon, >> Thank you for the fast reply. i changed the >> RTK_CUDA_PROJECTIONS_SLAB_SIZE but unfortunately nothing has changed. I >> also compiled it with the FLAG RTK_TIME_EACH_FILTER on and i did not >> understand why it tells me that CudaFDKBackProjectionImageFilter took >> 0.0275 s, CudaFDKConeBeamReconstructionFilter took 9.58 s >> and CudaFFTRampImageFilter took 0.0389 s while the PrintTiming method tells >> me that; Prefilter operations took 6.65, Ramp Filter 1.71, Backprojection: >> 1.21. So my question is where is the remaining time spent ? For instance, >> is (Backprojection=1.21 - CudaFDKBackProjectionImageFilter=0.0275) the >> time needed to copy the memory from CPU to GPU? The same holds for the ramp >> filter. >> Moreover it seems to me that what is taking long is the CUDAWeighting >> filter so do you think that increasing the number of thread per block which >> is now { 16, 16 , 2 } could help ? >> >> Here is what the applications shows me with the -v option: >> >> Reconstructing and writing... It took 11.8574 s >> FDKConeBeamReconstructionFilter timing: >> Prefilter operations: 6.65107 s >> Ramp filter: 1.71472 s >> Backprojection: 1.21037 s >> >> >> ************************************************************ >> ********************* >> Probe Tag Starts Stops Time >> (s) >> ************************************************************ >> ********************* >> ConstantImageSource 1 1 >> 0.0962241 >> CudaCropImageFilter 43 43 >> 0.00230094 >> CudaFDKBackProjectionImageFilter 43 43 >> 0.0275691 >> CudaFDKConeBeamReconstructionFilter 1 1 >> 9.58291 >> CudaFFTRampImageFilter 43 43 >> 0.0389145 >> ExtractImageFilter 43 43 >> 0.0130324 >> FFTWRealToHalfHermitianForwardFFTImageFilter 12 12 >> 0.00128049 >> ImageFileReader 686 686 >> 0.0481416 >> ImageFileWriter 1 1 >> 11.8383 >> ImageSeriesReader 686 686 >> 0.0484766 >> ProjectionsReader 1 1 >> 44.7685 >> Self 129 129 >> 0.0506474 >> StreamingImageFilter 2 2 27.713 >> >> VarianObiRawImageFilter 686 686 >> 0.0135297 >> >> At the beginning i was using my own application with my own data i now >> switched back to the wiki VarianRecontruction test ( with a 512^3 >> reconstructed volume). >> >> Thank you again, >> Kind Regards >> >> Elena >> >> 2018-07-18 22:00 GMT+02:00 Simon Rit <simon....@creatis.insa-lyon.fr>: >> >>> Hi, >>> Thanks for sharing your results. >>> RTK uses CUFFT for the ramp filtering which does its own blocks/grid >>> management. For backprojection, it's pretty simple, see >>> https://github.com/SimonRit/RTK/blob/master/src/ >>> rtkCudaFDKBackProjectionImageFilter.cu#L198 >>> mostly hardcoded, independent of the number of CUDA cores and could be >>> optimized. There is one compilation parameter that you can try to change to >>> see if that speeds up the computation, that is the cmake variable >>> RTK_CUDA_PROJECTIONS_SLAB_SIZE which controls how many projections are >>> backprojected simultaneously. >>> We currently currently don't propose any way to use multiple GPUs. >>> Please keep us posted if you continue to do some tests. In particular, I >>> advise turning on RTK_TIME_EACH_FILTER in cmake so that you get a report >>> with -v option in applications on how much time your program spent in each >>> filter. >>> Best regards, >>> Simon >>> >>> On Wed, Jul 18, 2018 at 6:48 PM, Elena Padovani < >>> elenapadovani...@gmail.com> wrote: >>> >>>> Hi RTK-users, >>>> >>>> I compiled RTK with CUDA and tried to setup a benchmark to analyze the >>>> performances trend of the GPUs when using the CUDA-FDK reconstruction >>>> filter. Precisely, when reconstructing the same volume from the same >>>> data-set on NVS510 GTX860M and GTX970M i got results consistent with the >>>> number of GPUs cuda cores. Indeed, when setting up this benchmark i was >>>> expecting a reduction in the reconstruction time with the increase of >>>> cuda cores(at least until the dimension of the reconstructed volume was not >>>> the actual bottleneck). However, when testing it on a Tesla P100 i got >>>> performances comparable to the GTX860M. Would you expect such a result? >>>> >>>> Unfortunately i am new to CUDA and i was wondering if any of you could >>>> help me figuring this out. >>>> How does rtk with CUDA manage the number of blocks/grid dimension ? >>>> Is the number of blocks/grid dimension depedent on the GPU cuda cores? >>>> Is there a way to use multiple GPUs? >>>> >>>> The test was carried with the following data: >>>> - 360 projections >>>> - reconstructed volume 600x700x800 px >>>> >>>> Thank you in advance >>>> Kind regards >>>> >>>> Elena >>>> >>>> >>>> _______________________________________________ >>>> Rtk-users mailing list >>>> Rtk-users@public.kitware.com >>>> https://public.kitware.com/mailman/listinfo/rtk-users >>>> >>>> >>> >> _______________________________________________ >> Rtk-users mailing list >> Rtk-users@public.kitware.com >> https://public.kitware.com/mailman/listinfo/rtk-users >> >
_______________________________________________ Rtk-users mailing list Rtk-users@public.kitware.com https://public.kitware.com/mailman/listinfo/rtk-users