Re: [Rtk-users] Slow CUDA FDK performance

Moritz Schaar Thu, 18 Nov 2021 06:10:30 -0800

Hi Simon,

thank you for looking into it!
So there is no general issue, which is nice to know.
Sadly I cannot run the old version as I end up with the following error:
“ImportError: DLL load failed while importing _RTKPython”
ITK works, RTK doesn’t. And rebuilding also doesn’t work as too many things 
changed (CUDA, VS, ..).


In the meantime I figured out that modifying “m_ProjectionSubsetSize” helps to 
accelerate everything.
Checking this with “feldkamp.GetProjectionSubsetSize()” I get for the CPU and 
CUDA version of the FDKConeBeamReconstructionFilter a value of “2”.
Regarding your test code I, obviously, get very slow execution times with this 
value. Increasing the number of subsets to 200 I end up with similar values you 
reported.

Now I am wondering where these defaults come from.
For CPU I assume that this comes from missing FFTW as given here 
https://github.com/SimonRit/RTK/blob/46ea6e190965bcc89421075830e365063aa8c51a/include/rtkFDKConeBeamReconstructionFilter.hxx#L48
However, I do not understand why the CUDA version defaults to 2 instead of 16.
In CMake I kept the default: RTK_CUDA_PROJECTIONS_SLAB_SIZE=16
From https://github.com/SimonRit/RTK/blob/master/rtkConfiguration.h.in#L35 I 
assume that this gets copied to SLAB_SIZE which will be used by all CUDA codes.
My rtkConfiguration.h also reflects this:
#ifndef SLAB_SIZE
#  define SLAB_SIZE 16
#endif

Do you have an explanation why this gets reduced from 16 to 2? Or a hint where 
I can have a look?

Best,
Moritz


Von: Simon Rit <simon....@creatis.insa-lyon.fr>
Gesendet: Donnerstag, 18. November 2021 12:13
An: Moritz Schaar <sch...@imt.uni-luebeck.de>
Cc: rtk-users@public.kitware.com
Betreff: Re: [Rtk-users] Slow CUDA FDK performance

Hi,
I compiled the python packages with exactly the same configurations and I can't 
reproduce the issue
old: CUDA 10.2, ITK 5.1.2, RTK 2.1.0 -> 0.9 s
0.019904613494873047
0.6475656032562256
Reconstructing...
0.9730124473571777

new: CUDA 11.5, ITK 5.2.1, RTK 2.3.0
0.017342329025268555
0.7650339603424072
Reconstructing...
0.8823671340942383

The code I ran is the following
#!/usr/bin/env python
import sys
import itk
import time
from itk import RTK as rtk

if len ( sys.argv ) < 3:
  print( "Usage: FirstReconstruction <outputimage> <outputgeometry>" )
  sys.exit ( 1 )

# Defines the image type
GPUImageType = rtk.CudaImage[itk.F,3]
CPUImageType = rtk.Image[itk.F,3]

# Defines the RTK geometry object
geometry = rtk.ThreeDCircularProjectionGeometry.New()
numberOfProjections = 200
firstAngle = 0.
angularArc = 360.
sid = 600 # source to isocenter distance
sdd = 1200 # source to detector distance
for x in range(0,numberOfProjections):
  angle = firstAngle + x * angularArc / numberOfProjections
  geometry.AddProjection(sid,sdd,angle)

# Writing the geometry to disk
xmlWriter = rtk.ThreeDCircularProjectionGeometryXMLFileWriter.New()
xmlWriter.SetFilename ( sys.argv[2] )
xmlWriter.SetObject ( geometry );
xmlWriter.WriteFile();

# Create a stack of empty projection images
ConstantImageSourceType = rtk.ConstantImageSource[GPUImageType]
constantImageSource = ConstantImageSourceType.New()
origin = [ -127.75, -127.75, 0. ]
sizeOutput = [ 512, 512,  numberOfProjections ]
spacing = [ 0.5, 0.5, 0.5 ]
constantImageSource.SetOrigin( origin )
constantImageSource.SetSpacing( spacing )
constantImageSource.SetSize( sizeOutput )
constantImageSource.SetConstant(0.)

REIType = rtk.RayEllipsoidIntersectionImageFilter[CPUImageType, CPUImageType]
rei = REIType.New()
semiprincipalaxis = [ 50, 50, 50]
center = [ 0, 0, 10]
# Set GrayScale value, axes, center...
rei.SetDensity(2)
rei.SetAngle(0)
rei.SetCenter(center)
rei.SetAxis(semiprincipalaxis)
rei.SetGeometry( geometry )
rei.SetInput(constantImageSource.GetOutput())

# Create reconstructed image
constantImageSource2 = ConstantImageSourceType.New()
sizeOutput = [ 256 ] * 3
origin = [ -63.75 ] * 3
spacing = [ 0.5 ] *  3
constantImageSource2.SetOrigin( origin )
constantImageSource2.SetSpacing( spacing )
constantImageSource2.SetSize( sizeOutput )
constantImageSource2.SetConstant(0.)
t0 = time.time()
constantImageSource2.Update()
t1 = time.time()
print(t1-t0)

# Graft the projections to an itk::CudaImage
projections = GPUImageType.New()
t0 = time.time()
rei.Update()
t1 = time.time()
print(t1-t0)
projections.SetPixelContainer(rei.GetOutput().GetPixelContainer())
projections.CopyInformation(rei.GetOutput())
projections.SetBufferedRegion(rei.GetOutput().GetBufferedRegion())
projections.SetRequestedRegion(rei.GetOutput().GetRequestedRegion())

# FDK reconstruction
print("Reconstructing...")
FDKGPUType = rtk.CudaFDKConeBeamReconstructionFilter
feldkamp = FDKGPUType.New()
feldkamp.SetInput(0, constantImageSource2.GetOutput())
feldkamp.SetInput(1, projections)
feldkamp.SetGeometry(geometry)
feldkamp.GetRampFilter().SetTruncationCorrection(0.0)
feldkamp.GetRampFilter().SetHannCutFrequency(0.0)
t0 = time.time()
feldkamp.Update()
t1 = time.time()
print(t1-t0)

To be honest I don't see to do at this stage... Can you maybe check the same 
code with your two versions ? Any other suggestion?
Simon

On Wed, Nov 10, 2021 at 10:03 AM Moritz Schaar 
<sch...@imt.uni-luebeck.de<mailto:sch...@imt.uni-luebeck.de>> wrote:
Hi Simon,

I completely agree that this is hard to track down. That’s why I am asking for 
directions ☺
To be more precise about the execution times of my example:
The timings given in pairs of 17.1/1.2 s and 19/7 s are only the required times 
of the reconstruction step itself.
Reading data, pre and post processing are not part of this time measurement.
So the 7 s average in python is similar to the 6.41 s I obtained from adding 
everything done in CudaFDKConeBeamReconstructionFilter using 
RTK_PROBE_EACH_FILTER.
The reconstruction step in python simply involves:

-          Instantiation of a simple class, this doesn’t add anything to the 
timings

-          Setting up ConstantImageSource with either rtk.Image or rtk.CudaImage

-          Setting up 
FDKConeBeamReconstructionFilter/CudaFDKConeBeamReconstructionFilter

-          Setting inputs, geometry and filter

-          Update() and return result

Looks like there was a typo in my mail, the versions compared should be:
old: CUDA 10.2, ITK 5.1.2, RTK 2.1.0
new: CUDA 11.5, ITK 5.2.1, RTK 2.3.0

Sorry for the confusion and thanks for looking into it!

Best,
Moritz


Von: Simon Rit 
<simon....@creatis.insa-lyon.fr<mailto:simon....@creatis.insa-lyon.fr>>
Gesendet: Mittwoch, 10. November 2021 09:32
An: Moritz Schaar <sch...@imt.uni-luebeck.de<mailto:sch...@imt.uni-luebeck.de>>
Cc: rtk-users@public.kitware.com<mailto:rtk-users@public.kitware.com>
Betreff: Re: [Rtk-users] Slow CUDA FDK performance

Hi Moritz,
Thanks for the report. It's a bit hard to be convinced that something is wrong 
without being able to reproduce it. From the RTK_PROBE_EACH_FILTER log, most of 
the time is spent reading the projections which will be the same with or 
without cuda so I wonder if this is not the issue here. I can try to reproduce 
the issue, can you just confirm the two configurations : Cuda 10.2, ITK 5.2.1, 
RTK 2.1.0 vs Cuda 11.5, ITK 5.2.1 RTK 2.3.0 ?
Thanks,
Simon

On Fri, Nov 5, 2021 at 4:20 PM Moritz Schaar 
<sch...@imt.uni-luebeck.de<mailto:sch...@imt.uni-luebeck.de>> wrote:
Hi,

I recently upgraded my Windows 10 system to ITK 5.2.1 including RTK 2.3.0.
This also involved upgrading CUDA from 10.2 to 11.5, Visual Studio 2019 and 
even python update (3.8.5 to 3.8.12).
Using the python wrapping of RTK I implemented own routines that use FDK 
similar to the rtkfdk application.
On the old system (ITK 5.2.1, RTK 2.1.0) I benchmarked the FDK for a 
512x512x200 dataset reconstructed into 256x256x256 with 1.0 mm isotropic voxel 
size.
The system is equipped with 24 CPU cores and one RTX 2080 Ti, so the CPU 
version took 17.1 and the CUDA version 1.2 seconds.
Running the new software version on the same system results in roughly 19 s CPU 
time but more than 7 s for the CUDA version.
I don’t care about the actual timings but the relative increase of the CUDA 
version is what bothers me.

To dig up some more information I recompiled RTK with RTK_PROBE_EACH_FILTER and 
ran rtkfdk.exe for the same data, this is what I got:
**************************************************************************************************************
Probe Tag                                    Starts    Stops     Time (s)       
Memory (kB)    Cuda memory (kB)
**************************************************************************************************************
ChangeInformationImageFilter                 200       200       0.0211846      
0              0
ConstantImageSource                          1         1         0.0305991      
65668          0
CudaCropImageFilter                          13        13        0.0222911      
15786.8        15753.8
CudaDisplacedDetectorImageFilter             13        13        0.0540568      
10719.1        16384
CudaFDKBackProjectionImageFilter             13        13        0.0326397      
5051.38        5041.23
CudaFDKConeBeamReconstructionFilter          1         1         5.72999        
552184         211648
CudaFDKWeightProjectionFilter                13        13        0.0262806      
-13892         630.154
CudaFFTRampImageFilter                       13        13        0.148416       
43095.4        12499.7
CudaParkerShortScanImageFilter               13        13        0.0467202      
2525.85        15753.8
ExtractImageFilter                           13        13        0.0259726      
15812.3        -15753.8
ImageFileReader                              200       200       0.0226735      
-0.16          0
ImageSeriesReader                            200       200       0.066097       
6.12           0
ProjectionsReader                            1         1         26.0388        
208488         0
StreamingImageFilter                         2         2         16.0663        
547512         191840
VnlRealToHalfHermitianForwardFFTImageFilter  2         2         0.0208174      
0              0

Following the conversion on the mailing list, 
https://public.kitware.com/pipermail/rtk-users/2018-July/010617.html, I see 
that the CudaFDKConeBeamReconstructionFilter takes 6.41 s of which roughly 1/3 
is spent in the CudaFFTRampImageFilter.
Sadly I don’t have these results for the old software version so I can’t relate 
these values.

However, I also played around with v2.2.0 but it doesn’t make a difference.
Sadly, the version I used before (v2.1.0) won’t compile with CUDA 11.5 anymore. 
I tried to add small adjustments e.g. this commit 
https://github.com/SimonRit/RTK/commit/3d3c7506087f5fa98aee75df5af5c30e7e51cbe6 
to make things work but this didn’t work.
The same happens with other errors when trying to setup ITK 5.1.2, so getting 
back the old version for comparison seems impossible.

Is there any direction you can point me to check what is actually the issue 
here? Or maybe someone has an idea what could be the reason? CUDA/RTK/ITK 
version?
Any help is appreciated.

Best,
Moritz

_______________________________________________
Rtk-users mailing list
Rtk-users@public.kitware.com<mailto:Rtk-users@public.kitware.com>
https://public.kitware.com/mailman/listinfo/rtk-users

_______________________________________________
Rtk-users mailing list
Rtk-users@public.kitware.com
https://public.kitware.com/mailman/listinfo/rtk-users

Re: [Rtk-users] Slow CUDA FDK performance

Reply via email to