Chris, As underlined by David, the time spent in raster I/O is presumably neglectable and not the issue here. How many polygons were generated in this execution ? A good way of identifying a bottleneck is to run the process under gdb and regularly break with Ctrl+C and display the backtrace, and that a few times.
Threading the algorithm is indeed a potential way of speeding things, but reconciling the various outputs isn't necessarily trivial. And currently the algorithm works in a streaming mode regarding the output, which is great to be able to output to GeoJSON which only supports streamed writes. The multithreaded version would presumably needs a temporary file to handle intermediate results. Even > Hi David, > > Thanks for your response. I have a little more information since > feeding your response to the project team: > > "The tif file is around 1.4GB as you noted and the data is similar to > that of the result of an image classification where each pixel value > is in a range between (say) 1-5. After a classification this image is > usually exported as a vector file (EVF of Shapefile) but in this case > we want to use geojson. This has taken both Mark and myself weeks to > complete with gdal_polygonize as you noted. > > I think an obvious way to speed this up would be threading by breaking > the tiff file in tiles (say 1024x1024) and spreading these over the > available cores, then there would need to be a way to dissolve the > tile boundaries to complete the polygons as we would not want obvious > tile lines." > > Does this help? > > Many thanks, > > Chris > > On 11 January 2015 at 18:31, David Strip <[email protected]> wrote: > > I'm surprised at your colleague's experience. We've run some polygonize on > > large images and have never had this problem. The g2.2xlarge instance is > > overkill in the sense that the code is not multi-threaded, so the extra > CPUs > > don't help. Also, as you have already determined, the image is read in > small > > chunks, so you don't need large buffers for the image. But two weeks make > no > > sense. In fact, your run shows that the job reaches 5% completion in a > > couple of hours. > > > > The reason for so many reads (though 2.3 seconds out of "a few hours" is > > negligible overhead) is that the algorithm operates on a pair of adjacent > > raster lines at a time. This allows processing of extremely large images > > with very modest memory requirements. It's been a while since I've looked > at > > the code, but from my recollection, the algorithm should scale > approximately > > linearly in the number of pixels and polygons in the image. Far more > > important to the run-time is the nature of the image itself. If the input > is > > something like a satellite photo, your output can be orders of magnitude > > larger than the input image, as you can get a polygon for nearly every > > pixel. If the output format is a verbose format like KML or JSON, the > number > > of bytes to describe each pixel is large. How big was the output in your > > colleague's run? > > > > The algorithm runs in two passes. If I'm reading the code right, the > > progress indicator is designed to show 10% at the end of the first pass. > You > > will have a better estimate of the run-time on your VM by noting the > elapsed > > time to 10%, then the elapsed time from 10% to 20%. > > > > Also, tell us more about the image. Is it a continuous scale raster - eg, a > > photo? One way to significantly reduce the output size (and hence runtime), > > as well as to get a more meaningful output in most cases, is to posterize > > the image into a small number of colors/tones. Then run a filter to remove > > isolated pixels or small groups of pixels. Polygonize run on this > > pre-processed image should perform better. > > > > Bear in mind that the algorithm is such that the first pass will be very > > similar in run-time for the unprocessed and pre-processed image. However, > > the second pass is more sensitive to the number of polygons and should > > improve for the posterized image. > > > > Hopefully Frank will weigh in where I've gotten it wrong or missed > > something. > > > > > > On 1/11/2015 10:11 AM, chris snow wrote: > > > > I have been informed by a colleague attempting to convert a 1.4GB TIF file > > using gdal_polygonize.py on a g2.2xlarge Amazon instance (8 vCPU, 15gb RAM) > > that the processing took over 2 weeks running constantly. I have also > been > > told that the same conversion using commercial tooling was completed in a > > few minutes. > > > > As a result, I'm currently investigating to see if there is an opportunity > > for improving the performance of the gdal_polygonize.py TIF to JSON > > conversion. I have run a strace while attempting the same conversion, but > > stopped after a few hours (the gdal_polygonize.py status indicator was > > showing between 5% and 7.5% complete). The strace results are: > > > > > > % time seconds usecs/call calls errors syscall > > ------ ----------- ----------- --------- --------- ---------------- > > 99.40 2.348443 9 252474 read > > ... > > 0.00 0.000000 0 1 set_robust_list > > ------ ----------- ----------- --------- --------- ---------------- > > 100.00 2.362624 256268 459 total > > > > > > FYI - I performed my test inside a vagrant virtualbox guest with 30GB > memory > > and 8 CPUS assigned to the guest. > > > > It appears that the input TIF file is read in small pieces at a time. > > > > I have shared the results here in case any one else is looking at > optimising > > the performance of the conversion or already has ideas where the code can > be > > optimised. > > > > Best regards, > > > > Chris > > > > > > _______________________________________________ > > gdal-dev mailing list > > [email protected] > > http://lists.osgeo.org/mailman/listinfo/gdal-dev > > > > > _______________________________________________ > gdal-dev mailing list > [email protected] > http://lists.osgeo.org/mailman/listinfo/gdal-dev > -- Spatialys - Geospatial professional services http://www.spatialys.com _______________________________________________ gdal-dev mailing list [email protected] http://lists.osgeo.org/mailman/listinfo/gdal-dev
