Hello, was wondering if the actual raster library API could be extended to read/write a given n contiguous number of lines in a buffer instead of a single one? This maybe of help for parallelization...
Regards, Yann On 4 April 2010 11:22, Jordan Neumeyer <[email protected]> wrote: > > On Thu, Apr 1, 2010 at 3:24 PM, Glynn Clements <[email protected]> > wrote: >> >> Jordan Neumeyer wrote: >> >> > > > Just kind of my thought process about how I would try to go about >> > > > parallelizing a module. >> > > >> > > The main issue with parallelising raster input is that the library >> > > keeps a copy of the current row's data, so that consecutive reads of >> > > the same row (as happen when upsampling) don't re-read the data. >> > > >> > > For concurrent access to a single map, you would need to either keep >> > > one row per thread, or abandon caching. Also, you would need to use >> > > pread() rather than fseek()+read(). >> > >> > It sounds like you're talking about parallelism in I/O from a file or >> > database. Neither of which is my intent or goal for this project. I will >> > parallelize things after they have already been read into memory, and >> > tasks >> > are processor intensive. I wouldn't want parallelize any I/O, but if I >> > were >> > to optimize I/O. I would make all operations I/O asynchronous, which is >> > can >> > mimic parallelism in a sense. Queuing up the chunks of data and then >> > processing them as resources become available. >> >> Most GRASS raster modules process data row-by-row, rather than reading >> entire maps into memory. Reading maps into memory is frowned upon, as >> GRASS is regularly used with maps which are too large to fit into >> memory. Where the algorithm cannot operate row-by-row, use of a tile >> cache is the next best alternative; see e.g. r.proj.seg (renamed to >> r.proj in 7.0). > > > That makes more sense. So a row is like chunk from the map data? Kind of > like the first row of pixels from an image. So from the first pixel to width > of image is one row, then width plus one starts the next, and so on and so > forth. How large are the rows generally? > >> >> Holding an entire map in memory is only considered acceptable if the >> algorithm is inherently so slow that processing a gigabyte-sized map >> simply wouldn't be feasible, or the access pattern is such that even a >> tile-cache approach isn't feasible. >> >> In general, GRASS should be able to process multi-gigabyte maps even >> on 32-bit systems, and work on multi-user systems where a process >> cannot assume that it can use a significant proportion of the system's >> total physical memory. > > > Which is good. I didn't realize how big the data set could be. What's > biggest map you've seen? > >> >> > > It's more straightfoward to read multiple maps concurrently. In 7.0, >> > > this case should be thread-safe. >> > > >> > > Alternatively, you could have one thread for reading, one for writing, >> > > and multiple worker threads for the actual processing. However, unless >> > > the processing is complex, I/O will be the bottleneck. >> > > >> > >> > I/O is generally a bottleneck anyway. Something always tends to be >> > waiting >> > on another. >> >> When I refer to I/O, I'm referring not just to read() and write(), but >> also the (de)compression, conversion and resampling, i.e. everything >> performed by the get/put-row functions. For many GRASS modules, this >> takes more time than the actual processing. > > I can see why, especially for big maps since it's doing that row-by-row. > So when a GRASS module loads a map the basic algorithm looks something like: > 1) Read row > 2) get-row function does necessary preprocessing > 3) row is cached or held in memory. Does the caching take place after > 4) row is processed > 5) Display/write process ? (Or is this after a couple iterations, all of > them?) > 5) repeat (1) > > Would it be beneficial/practical to parallelize some of the preprocessing > like conversion and resampling before the caching occurs? > >> >> Finally, the thread title refers to libraries. Very little processing >> occurs in the libraries; most of it is in the individual modules. So >> there isn't much scope for "parallelising" the libraries. The main >> issue for library functions is to ensure that they are thread-safe. >> Most of the necessary work for the raster library has been done in >> 7.0. > > > I was trying to refer to all of the raster modules as a whole, but library > is just what the modules share. I've changed the title from Parallelization > of Raster and Vector libraries to Parallelization of Raster and Vector > modules. > > Would I be working on GRASS 6.x or 7.x? Is there a minimum compiler version > when using GCC/MingW? Just curious because openMP tasks are only supported > on GCC >= 4.2. Which may or not be useful, but can be a valuable tool when > you don't know how much data or how many "tasks" you have. Like processing a > linked-list or binary trees. > >> >> -- >> Glynn Clements <[email protected]> > > ~Jordan > > > _______________________________________________ > grass-dev mailing list > [email protected] > http://lists.osgeo.org/mailman/listinfo/grass-dev > -- Yann Chemin Senior Spatial Hydrologist www.csu.edu.au/research/icwater M +61-4-3740 7019 _______________________________________________ grass-dev mailing list [email protected] http://lists.osgeo.org/mailman/listinfo/grass-dev
