2016-11-23 15:46 GMT+01:00 [email protected] <[email protected]>: > Thanks Thierry. > > Please also see that with new satellites, the resolution is ever > increasing (e.g. Sentinel http://m.esa.int/Our_Activities/Observing_the_ > Earth/Copernicus/Overview4) >
It has allways been so. Anytime you reach a reasonable size, they send a new satellite with higher res / larger images :) > > I understand the tile thing and indeed a lot of the algos work on tiles, > but there are other ways to do this and especially with real time geo > queries on custom defined polygons, you go only so far with tiles. A reason > why we are using GeoTrellis backed by Accumulo in order to pump data very > fast in random order. > But that mean you're dealing with preprocessed / graph georeferenced data (aka openstreetmap type of data). If you're dealing with raster, your polygons are approximated by a set of tiles (with a nice tile size well suited to your network / disk array). I had reasonable success a long time ago (1991, I think), for Ifremer, with an unbalanced, sort of quadtree based decomposition for highly irregular curves on the seabed. Tree node size / tile size was computed to be exactly equal to the disk block size on a very slow medium. That sort of work is in the line of a geographic index for a database: optimise query accesses to geo-referenced objects... what is hard, and probably what you are doing, is combining geographic queries with graph queries (give me all houses in Belgium within a ten minutes bus + walk trip to a primary school)(*) (*) One can work that out on a raster for speed. This is what GRASS does for example. (**) I asked a student to accelerate some raster processing on a very small FPGA a long time ago. Once he had understood he could pipeline the design to increase the frequency, he then discovered that the FPGA would happily grok data faster than the computer bus could provide it :) leaving no bandwith for the data to be written back to memory. > > We are adding 30+ servers to the cluster at the moment just to deal with > the sizes as there is a project mapping energy landscape > https://vito.be/en/land-use/land-use/energy-landscapes. This thing is > throwing YARN containers and uses CPU like, intensively. It is not uncommon > for me to see their workload eating everything for a serious amount of CPU > seconds. > Only a few seconds ? > > It would be silly not to plug Pharo into all of this infrastructure I > think. > I've had quite bad results with Pharo on compute intensive code recently, so I'd plan carefully how I use it. On that sort of hardware, in the projects I'm working on, 1000x faster than Pharo on a single node is about an expected target. > > Especially given the PhD/Postdoc/brainiacs per square meter there. If you > have seen the Lost TV show, well, it kind of feels working there at that > place. Especially given that is is kind of hidden in the woods. > > Maybe you could have interesting interactions with them. These guys also > have their own nuclear reactor and geothermal drilling. > I'd be interested, because we're working a bit on high performance parallel runtimes and compilation for those. If one day you happen to be ready to talk about it in our place? South of Paris, not too hard to reach by public transport :) Thierry > Phil > > > > On Wed, Nov 23, 2016 at 1:30 PM, Thierry Goubier < > [email protected]> wrote: > >> Hi Phil, >> >> 2016-11-23 12:17 GMT+01:00 [email protected] < >> [email protected]>: >> >>> [ ...] >>> >>> It is really important to have such features to avoid massive GC pauses. >>> >>> My use case is to load the data sets from here. >>> https://www.google.be/url?sa=t&source=web&rct=j&url=http://p >>> roba-v.vgt.vito.be/sites/default/files/Product_User_Manual.p >>> df&ved=0ahUKEwjwlOG-4L7QAhWBniwKHZVmDZcQFggpMAI&usg=AFQjCNGR >>> ME9ZyHWQ8yCPgAQBDi1PUmzhbQ&sig2=eyaT4DlWCTjqUdQGBhFY0w >>> >> I've used that type of data before, a long time ago. >> >> I consider that tiled / on-demand block loading is the way to go for >> those. Work with the header as long as possible, stream tiles if you need >> to work on the full data set. There is a good chance that: >> >> 1- You're memory bound for anything you compute with them >> 2- I/O times dominates, or become low enough to don't care (very fast >> SSDs) >> 3- It's very rare that you need full random access on the complete array >> 4- GC doesn't matter >> >> Stream computing is your solution! This is how the raster GIS are >> implemented. >> >> What is hard for me is manipulating a very large graph, or a sparse very >> large structure, like a huge Famix model or a FPGA layout model with a full >> design layed out on top. There, you're randomly accessing the whole of the >> structure (or at least you see no obvious partition) and the structure is >> too large for the memory or the GC. >> >> This is why I had a long time ago this idea of a in-memory working-set / >> on-disk full structure with automatic determination of what the working set >> is. >> >> For pointers, have a look at the Graph500 and HPCG benchmarks, especially >> the efficiency (ratio to peak) of HPCG runs, to see how difficult these >> cases are. >> >> Regards, >> >> Thierry >> > >
