Hi Phil,

2016-11-23 12:17 GMT+01:00 [email protected] <
[email protected]>:

> [ ...]
>
> It is really important to have such features to avoid massive GC pauses.
>
> My use case is to load the data sets from here.
> https://www.google.be/url?sa=t&source=web&rct=j&url=http://
> proba-v.vgt.vito.be/sites/default/files/Product_User_
> Manual.pdf&ved=0ahUKEwjwlOG-4L7QAhWBniwKHZVmDZcQFggpMAI&usg=
> AFQjCNGRME9ZyHWQ8yCPgAQBDi1PUmzhbQ&sig2=eyaT4DlWCTjqUdQGBhFY0w
>
I've used that type of data before, a long time ago.

I consider that tiled / on-demand block loading is the way to go for those.
Work with the header as long as possible, stream tiles if you need to work
on the full data set. There is a good chance that:

1- You're memory bound for anything you compute with them
2- I/O times dominates, or become low enough to don't care (very fast SSDs)
3- It's very rare that you need full random access on the complete array
4- GC doesn't matter

Stream computing is your solution! This is how the raster GIS are
implemented.

What is hard for me is manipulating a very large graph, or a sparse very
large structure, like a huge Famix model or a FPGA layout model with a full
design layed out on top. There, you're randomly accessing the whole of the
structure (or at least you see no obvious partition) and the structure is
too large for the memory or the GC.

This is why I had a long time ago this idea of a in-memory working-set /
on-disk full structure with automatic determination of what the working set
is.

For pointers, have a look at the Graph500 and HPCG benchmarks, especially
the efficiency (ratio to peak) of HPCG runs, to see how difficult these
cases are.

Regards,

Thierry

Reply via email to