Hi Dennis, On 8 December 2011 00:40, Dennis Frostlander <[email protected]> wrote:
> Hi, > > I am using the IndexedObjectStore for storing and then accessing large > amounts of data - around 300 millions objects. > In the maps I am storing the Long's as the key and simple objects with few > properties as values. > The maps are backed up in the file system by 5 files with size ranging > from 3 GB to 11 GB. > > When I start accessing the data from the collections, I am experiencing > quite slow performance - just to enumerate all objects in the collection it > takes around 15 hours on the 7200 rpm hard drive, with 10G of memory > available to java vm. The java vm runs in the server mode. > The IndexedObjectStore uses a very simple on disk layout that tends to result in very high levels of disk seeking. It doesn't scale to large datasets very effectively. > > I can see that the machine resources - CPU, hard drives are utilized to a > very small amount, the respective performance counters are close to > minimal. > Are you using Windows? Which counters are you monitoring? Disk throughput will be minimal. I forget the names of the counters off the top of my head, but you need to look for counters like CPU Wait Time, and Disk Queue Length. The CPU Wait Time is fairly easy to understand, if you have a high percentage then your disk IO is the bottleneck. > I have tried to perform multi-threaded reads - in each thread I create > separate indexed store readers. But the result is similar - the benefit is > very small. > If disk seeking is the issue, more threads are unlikely to improve performance and may in fact make it worse. > > Could anyone give me any suggestions how I can improve the data access and > utilize the machine resources more efficiently? > There's no simple answer to this because it depends to a large extend on your data access patterns. About the only suggestion I have is to start looking at using a proper database instead. To get good performance out of a database you need to ensure that data is organised according to your access patterns. One typically effective way to achieve this is to create a data column using a PostGIS type, add a GIST index on that column, and then cluster the table by that index. That will organise the table contents using the same ordering as your index which will have the effect of grouping geographically close objects close together on disk. Hope that makes sense. > > Yours sincerely, > Dennis Frostlander, > > P.S. on the related topic, I noticed that when the java process runs in > the debug mode and the debugger is attached (either intellij idea or > eclipse), the read operations are a magnitude slower. Not really sure why > though... > Java debuggers add a lot of overhead to execution. There's not much you can do about it. If you're trying to detect bottlenecks in code you need to use a profiler and add probes targeted at specific code points, or rely on logging. Brett
_______________________________________________ osmosis-dev mailing list [email protected] http://lists.openstreetmap.org/listinfo/osmosis-dev
