Incidentally, we had a little chat with Marcus yesterday about that. No, i don't think it is feasible to use single image to store everything. It is convenient, cheap and of course it is way better than dealing with communicating with external DB/servers whatever.
But there's one thing you should know already: the days of vertical growth is over. Running a service (under VM or not) on a single machine is asking for troubles: - limits on load - susceptible to power outage and other reliability problems etc Also, think that the amount of data you need to process correlates with CPU horsepower available. Which means that yes, you can run a huge image with 64Gb data in it.. but that means that responsiveness of your service will quite often fall beyond any usability limits. If we look in terms of VM and pick only one thing - garbage collection, you will see that there is certain limits beyond which a performance will drop too much, so you naturally will start thinking about ways to split data to separate chunks and run them on different machines/VMs. It is because GC's mark algorithm is O(n) bound, when n is total number of references between objects, and GC's scavenge algorithm is at best O(n) bound where n is total number of objects in object memory, and at worst is where n is total memory used by objects. No matter how you turn it, i just wanted to indicate that time to run GC is in linear dependency from the amount of data. Yes, we might invest a lot of effort in making GC more clever, more complex and more robust.. but no matter what you do, you cannot change the above facts. It means, that any improvements will be about diminishing returns, but won't change the picture radically. That means that sooner or later you will have to deal with it: a problem of splitting data on multiple independent chunks, and making your service to run on multiple machines , in order to use more CPU power, more memory and be more reliable etc. At this point, your main dilemma is to invent a fast and robust interfaces to communicate between images or between image(s)/ database etc. We should concentrate on things which dealing with inter-image communication and image-database communication, because it is the only way to ensure that we will answer upcoming future problems. Relying on using a single huge image is way to nowhere. -- Best regards, Igor Stasenko.
