Stephane Bailliez wrote: > Torsten Curdt wrote: >> >>> Being a complete idiot for distributed computing, I would say it is >>> easy to explode a JVM when doing such distributed jobs, (should it >>> be for OOM or anything). >> >> Then restrict what people can do - at least Google went that route. > > I don't know what Google did on the specifics :)
They came up with their own language for mapreduce jobs: http://labs.google.com/papers/sawzall.html > If you want to do that with Java and restrict memory usage, cpu usage > and descriptor access within each inVM instance. That's a considerable > amount of work that likely implies writing a specific agent for the vm > (or an agent for a specific vm that is, because it's pretty unlikely > that you will get the same results across vms), assuming that can then > really be done at the classloader level for each task (which is pretty > insanely complex to me if you have to consider allocation done at the > parent classloader level, etc..) > > At least by forking a vm you can afford to get some reasonably bound > control over the resources usage (or at least memory) without bringing > down everything since a vm is already bound to some degrees. > > >>> Failing jobs are not exactly uncommon and running things in a >>> sandboxed environment with less risk for the tracker seems like a >>> perfectly reasonable choice. So yeah, vm pooling certainly makes >>> perfect sense for it >> >> I am still not convinced - sorry >> >> It's a bit like you would like to run JSPs in a separate JVM because >> they might take down the servlet container. > > it is a bit too extreme in granularity. I think it is more about like > running n different webapps within the same VM or not. So if one > webapp is resource hog, separating it would not harm the n-1 other > applications and you would either create another server instance or > move it away to another node. > > I know of environment with large number of nodes (not related to > hadoop) where they also reboot a set of nodes daily to ensure that all > machines are really in working conditions (it's usually when the > machine reboots due to failure or whatever that someone has to rush to > it because some service forgot to be registered or things like that, > so doing this periodic check gives some people better ideas of their > response time to failure). That depends of operational procedures for > sure. This can be another implementation of the TaskTracker: a single JVM that forks a "replacement JVM" after either a given time or a given amount of tasks executed. This can avoid JVM fork overhead while also avoiding memory leak problems. The forked JVM could even be pre-forked and monitor the active one, taking over if it no more responds (and eventually killing it). Sylvain -- Sylvain Wallez - http://bluxte.net
