The claims in the paper I was interested in were not issues like non- blocking I/O etc. The claim that is of interest to pig is that a memory allocation and garbage collection scheme that is beyond the control of the programmer is a bad fit for a large data processing system. This is a fundamental design choice in Java, and fits it well for the vast majority of its uses. But for systems like Pig there seems to be no choice but to work around Java's memory management. I'll clarify this point in the document.

I took a closer look at NIO. My concern is that it does not give the level of control I want. NIO allows you to force a buffer to disk and request a buffer to load, but you cannot force a page out of memory. It doesn't even guarantee that after you load a page it will really be loaded. One of the biggest issues in pig right now is that we run out memory or get the garbage collector in a situation where it can't make sufficient progress. Perhaps switching to large buffers instead of having many individual objects will address this. But I'm concerned that if we cannot explicitly force data out of memory onto disk then we'll be back in the same boat of trusting the Java memory manager.

Alan.

On May 14, 2009, at 7:43 PM, Ted Dunning wrote:

That Telegraph dataflow paper is pretty long in the tooth.  Certainly
several of their claims have little force any more (lack of non- blocking I/O, poor thread performance, no unmap, very expensive synchronization for uncontested locks). It is worth that they did all of their tests on the 1.3
JVM and things have come an enormous way since then.

Certainly, it is worth having opaque contains based on byte arrays, but
isn't that pretty much what the NIO byte buffers are there to provide?
Wouldn't a virtual tuple type that was nothing more than a byte buffer, type
and an offset do almost all of what is proposed here?

On Thu, May 14, 2009 at 5:33 PM, Alan Gates <ga...@yahoo-inc.com> wrote:

http://wiki.apache.org/pig/PigMemory

Alan.


Reply via email to