The claims in the paper I was interested in were not issues like non-
blocking I/O etc. The claim that is of interest to pig is that a
memory allocation and garbage collection scheme that is beyond the
control of the programmer is a bad fit for a large data processing
system. This is a fundamental design choice in Java, and fits it well
for the vast majority of its uses. But for systems like Pig there
seems to be no choice but to work around Java's memory management.
I'll clarify this point in the document.
I took a closer look at NIO. My concern is that it does not give the
level of control I want. NIO allows you to force a buffer to disk and
request a buffer to load, but you cannot force a page out of memory.
It doesn't even guarantee that after you load a page it will really be
loaded. One of the biggest issues in pig right now is that we run out
memory or get the garbage collector in a situation where it can't make
sufficient progress. Perhaps switching to large buffers instead of
having many individual objects will address this. But I'm concerned
that if we cannot explicitly force data out of memory onto disk then
we'll be back in the same boat of trusting the Java memory manager.
On May 14, 2009, at 7:43 PM, Ted Dunning wrote:
That Telegraph dataflow paper is pretty long in the tooth. Certainly
several of their claims have little force any more (lack of non-
I/O, poor thread performance, no unmap, very expensive
uncontested locks). It is worth that they did all of their tests on
JVM and things have come an enormous way since then.
Certainly, it is worth having opaque contains based on byte arrays,
isn't that pretty much what the NIO byte buffers are there to provide?
Wouldn't a virtual tuple type that was nothing more than a byte
and an offset do almost all of what is proposed here?
On Thu, May 14, 2009 at 5:33 PM, Alan Gates <ga...@yahoo-inc.com>