Army wrote:
Well I'm a day late, but I've finished my review of the patch...
Hi Army,
Thank you very much for the careful review and the helpful suggestions.
There were a number of points in your message, many of which I will
address when I re-submit the patch proposal, but there were several
points which are large enough that I'd like to address them specifically
in this message.
---------------
1) Should this be one patch or several? It is indeed awkward to
bundle all these changes together, and they aren't all interdependent.
For example, DERBY-170 can be easily broken out. Also, the changes to
DDMWriter.finalizeDSSLength() that I propose for DERBY-125 can certainly
be separated from the "offset API" changes that I propose for DERBY-491 and
DERBY-492. I'd like to hear the opinions of others on the list about whether
to break this patch up or not.
---------------
2) Can we make the tests better? Yes, and I'll be working on this,
but I think I need more help from you, particularly regarding the
off-by-one problem in finalizeDSSLength, which I discuss at length
below. This is the problem child in the testing area, and I need
some good ideas for how to make the tests better.
---------------
3) Is the DERBY-170 test specific to the JCC client? Yes, this test is
specific to the JCC client, and, more specifically, it is specific to
the "deferPrepares" behavior of the JCC client. The DERBY-170 bug is that
when DRDAConnThread.skipRemainder calls DDMReader.skipDSS to skip a DSS
message, large object messages which have been split across multiple DSS
segments are not handled properly.
The reproduction script triggers this condition by executing a prepared
statement with a parameter which is 1 million bytes in length.
When the JCC client executes a prepared statement in deferPrepares mode,
it waits until it sees the execute() call, then it sends the prepare,
describe, and execute DRDA messages all at once. This is substantially
different from the Derby client, which sends the prepare as soon as the
application calls it.
When the reproduction script for DERBY-170 is run using the Derby client,
the 1 million byte parameter is never sent, because the Derby client
throws the table-not-found exception on the Connection.prepareStatement() call.
So in both cases the skipRemainder method gets called, but only in the
JCC configuration is there a segmented DSS object to be skipped over.
It would be nice to be able to write a test which exposes the failure for
both the Derby client and the JCC client, but I'm not sure how to do that.
Do you know of a way to cause the skipRemainder() processing to be run
on a arbitrarily large object, other than the case mentioned in the bug script?
---------------
4) What's going on with this off-by-one bug, and how can we write a test
which clearly demonstrates the bug?
I think this is a very interesting question, and is key to understanding my
changes, so I want to take the time to really dissect this problem in some
detail. After you read this, hopefully you will not only understand why the
test that I added doesn't show the bug, but also will be able to suggest how
I could write a test that *does* show the bug.
Consider a large DDM object which will be segmented across multiple DSS
segments by finalizeDSSLength(). The object must be split across K segments,
where K = (object length / 32765) + 1. For example, the DERBY-125 repro
script causes the creation of a 118,000 byte SQLDARD object, which must be
split across 4 DSS segments.
In order to split a large object into multiple segments, finalizeDSSLength
must shift K-1 segments over "to the right" in order to make room for
inserting the 2-byte DSS continuation header at the front of each segment.
Note that the first segment does not have to be shifted; it already has a
DSS header, and we just have to set the continuation flag for that header.
Let us use zero-based terminology for the segments, so that our large object
of K segments will have segments 0 through K-1. Segment 0 does not need to
be shifted at all. Segment 1 needs to be shifted 2 bytes. Segment 2 needs to
be shifted 4 bytes. Segment i needs to be shifted i*2 bytes. There will be
a total of (K-1)*2 bytes added to the buffer by finalizeDSSLength().
The finalizeDSSLength() loop runs (K-1) times, proceeding from segment K-1
backwards to segment 1, shifting each segment to the right. After each
segment is shifted to its proper place in the buffer, its DSS header is written
in front of it. Segment 0 is not shifted at all, but its DSS header is updated
to contain the continuation flag.
Now, on each pass through the loop, the bug occurs. Each time we shift a
segment, we shift the correct number of bytes, but we start 1 byte to the
left of where we are supposed to start, and our destination is also 1 byte to
the left of where the target of the copying is supposed to be. Because both
the source and destination of the copy are off by one, the copy is mostly
correct. But it does two things wrong, one at each end of the segment:
- we fail to copy the last byte of the segment to its new location. Since we
don't copy this last byte, the effect is that the message ends up containing
whatever was in this position in the 'bytes' buffer at that point.
- we also copy the last byte of the previous segment to a spot which is one
byte before the start of this segment. However, we then immediately
overwrite this byte with the DSS header for this segment. So this part of
the bug doesn't really matter.
Thus, the precise effect of the off-by-one bug is as follows:
For each segment in a multi-segment large DDM object, except for the first,
the last byte in the segment gets corrupted. The last byte of segment i
gets replaced by whatever value was i*2 bytes past it in the buffer.
For example, in the DERBY-125 repro script, which has a 4-segment SQLDARD
object, 3 bytes in that object are corrupted:
- the last byte in the object is replaced by whatever value happens to be
sitting 6 bytes to the right of it, in previously-unused space in the
'bytes' array.
- the last byte in segment 2 of the object is replaced by whatever value
happens to be sitting 4 bytes to the right of it. Note that this value is
whatever was the fourth byte of segment 3.
- the last byte in segment 1 of the object is replaced by whatever value
happens to be sitting 2 bytes to the right of it. Note that this value is
whatever was the second byte of segment 2.
Note that since we don't shift segment 0, we don't ever get an
ArrayIndexOutOfBoundsException.
Now that we have precisely described the impact of the bug, we are still
left with the question: why isn't this causing more symptoms? And, in
particular, why doesn't the DERBY-125 repro script show any symptoms?
Firstly, note that the total number of bytes which are damaged by the bug
is very, very small. In a K-segment object, which has K*32765 bytes, only
K-1 bytes are damaged, so the amount of damage is (K-1) / (K*32765), which
means that fewer than 1/32765 of the bytes in the object are damaged. That
is about .003 percent of the bytes in the object. Which, put differently,
means that 99.997 percent of the bytes in the object are perfectly fine!!
Even so, computers are very picky, and so you'd think that an object which
was 99.997% OK would still be damaged enough to notice. But now let's take
a look at the DERBY-125 repro script in a bit more detail. That script
basically does a Connection.prepareStatement() of a statement which looks like:
select ID from TABLE where ID in (?, ?, ?, ?, ?, ?, ?, ?, ...
....
... ?, ?, ?, ?)
In the actual script, there are 2001 parameter markers in the statement.
This script causes the server to return a SQLDARD object which contains
some information for each parameter marker in the message: each parameter
is of type INTEGER, is length 4, is precision 10, is scale 0, is an input
parameter, and so forth.
However, when I worked through all the details, it turns out that the SQLDARD
information for each parameter marker in the statement is almost all zeros.
There are exactly 59 bytes of data for each parameter marker in the statement,
and, in hexadecimal, they look like this:
00 0A 00 00 00 00 00 00 00 00 00 04 01 F1 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 FF
00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00
Note that out of those 59 bytes, only 6 of them are non-zero:
- 0A is the precision
- 04 is the length of the integer
- 01 F1 is SQLType "INTEGER"
- FF indicates that we are not a User-Defined-Type
- 01 indicates that this is an INPUT parameter
When you repeat this 2001 times, as the reproduction script does, you get
(6*2001) non-zero bytes and 54*2001 zero bytes in the message, so only 12006
out of the 118,059 bytes in the parameter marker description data are non-zero.
Almost 90% of the bytes are zero.
So when the bug occurs, and the last byte of segment i is replaced by the byte
which is i*2 bytes to the right of it, the odds turn out to be almost
overwhelmingly high that the damaged byte is a 0, and the byte that is i*2
to the right of it is also a 0.
So that's why the reproduction script doesn't expose the problem. The bug
occurs during the script, but a byte containing 00 is replaced by a byte
containing 00, and the client doesn't see anything wrong.
The reproduction script for bug 492 is more promising in this respect. That
script causes a large SQLCINRD object to be returned from the server to the
client. The SQLCINRD object is very similar to the SQLDARD object, in that
it has some data for each column in the result set, but the SQLCINRD object
which is returned in the DERBY-492 case also contains the column name, the
table name, and the schema name, so that the data for each column looks more
like:
12A0 06434F4C30373400 0000000000000000 .COL074......... ..|<............
12B0 00FF000000000000 0000000000000000 ................ ................
12C0 00000A5445535454 41424C4532000000 ...TESTTABLE2... ..........<.....
12D0 0342414800000006 434F4C3037340000 .BAH....COL074.. .........|<.....
There are still a lot of zeros in this data, but there is also lots of
non-zero data. So replacing the last byte in a segment full of this data
should be substantially more likely to cause damage that the client would
recognize.
Unfortunately, there are two problems with using the reproduction script
for DERBY-492 for demonstrating *just* the off-by-one bug:
- Firstly, DERBY-492 exhibits more problems than just the off-by-one bug. In
addition to that bug, DERBY-492 triggers the "writer.setOffset()" bug, and
that bug causes much more violent damage to the bytes buffer. So before the
client has a chance to notice any damage caused by the off-by-one bug, it
first complains about the damage caused by the setOffset bug.
- Secondly, even if the off-by-one bug was the only problem in the code under
test, the DERBY-492 script only generates a 2-segment SQLCINRD object, and
so the only byte damaged in that object is the very last byte, and that byte
is always 00, as is (usually) the byte which is two bytes beyond it in the
bytes buffer, because that byte is in newly-allocated space and hence is
initialized to 00. So even though the DERBY-492 data has a lot more
non-zero data, it still ends up having zeros in the only place where damage
occurs, which is the very last byte in the object. Now, I could change that
test to have more than 700 columns in it, but I'm still not sure if the
off-by-one bug would expose real damage, because there are still large
stretches of all-zeros data in the DERBY-492 case, so unless I get lucky I
might just have that data be present at the segment boundaries.
Note that the only "random" or "nondeterministic" part of this damage is that
the very last byte in the large DDM object gets replaced by an unknown byte
from the unused section of the 'bytes' array. If the bytes array was just
expanded in order to accomodate this object, then the unused section of the
array will contain zeros, but if the bytes array was previously used for a
larger amount of data, then those unused bytes contain unknown values. For
the other segments, the last byte of each segment gets replaced by a specific
byte from the next segment; that is, a nearby byte which is part of the same
object, and thus will be the same each time you run the test.
This makes me wonder whether it would be reasonable to modify ensureLength()
so that it pre-filled the buffer with non-zero data (for example, 0xFC), so
that the bug would be more likely to be obvious, since it would end up placing
0xFC into the last byte of the message, rather than 00 in most cases. I don't
think we'd always want ensureLength() to do this, for performance reasons, but
perhaps there is some way to conditionally compile such code into the test
version of DDMWriter?
So the bottom line is that I don't know how to write a test which demonstrates
just the off-by-one bug all by itself, and so the DERBY-125 test case in
debynet/prepStmt.java is probably of little value.
However, I do believe that the bug is real, for two reasons:
- Firstly, I've stepped through that code in painstaking detail enough times
that I've persuaded myself that my analysis is correct.
- Secondly, I happened into something by luck, which is that if I run the
lang/procedure.java test case, using the test harness, with my new test
cases for DERBY-491 and DERBY-492 included, and with all the other fixes
in this patch applied, then the off-by-one bug actually causes real damage!
This is because, when the repro script is run as part of the overall
procedure.java test, the 'bytes' buffer has already been expanded by the
time that we get to this test, and it contains some non-zero bytes at the
end, and so we end up replacing the last byte of the message with a
non-zero byte, which causes a client-side assertion to trip.
When you read my changes, Army, you noticed that in one test I'd added some
code to check that the value I selected back from the database matched the
value that I stored into the database, and you were wondering if this was
test code for the off-by-one bug. That was indeed my intent. After I found the
off-by-one bug, and started wondering about why it wasn't showing more
symptoms, I thought that maybe I was in fact getting corrupted column data
back, and so I changed the test to compare the actual value with the
expected value.
However, that was an irrelevant change, at least as far as the off-by-one
bug goes, because column data isn't sent as a single large DDM Object. Instead,
it is split via the LMTBLKPRC protocol, so it does not ever go through the
segmenting code in finalizeDSSLength(). So while checking the column data
is not a bad thing to do, it doesn't really have any relevance for that test,
and it definitely does not have anything to do with the off-by-one bug. So
probably I should just remove that if statement from that test.
---------------
I'm afraid that by now I've gone on much too long. I'd like to close this
message by asking the following:
- after considering my explanation of the bug, do you have an idea of a
better way to write a test for it? If not, should I just remove the
DERBY-125 test from prepStmt.java?
- Can you try the experiment of applying all of my patches *except* for
the off-by-one bug (just apply everything, then remove the "+1" from
finalizeDSSLength), and then run
java -Dframework=DerbyNetClient
org.apache.derby.functionTests.harness.RunTest lang/procedure.java
My belief is that you will get an error in the client from the
parseFastVCMorVCS routine in NetStatementReply.java saying that
"only one of the VCM, VCS length can be greater than 0". That, if you
can get it, is proof of the off-by-one bug causing damage.
thanks,
bryan