Hello,
It is apparent that we would benefit from extending the length
information stored about CLOBs in the Derby on-disk format.
Currently we set aside two bytes at the head of the data stream for
storing the byte length of the CLOB. This is not sufficient, and can
only be used for small CLOBs. For larger CLOBs we have to read through
all the data to find the length.
I see three major steps in this process:
a) Determine which information we want to store.
b) Determine how and when the information is obtained and stored.
c) Specify the actual on-disk format.
Regarding a), we have a few options:
1) Store byte length only.
2) Store character length only.
3) Store both byte length and character length.
4) Could also add distribution data for 1, 2, and 3 byte representations?
From a JDBC API point of view, the character length is the most
relevant one. Internally, Derby might benefit from knowing the byte
length as well, since it can't be easily calculated from the character
length. The reason for this is that we encode data using the modified
UTF-8 format, where a single character can take up from one to three bytes.
Note that there are three aspects of finding the character length from
the raw bytes:
- reading all the raw bytes
- decoding the raw bytes
- potential page cache "pollution"
The length often has to be obtained to validate user input (positions
and lengths), which means Derby has to do a lot of unnecessary work.
I'm bringing up b) because JDBC has methods for inserting data without
specifying the length. Throwing out some ideas:
- obtain required information on the fly, update immediately after
insert is done
- insert data without length information, then
* update on first subsequent request
* use a background task to update information
* add a "maintenance routine" for updating the information
I'm asking people to share their opinions on this matter, and I'll try
to answer any questions you might have to take the discussion forwards.
--
Kristian