Hello,

It is apparent that we would benefit from extending the length information stored about CLOBs in the Derby on-disk format.

Currently we set aside two bytes at the head of the data stream for storing the byte length of the CLOB. This is not sufficient, and can only be used for small CLOBs. For larger CLOBs we have to read through all the data to find the length.

I see three major steps in this process:
 a) Determine which information we want to store.
 b) Determine how and when the information is obtained and stored.
 c) Specify the actual on-disk format.


Regarding a), we have a few options:
 1) Store byte length only.
 2) Store character length only.
 3) Store both byte length and character length.
 4) Could also add distribution data for 1, 2, and 3 byte representations?

From a JDBC API point of view, the character length is the most relevant one. Internally, Derby might benefit from knowing the byte length as well, since it can't be easily calculated from the character length. The reason for this is that we encode data using the modified UTF-8 format, where a single character can take up from one to three bytes.

Note that there are three aspects of finding the character length from the raw bytes:
 - reading all the raw bytes
 - decoding the raw bytes
 - potential page cache "pollution"

The length often has to be obtained to validate user input (positions and lengths), which means Derby has to do a lot of unnecessary work.

I'm bringing up b) because JDBC has methods for inserting data without specifying the length. Throwing out some ideas: - obtain required information on the fly, update immediately after insert is done
 - insert data without length information, then
    * update on first subsequent request
    * use a background task to update information
    * add a "maintenance routine" for updating the information


I'm asking people to share their opinions on this matter, and I'll try to answer any questions you might have to take the discussion forwards.


--
Kristian

Reply via email to