Mike Matrigali wrote:


Kristian Waagan wrote:
Hello,

In my work on DERBY-2646, I have stumbled upon some issues that can greatly affect the performance of accessing Clobs, especially updating them.

Currently Clobs are stored on disk in the modified UTF-8 encoding. This uses one to three bytes to represent a single character. Since the number of bytes per character varies, there is no easy way to calculate the byte position from the character position, or vice versa. The naive, and maybe even the only feasible way, is to start decoding the bytes from the start of the Clob.

Note that the storage I speak of is the temporary storage of Clob copies. This is initiated when the user attempts to modify the Clob. I am not considering the case where the Clob is stored in the database itself.

Obviously, reading the Clob from the start every time you need to reposition is not very efficient. One optimization is to keep track of the "current position", but it might not help that much (depending on access pattern). This requires full knowledge about update actions, including on the various streams/writers. Another option is storing the Clob in UTF-16. This would allow direct mapping between byte and character positions, as far as I have understood (I had brief contact with Dag and Bernt offline), even in the case of surrogate pairs.

However, using UTF-16 imposes space overhead when operating on Clobs with US-ASCII characters, in fact the overhead is 100% (each character represented by two bytes instead of one). For some other languages (and/or character sets), using UTF-16 reduces the space requirements (two versus three bytes).



To summarize my view on this...


Pros, UTF-8 : more space efficient for US-ASCII, same as used by store
Pros, UTF-16: direct mapping between char/byte pos (easier logic)

Cons, UTF-8 : requires "counting"/decoding to find byte position
Cons, UTF-16: space overhead for US-ASCII, must be converted when/if Clob goes back into the database
Can you describe more in what situations you are proposing to use UTF-16
vs. UTF-8.  I know that there is a lot of performance overhead in
converting from one to the other, and I know in the past Derby had often
even converted back and forth through bad code.  Are the changes you
are proposing going to affect the non-update case.

Hi Mike,

Briefly stated, the encoding issue comes into play when the first update to the clob is issued. After this everything goes via the temporary copy, which is in memory or on disk depending on size.
Non-update cases are not affected.

If the user never issues a modification operation (setString, setCharacterStream, setAsciiStream, truncate), only streams from store will be used to fetch the required data.


It would be nice if the following happens:
1) INSERT
   From whatever input (stream, clob, string ...) we convert it once
   to the modified UTF-8 format that store uses on disk.  In the case
   of stream we should read it only once and never flow it to object
   or disk before getting it into store.

For insertions of new clobs (or other appropriate data types) through PreparedStatement, this does happen - although I haven't checked if using setAsciiStream causes a byte-char-byte conversion or not.

Note that there are two different types of setCharacterStream methods;
 A) PreparedStatement.setCharacterStream(column, Reader)
 B) Writer writer = Clob.setCharacterStream(pos)

Using B, a temporary clob will be created and the contents will spill to disk when the size threshold is reached.


2) SELECT
   Should be converted once from modified utf-8 into whatever format
   is requested by the select with no intermediate object or disk
   copies.

I believe this is also the current state. Again, it is only after an update a temporary clob is used.



What is the expected usage pattern for an update on a clob that uses
these "temporary" clobs?

Use of temporary clobs are triggered by updates to the clob. If clobs are updated, how many updates per transaction there will be is totally application dependent.

> What is the usual input format, what is the
usual output format?

The input formats are those of String, Reader and InputStream.
The output formats are those of String, Writer and OutputStream.

Do you expect more than one update usually?

I don't know...

> Does an update have to rewrite the end of the file on and shrink or expand of
a middle of the clob?

The expansion and shrinking certainly is necessary when using UTF-8.
When you do a setString(pos, str), you basically overwrite a range of existing characters with the characters in the insertion string. If the characters replaced are not represented with the same number of bytes as the inserted characters, the byte array on disk (or in memory) must be expanded or shrunk accordingly.

If we had used an encoding with a fixed number of bytes per char in store, we could have gotten away with skipping to the right position and then just stream the new value into store.
As I see it, this is not possible when using UTF-8 in store.

There is no functionality for inserting a new string without overwriting existing content, but appending to the value is possible. You can also truncate the clob.


Last, calling Connection.createClob() will always result in a temporary clob being created.



--
Kristian


I'm sure there are other aspects, and I would like some opinions and feedback on what to do. My two current alternatives on the table are using the naive counting technique, or changing to UTF-16. The former requires the least code changes.


To bound the scope of potential changes, I do plan to get this done for 10.3...



thanks,


Reply via email to