Re: Modified UTF-8 or UTF-16 for temporary Clobs?

Kristian Waagan Wed, 23 May 2007 01:13:26 -0700

Mike Matrigali wrote:

Kristian Waagan wrote:
Hello,
In my work on DERBY-2646, I have stumbled upon some issues that cangreatly affect the performance of accessing Clobs, especially updatingthem.
Currently Clobs are stored on disk in the modified UTF-8 encoding.This uses one to three bytes to represent a single character. Sincethe number of bytes per character varies, there is no easy way tocalculate the byte position from the character position, or viceversa. The naive, and maybe even the only feasible way, is to startdecoding the bytes from the start of the Clob.
Note that the storage I speak of is the temporary storage of Clobcopies. This is initiated when the user attempts to modify the Clob. Iam not considering the case where the Clob is stored in the databaseitself.
Obviously, reading the Clob from the start every time you need toreposition is not very efficient. One optimization is to keep track ofthe "current position", but it might not help that much (depending onaccess pattern). This requires full knowledge about update actions,including on the various streams/writers.Another option is storing the Clob in UTF-16. This would allow directmapping between byte and character positions, as far as I haveunderstood (I had brief contact with Dag and Bernt offline), even inthe case of surrogate pairs.
However, using UTF-16 imposes space overhead when operating on Clobswith US-ASCII characters, in fact the overhead is 100% (each characterrepresented by two bytes instead of one). For some other languages(and/or character sets), using UTF-16 reduces the space requirements(two versus three bytes).
To summarize my view on this...


Pros, UTF-8 : more space efficient for US-ASCII, same as used by store
Pros, UTF-16: direct mapping between char/byte pos (easier logic)

Cons, UTF-8 : requires "counting"/decoding to find byte position
Cons, UTF-16: space overhead for US-ASCII, must be converted when/ifClob goes back into the database
Can you describe more in what situations you are proposing to use UTF-16
vs. UTF-8.  I know that there is a lot of performance overhead in
converting from one to the other, and I know in the past Derby had often
even converted back and forth through bad code.  Are the changes you
are proposing going to affect the non-update case.


Hi Mike,

Briefly stated, the encoding issue comes into play when the first updateto the clob is issued. After this everything goes via the temporarycopy, which is in memory or on disk depending on size.

Non-update cases are not affected.

If the user never issues a modification operation (setString,setCharacterStream, setAsciiStream, truncate), only streams from storewill be used to fetch the required data.


It would be nice if the following happens:
1) INSERT
   From whatever input (stream, clob, string ...) we convert it once
   to the modified UTF-8 format that store uses on disk.  In the case
   of stream we should read it only once and never flow it to object
   or disk before getting it into store.

For insertions of new clobs (or other appropriate data types) throughPreparedStatement, this does happen - although I haven't checked ifusing setAsciiStream causes a byte-char-byte conversion or not.


Note that there are two different types of setCharacterStream methods;
 A) PreparedStatement.setCharacterStream(column, Reader)
 B) Writer writer = Clob.setCharacterStream(pos)

Using B, a temporary clob will be created and the contents will spill todisk when the size threshold is reached.


2) SELECT
   Should be converted once from modified utf-8 into whatever format
   is requested by the select with no intermediate object or disk
   copies.

I believe this is also the current state. Again, it is only after anupdate a temporary clob is used.



What is the expected usage pattern for an update on a clob that uses
these "temporary" clobs?

Use of temporary clobs are triggered by updates to the clob. If clobsare updated, how many updates per transaction there will be is totallyapplication dependent.


> What is the usual input format, what is the

usual output format?


The input formats are those of String, Reader and InputStream.
The output formats are those of String, Writer and OutputStream.

Do you expect more than one update usually?


I don't know...

> Does an update have to rewrite the end of the file on and shrink orexpand of

a middle of the clob?


The expansion and shrinking certainly is necessary when using UTF-8.

When you do a setString(pos, str), you basically overwrite a range ofexisting characters with the characters in the insertion string. If thecharacters replaced are not represented with the same number of bytes asthe inserted characters, the byte array on disk (or in memory) must beexpanded or shrunk accordingly.

If we had used an encoding with a fixed number of bytes per char instore, we could have gotten away with skipping to the right position andthen just stream the new value into store.

As I see it, this is not possible when using UTF-8 in store.

There is no functionality for inserting a new string without overwritingexisting content, but appending to the value is possible. You can alsotruncate the clob.

Last, calling Connection.createClob() will always result in a temporaryclob being created.




--
Kristian

I'm sure there are other aspects, and I would like some opinions andfeedback on what to do. My two current alternatives on the table areusing the naive counting technique, or changing to UTF-16. The formerrequires the least code changes.
To bound the scope of potential changes, I do plan to get this donefor 10.3...
thanks,

Re: Modified UTF-8 or UTF-16 for temporary Clobs?

Reply via email to