Mike Matrigali wrote:
Kristian Waagan wrote:
Hello,
In my work on DERBY-2646, I have stumbled upon some issues that can
greatly affect the performance of accessing Clobs, especially updating
them.
Currently Clobs are stored on disk in the modified UTF-8 encoding.
This uses one to three bytes to represent a single character. Since
the number of bytes per character varies, there is no easy way to
calculate the byte position from the character position, or vice
versa. The naive, and maybe even the only feasible way, is to start
decoding the bytes from the start of the Clob.
Note that the storage I speak of is the temporary storage of Clob
copies. This is initiated when the user attempts to modify the Clob. I
am not considering the case where the Clob is stored in the database
itself.
Obviously, reading the Clob from the start every time you need to
reposition is not very efficient. One optimization is to keep track of
the "current position", but it might not help that much (depending on
access pattern). This requires full knowledge about update actions,
including on the various streams/writers.
Another option is storing the Clob in UTF-16. This would allow direct
mapping between byte and character positions, as far as I have
understood (I had brief contact with Dag and Bernt offline), even in
the case of surrogate pairs.
However, using UTF-16 imposes space overhead when operating on Clobs
with US-ASCII characters, in fact the overhead is 100% (each character
represented by two bytes instead of one). For some other languages
(and/or character sets), using UTF-16 reduces the space requirements
(two versus three bytes).
To summarize my view on this...
Pros, UTF-8 : more space efficient for US-ASCII, same as used by store
Pros, UTF-16: direct mapping between char/byte pos (easier logic)
Cons, UTF-8 : requires "counting"/decoding to find byte position
Cons, UTF-16: space overhead for US-ASCII, must be converted when/if
Clob goes back into the database
Can you describe more in what situations you are proposing to use UTF-16
vs. UTF-8. I know that there is a lot of performance overhead in
converting from one to the other, and I know in the past Derby had often
even converted back and forth through bad code. Are the changes you
are proposing going to affect the non-update case.
Hi Mike,
Briefly stated, the encoding issue comes into play when the first update
to the clob is issued. After this everything goes via the temporary
copy, which is in memory or on disk depending on size.
Non-update cases are not affected.
If the user never issues a modification operation (setString,
setCharacterStream, setAsciiStream, truncate), only streams from store
will be used to fetch the required data.
It would be nice if the following happens:
1) INSERT
From whatever input (stream, clob, string ...) we convert it once
to the modified UTF-8 format that store uses on disk. In the case
of stream we should read it only once and never flow it to object
or disk before getting it into store.
For insertions of new clobs (or other appropriate data types) through
PreparedStatement, this does happen - although I haven't checked if
using setAsciiStream causes a byte-char-byte conversion or not.
Note that there are two different types of setCharacterStream methods;
A) PreparedStatement.setCharacterStream(column, Reader)
B) Writer writer = Clob.setCharacterStream(pos)
Using B, a temporary clob will be created and the contents will spill to
disk when the size threshold is reached.
2) SELECT
Should be converted once from modified utf-8 into whatever format
is requested by the select with no intermediate object or disk
copies.
I believe this is also the current state. Again, it is only after an
update a temporary clob is used.
What is the expected usage pattern for an update on a clob that uses
these "temporary" clobs?
Use of temporary clobs are triggered by updates to the clob. If clobs
are updated, how many updates per transaction there will be is totally
application dependent.
> What is the usual input format, what is the
usual output format?
The input formats are those of String, Reader and InputStream.
The output formats are those of String, Writer and OutputStream.
Do you expect more than one update usually?
I don't know...
> Does an update have to rewrite the end of the file on and shrink or
expand of
a middle of the clob?
The expansion and shrinking certainly is necessary when using UTF-8.
When you do a setString(pos, str), you basically overwrite a range of
existing characters with the characters in the insertion string. If the
characters replaced are not represented with the same number of bytes as
the inserted characters, the byte array on disk (or in memory) must be
expanded or shrunk accordingly.
If we had used an encoding with a fixed number of bytes per char in
store, we could have gotten away with skipping to the right position and
then just stream the new value into store.
As I see it, this is not possible when using UTF-8 in store.
There is no functionality for inserting a new string without overwriting
existing content, but appending to the value is possible. You can also
truncate the clob.
Last, calling Connection.createClob() will always result in a temporary
clob being created.
--
Kristian
I'm sure there are other aspects, and I would like some opinions and
feedback on what to do. My two current alternatives on the table are
using the naive counting technique, or changing to UTF-16. The former
requires the least code changes.
To bound the scope of potential changes, I do plan to get this done
for 10.3...
thanks,