[
https://issues.apache.org/jira/browse/DERBY-3907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661224#action_12661224
]
Kristian Waagan commented on DERBY-3907:
----------------------------------------
I agree ReaderToUTF8Stream can be simplified quite a bit, and I'll take a look
at it.
By adding the StreamHeaderHolder to the constructor, I think we'll loose one
feature. The current implementation has the ability to go back and fill in the
header if the value fits into the buffer (32768 - minus header size). The value
written to the header is the byte count, and it is used as a hint for sizing
the character array used when materializing the string value (CHAR, VARCHAR,
LONG VARCHAR).
This functionality can continue to exist, but I'm unsure how to do it. Some
quick proposals:
a) Duplicate header generation code in ReaderToUTF8Stream (trying to avoid
this)
b) Pass around a header generation object instead of a "pre-generated" header.
c) Pass around reference to the StringDataValue/DVD object
d) Special handling for all non-Clob data values (requires data type
information)
e) Simply check header length, if two bytes long then update header (hacky?)
I'm not sure how much performance will be affected by removing the feature, but
options (d) and (e) seem pretty simple to implement. There are also some
alternative "sizing heuristics", but I don't know how effective they are (I've
seen InputStream.available() and the start-stop index on the page being used).
When resizing, CHAR grows by 64 bytes, VARCHAR by 4 KB. Also note that the
problem is only affecting values inserted with a stream/reader.
Also, since we are already spending two bytes per string value to store meta
information, it would be nice to actually use them optimally. I'm tempted to
start using those two bytes for the character count, instead of the byte count.
That way, the sizing hint would be exact for almost all CHAR, VARCHAR and LONG
VARCHAR values (exceptions are those values inserted with a stream through the
"lengthless overrides" where the byte representation exceeds ~32k bytes).
The downsides of this are that some decoding loops must be modified and
probably that the hint has to be ignored when reading old databases (pre 10.5).
I also need to do a more thorough search for places where this information is
used.
I think using the new header format for all string types is a bad idea, as it
will add an extra 3 bytes of overhead. Further, two of those bytes will never
be used due to the maximum allowed length of the non-Clob data types.
BTW, I have added a static variable for the unknown length stream header
holder. A new patch will be uploaded later.
> Save useful length information for Clobs in store
> -------------------------------------------------
>
> Key: DERBY-3907
> URL: https://issues.apache.org/jira/browse/DERBY-3907
> Project: Derby
> Issue Type: Improvement
> Components: JDBC, Store
> Affects Versions: 10.5.0.0
> Reporter: Kristian Waagan
> Assignee: Kristian Waagan
> Attachments: derby-3907-1a-alternative_approach.diff,
> derby-3907-2b-header_write_preparation.diff,
> derby-3907-2b-header_write_preparation.stat
>
>
> The store should save useful length information for Clobs. This allows the
> length to be found without decoding the whole data stream.
> The following thread raised the issue on what information to store, and also
> contains some background information:
> http://www.nabble.com/Storing-length-information-for-CLOB-on-disk-tp19197535p19197535.html
> The information to store, and the exact format of it, is still to be
> discussed/determined.
> Currently two bytes are set aside for length information, which is inadequate.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.