[ 
https://issues.apache.org/jira/browse/DERBY-3907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661224#action_12661224
 ] 

Kristian Waagan commented on DERBY-3907:
----------------------------------------

I agree ReaderToUTF8Stream can be simplified quite a bit, and I'll take a look 
at it.

By adding the StreamHeaderHolder to the constructor, I think we'll loose one 
feature. The current implementation has the ability to go back and fill in the 
header if the value fits into the buffer (32768 - minus header size). The value 
written to the header is the byte count, and it is used as a hint for sizing 
the character array used when materializing the string value (CHAR, VARCHAR, 
LONG VARCHAR).
This functionality can continue to exist, but I'm unsure how to do it. Some 
quick proposals:
 a) Duplicate header generation code in ReaderToUTF8Stream (trying to avoid 
this)
 b) Pass around a header generation object instead of a "pre-generated" header.
 c) Pass around reference to the StringDataValue/DVD object
 d) Special handling for all non-Clob data values (requires data type 
information)
 e) Simply check header length, if two bytes long then update header (hacky?)

I'm not sure how much performance will be affected by removing the feature, but 
options (d) and (e) seem pretty simple to implement. There are also some 
alternative "sizing heuristics", but I don't know how effective they are (I've 
seen InputStream.available() and the start-stop index on the page being used). 
When resizing, CHAR grows by 64 bytes, VARCHAR by 4 KB. Also note that the 
problem is only affecting values inserted with a stream/reader.

Also, since we are already spending two bytes per string value to store meta 
information, it would be nice to actually use them optimally.  I'm tempted to 
start using those two bytes for the character count, instead of the byte count. 
That way, the sizing hint would be exact for almost all CHAR, VARCHAR and LONG 
VARCHAR values (exceptions are those values inserted with a stream through the 
"lengthless overrides" where the byte representation exceeds ~32k bytes).
The downsides of this are that some decoding loops must be modified and 
probably that the hint has to be ignored when reading old databases (pre 10.5). 
I also need to do a more thorough search for places where this information is 
used.

I think using the new header format for all string types is a bad idea, as it 
will add an extra 3 bytes of overhead. Further, two of those bytes will never 
be used due to the maximum allowed length of the non-Clob data types.


BTW, I have added a static variable for the unknown length stream header 
holder. A new patch will be uploaded later.

> Save useful length information for Clobs in store
> -------------------------------------------------
>
>                 Key: DERBY-3907
>                 URL: https://issues.apache.org/jira/browse/DERBY-3907
>             Project: Derby
>          Issue Type: Improvement
>          Components: JDBC, Store
>    Affects Versions: 10.5.0.0
>            Reporter: Kristian Waagan
>            Assignee: Kristian Waagan
>         Attachments: derby-3907-1a-alternative_approach.diff, 
> derby-3907-2b-header_write_preparation.diff, 
> derby-3907-2b-header_write_preparation.stat
>
>
> The store should save useful length information for Clobs. This allows the 
> length to be found without decoding the whole data stream.
> The following thread raised the issue on what information to store, and also 
> contains some background information: 
> http://www.nabble.com/Storing-length-information-for-CLOB-on-disk-tp19197535p19197535.html
> The information to store, and the exact format of it, is still to be 
> discussed/determined.
> Currently two bytes are set aside for length information, which is inadequate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to