Jianfeng Jia has posted comments on this change.

Change subject: ASTERIXDB-1102: VarSize Encoding to store length of String and 
ByteArray
......................................................................


Patch Set 8:

(9 comments)

https://asterix-gerrit.ics.uci.edu/#/c/449/8/hyracks/hyracks-data/hyracks-data-std/src/main/java/org/apache/hyracks/data/std/primitive/ByteArrayPointable.java
File 
hyracks/hyracks-data/hyracks-data-std/src/main/java/org/apache/hyracks/data/std/primitive/ByteArrayPointable.java:

Line 106:     private int hash = 0;
> Could we move those to the beginning of the class (i.e. before the first me
Done


Line 169:     public static int normalize(byte[] bytesPtr, int start) {
> Could you add a comment what "normalization" is in this case?
Done


https://asterix-gerrit.ics.uci.edu/#/c/449/8/hyracks/hyracks-data/hyracks-data-std/src/main/java/org/apache/hyracks/data/std/primitive/UTF8StringPointable.java
File 
hyracks/hyracks-data/hyracks-data-std/src/main/java/org/apache/hyracks/data/std/primitive/UTF8StringPointable.java:

Line 76:     private int stringLength;
> I would agree. Can we move from saving these values to using functions? Poi
I understand your worries, please refer to the above comments.


Line 76:     private int stringLength;
> I'm not sure that I'm happy that these pointables are stateful now, but if 
I've moved these values to the beginning and add the comments. 

We do need to be more careful about these values. It's all about the speedup 
and "safety" trade-offs. My thought is that by using the "set" method, the 
pointable itself is already not safe . What we need is a safer way to guarantee 
these state will be reset while the super set() functions are called. 

To this end, I change the strategy by inserting an abstract afterReset() 
function inside the AbstractPointable, which will be called at the end of set() 
functions. Then it should be safe to have the state inside the inherited class.


Line 92:      * Since the {@code utf8Length} and the {@code metaLength} are 
often used, we compute those two value in advance.
> s/value/values/
Done


https://asterix-gerrit.ics.uci.edu/#/c/449/8/hyracks/hyracks-data/hyracks-data-std/src/main/java/org/apache/hyracks/data/std/util/AbstractVarLenObjectBuilder.java
File 
hyracks/hyracks-data/hyracks-data-std/src/main/java/org/apache/hyracks/data/std/util/AbstractVarLenObjectBuilder.java:

Line 49:             int diff = estimateMetaLen - actualMetaLen;
> It would be nice if this case was covered in the UTF8StringPointableTest.
It is covered by UTF8StringBuilderTest.


https://asterix-gerrit.ics.uci.edu/#/c/449/8/hyracks/hyracks-data/hyracks-data-std/src/main/java/org/apache/hyracks/data/std/util/UTF8CharSequence.java
File 
hyracks/hyracks-data/hyracks-data-std/src/main/java/org/apache/hyracks/data/std/util/UTF8CharSequence.java:

Line 55:             buf = new char[utfLen];
> It seems relatively expensive to always allocate *and* copy the whole strin
First of all, this class is purely moved downward from Asterix to Hyracks. 

This allocation only happens when there is no more space to hold the new 
values. I checked the usage of this CharSequence, which is used in the string 
regex match/find function. In those places, we directly call the java regex 
pattern match class & functions. Considering the usage case, the regex match is 
a cpu heavy computaion, and the charAt() function is required to return the 
value fast. To this end, I would think the current design should be OK. 

One drawback, not limited to this class, is the allocation of a big chunk of 
memory can't be released. If there are only a few long values, and the rest are 
all short values, then the pre-allocated memory is wasted, and can't be 
released. I think it could be a potential threat. But I think it should be a 
separate and globally applied issue to discuss with.


https://asterix-gerrit.ics.uci.edu/#/c/449/8/hyracks/hyracks-storage-am-lsm-invertedindex/src/main/java/org/apache/hyracks/storage/am/lsm/invertedindex/tokenizers/AbstractUTF8StringBinaryTokenizer.java
File 
hyracks/hyracks-storage-am-lsm-invertedindex/src/main/java/org/apache/hyracks/storage/am/lsm/invertedindex/tokenizers/AbstractUTF8StringBinaryTokenizer.java:

Line 59:     //TODO: This UTF8Tokenizer strongly rely on the Asterix data 
format,
> s/rely/relies/
Done


https://asterix-gerrit.ics.uci.edu/#/c/449/8/hyracks/hyracks-util/src/main/java/org/apache/hyracks/util/string/UTF8StringUtil.java
File 
hyracks/hyracks-util/src/main/java/org/apache/hyracks/util/string/UTF8StringUtil.java:

Line 214:     public static int normalize(byte[] bytes, int start) {
> Could we document what "normalize" means in this case?
Done


-- 
To view, visit https://asterix-gerrit.ics.uci.edu/449
To unsubscribe, visit https://asterix-gerrit.ics.uci.edu/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I7e95df0f06984b784ebac2c84b97e56a50207d27
Gerrit-PatchSet: 8
Gerrit-Project: hyracks
Gerrit-Branch: master
Gerrit-Owner: Jianfeng Jia <[email protected]>
Gerrit-Reviewer: Ian Maxon <[email protected]>
Gerrit-Reviewer: Jenkins <[email protected]>
Gerrit-Reviewer: Jianfeng Jia <[email protected]>
Gerrit-Reviewer: Preston Carman <[email protected]>
Gerrit-Reviewer: Taewoo Kim <[email protected]>
Gerrit-Reviewer: Till Westmann <[email protected]>
Gerrit-HasComments: Yes

Reply via email to