Jianfeng Jia has posted comments on this change. Change subject: ASTERIXDB-1102: VarSize Encoding to store length of String and ByteArray ......................................................................
Patch Set 8: (9 comments) https://asterix-gerrit.ics.uci.edu/#/c/449/8/hyracks/hyracks-data/hyracks-data-std/src/main/java/org/apache/hyracks/data/std/primitive/ByteArrayPointable.java File hyracks/hyracks-data/hyracks-data-std/src/main/java/org/apache/hyracks/data/std/primitive/ByteArrayPointable.java: Line 106: private int hash = 0; > Could we move those to the beginning of the class (i.e. before the first me Done Line 169: public static int normalize(byte[] bytesPtr, int start) { > Could you add a comment what "normalization" is in this case? Done https://asterix-gerrit.ics.uci.edu/#/c/449/8/hyracks/hyracks-data/hyracks-data-std/src/main/java/org/apache/hyracks/data/std/primitive/UTF8StringPointable.java File hyracks/hyracks-data/hyracks-data-std/src/main/java/org/apache/hyracks/data/std/primitive/UTF8StringPointable.java: Line 76: private int stringLength; > I would agree. Can we move from saving these values to using functions? Poi I understand your worries, please refer to the above comments. Line 76: private int stringLength; > I'm not sure that I'm happy that these pointables are stateful now, but if I've moved these values to the beginning and add the comments. We do need to be more careful about these values. It's all about the speedup and "safety" trade-offs. My thought is that by using the "set" method, the pointable itself is already not safe . What we need is a safer way to guarantee these state will be reset while the super set() functions are called. To this end, I change the strategy by inserting an abstract afterReset() function inside the AbstractPointable, which will be called at the end of set() functions. Then it should be safe to have the state inside the inherited class. Line 92: * Since the {@code utf8Length} and the {@code metaLength} are often used, we compute those two value in advance. > s/value/values/ Done https://asterix-gerrit.ics.uci.edu/#/c/449/8/hyracks/hyracks-data/hyracks-data-std/src/main/java/org/apache/hyracks/data/std/util/AbstractVarLenObjectBuilder.java File hyracks/hyracks-data/hyracks-data-std/src/main/java/org/apache/hyracks/data/std/util/AbstractVarLenObjectBuilder.java: Line 49: int diff = estimateMetaLen - actualMetaLen; > It would be nice if this case was covered in the UTF8StringPointableTest. It is covered by UTF8StringBuilderTest. https://asterix-gerrit.ics.uci.edu/#/c/449/8/hyracks/hyracks-data/hyracks-data-std/src/main/java/org/apache/hyracks/data/std/util/UTF8CharSequence.java File hyracks/hyracks-data/hyracks-data-std/src/main/java/org/apache/hyracks/data/std/util/UTF8CharSequence.java: Line 55: buf = new char[utfLen]; > It seems relatively expensive to always allocate *and* copy the whole strin First of all, this class is purely moved downward from Asterix to Hyracks. This allocation only happens when there is no more space to hold the new values. I checked the usage of this CharSequence, which is used in the string regex match/find function. In those places, we directly call the java regex pattern match class & functions. Considering the usage case, the regex match is a cpu heavy computaion, and the charAt() function is required to return the value fast. To this end, I would think the current design should be OK. One drawback, not limited to this class, is the allocation of a big chunk of memory can't be released. If there are only a few long values, and the rest are all short values, then the pre-allocated memory is wasted, and can't be released. I think it could be a potential threat. But I think it should be a separate and globally applied issue to discuss with. https://asterix-gerrit.ics.uci.edu/#/c/449/8/hyracks/hyracks-storage-am-lsm-invertedindex/src/main/java/org/apache/hyracks/storage/am/lsm/invertedindex/tokenizers/AbstractUTF8StringBinaryTokenizer.java File hyracks/hyracks-storage-am-lsm-invertedindex/src/main/java/org/apache/hyracks/storage/am/lsm/invertedindex/tokenizers/AbstractUTF8StringBinaryTokenizer.java: Line 59: //TODO: This UTF8Tokenizer strongly rely on the Asterix data format, > s/rely/relies/ Done https://asterix-gerrit.ics.uci.edu/#/c/449/8/hyracks/hyracks-util/src/main/java/org/apache/hyracks/util/string/UTF8StringUtil.java File hyracks/hyracks-util/src/main/java/org/apache/hyracks/util/string/UTF8StringUtil.java: Line 214: public static int normalize(byte[] bytes, int start) { > Could we document what "normalize" means in this case? Done -- To view, visit https://asterix-gerrit.ics.uci.edu/449 To unsubscribe, visit https://asterix-gerrit.ics.uci.edu/settings Gerrit-MessageType: comment Gerrit-Change-Id: I7e95df0f06984b784ebac2c84b97e56a50207d27 Gerrit-PatchSet: 8 Gerrit-Project: hyracks Gerrit-Branch: master Gerrit-Owner: Jianfeng Jia <[email protected]> Gerrit-Reviewer: Ian Maxon <[email protected]> Gerrit-Reviewer: Jenkins <[email protected]> Gerrit-Reviewer: Jianfeng Jia <[email protected]> Gerrit-Reviewer: Preston Carman <[email protected]> Gerrit-Reviewer: Taewoo Kim <[email protected]> Gerrit-Reviewer: Till Westmann <[email protected]> Gerrit-HasComments: Yes
