Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/20496 )
Change subject: IMPALA-12373: Small String Optimization for StringValue ...................................................................... Patch Set 14: (11 comments) http://gerrit.cloudera.org:8080/#/c/20496/14/be/src/exec/parquet/parquet-common.h File be/src/exec/parquet/parquet-common.h: http://gerrit.cloudera.org:8080/#/c/20496/14/be/src/exec/parquet/parquet-common.h@573 PS14, Line 573: v->Assign(reinterpret_cast<char*>(const_cast<uint8_t*>(buffer)) + sizeof(int32_t), Can you add a comment about not smallifying the string? http://gerrit.cloudera.org:8080/#/c/20496/14/be/src/exec/text-converter.inline.h File be/src/exec/text-converter.inline.h: http://gerrit.cloudera.org:8080/#/c/20496/14/be/src/exec/text-converter.inline.h@99 PS14, Line 99: Wouldn't it be clearer to use SimpleString instead of StringValue from the start? We always assume that the string is not smallified. It would be also nice to have a comment about not smallifying during this function. http://gerrit.cloudera.org:8080/#/c/20496/14/be/src/runtime/buffered-tuple-stream.cc File be/src/runtime/buffered-tuple-stream.cc: http://gerrit.cloudera.org:8080/#/c/20496/14/be/src/runtime/buffered-tuple-stream.cc@948 PS14, Line 948: ComputeRowSize It could be mentioned in the function comments that this also smallifies strings. It may also make sense to change the name to ComputeRowSizeAndSmallifyStrings or something similar, as the current name gives the impression of being "read only" http://gerrit.cloudera.org:8080/#/c/20496/14/be/src/runtime/smallable-string.h File be/src/runtime/smallable-string.h: http://gerrit.cloudera.org:8080/#/c/20496/14/be/src/runtime/smallable-string.h@44 PS14, Line 44: SMALL_LIMIT future hack idea: the layout seems to be able to keep even bigger strings in the tuple. The planner could leave some extra bytes before the StringValue, and if len fits to 7 bits but > SMALL_LIMIT, then Ptr() could calculate the start of string based on length. This would make sense if based on the stats the strings are above SMALL_LIMIT but still quite small. http://gerrit.cloudera.org:8080/#/c/20496/14/be/src/runtime/sorter.cc File be/src/runtime/sorter.cc: http://gerrit.cloudera.org:8080/#/c/20496/14/be/src/runtime/sorter.cc@607 PS14, Line 607: string_values->push_back(string_val); : if (string_val->IsSmall()) continue; Do we actually need to collect these strings? The function could be renamed to CollectNonNullAndNonSmallVarSlots and skip small strings completely, as they need no further extra handling. http://gerrit.cloudera.org:8080/#/c/20496/14/be/src/runtime/sorter.cc@654 PS14, Line 654: if (string_val->IsSmall()) continue; see comment at line 607, small strings could be ignore completely. http://gerrit.cloudera.org:8080/#/c/20496/14/be/src/runtime/sorter.cc@686 PS14, Line 686: if (string_val->IsSmall()) continue; see comment at line 607 http://gerrit.cloudera.org:8080/#/c/20496/14/be/src/runtime/string-value.h File be/src/runtime/string-value.h: http://gerrit.cloudera.org:8080/#/c/20496/14/be/src/runtime/string-value.h@67 PS14, Line 67: void Assign(const StringValue& other) { string_impl_.Assign(other.string_impl_); } : : void Assign(char* ptr, int len) { optional: overloads with const std::string& and StringVal could be added as convenience functions as there are several places where assing is called with their ptr/len (StringVal is only valid if non null). http://gerrit.cloudera.org:8080/#/c/20496/14/be/src/runtime/string-value.h@78 PS14, Line 78: /// 'false' otherwise. In the latter case the object remains unmodified. Can you also mention the case when it is already smallified? http://gerrit.cloudera.org:8080/#/c/20496/14/be/src/runtime/tuple.cc File be/src/runtime/tuple.cc: http://gerrit.cloudera.org:8080/#/c/20496/14/be/src/runtime/tuple.cc@91 PS14, Line 91: if (string_v->IsSmall()) return; Can you add comments about the reason for return instead of continue? http://gerrit.cloudera.org:8080/#/c/20496/14/be/src/util/in-list-filter.cc File be/src/util/in-list-filter.cc: http://gerrit.cloudera.org:8080/#/c/20496/14/be/src/util/in-list-filter.cc@141 PS14, Line 141: if (s.IsSmall()) { : values_.insert(s) This seems inconsistent with newly_inserted_values_.total_len, as StringSetWithTotalLen::Insert increases the total size with len even for small strings. -- To view, visit http://gerrit.cloudera.org:8080/20496 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I741c3a5f12ab620b6b64b57d4c89b5f8e056efd3 Gerrit-Change-Number: 20496 Gerrit-PatchSet: 14 Gerrit-Owner: Zoltan Borok-Nagy <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Daniel Becker <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Kurt Deschler <[email protected]> Gerrit-Reviewer: Michael Smith <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Mon, 13 Nov 2023 17:12:06 +0000 Gerrit-HasComments: Yes
