Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/10950 )
Change subject: IMPALA-376: add built-in functions for parsing JSON ...................................................................... Patch Set 15: (24 comments) Thanks for your comments, Tim! Significant changes to the last patch: * handle memory allocation failures * replace std::vector with array in RapidJson so its memory are allocated by FuntionContext The memory tracking still needs to discuss. http://gerrit.cloudera.org:8080/#/c/10950/14//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/10950/14//COMMIT_MSG@27 PS14, Line 27: * Add e2e tests in exprs.test > Can we add a couple of tests to tests/custom_cluster/test_alloc_fail.py to Good point! Found that RapidJSON cannot handle failure of memory allocation. I have to throw exceptions in JsonUdfAllocator::Malloc to stop it. https://github.com/Tencent/rapidjson/issues/753#issuecomment-250058390 https://github.com/Tencent/rapidjson/issues/1269#issuecomment-392674252 http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/expr-test.cc File be/src/exprs/expr-test.cc: http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/expr-test.cc@8856 PS14, Line 8856: TestIsNull("get_json_object('{\"a\": 1}', '$.b')", TYPE_STRING); > Can you add some tests for: Done http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc File be/src/exprs/string-functions.cc: http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@51 PS9, Line 51: static const bool kNeedFree = false; > Done When dealing with allocation failures, I realized that memory allocated by FunctionContextImpl::results_pool_ (used by StringVal(FunctionContext* context, int len)) is not tracked. Looks like it's intentional. Do you know some background for this? Though I think RapidJson won't consume too much memory than the original JSON strings, we still need to track the memory consumption of the UDF, especially when we introduce cache for the parsed JSON paths in future. http://gerrit.cloudera.org:8080/#/c/10950/13/be/src/exprs/string-functions.cc File be/src/exprs/string-functions.cc: http://gerrit.cloudera.org:8080/#/c/10950/13/be/src/exprs/string-functions.cc@57 PS13, Line 57: void* Malloc(size_t size) { > I think we need to copy over the previous contents of 'originalPtr'. I look Done http://gerrit.cloudera.org:8080/#/c/10950/13/be/src/exprs/string-functions.cc@133 PS13, Line 133: > Can we use sb.GetSize() instead of strlen()? It seems unnecessary to walk t Done http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc File be/src/exprs/string-functions.cc: http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@109 PS14, Line 109: > These functions can all be static I think. Done http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@121 PS14, Line 121: > Use StringVal::CopyFrom() instead - it handles the edge case when memory al Done. Good point! http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@136 PS14, Line 136: if (v.IsNull()) return StringVal::null(); > Use StringVal::CopyFrom() instead Done http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@138 PS14, Line 138: // RapidJSON will quote the strings. It's incompatible with Hive's behavior when > SelectByKey, SelectByIndex, ExpandArrays and ProcessWildcard key could bene Done http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@139 PS14, Line 139: ing is at the root, so we conv > Our style says use pass by pointer for mutable arguments, and (generally) t OK. That's really strict. Done http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@142 PS14, Line 142: } > Why not use i for the index? here and below. Done http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@162 PS14, Line 162: queue->PushBack(item[key_ptr], *allocator); > Use braces here Done http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@172 PS14, Line 172: SizeType old_items = queue->Size(); > Use braces here Done http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@187 PS14, Line 187: to& v : ( > nit: "Encountered" . Here and in a couple of other places like line 205. Done http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@219 PS14, Line 219: RETURN_IF_OOM(ExtractValues(queue, allocator), -1); > Can't this be "return path_idx + 1". We don't use path_idx after returning. Sure. Done http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@224 PS14, Line 224: /// path_str. Return next unprocessed index in path_str. Return -1 for errors. > This is future work, but it would be nice in future to separate out the par Sure. Created IMPALA-7610 for this improvement. http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@275 PS14, Line 275: ctx->SetError(msg.c_str()); > nit: could fit on one line. Done http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@283 PS14, Line 283: static bool ParseStringVal(FunctionContext* ctx, const StringVal& json_str, > Maybe comment something like: Done http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@294 PS14, Line 294: return true; > nit: one line Done http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@339 PS14, Line 339: } > We use .c_str() in SelectByKey(), so is this just to null terminate it? If Sure. Done http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@345 PS14, Line 345: if (i == path_str.len) { > Maybe file a JIRA and leave it here? Otherwise we'll lose track of it easil Done. Created IMPALA-7611. http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/util/string-util.cc File be/src/util/string-util.cc: http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/util/string-util.cc@78 PS14, Line 78: uint8_t ch = *start++; > Use braces if the if doesn't fit on a single line. Done http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/util/string-util.cc@82 PS14, Line 82: while (start != end) { > Use braces if the if doesn't fit on a single line. Done http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/util/string-util.cc@85 PS14, Line 85: (ch >= '0' && ch <= '9') || ch == '_')) { > nit: ++start according to our style guide. Done -- To view, visit http://gerrit.cloudera.org:8080/10950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f Gerrit-Change-Number: 10950 Gerrit-PatchSet: 15 Gerrit-Owner: Quanlong Huang <[email protected]> Gerrit-Reviewer: Attila Jeges <[email protected]> Gerrit-Reviewer: Greg Rahn <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Mon, 24 Sep 2018 15:11:10 +0000 Gerrit-HasComments: Yes
