Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/10950 )
Change subject: IMPALA-376: add built-in functions for parsing JSON ...................................................................... Patch Set 14: (23 comments) Based on the list discussion it sounds like we want to move forward with this. http://gerrit.cloudera.org:8080/#/c/10950/14//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/10950/14//COMMIT_MSG@27 PS14, Line 27: * Add e2e tests in exprs.test Can we add a couple of tests to tests/custom_cluster/test_alloc_fail.py to check handling of out of memory. http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/expr-test.cc File be/src/exprs/expr-test.cc: http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/expr-test.cc@8856 PS14, Line 8856: TestIsNull("get_json_object('{illegal_json}', '$.a')", TYPE_STRING); Can you add some tests for: * trying to extract from just a JSON literal. E.g. numbers or strings * trying to extract from an empty list or object http://gerrit.cloudera.org:8080/#/c/10950/13/be/src/exprs/string-functions.cc File be/src/exprs/string-functions.cc: http://gerrit.cloudera.org:8080/#/c/10950/13/be/src/exprs/string-functions.cc@57 PS13, Line 57: void* Realloc(void* originalPtr, size_t originalSize, size_t newSize) { I think we need to copy over the previous contents of 'originalPtr'. I looked at the RapidJSON Realloc() implementations here and they both seem to do that: https://github.com/Tencent/rapidjson/blob/7e68aa0a21b7800ec98133cb106e49bd6536e25c/include/rapidjson/allocators.h http://gerrit.cloudera.org:8080/#/c/10950/13/be/src/exprs/string-functions.cc@133 PS13, Line 133: strlen Can we use sb.GetSize() instead of strlen()? It seems unnecessary to walk the string to find the null terminator. http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc File be/src/exprs/string-functions.cc: http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@109 PS14, Line 109: StringVal ToStringVal(FunctionContext* ctx, JsonUdfAllocator* allocator, These functions can all be static I think. http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@121 PS14, Line 121: memcpy(res.ptr, v->GetString(), v->GetStringLength()); Use StringVal::CopyFrom() instead - it handles the edge case when memory allocation fails. http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@136 PS14, Line 136: return result; Use StringVal::CopyFrom() instead http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@138 PS14, Line 138: SelectByKey, SelectByIndex, ExpandArrays and ProcessWildcard key could benefit from short function comments explaining how they transform 'queue'. E.g. for SelectByKey something like: // Extract all the values for 'key' where objects in 'queue' contain that queue. // Replace the contents of queue with the values found. http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@139 PS14, Line 139: (vector<JsonUdfValue*>& queue, Our style says use pass by pointer for mutable arguments, and (generally) to put mutable arguments last: https://google.github.io/styleguide/cppguide.html#Reference_Arguments Exception: we sometimes put mutable arguments before immutable for context data structures like "FunctionContext*" http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@142 PS14, Line 142: for (int k = 0; k < old_items; ++k) { Why not use i for the index? here and below. http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@162 PS14, Line 162: for (JsonUdfValue* it = queue[k]->Begin(); it != queue[k]->End(); ++it) Use braces here http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@172 PS14, Line 172: for (auto it = queue[k]->MemberBegin(); it != queue[k]->MemberEnd(); ++it) Use braces here http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@187 PS14, Line 187: Encounter nit: "Encountered" . Here and in a couple of other places like line 205. http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@219 PS14, Line 219: return ++path_idx; // path_idx points at ']' Can't this be "return path_idx + 1". We don't use path_idx after returning. http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@224 PS14, Line 224: int ProcessNumberIndex(FunctionContext* ctx, vector<JsonUdfValue*>& queue, This is future work, but it would be nice in future to separate out the parsing of the path from the processing of the JSON. That way for constant path strings we could save re-parsing the path. The benefit probably isn't that huge but might be significant. http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@275 PS14, Line 275: if (UNLIKELY(json_str.is_null || json_str.len == 0)) { nit: could fit on one line. http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@283 PS14, Line 283: while (path_str.ptr[beg] == ' ' && beg < path_str.len) beg++; Maybe comment something like: // Strip off preceding whitespace. http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@294 PS14, Line 294: if (!ParseStringVal(ctx, document, json_str)) { nit: one line http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@339 PS14, Line 339: string key = string(start, end); We use .c_str() in SelectByKey(), so is this just to null terminate it? If so, we should mention in a comment. E.g. // Convert to string to automatically null terminate. http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@345 PS14, Line 345: // TODO support range syntax like [2 to 7] and keyword `last`. Maybe file a JIRA and leave it here? Otherwise we'll lose track of it easily. http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/util/string-util.cc File be/src/util/string-util.cc: http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/util/string-util.cc@78 PS14, Line 78: if (!((ch >= 'a' && ch <= 'z') || (ch >= 'A' && ch <= 'Z') || ch == '_')) Use braces if the if doesn't fit on a single line. http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/util/string-util.cc@82 PS14, Line 82: if (!((ch >= 'a' && ch <= 'z') || (ch >= 'A' && ch <= 'Z') || Use braces if the if doesn't fit on a single line. http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/util/string-util.cc@85 PS14, Line 85: start++; nit: ++start according to our style guide. -- To view, visit http://gerrit.cloudera.org:8080/10950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f Gerrit-Change-Number: 10950 Gerrit-PatchSet: 14 Gerrit-Owner: Quanlong Huang <huangquanl...@gmail.com> Gerrit-Reviewer: Attila Jeges <atti...@cloudera.com> Gerrit-Reviewer: Greg Rahn <gr...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Reviewer: Todd Lipcon <t...@apache.org> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Comment-Date: Thu, 20 Sep 2018 01:09:34 +0000 Gerrit-HasComments: Yes