Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10950 )

Change subject: IMPALA-376: add built-in functions for parsing JSON
......................................................................


Patch Set 14:

(23 comments)

Based on the list discussion it sounds like we want to move forward with this.

http://gerrit.cloudera.org:8080/#/c/10950/14//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/10950/14//COMMIT_MSG@27
PS14, Line 27: * Add e2e tests in exprs.test
Can we add a couple of tests to tests/custom_cluster/test_alloc_fail.py to 
check handling of out of memory.


http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/expr-test.cc
File be/src/exprs/expr-test.cc:

http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/expr-test.cc@8856
PS14, Line 8856:   TestIsNull("get_json_object('{illegal_json}', '$.a')", 
TYPE_STRING);
Can you add some tests for:

 * trying to extract from just a JSON literal. E.g. numbers or strings
 * trying to extract from an empty list or object


http://gerrit.cloudera.org:8080/#/c/10950/13/be/src/exprs/string-functions.cc
File be/src/exprs/string-functions.cc:

http://gerrit.cloudera.org:8080/#/c/10950/13/be/src/exprs/string-functions.cc@57
PS13, Line 57:   void* Realloc(void* originalPtr, size_t originalSize, size_t 
newSize) {
I think we need to copy over the previous contents of 'originalPtr'. I looked 
at the RapidJSON Realloc() implementations here and they both seem to do that: 
https://github.com/Tencent/rapidjson/blob/7e68aa0a21b7800ec98133cb106e49bd6536e25c/include/rapidjson/allocators.h


http://gerrit.cloudera.org:8080/#/c/10950/13/be/src/exprs/string-functions.cc@133
PS13, Line 133: strlen
Can we use sb.GetSize() instead of strlen()? It seems unnecessary to walk the 
string to find the null terminator.


http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc
File be/src/exprs/string-functions.cc:

http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@109
PS14, Line 109: StringVal ToStringVal(FunctionContext* ctx, JsonUdfAllocator* 
allocator,
These functions can all be static I think.


http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@121
PS14, Line 121:       memcpy(res.ptr, v->GetString(), v->GetStringLength());
Use StringVal::CopyFrom() instead - it handles the edge case when memory 
allocation fails.


http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@136
PS14, Line 136:   return result;
Use StringVal::CopyFrom() instead


http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@138
PS14, Line 138:
SelectByKey, SelectByIndex, ExpandArrays and ProcessWildcard key could benefit 
from short function comments explaining how they transform 'queue'. E.g. for 
SelectByKey something like:

  // Extract all the values for 'key' where objects in 'queue' contain that 
queue.
  // Replace the contents of queue with the values found.


http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@139
PS14, Line 139: (vector<JsonUdfValue*>& queue,
Our style says use pass by pointer for mutable arguments, and (generally) to 
put mutable arguments last: 
https://google.github.io/styleguide/cppguide.html#Reference_Arguments

Exception: we sometimes put mutable arguments before immutable for context data 
structures like "FunctionContext*"


http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@142
PS14, Line 142:   for (int k = 0; k < old_items; ++k) {
Why not use i for the index? here and below.


http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@162
PS14, Line 162:     for (JsonUdfValue* it = queue[k]->Begin(); it != 
queue[k]->End(); ++it)
Use braces here


http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@172
PS14, Line 172:     for (auto it = queue[k]->MemberBegin(); it != 
queue[k]->MemberEnd(); ++it)
Use braces here


http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@187
PS14, Line 187: Encounter
nit: "Encountered" . Here and in a couple of other places like line 205.


http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@219
PS14, Line 219:   return ++path_idx;  // path_idx points at ']'
Can't this be "return path_idx + 1". We don't use path_idx after returning.


http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@224
PS14, Line 224: int ProcessNumberIndex(FunctionContext* ctx, 
vector<JsonUdfValue*>& queue,
This is future work, but it would be nice in future to separate out the parsing 
of the path from the processing of the JSON. That way for constant path strings 
we could save re-parsing the path. The benefit probably isn't that huge but 
might be significant.


http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@275
PS14, Line 275:   if (UNLIKELY(json_str.is_null || json_str.len == 0)) {
nit: could fit on one line.


http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@283
PS14, Line 283:   while (path_str.ptr[beg] == ' ' && beg < path_str.len) beg++;
Maybe comment something like:
// Strip off preceding whitespace.


http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@294
PS14, Line 294:   if (!ParseStringVal(ctx, document, json_str)) {
nit: one line


http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@339
PS14, Line 339:         string key = string(start, end);
We use .c_str() in SelectByKey(), so is this just to null terminate it? If so, 
we should mention in a comment. E.g.

  // Convert to string to automatically null terminate.


http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@345
PS14, Line 345:         // TODO support range syntax like [2 to 7] and keyword 
`last`.
Maybe file a JIRA and leave it here? Otherwise we'll lose track of it easily.


http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/util/string-util.cc
File be/src/util/string-util.cc:

http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/util/string-util.cc@78
PS14, Line 78:   if (!((ch >= 'a' && ch <= 'z') || (ch >= 'A' && ch <= 'Z') || 
ch == '_'))
Use braces if the if doesn't fit on a single line.


http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/util/string-util.cc@82
PS14, Line 82:     if (!((ch >= 'a' && ch <= 'z') || (ch >= 'A' && ch <= 'Z') ||
Use braces if the if doesn't fit on a single line.


http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/util/string-util.cc@85
PS14, Line 85:     start++;
nit: ++start according to our style guide.



--
To view, visit http://gerrit.cloudera.org:8080/10950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Gerrit-Change-Number: 10950
Gerrit-PatchSet: 14
Gerrit-Owner: Quanlong Huang <huangquanl...@gmail.com>
Gerrit-Reviewer: Attila Jeges <atti...@cloudera.com>
Gerrit-Reviewer: Greg Rahn <gr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <t...@apache.org>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Comment-Date: Thu, 20 Sep 2018 01:09:34 +0000
Gerrit-HasComments: Yes

Reply via email to