Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10950 )

Change subject: IMPALA-376: add built-in functions for parsing JSON
......................................................................


Patch Set 15:

(24 comments)

Thanks for your comments, Tim! Significant changes to the last patch:

* handle memory allocation failures
* replace std::vector with array in RapidJson so its memory are allocated by 
FuntionContext

The memory tracking still needs to discuss.

http://gerrit.cloudera.org:8080/#/c/10950/14//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/10950/14//COMMIT_MSG@27
PS14, Line 27: * Add e2e tests in exprs.test
> Can we add a couple of tests to tests/custom_cluster/test_alloc_fail.py to
Good point! Found that RapidJSON cannot handle failure of memory allocation. I 
have to throw exceptions in JsonUdfAllocator::Malloc to stop it.

https://github.com/Tencent/rapidjson/issues/753#issuecomment-250058390
https://github.com/Tencent/rapidjson/issues/1269#issuecomment-392674252


http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/expr-test.cc
File be/src/exprs/expr-test.cc:

http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/expr-test.cc@8856
PS14, Line 8856:   TestIsNull("get_json_object('{\"a\": 1}', '$.b')", 
TYPE_STRING);
> Can you add some tests for:
Done


http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc
File be/src/exprs/string-functions.cc:

http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@51
PS9, Line 51:   static const bool kNeedFree = false;
> Done
When dealing with allocation failures, I realized that memory allocated by 
FunctionContextImpl::results_pool_ (used by StringVal(FunctionContext* context, 
int len)) is not tracked. Looks like it's intentional. Do you know some 
background for this?

Though I think RapidJson won't consume too much memory than the original JSON 
strings, we still need to track the memory consumption of the UDF, especially 
when we introduce cache for the parsed JSON paths in future.


http://gerrit.cloudera.org:8080/#/c/10950/13/be/src/exprs/string-functions.cc
File be/src/exprs/string-functions.cc:

http://gerrit.cloudera.org:8080/#/c/10950/13/be/src/exprs/string-functions.cc@57
PS13, Line 57:   void* Malloc(size_t size) {
> I think we need to copy over the previous contents of 'originalPtr'. I look
Done


http://gerrit.cloudera.org:8080/#/c/10950/13/be/src/exprs/string-functions.cc@133
PS13, Line 133:
> Can we use sb.GetSize() instead of strlen()? It seems unnecessary to walk t
Done


http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc
File be/src/exprs/string-functions.cc:

http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@109
PS14, Line 109:
> These functions can all be static I think.
Done


http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@121
PS14, Line 121:
> Use StringVal::CopyFrom() instead - it handles the edge case when memory al
Done. Good point!


http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@136
PS14, Line 136:     if (v.IsNull()) return StringVal::null();
> Use StringVal::CopyFrom() instead
Done


http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@138
PS14, Line 138:       // RapidJSON will quote the strings. It's incompatible 
with Hive's behavior when
> SelectByKey, SelectByIndex, ExpandArrays and ProcessWildcard key could bene
Done


http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@139
PS14, Line 139: ing is at the root, so we conv
> Our style says use pass by pointer for mutable arguments, and (generally) t
OK. That's really strict. Done


http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@142
PS14, Line 142:     }
> Why not use i for the index? here and below.
Done


http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@162
PS14, Line 162:     queue->PushBack(item[key_ptr], *allocator);
> Use braces here
Done


http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@172
PS14, Line 172:   SizeType old_items = queue->Size();
> Use braces here
Done


http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@187
PS14, Line 187: to& v : (
> nit: "Encountered" . Here and in a couple of other places like line 205.
Done


http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@219
PS14, Line 219:   RETURN_IF_OOM(ExtractValues(queue, allocator), -1);
> Can't this be "return path_idx + 1". We don't use path_idx after returning.
Sure. Done


http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@224
PS14, Line 224: /// path_str. Return next unprocessed index in path_str. Return 
-1 for errors.
> This is future work, but it would be nice in future to separate out the par
Sure. Created IMPALA-7610 for this improvement.


http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@275
PS14, Line 275:     ctx->SetError(msg.c_str());
> nit: could fit on one line.
Done


http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@283
PS14, Line 283: static bool ParseStringVal(FunctionContext* ctx, const 
StringVal& json_str,
> Maybe comment something like:
Done


http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@294
PS14, Line 294:   return true;
> nit: one line
Done


http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@339
PS14, Line 339:       }
> We use .c_str() in SelectByKey(), so is this just to null terminate it? If
Sure. Done


http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@345
PS14, Line 345:         if (i == path_str.len) {
> Maybe file a JIRA and leave it here? Otherwise we'll lose track of it easil
Done. Created IMPALA-7611.


http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/util/string-util.cc
File be/src/util/string-util.cc:

http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/util/string-util.cc@78
PS14, Line 78:   uint8_t ch = *start++;
> Use braces if the if doesn't fit on a single line.
Done


http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/util/string-util.cc@82
PS14, Line 82:   while (start != end) {
> Use braces if the if doesn't fit on a single line.
Done


http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/util/string-util.cc@85
PS14, Line 85:         (ch >= '0' && ch <= '9') || ch == '_')) {
> nit: ++start according to our style guide.
Done



--
To view, visit http://gerrit.cloudera.org:8080/10950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Gerrit-Change-Number: 10950
Gerrit-PatchSet: 15
Gerrit-Owner: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Attila Jeges <[email protected]>
Gerrit-Reviewer: Greg Rahn <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Reviewer: Todd Lipcon <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Mon, 24 Sep 2018 15:11:10 +0000
Gerrit-HasComments: Yes

Reply via email to