[
https://issues.apache.org/jira/browse/IMPALA-376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16633296#comment-16633296
]
ASF subversion and git services commented on IMPALA-376:
--------------------------------------------------------
Commit ddef2cb9b14e7f8cf9a68a2a382e10a8e0f91c3d in impala's branch
refs/heads/master from stiga-huang
[ https://git-wip-us.apache.org/repos/asf?p=impala.git;h=ddef2cb ]
IMPALA-376: add built-in functions for parsing JSON
This patch implements the same function as Hive UDF get_json_object.
We reuse RapidJson to parse the json string. In order to track the
memory used in RapidJson, we wrap FunctionContext into an allocator.
get_json_object accepts two parameters: a json string and a selector
(json path). We parse the json string into a Document tree and then
perform BFS according to the selector. For example, to process
get_json_object('[{\"a\":1}, {\"a\":2}, {\"a\":3}]', '$[*].a'),
we first perform '$[*]' to extract all the items in the root array.
Then we get a queue consists of {a:1},{a:2},{a:3} and perform '.a'
selector on all values in the queue. The final results is 1,2,3 in the
queue. As there're multiple results, they should be encapsulated into
an array. The output results is a string of '[1,2,3]'.
More examples can be found in expr-test.cc.
Test:
* Add unit tests in expr-test
* Add e2e tests in exprs.test
* Add tests in test_alloc_fail.py to check handling of out of memory
Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Reviewed-on: http://gerrit.cloudera.org:8080/10950
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>
> Built-in functions for parsing JSON
> -----------------------------------
>
> Key: IMPALA-376
> URL: https://issues.apache.org/jira/browse/IMPALA-376
> Project: IMPALA
> Issue Type: New Feature
> Components: Backend
> Affects Versions: Product Backlog
> Environment: All supported environments
> Reporter: Zoltan Toth-Czifra
> Assignee: Quanlong Huang
> Priority: Minor
> Labels: built-in-function
>
> Hi,
> Hive comes with some useful built-in UDFs to process JSON objects.
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
> Namely:
> - get_json_object
> - json_tuple
> To make Impala and Hive tables and quieries more interchangable, I am
> proposing porting these UDFs to be part Impala's built in functions:
> http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Impala/ciiu_functions.html
> h4. Example
> Consider the following table *raw_log*
> ||action||parameters||
> |search|{"keyword":"hotel"}|
> |visit|{"url":"http://example.com"}|
> ...and the following query:
> {code}
> SELECT get_json_object(event_params, "$.keyword") AS keyword FROM raw_log
> WHERE action='search';
> {code}
> The query should return the following results:
> ||keyword||
> |hotel|
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]