[ 
https://issues.apache.org/jira/browse/HADOOP-4590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12645352#action_12645352
 ] 

Zheng Shao commented on HADOOP-4590:
------------------------------------

There are several alternatives to do this:

A. Let the caller of the function provide SQL fragment (srctable, columns, 
where condition, etc) instead of a full SQL, then the function can construct 
the SQL with the additional filtering conditions. The fragments can be at 
different level depends on the complexity/freedom that we want to expose to the 
user.  If the user gives us a full SQL, we can still do post filtering by 
nesting  "SELECT * FROM xxx WHERE yyy".

B. Let the caller of the function provide a SQL fragment with variables inside, 
and the function does variable substitutions.


My main arguments against the handler approach is that there are not many cases 
that the library would be able to change handlers without the user being 
noticed.

1. The handler has to understand the row schema in order to do filtering, while 
that information is not available if the user gives us a full SQL.
2. The user would be able to say "SELECT" instead of MAP in the query, and the 
handlers won't be in effect;
3. The user would be able to nest a nested query that contains a MAP working on 
a totally different schema, and the intention of the library is really on 
changing only the outer MAP.

An extreme analogy is that we construct command line like "awk xxx | cut -f 2" 
by prepending/appending strings, not by setting a environment variable to ask 
"cut -f 2" to do the filtering.


> User-definable handlers for MAP and REDUCE transforms
> -----------------------------------------------------
>
>                 Key: HADOOP-4590
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4590
>             Project: Hadoop Core
>          Issue Type: Wish
>          Components: contrib/hive
>            Reporter: Venky Iyer
>
> Mappers can be specified (as before) like:
> .... MAP USING 'uri' .....
> uris are in a format to be decided upon; possibilities are
> protocol://resource/param=value,param2=value2
> or
> protocol: resource_string
> For example, shell commands are like 
> sh://uniq or 
> sh: sort | uniq
> When no protocol is specified, we assume the default to be sh://.
> Another example is pyfunc://foo.bar/baz=2 , which points to the bar(baz=2) 
> function from the foo module. 
> We can add handlers for these protocols like
> add handler sh shell (default)
> add handler pyfunc "python pyhive.py"
> and replace these handlers using appropriate syntax.
> Map and Reduce handlers can be distinct. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to