Re: Java UDF

Edward Capriolo Wed, 28 Apr 2010 07:17:10 -0700

On Wed, Apr 28, 2010 at 5:06 AM, Tim Robertson <[email protected]>wrote:


> Thanks Carl - that looks ideal for this.
>
>
> On Wed, Apr 28, 2010 at 10:59 AM, Carl Steinbach <[email protected]>wrote:
>
>> Hi Tim,
>>
>> Larry Ogrodnek has a nice blog post describing how to use Hive's
>> TRANSFORM/MAP/REDUCE syntax with Java code here:
>> http://dev.bizo.com/2009/10/hive-map-reduce-in-java.html
>>
>> A version of the library he describes in the blog post has been added to
>> Hive's contrib directory, along with some examples and illustrative test
>> cases. Check out the following files:
>>
>> src/java/org/apache/hadoop/hive/contrib/mr/example/IdentityMapper.java
>> src/java/org/apache/hadoop/hive/contrib/mr/example/WordCountReduce.java
>> src/java/org/apache/hadoop/hive/contrib/mr/GenericMR.java
>> src/java/org/apache/hadoop/hive/contrib/mr/Mapper.java
>> src/java/org/apache/hadoop/hive/contrib/mr/Output.java
>> src/java/org/apache/hadoop/hive/contrib/mr/Reducer.java
>> src/test/queries/clientpositive/java_mr_example.q
>>
>> Hope this helps.
>>
>> Carl
>>
>>
>> On Wed, Apr 28, 2010 at 1:11 AM, Tim Robertson <[email protected]
>> > wrote:
>>
>>> Ok, so it turns out I overlooked some things in my current MR job with
>>> the configure() and a UDF isn't enough.
>>>
>>> I do want to use the Hive Map keyword and call my own MR map().
>>>
>>> Currently my map() looks like the following, which works on a tab
>>> delimited input file:
>>>
>>> public void map(LongWritable key, Text value,
>>>
>>>   OutputCollector<Text, Text> collector, Reporter reporter)
>>>
>>>   throws IOException {
>>>
>>>    Pattern tab = Pattern.compile("\t");
>>>
>>>   String[] atoms = tab.split(value.toString());
>>>
>>>   String parsed = myParseFunction(atoms);
>>>
>>>   collector.collect(new Text(parsed), new Text(atoms[0]));
>>>
>>> What would I need to implement to make this usable with a Map keyword in
>>> Hive please, so I can run this with input from table 1, to populate table
>>> 2?
>>>
>>> Sorry for this confusion, but it is not really clear to me - all help is
>>> very gratefully received.
>>>
>>> Cheers
>>> Tim
>>>
>>>
>>>
>>>
>>> On Tue, Apr 27, 2010 at 8:06 PM, Avram Aelony <[email protected]> wrote:
>>>
>>>>
>>>> Hi -
>>>>
>>>> If you would like to "simply take an input String (Text) run some Java  
>>>> and return a new (Text) by calling a function" then you may wish to 
>>>> consider using the "map" and "reduce" keywords directly from Hive and 
>>>> using a scripting language like Perl that contains your mapper and reducer 
>>>> code.
>>>>
>>>> for example:
>>>>
>>>> create external table some_input_table ( field_1 string ) row format 
>>>> (etc...);
>>>> create table your_next_table ( output_field_1 string, output_field_2 
>>>> string, output_field_3 string );
>>>>
>>>>
>>>> from (
>>>>    from some_input_table i
>>>>      map i.field_1 using 'some_custom_mapper_code.pl' ) mapper_output
>>>>    insert overwrite table your_next_table
>>>>      reduce mapper_output.* using 'some_custom_reducer_code.pl' as 
>>>> output_field_1, output_field_2, output_field_3
>>>> ;
>>>>
>>>> --test it
>>>> select * from your_next_table ;
>>>>
>>>> Hope that helps.
>>>>
>>>> cheers,
>>>> Avram
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Tuesday, April 27, 2010, at 10:55AM, "Tim Robertson" 
>>>> <[email protected]> wrote:
>>>> >
>>>>
>>>>  Thanks Edward,
>>>>
>>>>  I get where you are coming from now with that explanation.
>>>>
>>>>  Cheers,
>>>> Tim
>>>>
>>>>
>>>> On Tue, Apr 27, 2010 at 7:53 PM, Edward Capriolo <[email protected]
>>>> > wrote:
>>>>
>>>>>
>>>>>
>>>>> On Tue, Apr 27, 2010 at 1:48 PM, Tim Robertson <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hmmm... I am not trying to serialize or deserialize custom content,
>>>>>> but simply take an input String (Text) run some Java  and return a new
>>>>>> (Text) by calling a function
>>>>>>
>>>>>>  Looking at public class UDFYear extends UDF { the annotation at the
>>>>>> top suggests extending UDF and adding the annotation, might be enough.
>>>>>>
>>>>>>  I'll try it anyways...
>>>>>> Tim
>>>>>>
>>>>>> On Tue, Apr 27, 2010 at 7:37 PM, Adam O'Donnell <[email protected]>wrote:
>>>>>>
>>>>>>> It sounds like what you want is a custom SerDe.  I have tried to
>>>>>>> write
>>>>>>> one but ran into some difficulty.
>>>>>>>
>>>>>>> On Tue, Apr 27, 2010 at 10:13 AM, Tim Robertson
>>>>>>>  <[email protected]> wrote:
>>>>>>> > Thanks Edward,
>>>>>>> > You are indeed correct - I am confused!
>>>>>>> > So I checked out the source, and poked around.  If I were to extend
>>>>>>> UDF and
>>>>>>> > implement  public Text evaluate(Text source) {
>>>>>>> > would I be heading along the correct lines to use what you say
>>>>>>> above?
>>>>>>> > Thanks,
>>>>>>> > Tim
>>>>>>> >
>>>>>>> >
>>>>>>> > On Tue, Apr 27, 2010 at 5:11 PM, Edward Capriolo <
>>>>>>> [email protected]>
>>>>>>> > wrote:
>>>>>>> >>
>>>>>>> >>
>>>>>>> >> On Tue, Apr 27, 2010 at 10:22 AM, Tim Robertson
>>>>>>> >> <[email protected]> wrote:
>>>>>>> >>>
>>>>>>> >>> Hi,
>>>>>>> >>> I currently run a MapReduce job to rewrite a tab delimited file,
>>>>>>> and then
>>>>>>> >>> I use Hive for everything after that stage.
>>>>>>> >>> Am I correct in thinking that I can create a Jar with my own
>>>>>>> method which
>>>>>>> >>> can then be called in SQL?
>>>>>>> >>> Would the syntax be:
>>>>>>> >>>   hive> ADD JAR /tmp/parse.jar;
>>>>>>> >>>   hive> INSERT OVERWRITE TABLE target SELECT s.id,
>>>>>>> >>> s.canonical, parsedName FROM source s MAP s.canonical using
>>>>>>> 'parse' as
>>>>>>> >>> parsedName;
>>>>>>> >>> and parse be a MR job?  If so what are the input and output
>>>>>>> formats
>>>>>>> >>> please for the parse?  Or is it a class implementing an interface
>>>>>>> perhaps
>>>>>>> >>> and Hive take care of the rest?
>>>>>>> >>> Thanks for any pointers,
>>>>>>> >>> Tim
>>>>>>> >>>
>>>>>>> >>
>>>>>>> >> Tim,
>>>>>>> >>
>>>>>>> >> A UDF is an sql function like toString() max()
>>>>>>> >> An InputFormat teachers hive to read data from Key Value files
>>>>>>> >> A serde tells Hive how to parse input data into columns.
>>>>>>> >> Finally, the map()reduce(), transform() keywords you described is
>>>>>>> a way to
>>>>>>> >> pipe data to external process and read the results back in. Almost
>>>>>>> like a
>>>>>>> >> non-native to hive UDF.
>>>>>>> >>
>>>>>>> >> So you have munged up 4 concepts together :) Do not feel bad
>>>>>>> however, I
>>>>>>> >> struggled though an input format for the last month.
>>>>>>> >>
>>>>>>> >> It sounds most like you want a udf that takes a string and returns
>>>>>>> a
>>>>>>> >> canonical representation.
>>>>>>> >>
>>>>>>> >>
>>>>>>> >>   hive> ADD JAR /tmp/parse.jar;
>>>>>>> >> create temporary function canonical as 'my.package.canonical';
>>>>>>> >> select canonical(my colum) from source;
>>>>>>> >>
>>>>>>> >> Regards,
>>>>>>> >>
>>>>>>> >>
>>>>>>> >>
>>>>>>> >
>>>>>>> >
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>  --
>>>>>>> Adam J. O'Donnell, Ph.D.
>>>>>>> Immunet Corporation
>>>>>>> Cell: +1 (267) 251-0070
>>>>>>>
>>>>>>
>>>>>>
>>>>>  Tim,
>>>>>
>>>>> I think you are on the right track with the UDF approach.
>>>>>
>>>>> You could accomplish something similiar with a serdy accept from the
>>>>> client prospecting it would be more "transparent".
>>>>>
>>>>> A UDF is a bit more reusable then a serde. You can only chose a serde
>>>>> once when the table is created, but you UDF is applied on the resultset.
>>>>>
>>>>> Edward
>>>>>
>>>>
>>>>
>>>
>>
>
Based on your description now this is looking more like a SERDE. I say this
because
map(), reduce(), transform() are almsot identical to a UDF with the
exception that they can return multiple columns more easily.

Lets look at an example

RawFile

1\t6,7,8

So by default Hive uses a TextInputFormat
It reads the key as "1" it reads the value as "6,7,8"

Now the default OutputFormat is HiveIgnoreKeyOutputFormat. Thus hive drops
the "1"

and the output is now "6,7,8"

Now the SERDE takes over. based on the DELIMITER you specified when you
created the table, the serder attempts to split the row.

If I chose a DELIMITER ',' hive would split the row into
['6','7','8']
If I chose the wrong delimiter say '^K' or somethign whacky hive really
would not split the row you would get 1 return.
['6,7,8']

So you can chose the 'wrong delimiter' as Carl suggests above essentially
this turns the row into a single string. Then you are free to operate on it
as a single string. You can use a UDF that works with a single string or use
a map() sytax that can split the string and return multiple columns.

Any approach is valid and will yeild results, but treating a row as a single
string is anti-pattern-ish. Since you ideally want hive to understand the
schema of the row.

Re: Java UDF

Reply via email to