Re: Java UDF

Carl Steinbach Wed, 28 Apr 2010 02:00:08 -0700

Hi Tim,

Larry Ogrodnek has a nice blog post describing how to use Hive's
TRANSFORM/MAP/REDUCE syntax with Java code here:
http://dev.bizo.com/2009/10/hive-map-reduce-in-java.html


A version of the library he describes in the blog post has been added to
Hive's contrib directory, along with some examples and illustrative test
cases. Check out the following files:

src/java/org/apache/hadoop/hive/contrib/mr/example/IdentityMapper.java
src/java/org/apache/hadoop/hive/contrib/mr/example/WordCountReduce.java
src/java/org/apache/hadoop/hive/contrib/mr/GenericMR.java
src/java/org/apache/hadoop/hive/contrib/mr/Mapper.java
src/java/org/apache/hadoop/hive/contrib/mr/Output.java
src/java/org/apache/hadoop/hive/contrib/mr/Reducer.java
src/test/queries/clientpositive/java_mr_example.q

Hope this helps.

Carl

On Wed, Apr 28, 2010 at 1:11 AM, Tim Robertson <[email protected]>wrote:

> Ok, so it turns out I overlooked some things in my current MR job with the
> configure() and a UDF isn't enough.
>
> I do want to use the Hive Map keyword and call my own MR map().
>
> Currently my map() looks like the following, which works on a tab delimited
> input file:
>
> public void map(LongWritable key, Text value,
>
>   OutputCollector<Text, Text> collector, Reporter reporter)
>
>   throws IOException {
>
>    Pattern tab = Pattern.compile("\t");
>
>   String[] atoms = tab.split(value.toString());
>
>   String parsed = myParseFunction(atoms);
>
>   collector.collect(new Text(parsed), new Text(atoms[0]));
>
> What would I need to implement to make this usable with a Map keyword in
> Hive please, so I can run this with input from table 1, to populate table
> 2?
>
> Sorry for this confusion, but it is not really clear to me - all help is
> very gratefully received.
>
> Cheers
> Tim
>
>
>
>
> On Tue, Apr 27, 2010 at 8:06 PM, Avram Aelony <[email protected]> wrote:
>
>>
>> Hi -
>>
>> If you would like to "simply take an input String (Text) run some Java  and 
>> return a new (Text) by calling a function" then you may wish to consider 
>> using the "map" and "reduce" keywords directly from Hive and using a 
>> scripting language like Perl that contains your mapper and reducer code.
>>
>> for example:
>>
>> create external table some_input_table ( field_1 string ) row format 
>> (etc...);
>> create table your_next_table ( output_field_1 string, output_field_2 string, 
>> output_field_3 string );
>>
>>
>> from (
>>    from some_input_table i
>>      map i.field_1 using 'some_custom_mapper_code.pl' ) mapper_output
>>    insert overwrite table your_next_table
>>      reduce mapper_output.* using 'some_custom_reducer_code.pl' as 
>> output_field_1, output_field_2, output_field_3
>> ;
>>
>> --test it
>> select * from your_next_table ;
>>
>> Hope that helps.
>>
>> cheers,
>> Avram
>>
>>
>>
>>
>>
>> On Tuesday, April 27, 2010, at 10:55AM, "Tim Robertson" 
>> <[email protected]> wrote:
>> >
>>
>>  Thanks Edward,
>>
>>  I get where you are coming from now with that explanation.
>>
>>  Cheers,
>> Tim
>>
>>
>> On Tue, Apr 27, 2010 at 7:53 PM, Edward Capriolo 
>> <[email protected]>wrote:
>>
>>>
>>>
>>> On Tue, Apr 27, 2010 at 1:48 PM, Tim Robertson <
>>> [email protected]> wrote:
>>>
>>>> Hmmm... I am not trying to serialize or deserialize custom content, but
>>>> simply take an input String (Text) run some Java  and return a new (Text) 
>>>> by
>>>> calling a function
>>>>
>>>>  Looking at public class UDFYear extends UDF { the annotation at the
>>>> top suggests extending UDF and adding the annotation, might be enough.
>>>>
>>>>  I'll try it anyways...
>>>> Tim
>>>>
>>>> On Tue, Apr 27, 2010 at 7:37 PM, Adam O'Donnell <[email protected]>wrote:
>>>>
>>>>> It sounds like what you want is a custom SerDe.  I have tried to write
>>>>> one but ran into some difficulty.
>>>>>
>>>>> On Tue, Apr 27, 2010 at 10:13 AM, Tim Robertson
>>>>>  <[email protected]> wrote:
>>>>> > Thanks Edward,
>>>>> > You are indeed correct - I am confused!
>>>>> > So I checked out the source, and poked around.  If I were to extend
>>>>> UDF and
>>>>> > implement  public Text evaluate(Text source) {
>>>>> > would I be heading along the correct lines to use what you say above?
>>>>> > Thanks,
>>>>> > Tim
>>>>> >
>>>>> >
>>>>> > On Tue, Apr 27, 2010 at 5:11 PM, Edward Capriolo <
>>>>> [email protected]>
>>>>> > wrote:
>>>>> >>
>>>>> >>
>>>>> >> On Tue, Apr 27, 2010 at 10:22 AM, Tim Robertson
>>>>> >> <[email protected]> wrote:
>>>>> >>>
>>>>> >>> Hi,
>>>>> >>> I currently run a MapReduce job to rewrite a tab delimited file,
>>>>> and then
>>>>> >>> I use Hive for everything after that stage.
>>>>> >>> Am I correct in thinking that I can create a Jar with my own method
>>>>> which
>>>>> >>> can then be called in SQL?
>>>>> >>> Would the syntax be:
>>>>> >>>   hive> ADD JAR /tmp/parse.jar;
>>>>> >>>   hive> INSERT OVERWRITE TABLE target SELECT s.id,
>>>>> >>> s.canonical, parsedName FROM source s MAP s.canonical using 'parse'
>>>>> as
>>>>> >>> parsedName;
>>>>> >>> and parse be a MR job?  If so what are the input and output formats
>>>>> >>> please for the parse?  Or is it a class implementing an interface
>>>>> perhaps
>>>>> >>> and Hive take care of the rest?
>>>>> >>> Thanks for any pointers,
>>>>> >>> Tim
>>>>> >>>
>>>>> >>
>>>>> >> Tim,
>>>>> >>
>>>>> >> A UDF is an sql function like toString() max()
>>>>> >> An InputFormat teachers hive to read data from Key Value files
>>>>> >> A serde tells Hive how to parse input data into columns.
>>>>> >> Finally, the map()reduce(), transform() keywords you described is a
>>>>> way to
>>>>> >> pipe data to external process and read the results back in. Almost
>>>>> like a
>>>>> >> non-native to hive UDF.
>>>>> >>
>>>>> >> So you have munged up 4 concepts together :) Do not feel bad
>>>>> however, I
>>>>> >> struggled though an input format for the last month.
>>>>> >>
>>>>> >> It sounds most like you want a udf that takes a string and returns a
>>>>> >> canonical representation.
>>>>> >>
>>>>> >>
>>>>> >>   hive> ADD JAR /tmp/parse.jar;
>>>>> >> create temporary function canonical as 'my.package.canonical';
>>>>> >> select canonical(my colum) from source;
>>>>> >>
>>>>> >> Regards,
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >
>>>>> >
>>>>>
>>>>>
>>>>>
>>>>>  --
>>>>> Adam J. O'Donnell, Ph.D.
>>>>> Immunet Corporation
>>>>> Cell: +1 (267) 251-0070
>>>>>
>>>>
>>>>
>>>  Tim,
>>>
>>> I think you are on the right track with the UDF approach.
>>>
>>> You could accomplish something similiar with a serdy accept from the
>>> client prospecting it would be more "transparent".
>>>
>>> A UDF is a bit more reusable then a serde. You can only chose a serde
>>> once when the table is created, but you UDF is applied on the resultset.
>>>
>>> Edward
>>>
>>
>>
>

Re: Java UDF

Reply via email to