Re: Java UDF

Edward Capriolo Wed, 28 Apr 2010 07:32:53 -0700

On Wed, Apr 28, 2010 at 10:16 AM, Edward Capriolo <[email protected]>wrote:


>
>
> On Wed, Apr 28, 2010 at 5:06 AM, Tim Robertson 
> <[email protected]>wrote:
>
>> Thanks Carl - that looks ideal for this.
>>
>>
>> On Wed, Apr 28, 2010 at 10:59 AM, Carl Steinbach <[email protected]>wrote:
>>
>>> Hi Tim,
>>>
>>> Larry Ogrodnek has a nice blog post describing how to use Hive's
>>> TRANSFORM/MAP/REDUCE syntax with Java code here:
>>> http://dev.bizo.com/2009/10/hive-map-reduce-in-java.html
>>>
>>> A version of the library he describes in the blog post has been added to
>>> Hive's contrib directory, along with some examples and illustrative test
>>> cases. Check out the following files:
>>>
>>> src/java/org/apache/hadoop/hive/contrib/mr/example/IdentityMapper.java
>>> src/java/org/apache/hadoop/hive/contrib/mr/example/WordCountReduce.java
>>> src/java/org/apache/hadoop/hive/contrib/mr/GenericMR.java
>>> src/java/org/apache/hadoop/hive/contrib/mr/Mapper.java
>>> src/java/org/apache/hadoop/hive/contrib/mr/Output.java
>>> src/java/org/apache/hadoop/hive/contrib/mr/Reducer.java
>>> src/test/queries/clientpositive/java_mr_example.q
>>>
>>> Hope this helps.
>>>
>>> Carl
>>>
>>>
>>> On Wed, Apr 28, 2010 at 1:11 AM, Tim Robertson <
>>> [email protected]> wrote:
>>>
>>>> Ok, so it turns out I overlooked some things in my current MR job with
>>>> the configure() and a UDF isn't enough.
>>>>
>>>> I do want to use the Hive Map keyword and call my own MR map().
>>>>
>>>> Currently my map() looks like the following, which works on a tab
>>>> delimited input file:
>>>>
>>>> public void map(LongWritable key, Text value,
>>>>
>>>>   OutputCollector<Text, Text> collector, Reporter reporter)
>>>>
>>>>   throws IOException {
>>>>
>>>>    Pattern tab = Pattern.compile("\t");
>>>>
>>>>   String[] atoms = tab.split(value.toString());
>>>>
>>>>   String parsed = myParseFunction(atoms);
>>>>
>>>>   collector.collect(new Text(parsed), new Text(atoms[0]));
>>>>
>>>> What would I need to implement to make this usable with a Map keyword in
>>>> Hive please, so I can run this with input from table 1, to populate table
>>>> 2?
>>>>
>>>> Sorry for this confusion, but it is not really clear to me - all help is
>>>> very gratefully received.
>>>>
>>>> Cheers
>>>> Tim
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Apr 27, 2010 at 8:06 PM, Avram Aelony <[email protected]> wrote:
>>>>
>>>>>
>>>>> Hi -
>>>>>
>>>>> If you would like to "simply take an input String (Text) run some Java  
>>>>> and return a new (Text) by calling a function" then you may wish to 
>>>>> consider using the "map" and "reduce" keywords directly from Hive and 
>>>>> using a scripting language like Perl that contains your mapper and 
>>>>> reducer code.
>>>>>
>>>>> for example:
>>>>>
>>>>> create external table some_input_table ( field_1 string ) row format 
>>>>> (etc...);
>>>>> create table your_next_table ( output_field_1 string, output_field_2 
>>>>> string, output_field_3 string );
>>>>>
>>>>>
>>>>> from (
>>>>>    from some_input_table i
>>>>>      map i.field_1 using 'some_custom_mapper_code.pl' ) mapper_output
>>>>>    insert overwrite table your_next_table
>>>>>      reduce mapper_output.* using 'some_custom_reducer_code.pl' as 
>>>>> output_field_1, output_field_2, output_field_3
>>>>> ;
>>>>>
>>>>> --test it
>>>>> select * from your_next_table ;
>>>>>
>>>>> Hope that helps.
>>>>>
>>>>> cheers,
>>>>> Avram
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tuesday, April 27, 2010, at 10:55AM, "Tim Robertson" 
>>>>> <[email protected]> wrote:
>>>>> >
>>>>>
>>>>>  Thanks Edward,
>>>>>
>>>>>  I get where you are coming from now with that explanation.
>>>>>
>>>>>  Cheers,
>>>>> Tim
>>>>>
>>>>>
>>>>> On Tue, Apr 27, 2010 at 7:53 PM, Edward Capriolo <
>>>>> [email protected]> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Apr 27, 2010 at 1:48 PM, Tim Robertson <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Hmmm... I am not trying to serialize or deserialize custom content,
>>>>>>> but simply take an input String (Text) run some Java  and return a new
>>>>>>> (Text) by calling a function
>>>>>>>
>>>>>>>  Looking at public class UDFYear extends UDF { the annotation at the
>>>>>>> top suggests extending UDF and adding the annotation, might be enough.
>>>>>>>
>>>>>>>  I'll try it anyways...
>>>>>>> Tim
>>>>>>>
>>>>>>> On Tue, Apr 27, 2010 at 7:37 PM, Adam O'Donnell <[email protected]>wrote:
>>>>>>>
>>>>>>>> It sounds like what you want is a custom SerDe.  I have tried to
>>>>>>>> write
>>>>>>>> one but ran into some difficulty.
>>>>>>>>
>>>>>>>> On Tue, Apr 27, 2010 at 10:13 AM, Tim Robertson
>>>>>>>>  <[email protected]> wrote:
>>>>>>>> > Thanks Edward,
>>>>>>>> > You are indeed correct - I am confused!
>>>>>>>> > So I checked out the source, and poked around.  If I were to
>>>>>>>> extend UDF and
>>>>>>>> > implement  public Text evaluate(Text source) {
>>>>>>>> > would I be heading along the correct lines to use what you say
>>>>>>>> above?
>>>>>>>> > Thanks,
>>>>>>>> > Tim
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > On Tue, Apr 27, 2010 at 5:11 PM, Edward Capriolo <
>>>>>>>> [email protected]>
>>>>>>>> > wrote:
>>>>>>>> >>
>>>>>>>> >>
>>>>>>>> >> On Tue, Apr 27, 2010 at 10:22 AM, Tim Robertson
>>>>>>>> >> <[email protected]> wrote:
>>>>>>>> >>>
>>>>>>>> >>> Hi,
>>>>>>>> >>> I currently run a MapReduce job to rewrite a tab delimited file,
>>>>>>>> and then
>>>>>>>> >>> I use Hive for everything after that stage.
>>>>>>>> >>> Am I correct in thinking that I can create a Jar with my own
>>>>>>>> method which
>>>>>>>> >>> can then be called in SQL?
>>>>>>>> >>> Would the syntax be:
>>>>>>>> >>>   hive> ADD JAR /tmp/parse.jar;
>>>>>>>> >>>   hive> INSERT OVERWRITE TABLE target SELECT s.id,
>>>>>>>> >>> s.canonical, parsedName FROM source s MAP s.canonical using
>>>>>>>> 'parse' as
>>>>>>>> >>> parsedName;
>>>>>>>> >>> and parse be a MR job?  If so what are the input and output
>>>>>>>> formats
>>>>>>>> >>> please for the parse?  Or is it a class implementing an
>>>>>>>> interface perhaps
>>>>>>>> >>> and Hive take care of the rest?
>>>>>>>> >>> Thanks for any pointers,
>>>>>>>> >>> Tim
>>>>>>>> >>>
>>>>>>>> >>
>>>>>>>> >> Tim,
>>>>>>>> >>
>>>>>>>> >> A UDF is an sql function like toString() max()
>>>>>>>> >> An InputFormat teachers hive to read data from Key Value files
>>>>>>>> >> A serde tells Hive how to parse input data into columns.
>>>>>>>> >> Finally, the map()reduce(), transform() keywords you described is
>>>>>>>> a way to
>>>>>>>> >> pipe data to external process and read the results back in.
>>>>>>>> Almost like a
>>>>>>>> >> non-native to hive UDF.
>>>>>>>> >>
>>>>>>>> >> So you have munged up 4 concepts together :) Do not feel bad
>>>>>>>> however, I
>>>>>>>> >> struggled though an input format for the last month.
>>>>>>>> >>
>>>>>>>> >> It sounds most like you want a udf that takes a string and
>>>>>>>> returns a
>>>>>>>> >> canonical representation.
>>>>>>>> >>
>>>>>>>> >>
>>>>>>>> >>   hive> ADD JAR /tmp/parse.jar;
>>>>>>>> >> create temporary function canonical as 'my.package.canonical';
>>>>>>>> >> select canonical(my colum) from source;
>>>>>>>> >>
>>>>>>>> >> Regards,
>>>>>>>> >>
>>>>>>>> >>
>>>>>>>> >>
>>>>>>>> >
>>>>>>>> >
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>  --
>>>>>>>> Adam J. O'Donnell, Ph.D.
>>>>>>>> Immunet Corporation
>>>>>>>> Cell: +1 (267) 251-0070
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>  Tim,
>>>>>>
>>>>>> I think you are on the right track with the UDF approach.
>>>>>>
>>>>>> You could accomplish something similiar with a serdy accept from the
>>>>>> client prospecting it would be more "transparent".
>>>>>>
>>>>>> A UDF is a bit more reusable then a serde. You can only chose a serde
>>>>>> once when the table is created, but you UDF is applied on the resultset.
>>>>>>
>>>>>> Edward
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
> Based on your description now this is looking more like a SERDE. I say this
> because
> map(), reduce(), transform() are almsot identical to a UDF with the
> exception that they can return multiple columns more easily.
>
> Lets look at an example
>
> RawFile
>
> 1\t6,7,8
>
> So by default Hive uses a TextInputFormat
> It reads the key as "1" it reads the value as "6,7,8"
>
> Now the default OutputFormat is HiveIgnoreKeyOutputFormat. Thus hive drops
> the "1"
>
> and the output is now "6,7,8"
>
> Now the SERDE takes over. based on the DELIMITER you specified when you
> created the table, the serder attempts to split the row.
>
> If I chose a DELIMITER ',' hive would split the row into
> ['6','7','8']
> If I chose the wrong delimiter say '^K' or somethign whacky hive really
> would not split the row you would get 1 return.
> ['6,7,8']
>
> So you can chose the 'wrong delimiter' as Carl suggests above essentially
> this turns the row into a single string. Then you are free to operate on it
> as a single string. You can use a UDF that works with a single string or use
> a map() sytax that can split the string and return multiple columns.
>
> Any approach is valid and will yeild results, but treating a row as a
> single string is anti-pattern-ish. Since you ideally want hive to understand
> the schema of the row.
>
>
Just wanted to follow up a bit. One benefit of Carl's approach is the map(),
reduce(), transform() is definately easier to try. Making a serde takes a
while. You have to learn the framework, write some code, alter your current
table definitition, etc. map() reduce(), transform() can be done without
writing java code to extend hive.

Re: Java UDF

Reply via email to