Re: Java UDF

Tim Robertson Wed, 28 Apr 2010 07:55:03 -0700

Thanks Edward.  As I venture further into Hive I will want it to understand
the schema of the row (especially for Numeric types), so will put the effort
in now to understand serde.


For context...
We have a 200million row mysql table which currently we scan in batches and
process (parsing, georeferencing etc) into another 200million row "fact"
table which serves the web, and I am moving from this to Hive processing,
and having really amazing results so far.  Sqoop -> Hive will be integrated
soon.  Following that, the source table will be ported to HBase, with Hive
performing this processing n HBase input and then sqoop pushing the
processed results to mysql / PostGIS DBs (and probably some Lucene).  If I
can get this Serde working (as I find time) it is really going to allow us a
clean processing workflow.

More questions to follow I am sure.

Tim




On Wed, Apr 28, 2010 at 4:32 PM, Edward Capriolo <[email protected]>wrote:

>
>
> On Wed, Apr 28, 2010 at 10:16 AM, Edward Capriolo 
> <[email protected]>wrote:
>
>>
>>
>> On Wed, Apr 28, 2010 at 5:06 AM, Tim Robertson <[email protected]
>> > wrote:
>>
>>> Thanks Carl - that looks ideal for this.
>>>
>>>
>>> On Wed, Apr 28, 2010 at 10:59 AM, Carl Steinbach <[email protected]>wrote:
>>>
>>>> Hi Tim,
>>>>
>>>> Larry Ogrodnek has a nice blog post describing how to use Hive's
>>>> TRANSFORM/MAP/REDUCE syntax with Java code here:
>>>> http://dev.bizo.com/2009/10/hive-map-reduce-in-java.html
>>>>
>>>> A version of the library he describes in the blog post has been added to
>>>> Hive's contrib directory, along with some examples and illustrative test
>>>> cases. Check out the following files:
>>>>
>>>> src/java/org/apache/hadoop/hive/contrib/mr/example/IdentityMapper.java
>>>> src/java/org/apache/hadoop/hive/contrib/mr/example/WordCountReduce.java
>>>> src/java/org/apache/hadoop/hive/contrib/mr/GenericMR.java
>>>> src/java/org/apache/hadoop/hive/contrib/mr/Mapper.java
>>>> src/java/org/apache/hadoop/hive/contrib/mr/Output.java
>>>> src/java/org/apache/hadoop/hive/contrib/mr/Reducer.java
>>>> src/test/queries/clientpositive/java_mr_example.q
>>>>
>>>> Hope this helps.
>>>>
>>>> Carl
>>>>
>>>>
>>>> On Wed, Apr 28, 2010 at 1:11 AM, Tim Robertson <
>>>> [email protected]> wrote:
>>>>
>>>>> Ok, so it turns out I overlooked some things in my current MR job with
>>>>> the configure() and a UDF isn't enough.
>>>>>
>>>>> I do want to use the Hive Map keyword and call my own MR map().
>>>>>
>>>>> Currently my map() looks like the following, which works on a tab
>>>>> delimited input file:
>>>>>
>>>>> public void map(LongWritable key, Text value,
>>>>>
>>>>>   OutputCollector<Text, Text> collector, Reporter reporter)
>>>>>
>>>>>   throws IOException {
>>>>>
>>>>>    Pattern tab = Pattern.compile("\t");
>>>>>
>>>>>   String[] atoms = tab.split(value.toString());
>>>>>
>>>>>   String parsed = myParseFunction(atoms);
>>>>>
>>>>>   collector.collect(new Text(parsed), new Text(atoms[0]));
>>>>>
>>>>> What would I need to implement to make this usable with a Map keyword
>>>>> in Hive please, so I can run this with input from table 1, to populate 
>>>>> table
>>>>> 2?
>>>>>
>>>>> Sorry for this confusion, but it is not really clear to me - all help
>>>>> is very gratefully received.
>>>>>
>>>>> Cheers
>>>>> Tim
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Apr 27, 2010 at 8:06 PM, Avram Aelony <[email protected]> wrote:
>>>>>
>>>>>>
>>>>>> Hi -
>>>>>>
>>>>>> If you would like to "simply take an input String (Text) run some Java  
>>>>>> and return a new (Text) by calling a function" then you may wish to 
>>>>>> consider using the "map" and "reduce" keywords directly from Hive and 
>>>>>> using a scripting language like Perl that contains your mapper and 
>>>>>> reducer code.
>>>>>>
>>>>>> for example:
>>>>>>
>>>>>> create external table some_input_table ( field_1 string ) row format 
>>>>>> (etc...);
>>>>>> create table your_next_table ( output_field_1 string, output_field_2 
>>>>>> string, output_field_3 string );
>>>>>>
>>>>>>
>>>>>> from (
>>>>>>    from some_input_table i
>>>>>>      map i.field_1 using 'some_custom_mapper_code.pl' ) mapper_output
>>>>>>    insert overwrite table your_next_table
>>>>>>      reduce mapper_output.* using 'some_custom_reducer_code.pl' as 
>>>>>> output_field_1, output_field_2, output_field_3
>>>>>> ;
>>>>>>
>>>>>> --test it
>>>>>> select * from your_next_table ;
>>>>>>
>>>>>> Hope that helps.
>>>>>>
>>>>>> cheers,
>>>>>> Avram
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tuesday, April 27, 2010, at 10:55AM, "Tim Robertson" 
>>>>>> <[email protected]> wrote:
>>>>>> >
>>>>>>
>>>>>>  Thanks Edward,
>>>>>>
>>>>>>  I get where you are coming from now with that explanation.
>>>>>>
>>>>>>  Cheers,
>>>>>> Tim
>>>>>>
>>>>>>
>>>>>> On Tue, Apr 27, 2010 at 7:53 PM, Edward Capriolo <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Apr 27, 2010 at 1:48 PM, Tim Robertson <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Hmmm... I am not trying to serialize or deserialize custom content,
>>>>>>>> but simply take an input String (Text) run some Java  and return a new
>>>>>>>> (Text) by calling a function
>>>>>>>>
>>>>>>>>  Looking at public class UDFYear extends UDF { the annotation at
>>>>>>>> the top suggests extending UDF and adding the annotation, might be 
>>>>>>>> enough.
>>>>>>>>
>>>>>>>>  I'll try it anyways...
>>>>>>>> Tim
>>>>>>>>
>>>>>>>> On Tue, Apr 27, 2010 at 7:37 PM, Adam O'Donnell 
>>>>>>>> <[email protected]>wrote:
>>>>>>>>
>>>>>>>>> It sounds like what you want is a custom SerDe.  I have tried to
>>>>>>>>> write
>>>>>>>>> one but ran into some difficulty.
>>>>>>>>>
>>>>>>>>> On Tue, Apr 27, 2010 at 10:13 AM, Tim Robertson
>>>>>>>>>  <[email protected]> wrote:
>>>>>>>>> > Thanks Edward,
>>>>>>>>> > You are indeed correct - I am confused!
>>>>>>>>> > So I checked out the source, and poked around.  If I were to
>>>>>>>>> extend UDF and
>>>>>>>>> > implement  public Text evaluate(Text source) {
>>>>>>>>> > would I be heading along the correct lines to use what you say
>>>>>>>>> above?
>>>>>>>>> > Thanks,
>>>>>>>>> > Tim
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> > On Tue, Apr 27, 2010 at 5:11 PM, Edward Capriolo <
>>>>>>>>> [email protected]>
>>>>>>>>> > wrote:
>>>>>>>>> >>
>>>>>>>>> >>
>>>>>>>>> >> On Tue, Apr 27, 2010 at 10:22 AM, Tim Robertson
>>>>>>>>> >> <[email protected]> wrote:
>>>>>>>>> >>>
>>>>>>>>> >>> Hi,
>>>>>>>>> >>> I currently run a MapReduce job to rewrite a tab delimited
>>>>>>>>> file, and then
>>>>>>>>> >>> I use Hive for everything after that stage.
>>>>>>>>> >>> Am I correct in thinking that I can create a Jar with my own
>>>>>>>>> method which
>>>>>>>>> >>> can then be called in SQL?
>>>>>>>>> >>> Would the syntax be:
>>>>>>>>> >>>   hive> ADD JAR /tmp/parse.jar;
>>>>>>>>> >>>   hive> INSERT OVERWRITE TABLE target SELECT s.id,
>>>>>>>>> >>> s.canonical, parsedName FROM source s MAP s.canonical using
>>>>>>>>> 'parse' as
>>>>>>>>> >>> parsedName;
>>>>>>>>> >>> and parse be a MR job?  If so what are the input and output
>>>>>>>>> formats
>>>>>>>>> >>> please for the parse?  Or is it a class implementing an
>>>>>>>>> interface perhaps
>>>>>>>>> >>> and Hive take care of the rest?
>>>>>>>>> >>> Thanks for any pointers,
>>>>>>>>> >>> Tim
>>>>>>>>> >>>
>>>>>>>>> >>
>>>>>>>>> >> Tim,
>>>>>>>>> >>
>>>>>>>>> >> A UDF is an sql function like toString() max()
>>>>>>>>> >> An InputFormat teachers hive to read data from Key Value files
>>>>>>>>> >> A serde tells Hive how to parse input data into columns.
>>>>>>>>> >> Finally, the map()reduce(), transform() keywords you described
>>>>>>>>> is a way to
>>>>>>>>> >> pipe data to external process and read the results back in.
>>>>>>>>> Almost like a
>>>>>>>>> >> non-native to hive UDF.
>>>>>>>>> >>
>>>>>>>>> >> So you have munged up 4 concepts together :) Do not feel bad
>>>>>>>>> however, I
>>>>>>>>> >> struggled though an input format for the last month.
>>>>>>>>> >>
>>>>>>>>> >> It sounds most like you want a udf that takes a string and
>>>>>>>>> returns a
>>>>>>>>> >> canonical representation.
>>>>>>>>> >>
>>>>>>>>> >>
>>>>>>>>> >>   hive> ADD JAR /tmp/parse.jar;
>>>>>>>>> >> create temporary function canonical as 'my.package.canonical';
>>>>>>>>> >> select canonical(my colum) from source;
>>>>>>>>> >>
>>>>>>>>> >> Regards,
>>>>>>>>> >>
>>>>>>>>> >>
>>>>>>>>> >>
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  --
>>>>>>>>> Adam J. O'Donnell, Ph.D.
>>>>>>>>> Immunet Corporation
>>>>>>>>> Cell: +1 (267) 251-0070
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>  Tim,
>>>>>>>
>>>>>>> I think you are on the right track with the UDF approach.
>>>>>>>
>>>>>>> You could accomplish something similiar with a serdy accept from the
>>>>>>> client prospecting it would be more "transparent".
>>>>>>>
>>>>>>> A UDF is a bit more reusable then a serde. You can only chose a serde
>>>>>>> once when the table is created, but you UDF is applied on the resultset.
>>>>>>>
>>>>>>> Edward
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>> Based on your description now this is looking more like a SERDE. I say
>> this because
>> map(), reduce(), transform() are almsot identical to a UDF with the
>> exception that they can return multiple columns more easily.
>>
>> Lets look at an example
>>
>> RawFile
>>
>> 1\t6,7,8
>>
>> So by default Hive uses a TextInputFormat
>> It reads the key as "1" it reads the value as "6,7,8"
>>
>> Now the default OutputFormat is HiveIgnoreKeyOutputFormat. Thus hive drops
>> the "1"
>>
>> and the output is now "6,7,8"
>>
>> Now the SERDE takes over. based on the DELIMITER you specified when you
>> created the table, the serder attempts to split the row.
>>
>> If I chose a DELIMITER ',' hive would split the row into
>> ['6','7','8']
>> If I chose the wrong delimiter say '^K' or somethign whacky hive really
>> would not split the row you would get 1 return.
>> ['6,7,8']
>>
>> So you can chose the 'wrong delimiter' as Carl suggests above essentially
>> this turns the row into a single string. Then you are free to operate on it
>> as a single string. You can use a UDF that works with a single string or use
>> a map() sytax that can split the string and return multiple columns.
>>
>> Any approach is valid and will yeild results, but treating a row as a
>> single string is anti-pattern-ish. Since you ideally want hive to understand
>> the schema of the row.
>>
>>
> Just wanted to follow up a bit. One benefit of Carl's approach is the
> map(), reduce(), transform() is definately easier to try. Making a serde
> takes a while. You have to learn the framework, write some code, alter your
> current table definitition, etc. map() reduce(), transform() can be done
> without writing java code to extend hive.
>

Re: Java UDF

Reply via email to