Thanks Carl - that looks ideal for this.

On Wed, Apr 28, 2010 at 10:59 AM, Carl Steinbach <[email protected]> wrote:

> Hi Tim,
>
> Larry Ogrodnek has a nice blog post describing how to use Hive's
> TRANSFORM/MAP/REDUCE syntax with Java code here:
> http://dev.bizo.com/2009/10/hive-map-reduce-in-java.html
>
> A version of the library he describes in the blog post has been added to
> Hive's contrib directory, along with some examples and illustrative test
> cases. Check out the following files:
>
> src/java/org/apache/hadoop/hive/contrib/mr/example/IdentityMapper.java
> src/java/org/apache/hadoop/hive/contrib/mr/example/WordCountReduce.java
> src/java/org/apache/hadoop/hive/contrib/mr/GenericMR.java
> src/java/org/apache/hadoop/hive/contrib/mr/Mapper.java
> src/java/org/apache/hadoop/hive/contrib/mr/Output.java
> src/java/org/apache/hadoop/hive/contrib/mr/Reducer.java
> src/test/queries/clientpositive/java_mr_example.q
>
> Hope this helps.
>
> Carl
>
>
> On Wed, Apr 28, 2010 at 1:11 AM, Tim Robertson 
> <[email protected]>wrote:
>
>> Ok, so it turns out I overlooked some things in my current MR job with the
>> configure() and a UDF isn't enough.
>>
>> I do want to use the Hive Map keyword and call my own MR map().
>>
>> Currently my map() looks like the following, which works on a tab
>> delimited input file:
>>
>> public void map(LongWritable key, Text value,
>>
>>   OutputCollector<Text, Text> collector, Reporter reporter)
>>
>>   throws IOException {
>>
>>    Pattern tab = Pattern.compile("\t");
>>
>>   String[] atoms = tab.split(value.toString());
>>
>>   String parsed = myParseFunction(atoms);
>>
>>   collector.collect(new Text(parsed), new Text(atoms[0]));
>>
>> What would I need to implement to make this usable with a Map keyword in
>> Hive please, so I can run this with input from table 1, to populate table
>> 2?
>>
>> Sorry for this confusion, but it is not really clear to me - all help is
>> very gratefully received.
>>
>> Cheers
>> Tim
>>
>>
>>
>>
>> On Tue, Apr 27, 2010 at 8:06 PM, Avram Aelony <[email protected]> wrote:
>>
>>>
>>> Hi -
>>>
>>> If you would like to "simply take an input String (Text) run some Java  and 
>>> return a new (Text) by calling a function" then you may wish to consider 
>>> using the "map" and "reduce" keywords directly from Hive and using a 
>>> scripting language like Perl that contains your mapper and reducer code.
>>>
>>> for example:
>>>
>>> create external table some_input_table ( field_1 string ) row format 
>>> (etc...);
>>> create table your_next_table ( output_field_1 string, output_field_2 
>>> string, output_field_3 string );
>>>
>>>
>>> from (
>>>    from some_input_table i
>>>      map i.field_1 using 'some_custom_mapper_code.pl' ) mapper_output
>>>    insert overwrite table your_next_table
>>>      reduce mapper_output.* using 'some_custom_reducer_code.pl' as 
>>> output_field_1, output_field_2, output_field_3
>>> ;
>>>
>>> --test it
>>> select * from your_next_table ;
>>>
>>> Hope that helps.
>>>
>>> cheers,
>>> Avram
>>>
>>>
>>>
>>>
>>>
>>> On Tuesday, April 27, 2010, at 10:55AM, "Tim Robertson" 
>>> <[email protected]> wrote:
>>> >
>>>
>>>  Thanks Edward,
>>>
>>>  I get where you are coming from now with that explanation.
>>>
>>>  Cheers,
>>> Tim
>>>
>>>
>>> On Tue, Apr 27, 2010 at 7:53 PM, Edward Capriolo 
>>> <[email protected]>wrote:
>>>
>>>>
>>>>
>>>> On Tue, Apr 27, 2010 at 1:48 PM, Tim Robertson <
>>>> [email protected]> wrote:
>>>>
>>>>> Hmmm... I am not trying to serialize or deserialize custom content, but
>>>>> simply take an input String (Text) run some Java  and return a new (Text) 
>>>>> by
>>>>> calling a function
>>>>>
>>>>>  Looking at public class UDFYear extends UDF { the annotation at the
>>>>> top suggests extending UDF and adding the annotation, might be enough.
>>>>>
>>>>>  I'll try it anyways...
>>>>> Tim
>>>>>
>>>>> On Tue, Apr 27, 2010 at 7:37 PM, Adam O'Donnell <[email protected]>wrote:
>>>>>
>>>>>> It sounds like what you want is a custom SerDe.  I have tried to write
>>>>>> one but ran into some difficulty.
>>>>>>
>>>>>> On Tue, Apr 27, 2010 at 10:13 AM, Tim Robertson
>>>>>>  <[email protected]> wrote:
>>>>>> > Thanks Edward,
>>>>>> > You are indeed correct - I am confused!
>>>>>> > So I checked out the source, and poked around.  If I were to extend
>>>>>> UDF and
>>>>>> > implement  public Text evaluate(Text source) {
>>>>>> > would I be heading along the correct lines to use what you say
>>>>>> above?
>>>>>> > Thanks,
>>>>>> > Tim
>>>>>> >
>>>>>> >
>>>>>> > On Tue, Apr 27, 2010 at 5:11 PM, Edward Capriolo <
>>>>>> [email protected]>
>>>>>> > wrote:
>>>>>> >>
>>>>>> >>
>>>>>> >> On Tue, Apr 27, 2010 at 10:22 AM, Tim Robertson
>>>>>> >> <[email protected]> wrote:
>>>>>> >>>
>>>>>> >>> Hi,
>>>>>> >>> I currently run a MapReduce job to rewrite a tab delimited file,
>>>>>> and then
>>>>>> >>> I use Hive for everything after that stage.
>>>>>> >>> Am I correct in thinking that I can create a Jar with my own
>>>>>> method which
>>>>>> >>> can then be called in SQL?
>>>>>> >>> Would the syntax be:
>>>>>> >>>   hive> ADD JAR /tmp/parse.jar;
>>>>>> >>>   hive> INSERT OVERWRITE TABLE target SELECT s.id,
>>>>>> >>> s.canonical, parsedName FROM source s MAP s.canonical using
>>>>>> 'parse' as
>>>>>> >>> parsedName;
>>>>>> >>> and parse be a MR job?  If so what are the input and output
>>>>>> formats
>>>>>> >>> please for the parse?  Or is it a class implementing an interface
>>>>>> perhaps
>>>>>> >>> and Hive take care of the rest?
>>>>>> >>> Thanks for any pointers,
>>>>>> >>> Tim
>>>>>> >>>
>>>>>> >>
>>>>>> >> Tim,
>>>>>> >>
>>>>>> >> A UDF is an sql function like toString() max()
>>>>>> >> An InputFormat teachers hive to read data from Key Value files
>>>>>> >> A serde tells Hive how to parse input data into columns.
>>>>>> >> Finally, the map()reduce(), transform() keywords you described is a
>>>>>> way to
>>>>>> >> pipe data to external process and read the results back in. Almost
>>>>>> like a
>>>>>> >> non-native to hive UDF.
>>>>>> >>
>>>>>> >> So you have munged up 4 concepts together :) Do not feel bad
>>>>>> however, I
>>>>>> >> struggled though an input format for the last month.
>>>>>> >>
>>>>>> >> It sounds most like you want a udf that takes a string and returns
>>>>>> a
>>>>>> >> canonical representation.
>>>>>> >>
>>>>>> >>
>>>>>> >>   hive> ADD JAR /tmp/parse.jar;
>>>>>> >> create temporary function canonical as 'my.package.canonical';
>>>>>> >> select canonical(my colum) from source;
>>>>>> >>
>>>>>> >> Regards,
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >
>>>>>> >
>>>>>>
>>>>>>
>>>>>>
>>>>>>  --
>>>>>> Adam J. O'Donnell, Ph.D.
>>>>>> Immunet Corporation
>>>>>> Cell: +1 (267) 251-0070
>>>>>>
>>>>>
>>>>>
>>>>  Tim,
>>>>
>>>> I think you are on the right track with the UDF approach.
>>>>
>>>> You could accomplish something similiar with a serdy accept from the
>>>> client prospecting it would be more "transparent".
>>>>
>>>> A UDF is a bit more reusable then a serde. You can only chose a serde
>>>> once when the table is created, but you UDF is applied on the resultset.
>>>>
>>>> Edward
>>>>
>>>
>>>
>>
>

Reply via email to