Hi Tim, Larry Ogrodnek has a nice blog post describing how to use Hive's TRANSFORM/MAP/REDUCE syntax with Java code here: http://dev.bizo.com/2009/10/hive-map-reduce-in-java.html
A version of the library he describes in the blog post has been added to Hive's contrib directory, along with some examples and illustrative test cases. Check out the following files: src/java/org/apache/hadoop/hive/contrib/mr/example/IdentityMapper.java src/java/org/apache/hadoop/hive/contrib/mr/example/WordCountReduce.java src/java/org/apache/hadoop/hive/contrib/mr/GenericMR.java src/java/org/apache/hadoop/hive/contrib/mr/Mapper.java src/java/org/apache/hadoop/hive/contrib/mr/Output.java src/java/org/apache/hadoop/hive/contrib/mr/Reducer.java src/test/queries/clientpositive/java_mr_example.q Hope this helps. Carl On Wed, Apr 28, 2010 at 1:11 AM, Tim Robertson <[email protected]>wrote: > Ok, so it turns out I overlooked some things in my current MR job with the > configure() and a UDF isn't enough. > > I do want to use the Hive Map keyword and call my own MR map(). > > Currently my map() looks like the following, which works on a tab delimited > input file: > > public void map(LongWritable key, Text value, > > OutputCollector<Text, Text> collector, Reporter reporter) > > throws IOException { > > Pattern tab = Pattern.compile("\t"); > > String[] atoms = tab.split(value.toString()); > > String parsed = myParseFunction(atoms); > > collector.collect(new Text(parsed), new Text(atoms[0])); > > What would I need to implement to make this usable with a Map keyword in > Hive please, so I can run this with input from table 1, to populate table > 2? > > Sorry for this confusion, but it is not really clear to me - all help is > very gratefully received. > > Cheers > Tim > > > > > On Tue, Apr 27, 2010 at 8:06 PM, Avram Aelony <[email protected]> wrote: > >> >> Hi - >> >> If you would like to "simply take an input String (Text) run some Java and >> return a new (Text) by calling a function" then you may wish to consider >> using the "map" and "reduce" keywords directly from Hive and using a >> scripting language like Perl that contains your mapper and reducer code. >> >> for example: >> >> create external table some_input_table ( field_1 string ) row format >> (etc...); >> create table your_next_table ( output_field_1 string, output_field_2 string, >> output_field_3 string ); >> >> >> from ( >> from some_input_table i >> map i.field_1 using 'some_custom_mapper_code.pl' ) mapper_output >> insert overwrite table your_next_table >> reduce mapper_output.* using 'some_custom_reducer_code.pl' as >> output_field_1, output_field_2, output_field_3 >> ; >> >> --test it >> select * from your_next_table ; >> >> Hope that helps. >> >> cheers, >> Avram >> >> >> >> >> >> On Tuesday, April 27, 2010, at 10:55AM, "Tim Robertson" >> <[email protected]> wrote: >> > >> >> Thanks Edward, >> >> I get where you are coming from now with that explanation. >> >> Cheers, >> Tim >> >> >> On Tue, Apr 27, 2010 at 7:53 PM, Edward Capriolo >> <[email protected]>wrote: >> >>> >>> >>> On Tue, Apr 27, 2010 at 1:48 PM, Tim Robertson < >>> [email protected]> wrote: >>> >>>> Hmmm... I am not trying to serialize or deserialize custom content, but >>>> simply take an input String (Text) run some Java and return a new (Text) >>>> by >>>> calling a function >>>> >>>> Looking at public class UDFYear extends UDF { the annotation at the >>>> top suggests extending UDF and adding the annotation, might be enough. >>>> >>>> I'll try it anyways... >>>> Tim >>>> >>>> On Tue, Apr 27, 2010 at 7:37 PM, Adam O'Donnell <[email protected]>wrote: >>>> >>>>> It sounds like what you want is a custom SerDe. I have tried to write >>>>> one but ran into some difficulty. >>>>> >>>>> On Tue, Apr 27, 2010 at 10:13 AM, Tim Robertson >>>>> <[email protected]> wrote: >>>>> > Thanks Edward, >>>>> > You are indeed correct - I am confused! >>>>> > So I checked out the source, and poked around. If I were to extend >>>>> UDF and >>>>> > implement public Text evaluate(Text source) { >>>>> > would I be heading along the correct lines to use what you say above? >>>>> > Thanks, >>>>> > Tim >>>>> > >>>>> > >>>>> > On Tue, Apr 27, 2010 at 5:11 PM, Edward Capriolo < >>>>> [email protected]> >>>>> > wrote: >>>>> >> >>>>> >> >>>>> >> On Tue, Apr 27, 2010 at 10:22 AM, Tim Robertson >>>>> >> <[email protected]> wrote: >>>>> >>> >>>>> >>> Hi, >>>>> >>> I currently run a MapReduce job to rewrite a tab delimited file, >>>>> and then >>>>> >>> I use Hive for everything after that stage. >>>>> >>> Am I correct in thinking that I can create a Jar with my own method >>>>> which >>>>> >>> can then be called in SQL? >>>>> >>> Would the syntax be: >>>>> >>> hive> ADD JAR /tmp/parse.jar; >>>>> >>> hive> INSERT OVERWRITE TABLE target SELECT s.id, >>>>> >>> s.canonical, parsedName FROM source s MAP s.canonical using 'parse' >>>>> as >>>>> >>> parsedName; >>>>> >>> and parse be a MR job? If so what are the input and output formats >>>>> >>> please for the parse? Or is it a class implementing an interface >>>>> perhaps >>>>> >>> and Hive take care of the rest? >>>>> >>> Thanks for any pointers, >>>>> >>> Tim >>>>> >>> >>>>> >> >>>>> >> Tim, >>>>> >> >>>>> >> A UDF is an sql function like toString() max() >>>>> >> An InputFormat teachers hive to read data from Key Value files >>>>> >> A serde tells Hive how to parse input data into columns. >>>>> >> Finally, the map()reduce(), transform() keywords you described is a >>>>> way to >>>>> >> pipe data to external process and read the results back in. Almost >>>>> like a >>>>> >> non-native to hive UDF. >>>>> >> >>>>> >> So you have munged up 4 concepts together :) Do not feel bad >>>>> however, I >>>>> >> struggled though an input format for the last month. >>>>> >> >>>>> >> It sounds most like you want a udf that takes a string and returns a >>>>> >> canonical representation. >>>>> >> >>>>> >> >>>>> >> hive> ADD JAR /tmp/parse.jar; >>>>> >> create temporary function canonical as 'my.package.canonical'; >>>>> >> select canonical(my colum) from source; >>>>> >> >>>>> >> Regards, >>>>> >> >>>>> >> >>>>> >> >>>>> > >>>>> > >>>>> >>>>> >>>>> >>>>> -- >>>>> Adam J. O'Donnell, Ph.D. >>>>> Immunet Corporation >>>>> Cell: +1 (267) 251-0070 >>>>> >>>> >>>> >>> Tim, >>> >>> I think you are on the right track with the UDF approach. >>> >>> You could accomplish something similiar with a serdy accept from the >>> client prospecting it would be more "transparent". >>> >>> A UDF is a bit more reusable then a serde. You can only chose a serde >>> once when the table is created, but you UDF is applied on the resultset. >>> >>> Edward >>> >> >> >
