Thanks Carl - that looks ideal for this.
On Wed, Apr 28, 2010 at 10:59 AM, Carl Steinbach <[email protected]> wrote: > Hi Tim, > > Larry Ogrodnek has a nice blog post describing how to use Hive's > TRANSFORM/MAP/REDUCE syntax with Java code here: > http://dev.bizo.com/2009/10/hive-map-reduce-in-java.html > > A version of the library he describes in the blog post has been added to > Hive's contrib directory, along with some examples and illustrative test > cases. Check out the following files: > > src/java/org/apache/hadoop/hive/contrib/mr/example/IdentityMapper.java > src/java/org/apache/hadoop/hive/contrib/mr/example/WordCountReduce.java > src/java/org/apache/hadoop/hive/contrib/mr/GenericMR.java > src/java/org/apache/hadoop/hive/contrib/mr/Mapper.java > src/java/org/apache/hadoop/hive/contrib/mr/Output.java > src/java/org/apache/hadoop/hive/contrib/mr/Reducer.java > src/test/queries/clientpositive/java_mr_example.q > > Hope this helps. > > Carl > > > On Wed, Apr 28, 2010 at 1:11 AM, Tim Robertson > <[email protected]>wrote: > >> Ok, so it turns out I overlooked some things in my current MR job with the >> configure() and a UDF isn't enough. >> >> I do want to use the Hive Map keyword and call my own MR map(). >> >> Currently my map() looks like the following, which works on a tab >> delimited input file: >> >> public void map(LongWritable key, Text value, >> >> OutputCollector<Text, Text> collector, Reporter reporter) >> >> throws IOException { >> >> Pattern tab = Pattern.compile("\t"); >> >> String[] atoms = tab.split(value.toString()); >> >> String parsed = myParseFunction(atoms); >> >> collector.collect(new Text(parsed), new Text(atoms[0])); >> >> What would I need to implement to make this usable with a Map keyword in >> Hive please, so I can run this with input from table 1, to populate table >> 2? >> >> Sorry for this confusion, but it is not really clear to me - all help is >> very gratefully received. >> >> Cheers >> Tim >> >> >> >> >> On Tue, Apr 27, 2010 at 8:06 PM, Avram Aelony <[email protected]> wrote: >> >>> >>> Hi - >>> >>> If you would like to "simply take an input String (Text) run some Java and >>> return a new (Text) by calling a function" then you may wish to consider >>> using the "map" and "reduce" keywords directly from Hive and using a >>> scripting language like Perl that contains your mapper and reducer code. >>> >>> for example: >>> >>> create external table some_input_table ( field_1 string ) row format >>> (etc...); >>> create table your_next_table ( output_field_1 string, output_field_2 >>> string, output_field_3 string ); >>> >>> >>> from ( >>> from some_input_table i >>> map i.field_1 using 'some_custom_mapper_code.pl' ) mapper_output >>> insert overwrite table your_next_table >>> reduce mapper_output.* using 'some_custom_reducer_code.pl' as >>> output_field_1, output_field_2, output_field_3 >>> ; >>> >>> --test it >>> select * from your_next_table ; >>> >>> Hope that helps. >>> >>> cheers, >>> Avram >>> >>> >>> >>> >>> >>> On Tuesday, April 27, 2010, at 10:55AM, "Tim Robertson" >>> <[email protected]> wrote: >>> > >>> >>> Thanks Edward, >>> >>> I get where you are coming from now with that explanation. >>> >>> Cheers, >>> Tim >>> >>> >>> On Tue, Apr 27, 2010 at 7:53 PM, Edward Capriolo >>> <[email protected]>wrote: >>> >>>> >>>> >>>> On Tue, Apr 27, 2010 at 1:48 PM, Tim Robertson < >>>> [email protected]> wrote: >>>> >>>>> Hmmm... I am not trying to serialize or deserialize custom content, but >>>>> simply take an input String (Text) run some Java and return a new (Text) >>>>> by >>>>> calling a function >>>>> >>>>> Looking at public class UDFYear extends UDF { the annotation at the >>>>> top suggests extending UDF and adding the annotation, might be enough. >>>>> >>>>> I'll try it anyways... >>>>> Tim >>>>> >>>>> On Tue, Apr 27, 2010 at 7:37 PM, Adam O'Donnell <[email protected]>wrote: >>>>> >>>>>> It sounds like what you want is a custom SerDe. I have tried to write >>>>>> one but ran into some difficulty. >>>>>> >>>>>> On Tue, Apr 27, 2010 at 10:13 AM, Tim Robertson >>>>>> <[email protected]> wrote: >>>>>> > Thanks Edward, >>>>>> > You are indeed correct - I am confused! >>>>>> > So I checked out the source, and poked around. If I were to extend >>>>>> UDF and >>>>>> > implement public Text evaluate(Text source) { >>>>>> > would I be heading along the correct lines to use what you say >>>>>> above? >>>>>> > Thanks, >>>>>> > Tim >>>>>> > >>>>>> > >>>>>> > On Tue, Apr 27, 2010 at 5:11 PM, Edward Capriolo < >>>>>> [email protected]> >>>>>> > wrote: >>>>>> >> >>>>>> >> >>>>>> >> On Tue, Apr 27, 2010 at 10:22 AM, Tim Robertson >>>>>> >> <[email protected]> wrote: >>>>>> >>> >>>>>> >>> Hi, >>>>>> >>> I currently run a MapReduce job to rewrite a tab delimited file, >>>>>> and then >>>>>> >>> I use Hive for everything after that stage. >>>>>> >>> Am I correct in thinking that I can create a Jar with my own >>>>>> method which >>>>>> >>> can then be called in SQL? >>>>>> >>> Would the syntax be: >>>>>> >>> hive> ADD JAR /tmp/parse.jar; >>>>>> >>> hive> INSERT OVERWRITE TABLE target SELECT s.id, >>>>>> >>> s.canonical, parsedName FROM source s MAP s.canonical using >>>>>> 'parse' as >>>>>> >>> parsedName; >>>>>> >>> and parse be a MR job? If so what are the input and output >>>>>> formats >>>>>> >>> please for the parse? Or is it a class implementing an interface >>>>>> perhaps >>>>>> >>> and Hive take care of the rest? >>>>>> >>> Thanks for any pointers, >>>>>> >>> Tim >>>>>> >>> >>>>>> >> >>>>>> >> Tim, >>>>>> >> >>>>>> >> A UDF is an sql function like toString() max() >>>>>> >> An InputFormat teachers hive to read data from Key Value files >>>>>> >> A serde tells Hive how to parse input data into columns. >>>>>> >> Finally, the map()reduce(), transform() keywords you described is a >>>>>> way to >>>>>> >> pipe data to external process and read the results back in. Almost >>>>>> like a >>>>>> >> non-native to hive UDF. >>>>>> >> >>>>>> >> So you have munged up 4 concepts together :) Do not feel bad >>>>>> however, I >>>>>> >> struggled though an input format for the last month. >>>>>> >> >>>>>> >> It sounds most like you want a udf that takes a string and returns >>>>>> a >>>>>> >> canonical representation. >>>>>> >> >>>>>> >> >>>>>> >> hive> ADD JAR /tmp/parse.jar; >>>>>> >> create temporary function canonical as 'my.package.canonical'; >>>>>> >> select canonical(my colum) from source; >>>>>> >> >>>>>> >> Regards, >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> > >>>>>> > >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Adam J. O'Donnell, Ph.D. >>>>>> Immunet Corporation >>>>>> Cell: +1 (267) 251-0070 >>>>>> >>>>> >>>>> >>>> Tim, >>>> >>>> I think you are on the right track with the UDF approach. >>>> >>>> You could accomplish something similiar with a serdy accept from the >>>> client prospecting it would be more "transparent". >>>> >>>> A UDF is a bit more reusable then a serde. You can only chose a serde >>>> once when the table is created, but you UDF is applied on the resultset. >>>> >>>> Edward >>>> >>> >>> >> >
