On Wed, Apr 28, 2010 at 10:16 AM, Edward Capriolo <[email protected]>wrote:
> > > On Wed, Apr 28, 2010 at 5:06 AM, Tim Robertson > <[email protected]>wrote: > >> Thanks Carl - that looks ideal for this. >> >> >> On Wed, Apr 28, 2010 at 10:59 AM, Carl Steinbach <[email protected]>wrote: >> >>> Hi Tim, >>> >>> Larry Ogrodnek has a nice blog post describing how to use Hive's >>> TRANSFORM/MAP/REDUCE syntax with Java code here: >>> http://dev.bizo.com/2009/10/hive-map-reduce-in-java.html >>> >>> A version of the library he describes in the blog post has been added to >>> Hive's contrib directory, along with some examples and illustrative test >>> cases. Check out the following files: >>> >>> src/java/org/apache/hadoop/hive/contrib/mr/example/IdentityMapper.java >>> src/java/org/apache/hadoop/hive/contrib/mr/example/WordCountReduce.java >>> src/java/org/apache/hadoop/hive/contrib/mr/GenericMR.java >>> src/java/org/apache/hadoop/hive/contrib/mr/Mapper.java >>> src/java/org/apache/hadoop/hive/contrib/mr/Output.java >>> src/java/org/apache/hadoop/hive/contrib/mr/Reducer.java >>> src/test/queries/clientpositive/java_mr_example.q >>> >>> Hope this helps. >>> >>> Carl >>> >>> >>> On Wed, Apr 28, 2010 at 1:11 AM, Tim Robertson < >>> [email protected]> wrote: >>> >>>> Ok, so it turns out I overlooked some things in my current MR job with >>>> the configure() and a UDF isn't enough. >>>> >>>> I do want to use the Hive Map keyword and call my own MR map(). >>>> >>>> Currently my map() looks like the following, which works on a tab >>>> delimited input file: >>>> >>>> public void map(LongWritable key, Text value, >>>> >>>> OutputCollector<Text, Text> collector, Reporter reporter) >>>> >>>> throws IOException { >>>> >>>> Pattern tab = Pattern.compile("\t"); >>>> >>>> String[] atoms = tab.split(value.toString()); >>>> >>>> String parsed = myParseFunction(atoms); >>>> >>>> collector.collect(new Text(parsed), new Text(atoms[0])); >>>> >>>> What would I need to implement to make this usable with a Map keyword in >>>> Hive please, so I can run this with input from table 1, to populate table >>>> 2? >>>> >>>> Sorry for this confusion, but it is not really clear to me - all help is >>>> very gratefully received. >>>> >>>> Cheers >>>> Tim >>>> >>>> >>>> >>>> >>>> On Tue, Apr 27, 2010 at 8:06 PM, Avram Aelony <[email protected]> wrote: >>>> >>>>> >>>>> Hi - >>>>> >>>>> If you would like to "simply take an input String (Text) run some Java >>>>> and return a new (Text) by calling a function" then you may wish to >>>>> consider using the "map" and "reduce" keywords directly from Hive and >>>>> using a scripting language like Perl that contains your mapper and >>>>> reducer code. >>>>> >>>>> for example: >>>>> >>>>> create external table some_input_table ( field_1 string ) row format >>>>> (etc...); >>>>> create table your_next_table ( output_field_1 string, output_field_2 >>>>> string, output_field_3 string ); >>>>> >>>>> >>>>> from ( >>>>> from some_input_table i >>>>> map i.field_1 using 'some_custom_mapper_code.pl' ) mapper_output >>>>> insert overwrite table your_next_table >>>>> reduce mapper_output.* using 'some_custom_reducer_code.pl' as >>>>> output_field_1, output_field_2, output_field_3 >>>>> ; >>>>> >>>>> --test it >>>>> select * from your_next_table ; >>>>> >>>>> Hope that helps. >>>>> >>>>> cheers, >>>>> Avram >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Tuesday, April 27, 2010, at 10:55AM, "Tim Robertson" >>>>> <[email protected]> wrote: >>>>> > >>>>> >>>>> Thanks Edward, >>>>> >>>>> I get where you are coming from now with that explanation. >>>>> >>>>> Cheers, >>>>> Tim >>>>> >>>>> >>>>> On Tue, Apr 27, 2010 at 7:53 PM, Edward Capriolo < >>>>> [email protected]> wrote: >>>>> >>>>>> >>>>>> >>>>>> On Tue, Apr 27, 2010 at 1:48 PM, Tim Robertson < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Hmmm... I am not trying to serialize or deserialize custom content, >>>>>>> but simply take an input String (Text) run some Java and return a new >>>>>>> (Text) by calling a function >>>>>>> >>>>>>> Looking at public class UDFYear extends UDF { the annotation at the >>>>>>> top suggests extending UDF and adding the annotation, might be enough. >>>>>>> >>>>>>> I'll try it anyways... >>>>>>> Tim >>>>>>> >>>>>>> On Tue, Apr 27, 2010 at 7:37 PM, Adam O'Donnell <[email protected]>wrote: >>>>>>> >>>>>>>> It sounds like what you want is a custom SerDe. I have tried to >>>>>>>> write >>>>>>>> one but ran into some difficulty. >>>>>>>> >>>>>>>> On Tue, Apr 27, 2010 at 10:13 AM, Tim Robertson >>>>>>>> <[email protected]> wrote: >>>>>>>> > Thanks Edward, >>>>>>>> > You are indeed correct - I am confused! >>>>>>>> > So I checked out the source, and poked around. If I were to >>>>>>>> extend UDF and >>>>>>>> > implement public Text evaluate(Text source) { >>>>>>>> > would I be heading along the correct lines to use what you say >>>>>>>> above? >>>>>>>> > Thanks, >>>>>>>> > Tim >>>>>>>> > >>>>>>>> > >>>>>>>> > On Tue, Apr 27, 2010 at 5:11 PM, Edward Capriolo < >>>>>>>> [email protected]> >>>>>>>> > wrote: >>>>>>>> >> >>>>>>>> >> >>>>>>>> >> On Tue, Apr 27, 2010 at 10:22 AM, Tim Robertson >>>>>>>> >> <[email protected]> wrote: >>>>>>>> >>> >>>>>>>> >>> Hi, >>>>>>>> >>> I currently run a MapReduce job to rewrite a tab delimited file, >>>>>>>> and then >>>>>>>> >>> I use Hive for everything after that stage. >>>>>>>> >>> Am I correct in thinking that I can create a Jar with my own >>>>>>>> method which >>>>>>>> >>> can then be called in SQL? >>>>>>>> >>> Would the syntax be: >>>>>>>> >>> hive> ADD JAR /tmp/parse.jar; >>>>>>>> >>> hive> INSERT OVERWRITE TABLE target SELECT s.id, >>>>>>>> >>> s.canonical, parsedName FROM source s MAP s.canonical using >>>>>>>> 'parse' as >>>>>>>> >>> parsedName; >>>>>>>> >>> and parse be a MR job? If so what are the input and output >>>>>>>> formats >>>>>>>> >>> please for the parse? Or is it a class implementing an >>>>>>>> interface perhaps >>>>>>>> >>> and Hive take care of the rest? >>>>>>>> >>> Thanks for any pointers, >>>>>>>> >>> Tim >>>>>>>> >>> >>>>>>>> >> >>>>>>>> >> Tim, >>>>>>>> >> >>>>>>>> >> A UDF is an sql function like toString() max() >>>>>>>> >> An InputFormat teachers hive to read data from Key Value files >>>>>>>> >> A serde tells Hive how to parse input data into columns. >>>>>>>> >> Finally, the map()reduce(), transform() keywords you described is >>>>>>>> a way to >>>>>>>> >> pipe data to external process and read the results back in. >>>>>>>> Almost like a >>>>>>>> >> non-native to hive UDF. >>>>>>>> >> >>>>>>>> >> So you have munged up 4 concepts together :) Do not feel bad >>>>>>>> however, I >>>>>>>> >> struggled though an input format for the last month. >>>>>>>> >> >>>>>>>> >> It sounds most like you want a udf that takes a string and >>>>>>>> returns a >>>>>>>> >> canonical representation. >>>>>>>> >> >>>>>>>> >> >>>>>>>> >> hive> ADD JAR /tmp/parse.jar; >>>>>>>> >> create temporary function canonical as 'my.package.canonical'; >>>>>>>> >> select canonical(my colum) from source; >>>>>>>> >> >>>>>>>> >> Regards, >>>>>>>> >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> > >>>>>>>> > >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Adam J. O'Donnell, Ph.D. >>>>>>>> Immunet Corporation >>>>>>>> Cell: +1 (267) 251-0070 >>>>>>>> >>>>>>> >>>>>>> >>>>>> Tim, >>>>>> >>>>>> I think you are on the right track with the UDF approach. >>>>>> >>>>>> You could accomplish something similiar with a serdy accept from the >>>>>> client prospecting it would be more "transparent". >>>>>> >>>>>> A UDF is a bit more reusable then a serde. You can only chose a serde >>>>>> once when the table is created, but you UDF is applied on the resultset. >>>>>> >>>>>> Edward >>>>>> >>>>> >>>>> >>>> >>> >> > Based on your description now this is looking more like a SERDE. I say this > because > map(), reduce(), transform() are almsot identical to a UDF with the > exception that they can return multiple columns more easily. > > Lets look at an example > > RawFile > > 1\t6,7,8 > > So by default Hive uses a TextInputFormat > It reads the key as "1" it reads the value as "6,7,8" > > Now the default OutputFormat is HiveIgnoreKeyOutputFormat. Thus hive drops > the "1" > > and the output is now "6,7,8" > > Now the SERDE takes over. based on the DELIMITER you specified when you > created the table, the serder attempts to split the row. > > If I chose a DELIMITER ',' hive would split the row into > ['6','7','8'] > If I chose the wrong delimiter say '^K' or somethign whacky hive really > would not split the row you would get 1 return. > ['6,7,8'] > > So you can chose the 'wrong delimiter' as Carl suggests above essentially > this turns the row into a single string. Then you are free to operate on it > as a single string. You can use a UDF that works with a single string or use > a map() sytax that can split the string and return multiple columns. > > Any approach is valid and will yeild results, but treating a row as a > single string is anti-pattern-ish. Since you ideally want hive to understand > the schema of the row. > > Just wanted to follow up a bit. One benefit of Carl's approach is the map(), reduce(), transform() is definately easier to try. Making a serde takes a while. You have to learn the framework, write some code, alter your current table definitition, etc. map() reduce(), transform() can be done without writing java code to extend hive.
