Thanks Edward. As I venture further into Hive I will want it to understand the schema of the row (especially for Numeric types), so will put the effort in now to understand serde.
For context... We have a 200million row mysql table which currently we scan in batches and process (parsing, georeferencing etc) into another 200million row "fact" table which serves the web, and I am moving from this to Hive processing, and having really amazing results so far. Sqoop -> Hive will be integrated soon. Following that, the source table will be ported to HBase, with Hive performing this processing n HBase input and then sqoop pushing the processed results to mysql / PostGIS DBs (and probably some Lucene). If I can get this Serde working (as I find time) it is really going to allow us a clean processing workflow. More questions to follow I am sure. Tim On Wed, Apr 28, 2010 at 4:32 PM, Edward Capriolo <[email protected]>wrote: > > > On Wed, Apr 28, 2010 at 10:16 AM, Edward Capriolo > <[email protected]>wrote: > >> >> >> On Wed, Apr 28, 2010 at 5:06 AM, Tim Robertson <[email protected] >> > wrote: >> >>> Thanks Carl - that looks ideal for this. >>> >>> >>> On Wed, Apr 28, 2010 at 10:59 AM, Carl Steinbach <[email protected]>wrote: >>> >>>> Hi Tim, >>>> >>>> Larry Ogrodnek has a nice blog post describing how to use Hive's >>>> TRANSFORM/MAP/REDUCE syntax with Java code here: >>>> http://dev.bizo.com/2009/10/hive-map-reduce-in-java.html >>>> >>>> A version of the library he describes in the blog post has been added to >>>> Hive's contrib directory, along with some examples and illustrative test >>>> cases. Check out the following files: >>>> >>>> src/java/org/apache/hadoop/hive/contrib/mr/example/IdentityMapper.java >>>> src/java/org/apache/hadoop/hive/contrib/mr/example/WordCountReduce.java >>>> src/java/org/apache/hadoop/hive/contrib/mr/GenericMR.java >>>> src/java/org/apache/hadoop/hive/contrib/mr/Mapper.java >>>> src/java/org/apache/hadoop/hive/contrib/mr/Output.java >>>> src/java/org/apache/hadoop/hive/contrib/mr/Reducer.java >>>> src/test/queries/clientpositive/java_mr_example.q >>>> >>>> Hope this helps. >>>> >>>> Carl >>>> >>>> >>>> On Wed, Apr 28, 2010 at 1:11 AM, Tim Robertson < >>>> [email protected]> wrote: >>>> >>>>> Ok, so it turns out I overlooked some things in my current MR job with >>>>> the configure() and a UDF isn't enough. >>>>> >>>>> I do want to use the Hive Map keyword and call my own MR map(). >>>>> >>>>> Currently my map() looks like the following, which works on a tab >>>>> delimited input file: >>>>> >>>>> public void map(LongWritable key, Text value, >>>>> >>>>> OutputCollector<Text, Text> collector, Reporter reporter) >>>>> >>>>> throws IOException { >>>>> >>>>> Pattern tab = Pattern.compile("\t"); >>>>> >>>>> String[] atoms = tab.split(value.toString()); >>>>> >>>>> String parsed = myParseFunction(atoms); >>>>> >>>>> collector.collect(new Text(parsed), new Text(atoms[0])); >>>>> >>>>> What would I need to implement to make this usable with a Map keyword >>>>> in Hive please, so I can run this with input from table 1, to populate >>>>> table >>>>> 2? >>>>> >>>>> Sorry for this confusion, but it is not really clear to me - all help >>>>> is very gratefully received. >>>>> >>>>> Cheers >>>>> Tim >>>>> >>>>> >>>>> >>>>> >>>>> On Tue, Apr 27, 2010 at 8:06 PM, Avram Aelony <[email protected]> wrote: >>>>> >>>>>> >>>>>> Hi - >>>>>> >>>>>> If you would like to "simply take an input String (Text) run some Java >>>>>> and return a new (Text) by calling a function" then you may wish to >>>>>> consider using the "map" and "reduce" keywords directly from Hive and >>>>>> using a scripting language like Perl that contains your mapper and >>>>>> reducer code. >>>>>> >>>>>> for example: >>>>>> >>>>>> create external table some_input_table ( field_1 string ) row format >>>>>> (etc...); >>>>>> create table your_next_table ( output_field_1 string, output_field_2 >>>>>> string, output_field_3 string ); >>>>>> >>>>>> >>>>>> from ( >>>>>> from some_input_table i >>>>>> map i.field_1 using 'some_custom_mapper_code.pl' ) mapper_output >>>>>> insert overwrite table your_next_table >>>>>> reduce mapper_output.* using 'some_custom_reducer_code.pl' as >>>>>> output_field_1, output_field_2, output_field_3 >>>>>> ; >>>>>> >>>>>> --test it >>>>>> select * from your_next_table ; >>>>>> >>>>>> Hope that helps. >>>>>> >>>>>> cheers, >>>>>> Avram >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Tuesday, April 27, 2010, at 10:55AM, "Tim Robertson" >>>>>> <[email protected]> wrote: >>>>>> > >>>>>> >>>>>> Thanks Edward, >>>>>> >>>>>> I get where you are coming from now with that explanation. >>>>>> >>>>>> Cheers, >>>>>> Tim >>>>>> >>>>>> >>>>>> On Tue, Apr 27, 2010 at 7:53 PM, Edward Capriolo < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, Apr 27, 2010 at 1:48 PM, Tim Robertson < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Hmmm... I am not trying to serialize or deserialize custom content, >>>>>>>> but simply take an input String (Text) run some Java and return a new >>>>>>>> (Text) by calling a function >>>>>>>> >>>>>>>> Looking at public class UDFYear extends UDF { the annotation at >>>>>>>> the top suggests extending UDF and adding the annotation, might be >>>>>>>> enough. >>>>>>>> >>>>>>>> I'll try it anyways... >>>>>>>> Tim >>>>>>>> >>>>>>>> On Tue, Apr 27, 2010 at 7:37 PM, Adam O'Donnell >>>>>>>> <[email protected]>wrote: >>>>>>>> >>>>>>>>> It sounds like what you want is a custom SerDe. I have tried to >>>>>>>>> write >>>>>>>>> one but ran into some difficulty. >>>>>>>>> >>>>>>>>> On Tue, Apr 27, 2010 at 10:13 AM, Tim Robertson >>>>>>>>> <[email protected]> wrote: >>>>>>>>> > Thanks Edward, >>>>>>>>> > You are indeed correct - I am confused! >>>>>>>>> > So I checked out the source, and poked around. If I were to >>>>>>>>> extend UDF and >>>>>>>>> > implement public Text evaluate(Text source) { >>>>>>>>> > would I be heading along the correct lines to use what you say >>>>>>>>> above? >>>>>>>>> > Thanks, >>>>>>>>> > Tim >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > On Tue, Apr 27, 2010 at 5:11 PM, Edward Capriolo < >>>>>>>>> [email protected]> >>>>>>>>> > wrote: >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> On Tue, Apr 27, 2010 at 10:22 AM, Tim Robertson >>>>>>>>> >> <[email protected]> wrote: >>>>>>>>> >>> >>>>>>>>> >>> Hi, >>>>>>>>> >>> I currently run a MapReduce job to rewrite a tab delimited >>>>>>>>> file, and then >>>>>>>>> >>> I use Hive for everything after that stage. >>>>>>>>> >>> Am I correct in thinking that I can create a Jar with my own >>>>>>>>> method which >>>>>>>>> >>> can then be called in SQL? >>>>>>>>> >>> Would the syntax be: >>>>>>>>> >>> hive> ADD JAR /tmp/parse.jar; >>>>>>>>> >>> hive> INSERT OVERWRITE TABLE target SELECT s.id, >>>>>>>>> >>> s.canonical, parsedName FROM source s MAP s.canonical using >>>>>>>>> 'parse' as >>>>>>>>> >>> parsedName; >>>>>>>>> >>> and parse be a MR job? If so what are the input and output >>>>>>>>> formats >>>>>>>>> >>> please for the parse? Or is it a class implementing an >>>>>>>>> interface perhaps >>>>>>>>> >>> and Hive take care of the rest? >>>>>>>>> >>> Thanks for any pointers, >>>>>>>>> >>> Tim >>>>>>>>> >>> >>>>>>>>> >> >>>>>>>>> >> Tim, >>>>>>>>> >> >>>>>>>>> >> A UDF is an sql function like toString() max() >>>>>>>>> >> An InputFormat teachers hive to read data from Key Value files >>>>>>>>> >> A serde tells Hive how to parse input data into columns. >>>>>>>>> >> Finally, the map()reduce(), transform() keywords you described >>>>>>>>> is a way to >>>>>>>>> >> pipe data to external process and read the results back in. >>>>>>>>> Almost like a >>>>>>>>> >> non-native to hive UDF. >>>>>>>>> >> >>>>>>>>> >> So you have munged up 4 concepts together :) Do not feel bad >>>>>>>>> however, I >>>>>>>>> >> struggled though an input format for the last month. >>>>>>>>> >> >>>>>>>>> >> It sounds most like you want a udf that takes a string and >>>>>>>>> returns a >>>>>>>>> >> canonical representation. >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> hive> ADD JAR /tmp/parse.jar; >>>>>>>>> >> create temporary function canonical as 'my.package.canonical'; >>>>>>>>> >> select canonical(my colum) from source; >>>>>>>>> >> >>>>>>>>> >> Regards, >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> > >>>>>>>>> > >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Adam J. O'Donnell, Ph.D. >>>>>>>>> Immunet Corporation >>>>>>>>> Cell: +1 (267) 251-0070 >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> Tim, >>>>>>> >>>>>>> I think you are on the right track with the UDF approach. >>>>>>> >>>>>>> You could accomplish something similiar with a serdy accept from the >>>>>>> client prospecting it would be more "transparent". >>>>>>> >>>>>>> A UDF is a bit more reusable then a serde. You can only chose a serde >>>>>>> once when the table is created, but you UDF is applied on the resultset. >>>>>>> >>>>>>> Edward >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> Based on your description now this is looking more like a SERDE. I say >> this because >> map(), reduce(), transform() are almsot identical to a UDF with the >> exception that they can return multiple columns more easily. >> >> Lets look at an example >> >> RawFile >> >> 1\t6,7,8 >> >> So by default Hive uses a TextInputFormat >> It reads the key as "1" it reads the value as "6,7,8" >> >> Now the default OutputFormat is HiveIgnoreKeyOutputFormat. Thus hive drops >> the "1" >> >> and the output is now "6,7,8" >> >> Now the SERDE takes over. based on the DELIMITER you specified when you >> created the table, the serder attempts to split the row. >> >> If I chose a DELIMITER ',' hive would split the row into >> ['6','7','8'] >> If I chose the wrong delimiter say '^K' or somethign whacky hive really >> would not split the row you would get 1 return. >> ['6,7,8'] >> >> So you can chose the 'wrong delimiter' as Carl suggests above essentially >> this turns the row into a single string. Then you are free to operate on it >> as a single string. You can use a UDF that works with a single string or use >> a map() sytax that can split the string and return multiple columns. >> >> Any approach is valid and will yeild results, but treating a row as a >> single string is anti-pattern-ish. Since you ideally want hive to understand >> the schema of the row. >> >> > Just wanted to follow up a bit. One benefit of Carl's approach is the > map(), reduce(), transform() is definately easier to try. Making a serde > takes a while. You have to learn the framework, write some code, alter your > current table definitition, etc. map() reduce(), transform() can be done > without writing java code to extend hive. >
