Re: Implement in clause with or clause
There are no risks, but it will be slower especially when the list after in is very long. Zheng 2010/8/3 我很快乐 896923...@qq.com: Thank you for your reply. Because my company reuire we use 0.4.1 version, so I could't upgrade the version. Could you tell me there are which risks if I use the OR clause(example:where id=1 or id=2 or id=3) to implement the IN clause(example: id in(1,2,3) ) ? Thanks, LiuLei -- Yours, Zheng http://www.linkedin.com/in/zshao
Re: Hive support for latin1
Just change FetchTask.java: public boolean fetch(ArrayListString res) res.add(((Text) mSerde.serialize(io.o, io.oi)).toString()); Instead of using Text.toString(), use your own method to convert from raw bytes to unicode String. Zheng On Sun, Aug 1, 2010 at 8:31 PM, bc Wong bcwal...@cloudera.com wrote: Hi all, I'm trying to figure out how to query Hive on latin1 encoded data. I created a file with 256 characters, with unicode value 0-255, encoded in latin1. I made a table out of it. But when I do a select *, Hive returns the upper ascii rows as '\xef\xbf\xbd', which is the replacement character '\ufffd' encoded in UTF-8. Does anyone know how to work with non-UTF8 data? Cheers, -- bc Wong Cloudera Software Engineer -- Yours, Zheng http://www.linkedin.com/in/zshao
Re: built-in UTF8 checker
No, but it's very simple to write one. public class MyUTF8StringChecker extends UDF { public boolean evaluate(Text t) { try { Text.validateUTF8(t.getBytes(), 0, t.getLength()); return true; } catch (MalformedInputException e) { return false; } } } On Tue, Jul 20, 2010 at 12:03 PM, Ping Zhu p...@sharethis.com wrote: Hi, Are there are any built-in functions in Hive to check whether a string is UTF8-encoding? I did some research about this issue but did not find useful resources. Thanks for your suggestions and help. Ping -- Yours, Zheng http://www.linkedin.com/in/zshao
Re: Hive and protocol buffers -- are there UDFs for dealing with them?
If you just need to scan the data once, it makes sense to use hive SerDe to read the data directly (which saves you one I/O round trip). If you need to read the data multiple times, then it's better to save the 3 columns into separate files. Zheng On Mon, Jul 12, 2010 at 5:08 PM, Leo Alekseyev dnqu...@gmail.com wrote: Hi all, I was wondering if anyone is using Hive with protocol buffers. The Hadoop wiki links to http://www.slideshare.net/ragho/hive-user-meeting-august-2009-facebook for SerDe examples; there it says that there is no built-in support for protobufs. Since this presentation is about a year old, I was wondering whether there appeared any UDFs, native or third-party, to deal with them. I am also curious about the relative efficiency of performing SerDe using UDFs in hive vs. running a separate hadoop job to first deserialize the data from protocol buffers into an ascii flat file with only the interesting fields (going from ~15 fields to ~3), and then doing the rest of the computation in hive. I'd appreciate any comments! Thanks, --Leo -- Yours, Zheng http://www.linkedin.com/in/zshao
Re: UDF which takes entire row as arg
Yes. Even a normal (non-generic) UDF might work if all columns can be converted to the same type. UDF can accept variable-length of arguments of the same type. it will be a great addition to let UDF/UDAF handle * (as well as `regex`). The change is all compile-time, and is relatively simple. Zheng On Wed, Jul 7, 2010 at 8:31 PM, Edward Capriolo edlinuxg...@gmail.com wrote: You could write a generic UDF since they accept arbitrary signature, but you would have to pass each column specifically (no * support) -- Yours, Zheng http://www.linkedin.com/in/zshao
Re: Create Table with Line Terminated other than '\n'
That patch basically throws an error if user specified a non-newline line terminator. Without the patch it will produce unexpected result, successfully. Sent from my iPhone On Jun 11, 2010, at 11:23 PM, Amr Awadallah a...@cloudera.com wrote: Zheng, I thought that was fixed per you work here, no? https://issues.apache.org/jira/browse/HIVE-302 Then what did you fix? -- amr On 6/10/2010 10:22 PM, Zheng Shao wrote: Also, changing LINES TERMINATED BY probably won't work, because hadoop's TextInputFormat does not allow line terminators other than \n. Zheng On Thu, Jun 10, 2010 at 6:31 PM, Carl Steinbachc...@cloudera.com wrote: Hi Shuja, The grammar for Hive's CREATE TABLE statement is discussed here: http://wiki.apache.org/hadoop/Hive/LanguageManual/DDL#Create_Table You need to use the LINES TERMINATED BY clause in the CREATE TABLE statement in order to specify a line terminator other than \n. Carl On Thu, Jun 10, 2010 at 5:39 PM, Shuja Rehmanshujamug...@gmail.com wrote: Hi I want to create a table in hive which should have row formated line terminated other than '\n'. so i can read xml file as single cell in one row and column of table. kindly let me know how to do this? THanks -- Regards Shuja-ur-Rehman Baig _ MS CS - School of Science and Engineering Lahore University of Management Sciences (LUMS) Sector U, DHA, Lahore, 54792, Pakistan Cell: +92 3214207445
Re: Create Table with Line Terminated other than '\n'
Also, changing LINES TERMINATED BY probably won't work, because hadoop's TextInputFormat does not allow line terminators other than \n. Zheng On Thu, Jun 10, 2010 at 6:31 PM, Carl Steinbach c...@cloudera.com wrote: Hi Shuja, The grammar for Hive's CREATE TABLE statement is discussed here: http://wiki.apache.org/hadoop/Hive/LanguageManual/DDL#Create_Table You need to use the LINES TERMINATED BY clause in the CREATE TABLE statement in order to specify a line terminator other than \n. Carl On Thu, Jun 10, 2010 at 5:39 PM, Shuja Rehman shujamug...@gmail.com wrote: Hi I want to create a table in hive which should have row formated line terminated other than '\n'. so i can read xml file as single cell in one row and column of table. kindly let me know how to do this? THanks -- Regards Shuja-ur-Rehman Baig _ MS CS - School of Science and Engineering Lahore University of Management Sciences (LUMS) Sector U, DHA, Lahore, 54792, Pakistan Cell: +92 3214207445 -- Yours, Zheng http://www.linkedin.com/in/zshao
Re: BUG at optimizer or map side aggregate?
Nice finding! That's likely to be the cause. Can you open a JIRA issue on issues.apache.org/jira/browse/HIVE Zheng On Wed, May 12, 2010 at 1:05 AM, Ted Xu ted.xu...@gmail.com wrote: Zheng, Thank you for your reply. Well, it seems hard for me to repreduce this bug in a simpler query. However, if I change the alias of subquery 't1' (either the inner one or the join result), the bug disappears. I'm wondering if there is possible that table aliases of different level will conflict when their alias names are the same. 2010/5/12 Zheng Shao zsh...@gmail.com Yes that does seem to be a bug. Can you try if you can simply the query while reproducing the bug? That will make it a lot easier to debug and fix. Zheng On Tue, May 11, 2010 at 7:44 PM, Ted Xu ted.xu...@gmail.com wrote: Hi all, I think I found a bug, I'm not sure whether the problem is at optimizer (PPD) or at map side aggregate. See query listed below: - create table if not exists dm_fact_buyer_prd_info_d ( category_id string ,gmv_trade_num int ,user_id int ) PARTITIONED BY (ds int); set hive.optimize.ppd=true; set hive.map.aggr=true; explain select 20100426, category_id1,category_id2,assoc_idx from ( select category_id1 , category_id2 , count(distinct user_id) as assoc_idx from ( select t1.category_id as category_id1 , t2.category_id as category_id2 , t1.user_id from ( select category_id, user_id from dm_fact_buyer_prd_info_d where ds = 20100426 and ds 20100419 and category_id 0 and gmv_trade_num0 group by category_id, user_id ) t1 join ( select category_id, user_id from dm_fact_buyer_prd_info_d where ds = 20100426 and ds 20100419 and category_id 0 and gmv_trade_num 0 group by category_id, user_id ) t2 on t1.user_id=t2.user_id ) t1 group by category_id1, category_id2 ) t_o where category_id1 category_id2 and assoc_idx 2; The query above will fail when execute, throwing exception: can not cast UDFOpNotEqual(Text, IntWritable) to UDFOpNotEqual(Text, Text). I explained the query and the execute plan looks really wired (see the highlighted predicate): ABSTRACT SYNTAX TREE: (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_TABREF dm_fact_buyer_prd_info_d)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL category_id)) (TOK_SELEXPR (TOK_TABLE_OR_COL user_id))) (TOK_WHERE (and (and (and (= (TOK_TABLE_OR_COL ds) 20100426) ( (TOK_TABLE_OR_COL ds) 20100419)) ( (TOK_TABLE_OR_COL category_id) 0)) ( (TOK_TABLE_OR_COL gmv_trade_num) 0))) (TOK_GROUPBY (TOK_TABLE_OR_COL category_id) (TOK_TABLE_OR_COL user_id t1) (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_TABREF dm_fact_buyer_prd_info_d)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL category_id)) (TOK_SELEXPR (TOK_TABLE_OR_COL user_id))) (TOK_WHERE (and (and (and (= (TOK_TABLE_OR_COL ds) 20100426) ( (TOK_TABLE_OR_COL ds) 20100419)) ( (TOK_TABLE_OR_COL category_id) 0)) ( (TOK_TABLE_OR_COL gmv_trade_num) 0))) (TOK_GROUPBY (TOK_TABLE_OR_COL category_id) (TOK_TABLE_OR_COL user_id t2) (= (. (TOK_TABLE_OR_COL t1) user_id) (. (TOK_TABLE_OR_COL t2) user_id (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL t1) category_id) category_id1) (TOK_SELEXPR (. (TOK_TABLE_OR_COL t2) category_id) category_id2) (TOK_SELEXPR (. (TOK_TABLE_OR_COL t1) user_id) t1)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL category_id1)) (TOK_SELEXPR (TOK_TABLE_OR_COL category_id2)) (TOK_SELEXPR (TOK_FUNCTIONDI count (TOK_TABLE_OR_COL user_id)) assoc_idx)) (TOK_GROUPBY (TOK_TABLE_OR_COL category_id1) (TOK_TABLE_OR_COL category_id2 t_o)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR 20100426) (TOK_SELEXPR (TOK_TABLE_OR_COL category_id1)) (TOK_SELEXPR (TOK_TABLE_OR_COL category_id2)) (TOK_SELEXPR (TOK_TABLE_OR_COL assoc_idx))) (TOK_WHERE (and ( (TOK_TABLE_OR_COL category_id1) (TOK_TABLE_OR_COL category_id2)) ( (TOK_TABLE_OR_COL assoc_idx) 2) STAGE DEPENDENCIES: Stage-1 is a root stage Stage-2 depends on stages: Stage-1, Stage-4 Stage-3 depends on stages: Stage-2 Stage-4 is a root stage Stage-2 depends on stages: Stage-1, Stage-4 Stage-3 depends on stages: Stage-2 Stage-0 is a root stage STAGE PLANS: Stage: Stage-1 Map Reduce Alias - Map Operator Tree: t_o:t1:t1:dm_fact_buyer_prd_info_d TableScan alias: dm_fact_buyer_prd_info_d Filter Operator predicate: expr: (UDFToDouble(ds
Re: why hive ignore my setting about reduce task number?
Do you need to get all records in the order? In most of our use cases users are only interested in the top 100 or something. If you do limit 100 together with order by, it will be much faster. Sent from my iPhone On May 12, 2010, at 1:54 PM, luocan19826...@sohu.com wrote: Thanks, Ted. If I have very big data to sort, only 1 reduce task will have performance issue. Do hive have some skill to optimize it? I have observe that the reduce task is very slow in my job. 你的1G网络U盘真好用! 查薪酬:对比同行工资!
Re: error: Both Left and Right Aliases Encountered in Join obj
Put t1.objt2.obj in the where clause. On Fri, Apr 30, 2010 at 12:14 AM, Harshit Kumar ku...@bike.snu.ac.kr wrote: Hi I have a query like this from spo t1 join spo t2 on (t1.sub=t2.sub and t1.objt2.obj) insert overwrite table spojoin select t1.sub, t1.pre, t2.obj, t2.sub, t2.pre, t2.obj; Executing the above query gives the following error. FAILED: Error in semantic analysis: line 1:46 Both Left and Right Aliases Encountered in Join obj However, If I replace the operator with == operator, it executes. Please let me know what am I doing wrong? Thanks Kumar -- Yours, Zheng http://www.linkedin.com/in/zshao
Re: HADOOP-4012 and bzip2 input splitting
Can you take a look at the job.xml link in your map-reduce job created by Hive and let me know the mapred.input.format.class? Is it HiveInputFormat or CombineHiveInputFormat? It should work if you set it to org.apache.hadoop.hive.ql.io.HiveInputFormat Also, can you verify if https://issues.apache.org/jira/browse/MAPREDUCE-830 is in your hadoop distribution or not? Zheng On Wed, Apr 21, 2010 at 11:31 PM, 김영우 warwit...@gmail.com wrote: Zeng, Thanks for your quick reply. but there is only 1 mapper for my job with 300 MB, bz2 file. I added the following in my core-site.xml property nameio.compression.codecs/name valueorg.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec/value /property My table definition: create table test_bzip2 ( co1 string, . . col20 string ) row format delimited fields terminated by '\t' stored as textfile; A simple grouping/count query and the following is the query's plan: STAGE PLANS: Stage: Stage-1 Map Reduce Alias - Map Operator Tree: test_bzip2 TableScan alias: test_bzip2 Select Operator expressions: expr: siteid type: string outputColumnNames: siteid Reduce Output Operator key expressions: expr: siteid type: string sort order: + Map-reduce partition columns: expr: siteid type: string tag: -1 value expressions: expr: 1 type: int Reduce Operator Tree: Group By Operator aggregations: expr: count(VALUE._col0) bucketGroup: false keys: expr: KEY._col0 type: string mode: complete outputColumnNames: _col0, _col1 Select Operator expressions: expr: _col0 type: string expr: _col1 type: bigint outputColumnNames: _col0, _col1 File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Stage: Stage-0 Fetch Operator limit: -1 I just verified bz2 splitting working in my cluster using a simple pig script. the pig script makes 3 mapper for M/R job. What should I check further? Job config info? - Youngwoo 2010/4/22 Zheng Shao zsh...@gmail.com It should be automatically supported. You don't need to do anything except adding the bzip2 codec in io.compression.codecs in hadoop configuration files (core-site.xml) Zheng On Wed, Apr 21, 2010 at 10:15 PM, 김영우 warwit...@gmail.com wrote: Hi, HADOOP-4012, https://issues.apache.org/jira/browse/HADOOP-4012 has been committed. and CHD3 supports bzip2 splitting. I'm wondering if Hive supports input splitting for bzip2 compreesed text file(*.bz2). If not, Should I implement a custom SerDe for bzip2 compressed files? Thanks, Youngwoo -- Yours, Zheng http://www.linkedin.com/in/zshao -- Yours, Zheng http://www.linkedin.com/in/zshao
Re: HADOOP-4012 and bzip2 input splitting
It should be automatically supported. You don't need to do anything except adding the bzip2 codec in io.compression.codecs in hadoop configuration files (core-site.xml) Zheng On Wed, Apr 21, 2010 at 10:15 PM, 김영우 warwit...@gmail.com wrote: Hi, HADOOP-4012, https://issues.apache.org/jira/browse/HADOOP-4012 has been committed. and CHD3 supports bzip2 splitting. I'm wondering if Hive supports input splitting for bzip2 compreesed text file(*.bz2). If not, Should I implement a custom SerDe for bzip2 compressed files? Thanks, Youngwoo -- Yours, Zheng http://www.linkedin.com/in/zshao
Re: Cluster By Algorithm?
Its as simple as taking a hashcode of the key and mod by number of reducers. To get started, have a try of any .q files in clientpositive directory. On the code side, HiveKey.java has the implementation. Sent from my iPhone On Apr 11, 2010, at 2:48 PM, Aaron McCurry amccu...@gmail.com wrote: I have a search solution that is down stream of some Netezza data marts that I'm replacing with a Hive solution. We already partition the data for the search solution 32 ways and I would like to take advantage of the data clustering in Hive (buckets), so that I don't have to do any post processing. Is there documentation that describes how the data is hashed or how it's organized across the buckets? Or could someone point me to a class that implements it? Thanks! Aaron
Re: Using newest hive release (0.5.0) - Problem with count(1)
Yes we use sun jdk 1.6 and it works. On Tue, Apr 6, 2010 at 12:32 PM, Aaron McCurry amccu...@gmail.com wrote: I am using 1.6, however it is the IBM jvm (not my choice). If the feature is known to work on the Sun JVM then I will deal with the problem another way. Thanks. Aaron On Tue, Apr 6, 2010 at 3:12 PM, Zheng Shao zsh...@gmail.com wrote: Are you using Java 1.5? Hive now requires Java 1.6 On Tue, Apr 6, 2010 at 7:23 AM, Aaron McCurry amccu...@gmail.com wrote: In the past I have used hive 0.3.0 successfully and now with a new project coming up I decided to give hive 0.5.0 a run and everything is working as expected, except for when I try to get a simple count of the table. The simple table is defined as: create table log_table (col1 string, col2 string, col3 string, col4 string, col5 string, col6 string) row format delimited fields terminated by '\t' stored as textfile; And the query I'm running is: select count(1) from log_table; From the hive command line I get the following errors: ... In order to set c constant number of reducers: set mapred.reduce.tasks=number Exception during encoding:java.lang.Exception: failed to write expression: GenericUDAFEvaluator$Mode=Class.new(); Continue... Exception during encoding:java.lang.Exception: failed to write expression: GenericUDAFEvaluator$Mode=Class.new(); Continue... Exception during encoding:java.lang.Exception: failed to write expression: GenericUDAFEvaluator$Mode=Class.new(); Continue... Exception during encoding:java.lang.Exception: failed to write expression: GenericUDAFEvaluator$Mode=Class.new(); Continue... Starting Job = job_201004010912_0015, Tracking URL = . And when looking at the failed hadoop jobs I see the following exception: Caused by: java.lang.ClassCastException: org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableIntObjectInspector incompatible with org.apache.hadoop.hive.serde2.objectinspector.primitive.LongObjectInspector at org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCount$GenericUDAFCountEvaluator.merge(GenericUDAFCount.java:93) at org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator.aggregate(GenericUDAFEvaluator.java:113) ... Is this a known issue? Am I missing something? Any guidance would be appreciated. Thanks! Aaron -- Yours, Zheng http://www.linkedin.com/in/zshao -- Yours, Zheng http://www.linkedin.com/in/zshao
Re: Truncation error when creating table with column containing struct with many fields
That change should be fine. Zheng On Tue, Apr 6, 2010 at 5:16 PM, Dilip Joseph dilip.antony.jos...@gmail.com wrote: Hello, I got the following error when creating a table with a column that has an ARRAY of STRUCTS with many fields. It appears that there is a 128 character limit on the column definition. FAILED: Error in metadata: javax.jdo.JDODataStoreException: Add request failed : INSERT INTO COLUMNS (SD_ID,COMMENT,COLUMN_NAME,TYPE_NAME,INTEGER_IDX) VALUES (?,?,?,?,?) NestedThrowables: java.sql.BatchUpdateException: A truncation error was encountered trying to shrink VARCHAR 'arraystructid:int,fld1:bigint,fld2:int,fld3' to length 128. FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask I was able to get table create working after changing 128 to 256 in /metastore/src/model/package.jdo. Does anyone know if there are any adverse side-effects of doing so? Dilip -- Yours, Zheng http://www.linkedin.com/in/zshao
Re: create table exception
See http://wiki.apache.org/hadoop/Hive/AdminManual/MetastoreAdmin for details. Zheng On Mon, Apr 5, 2010 at 12:01 AM, Sagar Naik sn...@attributor.com wrote: Hi As a trial, I am trying to setup hive for local DFS,MR mode I have set property namehive.metastore.uris/name valuefile:///data/hive/metastore/metadb/value descriptionThe location of filestore metadata base dir/description /property in hive-site.xml But I m still getting the following error Pl help me in getting hive up and running CREATE TABLE pokes (foo INT, bar STRING); 10/04/04 23:58:08 [main] INFO parse.ParseDriver: Parsing command: CREATE TABLE pokes (foo INT, bar STRING) 10/04/04 23:58:08 [main] INFO parse.ParseDriver: Parse Completed 10/04/04 23:58:08 [main] INFO parse.SemanticAnalyzer: Starting Semantic Analysis 10/04/04 23:58:08 [main] INFO parse.SemanticAnalyzer: Creating tablepokes positin=13 10/04/04 23:58:08 [main] INFO ql.Driver: Semantic Analysis Completed 10/04/04 23:58:08 [main] INFO ql.Driver: Starting command: CREATE TABLE pokes (foo INT, bar STRING) 10/04/04 23:58:08 [main] INFO exec.DDLTask: Default to LazySimpleSerDe for table pokes 10/04/04 23:58:08 [main] INFO hive.log: DDL: struct pokes { i32 foo, string bar} FAILED: Error in metadata: java.lang.IllegalArgumentException: URI: does not have a scheme 10/04/04 23:58:08 [main] ERROR exec.DDLTask: FAILED: Error in metadata: java.lang.IllegalArgumentException: URI: does not have a scheme org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.IllegalArgumentException: URI: does not have a scheme at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:281) at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:1281) at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:119) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:99) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:64) at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:582) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:462) at org.apache.hadoop.hive.ql.Driver.runCommand(Driver.java:324) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:312) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:123) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:181) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:287) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:155) at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68) Caused by: java.lang.IllegalArgumentException: URI: does not have a scheme at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.init(HiveMetaStoreClient.java:92) at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:828) at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:838) at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:275) ... 20 more FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask -- Yours, Zheng
Re: UDAF on AWS Hive
Hive 0.4 has limited support on complex types in UDAF. If you are looking for an ad-hoc solution, try putting the data into a single Text. It will be great if you can ask AWS guys upgrading Hive to 0.5. 0.5 has over 100 bug fixes and is much more stable. Zheng On Fri, Apr 2, 2010 at 1:11 PM, Matthew Bryan gou...@gmail.com wrote: I'm writing a basic group_concat UDAF for the Amazon version of Hiveand it's working fine for unordered groupings. But I can't seem to get an ordered version working (filling an array based on an IntWritable passed alongside). When I move from using Text return type on terminatePartial() to either Text[] or a State class I start getting errors: FAILED: Error in semantic analysis: org.apache.hadoop.hive.ql.metadata.HiveException: Cannot recognize return type class [Lorg.apache.hadoop.io.Text; from public org.apache.hadoop.io.Text[] com.company.hadoop.hive.udaf.UDAFGroupConcatN$GroupConcatNStringEvaluator.terminatePartial() or FAILED: Error in semantic analysis: org.apache.hadoop.hive.ql.metadata.HiveException: Cannot recognize return type class com.company.hadoop.hive.udaf.UDAFGroupConcatN$UDAFGroupConc atNState from public com.company.hadoop.hive.udaf.UDAFGroupConcatN$UDAFGroupConcatNState com.company.hadoop.hive.udaf.UDAFGroupConcatN$GroupConcatNStringEvaluator.terminatePartial () What limits are there on the return type of terminatePartial()shouldn't it just have to match the argument of merge and nothing more? Keep in mind this is the Amazon version of Hive (0.4 I think) I put both versions of the UDAF below, ordered and unordered. Thanks for your time. Matt # Working Unordered /*QUERY: select user, event, group_concat(details) from datatable group by user,event;*/ package com.company.hadoop.hive.udaf; import org.apache.hadoop.hive.ql.exec.UDAF; import org.apache.hadoop.hive.ql.exec.UDAFEvaluator; import org.apache.hadoop.io.Text; public class UDAFGroupConcat extends UDAF{ public static class GroupConcatStringEvaluator implements UDAFEvaluator { private Text mOutput; private boolean mEmpty; public GroupConcatStringEvaluator() { super(); init(); } public void init() { mOutput = null; mEmpty = true; } public boolean iterate(Text o) { if (o!=null) { if(mEmpty) { mOutput = new Text(o); mEmpty = false; } else { mOutput.set(mOutput.toString()+ +o.toString()); } } return true; } public Text terminatePartial() {return mEmpty ? null : mOutput;} public boolean merge(Text o) {return iterate(o);} public Text terminate() {return mEmpty ? null : mOutput;} } } Not Working Ordered # /*QUERY: select user, event, group_concatN(details, detail_id) from datatable group by user,event;*/ package com.company.hadoop.hive.udaf; import org.apache.hadoop.hive.ql.exec.UDAF; import org.apache.hadoop.hive.ql.exec.UDAFEvaluator; import org.apache.hadoop.io.Text; import org.apache.hadoop.io.IntWritable; public class UDAFGroupConcatN extends UDAF{ public static class GroupConcatNStringEvaluator implements UDAFEvaluator { private Text[] mArray; private boolean mEmpty; public GroupConcatNStringEvaluator() { super(); init(); } public void init() { mArray = new Text[5]; mEmpty = true; } public boolean iterate(Text o, IntWritable N) { if (o!=nullN!=null) { mArray[N.get()].set(o.toString()); mEmpty=false; } return true; } public Text[] terminatePartial() {return mEmpty ? null : mArray;} public boolean merge(Text[] o) { if (o!=null) { for(int i=0; i=5; i++){ if(mArray[i].getLength()==0){ mArray[i].set(o[i].toString()); } } } return true; } public Text[] terminate() {return mEmpty ? null : mArray;} } } -- Yours, Zheng
Re: Sequence Files with data inside key
The easiest way is to write a SequenceFileInputFormat that returns a RecordReader that has key in the value and value in the key. Zheng On Fri, Apr 2, 2010 at 2:16 PM, Edward Capriolo edlinuxg...@gmail.com wrote: I have some sequence files in which all our data is in the key. http://osdir.com/ml/hive-user-hadoop-apache/2009-10/msg00027.html Has anyone tackled the above issue? -- Yours, Zheng
Re: date_sub() function returns wrong date because of daylight saving time difference
I will take a look. Thanks Bryan! On Thu, Apr 1, 2010 at 12:38 AM, Bryan Talbot btal...@aeriagames.com wrote: I guess most places are running their clusters with UTC time zones or these functions are not widely used. Any chance of getting a committer to look at the patch with unit tests? -Bryan On Mar 26, 2010, at Mar 26, 11:37 AM, Bryan Talbot wrote: Has anyone else been running into this issue? https://issues.apache.org/jira/browse/HIVE-1253 If not, what are we doing wrong to get hit by it? -Bryan -- Yours, Zheng
Re: unix_timestamp function
Setting TZ in your .bash_profile won't work because the map/reduce tasks runs on the hadoop clusters. If you start your hadoop tasktracker with that TZ setting, it will probably work. Zheng On Thu, Apr 1, 2010 at 3:32 PM, tom kersnick hiveu...@gmail.com wrote: So its working, but Im having a time zone issue. My servers are located in EST, but i need this data in PST. So when it converts this: hive select from_unixtime(1270145333,'-MM-dd HH:mm:ss') from ut2; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_201003031204_0102, Tracking URL = http://master:50030/jobdetails.jsp?jobid=job_201003031204_0102 Kill Command = /usr/local/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=master:54311 -kill job_201003031204_0102 2010-04-01 18:28:23,041 Stage-1 map = 0%, reduce = 0% 2010-04-01 18:28:37,315 Stage-1 map = 67%, reduce = 0% 2010-04-01 18:28:43,386 Stage-1 map = 100%, reduce = 0% 2010-04-01 18:28:46,412 Stage-1 map = 100%, reduce = 100% Ended Job = job_201003031204_0102 OK 2010-04-01 14:08:53 Time taken: 30.191 seconds I need it to be : 2010-04-01 11:08:53 I tried setting the variable in my .bash_profile for TZ=/ /Americas/ = no go. Nothing in the hive ddl link you is leading me in the right direction. Is there something you guys can recommend? I can write a script outside of hive, but it would be great if I can have users handle this within their queries. Thanks in advance! /tom On Thu, Apr 1, 2010 at 2:17 PM, tom kersnick hiveu...@gmail.com wrote: ok thanks I should have caught that. /tom On Thu, Apr 1, 2010 at 2:13 PM, Carl Steinbach c...@cloudera.com wrote: Hi Tom, Unix Time is defined as the number of *seconds* since January 1, 1970. It looks like the data you have in cola is in milliseconds. You need to divide this value by 1000 before calling from_unixtime() on the result. Thanks. Carl On Thu, Apr 1, 2010 at 2:02 PM, tom kersnick hiveu...@gmail.com wrote: Thanks, but there is something fishy going on. Im using hive 0.5.0 with hadoop 0.20.1 I tried the column as both a bigint and a string. According the hive ddl: string from_unixtime(int unixtime) Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the format of 1970-01-01 00:00:00 It looks like the input is int, that would be too small for my 1270145333155 timestamp. Any ideas? Example below: /tom hive describe ut; OK colabigint colbstring Time taken: 0.101 seconds hive select * from ut; OK 1270145333155tuesday Time taken: 0.065 seconds hive select from_unixtime(cola,'-MM-dd HH:mm:ss'),colb from ut; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_201003031204_0083, Tracking URL = http://master:50030/jobdetails.jsp?jobid=job_201003031204_0083 Kill Command = /usr/local/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=master:54311 -kill job_201003031204_0083 2010-04-01 16:57:32,407 Stage-1 map = 0%, reduce = 0% 2010-04-01 16:57:45,577 Stage-1 map = 100%, reduce = 0% 2010-04-01 16:57:48,605 Stage-1 map = 100%, reduce = 100% Ended Job = job_201003031204_0083 OK 42219-04-22 00:05:55tuesday Time taken: 18.066 seconds hive describe ut; OK colastring colbstring Time taken: 0.077 seconds hive select * from ut; OK 1270145333155tuesday Time taken: 0.065 seconds hive select from_unixtime(cola,'-MM-dd HH:mm:ss'),colb from ut; FAILED: Error in semantic analysis: line 1:7 Function Argument Type Mismatch from_unixtime: Looking for UDF from_unixtime with parameters [class org.apache.hadoop.io.Text, class org.apache.hadoop.io.Text] On Thu, Apr 1, 2010 at 1:37 PM, Carl Steinbach c...@cloudera.comwrote: Hi Tom, I think you want to use the from_unixtime UDF: hive describe function extended from_unixtime; describe function extended from_unixtime; OK from_unixtime(unix_time, format) - returns unix_time in the specified format Example: SELECT from_unixtime(0, '-MM-dd HH:mm:ss') FROM src LIMIT 1; '1970-01-01 00:00:00' Time taken: 0.647 seconds hive Thanks. Carl On Thu, Apr 1, 2010 at 1:11 PM, tom kersnick hiveu...@gmail.comwrote: hive describe ut; OK timebigint daystring Time taken: 0.128 seconds hive select * from ut; OK 1270145333155tuesday Time taken: 0.085 seconds When I run this simple query, I'm getting a NULL for the time column with data type bigint. hive select unix_timestamp(time),day from ut; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_201003031204_0080, Tracking URL =
Re: How do I make Hive use a custom scheduler and not the default scheduler?
Hive also loads hadoop conf in HADOOP_HOME/conf. You can set it there. On 3/23/10, Ryan LeCompte lecom...@gmail.com wrote: Right now when we submit queries, it uses the hadoop scheduler. I have a custom fair share scheduler configured as well, but I see that jobs generated from our Hive queries never get picked up by that scheduler. Is there something in hive-site.xml that I can configure to make all queries use a particular scheduler? Thanks, Ryan -- Sent from my mobile device Yours, Zheng
Re: Performance Programming Comparison of JAQL, Hive, Pig and Java
Glad to know that Hive has a good performance compared with other languages. It will be great if you can publish the queries/codes in the benchmark, as well as environment setup, so that other people can rerun your benchmark easily. Zheng On Tue, Mar 23, 2010 at 7:11 AM, Rob Stewart robstewar...@googlemail.com wrote: Hi folks, As promised, today I have made available my findings and experiment results from my research project, examining the high level languages: Pig, Hive and JAQL. The project extends from existing studies, by evaluating the scale up, scale out, and runtime for 3 benchmarking applications. It also examines the ease of programming, and the computational power of each language. I've created two documents: - Publication - A slide-by-slide presentation. 16 slides - *Suitable for most readers* - dissertation results chapter (18 pages of text) You can find these documents at: http://www.macs.hw.ac.uk/~rs46/publications.html Excuse the .HTML link - It is useful for me to record the number of hits the publication receives. I welcome any feedback, either on this mailing list, or to my University email address for direct correspondence. Any questions regarding the benchmarks should be sent to my University email address. Thanks for taking an interest, Rob Stewart -- Yours, Zheng
Re: support for arrays, maps, structs while writing output of custom reduce script to table
From 0.5 (probably), we can add type information to the column names after AS. Note that the first level separator should be TAB, and the second separator should be ^B (and then ^C, etc) FROM (select * from srcTable DISTRIBUTE BY id SORT BY id) s INSERT OVERWRITE TABLE SS REDUCE * USING 'myreduce.py' AS (a INT, b INT, vals ARRAYSTRUCTx:INT, y:STRING) ; On Mon, Mar 22, 2010 at 1:50 PM, Dilip Joseph dilip.antony.jos...@gmail.com wrote: Hello, Does Hive currently support arrays, maps, structs while using custom reduce/map scripts? 'myreduce.py' in the example below produces an array of structs delimited by \2s and \3s. CREATE TABLE SS ( a INT, b INT, vals ARRAYSTRUCTx:INT, y:STRING ); FROM (select * from srcTable DISTRIBUTE BY id SORT BY id) s INSERT OVERWRITE TABLE SS REDUCE * USING 'myreduce.py' AS (a,b, vals) ; However, the query is failing with the following error message, even before the script is executed: FAILED: Error in semantic analysis: line 2:27 Cannot insert into target table because column number/types are different SS: Cannot convert column 2 from string to arraystructx:int,y:string. I saw a discussion about this in http://www.mail-archive.com/hive-user@hadoop.apache.org/msg00160.html, dated over a year ago. Just wondering if there have been any updates. Thanks, Dilip -- Yours, Zheng
Re: support for arrays, maps, structs while writing output of custom reduce script to table
Great! This is a bug. Hive field names should be case-insensitive. Can you open a JIRA for that? Zheng On Mon, Mar 22, 2010 at 2:43 PM, Dilip Joseph dilip.antony.jos...@gmail.com wrote: Thanks Zheng, That worked. It appears that the type information is converted to lower case before comparison. The following statements where userId is used as a field name failed. hive CREATE TABLE SS ( a INT, b INT, vals ARRAYSTRUCTuserId:INT, y:STRING ); OK Time taken: 0.309 seconds hive FROM (select * from srcTable DISTRIBUTE BY id SORT BY id) s INSERT OVERWRITE TABLE SS REDUCE * USING 'myreduce.py' AS (a INT, b INT, vals ARRAYSTRUCTuserId:INT, y:STRING ) ; FAILED: Error in semantic analysis: line 2:27 Cannot insert into target table because column number/types are different SS: Cannot convert column 2 from arraystructuserId:int,y:string to arraystructuserid:int,y:string. The same queries worked fine after changing userId to userid. Dilip On Mon, Mar 22, 2010 at 2:20 PM, Zheng Shao zsh...@gmail.com wrote: From 0.5 (probably), we can add type information to the column names after AS. Note that the first level separator should be TAB, and the second separator should be ^B (and then ^C, etc) FROM (select * from srcTable DISTRIBUTE BY id SORT BY id) s INSERT OVERWRITE TABLE SS REDUCE * USING 'myreduce.py' AS (a INT, b INT, vals ARRAYSTRUCTx:INT, y:STRING) ; On Mon, Mar 22, 2010 at 1:50 PM, Dilip Joseph dilip.antony.jos...@gmail.com wrote: Hello, Does Hive currently support arrays, maps, structs while using custom reduce/map scripts? 'myreduce.py' in the example below produces an array of structs delimited by \2s and \3s. CREATE TABLE SS ( a INT, b INT, vals ARRAYSTRUCTx:INT, y:STRING ); FROM (select * from srcTable DISTRIBUTE BY id SORT BY id) s INSERT OVERWRITE TABLE SS REDUCE * USING 'myreduce.py' AS (a,b, vals) ; However, the query is failing with the following error message, even before the script is executed: FAILED: Error in semantic analysis: line 2:27 Cannot insert into target table because column number/types are different SS: Cannot convert column 2 from string to arraystructx:int,y:string. I saw a discussion about this in http://www.mail-archive.com/hive-user@hadoop.apache.org/msg00160.html, dated over a year ago. Just wondering if there have been any updates. Thanks, Dilip -- Yours, Zheng -- _ Dilip Antony Joseph http://www.marydilip.info -- Yours, Zheng
Re: SerDe examples that use arrays and structs?
BinarySortableSerDe, LazySimpleSerDe, and LazyBinarySerDe all supports arrays/structs. There is a UDF called size(var) that can return the size of an array. Zheng On Sun, Mar 21, 2010 at 9:19 PM, Adam O'Donnell a...@immunet.com wrote: First of all, thank you to all of the facebook guys for hosting the hive user group last week. Second of all, does anyone have some SerDe code that uses arrays and structs on deserialization? Also, is there a way inside of Hive to discover the number of elements in an array? Thanks and take care Adam -- Yours, Zheng
Re: delimiters for nested structures
Multiple-level of delimiters works as the following by default: The first level (fields delimiters) will be \001 (^A, ascii code 1). Each level of struct and array take an additional field delimitor following (\002, etc). Each level of map takes 2 levels of additional field deimitor. So it will be: s1.name ^B s1.age ^A a1[0].x ^C a2[0].y ^B a1[1].x ^C a2[1].y ^A b1.key1 ^C b1.value1[0] ^D b1.value1[1] ^B b1.key2 ^C b1.value2[0] ^D b1.value2[1] Zheng On Fri, Mar 19, 2010 at 6:07 PM, Dilip Joseph dilip.antony.jos...@gmail.com wrote: Hello, What are the delimiters for data to be loaded into a table with nested arrays, structs, maps etc? For example: CREATE TABLE nested ( s1 STRUCTname:STRING, age: INT, a1 ARRAYSTRUCTx:INT, y:INT, b1 MAPSTRING, ARRAYINT ) Should I write a custom SerDe for this? Thank you, Dilip -- Yours, Zheng
Re: DynamicSerDe/TBinaryProtocol
What is the format of your data? TBinaryProtocol does not work with TextFile format, as you can imagine. On 3/10/10, Anty anty@gmail.com wrote: Hi: ALL I encounter a problem, any suggestion will be appreciated! MY hive version is 0.30.0 I create a table in CLI. CREATE TABLE table2 (boo int,bar string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe' WITH SERDEPROPERTIES ( 'serialization.format'=org.apache.hadoop.hive.serde2.thrift.TCTLSeparatedProtocol') STORED AS TEXTFILE; Then a load some data to table2. INSERT OVERWRITE TABLE table2 SELECT foo,bar from pokes. Everything is OK. Also , i can issue queries against table2. But, when i change the protocol to TBinaryProtocol, CREATE TABLE table1 (boo int,bar string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe' WITH SERDEPROPERTIES ( 'serialization.format'='org.apache.thrift.protocol.TBinaryProtocol') STORED AS TEXTFILE; then load some data to table1 ,there is some error ,the loading process can't be completed. java.lang.RuntimeException: org.apache.hadoop.hive.serde2.SerDeException: org.apache.thrift.transport.TTransportException: Cannot read. Remote side has closed. Tried to read 1 bytes, but only got 0 bytes. at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:182) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:170) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.hive.serde2.SerDeException: org.apache.thrift.transport.TTransportException: Cannot read. Remote side has closed. Tried to read 1 bytes, but only got 0 bytes. at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:328) at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:165) ... 4 more Caused by: org.apache.hadoop.hive.serde2.SerDeException: org.apache.thrift.transport.TTransportException: Cannot read. Remote side has closed. Tried to read 1 bytes, but only got 0 bytes. at org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe.deserialize(DynamicSerDe.java:135) at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:319) ... 5 more Caused by: org.apache.thrift.transport.TTransportException: Cannot read. Remote side has closed. Tried to read 1 bytes, but only got 0 bytes. at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86) at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:314) at org.apache.thrift.protocol.TBinaryProtocol.readByte(TBinaryProtocol.java:247) at org.apache.thrift.protocol.TBinaryProtocol.readFieldBegin(TBinaryProtocol.java:216) at org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDeFieldList.deserialize(DynamicSerDeFieldList.java:163) at org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDeStructBase.deserialize(DynamicSerDeStructBase.java:59) at org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe.deserialize(DynamicSerDe.java:131) ... 6 more If there is something wrong with TBinaryProtocol? -- Best Regards Anty Rao -- Sent from my mobile device Yours, Zheng
Re: Hive UDF Unknown exception:
Try Double[]. Primitive arrays (like double[], int[]) are not supported yet, because that needs special handling for each of the primitive type. Zheng On Wed, Mar 10, 2010 at 4:55 PM, tom kersnick hiveu...@gmail.com wrote: Gents, Any ideas why this happens? Im using hive 0.50 with hadoop 20.2. This is a super simple UDF. Im just taking the length of the values and then dividing by pi. It keeps popping up with this error: FAILED: Unknown exception: [D cannot be cast to [Ljava.lang.Object; Here is my approach: package com.xyz.udf; import org.apache.hadoop.hive.ql.exec.UDF; import java.util.Collections; public final class test extends UDF { public double evaluate(double[] values) { final Integer len = values.length; final Integer pi = len / 3.14159265; return values[pi]; } } hive list jars; hive add jar /tmp/hive_aux/x-y-z-udf-1.0-SNAPSHOT.jar; Added /tmp/hive_aux/x-y-z-udf-1.0-SNAPSHOT.jar to class path hive create temporary function my_test as 'com.xyz.udf.test'; OK Time taken: 0.41 seconds hive show tables; OK userpool test Time taken: 3.167 seconds hive describe userpool; OK word string amount int Time taken: 0.098 seconds hive select my_test(amount) from userpool; FAILED: Unknown exception: [D cannot be cast to [Ljava.lang.Object; hive describe test; OK word string amount string Time taken: 0.134 seconds hive select my_test(amount) from test; FAILED: Unknown exception: [D cannot be cast to [Ljava.lang.Object; Thanks in advance! /tom -- Yours, Zheng
Re: problem with IS NOT NULL operator in hive
WHERE product_name IS NOT NULL AND product_name '' On Tue, Mar 9, 2010 at 12:45 AM, prakash sejwani prakashsejw...@gmail.com wrote: yes right can you give me a tip how to exclude blank values On Tue, Mar 9, 2010 at 2:13 PM, Zheng Shao zsh...@gmail.com wrote: So I guess you didn't exclude the Blank ones? On Tue, Mar 9, 2010 at 12:41 AM, prakash sejwani prakashsejw...@gmail.com wrote: yes, regexp_extract return NULL or Blank On Tue, Mar 9, 2010 at 2:05 PM, Zheng Shao zsh...@gmail.com wrote: What do you mean by product_name is present? If it is not present, does the regexp_extract return NULL? Zheng On Tue, Mar 9, 2010 at 12:13 AM, prakash sejwani prakashsejw...@gmail.com wrote: Hi all, I have a query below FROM ( SELECT h.* FROM ( -- Pull from the access_log SELECT ip, -- Reformat the time from the access log time, dt, --method, resource, protocol, status, length, referer, agent, -- Extract the product_id for the hit from the URL cast( regexp_extract(resource,'\q=([^\]+)', 1) AS STRING) AS product_name FROM a_log ) h )hit -- Insert the hit data into a seperate search table INSERT OVERWRITE TABLE search SELECT ip, time, dt, product_name WHERE product_name IS NOT NULL; it suppose to populate the search table with only if product_name is present but i get all of it.. any help would be appreciated thanks prakash sejwani econify infotech mumbai -- Yours, Zheng -- Yours, Zheng -- Yours, Zheng
Re: All Map jobs fail with NPE in LazyStruct.uncheckedGetField
Do you want to try hive release 0.5.0 or hive trunk? We should have provided better error messages here: https://issues.apache.org/jira/browse/HIVE-1216 Zheng On Thu, Mar 4, 2010 at 12:34 PM, Tom Nichols tmnich...@gmail.com wrote: I am trying out Hive, using Cloudera's EC2 distribution (Hadoop 0.18.3, Hive 0.4.1, I believe) I'm trying to run the following query which causes every map task to fail with an NPE before making any progress: java.lang.NullPointerException at org.apache.hadoop.hive.serde2.lazy.LazyStruct.uncheckedGetField(LazyStruct.java:205) at org.apache.hadoop.hive.serde2.lazy.LazyStruct.getField(LazyStruct.java:182) at org.apache.hadoop.hive.serde2.objectinspector.LazySimpleStructObjectInspector.getStructFieldData(LazySimpleStructObjectInspector.java:141) at org.apache.hadoop.hive.ql.exec.ExprNodeColumnEvaluator.evaluate(ExprNodeColumnEvaluator.java:53) at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:74) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:332) at org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:49) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:332) at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:175) at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:71) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47) The query: -- Get the node's max price and corresponding year/day/hour/month select isone.node_id, isone.day, isone.hour, isone.lmp from (select max(lmp) as mlmp, node_id from isone_lmp where isone_lmp.node_id = 400 group by node_id) maxlmp join isone_lmp isone on ( isone.node_id = maxlmp.node_id and isone.lmp=maxlmp.mlmp ); The table: CREATE TABLE isone_lmp ( node_id int, day string, hour int, minute int, energy float, congestion float, loss float, lmp float ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE; The data looks like the following: 396,20090120,00,00,62.77,0,.78,63.55 397,20090120,00,00,62.77,0,.65,63.42 398,20090120,00,00,62.77,0,.65,63.42 399,20090120,00,00,62.77,0,.65,63.42 400,20090120,00,00,62.77,0,.65,63.42 401,20090120,00,00,62.77,0,-1.02,61.75 405,20090120,00,00,62.77,0,.21,62.98 It's about 15GB of data total; I can do a simple select count(1) from isone_lmp; which executes as expected. Any thoughts? I've been able to execute the same query on a smaller subset of data (2M rows as opposed to 500M) on a non-distributed setup locally. Thanks. -Tom -- Yours, Zheng
Re: complex query using FROM and INSERT in hive
there is an extra , before FROM cast(regexp_extract(resource, '/companies/(\\d+)', 1) AS INT) AS company_id, -- Run our User Defined Function (see src/com/econify/geoip/IpToCountry.java). Takes the IP of the hit and looks up its country -- ip_to_country(ip) AS ip_country FROM access_log On Tue, Mar 2, 2010 at 7:37 AM, prakash sejwani prakashsejw...@gmail.com wrote: when i run this query from hive console FROM ( SELECT h.*, p.title AS product_sku, p.description AS product_name, c.name AS company_name, c2.id AS product_company_id, c2.name AS product_company_name FROM ( -- Pull from the access_log SELECT ip, ident, user, -- Reformat the time from the access log from_unixtime(cast(unix_ timestamp(time, dd/MMM/:hh:mm:ss Z) AS INT)) AS time, method, resource, protocol, status, length, referer, agent, -- Extract the product_id for the hit from the URL cast(regexp_extract(resource, '/products/(\\d+)', 1) AS INT) AS product_id, -- Extract the company_id for the hit from the URL cast(regexp_extract(resource, '/companies/(\\d+)', 1) AS INT) AS company_id, -- Run our User Defined Function (see src/com/econify/geoip/IpToCountry.java). Takes the IP of the hit and looks up its country -- ip_to_country(ip) AS ip_country FROM access_log ) h -- Join each hit with its product or company (if it has one) LEFT OUTER JOIN products p ON (h.product_id = p.id) LEFT OUTER JOIN companies c ON (h.company_id = c.id) -- If the hit was for a product, we probably didn't get the company_id in the hit subquery, -- so join products.company_id with another instance of the companies table LEFT OUTER JOIN companies c2 ON (p.company_id = c2.id) -- Filter out all hits that weren't for a company or a product WHERE h.product_id IS NOT NULL OR h.company_id IS NOT NULL ) hit -- Insert the hit data into a seperate product_hits table INSERT OVERWRITE TABLE product_hits SELECT ip, ident, user, time, method, resource, protocol, status, length, referer, agent, product_id, product_company_id AS company_id, ip_country, product_name, product_company_name AS company_name WHERE product_name IS NOT NULL -- Insert the hit data insto a seperate company_hits table INSERT OVERWRITE TABLE company_hits SELECT ip, ident, user, time, method, resource, protocol, status, length, referer, agent, company_id, ip_country, company_name WHERE company_name IS NOT NULL; I get the following error FAILED: Parse Error: line 19:6 cannot recognize input 'FROM' in select expression thanks, prakash -- Yours, Zheng
Re: Hive User Group Meeting 3/18/2010 7pm at Facebook
We also created a Meetup group in case you prefer to register on meetup.com http://www.meetup.com/Hive-User-Group-Meeting/calendar/12741356/ We are hosting a Hive User Group Meeting, open to all current and potential hadoop/hive users. Agenda: * Hive Tutorial (Carl Steinbach, cloudera): 20 min * Hive User Case Study (Eva Tse, netflix): 20 min * New Features and API (Hive team, Facebook): 25 min JDBC/ODBC and CTAS(Create Table As Select) UDF/UDAF/UDTF (User-defined Functions) Create View/HBaseInputFormat (Hive and HBase integration) Hive Join Strategy (How Hive does the join) SerDe (Hive's serialization/deserialization framework) Hive is a scalable data warehouse infrastructure built on top of Hadoop. It provides tools to enable easy data ETL, a mechanism to put structures on the data, and the capability to querying and analysis of large data sets stored in Hadoop files. Hive defines a simple SQL-like query language, called HiveQL, that enables users familiar with SQL to query the data. At the same time, this language also allows programmers who are familiar with MapReduce to be able to plug in their custom mappers and reducers to perform more sophisticated analysis. The current largest deployment of Hive is the silver cluster at Facebook, which consists of 1100 nodes with 8 CPU-cores and 12 1TB-disk each. The total capacity is 8800 CPU-cores with 13 PB of raw storage space. More than 4 TB of compressed data (20+ TB uncompressed) are loaded into Hive every day. If you'd like to network with fellow Hive/Hadoop users online, feel free to find them here: http://www.facebook.com/event.php?eid=319237846974 Zheng On Fri, Feb 26, 2010 at 1:56 PM, Zheng Shao zsh...@gmail.com wrote: Hi all, We are going to hold the second Hive User Group Meeting at 7PM on 3/18/2010 Thursday. The agenda will be: * Hive Tutorial: 20 min * Hive User Case Study: 20 min * New Features and API: 25 min JDBC/ODBC and CTAS UDF/UDAF/UDTF Create View/HBaseInputFormat Hive Join Strategy SerDe The audience is beginner to intermediate Hive users/developers. *** The details are here: http://www.facebook.com/event.php?eid=319237846974 *** *** Please RSVP so we can schedule logistics accordingly. *** -- Yours, Zheng -- Yours, Zheng
Re: hive 0.50 on hadoop 0.22
Hi Massoud, Great work! Yes this is exactly the use of shims. When we see an API change across hadoop versions, we add a new function to shims interface, and implement it in each of the shim. For this one, you probably want to wrap the logic in Driver.java into a single shim interface function, and implement that function in all shim versions. Does that make sense? Zheng On Mon, Mar 1, 2010 at 1:08 PM, Massoud Mazar massoud.ma...@avg.com wrote: Zheng, Thanks for answering. I've decided to give it (hive 0.50 on hadoop 0.22) a try. I'm a developer, but not a Java developer, so with some initial help I can spend time and work on this. Just to start, I modified the ShimLoader.java and copied the same HADOOP_SHIM_CLASSES and JETTY_SHIM_CLASSES from 0.20 to 0.22 to see where it breaks. I built and deployed hive 0.50 to a running hadoop 0.22 and did show tables; in hive, and I got this: Exception in thread main java.lang.NoSuchMethodError: org.apache.hadoop.security.UserGroupInformation: method init()V not found at org.apache.hadoop.security.UnixUserGroupInformation.init(UnixUserGroupInformation.java:69) at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:271) at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:300) at org.apache.hadoop.hive.ql.Driver.init(Driver.java:243) at org.apache.hadoop.hive.ql.processors.CommandProcessorFactory.get(CommandProcessorFactory.java:40) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:116) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:181) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:287) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:187) Now, when I look at the UserGroupInformation class in hadoop 0.22 source code, it does not have a parameter-less constructor, but documentation at http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/security/UserGroupInformation.html shows such a constructor. Now, my question is: is this something that can be fixed by shims? Or it is a problem with hadoop? -Original Message- From: Zheng Shao [mailto:zsh...@gmail.com] Sent: Saturday, February 27, 2010 4:24 AM To: hive-user@hadoop.apache.org Subject: Re: hive 0.50 on hadoop 0.22 Hi Mazar, We have not tried Hive on Hadoop higher than 0.20 yet. However, Hive has the shim infrastructure which makes it easy to port to new Hadoop versions. Please see the shim directory inside Hive. Zheng On Fri, Feb 26, 2010 at 1:59 PM, Massoud Mazar massoud.ma...@avg.com wrote: Is it possible to run release-0.5.0-rc0 on top of hadoop 0.22.0 (trunk)? -- Yours, Zheng -- Yours, Zheng
Re: hive 0.50 on hadoop 0.22
Hi Mazar, We have not tried Hive on Hadoop higher than 0.20 yet. However, Hive has the shim infrastructure which makes it easy to port to new Hadoop versions. Please see the shim directory inside Hive. Zheng On Fri, Feb 26, 2010 at 1:59 PM, Massoud Mazar massoud.ma...@avg.com wrote: Is it possible to run release-0.5.0-rc0 on top of hadoop 0.22.0 (trunk)? -- Yours, Zheng
Hive User Group Meeting 3/18/2010 7pm at Facebook
Hi all, We are going to hold the second Hive User Group Meeting at 7PM on 3/18/2010 Thursday. The agenda will be: * Hive Tutorial: 20 min * Hive User Case Study: 20 min * New Features and API: 25 min JDBC/ODBC and CTAS UDF/UDAF/UDTF Create View/HBaseInputFormat Hive Join Strategy SerDe The audience is beginner to intermediate Hive users/developers. *** The details are here: http://www.facebook.com/event.php?eid=319237846974 *** *** Please RSVP so we can schedule logistics accordingly. *** -- Yours, Zheng
Re: How to generate Row Id in Hive?
Since Hive runs many mappers/reducers in parallel, there is no way to generate a globally unique increasing row id. If you are OK with that, you can easily write a non-deterministic UDF. See rand() (or UDFRand.java) for example. Please open a JIRA if you plan to work on that. Zheng On Wed, Feb 24, 2010 at 6:47 PM, Weiwei Hsieh whs...@slingmedia.com wrote: All, Could anyone tell me on how to generate a row id for a new record in Hive? Many thanks. weiwei -- Yours, Zheng
Re: Execution Error
Most probably $TMPDIR does not exist. I think by default it's /tmp/user. Can you mkdir ? On Thu, Feb 25, 2010 at 5:58 AM, Aryeh Berkowitz ar...@iswcorp.com wrote: Can anybody tell me why I’m getting this error? hive show tables; OK email html_href html_src ipadrr phone urls Time taken: 0.129 seconds hive SELECT DISTINCT a.url, a.signature, a.size from urls a; Total MapReduce jobs = 1 Launching Job 1 out of 1 java.io.IOException: No such file or directory at java.io.UnixFileSystem.createFileExclusively(Native Method) at java.io.File.checkAndCreate(File.java:1704) at java.io.File.createTempFile(File.java:1792) at org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:87) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:107) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:55) at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:630) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:504) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:382) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:138) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:197) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:303) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MapRedTask -- Yours, Zheng
Re: How to generate Row Id in Hive?
Not right now. It should be pretty simple to do though. We can expose the current JobConf via a static method in ExecMapper. Zheng On Thu, Feb 25, 2010 at 7:52 AM, Todd Lipcon t...@cloudera.com wrote: Zheng: is there a way to get at the hadoop conf variables from within a query? If so, you could use mapred.task.id to get a unique string. -Todd On Thu, Feb 25, 2010 at 12:42 AM, Zheng Shao zsh...@gmail.com wrote: Since Hive runs many mappers/reducers in parallel, there is no way to generate a globally unique increasing row id. If you are OK with that, you can easily write a non-deterministic UDF. See rand() (or UDFRand.java) for example. Please open a JIRA if you plan to work on that. Zheng On Wed, Feb 24, 2010 at 6:47 PM, Weiwei Hsieh whs...@slingmedia.com wrote: All, Could anyone tell me on how to generate a row id for a new record in Hive? Many thanks. weiwei -- Yours, Zheng -- Yours, Zheng
[ANNOUNCE] Hive 0.5.0 released
Hi folks, We have released Hive 0.5.0. You can find it from the download page in 24 hours (still waiting to be mirrored) http://hadoop.apache.org/hive/releases.html#Download -- Yours, Zheng
Re: [ANNOUNCE] Hive 0.5.0 released
Thanks for the feedback. Which exact version of hadoop are you using? There is a bug in hadoop combinefileinputformat that was fixed recently. Zheng On 2/24/10, Ryan LeCompte lecom...@gmail.com wrote: Actually, I just fixed the problem by removing the following in hive-site.xml: property namehive.input.format/name valueorg.apache.hadoop.hive.ql.io.CombineHiveInputFormat/value /property Any reason why specifying the above would cause the error? We are using latest version of Hadoop. Thanks, Ryan On Wed, Feb 24, 2010 at 10:40 AM, Ryan LeCompte lecom...@gmail.com wrote: I actually just tried doing this (using same metastoredb, just using 0.5.0 release code), and now when I execute a simple query it immediately fails with the following in hive.log: 2010-02-24 10:39:31,950 WARN mapred.JobClient (JobClient.java:configureCommandLineOptions(539)) - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2010-02-24 10:39:33,535 ERROR exec.ExecDriver (SessionState.java:printError(248)) - Ended Job = job_201002241035_0002 with errors 2010-02-24 10:39:33,555 ERROR ql.Driver (SessionState.java:printError(248)) - FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.ExecDriver Any ideas how to get this working? Thanks, Ryan On Wed, Feb 24, 2010 at 8:20 AM, Massoud Mazar massoud.ma...@avg.comwrote: Is it compatible with release-0.4.1-rc2 so I can just replace the code? -Original Message- From: Zheng Shao [mailto:zsh...@gmail.com] Sent: Wednesday, February 24, 2010 3:34 AM To: hive-user@hadoop.apache.org; hive-...@hadoop.apache.org Subject: [ANNOUNCE] Hive 0.5.0 released Hi folks, We have released Hive 0.5.0. You can find it from the download page in 24 hours (still waiting to be mirrored) http://hadoop.apache.org/hive/releases.html#Download -- Yours, Zheng -- Sent from my mobile device Yours, Zheng
Re: [ANNOUNCE] Hive 0.5.0 released
Yes, see http://issues.apache.org/jira/browse/HADOOP-5759?page=com.atlassian.jira.plugin.ext.subversion:subversion-commits-tabpanel The fix is committed to Hadoop 0.20.2 and 0.21.0. But you can continue to use Hive 0.5.0 if you remove that configuration. Zheng On Wed, Feb 24, 2010 at 10:17 AM, Ryan LeCompte lecom...@gmail.com wrote: Ah, interesting. Using Hadoop 0.20.1. Is this the problematic version? Thanks, Ryan On Wed, Feb 24, 2010 at 12:50 PM, Zheng Shao zsh...@gmail.com wrote: Thanks for the feedback. Which exact version of hadoop are you using? There is a bug in hadoop combinefileinputformat that was fixed recently. Zheng On 2/24/10, Ryan LeCompte lecom...@gmail.com wrote: Actually, I just fixed the problem by removing the following in hive-site.xml: property namehive.input.format/name valueorg.apache.hadoop.hive.ql.io.CombineHiveInputFormat/value /property Any reason why specifying the above would cause the error? We are using latest version of Hadoop. Thanks, Ryan On Wed, Feb 24, 2010 at 10:40 AM, Ryan LeCompte lecom...@gmail.com wrote: I actually just tried doing this (using same metastoredb, just using 0.5.0 release code), and now when I execute a simple query it immediately fails with the following in hive.log: 2010-02-24 10:39:31,950 WARN mapred.JobClient (JobClient.java:configureCommandLineOptions(539)) - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2010-02-24 10:39:33,535 ERROR exec.ExecDriver (SessionState.java:printError(248)) - Ended Job = job_201002241035_0002 with errors 2010-02-24 10:39:33,555 ERROR ql.Driver (SessionState.java:printError(248)) - FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.ExecDriver Any ideas how to get this working? Thanks, Ryan On Wed, Feb 24, 2010 at 8:20 AM, Massoud Mazar massoud.ma...@avg.comwrote: Is it compatible with release-0.4.1-rc2 so I can just replace the code? -Original Message- From: Zheng Shao [mailto:zsh...@gmail.com] Sent: Wednesday, February 24, 2010 3:34 AM To: hive-user@hadoop.apache.org; hive-...@hadoop.apache.org Subject: [ANNOUNCE] Hive 0.5.0 released Hi folks, We have released Hive 0.5.0. You can find it from the download page in 24 hours (still waiting to be mirrored) http://hadoop.apache.org/hive/releases.html#Download -- Yours, Zheng -- Sent from my mobile device Yours, Zheng -- Yours, Zheng
Re: Error while starting hive
export HADOOP_CLASSPATH=/master/hadoop/json.jar:/master/hadoop/hbase-0.20.2/hbase-0.20.2.jar:/master/hadoop/hbase-0.20.2/lib/zookeeper-3.2.1.jar:/master/hadoop/hive/build/dist/lib/:/master/hadoop/hive/build/dist/lib/*.jar:/master/hadoop/hive/build/dist/conf/ should be: export HADOOP_CLASSPATH=/master/hadoop/json.jar:/master/hadoop/hbase-0.20.2/hbase-0.20.2.jar:/master/hadoop/hbase-0.20.2/lib/zookeeper-3.2.1.jar:/master/hadoop/hive/build/dist/lib/:/master/hadoop/hive/build/dist/lib/*.jar:/master/hadoop/hive/build/dist/conf/:$HADOOP_CLASSPATH Zheng On Sun, Feb 21, 2010 at 11:19 PM, Mafish Liu maf...@gmail.com wrote: This happens when hive fails to find hive jar files. Did you specify HIVE_HOME and HIVE_LIB in your system? 2010/2/22 Vidyasagar Venkata Nallapati vidyasagar.nallap...@onmobile.com: Hi, While starting hive I am still getting an error, attached are the hadoop env and hive-ste I am using phoe...@ph1:/master/hadoop/hive/build/dist$ bin/hive Exception in thread main java.lang.NoClassDefFoundError: org/apache/hadoop/hive/conf/HiveConf at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:247) at org.apache.hadoop.util.RunJar.main(RunJar.java:149) Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hive.conf.HiveConf at java.net.URLClassLoader$1.run(URLClassLoader.java:200) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:307) at java.lang.ClassLoader.loadClass(ClassLoader.java:252) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320) ... 3 more Regards Vidya DISCLAIMER: The information in this message is confidential and may be legally privileged. It is intended solely for the addressee. Access to this message by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, or distribution of the message, or any action or omission taken by you in reliance on it, is prohibited and may be unlawful. Please immediately contact the sender if you have received this message in error. Further, this e-mail may contain viruses and all reasonable precaution to minimize the risk arising there from is taken by OnMobile. OnMobile is not liable for any damage sustained by you as a result of any virus in this e-mail. All applicable virus checks should be carried out by you before opening this e-mail or any attachment thereto. Thank you - OnMobile Global Limited. -- maf...@gmail.com -- Yours, Zheng
[VOTE] hive 0.5.0 release (rc1)
Hi, I just made a release candidate at https://svn.apache.org/repos/asf/hadoop/hive/tags/release-0.5.0-rc1 The tarballs are at: http://people.apache.org/~zshao/hive-0.5.0-candidate-1/ The HWI startup problem is fixed in rc1. This supersedes the previous email about voting on rc0. Please vote. -- Yours, Zheng
Re: [VOTE] hive 0.5.0 release candidate 0
Can you generate a patch for 0.5? The patch does not work on branch-0.5 Zheng On 2/19/10, Edward Capriolo edlinuxg...@gmail.com wrote: On Fri, Feb 19, 2010 at 9:49 PM, Zheng Shao zsh...@gmail.com wrote: Hi, I just made a release candidate at https://svn.apache.org/repos/asf/hadoop/hive/tags/release-0.5.0-rc0 The tarballs are at: http://people.apache.org/~zshao/hive-0.4.1-candidate-3/ Please vote. -- Yours, Zheng -1 I would like to fix https://issues.apache.org/jira/browse/HIVE-1183 -- Sent from my mobile device Yours, Zheng
Re: SequenceFile compression on Amazon EMR not very good
hive.exec.compress.output controls whether or not to compress hive output. (This overrides mapred.output.compress in Hive). All other compression flags are from hadoop. Please see http://hadoop.apache.org/common/docs/r0.18.0/hadoop-default.html Zheng On Fri, Feb 19, 2010 at 5:53 AM, Saurabh Nanda saurabhna...@gmail.com wrote: And also hive.exec.compress.*. So that makes it three sets of configuration variables: mapred.output.compress.* io.seqfile.compress.* hive.exec.compress.* What's the relationship between these configuration parameters and which ones should I set to achieve a well compress output table? Saurabh. On Fri, Feb 19, 2010 at 7:16 PM, Saurabh Nanda saurabhna...@gmail.com wrote: I'm confused here Zheng. There are two sets of configuration variables. Those starting with io.* and those starting with mapred.*. For making sure that the final output table is compressed, which ones do I have to set? Saurabh. On Fri, Feb 19, 2010 at 12:37 AM, Zheng Shao zsh...@gmail.com wrote: Did you also: SET mapred.output.compression.codec=org.apacheGZipCode; Zheng On Thu, Feb 18, 2010 at 8:25 AM, Saurabh Nanda saurabhna...@gmail.com wrote: Hi Zheng, I cross checked. I am setting the following in my Hive script before the INSERT command: SET io.seqfile.compression.type=BLOCK; SET hive.exec.compress.output=true; A 132 MB (gzipped) input file going through a cleanup and getting populated in a sequencefile table is growing to 432 MB. What could be going wrong? Saurabh. On Wed, Feb 3, 2010 at 2:26 PM, Saurabh Nanda saurabhna...@gmail.com wrote: Thanks, Zheng. Will do some more tests and get back. Saurabh. On Mon, Feb 1, 2010 at 1:22 PM, Zheng Shao zsh...@gmail.com wrote: I would first check whether it is really the block compression or record compression. Also maybe the block size is too small but I am not sure that is tunable in SequenceFile or not. Zheng On Sun, Jan 31, 2010 at 9:03 PM, Saurabh Nanda saurabhna...@gmail.com wrote: Hi, The size of my Gzipped weblog files is about 35MB. However, upon enabling block compression, and inserting the logs into another Hive table (sequencefile), the file size bloats up to about 233MB. I've done similar processing on a local Hadoop/Hive cluster, and while the compressions is not as good as gzipping, it still is not this bad. What could be going wrong? I looked at the header of the resulting file and here's what it says: SEQ^Forg.apache.hadoop.io.BytesWritable^Yorg.apache.hadoop.io.Text^A^@'org.apache.hadoop.io.compress.GzipCodec Does Amazon Elastic MapReduce behave differently or am I doing something wrong? Saurabh. -- http://nandz.blogspot.com http://foodieforlife.blogspot.com -- Yours, Zheng -- http://nandz.blogspot.com http://foodieforlife.blogspot.com -- http://nandz.blogspot.com http://foodieforlife.blogspot.com -- Yours, Zheng -- http://nandz.blogspot.com http://foodieforlife.blogspot.com -- http://nandz.blogspot.com http://foodieforlife.blogspot.com -- Yours, Zheng
Re: Thrift Server Error Messages
Can you open a JIRA and help propose some concrete design of the change? That will help make it faster to have this feature. Thanks, Zheng On Fri, Feb 19, 2010 at 6:17 AM, Andy Kent andy.k...@forward.co.uk wrote: When executing commands on the hive command line it give really useful output if you have syntax errors in your query. When using the Thrift interface I seem to only be able to get errors like 'Error code: 11'. Is there a way to get at the human friendly error messages via the thrift interface? If not, is there a list of the thrift error codes and what they mean anywhere? If it's not available it would really great if this could be exposed via thrift. Thanks, Andy. -- Yours, Zheng
Re: computing median and percentiles
Hi Jerome, Is there any update on this? https://issues.apache.org/jira/browse/HIVE-259 Zheng On Fri, Feb 5, 2010 at 9:34 AM, Jerome Boulon jbou...@netflix.com wrote: Hi Bryan, I'm working on Hive-259. I'll post an update early next week. /Jerome. On 2/4/10 9:08 PM, Bryan Talbot btal...@aeriagames.com wrote: What's the best way to compute median and other percentiles using Hive 0.40? I've run across http://issues.apache.org/jira/browse/HIVE-259 but there doesn't seem to be any planned implementation yet. -Bryan -- Yours, Zheng
Re: Having trouble with lateral view
Jason, Do you want to open a JIRA and contrib your map_explode function to Hive? That will be greatly appreciated. Zheng On Fri, Feb 19, 2010 at 2:49 PM, Yongqiang He heyongqi...@software.ict.ac.cn wrote: Hi Jason, This is a known bug, see https://issues.apache.org/jira/browse/HIVE-1056 You can first disable ppd with “set hive.optimize.ppd=false;” Thanks Yongqiang On 2/19/10 2:23 PM, Jason Michael jmich...@videoegg.com wrote: I’m currently running a hive build from trunk, revision number 911889. I’ve built a UDTF called map_explode which just emits the key and value of each entry in a map as a row in the result table. The table I’m running it against looks like: hive describe mytable; product string from deserializer ... interactions mapstring,int from deserializer If I use the map_explode in the select clause, I get the expected results: hive select map_explode(interactions) as (key, value) from mytable where day = '2010-02-18' and hour = 1 limit 10; ... OK invite_impression 1 invite_impression 1 invite_impression 1 invite_impression 1 rollout 12 invite_impression 1 invite_impression 1 invite_impression 1 rollout 4 invite_impression 1 Time taken: 22.11 seconds However, if I try to use LATERAL JOIN to relate the exploded values back to the parent table, like so: hive select product, key, sum(value) from mytable LATERAL VIEW map_explode(interactions) interacts as key, value where day = '2010-02-18' and hour = 1 group by product, key; I get the following error: FAILED: Unknown exception: null Looking in hive.log, I see the follow stack trace: 2010-02-19 14:15:17,215 ERROR ql.Driver (SessionState.java:printError(255)) - FAILED: Unknown exception: null java.lang.NullPointerException at org.apache.hadoop.hive.ql.ppd.ExprWalkerProcFactory$ColumnExprProcessor.process(ExprWalkerProcFactory.java:87) at org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:89) at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:89) at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.walk(DefaultGraphWalker.java:129) at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:103) at org.apache.hadoop.hive.ql.ppd.ExprWalkerProcFactory.extractPushdownPreds(ExprWalkerProcFactory.java:273) at org.apache.hadoop.hive.ql.ppd.OpProcFactory$DefaultPPD.mergeWithChildrenPred(OpProcFactory.java:317) at org.apache.hadoop.hive.ql.ppd.OpProcFactory$DefaultPPD.process(OpProcFactory.java:258) at org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:89) at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:89) at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.walk(DefaultGraphWalker.java:129) at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:103) at org.apache.hadoop.hive.ql.ppd.PredicatePushDown.transform(PredicatePushDown.java:103) at org.apache.hadoop.hive.ql.optimizer.Optimizer.optimize(Optimizer.java:74) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:5758) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:125) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:304) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:377) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:138) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:197) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:303) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) I peeked at ExprWalkerProcFactory, but couldn’t readily see what was causing the problem. Any ideas? Jason -- Yours, Zheng
[VOTE] hive 0.5.0 release candidate 0
Hi, I just made a release candidate at https://svn.apache.org/repos/asf/hadoop/hive/tags/release-0.5.0-rc0 The tarballs are at: http://people.apache.org/~zshao/hive-0.4.1-candidate-3/ Please vote. -- Yours, Zheng
Re: SequenceFile compression on Amazon EMR not very good
Did you also: SET mapred.output.compression.codec=org.apacheGZipCode; Zheng On Thu, Feb 18, 2010 at 8:25 AM, Saurabh Nanda saurabhna...@gmail.com wrote: Hi Zheng, I cross checked. I am setting the following in my Hive script before the INSERT command: SET io.seqfile.compression.type=BLOCK; SET hive.exec.compress.output=true; A 132 MB (gzipped) input file going through a cleanup and getting populated in a sequencefile table is growing to 432 MB. What could be going wrong? Saurabh. On Wed, Feb 3, 2010 at 2:26 PM, Saurabh Nanda saurabhna...@gmail.com wrote: Thanks, Zheng. Will do some more tests and get back. Saurabh. On Mon, Feb 1, 2010 at 1:22 PM, Zheng Shao zsh...@gmail.com wrote: I would first check whether it is really the block compression or record compression. Also maybe the block size is too small but I am not sure that is tunable in SequenceFile or not. Zheng On Sun, Jan 31, 2010 at 9:03 PM, Saurabh Nanda saurabhna...@gmail.com wrote: Hi, The size of my Gzipped weblog files is about 35MB. However, upon enabling block compression, and inserting the logs into another Hive table (sequencefile), the file size bloats up to about 233MB. I've done similar processing on a local Hadoop/Hive cluster, and while the compressions is not as good as gzipping, it still is not this bad. What could be going wrong? I looked at the header of the resulting file and here's what it says: SEQ^Forg.apache.hadoop.io.BytesWritable^Yorg.apache.hadoop.io.Text^A^@'org.apache.hadoop.io.compress.GzipCodec Does Amazon Elastic MapReduce behave differently or am I doing something wrong? Saurabh. -- http://nandz.blogspot.com http://foodieforlife.blogspot.com -- Yours, Zheng -- http://nandz.blogspot.com http://foodieforlife.blogspot.com -- http://nandz.blogspot.com http://foodieforlife.blogspot.com -- Yours, Zheng
Re: Question on modifying a table to become external
There is no command to do that right now. One way to go is to create another external table pointing to the same location (and forget about the old table). Or you can move the files first, before dropping and recreating the same table. Zheng On Thu, Feb 18, 2010 at 10:22 AM, Eva Tse e...@netflix.com wrote: We created a table without the ‘EXTERNAL’ qualifier but did specify a location for the warehouse. We would like to modify this to be an external table. We tried to drop the table, but it does delete the files in the S3 external location. Is there a way we could achieve this? Thanks, Eva. CREATE TABLE IF NOT EXISTS exampletable ( other_properties Mapstring, string, event_ts_ms bigint, hostname string ) PARTITIONED by (dateint int, hour int) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' COLLECTION ITEMS TERMINATED BY '\004' MAP KEYS TERMINATED BY '\002' stored as SEQUENCEFILE LOCATION ' s3n://bucketname/hive/warehouse/exampletable'; -- Yours, Zheng
Re: map join and OOM
https://issues.apache.org/jira/browse/HIVE-917 might be what you want (suppose both of the tables are already bucketed on the join column). Zheng On Thu, Feb 18, 2010 at 2:53 PM, Ning Zhang nzh...@facebook.com wrote: 1GB of the small table is usually too large for map-side joins. If the raw data is 1GB, it could be 10x larger when it is read into main memory as Java objects. Our default value is 10MB. Another factor to determine whether to use map-side join is the number of rows in the small table. If it is too large, each mapper will spend long time to process the join (each mapper reads the whole small table into a hash table in main memory and joins a split of the large table). Thanks, Ning On Feb 18, 2010, at 2:45 PM, Edward Capriolo wrote: I have Hive 4.1-rc2. My query runs in Time taken: 312.956 seconds using the map/reduce join. I was interested in using mapjoin, I get an OOM error. hive java.lang.OutOfMemoryError: GC overhead limit exceeded at org.apache.hadoop.hive.ql.util.jdbm.recman.RecordFile.getNewNode(RecordFile.java:369) My pageviews is 8GB and my client_ips is ~ 1GB property namemapred.child.java.opts/name value-Xmx778m/value /property [ecapri...@nyhadoopdata10 ~]$ hive Hive history file=/tmp/ecapriolo/hive_job_log_ecapriolo_201002181717_253155276.txt hive explain Select /*+ MAPJOIN( client_ips )*/clientip_id,client_ip, SUM(bytes_sent) as X from pageviews join client_ips on pageviews.clientip_id=client_ips.id where year=2010 AND month=02 and day=17 group by clientip_id,client_ip ; OK ABSTRACT SYNTAX TREE: (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF pageviews) (TOK_TABREF client_ips) (= (. (TOK_TABLE_OR_COL pageviews) clientip_id) (. (TOK_TABLE_OR_COL client_ips) id (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_HINTLIST (TOK_HINT TOK_MAPJOIN (TOK_HINTARGLIST client_ips))) (TOK_SELEXPR (TOK_TABLE_OR_COL clientip_id)) (TOK_SELEXPR (TOK_TABLE_OR_COL client_ip)) (TOK_SELEXPR (TOK_FUNCTION SUM (TOK_TABLE_OR_COL bytes_sent)) X)) (TOK_WHERE (and (AND (= (TOK_TABLE_OR_COL year) 2010) (= (TOK_TABLE_OR_COL month) 02)) (= (TOK_TABLE_OR_COL day) 17))) (TOK_GROUPBY (TOK_TABLE_OR_COL clientip_id) (TOK_TABLE_OR_COL client_ip STAGE DEPENDENCIES: Stage-1 is a root stage Stage-2 depends on stages: Stage-1 Stage-0 is a root stage STAGE PLANS: Stage: Stage-1 Map Reduce Alias - Map Operator Tree: pageviews TableScan alias: pageviews Filter Operator predicate: expr: (((UDFToDouble(year) = UDFToDouble(2010)) and (UDFToDouble(month) = UDFToDouble(2))) and (UDFToDouble(day) = UDFToDouble(17))) type: boolean Common Join Operator condition map: Inner Join 0 to 1 condition expressions: 0 {clientip_id} {bytes_sent} {year} {month} {day} 1 {client_ip} keys: 0 1 outputColumnNames: _col13, _col17, _col22, _col23, _col24, _col26 Position of Big Table: 0 File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat Local Work: Map Reduce Local Work Alias - Map Local Tables: client_ips Fetch Operator limit: -1 Alias - Map Local Operator Tree: client_ips TableScan alias: client_ips Common Join Operator condition map: Inner Join 0 to 1 condition expressions: 0 {clientip_id} {bytes_sent} {year} {month} {day} 1 {client_ip} keys: 0 1 outputColumnNames: _col13, _col17, _col22, _col23, _col24, _col26 Position of Big Table: 0 File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat Stage: Stage-2 Map Reduce Alias - Map Operator Tree: hdfs://nyhadoopname1.ops.about.com:8020/tmp/hive-ecapriolo/975920219/10002 Select Operator expressions: expr: _col13 type: int expr: _col17 type: int expr: _col22 type: string expr: _col23
Re: Hive Server Leaking File Descriptors?
This is actually a bug in MAPREDUCE-1504, but we will try to find a workaround. https://issues.apache.org/jira/browse/HIVE-1181 Given that release 0.5.0 is much wanted right now, I don't think we want to wait purely for 0.5.0 since the ultimate fix should come from Hadoop. We will definitely get HIVE-1181 for branch 0.5. Zheng -- Forwarded message -- From: Andy Kent andy.k...@forward.co.uk Date: Thu, Feb 18, 2010 at 3:17 PM Subject: Re: Hive Server Leaking File Descriptors? To: hive-user@hadoop.apache.org hive-user@hadoop.apache.org On 18 Feb 2010, at 20:29, Zheng Shao zsh...@gmail.com wrote: I've tried to look into it a bit more and it seems to happen on load data inpath This is inline with what we have been seeing as we do around 200 load data statements per day and leak approx the same number of file descriptors. Is there any chance this fix will make it into the 0.5 release? -- Yours, Zheng
Re: NoClassDef error
The stacktrace that you showed is from the hive cli right? Did you define HADOOP_CLASSPATH somewhere? Hive modifies HADOOP_CLASSPATH so it's important to modify it by export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/my/new/path instead of directly overwriting it. Zheng On Thu, Feb 18, 2010 at 9:22 PM, Vidyasagar Venkata Nallapati vidyasagar.nallap...@onmobile.com wrote: Hi, I have kept the hive/conf in the HADOOP_CLASSPATH Also I have verified that there are no hive jars in the hadoop directory and also added the property namehadoop.bin.path/name value/usr/bin/hadoop/value !-- note that the hive shell script also uses this property name -- descriptionPath to hadoop binary. Assumes that by default we are executing from hive/description /property But am still getting the same error if a run on multi node cluster, its working in a single node setup. Regards Vidyasagar N V From: Yi Mao [mailto:ymaob...@gmail.com] Sent: Wednesday, February 17, 2010 11:28 PM To: hive-user@hadoop.apache.org Subject: Re: NoClassDef error I think you also need hive/conf in the classpath. On Wed, Feb 17, 2010 at 2:23 AM, Vidyasagar Venkata Nallapati vidyasagar.nallap...@onmobile.com wrote: Hi , When starting the hive I am getting an error even after I am including in class path, attached is the hadoop-env I am using. Exception in thread main java.lang.NoClassDefFoundError: org/apache/hadoop/hive/conf/HiveConf at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:247) at org.apache.hadoop.util.RunJar.main(RunJar.java:149) Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hive.conf.HiveConf at java.net.URLClassLoader$1.run(URLClassLoader.java:200) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:307) at java.lang.ClassLoader.loadClass(ClassLoader.java:252) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320) ... 3 more Regards Vidyasagar N V DISCLAIMER: The information in this message is confidential and may be legally privileged. It is intended solely for the addressee. Access to this message by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, or distribution of the message, or any action or omission taken by you in reliance on it, is prohibited and may be unlawful. Please immediately contact the sender if you have received this message in error. Further, this e-mail may contain viruses and all reasonable precaution to minimize the risk arising there from is taken by OnMobile. OnMobile is not liable for any damage sustained by you as a result of any virus in this e-mail. All applicable virus checks should be carried out by you before opening this e-mail or any attachment thereto. Thank you - OnMobile Global Limited. DISCLAIMER: The information in this message is confidential and may be legally privileged. It is intended solely for the addressee. Access to this message by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, or distribution of the message, or any action or omission taken by you in reliance on it, is prohibited and may be unlawful. Please immediately contact the sender if you have received this message in error. Further, this e-mail may contain viruses and all reasonable precaution to minimize the risk arising there from is taken by OnMobile. OnMobile is not liable for any damage sustained by you as a result of any virus in this e-mail. All applicable virus checks should be carried out by you before opening this e-mail or any attachment thereto. Thank you - OnMobile Global Limited. -- Yours, Zheng
Re: NoClassDef error
In which directory did you start hive? hive should be started in build/dist Zheng On Wed, Feb 17, 2010 at 2:23 AM, Vidyasagar Venkata Nallapati vidyasagar.nallap...@onmobile.com wrote: Hi , When starting the hive I am getting an error even after I am including in class path, attached is the hadoop-env I am using. Exception in thread main java.lang.NoClassDefFoundError: org/apache/hadoop/hive/conf/HiveConf at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:247) at org.apache.hadoop.util.RunJar.main(RunJar.java:149) Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hive.conf.HiveConf at java.net.URLClassLoader$1.run(URLClassLoader.java:200) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:307) at java.lang.ClassLoader.loadClass(ClassLoader.java:252) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320) ... 3 more Regards Vidyasagar N V DISCLAIMER: The information in this message is confidential and may be legally privileged. It is intended solely for the addressee. Access to this message by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, or distribution of the message, or any action or omission taken by you in reliance on it, is prohibited and may be unlawful. Please immediately contact the sender if you have received this message in error. Further, this e-mail may contain viruses and all reasonable precaution to minimize the risk arising there from is taken by OnMobile. OnMobile is not liable for any damage sustained by you as a result of any virus in this e-mail. All applicable virus checks should be carried out by you before opening this e-mail or any attachment thereto. Thank you - OnMobile Global Limited. -- Yours, Zheng
Re: Help with Compressed Storage
I just corrected the wiki page. It will also be a good idea to support case-insensitive boolean values in the code. Zheng On Wed, Feb 17, 2010 at 9:27 AM, Brent Miller brentalanmil...@gmail.com wrote: Thanks Adam, that works for me as well. It seems that the property for hive.exec.compress.output is case sensitive, and when it is set to TRUE (as it is on the compressed storage page on the wiki) it is ignored by hive. -Brent On Tue, Feb 16, 2010 at 4:24 PM, Adam O'Donnell a...@immunet.com wrote: Adding these to my hive-site.xml file worked fine: property namehive.exec.compress.output/name valuetrue/value descriptionCompress output/description /property property namemapred.output.compression.type/name valueBLOCK/value descriptionBlock compression/description /property On Tue, Feb 16, 2010 at 1:43 PM, Brent Miller brentalanmil...@gmail.com wrote: Hello, I've seen issues similar to this one come up once or twice before, but I haven't ever seen a solution to the problem that I'm having. I was following the Compressed Storage page on the Hive Wiki http://wiki.apache.org/hadoop/CompressedStorage and realized that the sequence files that are created in the warehouse directory are actually uncompressed and larger than than the originals. For example, I have a table 'test1' who's input data looks something like: 0,1369962224,2010/02/01,00:00:00.101,0C030301,4,BD43 0,1369962225,2010/02/01,00:00:00.101,0C030501,4,66268E43 0,1369962226,2010/02/01,00:00:00.101,0C030701,4,041F3341 ... And after creating a second table 'test1_comp' that was crated with the STORED AS SEQUENCEFILE directive and the compression options SET as described in the wiki, I can look at the resultant sequence files and see that they're just plain (uncompressed) text: SEQ org.apache.hadoop.io.BytesWritable org.apache.hadoop.io.Text+�c�!Y�M �� Z^��= 80,1369962224,2010/02/01,00:00:00.101,0C030301,4,BD43= 80,1369962225,2010/02/01,00:00:00.101,0C030501,4,66268E43= 80,1369962226,2010/02/01,00:00:00.101,0C030701,4,041F3341= 80,1369962227,2010/02/01,00:00:00.101,0C030901,4,11360141= ... I've tried messing around with different org.apache.hadoop.io.compress.* options, but the sequence files always come out uncompressed. Has anybody ever seen this or know away to keep the data compressed? Since the input text is so uniform, we get huge space savings from compression and would like to store the data this way if possible. I'm using Hadoop 20.1 and Hive that I checked out from SVN about a week ago. Thanks, Brent -- Adam J. O'Donnell, Ph.D. Immunet Corporation Cell: +1 (267) 251-0070 -- Yours, Zheng
Re: hive ant spead ups
I think this is worth exploring. Unit test time is now longer and longer given more code and more tests. Do you want to start a JIRA issue and discuss more about it? Zheng On Wed, Feb 17, 2010 at 8:53 AM, Edward Capriolo edlinuxg...@gmail.com wrote: I made an ant target quick-test, which differs from test in that it has no dependencies. target name=quick-test iterate target=test/ /target target name=test depends=clean-test,jar iterate target=test/ iterate-cpp target=test/ /target time ant -Dhadoop.version='0.18.3' -Doffline=true -Dtestcase=TestCliDriver -Dqfile=alter1.q quick-test BUILD SUCCESSFUL Total time: 15 seconds real 0m16.250s user 0m20.965s sys 0m1.579s time ant -Dhadoop.version='0.18.3' -Doffline=true -Dtestcase=TestCliDriver -Dqfile=alter1.q test BUILD SUCCESSFUL Total time: 26 seconds real 0m26.564s user 0m31.307s sys 0m2.346s It does without saying that Hive ant is very different then make file. Most make files can set simple flags to say, 'make.ok ' , so that running a target like 'make install' will not cause the dependent tasks to be re-run. Excuse my ignorance if this is some built in ant switch like '--no-deps'. Should we set flags in hive so the build process can intelligently skip work that is already done? -- Yours, Zheng
Re: Help with Compressed Storage
There is no special setting for bz2. Can you get the debug log? Zheng On Wed, Feb 17, 2010 at 9:02 PM, prasenjit mukherjee pmukher...@quattrowireless.com wrote: So I tried the same with .gz files and it worked. I am using the following hadoop version :Hadoop 0.20.1+169.56 with cloudera's ami-2359bf4a. I thought that hadoop0.20 does support bz2 compression, hence same should work with hive as well. Interesting note is that Pig works fine on the same bz2 data. Is there any tweaking/config setup I need to do for hive to take bz2 files as input ? On Thu, Feb 18, 2010 at 8:31 AM, prasenjit mukherjee pmukher...@quattrowireless.com wrote: I have a similar issue with bz2 files. I have the hadoop directories : /ip/data/ : containing unzipped text files ( foo1.txt, foo2.txt ) /ip/datacompressed/ : containing same files bzipped ( foo1.bz2, foo2.bz2 ) CREATE EXTERNAL TABLE tx_log(id1 string, id2 string, id3 string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\002' LOCATION '/ip/datacompressed/'; SELECT * FROM tx_log limit 1; The command works fine with LOCATION '/ip/data/' but doesnt work with LOCATION '/ip/datacompressed/' Any pointers ? I thought ( like Pig ) hive automatically detects .bz2 extensions and applies appropriate decompression. Am I wrong ? -Prasen On Thu, Feb 18, 2010 at 3:04 AM, Zheng Shao zsh...@gmail.com wrote: I just corrected the wiki page. It will also be a good idea to support case-insensitive boolean values in the code. Zheng On Wed, Feb 17, 2010 at 9:27 AM, Brent Miller brentalanmil...@gmail.com wrote: Thanks Adam, that works for me as well. It seems that the property for hive.exec.compress.output is case sensitive, and when it is set to TRUE (as it is on the compressed storage page on the wiki) it is ignored by hive. -Brent On Tue, Feb 16, 2010 at 4:24 PM, Adam O'Donnell a...@immunet.com wrote: Adding these to my hive-site.xml file worked fine: property namehive.exec.compress.output/name valuetrue/value descriptionCompress output/description /property property namemapred.output.compression.type/name valueBLOCK/value descriptionBlock compression/description /property On Tue, Feb 16, 2010 at 1:43 PM, Brent Miller brentalanmil...@gmail.com wrote: Hello, I've seen issues similar to this one come up once or twice before, but I haven't ever seen a solution to the problem that I'm having. I was following the Compressed Storage page on the Hive Wiki http://wiki.apache.org/hadoop/CompressedStorage and realized that the sequence files that are created in the warehouse directory are actually uncompressed and larger than than the originals. For example, I have a table 'test1' who's input data looks something like: 0,1369962224,2010/02/01,00:00:00.101,0C030301,4,BD43 0,1369962225,2010/02/01,00:00:00.101,0C030501,4,66268E43 0,1369962226,2010/02/01,00:00:00.101,0C030701,4,041F3341 ... And after creating a second table 'test1_comp' that was crated with the STORED AS SEQUENCEFILE directive and the compression options SET as described in the wiki, I can look at the resultant sequence files and see that they're just plain (uncompressed) text: SEQ org.apache.hadoop.io.BytesWritable org.apache.hadoop.io.Text+�c�!Y�M �� Z^��= 80,1369962224,2010/02/01,00:00:00.101,0C030301,4,BD43= 80,1369962225,2010/02/01,00:00:00.101,0C030501,4,66268E43= 80,1369962226,2010/02/01,00:00:00.101,0C030701,4,041F3341= 80,1369962227,2010/02/01,00:00:00.101,0C030901,4,11360141= ... I've tried messing around with different org.apache.hadoop.io.compress.* options, but the sequence files always come out uncompressed. Has anybody ever seen this or know away to keep the data compressed? Since the input text is so uniform, we get huge space savings from compression and would like to store the data this way if possible. I'm using Hadoop 20.1 and Hive that I checked out from SVN about a week ago. Thanks, Brent -- Adam J. O'Donnell, Ph.D. Immunet Corporation Cell: +1 (267) 251-0070 -- Yours, Zheng -- Yours, Zheng
Re: Help with Compressed Storage
Just remember that we need to have the BZipCodec class in the following hadoop configuration: Can you check? io.compression.codecs Zheng On Wed, Feb 17, 2010 at 11:21 PM, prasenjit mukherjee prasen@gmail.com wrote: So this is the command I ran, first with with small.gz (which worked fine) and then with small.bz2 ( which didnt work ) : drop table small_table; CREATE TABLE small_table(id1 string, id2 string, id3 string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; LOAD DATA LOCAL INPATH '/root/data/small.gz' OVERWRITE INTO TABLE small_table; select * from small_table limit 1; For gz files I do see the following lines in hive_debug : 10/02/18 01:59:23 DEBUG ipc.RPC: Call: getBlockLocations 1 10/02/18 01:59:23 DEBUG util.NativeCodeLoader: Trying to load the custom-built native-hadoop library... 10/02/18 01:59:23 DEBUG util.NativeCodeLoader: Failed to load native-hadoop with error: java.lang.UnsatisfiedLinkError: no hadoop in java.library.path 10/02/18 01:59:23 DEBUG util.NativeCodeLoader: java.library.path=/usr/java/jdk1.6.0_14/jre/lib/amd64/server:/usr/java/jdk1.6.0_14/jre/lib/amd64:/usr/java/jdk1.6.0_14/jre/../lib/amd64:/usr/java/packages/lib/amd64:/lib:/usr/lib 10/02/18 01:59:23 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 10/02/18 01:59:23 DEBUG fs.FSInputChecker: DFSClient readChunk got seqno 0 offsetInBlock 0 lastPacketInBlock true packetLen 88 aid1 bid2 cid3 But for bzip files there is none : 10/02/18 01:57:18 DEBUG ipc.RPC: Call: getBlockLocations 2 10/02/18 01:57:18 DEBUG fs.FSInputChecker: DFSClient readChunk got seqno 0 offsetInBlock 0 lastPacketInBlock true packetLen 85 10/02/18 01:57:18 WARN lazy.LazyStruct: Missing fields! Expected 3 fields but only got 1! Ignoring similar problems. BZh91AYSYǧ �Y @ TP?* ���SFL� cѶѶ�$� � �w��U�)„�=8O� NULL NULL Let me know if you still need the debug files. Attached are the small.gz and small.bzip2 files. Thanks and appreciate, -Prasen On Thu, Feb 18, 2010 at 11:52 AM, Zheng Shao zsh...@gmail.com wrote: There is no special setting for bz2. Can you get the debug log? Zheng On Wed, Feb 17, 2010 at 9:02 PM, prasenjit mukherjee pmukher...@quattrowireless.com wrote: So I tried the same with .gz files and it worked. I am using the following hadoop version :Hadoop 0.20.1+169.56 with cloudera's ami-2359bf4a. I thought that hadoop0.20 does support bz2 compression, hence same should work with hive as well. Interesting note is that Pig works fine on the same bz2 data. Is there any tweaking/config setup I need to do for hive to take bz2 files as input ? On Thu, Feb 18, 2010 at 8:31 AM, prasenjit mukherjee pmukher...@quattrowireless.com wrote: I have a similar issue with bz2 files. I have the hadoop directories : /ip/data/ : containing unzipped text files ( foo1.txt, foo2.txt ) /ip/datacompressed/ : containing same files bzipped ( foo1.bz2, foo2.bz2 ) CREATE EXTERNAL TABLE tx_log(id1 string, id2 string, id3 string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\002' LOCATION '/ip/datacompressed/'; SELECT * FROM tx_log limit 1; The command works fine with LOCATION '/ip/data/' but doesnt work with LOCATION '/ip/datacompressed/' Any pointers ? I thought ( like Pig ) hive automatically detects .bz2 extensions and applies appropriate decompression. Am I wrong ? -Prasen On Thu, Feb 18, 2010 at 3:04 AM, Zheng Shao zsh...@gmail.com wrote: I just corrected the wiki page. It will also be a good idea to support case-insensitive boolean values in the code. Zheng On Wed, Feb 17, 2010 at 9:27 AM, Brent Miller brentalanmil...@gmail.com wrote: Thanks Adam, that works for me as well. It seems that the property for hive.exec.compress.output is case sensitive, and when it is set to TRUE (as it is on the compressed storage page on the wiki) it is ignored by hive. -Brent On Tue, Feb 16, 2010 at 4:24 PM, Adam O'Donnell a...@immunet.com wrote: Adding these to my hive-site.xml file worked fine: property namehive.exec.compress.output/name valuetrue/value descriptionCompress output/description /property property namemapred.output.compression.type/name valueBLOCK/value descriptionBlock compression/description /property On Tue, Feb 16, 2010 at 1:43 PM, Brent Miller brentalanmil...@gmail.com wrote: Hello, I've seen issues similar to this one come up once or twice before, but I haven't ever seen a solution to the problem that I'm having. I was following the Compressed Storage page on the Hive Wiki http://wiki.apache.org/hadoop/CompressedStorage and realized that the sequence files that are created in the warehouse
Re: Help with Compressed Storage
Try this one to see if it works: hive -hiveconf io.compression.codecs=xxx,yyy,zzz Zheng On Wed, Feb 17, 2010 at 11:33 PM, prasenjit mukherjee prasen@gmail.com wrote: Thanks for the pointer that was indeed the problem. The specific AMI I was using didnt include bzip2 codecs in their hadoop-site.xml. Is there a way I can pass those parameters from hive, so that I dont need to manually change the file ? -Thanks, Prasen On Thu, Feb 18, 2010 at 12:54 PM, Zheng Shao zsh...@gmail.com wrote: Just remember that we need to have the BZipCodec class in the following hadoop configuration: Can you check? io.compression.codecs Zheng -- Yours, Zheng
[VOTE] release hive 0.5.0
Hive branch 0.5 was created 5 weeks ago: https://svn.apache.org/viewvc/hadoop/hive/branches/branch-0.5/ It has also been running as the production version of Hive at Facebook for 2 weeks. We'd like to start making release candidates (for 0.5.0) from branch 0.5. Please vote. -- Yours, Zheng
Re: Hive Server Leaking File Descriptors?
Can you go to that box, sudo as root, and do lsof | grep 12345 where 12345 is the process id of the hive server? We should be able to see the names of the files that are open. Zheng On Mon, Feb 15, 2010 at 7:42 AM, Andy Kent andy.k...@forward.co.uk wrote: Nope, no luck so far. We have upped the number of file descriptors and are having to restart hive every week or so :( Any other suggestions would be greatly appreciated. On 15 Feb 2010, at 14:09, Bennie Schut wrote: Did this help? I'm running into a similar problem. slowly leaking connections to 50010 and after a hive restart all is ok again. Andy Kent wrote: I can give try and give it a go. I'm not convinced though as we are working with CSV files and don't touch sequence files at all at the moment. We are using the Clodera Ubuntu Packages for Hadoop 0.20.1+133 and Hive 0.40 On 25 Jan 2010, at 15:30, Jay Booth wrote: Actually, we had an issue with this, it was a bug in SequenceFile where if there were problems opening a file, it would leave a filehandle open and never close it. Here's the patch -- It's already fixed in 0.21/trunk, if I get some time this week I'll submit it against 0.20.2 -- could you apply this to hadoop and let me know if it fixes things for you? On Mon, Jan 25, 2010 at 10:11 AM, Jay Booth jaybo...@gmail.commailto:jaybo...@gmail.com wrote: Yeah, I'd guess that this is a Hive issue, although it could be a combination.. maybe if you're doing queries and then closing your thrift connection before reading all results, Hive doesn't know what to do and leaves the connection open? Once the west coast folks wake up, they might have a better answer for you than I do. On Mon, Jan 25, 2010 at 9:06 AM, Andy Kent andy.k...@forward.co.ukmailto:andy.k...@forward.co.uk wrote: On 25 Jan 2010, at 13:59, Jay Booth wrote: That's the datanode port.. if I had to guess, Hive's connecting to DFS directly for some reason (maybe for select * queries?) and not finishing their reads or closing the connections after. Thanks for the response. That's what I was suspecting. I have triple checked and our Ruby code and it is defiantly closing it's thrift connections properly. I'll try running some different queries and see if I can suss out some examples of which ones are leaky. Is this something that I should post to Jira or is it a known issue? I can't believe other people haven't noticed this? SequenceFile.patch -- Yours, Zheng
Re: Got sun.misc.InvalidJarIndexException: Invalid index
MySQL is recommended for multiple-node deployment of Hive. Can you try MySQL? Zheng On Mon, Feb 8, 2010 at 6:32 PM, Mafish Liu maf...@gmail.com wrote: Hi, all: I'm deploying hive from node A to node B. Hive on node A works properly while on node B, when I try to create a new table, I got the following exception: 2010-02-08 10:15:38,339 ERROR exec.DDLTask (SessionState.java:printError(279)) - FAILED: Error in metadata: javax.jdo.JDOUserException: Exception during population of metadata for org.apache.hadoop.hive.metastore.model.MDatabase NestedThrowables: sun.misc.InvalidJarIndexException: Invalid index org.apache.hadoop.hive.ql.metadata.HiveException: javax.jdo.JDOUserException: Exception during population of metadata for org.apache.hadoop.hive.metastore.model.MDatabase NestedThrowables: sun.misc.InvalidJarIndexException: Invalid index at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:258) at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:879) at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:103) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:379) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:285) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:123) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:181) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:287) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:165) at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68) Caused by: javax.jdo.JDOUserException: Exception during population of metadata for org.apache.hadoop.hive.metastore.model.MDatabase NestedThrowables: sun.misc.InvalidJarIndexException: Invalid index at org.datanucleus.jdo.NucleusJDOHelper.getJDOExceptionForNucleusException(NucleusJDOHelper.java:350) at org.datanucleus.ObjectManagerImpl.getExtent(ObjectManagerImpl.java:3741) at org.datanucleus.store.rdbms.query.JDOQLQueryCompiler.compileCandidates(JDOQLQueryCompiler.java:411) at org.datanucleus.store.rdbms.query.QueryCompiler.executionCompile(QueryCompiler.java:312) at org.datanucleus.store.rdbms.query.JDOQLQueryCompiler.compile(JDOQLQueryCompiler.java:225) at org.datanucleus.store.rdbms.query.JDOQLQuery.compileInternal(JDOQLQuery.java:174) at org.datanucleus.store.query.Query.executeQuery(Query.java:1443) at org.datanucleus.store.rdbms.query.JDOQLQuery.executeQuery(JDOQLQuery.java:244) at org.datanucleus.store.query.Query.executeWithArray(Query.java:1357) at org.datanucleus.jdo.JDOQuery.execute(JDOQuery.java:242) at org.apache.hadoop.hive.metastore.ObjectStore.getMDatabase(ObjectStore.java:283) at org.apache.hadoop.hive.metastore.ObjectStore.getDatabase(ObjectStore.java:301) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:146) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:118) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:100) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.init(HiveMetaStoreClient.java:74) at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:783) at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:794) at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:252) ... 16 more Caused by: sun.misc.InvalidJarIndexException: Invalid index at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:854) at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:762) at sun.misc.URLClassPath$JarLoader.findResource(URLClassPath.java:732) at sun.misc.URLClassPath$1.next(URLClassPath.java:195) at sun.misc.URLClassPath$1.hasMoreElements(URLClassPath.java:205) at java.net.URLClassLoader$3$1.run(URLClassLoader.java:393) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader$3.next(URLClassLoader.java:390) at java.net.URLClassLoader$3.hasMoreElements(URLClassLoader.java:415) at sun.misc.CompoundEnumeration.next(CompoundEnumeration.java:27) at sun.misc.CompoundEnumeration.hasMoreElements(CompoundEnumeration.java:36) at
Re: SerDe issue
Hi Roberto, The reason that Text is passed in is because the table is defined as TextFile format (the default). There are some examples (*.q files) of using SequenceFile format ( CREATE TABLE xxx STORED AS SEQUENCEFILE). SEQUENCEFILE will return BytesWritable by default. Please have a try. Zheng On Fri, Feb 12, 2010 at 1:05 PM, Roberto Congiu roberto.con...@openx.org wrote: Hey guys, I wrote a SerDe to support lwes (http://lwes.org) using BinarySortableSerDe as a model. The code is very similar, and I serialize an lwes event to a BytesWritable, and deserialize from it. Serialization is fine...however, when I run an insert into... select, the Deserialize methods is passed a Text object instead of a BytesWritable object like expected. Hive generates 2 jobs, and it fails on the mapper in the second. getSerializedClass() is set correctly: public Class? extends Writable getSerializedClass() { LOG.debug(JournalSerDe::getSerializedClass()); return BytesWritable.class; } And I don't see any relevant difference between BinarySortableSerDe and my code. Does anybody have a hint on what may be happening ? Thanks, Roberto -- Yours, Zheng
Re: Hive Installation Error
What commands did you run? With which release? Zheng On Wed, Feb 10, 2010 at 11:20 PM, Vidyasagar Venkata Nallapati vidyasagar.nallap...@onmobile.com wrote: Hi, Installation is giving an error as master/hadoop/hadoop-0.20.1/build.xml:895: 'java5.home' is not defined. Forrest requires Java 5. Please pass -Djava5.home=base of Java 5 distribution to Ant on the command-line. at org.apache.tools.ant.taskdefs.Exit.execute(Exit.java:142) at org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:288) at sun.reflect.GeneratedMethodAccessor1.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:106) at org.apache.tools.ant.Task.perform(Task.java:348) at org.apache.tools.ant.Target.execute(Target.java:357) at org.apache.tools.ant.Target.performTasks(Target.java:385) at org.apache.tools.ant.Project.executeSortedTargets(Project.java:1337) at org.apache.tools.ant.Project.executeTarget(Project.java:1306) at org.apache.tools.ant.helper.DefaultExecutor.executeTargets(DefaultExecutor.java:41) at org.apache.tools.ant.Project.executeTargets(Project.java:1189) at org.apache.tools.ant.Main.runBuild(Main.java:758) at org.apache.tools.ant.Main.startAnt(Main.java:217) at org.apache.tools.ant.launch.Launcher.run(Launcher.java:257) at org.apache.tools.ant.launch.Launcher.main(Launcher.java:104) Total time: 7 minutes 11 seconds Please guide on this case. Regards Vidyasagar N V DISCLAIMER: The information in this message is confidential and may be legally privileged. It is intended solely for the addressee. Access to this message by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, or distribution of the message, or any action or omission taken by you in reliance on it, is prohibited and may be unlawful. Please immediately contact the sender if you have received this message in error. Further, this e-mail may contain viruses and all reasonable precaution to minimize the risk arising there from is taken by OnMobile. OnMobile is not liable for any damage sustained by you as a result of any virus in this e-mail. All applicable virus checks should be carried out by you before opening this e-mail or any attachment thereto. Thank you - OnMobile Global Limited. -- Yours, Zheng
Re: Distributing additional files for reduce scripts
add file myfile.txt; You can find some examples in *.q files in the distribution. Zheng On Thu, Feb 11, 2010 at 10:23 PM, Adam O'Donnell a...@immunet.com wrote: Guys: How do you go about distributing additional files that may be needed by your reduce scripts? For example, I need to distribute a GeoIP database with my reduce script to do some lookups. Thanks! Adam -- Adam J. O'Donnell, Ph.D. Immunet Corporation Cell: +1 (267) 251-0070 -- Yours, Zheng
Re: hive map reduce output
Another possible reason is that we found sometimes hadoop framework does not return the correct count to the clients. In all these cases, the count is smaller than the number of rows actually loaded. which version of hadoop are you using? Zheng On Mon, Feb 8, 2010 at 11:27 PM, Jeff Hammerbacher ham...@cloudera.com wrote: Hey wd, Actually, what version are you running? Your bug sounds an awful lot like http://issues.apache.org/jira/browse/HIVE-327, which was fixed many moons ago. Thanks, Jeff On Mon, Feb 8, 2010 at 11:25 PM, Carl Steinbach c...@cloudera.com wrote: Hi wd, Please file a JIRA ticket for this issue. Thanks. Carl On Mon, Feb 8, 2010 at 7:05 PM, wd w...@wdicc.com wrote: hi, I've use hive map reduce to process some log files. I found out that hive will output like num1 rows loaded to table_name message every run. But the num1 not equal to select count(1) from table_name execute result. I think this should be a bug. If we can not count the right num, why we output that message? -- Yours, Zheng
Re: Lzo problem throwing java.io.IOException:java.io.EOFException
Looks like a lzo codec problem. Can you try a simple mapreduce program outputs to lzo compression and the same output file format as you hive table? On 2/9/10, Bennie Schut bsc...@ebuddy.com wrote: I have a bit of an edge case on using lzo which I think might be related to HIVE-524. When running a query like this: select distinct login_cldr_id as cldr_id from chatsessions_load; I get a java.io.IOException:java.io.EOFException without much of a description. I know the output should be a single value and noticed it decided to use 2 reducers. One of the reducers produced a 0 byte file which I imagine will be the cause of the IOException. It I do set mapred.reduce.tasks=1 it works correctly since there is no 0 byte file anymore. I also noticed when using gzip I don't see this problem at all. Since I use -- Sent from my mobile device Yours, Zheng
Re: Using UDFs stored on HDFS
Yes that's correct. I prefer to download the jars in add jar. Zheng On Mon, Feb 8, 2010 at 3:46 PM, Philip Zeyliger phi...@cloudera.com wrote: Hi folks, I have a quick question about UDF support in Hive. I'm on the 0.5 branch. Can you use a UDF where the jar which contains the function is on HDFS, and not on the local filesystem. Specifically, the following does not seem to work: # This is Hive 0.5, from svn $bin/hive Hive history file=/tmp/philip/hive_job_log_philip_201002081541_370227273.txt hive add jar hdfs://localhost/FooTest.jar; Added hdfs://localhost/FooTest.jar to class path hive create temporary function cube as 'com.cloudera.FooTestUDF'; FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.FunctionTask Does this work for other people? I could probably fix it by changing add jar to download remote jars locally, when necessary (to load them into the classpath), or update URLClassLoader (or whatever is underneath there) to read directly from HDFS, which seems a bit more fragile. But I wanted to make sure that my interpretation of what's going on is right before I have at it. Thanks, -- Philip -- Yours, Zheng
Re: LZO Compression on trunk
That seems to be a bug. Are you using hive trunk or any release? On 2/5/10, Bennie Schut bsc...@ebuddy.com wrote: I have a tab separated files I have loaded it with load data inpath then I do a SET hive.exec.compress.output=true; SET mapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec; SET mapred.map.output.compression.codec=com.hadoop.compression.lzo.LzoCodec; select distinct login_cldr_id as cldr_id from chatsessions_load; Ended Job = job_201001151039_1641 OK NULL NULL NULL Time taken: 49.06 seconds however if I start it without the set commands I get this: Ended Job = job_201001151039_1642 OK 2283 Time taken: 45.308 seconds Which is the correct result. When I do a insert overwrite on a rcfile table it will actually compress the data correctly. When I disable compression and query this new table the result is correct. When I enable compression it's wrong again. I see no errors in the logs. Any idea's why this might happen? -- Sent from my mobile device Yours, Zheng
Re: Hive Installation Problem
HI guys, Can you have a try to make the following directory the same as mine? Once this is done, remove the build directory, and run ant package. Does this solve the problem? [zs...@dev ~/.ant] ls -lR .: total 3896 drwxr-xr-x 2 zshao users4096 Feb 5 13:04 apache-ivy-2.0.0-rc2 -rw-r--r-- 1 zshao users 3965953 Nov 4 2008 apache-ivy-2.0.0-rc2-bin.zip -rw-r--r-- 1 zshao users 0 Feb 5 13:04 apache-ivy-2.0.0-rc2.installed drwxr-xr-x 3 zshao users4096 Feb 5 13:07 cache drwxr-xr-x 2 zshao users4096 Feb 5 13:04 lib ./apache-ivy-2.0.0-rc2: total 880 -rw-r--r-- 1 zshao users 893199 Oct 28 2008 ivy-2.0.0-rc2.jar ./cache: total 4 drwxr-xr-x 3 zshao users 4096 Feb 4 19:30 hadoop ./cache/hadoop: total 4 drwxr-xr-x 3 zshao users 4096 Feb 5 13:08 core ./cache/hadoop/core: total 4 drwxr-xr-x 2 zshao users 4096 Feb 4 19:30 sources ./cache/hadoop/core/sources: total 127436 -rw-r--r-- 1 zshao users 14427013 Aug 20 2008 hadoop-0.17.2.1.tar.gz -rw-r--r-- 1 zshao users 30705253 Jan 22 2009 hadoop-0.18.3.tar.gz -rw-r--r-- 1 zshao users 42266180 Nov 13 2008 hadoop-0.19.0.tar.gz -rw-r--r-- 1 zshao users 42813980 Apr 8 2009 hadoop-0.20.0.tar.gz ./lib: total 880 -rw-r--r-- 1 zshao users 893199 Feb 5 13:04 ivy-2.0.0-rc2.jar Zheng On Fri, Feb 5, 2010 at 5:49 AM, Vidyasagar Venkata Nallapati vidyasagar.nallap...@onmobile.com wrote: Hi , We are still getting the problem [ivy:retrieve] no resolved descriptor found: launching default resolve Overriding previous definition of property ivy.version [ivy:retrieve] using ivy parser to parse file:/master/hadoop/hive/shims/ivy.xml [ivy:retrieve] :: resolving dependencies :: org.apache.hadoop.hive#shims;work...@ph1 [ivy:retrieve] confs: [default] [ivy:retrieve] validate = true [ivy:retrieve] refresh = false [ivy:retrieve] resolving dependencies for configuration 'default' [ivy:retrieve] == resolving dependencies for org.apache.hadoop.hive#shims;work...@ph1 [default] [ivy:retrieve] == resolving dependencies org.apache.hadoop.hive#shims;work...@ph1-hadoop#core;0.20.1 [default-*] [ivy:retrieve] default: Checking cache for: dependency: hadoop#core;0.20.1 {*=[*]} [ivy:retrieve] hadoop-source: no ivy file nor artifact found for hadoop#core;0.20.1 [ivy:retrieve] tried https://repository.apache.org/content/repositories/snapshots/hadoop/core/0.20.1/core-0.20.1.pom And the .pom for this is not getting copied, please suggest something on this. Regards Vidyasagar N V From: baburaj.S [mailto:babura...@onmobile.com] Sent: Friday, February 05, 2010 4:59 PM To: hive-user@hadoop.apache.org Subject: RE: Hive Installation Problem No I don’t have the variable defined. Any other things that I have to check. Is this happening because I am trying for Hadoop 0.20.1 Babu From: Carl Steinbach [mailto:c...@cloudera.com] Sent: Friday, February 05, 2010 3:07 PM To: hive-user@hadoop.apache.org Subject: Re: Hive Installation Problem Hi Babu, ~/.ant/cache is the default Ivy cache directory for Hive, but if the environment variable IVY_HOME is set it will use $IVY_HOME/cache instead. Is it possible that you have this environment variable set to a value different than ~/.ant? On Fri, Feb 5, 2010 at 12:09 AM, baburaj.S babura...@onmobile.com wrote: I have tried the same but still the installation is giving the same error. I don't know if it is looking in the cache . Can we make any change in ivysettings.xml that it has to resolve the file from the file system rather through an url. Babu -Original Message- From: Zheng Shao [mailto:zsh...@gmail.com] Sent: Friday, February 05, 2010 12:47 PM To: hive-user@hadoop.apache.org Subject: Re: Hive Installation Problem Added to http://wiki.apache.org/hadoop/Hive/FAQ Zheng On Thu, Feb 4, 2010 at 11:11 PM, Zheng Shao zsh...@gmail.com wrote: Try this: cd ~/.ant/cache/hadoop/core/sources wget http://archive.apache.org/dist/hadoop/core/hadoop-0.20.1/hadoop-0.20.1.tar.gz Zheng On Thu, Feb 4, 2010 at 10:23 PM, baburaj.S babura...@onmobile.com wrote: Hello , I am new to Hadoop and is trying to install Hive now. We have the following setup at our side OS - Ubuntu 9.10 Hadoop - 0.20.1 Hive installation tried - 0.4.0 . The Hadoop is installed and is working fine . Now when we were installing Hive I got error that it couldn't resolve the dependencies. I changed the shims build and properties xml to make the dependencies look for Hadoop 0.20.1 . But now when I call the ant script I get the following error ivy-retrieve-hadoop-source: [ivy:retrieve] :: Ivy 2.0.0-rc2 - 20081028224207 :: http://ant.apache.org/ivy/ : :: loading settings :: file = /master/hive/ivy/ivysettings.xml [ivy:retrieve] :: resolving dependencies :: org.apache.hadoop.hive#shims;working [ivy:retrieve] confs: [default] [ivy:retrieve] :: resolution report :: resolve 953885ms :: artifacts dl 0ms
Re: Concurrently load data into Hive tables?
We can load data/insert overwrite data concurrently as long as they are different partitions. On Thu, Feb 4, 2010 at 6:51 AM, Ryan LeCompte lecom...@gmail.com wrote: Hey guys, Is it possible to concurrently load data into Hive tables (same table, different partition)? I'd like to concurrently execute the LOAD DATA command by two separate processes. Is Hive thread-safe in this regard? Or is it best to run the LOAD DATA commands serially? How about running two Hive queries concurrently that both output their results into different partitions of another Hive table? Thanks! Ryan -- Yours, Zheng
Re: Question about Hive supporting new Hadoop MapReduce API
We haven't had a plan yet. It will be great to draw out the pros/cons of moving to the new MapReduce API. Do you want to open a JIRA to discuss it? Zheng On Thu, Feb 4, 2010 at 5:46 PM, Schubert Zhang zson...@gmail.com wrote: Does anyone know the plan of Hive to support new Hadoop MapReduce API? In current hive 0.4 and trunk, still using deprecated Hadoop API, we want to know the plan. -- Yours, Zheng
Re: computing median and percentiles
I would say, just create a histogram of value, count pair, sort at the end, and return the value at the percentile. This assumes that the number of unique values are not big, which can be easily enforced by using round(number, digits). Zheng On Thu, Feb 4, 2010 at 9:08 PM, Bryan Talbot btal...@aeriagames.com wrote: What's the best way to compute median and other percentiles using Hive 0.40? I've run across http://issues.apache.org/jira/browse/HIVE-259 but there doesn't seem to be any planned implementation yet. -Bryan -- Yours, Zheng
Re: Hive Installation Problem
Added to http://wiki.apache.org/hadoop/Hive/FAQ Zheng On Thu, Feb 4, 2010 at 11:11 PM, Zheng Shao zsh...@gmail.com wrote: Try this: cd ~/.ant/cache/hadoop/core/sources wget http://archive.apache.org/dist/hadoop/core/hadoop-0.20.1/hadoop-0.20.1.tar.gz Zheng On Thu, Feb 4, 2010 at 10:23 PM, baburaj.S babura...@onmobile.com wrote: Hello , I am new to Hadoop and is trying to install Hive now. We have the following setup at our side OS - Ubuntu 9.10 Hadoop - 0.20.1 Hive installation tried - 0.4.0 . The Hadoop is installed and is working fine . Now when we were installing Hive I got error that it couldn't resolve the dependencies. I changed the shims build and properties xml to make the dependencies look for Hadoop 0.20.1 . But now when I call the ant script I get the following error ivy-retrieve-hadoop-source: [ivy:retrieve] :: Ivy 2.0.0-rc2 - 20081028224207 :: http://ant.apache.org/ivy/ : :: loading settings :: file = /master/hive/ivy/ivysettings.xml [ivy:retrieve] :: resolving dependencies :: org.apache.hadoop.hive#shims;working [ivy:retrieve] confs: [default] [ivy:retrieve] :: resolution report :: resolve 953885ms :: artifacts dl 0ms - | | modules || artifacts | | conf | number| search|dwnlded|evicted|| number|dwnlded| - | default | 1 | 0 | 0 | 0 || 0 | 0 | - [ivy:retrieve] [ivy:retrieve] :: problems summary :: [ivy:retrieve] WARNINGS [ivy:retrieve] module not found: hadoop#core;0.20.1 [ivy:retrieve] hadoop-source: tried [ivy:retrieve] -- artifact hadoop#core;0.20.1!hadoop.tar.gz(source): [ivy:retrieve] http://archive.apache.org/dist/hadoop/core/hadoop-0.20.1/hadoop-0.20.1.tar.gz [ivy:retrieve] apache-snapshot: tried [ivy:retrieve] https://repository.apache.org/content/repositories/snapshots/hadoop/core/0.20.1/core-0.20.1.pom [ivy:retrieve] -- artifact hadoop#core;0.20.1!hadoop.tar.gz(source): [ivy:retrieve] https://repository.apache.org/content/repositories/snapshots/hadoop/core/0.20.1/hadoop-0.20.1.tar.gz [ivy:retrieve] maven2: tried [ivy:retrieve] http://repo1.maven.org/maven2/hadoop/core/0.20.1/core-0.20.1.pom [ivy:retrieve] -- artifact hadoop#core;0.20.1!hadoop.tar.gz(source): [ivy:retrieve] http://repo1.maven.org/maven2/hadoop/core/0.20.1/core-0.20.1.tar.gz [ivy:retrieve] :: [ivy:retrieve] :: UNRESOLVED DEPENDENCIES :: [ivy:retrieve] :: [ivy:retrieve] :: hadoop#core;0.20.1: not found [ivy:retrieve] :: [ivy:retrieve] ERRORS [ivy:retrieve] Server access Error: Connection timed out url=http://archive.apache.org/dist/hadoop/core/hadoop-0.20.1/hadoop-0.20.1.tar.gz [ivy:retrieve] Server access Error: Connection timed out url=https://repository.apache.org/content/repositories/snapshots/hadoop/core/0.20.1/core-0.20.1.pom [ivy:retrieve] Server access Error: Connection timed out url=https://repository.apache.org/content/repositories/snapshots/hadoop/core/0.20.1/hadoop-0.20.1.tar.gz [ivy:retrieve] Server access Error: Connection timed out url=http://repo1.maven.org/maven2/hadoop/core/0.20.1/core-0.20.1.pom [ivy:retrieve] Server access Error: Connection timed out url=http://repo1.maven.org/maven2/hadoop/core/0.20.1/core-0.20.1.tar.gz [ivy:retrieve] [ivy:retrieve] :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS BUILD FAILED /master/hive/build.xml:148: The following error occurred while executing this line: /master/hive/build.xml:93: The following error occurred while executing this line: /master/hive/shims/build.xml:64: The following error occurred while executing this line: /master/hive/build-common.xml:172: impossible to resolve dependencies: resolve failed - see output for details Total time: 15 minutes 55 seconds I have even tried to download hadoop-0.20.1.tar.gz and put it in the ant cache of the user . Still the same error is repeated. I am stuck and not able to install it . Any help on the above will be greatly appreciated. Babu DISCLAIMER: The information in this message is confidential and may be legally privileged. It is intended solely for the addressee. Access to this message by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, or distribution of the message, or any action or omission taken by you in reliance on it, is prohibited and may be unlawful. Please immediately contact the sender if you have received this message in error
Re: Resolvers for UDAFs
Can you post the Hive query? What are the types of the parameters that you passed to the function? Zheng On Wed, Feb 3, 2010 at 3:23 AM, Sonal Goyal sonalgoy...@gmail.com wrote: Hi, I am writing a UDAF which takes in 4 parameters. I have 2 cases - one where all the paramters are ints, and second where the last parameter is double. I wrote two evaluators for this, with iterate as public boolean iterate(int max, int groupBy, int attribute, int count) and public boolean iterate(int max, int groupBy, int attribute, double count) However, when I run a query, I get the exception: org.apache.hadoop.hive.ql.exec.AmbiguousMethodException: Ambiguous method for class org.apache.hadoop.hive.udaf.TopXPerGroup with [int, int, int, int] at org.apache.hadoop.hive.ql.exec.DefaultUDAFEvaluatorResolver.getEvaluatorClass(DefaultUDAFEvaluatorResolver.java:83) at org.apache.hadoop.hive.ql.udf.generic.GenericUDAFBridge.getEvaluator(GenericUDAFBridge.java:57) at org.apache.hadoop.hive.ql.exec.FunctionRegistry.getGenericUDAFEvaluator(FunctionRegistry.java:594) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getGenericUDAFEvaluator(SemanticAnalyzer.java:1882) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genGroupByPlanMapGroupByOperator(SemanticAnalyzer.java:2270) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genGroupByPlanMapAggr1MR(SemanticAnalyzer.java:2821) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:4543) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:5058) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:4999) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:5020) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:4999) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:5020) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:5587) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:114) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:317) at org.apache.hadoop.hive.ql.Driver.runCommand(Driver.java:370) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:362) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:140) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:200) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:311) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) One option for me is to write a resolver which I will do. But, I just wanted to know if this is a bug in hive whereby it is not able to get the write evaluator. Or if this is a gap in my understanding. I look forward to hearing your views on this. Thanks and Regards, Sonal -- Yours, Zheng
Re: Converting multiple joins into a single multi-way join
See ql/src/test/queries/clientpositive/uniquejoin.q FROM UNIQUEJOIN PRESERVE T1 a (a.key), PRESERVE T2 b (b.key), PRESERVE T3 c (c.key) SELECT a.key, b.key, c.key; FROM UNIQUEJOIN T1 a (a.key), T2 b (b.key), T3 c (c.key) SELECT a.key, b.key, c.key; FROM UNIQUEJOIN T1 a (a.key), T2 b (b.key-1), T3 c (c.key) SELECT a.key, b.key, c.key; FROM UNIQUEJOIN PRESERVE T1 a (a.key, a.val), PRESERVE T2 b (b.key, b.val), PRESERVE T3 c (c.key, c.val) SELECT a.key, a.val, b.key, b.val, c.key, c.val; FROM UNIQUEJOIN PRESERVE T1 a (a.key), T2 b (b.key), PRESERVE T3 c (c.key) SELECT a.key, b.key, c.key; FROM UNIQUEJOIN PRESERVE T1 a (a.key), T2 b(b.key) SELECT a.key, b.key; Zheng On Wed, Feb 3, 2010 at 2:07 AM, bharath v bharathvissapragada1...@gmail.com wrote: Hi , Can anyone give me an example in which there is an optimization of Converting multiple joins into a single multi-way join .. i.e., reducing the number of map-reduce jobs . I read this from hive's design document but I couldn't find any example . Can anyone point me to the same?? Kindly help, Thanks -- Yours, Zheng
Re: Converting multiple joins into a single multi-way join
https://issues.apache.org/jira/browse/HIVE-591 On Wed, Feb 3, 2010 at 1:34 PM, Zheng Shao zsh...@gmail.com wrote: See ql/src/test/queries/clientpositive/uniquejoin.q FROM UNIQUEJOIN PRESERVE T1 a (a.key), PRESERVE T2 b (b.key), PRESERVE T3 c (c.key) SELECT a.key, b.key, c.key; FROM UNIQUEJOIN T1 a (a.key), T2 b (b.key), T3 c (c.key) SELECT a.key, b.key, c.key; FROM UNIQUEJOIN T1 a (a.key), T2 b (b.key-1), T3 c (c.key) SELECT a.key, b.key, c.key; FROM UNIQUEJOIN PRESERVE T1 a (a.key, a.val), PRESERVE T2 b (b.key, b.val), PRESERVE T3 c (c.key, c.val) SELECT a.key, a.val, b.key, b.val, c.key, c.val; FROM UNIQUEJOIN PRESERVE T1 a (a.key), T2 b (b.key), PRESERVE T3 c (c.key) SELECT a.key, b.key, c.key; FROM UNIQUEJOIN PRESERVE T1 a (a.key), T2 b(b.key) SELECT a.key, b.key; Zheng On Wed, Feb 3, 2010 at 2:07 AM, bharath v bharathvissapragada1...@gmail.com wrote: Hi , Can anyone give me an example in which there is an optimization of Converting multiple joins into a single multi-way join .. i.e., reducing the number of map-reduce jobs . I read this from hive's design document but I couldn't find any example . Can anyone point me to the same?? Kindly help, Thanks -- Yours, Zheng -- Yours, Zheng
Re: intermediate data written to the disk?
If the join key is the same, you can use unique join to make sure it's done in a single map-reduce job. Zheng On Wed, Feb 3, 2010 at 1:25 AM, bharath v bharathvissapragada1...@gmail.com wrote: Hi , I have a small doubt in how hive handles queries containing join of more than 2 tables . Suppose we have 3 tables A,B,C .. and the plan is ((AB)C) .. We can join A,B in a map reduce job and join the resultant table with C. I have a doubt whether the result of AB is stored to disk before joining with C or is it streamed directly to join with C (I don't know how , just a guess) . Any help is appreciated , Thanks -- Yours, Zheng
Re: Help writing UDAF with custom object
Which version of Hive are you using? I looked at the code for trunk and cannot find PrimitiveObjectInspectorFactory.java:166 Zheng On Mon, Feb 1, 2010 at 3:41 AM, Sonal Goyal sonalgoy...@gmail.com wrote: Hi Zheng, Thanks for your response. I had initially used ints, but due to the error I got, I changed them to Integers. I have now reverted the code to use ints as suggested by you. My problem: I have a table called products_bought which has a number of products bought by each customer ordered by count bought. I want to get the top x customers of each product. Table products_bought product_id customer_id product_count 1 1 6 1 2 5 1 3 4 2 1 8 2 2 4 2 3 1 I want the say, top 2 results per products. Which will be: product_id customer_id product_count 1 1 6 1 2 5 2 1 8 2 2 4 Solution: I create a jar with the code I sent and do the following steps in cli 1. add jar jarname 2. create temporary function topx as 'class name'; 3. select topx(2, product_id, customer_id, product_count) from products_bought The logs give me the error: 0/02/01 16:56:28 DEBUG ipc.RPC: Call: mkdirs 23 10/02/01 16:56:28 INFO parse.SemanticAnalyzer: Completed getting MetaData in Semantic Analysis 10/02/01 16:56:28 DEBUG parse.SemanticAnalyzer: Created Table Plan for products_bought org.apache.hadoop.hive.ql.exec.tablescanopera...@72d8978c 10/02/01 16:56:28 DEBUG exec.FunctionRegistry: Looking up GenericUDAF: topx FAILED: Unknown exception : Internal error: Cannot recognize int 10/02/01 16:56:28 ERROR ql.Driver: FAILED: Unknown exception : Internal error: Cannot recognize int java.lang.RuntimeException: Internal error: Cannot recognize int at org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory.getPrimitiveObjectInspectorFromClass(PrimitiveObjectInspectorFactory.java:166) at org.apache.hadoop.hive.ql.udf.generic.GenericUDFUtils$PrimitiveConversionHelper.init(GenericUDFUtils.java:197) at org.apache.hadoop.hive.ql.udf.generic.GenericUDAFBridge$GenericUDAFBridgeEvaluator.init(GenericUDAFBridge.java:123) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getGenericUDAFInfo(SemanticAnalyzer.java:1592) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genGroupByPlanMapGroupByOperator(SemanticAnalyzer.java:1912) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genGroupByPlanMapAggr1MR(SemanticAnalyzer.java:2452) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:3733) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:4184) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:4425) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:76) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:249) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:281) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:123) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:181) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:287) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) I am going through the code mentioned by Zheng to see if there is something wrong I am doing. At this point of time, I think my main concern is to get the function to output something and to verify that Hive specific hooks are in place. If you have any suggestions, please do let me know. Thanks and Regards, Sonal On Mon, Feb 1, 2010 at 1:19 PM, Zheng Shao zsh...@gmail.com wrote: The first problem is: private Integer key; private Integer attribute; private Integer count; Java Integer objects are non-modifiable, which means we have to create a new object per row (which in turn makes the code really inefficient). You can change it to private int to make it efficient (and also works for Hive). Second, can you post your Hive query? It seems your code does not do what you want. You might want to take a look at http://issues.apache.org/jira/browse/HIVE-894 for the UDAF max_n and see how that works for Hive. Zheng On Sun, Jan 31, 2010 at 9:38 PM, Sonal Goyal sonalgoy...@gmail.com wrote: Hi, I am writing a UDAF which returns the top x results per key. Lets say my input is key attribute count 1 1 6 1 2 5 1 3 4
Re: Resolvers for UDAFs
Hi Sonal, 1. We usually move the group_by column out of the UDAF - just like we do SELECT key, sum(value) FROM table. I think you should write: SELECT customer_id, topx(2, product_id, product_count) FROM products_bought and in topx: public boolean iterate(int max, int attribute, int count). 2. Can you run describe products_bought? Does product_count column have type int? You might want to try removing the other interate function to see whether that solves the problem. Zheng On Wed, Feb 3, 2010 at 9:58 PM, Sonal Goyal sonalgoy...@gmail.com wrote: Hi Zheng, My query is: select a.myTable.key, a.myTable.attribute, a.myTable.count from (select explode (t.pc) as myTable from (select topx(2, product_id, customer_id, product_count) as pc from (select product_id, customer_id, product_count from products_bought order by product_id, product_count desc) r ) t )a; My overloaded iterators are: public boolean iterate(int max, int groupBy, int attribute, int count) public boolean iterate(int max, int groupBy, int attribute, double count) Before overloading, my query was running fine. My table products_bought is: product_id int, customer_id int, product_count int And I get: FAILED: Error in semantic analysis: Ambiguous method for class org.apache.hadoop.hive.udaf.TopXPerGroup with [int, int, int, int] The hive logs say: 2010-02-03 11:18:15,721 ERROR processors.DeleteResourceProcessor (SessionState.java:printError(255)) - Usage: delete [FILE|JAR|ARCHIVE] value [value]* 2010-02-03 11:22:14,663 ERROR ql.Driver (SessionState.java:printError(255)) - FAILED: Error in semantic analysis: Ambiguous method for class org.apache.hadoop.hive.udaf.TopXPerGroup with [int, int, int, int] org.apache.hadoop.hive.ql.exec.AmbiguousMethodException: Ambiguous method for class org.apache.hadoop.hive.udaf.TopXPerGroup with [int, int, int, int] at org.apache.hadoop.hive.ql.exec.DefaultUDAFEvaluatorResolver.getEvaluatorClass(DefaultUDAFEvaluatorResolver.java:83) at org.apache.hadoop.hive.ql.udf.generic.GenericUDAFBridge.getEvaluator(GenericUDAFBridge.java:57) at org.apache.hadoop.hive.ql.exec.FunctionRegistry.getGenericUDAFEvaluator(FunctionRegistry.java:594) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getGenericUDAFEvaluator(SemanticAnalyzer.java:1882) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genGroupByPlanMapGroupByOperator(SemanticAnalyzer.java:2270) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genGroupByPlanMapAggr1MR(SemanticAnalyzer.java:2821) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:4543) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:5058) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:4999) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:5020) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:4999) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:5020) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:5587) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:114) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:317) at org.apache.hadoop.hive.ql.Driver.runCommand(Driver.java:370) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:362) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:140) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:200) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:311) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) Thanks and Regards, Sonal On Thu, Feb 4, 2010 at 12:12 AM, Zheng Shao zsh...@gmail.com wrote: Can you post the Hive query? What are the types of the parameters that you passed to the function? Zheng On Wed, Feb 3, 2010 at 3:23 AM, Sonal Goyal sonalgoy...@gmail.com wrote: Hi, I am writing a UDAF which takes in 4 parameters. I have 2 cases - one where all the paramters are ints, and second where the last parameter is double. I wrote two evaluators for this, with iterate as public boolean iterate(int max, int groupBy, int attribute, int count) and public boolean iterate(int max, int groupBy, int attribute, double count) However, when I run a query, I get the exception: org.apache.hadoop.hive.ql.exec.AmbiguousMethodException
Re: Resolvers for UDAFs
Yes it should be: SELECT customer_id, topx(2, product_id, product_count) FROM products_bought GROUP BY customer_id; On Wed, Feb 3, 2010 at 11:31 PM, Sonal Goyal sonalgoy...@gmail.com wrote: Hi Zheng, Wouldnt the query you mentioned need a group by clause? I need the top x customers per product id. Sorry, can you please explain. Thanks and Regards, Sonal On Thu, Feb 4, 2010 at 12:07 PM, Sonal Goyal sonalgoy...@gmail.com wrote: Hi Zheng, Thanks for your email and your feedback. I will try to change the code as suggested by you. Here is the output of describe: hive describe products_bought; OK product_id int customer_id int product_count int My function was working fine earlier with this table and iterate(int, int, int, int). Once I introduced the other iterate, it stopped working. Thanks and Regards, Sonal On Thu, Feb 4, 2010 at 11:37 AM, Zheng Shao zsh...@gmail.com wrote: Hi Sonal, 1. We usually move the group_by column out of the UDAF - just like we do SELECT key, sum(value) FROM table. I think you should write: SELECT customer_id, topx(2, product_id, product_count) FROM products_bought and in topx: public boolean iterate(int max, int attribute, int count). 2. Can you run describe products_bought? Does product_count column have type int? You might want to try removing the other interate function to see whether that solves the problem. Zheng On Wed, Feb 3, 2010 at 9:58 PM, Sonal Goyal sonalgoy...@gmail.com wrote: Hi Zheng, My query is: select a.myTable.key, a.myTable.attribute, a.myTable.count from (select explode (t.pc) as myTable from (select topx(2, product_id, customer_id, product_count) as pc from (select product_id, customer_id, product_count from products_bought order by product_id, product_count desc) r ) t )a; My overloaded iterators are: public boolean iterate(int max, int groupBy, int attribute, int count) public boolean iterate(int max, int groupBy, int attribute, double count) Before overloading, my query was running fine. My table products_bought is: product_id int, customer_id int, product_count int And I get: FAILED: Error in semantic analysis: Ambiguous method for class org.apache.hadoop.hive.udaf.TopXPerGroup with [int, int, int, int] The hive logs say: 2010-02-03 11:18:15,721 ERROR processors.DeleteResourceProcessor (SessionState.java:printError(255)) - Usage: delete [FILE|JAR|ARCHIVE] value [value]* 2010-02-03 11:22:14,663 ERROR ql.Driver (SessionState.java:printError(255)) - FAILED: Error in semantic analysis: Ambiguous method for class org.apache.hadoop.hive.udaf.TopXPerGroup with [int, int, int, int] org.apache.hadoop.hive.ql.exec.AmbiguousMethodException: Ambiguous method for class org.apache.hadoop.hive.udaf.TopXPerGroup with [int, int, int, int] at org.apache.hadoop.hive.ql.exec.DefaultUDAFEvaluatorResolver.getEvaluatorClass(DefaultUDAFEvaluatorResolver.java:83) at org.apache.hadoop.hive.ql.udf.generic.GenericUDAFBridge.getEvaluator(GenericUDAFBridge.java:57) at org.apache.hadoop.hive.ql.exec.FunctionRegistry.getGenericUDAFEvaluator(FunctionRegistry.java:594) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getGenericUDAFEvaluator(SemanticAnalyzer.java:1882) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genGroupByPlanMapGroupByOperator(SemanticAnalyzer.java:2270) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genGroupByPlanMapAggr1MR(SemanticAnalyzer.java:2821) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:4543) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:5058) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:4999) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:5020) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:4999) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:5020) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:5587) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:114) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:317) at org.apache.hadoop.hive.ql.Driver.runCommand(Driver.java:370) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:362) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:140) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:200) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:311) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method
Re: SequenceFile compression on Amazon EMR not very good
I would first check whether it is really the block compression or record compression. Also maybe the block size is too small but I am not sure that is tunable in SequenceFile or not. Zheng On Sun, Jan 31, 2010 at 9:03 PM, Saurabh Nanda saurabhna...@gmail.com wrote: Hi, The size of my Gzipped weblog files is about 35MB. However, upon enabling block compression, and inserting the logs into another Hive table (sequencefile), the file size bloats up to about 233MB. I've done similar processing on a local Hadoop/Hive cluster, and while the compressions is not as good as gzipping, it still is not this bad. What could be going wrong? I looked at the header of the resulting file and here's what it says: SEQ^Forg.apache.hadoop.io.BytesWritable^Yorg.apache.hadoop.io.Text^A^@'org.apache.hadoop.io.compress.GzipCodec Does Amazon Elastic MapReduce behave differently or am I doing something wrong? Saurabh. -- http://nandz.blogspot.com http://foodieforlife.blogspot.com -- Yours, Zheng
Re: UDAF/UDTF question
The easiest way to go is to write a UDAF to return the answer in arraystructdecile:int, value:double. Then you can do: (note that explode is a predefined UDTF) SELECT tmp.key, tmp2.d.decile, tmp2.d.value FROM (SELECT key, Decile(value) as deciles GROUP BY key) tmp LATERAL VIEW explode(tmp.deciles) tmp2 AS d Zheng On Thu, Jan 28, 2010 at 2:07 PM, Jason Michael jmich...@videoegg.com wrote: Hello all, What would be the best way to write a function that would perform aggregation computations on records in a table and return multiple rows (and possibly columns)? For example, imagine a function called DECILES that computes all the deciles for a given measure and returns them as 10 rows with 2 columns, decile and value. It seems like what I want is some sort of combination of a UDAF and a UDTF. Does such an animal exist in the Hive world? Jason -- Yours, Zheng
Re: help!
Can you take a look at /tmp/user/hive.log? There should be some exceptions there. Zheng On Wed, Jan 27, 2010 at 7:59 PM, Fu Ecy fuzhijie1...@gmail.com wrote: I want to load some files on HDFS to a hive table, but there is an execption as follow: hive load data inpath '/group/taobao/taobao/dw/stb/20100125/collect_info/*' into table collect_info; Loading data to table collect_info Failed with exception addFiles: error while moving files!!! FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask But, when I download the files from HDFS to local machine, then load them into the table, it works. Data in '/group/taobao/taobao/dw/stb/20100125/collect_info/*' is a little more than 200GB. I need to use the Hive to make some statistics. much thanks :-) -- Yours, Zheng
Re: help!
When Hive loads data from HDFS, it moves the files instead of copying the files. That means the current user should have write permissions to the source files/directories as well. Can you check that? Zheng On Wed, Jan 27, 2010 at 11:18 PM, Fu Ecy fuzhijie1...@gmail.com wrote: property namehive.metastore.warehouse.dir/name value/group/tbdev/kunlun/henshao/hive//value descriptionlocation of default database for the warehouse/description /property property namehive.exec.scratchdir/name value/group/tbdev/kunlun/henshao/hive/temp/value descriptionScratch space for Hive jobs/description /property [kun...@gate2 ~]$ hive --config config/ -u root -p root Hive history file=/tmp/kunlun/hive_job_log_kunlun_201001281514_422659187.txt hive create table pokes (foo int, bar string); OK Time taken: 0.825 seconds Yes, I have the permission for Hive's warehouse directory and tmp directory. 2010/1/28 김영우 warwit...@gmail.com Hi Fu, Your query seems correct but I think, It's a problem related HDFS permission. Did you set right permission for Hive's warehouse directory and tmp directory? Seems user 'kunlun' does not have WRITE permission for hive warehouse directory. Youngwoo 2010/1/28 Fu Ecy fuzhijie1...@gmail.com 2010-01-27 12:58:22,182 ERROR ql.Driver (SessionState.java:printError(303)) - FAILED: Parse Error: line 2:10 cannot recognize input ',' in column type org.apache.hadoop.hive.ql.parse.ParseException: line 2:10 cannot recognize input ',' in column type at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:357) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:249) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:290) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:163) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:221) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:335) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:165) at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68) 2010-01-27 12:58:40,394 ERROR hive.log (MetaStoreUtils.java:logAndThrowMetaException(570)) - Got exception: org.apache.hadoop .security.AccessControlException org.apache.hadoop.security.AccessControlException: Permission denied: user=kunlun, access=WR ITE, inode=user:hadoop:cug-admin:rwxr-xr-x 2010-01-27 12:58:40,395 ERROR hive.log (MetaStoreUtils.java:logAndThrowMetaException(571)) - org.apache.hadoop.security.Acces sControlException: org.apache.hadoop.security.AccessControlException: Permission denied: user=kunlun, access=WRITE, inode=us er:hadoop:cug-admin:rwxr-xr-x at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:96) at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:58) at org.apache.hadoop.hdfs.DFSClient.mkdirs(DFSClient.java:831) at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:257) at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1118) at org.apache.hadoop.hive.metastore.Warehouse.mkdirs(Warehouse.java:123) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_table(HiveMetaStore.java:505) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createTable(HiveMetaStoreClient.java:256) at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:254) at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:883) at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:105) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:388) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:294) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:163) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:221) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:335) at
Re: help!
Please see http://wiki.apache.org/hadoop/Hive/LanguageManual/DDL for how to use External table. You don't need to load into external table because external table can directly point to your data directory. Zheng On Wed, Jan 27, 2010 at 11:38 PM, Fu Ecy fuzhijie1...@gmail.com wrote: hive CREATE EXTERNAL TABLE collect_info ( id string, t1 string, t2 string, t3 string, t4 string, t5 string, collector string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE; OK Time taken: 0.234 seconds hive load data inpath '/group/taobao/taobao/dw/stb/20100125/collect_info/coll_9.collect_info575' overwrite into table collect_info; Loading data to table collect_info Failed with exception replaceFiles: error while moving files!!! FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask It doesn't wok. 2010/1/28 Fu Ecy fuzhijie1...@gmail.com I think this is the problem, I don't have the write permissions to the source files/directories. Thank you, Shao :-) 2010/1/28 Zheng Shao zsh...@gmail.com When Hive loads data from HDFS, it moves the files instead of copying the files. That means the current user should have write permissions to the source files/directories as well. Can you check that? Zheng On Wed, Jan 27, 2010 at 11:18 PM, Fu Ecy fuzhijie1...@gmail.com wrote: property namehive.metastore.warehouse.dir/name value/group/tbdev/kunlun/henshao/hive//value descriptionlocation of default database for the warehouse/description /property property namehive.exec.scratchdir/name value/group/tbdev/kunlun/henshao/hive/temp/value descriptionScratch space for Hive jobs/description /property [kun...@gate2 ~]$ hive --config config/ -u root -p root Hive history file=/tmp/kunlun/hive_job_log_kunlun_201001281514_422659187.txt hive create table pokes (foo int, bar string); OK Time taken: 0.825 seconds Yes, I have the permission for Hive's warehouse directory and tmp directory. 2010/1/28 김영우 warwit...@gmail.com Hi Fu, Your query seems correct but I think, It's a problem related HDFS permission. Did you set right permission for Hive's warehouse directory and tmp directory? Seems user 'kunlun' does not have WRITE permission for hive warehouse directory. Youngwoo 2010/1/28 Fu Ecy fuzhijie1...@gmail.com 2010-01-27 12:58:22,182 ERROR ql.Driver (SessionState.java:printError(303)) - FAILED: Parse Error: line 2:10 cannot recognize input ',' in column type org.apache.hadoop.hive.ql.parse.ParseException: line 2:10 cannot recognize input ',' in column type at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:357) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:249) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:290) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:163) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:221) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:335) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:165) at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68) 2010-01-27 12:58:40,394 ERROR hive.log (MetaStoreUtils.java:logAndThrowMetaException(570)) - Got exception: org.apache.hadoop .security.AccessControlException org.apache.hadoop.security.AccessControlException: Permission denied: user=kunlun, access=WR ITE, inode=user:hadoop:cug-admin:rwxr-xr-x 2010-01-27 12:58:40,395 ERROR hive.log (MetaStoreUtils.java:logAndThrowMetaException(571)) - org.apache.hadoop.security.Acces sControlException: org.apache.hadoop.security.AccessControlException: Permission denied: user=kunlun, access=WRITE, inode=us er:hadoop:cug-admin:rwxr-xr-x at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:96
Re: Can not run hive 0.4.1
Can you post the traces in /tmp/user/hive.log? Zheng On Tue, Jan 26, 2010 at 12:40 AM, Jeff Zhang zjf...@gmail.com wrote: Hi all, I follow the get started wiki page, but I use the hive 0.4.1 release version rather than svn trunk. And when I invoke command: show tables; It shows the following error message, anyone has encounter this problem before ? hive show tables; FAILED: Error in metadata: javax.jdo.JDOFatalInternalException: Unexpected exception caught. NestedThrowables: java.lang.reflect.InvocationTargetException FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask -- Best Regards Jeff Zhang -- Yours, Zheng
Re: Can not run hive 0.4.1
This usually happens when there is a problem in the metastore configuration. Did you change any hive configurations? Zheng On Tue, Jan 26, 2010 at 1:41 AM, Jeff Zhang zjf...@gmail.com wrote: The following is the logs: 2010-01-26 17:23:51,509 ERROR exec.DDLTask (SessionState.java:printError(279)) - FAILED: Error in metadata: javax.jdo.JDOFatalInternalException: Unexpected exception caught. NestedThrowables: java.lang.reflect.InvocationTargetException org.apache.hadoop.hive.ql.metadata.HiveException: javax.jdo.JDOFatalInternalException: Unexpected exception caught. NestedThrowables: java.lang.reflect.InvocationTargetException at org.apache.hadoop.hive.ql.metadata.Hive.getTablesByPattern(Hive.java:400) at org.apache.hadoop.hive.ql.metadata.Hive.getAllTables(Hive.java:387) at org.apache.hadoop.hive.ql.exec.DDLTask.showTables(DDLTask.java:352) at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:143) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:379) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:285) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:123) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:181) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:287) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) Caused by: javax.jdo.JDOFatalInternalException: Unexpected exception caught. NestedThrowables: java.lang.reflect.InvocationTargetException at javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1186) at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:803) at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:698) at org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:161) at org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:178) at org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:122) at org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:101) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:130) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:146) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:118) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:100) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.init(HiveMetaStoreClient.java:74) at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:783) at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:794) at org.apache.hadoop.hive.ql.metadata.Hive.getTablesByPattern(Hive.java:398) ... 13 more Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at javax.jdo.JDOHelper$16.run(JDOHelper.java:1956) at java.security.AccessController.doPrivileged(Native Method) at javax.jdo.JDOHelper.invoke(JDOHelper.java:1951) at javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1159) ... 29 more On Tue, Jan 26, 2010 at 4:52 PM, Zheng Shao zsh...@gmail.com wrote: Can you post the traces in /tmp/user/hive.log? Zheng On Tue, Jan 26, 2010 at 12:40 AM, Jeff Zhang zjf...@gmail.com wrote: Hi all, I follow the get started wiki page, but I use the hive 0.4.1 release version rather than svn trunk. And when I invoke command: show tables; It shows the following error message, anyone has encounter this problem before ? hive show tables; FAILED: Error in metadata: javax.jdo.JDOFatalInternalException: Unexpected exception caught. NestedThrowables: java.lang.reflect.InvocationTargetException FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask -- Best Regards Jeff Zhang -- Yours, Zheng -- Best Regards Jeff Zhang -- Yours, Zheng
Re: Can not run hive 0.4.1
In which directory did you run hive? Try ant package -Doffline=true on hive trunk. Zheng On Tue, Jan 26, 2010 at 2:14 AM, Jeff Zhang zjf...@gmail.com wrote: No, I did not change anything. and BTW, I sync the Hive from svn, but can not build it, the following is the error message: [ivy:retrieve] :: resolution report :: resolve 7120ms :: artifacts dl 454644ms - | | modules || artifacts | | conf | number| search|dwnlded|evicted|| number|dwnlded| - | default | 4 | 0 | 0 | 0 || 4 | 0 | - [ivy:retrieve] [ivy:retrieve] :: problems summary :: [ivy:retrieve] WARNINGS [ivy:retrieve] [FAILED ] hadoop#core;0.18.3!hadoop.tar.gz(source): Downloaded file size doesn't match expected Content Length for http://archive.apache.org/dist/hadoop/core/hadoop-0.18.3/hadoop-0.18.3.tar.gz. Please retry. (154498ms) [ivy:retrieve] [FAILED ] hadoop#core;0.18.3!hadoop.tar.gz(source): (0ms) [ivy:retrieve] hadoop-source: tried [ivy:retrieve] http://archive.apache.org/dist/hadoop/core/hadoop-0.18.3/hadoop-0.18.3.tar.gz [ivy:retrieve] apache-snapshot: tried [ivy:retrieve] https://repository.apache.org/content/repositories/snapshots/hadoop/core/0.18.3/hadoop-0.18.3.tar.gz [ivy:retrieve] maven2: tried [ivy:retrieve] http://repo1.maven.org/maven2/hadoop/core/0.18.3/core-0.18.3.tar.gz [ivy:retrieve] [FAILED ] hadoop#core;0.19.0!hadoop.tar.gz(source): Downloaded file size doesn't match expected Content Length for http://archive.apache.org/dist/hadoop/core/hadoop-0.19.0/hadoop-0.19.0.tar.gz. Please retry. (153130ms) [ivy:retrieve] [FAILED ] hadoop#core;0.19.0!hadoop.tar.gz(source): (0ms) [ivy:retrieve] hadoop-source: tried [ivy:retrieve] http://archive.apache.org/dist/hadoop/core/hadoop-0.19.0/hadoop-0.19.0.tar.gz [ivy:retrieve] apache-snapshot: tried [ivy:retrieve] https://repository.apache.org/content/repositories/snapshots/hadoop/core/0.19.0/hadoop-0.19.0.tar.gz [ivy:retrieve] maven2: tried [ivy:retrieve] http://repo1.maven.org/maven2/hadoop/core/0.19.0/core-0.19.0.tar.gz [ivy:retrieve] [FAILED ] hadoop#core;0.20.0!hadoop.tar.gz(source): Downloaded file size doesn't match expected Content Length for http://archive.apache.org/dist/hadoop/core/hadoop-0.20.0/hadoop-0.20.0.tar.gz. Please retry. (147000ms) [ivy:retrieve] [FAILED ] hadoop#core;0.20.0!hadoop.tar.gz(source): (0ms) [ivy:retrieve] hadoop-source: tried [ivy:retrieve] http://archive.apache.org/dist/hadoop/core/hadoop-0.20.0/hadoop-0.20.0.tar.gz [ivy:retrieve] apache-snapshot: tried [ivy:retrieve] https://repository.apache.org/content/repositories/snapshots/hadoop/core/0.20.0/hadoop-0.20.0.tar.gz [ivy:retrieve] maven2: tried [ivy:retrieve] http://repo1.maven.org/maven2/hadoop/core/0.20.0/core-0.20.0.tar.gz [ivy:retrieve] :: [ivy:retrieve] :: FAILED DOWNLOADS :: [ivy:retrieve] :: ^ see resolution messages for details ^ :: [ivy:retrieve] :: [ivy:retrieve] :: hadoop#core;0.18.3!hadoop.tar.gz(source) [ivy:retrieve] :: hadoop#core;0.19.0!hadoop.tar.gz(source) [ivy:retrieve] :: hadoop#core;0.20.0!hadoop.tar.gz(source) [ivy:retrieve] :: [ivy:retrieve] [ivy:retrieve] :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS BUILD FAILED /root/Hive_trunk/build.xml:148: The following error occurred while executing this line: /root/Hive_trunk/build.xml:93: The following error occurred while executing this line: /root/Hive_trunk/shims/build.xml:55: The following error occurred while executing this line: /root/Hive_trunk/build-common.xml:173: impossible to resolve dependencies: resolve failed - see output for details On Tue, Jan 26, 2010 at 6:04 PM, Zheng Shao zsh...@gmail.com wrote: This usually happens when there is a problem in the metastore configuration. Did you change any hive configurations? Zheng On Tue, Jan 26, 2010 at 1:41 AM, Jeff Zhang zjf...@gmail.com wrote: The following is the logs: 2010-01-26 17:23:51,509 ERROR exec.DDLTask (SessionState.java:printError(279)) - FAILED: Error in metadata: javax.jdo.JDOFatalInternalException: Unexpected exception caught. NestedThrowables: java.lang.reflect.InvocationTargetException org.apache.hadoop.hive.ql.metadata.HiveException: javax.jdo.JDOFatalInternalException: Unexpected exception caught. NestedThrowables
Re: How can I implement a cursor in Hive? ...or... Can I implement a CROSS APPLY in Hive?...or... How can I do a FOR or WHILE loop (inside or outside) of Hive?
We can use a combination of UDAF and LATERAL VIEW to implement what you want. 1. Define a UDAF like this: max_n(5, products_bought, customer_id) which returns the top 5 products_bought and their customer_id in type of arraystructcol0:int,col1:int 2. Use the Lateral views (with explode) to transform a single row into multiple rows. SELECT t.product_id, t5.products_bought, t5.customer_id FROM ( SELECT product_id, max_n(5, products_bought, customer_id) as top5 FROM temp GROUP BY product_id) t LATERAL VIEW explode(t.top5) t5 AS products_bought, customer_id; See http://wiki.apache.org/hadoop/Hive/LanguageManual/LateralView Paul is the author of UDTF and Lateral view. He might be able to give you more details. Zheng On Mon, Jan 25, 2010 at 10:47 PM, Mike Roberts m...@spyfu.com wrote: I'm trying to use Hive to solve a fairly common SQL scenario that I run into. I have boiled the problem down into its most basic form: You have a table of transactions defined as so: CREATE TABLE transactions (product_id INT, customer_id INT) || |--Transactions--| |---product_id (INT)-| |---customer_id(INT)-| || The goal is simple: For each product, produce a list of the top 5 largest customers. So, the base query would look like this: SELECT product_id, customer_id, count(*) as products_bought FROM transactions GROUP BY product_id, customer_id You could insert that value into another table called products_bought defined as: CREATE TABLE prod_bought (product_id INT, customer_id INT, products_bought INT) Now you have an intermediate result that tells you how many times each customer bought each product. But, obviously, that doesn't completely solve the problem. At this point, in order to solve the problem, you'd have to use a cursor or a CROSS APPLY. Here's an example in T-SQL: --THE CURSOR METHOD: DECLARE @productId int; DECLARE product_cur CURSOR FOR SELECT DISTINCT product_id FROM transactions t OPEN product_cur FETCH product_cur into @productId WHILE (@@FETCH_STATUS -1) BEGIN FETCH product_cur into @productId INSERT top_customers_by_product SELECT TOP 5 product_id, customer_id, products_bought FROM prod_bought WHERE product_id = @productId ORDER BY products_bought desc END CLOSE Domains DEALLOCATE Domains --THE CROSS APPLY METHOD: --First create a user defined function CREATE FUNCTION dbo.fn_GetTopXCustomers(@ProductId INT) RETURNS TABLE AS RETURN SELECT TOP 5 product_id, customer_id, products_bought FROM prod_bought WHERE product_id = @productId ORDER BY products_bought desc GO --Build a table of distinct product Ids SELECT DISTINCT product_id INTO temp_distinct_product_ids FROM transactions --Run the CROSS APPLY SELECT A.product_id , A.customer_id , A.products_bought INTO top_customers_by_product FROM temp_distinct_product_ids T CROSS APPLY dbo.fn_GetTopXCustomers(T.product_id) A Okay, so there are two ways I could solve the problem in SQL (CROSS APPLY is dramatically faster for anyone that cares). How can I do the same thing in Hive? Here's the question restated: How can I implement a cursor in Hive? How can I do a for or while loop in Hive? Can I implement a CROSS APPLY in Hive? I realize that I can implement a cursor outside of Hive and just execute the same Hive script over and over and over again. And, that's not a horrible solution as long as it leverages the full power of Hadoop. My concern is that each of the individual queries that is run inside are fairly inexpensive, but the total number of products makes the total job *very* expensive. Also, the solution should be reusable -- I'd really prefer not to write a custom jar every time I run into this problem. Actually, I’m also not particularly religious about using Hive. If there’s some other tech that does what I need, that’s cool too. Thanks in advance. Mike Roberts -- Yours, Zheng
Re: Error after loading data
Hi Ankit, org.apache.hadoop.mapreduce.lib.input.XmlInputFormat is implementing the new mapreduce InputFormat API. while Hive need an InputFormat that implements org.apache.hadoop.mapred.InputFormat (the old API). This might work: http://www.umiacs.umd.edu/~jimmylin/cloud9/docs/api/edu/umd/cloud9/collection/XMLInputFormat.html Or you might want to adapt the XMLInputFormat to the old API so Hive can read from it. Zheng On Fri, Jan 22, 2010 at 10:58 AM, ankit bhatnagar abhatna...@gmail.com wrote: Hi all, I am loading data from xml file to hive schema. add jar build/contrib/hadoop-mapred-0.22.0-SNAPSHOT.jar CREATE TABLE IF NOT EXISTS PARSE_XML( column1 String, column2 String ) STORED AS INPUTFORMAT 'org.apache.hadoop.mapreduce.lib.input.XmlInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'; LOAD DATA LOCAL INPATH './hive-svn/build/dist/examples/files/upload.xml' OVERWRITE INTO TABLE PARSE_XML; I was able to create the table however I got the following error- FAILED: Error in semantic analysis: line 1:14 Input Format must implement InputFormat parse_xml when I do the select on the table Ankit -- Yours, Zheng
Re: Deleted input files after load
If you want the files to stay there, you can try CREATE EXTERNAL TABLE with a location (instead of create table + load) Zheng On Fri, Jan 22, 2010 at 10:51 AM, Bill Graham billgra...@gmail.com wrote: Hive doesn't delete the files upon load, it moves them to a location under the Hive warehouse directory. Try looking under /user/hive/warehouse/t_word_count. On Fri, Jan 22, 2010 at 10:44 AM, Shiva shiv...@gmail.com wrote: Hi, For the first time I used Hive to load couple of word count data input files into tables with and without OVERWRITE. Both the times the input file in HDFS got deleted. Is that a expected behavior? And couldn't find any definitive answer on the Hive wiki. hive LOAD DATA INPATH '/user/vmplanet/output/part-0' OVERWRITE INTO TABLE t_word_count; Env.: Using Hadoop 0.20.1 and latest Hive on Ubuntu 9.10 running in VMware. Thanks, Shiva -- Yours, Zheng
Re: hive multiple inserts
https://issues.apache.org/jira/browse/HIVE-634 As far as I know there is nobody working on that right now. If you are interested, we can work together on that. Let's move the discussion to the JIRA. Zheng On Tue, Jan 12, 2010 at 3:27 AM, Anty anty@gmail.com wrote: Thanks Zheng. We have used RegexSerDe in some use cases, but the speed is indeed slower, so we don't want to use regular expression if not necessary. yes, we have used RegexSerDe in some use cases. I found HIVE-634 https://issues.apache.org/jira/browse/HIVE-634 is what i need ,allowing for the user to specify field delimiter with any format. INSERT OVERWRITE LOCAL DIRECTORY '/mnt/daily_timelines' [ ROW FORMAT DELIMITED | SERDE ... ] [ FILE FORMAT ...] SELECT * FROM daily_timelines; Is somebody still working on this feature? On Tue, Jan 12, 2010 at 2:28 PM, Zheng Shao zsh...@gmail.com wrote: Yes we only support one-byte delimiter for performance reasons. You can use the RegexSerDe in the contrib package for any row format that allows a regular expression (including your case ), but the speed will be slower. Zheng On Mon, Jan 11, 2010 at 5:54 PM, Anty anty@gmail.com wrote: Thanks Zheng. It does works. I have a another question,if the field delimiter is a string ,e.g. ,it looks like the LazySimpleSerDe can't works.Does the LazySimpleSerDe didn't support string field delimiter,only one byte of control characters? On Tue, Jan 12, 2010 at 3:05 AM, Zheng Shao zsh...@gmail.com wrote: For your second question, currently we can do it with a little extra work: 1. Create an external table on the target directory with the field delimiter you want; 2. Run the query and insert overwrite the target external table. For the first question we can also do the similar thing (create a bunch of external table and then insert), but I think we should fix the problem. Zheng On Mon, Jan 11, 2010 at 8:31 AM, Anty anty@gmail.com wrote: HI: I came across the same problean, therein is no data.I have one more question,can i specify the field delimiter for the output file,not just the default ctrl-a field delimiter? On Fri, Jan 8, 2010 at 2:23 PM, wd w...@wdicc.com wrote: hi, I'v tried use hive svn version, seems this bug still exists. svn st -v 896805 896744 namit. 896805 894292 namiteclipse-templates 896805 894292 namiteclipse-templates/.classpath 896805 765509 zshao eclipse-templates/TestHive.launchtemplate 896805 765509 zshao eclipse-templates/TestMTQueries.l .. svn reversion 896805 ? follows is the execute log. hive from test INSERT OVERWRITE LOCAL DIRECTORY '/home/stefdong/tmp/0' select * where a = 1 INSERT OVERWRITE LOCAL DIRECTORY '/home/stefdong/tmp/1' select * where a = 3; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_201001071716_4691, Tracking URL = http://abc.com:50030/jobdetails.jsp?jobid=job_201001071716_4691 Kill Command = hadoop job -Dmapred.job.tracker=abc.com:9001 -kill job_201001071716_4691 2010-01-08 14:14:55,442 Stage-2 map = 0%, reduce = 0% 2010-01-08 14:15:00,643 Stage-2 map = 100%, reduce = 0% Ended Job = job_201001071716_4691 Copying data to local directory /home/stefdong/tmp/0 Copying data to local directory /home/stefdong/tmp/0 13 Rows loaded to /home/stefdong/tmp/0 9 Rows loaded to /home/stefdong/tmp/1 OK Time taken: 9.409 seconds thx. 2010/1/6 wd w...@wdicc.com hi, Single insert can extract data into '/tmp/out/1'.I even can see xxx rows loaded to '/tmp/out/0', xxx rows loaded to '/tmp/out/1'...etc in multi inserts, but there is no data in fact. Havn't try svn revision, will try it today.thx. 2010/1/5 Zheng Shao zsh...@gmail.com Looks like a bug. What is the svn revision of Hive? Did you verify that single insert into '/tmp/out/1' produces non-empty files? Zheng On Tue, Jan 5, 2010 at 12:51 AM, wd w...@wdicc.com wrote: In hive wiki: Hive extension (multiple inserts): FROM from_statement INSERT OVERWRITE [LOCAL] DIRECTORY directory1 select_statement1 [INSERT OVERWRITE [LOCAL] DIRECTORY directory2 select_statement2] ... I'm try to use hive multi inserts to extract data from hive to local disk. Follows is the hql from test_tbl INSERT OVERWRITE LOCAL DIRECTORY '/tmp/out/0' select select * where id%10=0 INSERT OVERWRITE LOCAL DIRECTORY '/tmp/out/1' select select * where id%10=1 INSERT OVERWRITE LOCAL DIRECTORY '/tmp/out/2' select select * where id%10=2 This hql can execute, but only /tmp/out/0 have
Re: hive multiple inserts
For your second question, currently we can do it with a little extra work: 1. Create an external table on the target directory with the field delimiter you want; 2. Run the query and insert overwrite the target external table. For the first question we can also do the similar thing (create a bunch of external table and then insert), but I think we should fix the problem. Zheng On Mon, Jan 11, 2010 at 8:31 AM, Anty anty@gmail.com wrote: HI: I came across the same problean, therein is no data.I have one more question,can i specify the field delimiter for the output file,not just the default ctrl-a field delimiter? On Fri, Jan 8, 2010 at 2:23 PM, wd w...@wdicc.com wrote: hi, I'v tried use hive svn version, seems this bug still exists. svn st -v 896805 896744 namit . 896805 894292 namit eclipse-templates 896805 894292 namit eclipse-templates/.classpath 896805 765509 zshao eclipse-templates/TestHive.launchtemplate 896805 765509 zshao eclipse-templates/TestMTQueries.l .. svn reversion 896805 ? follows is the execute log. hive from test INSERT OVERWRITE LOCAL DIRECTORY '/home/stefdong/tmp/0' select * where a = 1 INSERT OVERWRITE LOCAL DIRECTORY '/home/stefdong/tmp/1' select * where a = 3; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_201001071716_4691, Tracking URL = http://abc.com:50030/jobdetails.jsp?jobid=job_201001071716_4691 Kill Command = hadoop job -Dmapred.job.tracker=abc.com:9001 -kill job_201001071716_4691 2010-01-08 14:14:55,442 Stage-2 map = 0%, reduce = 0% 2010-01-08 14:15:00,643 Stage-2 map = 100%, reduce = 0% Ended Job = job_201001071716_4691 Copying data to local directory /home/stefdong/tmp/0 Copying data to local directory /home/stefdong/tmp/0 13 Rows loaded to /home/stefdong/tmp/0 9 Rows loaded to /home/stefdong/tmp/1 OK Time taken: 9.409 seconds thx. 2010/1/6 wd w...@wdicc.com hi, Single insert can extract data into '/tmp/out/1'.I even can see xxx rows loaded to '/tmp/out/0', xxx rows loaded to '/tmp/out/1'...etc in multi inserts, but there is no data in fact. Havn't try svn revision, will try it today.thx. 2010/1/5 Zheng Shao zsh...@gmail.com Looks like a bug. What is the svn revision of Hive? Did you verify that single insert into '/tmp/out/1' produces non-empty files? Zheng On Tue, Jan 5, 2010 at 12:51 AM, wd w...@wdicc.com wrote: In hive wiki: Hive extension (multiple inserts): FROM from_statement INSERT OVERWRITE [LOCAL] DIRECTORY directory1 select_statement1 [INSERT OVERWRITE [LOCAL] DIRECTORY directory2 select_statement2] ... I'm try to use hive multi inserts to extract data from hive to local disk. Follows is the hql from test_tbl INSERT OVERWRITE LOCAL DIRECTORY '/tmp/out/0' select select * where id%10=0 INSERT OVERWRITE LOCAL DIRECTORY '/tmp/out/1' select select * where id%10=1 INSERT OVERWRITE LOCAL DIRECTORY '/tmp/out/2' select select * where id%10=2 This hql can execute, but only /tmp/out/0 have datafile in it, other directories are empty. why this happen? bug? -- Yours, Zheng -- Best Regards Anty Rao -- Yours, Zheng
Re: hive multiple inserts
Yes we only support one-byte delimiter for performance reasons. You can use the RegexSerDe in the contrib package for any row format that allows a regular expression (including your case ), but the speed will be slower. Zheng On Mon, Jan 11, 2010 at 5:54 PM, Anty anty@gmail.com wrote: Thanks Zheng. It does works. I have a another question,if the field delimiter is a string ,e.g. ,it looks like the LazySimpleSerDe can't works.Does the LazySimpleSerDe didn't support string field delimiter,only one byte of control characters? On Tue, Jan 12, 2010 at 3:05 AM, Zheng Shao zsh...@gmail.com wrote: For your second question, currently we can do it with a little extra work: 1. Create an external table on the target directory with the field delimiter you want; 2. Run the query and insert overwrite the target external table. For the first question we can also do the similar thing (create a bunch of external table and then insert), but I think we should fix the problem. Zheng On Mon, Jan 11, 2010 at 8:31 AM, Anty anty@gmail.com wrote: HI: I came across the same problean, therein is no data.I have one more question,can i specify the field delimiter for the output file,not just the default ctrl-a field delimiter? On Fri, Jan 8, 2010 at 2:23 PM, wd w...@wdicc.com wrote: hi, I'v tried use hive svn version, seems this bug still exists. svn st -v 896805 896744 namit. 896805 894292 namiteclipse-templates 896805 894292 namiteclipse-templates/.classpath 896805 765509 zshao eclipse-templates/TestHive.launchtemplate 896805 765509 zshao eclipse-templates/TestMTQueries.l .. svn reversion 896805 ? follows is the execute log. hive from test INSERT OVERWRITE LOCAL DIRECTORY '/home/stefdong/tmp/0' select * where a = 1 INSERT OVERWRITE LOCAL DIRECTORY '/home/stefdong/tmp/1' select * where a = 3; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_201001071716_4691, Tracking URL = http://abc.com:50030/jobdetails.jsp?jobid=job_201001071716_4691 Kill Command = hadoop job -Dmapred.job.tracker=abc.com:9001 -kill job_201001071716_4691 2010-01-08 14:14:55,442 Stage-2 map = 0%, reduce = 0% 2010-01-08 14:15:00,643 Stage-2 map = 100%, reduce = 0% Ended Job = job_201001071716_4691 Copying data to local directory /home/stefdong/tmp/0 Copying data to local directory /home/stefdong/tmp/0 13 Rows loaded to /home/stefdong/tmp/0 9 Rows loaded to /home/stefdong/tmp/1 OK Time taken: 9.409 seconds thx. 2010/1/6 wd w...@wdicc.com hi, Single insert can extract data into '/tmp/out/1'.I even can see xxx rows loaded to '/tmp/out/0', xxx rows loaded to '/tmp/out/1'...etc in multi inserts, but there is no data in fact. Havn't try svn revision, will try it today.thx. 2010/1/5 Zheng Shao zsh...@gmail.com Looks like a bug. What is the svn revision of Hive? Did you verify that single insert into '/tmp/out/1' produces non-empty files? Zheng On Tue, Jan 5, 2010 at 12:51 AM, wd w...@wdicc.com wrote: In hive wiki: Hive extension (multiple inserts): FROM from_statement INSERT OVERWRITE [LOCAL] DIRECTORY directory1 select_statement1 [INSERT OVERWRITE [LOCAL] DIRECTORY directory2 select_statement2] ... I'm try to use hive multi inserts to extract data from hive to local disk. Follows is the hql from test_tbl INSERT OVERWRITE LOCAL DIRECTORY '/tmp/out/0' select select * where id%10=0 INSERT OVERWRITE LOCAL DIRECTORY '/tmp/out/1' select select * where id%10=1 INSERT OVERWRITE LOCAL DIRECTORY '/tmp/out/2' select select * where id%10=2 This hql can execute, but only /tmp/out/0 have datafile in it, other directories are empty. why this happen? bug? -- Yours, Zheng -- Best Regards Anty Rao -- Yours, Zheng -- Best Regards Anty Rao -- Yours, Zheng
Re: Speedup of test target
Unfortunately the trunk does not run tests in parallel yet. The majority of the time is spent in TestCliDriver which contains over 200 .q files. We will need to separate the working directories and metastore directories to make these .q files run in parallel. Zheng On Thu, Jan 7, 2010 at 11:46 AM, Edward Capriolo edlinuxg...@gmail.com wrote: Since apache was granted a clover license I was looking to add a clover target to hive. I know recently there was a jira issue on running ant tests in parallel. I have a modest core II due laptop that takes quite a while on the test target. Does the trunk currently by default run test in parallel, if not how can I enable this? Also what are people out there using to run the test target hardware wise, and how long does ant test take? Thanks, Edward -- Yours, Zheng
Re: hive multiple inserts
Looks like a bug. What is the svn revision of Hive? Did you verify that single insert into '/tmp/out/1' produces non-empty files? Zheng On Tue, Jan 5, 2010 at 12:51 AM, wd w...@wdicc.com wrote: In hive wiki: Hive extension (multiple inserts): FROM from_statement INSERT OVERWRITE [LOCAL] DIRECTORY directory1 select_statement1 [INSERT OVERWRITE [LOCAL] DIRECTORY directory2 select_statement2] ... I'm try to use hive multi inserts to extract data from hive to local disk. Follows is the hql from test_tbl INSERT OVERWRITE LOCAL DIRECTORY '/tmp/out/0' select select * where id%10=0 INSERT OVERWRITE LOCAL DIRECTORY '/tmp/out/1' select select * where id%10=1 INSERT OVERWRITE LOCAL DIRECTORY '/tmp/out/2' select select * where id%10=2 This hql can execute, but only /tmp/out/0 have datafile in it, other directories are empty. why this happen? bug? -- Yours, Zheng
RE: Populating MAP type columns
Hi Saurabh, I think we can do it with the following 3 UDFs. make_map(trim(split(cookies, ,)), =) ArrayListString split(String) See http://issues.apache.org/jira/browse/HIVE-642 ArrayListString trim(ArrayListString) Open one for that HashMapString,String make_map(ArrayListString, String separator) Open one for that The last 2 need to be written. Please open a JIRA for each. It will be great if you are interested in working on that. There are some examples in the contrib directory already (search for UDFExampleAdd). See http://wiki.apache.org/hadoop/Hive/HowToContribute Zheng From: Saurabh Nanda [mailto:saurabhna...@gmail.com] Sent: Tuesday, January 05, 2010 2:01 AM To: hive-user@hadoop.apache.org Subject: Populating MAP type columns From http://wiki.apache.org/hadoop/Hive/Tutorial#Map.28Associative_Arrays.29_Operations it seems that Such structures can only be created programmatically currently. What does this mean exactly? Do I have to use the Java based APi to insert data into such columns? If that is the case, has someone written a UDF which lets me import weblog cookie data into a MAP column using only the Hive QL. The cookie data is of the following format: cookie_name1=value; cookie_name2=value; cookie_name3=value If there is no such UDF available, would it be a good idea to include one in the standard Hive distribution? Thanks, Saurabh. -- http://nandz.blogspot.com http://foodieforlife.blogspot.com
Re: Null values in hive output
Hi Eric, Most probably there are leading/trailing spaces in the columns that are defined as int. If Hive cannot parse the field successfully, the field will become null. You can try this to find out the rows: SELECT * FROM raw_facts WHERE year IS NULL; Zheng On Mon, Jan 4, 2010 at 4:10 PM, Eric Sammer e...@lifeless.net wrote: All: I apologize in advance if this is common. I've searched and I can't find an explanation. I'm loading a plain text tab delimited file into a Hive (0.4.1-dev) table. This file is a small sample set of my full dataset and is the result of a M/R job, written by TextOutputFormat, if it matters. When I query the table, a small percentage (a few hundred out of a few million) of the rows contain null values where as the input file does not contain any null values. The number of null field records seems to grow proportionally to the total number of records at a relatively constant rate. It looks as if it's a SerDe error / misconfiguration of some kind, but I can't pinpoint anything that would cause the issue. To confirm, I've done an fs -cat of the file to local disk and used cut and sort to confirm all fields are properly formatted and populated. Below is the extended table description along with some additional information. Any help is greatly appreciated as using Hive for simple aggregation is saving me a ton of time from having to hand write the M/R jobs myself. I'm sure there's something I've done wrong. Unfortunately, I'm in a situation where I can't deal with any portion of the records being dumped (part of a reporting system). Original create: hive create table raw_facts ( year int, month int, day int, application string, company_id int, country_code string, receiver_code_id int, keyword string, total int ) row format delimited fields terminated by '\t'; (I've also tried row format TEXTFORMAT or whatever it is; all fields were null - assumed it was because hive was expecting ^A delimited.) Table: hive describe extended raw_facts; OK year int month int day int application string company_id int country_code string receiver_code_id int keyword string total int Detailed Table Information Table(tableName:raw_facts, dbName:default, owner:snip, createTime:1262631537, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:year, type:int, comment:null), FieldSchema(name:month, type:int, comment:null), FieldSchema(name:day, type:int, comment:null), FieldSchema(name:application, type:string, comment:null), FieldSchema(name:company_id, type:int, comment:null), FieldSchema(name:country_code, type:string, comment:null), FieldSchema(name:receiver_code_id, type:int, comment:null), FieldSchema(name:keyword, type:string, comment:null), FieldSchema(name:total, type:int, comment:null)], location:hdfs://snip/home/hive/warehouse/raw_facts, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=9,field.delim= }), bucketCols:[], sortCols:[], parameters:{}), partitionKeys:[], parameters:{}) Sample (real) rows: (these are tab separated in the file) 2009 12 01 f 98 US 171 test 222 2009 12 01 f 98 US 199 test 222 2009 12 01 f 98 US 220 test 222 Load command used: hive load data inpath 'hdfs://snip/some/path/out/part-r-0' overwrite into table raw_facts ; Some queries: hive select count(1) from raw_facts; OK 4723253 hive select count(1) from raw_facts where year is null; OK 277 hive select year,count(1) from raw_facts group by year; OK NULL 277 2009 4722976 Thanks in advance. -- Eric Sammer e...@lifless.net http://esammer.blogspot.com -- Yours, Zheng