from:"Zheng Shao"

Re: Implement in clause with or clause

2010-08-04 Thread Zheng Shao

There are no risks, but it will be slower especially when the list
after in is very long.

Zheng

2010/8/3 我很快乐 896923...@qq.com:
 Thank you for your reply.

 Because my company reuire we use 0.4.1 version, so I could't upgrade the
 version.  Could you tell me there are which risks if I use the OR
 clause(example:where id=1 or id=2 or id=3)  to implement the IN
 clause(example: id in(1,2,3) ) ?

 Thanks,

 LiuLei



-- 
Yours,
Zheng
http://www.linkedin.com/in/zshao

Re: Hive support for latin1

2010-08-02 Thread Zheng Shao

Just change FetchTask.java: public boolean fetch(ArrayListString res)

res.add(((Text) mSerde.serialize(io.o, io.oi)).toString());

Instead of using Text.toString(), use your own method to convert from
raw bytes to unicode String.


Zheng

On Sun, Aug 1, 2010 at 8:31 PM, bc Wong bcwal...@cloudera.com wrote:
 Hi all,

 I'm trying to figure out how to query Hive on latin1 encoded data.

 I created a file with 256 characters, with unicode value 0-255,
 encoded in latin1. I made a table out of it. But when I do a select
 *, Hive returns the upper ascii rows as '\xef\xbf\xbd', which is the
 replacement character '\ufffd' encoded in UTF-8.

 Does anyone know how to work with non-UTF8 data?

 Cheers,
 --
 bc Wong
 Cloudera Software Engineer




-- 
Yours,
Zheng
http://www.linkedin.com/in/zshao

Re: built-in UTF8 checker

2010-07-21 Thread Zheng Shao

No, but it's very simple to write one.

public class MyUTF8StringChecker extends UDF {
  public boolean evaluate(Text t) {
try {
  Text.validateUTF8(t.getBytes(), 0, t.getLength());
  return true;
 } catch (MalformedInputException e) {
   return false;
 }
  }
}


On Tue, Jul 20, 2010 at 12:03 PM, Ping Zhu p...@sharethis.com wrote:
 Hi,
   Are there are any built-in functions in Hive to check whether a string is
 UTF8-encoding? I did some research about this issue but did not find useful
 resources. Thanks for your suggestions and help.
   Ping



-- 
Yours,
Zheng
http://www.linkedin.com/in/zshao

Re: Hive and protocol buffers -- are there UDFs for dealing with them?

2010-07-12 Thread Zheng Shao

If you just need to scan the data once, it makes sense to use hive
SerDe to read the data directly (which saves you one I/O round trip).

If you need to read the data multiple times, then it's better to save
the 3 columns into separate files.

Zheng

On Mon, Jul 12, 2010 at 5:08 PM, Leo Alekseyev dnqu...@gmail.com wrote:
 Hi all,
 I was wondering if anyone is using Hive with protocol buffers.  The
 Hadoop wiki links to
 http://www.slideshare.net/ragho/hive-user-meeting-august-2009-facebook
 for SerDe examples; there it says that there is no built-in support
 for protobufs.  Since this presentation is about a year old, I was
 wondering whether there appeared any UDFs, native or third-party, to
 deal with them.

 I am also curious about the relative efficiency of performing SerDe
 using UDFs in hive vs. running a separate hadoop job to first
 deserialize the data from protocol buffers into an ascii flat file
 with only the interesting fields (going from ~15 fields to ~3), and
 then doing the rest of the computation in hive.  I'd appreciate any
 comments!

 Thanks,
 --Leo




-- 
Yours,
Zheng
http://www.linkedin.com/in/zshao

Re: UDF which takes entire row as arg

2010-07-07 Thread Zheng Shao

Yes. Even a normal (non-generic) UDF might work if all columns can be
converted to the same type. UDF can accept variable-length of
arguments of the same type.

it will be a great addition to let UDF/UDAF handle * (as well as `regex`).
The change is all compile-time, and is relatively simple.

Zheng

On Wed, Jul 7, 2010 at 8:31 PM, Edward Capriolo edlinuxg...@gmail.com wrote:

 You could write a generic UDF since they accept arbitrary signature,
 but you would have to pass each column specifically (no * support)




-- 
Yours,
Zheng
http://www.linkedin.com/in/zshao

Re: Create Table with Line Terminated other than '\n'

2010-06-12 Thread Zheng Shao

That patch basically throws an error if user specified a non-newline  
line terminator. Without the patch it will produce unexpected result,  
successfully.


Sent from my iPhone

On Jun 11, 2010, at 11:23 PM, Amr Awadallah a...@cloudera.com wrote:


Zheng, I thought that was fixed per you work here, no?

https://issues.apache.org/jira/browse/HIVE-302

Then what did you fix?

-- amr

On 6/10/2010 10:22 PM, Zheng Shao wrote:

Also, changing LINES TERMINATED BY probably won't work, because
hadoop's TextInputFormat does not allow line terminators other than
\n.

Zheng

On Thu, Jun 10, 2010 at 6:31 PM, Carl Steinbachc...@cloudera.com   
wrote:



Hi Shuja,
The grammar for Hive's CREATE TABLE statement is discussed
here: http://wiki.apache.org/hadoop/Hive/LanguageManual/DDL#Create_Table
You need to use the LINES TERMINATED BY clause in the CREATE TABLE
statement in order to specify a line terminator other than \n.
Carl

On Thu, Jun 10, 2010 at 5:39 PM, Shuja  
Rehmanshujamug...@gmail.com  wrote:



Hi
I want to create a table in hive which should have row formated  
line
terminated other than '\n'. so i can read xml file as single cell  
in one row

and column of table.
kindly let me know how to do this?
THanks



--
Regards
Shuja-ur-Rehman Baig
_
MS CS - School of Science and Engineering
Lahore University of Management Sciences (LUMS)
Sector U, DHA, Lahore, 54792, Pakistan
Cell: +92 3214207445

Re: Create Table with Line Terminated other than '\n'

2010-06-10 Thread Zheng Shao

Also, changing LINES TERMINATED BY probably won't work, because
hadoop's TextInputFormat does not allow line terminators other than
\n.

Zheng

On Thu, Jun 10, 2010 at 6:31 PM, Carl Steinbach c...@cloudera.com wrote:
 Hi Shuja,
 The grammar for Hive's CREATE TABLE statement is discussed
 here: http://wiki.apache.org/hadoop/Hive/LanguageManual/DDL#Create_Table
 You need to use the LINES TERMINATED BY clause in the CREATE TABLE
 statement in order to specify a line terminator other than \n.
 Carl

 On Thu, Jun 10, 2010 at 5:39 PM, Shuja Rehman shujamug...@gmail.com wrote:

 Hi
 I want to create a table in hive which should have row formated line
 terminated other than '\n'. so i can read xml file as single cell in one row
 and column of table.
 kindly let me know how to do this?
 THanks



 --
 Regards
 Shuja-ur-Rehman Baig
 _
 MS CS - School of Science and Engineering
 Lahore University of Management Sciences (LUMS)
 Sector U, DHA, Lahore, 54792, Pakistan
 Cell: +92 3214207445





-- 
Yours,
Zheng
http://www.linkedin.com/in/zshao

Re: BUG at optimizer or map side aggregate?

2010-05-12 Thread Zheng Shao

Nice finding! That's likely to be the cause.
Can you open a JIRA issue on issues.apache.org/jira/browse/HIVE

Zheng

On Wed, May 12, 2010 at 1:05 AM, Ted Xu ted.xu...@gmail.com wrote:
 Zheng,
 Thank you for your reply.
 Well, it seems hard for me to repreduce this bug in a simpler query.
 However, if I change the alias of subquery 't1' (either the inner one or the
 join result), the bug disappears. I'm wondering if there is possible that
 table aliases of different level will conflict when their alias names are
 the same.

 2010/5/12 Zheng Shao zsh...@gmail.com

 Yes that does seem to be a bug.

 Can you try if you can simply the query while reproducing the bug?
 That will make it a lot easier to debug and fix.


 Zheng

 On Tue, May 11, 2010 at 7:44 PM, Ted Xu ted.xu...@gmail.com wrote:
  Hi all,
  I think I found a bug, I'm not sure whether the problem is at optimizer
  (PPD) or at map side aggregate.
  See query listed below:
  -
  create table if not exists dm_fact_buyer_prd_info_d (
  category_id string
  ,gmv_trade_num  int
  ,user_id    int
  )
  PARTITIONED BY (ds int);
  set hive.optimize.ppd=true;
  set hive.map.aggr=true;
  explain select 20100426, category_id1,category_id2,assoc_idx
  from (
  select
  category_id1
  , category_id2
  , count(distinct user_id) as assoc_idx
  from (
  select
  t1.category_id as category_id1
  , t2.category_id as category_id2
  , t1.user_id
  from (
  select category_id, user_id
  from dm_fact_buyer_prd_info_d
  where ds = 20100426
  and ds  20100419
  and category_id  0
  and gmv_trade_num0
  group by category_id, user_id ) t1
  join (
  select category_id, user_id
  from dm_fact_buyer_prd_info_d
  where ds = 20100426
  and ds 20100419
  and category_id 0
  and gmv_trade_num 0
  group by category_id, user_id ) t2 on t1.user_id=t2.user_id
  ) t1
  group by category_id1, category_id2 ) t_o
  where category_id1  category_id2
  and assoc_idx  2;
  
  The query above will fail when execute, throwing exception: can not
  cast
  UDFOpNotEqual(Text, IntWritable) to UDFOpNotEqual(Text, Text).
  I explained the query and the execute plan looks really wired (see
  the highlighted predicate):
  
  ABSTRACT SYNTAX TREE:
    (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_SUBQUERY
  (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_SUBQUERY (TOK_QUERY (TOK_FROM
  (TOK_TABREF dm_fact_buyer_prd_info_d)) (TOK_INSERT (TOK_DESTINATION
  (TOK_DIR
  TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL category_id))
  (TOK_SELEXPR (TOK_TABLE_OR_COL user_id))) (TOK_WHERE (and (and (and (=
  (TOK_TABLE_OR_COL ds) 20100426) ( (TOK_TABLE_OR_COL ds) 20100419)) (
  (TOK_TABLE_OR_COL category_id) 0)) ( (TOK_TABLE_OR_COL gmv_trade_num)
  0)))
  (TOK_GROUPBY (TOK_TABLE_OR_COL category_id) (TOK_TABLE_OR_COL
  user_id
  t1) (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_TABREF
  dm_fact_buyer_prd_info_d)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR
  TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL category_id))
  (TOK_SELEXPR (TOK_TABLE_OR_COL user_id))) (TOK_WHERE (and (and (and (=
  (TOK_TABLE_OR_COL ds) 20100426) ( (TOK_TABLE_OR_COL ds) 20100419)) (
  (TOK_TABLE_OR_COL category_id) 0)) ( (TOK_TABLE_OR_COL gmv_trade_num)
  0)))
  (TOK_GROUPBY (TOK_TABLE_OR_COL category_id) (TOK_TABLE_OR_COL
  user_id
  t2) (= (. (TOK_TABLE_OR_COL t1) user_id) (. (TOK_TABLE_OR_COL t2)
  user_id (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE))
  (TOK_SELECT
  (TOK_SELEXPR (. (TOK_TABLE_OR_COL t1) category_id) category_id1)
  (TOK_SELEXPR (. (TOK_TABLE_OR_COL t2) category_id) category_id2)
  (TOK_SELEXPR (. (TOK_TABLE_OR_COL t1) user_id) t1)) (TOK_INSERT
  (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR
  (TOK_TABLE_OR_COL category_id1)) (TOK_SELEXPR (TOK_TABLE_OR_COL
  category_id2)) (TOK_SELEXPR (TOK_FUNCTIONDI count (TOK_TABLE_OR_COL
  user_id)) assoc_idx)) (TOK_GROUPBY (TOK_TABLE_OR_COL category_id1)
  (TOK_TABLE_OR_COL category_id2 t_o)) (TOK_INSERT (TOK_DESTINATION
  (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR 20100426) (TOK_SELEXPR
  (TOK_TABLE_OR_COL category_id1)) (TOK_SELEXPR (TOK_TABLE_OR_COL
  category_id2)) (TOK_SELEXPR (TOK_TABLE_OR_COL assoc_idx))) (TOK_WHERE
  (and
  ( (TOK_TABLE_OR_COL category_id1) (TOK_TABLE_OR_COL category_id2)) (
  (TOK_TABLE_OR_COL assoc_idx) 2)
 
 
  STAGE DEPENDENCIES:
    Stage-1 is a root stage
    Stage-2 depends on stages: Stage-1, Stage-4
    Stage-3 depends on stages: Stage-2
    Stage-4 is a root stage
    Stage-2 depends on stages: Stage-1, Stage-4
    Stage-3 depends on stages: Stage-2
    Stage-0 is a root stage
 
 
  STAGE PLANS:
    Stage: Stage-1
      Map Reduce
        Alias - Map Operator Tree:
          t_o:t1:t1:dm_fact_buyer_prd_info_d
            TableScan
              alias: dm_fact_buyer_prd_info_d
              Filter Operator
                predicate:
                    expr: (UDFToDouble(ds

Re: why hive ignore my setting about reduce task number?

2010-05-12 Thread Zheng Shao

Do you need to get all records in the order? In most of our use cases  
users are only interested in the top 100 or something. If you do limit  
100 together with order by, it will be much faster.



Sent from my iPhone

On May 12, 2010, at 1:54 PM, luocan19826...@sohu.com wrote:


Thanks, Ted.
If I have very big data to sort, only 1 reduce task will have  
performance issue.

Do hive have some skill to optimize it?
I have observe that the reduce task is very slow in my job.


你的1G网络U盘真好用！
查薪酬：对比同行工资！

Re: error: Both Left and Right Aliases Encountered in Join obj

2010-04-30 Thread Zheng Shao

Put t1.objt2.obj in the where clause.


On Fri, Apr 30, 2010 at 12:14 AM, Harshit Kumar ku...@bike.snu.ac.kr wrote:
 Hi

 I have a query like this

 from spo t1 join spo t2 on (t1.sub=t2.sub and t1.objt2.obj) insert
 overwrite table spojoin select t1.sub, t1.pre, t2.obj, t2.sub, t2.pre,
 t2.obj;

 Executing the above query gives the following error.
 FAILED: Error in semantic analysis: line 1:46 Both Left and Right Aliases
 Encountered in Join obj

 However, If I replace the  operator with == operator, it executes.

 Please let me know what am I doing wrong?

 Thanks
 Kumar






-- 
Yours,
Zheng
http://www.linkedin.com/in/zshao

Re: HADOOP-4012 and bzip2 input splitting

2010-04-22 Thread Zheng Shao

Can you take a look at the job.xml link in your map-reduce job
created by Hive and let me know the mapred.input.format.class?
Is it HiveInputFormat or CombineHiveInputFormat?

It should work if you set it to org.apache.hadoop.hive.ql.io.HiveInputFormat

Also, can you verify if
https://issues.apache.org/jira/browse/MAPREDUCE-830 is in your hadoop
distribution or not?

Zheng

On Wed, Apr 21, 2010 at 11:31 PM, 김영우 warwit...@gmail.com wrote:
 Zeng,

 Thanks for your quick reply. but there is only 1 mapper for my job with 300
 MB, bz2 file.

 I added the following in my core-site.xml

 property
 nameio.compression.codecs/name
 valueorg.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec/value
 /property

 My table definition:

 create table test_bzip2
 (
 co1 string,
 .
 .

 col20 string
 )
 row format delimited
 fields terminated by '\t'
 stored as textfile;

 A simple grouping/count query and the following is the query's plan:
 STAGE PLANS:
   Stage: Stage-1
     Map Reduce
   Alias - Map Operator Tree:
     test_bzip2
   TableScan
     alias: test_bzip2
     Select Operator
   expressions:
     expr: siteid
     type: string
   outputColumnNames: siteid
   Reduce Output Operator
     key expressions:
   expr: siteid
   type: string
     sort order: +
     Map-reduce partition columns:
   expr: siteid
   type: string
     tag: -1
     value expressions:
   expr: 1
   type: int
   Reduce Operator Tree:
     Group By Operator
   aggregations:
     expr: count(VALUE._col0)
   bucketGroup: false
   keys:
     expr: KEY._col0
     type: string
   mode: complete
   outputColumnNames: _col0, _col1
   Select Operator
     expressions:
   expr: _col0
   type: string
   expr: _col1
   type: bigint
     outputColumnNames: _col0, _col1
     File Output Operator
   compressed: false
   GlobalTableId: 0
   table:
   input format: org.apache.hadoop.mapred.TextInputFormat
   output format:
 org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

   Stage: Stage-0
     Fetch Operator
   limit: -1


 I just verified bz2 splitting working in my cluster using a simple pig
 script. the pig script makes 3 mapper for M/R job.

 What should I check further? Job config info?

 - Youngwoo

 2010/4/22 Zheng Shao zsh...@gmail.com

 It should be automatically supported. You don't need to do anything
 except adding the bzip2 codec in io.compression.codecs in hadoop
 configuration files (core-site.xml)

 Zheng

 On Wed, Apr 21, 2010 at 10:15 PM, 김영우 warwit...@gmail.com wrote:
  Hi,
 
  HADOOP-4012, https://issues.apache.org/jira/browse/HADOOP-4012 has been
  committed. and CHD3 supports bzip2 splitting.
  I'm wondering if Hive supports input splitting for bzip2 compreesed text
  file(*.bz2). If not, Should I implement a custom SerDe for bzip2
  compressed
  files?
 
  Thanks,
  Youngwoo
 



 --
 Yours,
 Zheng
 http://www.linkedin.com/in/zshao





-- 
Yours,
Zheng
http://www.linkedin.com/in/zshao

Re: HADOOP-4012 and bzip2 input splitting

2010-04-21 Thread Zheng Shao

It should be automatically supported. You don't need to do anything
except adding the bzip2 codec in io.compression.codecs in hadoop
configuration files (core-site.xml)

Zheng

On Wed, Apr 21, 2010 at 10:15 PM, 김영우 warwit...@gmail.com wrote:
 Hi,

 HADOOP-4012, https://issues.apache.org/jira/browse/HADOOP-4012 has been
 committed. and CHD3 supports bzip2 splitting.
 I'm wondering if Hive supports input splitting for bzip2 compreesed text
 file(*.bz2). If not, Should I implement a custom SerDe for bzip2 compressed
 files?

 Thanks,
 Youngwoo




-- 
Yours,
Zheng
http://www.linkedin.com/in/zshao

Re: Cluster By Algorithm?

2010-04-11 Thread Zheng Shao

Its as simple as taking a hashcode of the key and mod by number of  
reducers. To get started, have a try of any .q files in clientpositive  
directory.


On the code side, HiveKey.java has the implementation.



Sent from my iPhone

On Apr 11, 2010, at 2:48 PM, Aaron McCurry amccu...@gmail.com wrote:

I have a search solution that is down stream of some Netezza data  
marts that I'm replacing with a Hive solution.  We already partition  
the data for the search solution 32 ways and I would like to take  
advantage of the data clustering in Hive (buckets), so that I don't  
have to do any post processing.  Is there documentation that  
describes how the data is hashed or how it's organized across the  
buckets?  Or could someone point me to a class that implements it?   
Thanks!


Aaron

Re: Using newest hive release (0.5.0) - Problem with count(1)

2010-04-06 Thread Zheng Shao

Yes we use sun jdk 1.6 and it works.

On Tue, Apr 6, 2010 at 12:32 PM, Aaron McCurry amccu...@gmail.com wrote:
 I am using 1.6, however it is the IBM jvm (not my choice).  If the feature
 is known to work on the Sun JVM then I will deal with the problem another
 way.  Thanks.

 Aaron

 On Tue, Apr 6, 2010 at 3:12 PM, Zheng Shao zsh...@gmail.com wrote:

 Are you using Java 1.5? Hive now requires Java 1.6


 On Tue, Apr 6, 2010 at 7:23 AM, Aaron McCurry amccu...@gmail.com wrote:
  In the past I have used hive 0.3.0 successfully and now with a new
  project
  coming up I decided to give hive 0.5.0 a run and everything is working
  as
  expected, except for when I try to get a simple count of the table.
 
  The simple table is defined as:
 
  create table log_table (col1 string, col2 string, col3 string, col4
  string,
  col5 string, col6 string)
  row format delimited
  fields terminated by '\t'
  stored as textfile;
 
  And the query I'm running is:
 
  select count(1) from log_table;
 
  From the hive command line I get the following errors:
 
  ...
  In order to set c constant number of reducers:
     set mapred.reduce.tasks=number
  Exception during encoding:java.lang.Exception: failed to write
  expression:
  GenericUDAFEvaluator$Mode=Class.new();
  Continue...
  Exception during encoding:java.lang.Exception: failed to write
  expression:
  GenericUDAFEvaluator$Mode=Class.new();
  Continue...
  Exception during encoding:java.lang.Exception: failed to write
  expression:
  GenericUDAFEvaluator$Mode=Class.new();
  Continue...
  Exception during encoding:java.lang.Exception: failed to write
  expression:
  GenericUDAFEvaluator$Mode=Class.new();
  Continue...
  Starting Job = job_201004010912_0015, Tracking URL = .
 
 
 
  And when looking at the failed hadoop jobs I see the following
  exception:
 
  Caused by: java.lang.ClassCastException:
 
  org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableIntObjectInspector
  incompatible with
 
  org.apache.hadoop.hive.serde2.objectinspector.primitive.LongObjectInspector
      at
 
  org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCount$GenericUDAFCountEvaluator.merge(GenericUDAFCount.java:93)
      at
 
  org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator.aggregate(GenericUDAFEvaluator.java:113)
  ...
 
 
  Is this a known issue?  Am I missing something?  Any guidance would be
  appreciated.  Thanks!
 
  Aaron
 



 --
 Yours,
 Zheng
 http://www.linkedin.com/in/zshao





-- 
Yours,
Zheng
http://www.linkedin.com/in/zshao

Re: Truncation error when creating table with column containing struct with many fields

2010-04-06 Thread Zheng Shao

That change should be fine.

Zheng

On Tue, Apr 6, 2010 at 5:16 PM, Dilip Joseph
dilip.antony.jos...@gmail.com wrote:
 Hello,

 I got the following error when creating a table with a column that has
 an ARRAY of STRUCTS with many fields.  It appears that there is a 128
 character limit on the column definition.

 FAILED: Error in metadata: javax.jdo.JDODataStoreException: Add
 request failed : INSERT INTO COLUMNS
 (SD_ID,COMMENT,COLUMN_NAME,TYPE_NAME,INTEGER_IDX) VALUES (?,?,?,?,?)
 NestedThrowables:
 java.sql.BatchUpdateException: A truncation error was encountered
 trying to shrink VARCHAR
 'arraystructid:int,fld1:bigint,fld2:int,fld3' to length 128.
 FAILED: Execution Error, return code 1 from
 org.apache.hadoop.hive.ql.exec.DDLTask

 I was able to get table create working after changing 128 to 256 in
 /metastore/src/model/package.jdo.   Does anyone know if there are any
 adverse side-effects of doing so?

 Dilip




-- 
Yours,
Zheng
http://www.linkedin.com/in/zshao

Re: create table exception

2010-04-05 Thread Zheng Shao

See http://wiki.apache.org/hadoop/Hive/AdminManual/MetastoreAdmin for details.

Zheng

On Mon, Apr 5, 2010 at 12:01 AM, Sagar Naik sn...@attributor.com wrote:
 Hi
 As a trial, I  am trying to setup hive for local DFS,MR mode
 I have set
 property
  namehive.metastore.uris/name
  valuefile:///data/hive/metastore/metadb/value
  descriptionThe location of filestore metadata base dir/description
 /property

 in hive-site.xml

 But I m still getting the following error

 Pl help me in getting hive up and running




 CREATE TABLE pokes (foo INT, bar STRING);
 10/04/04 23:58:08 [main] INFO parse.ParseDriver: Parsing command: CREATE 
 TABLE pokes (foo INT, bar STRING)
 10/04/04 23:58:08 [main] INFO parse.ParseDriver: Parse Completed
 10/04/04 23:58:08 [main] INFO parse.SemanticAnalyzer: Starting Semantic 
 Analysis
 10/04/04 23:58:08 [main] INFO parse.SemanticAnalyzer: Creating tablepokes 
 positin=13
 10/04/04 23:58:08 [main] INFO ql.Driver: Semantic Analysis Completed
 10/04/04 23:58:08 [main] INFO ql.Driver: Starting command: CREATE TABLE pokes 
 (foo INT, bar STRING)
 10/04/04 23:58:08 [main] INFO exec.DDLTask: Default to LazySimpleSerDe for 
 table pokes
 10/04/04 23:58:08 [main] INFO hive.log: DDL: struct pokes { i32 foo, string 
 bar}
 FAILED: Error in metadata: java.lang.IllegalArgumentException: URI:  does not 
 have a scheme
 10/04/04 23:58:08 [main] ERROR exec.DDLTask: FAILED: Error in metadata: 
 java.lang.IllegalArgumentException: URI:  does not have a scheme
 org.apache.hadoop.hive.ql.metadata.HiveException: 
 java.lang.IllegalArgumentException: URI:  does not have a scheme
        at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:281)
        at 
 org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:1281)
        at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:119)
        at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:99)
        at 
 org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:64)
        at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:582)
        at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:462)
        at org.apache.hadoop.hive.ql.Driver.runCommand(Driver.java:324)
        at org.apache.hadoop.hive.ql.Driver.run(Driver.java:312)
        at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:123)
        at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:181)
        at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:287)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
        at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
        at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
 Caused by: java.lang.IllegalArgumentException: URI:  does not have a scheme
        at 
 org.apache.hadoop.hive.metastore.HiveMetaStoreClient.init(HiveMetaStoreClient.java:92)
        at 
 org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:828)
        at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:838)
        at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:275)
        ... 20 more

 FAILED: Execution Error, return code 1 from 
 org.apache.hadoop.hive.ql.exec.DDLTask





-- 
Yours,
Zheng

Re: UDAF on AWS Hive

2010-04-02 Thread Zheng Shao

Hive 0.4 has limited support on complex types in UDAF.
If you are looking for an ad-hoc solution, try putting the data into a
single Text.

It will be great if you can ask AWS guys upgrading Hive to 0.5.
0.5 has over 100 bug fixes and is much more stable.

Zheng

On Fri, Apr 2, 2010 at 1:11 PM, Matthew Bryan gou...@gmail.com wrote:
 I'm writing a basic group_concat UDAF for the Amazon version of
 Hiveand it's working fine for unordered groupings. But I can't
 seem to get an ordered version working (filling an array based on an
 IntWritable passed alongside). When I move from using Text return type
 on terminatePartial() to either Text[] or a State class I start
 getting errors:

 FAILED: Error in semantic analysis:
 org.apache.hadoop.hive.ql.metadata.HiveException: Cannot recognize
 return type class [Lorg.apache.hadoop.io.Text; from public
 org.apache.hadoop.io.Text[]
 com.company.hadoop.hive.udaf.UDAFGroupConcatN$GroupConcatNStringEvaluator.terminatePartial()

 or

 FAILED: Error in semantic analysis:
 org.apache.hadoop.hive.ql.metadata.HiveException: Cannot recognize
 return type class
 com.company.hadoop.hive.udaf.UDAFGroupConcatN$UDAFGroupConc
 atNState from public
 com.company.hadoop.hive.udaf.UDAFGroupConcatN$UDAFGroupConcatNState
 com.company.hadoop.hive.udaf.UDAFGroupConcatN$GroupConcatNStringEvaluator.terminatePartial
 ()

 What limits are there on the return type of
 terminatePartial()shouldn't it just have to match the argument of
 merge and nothing more? Keep in mind this is the Amazon version of
 Hive (0.4 I think)

 I put both versions of the UDAF below, ordered and unordered.

 Thanks for your time.

 Matt


 # Working Unordered 
 /*QUERY: select user, event, group_concat(details) from datatable
 group by user,event;*/

 package com.company.hadoop.hive.udaf;

 import org.apache.hadoop.hive.ql.exec.UDAF;
 import org.apache.hadoop.hive.ql.exec.UDAFEvaluator;
 import org.apache.hadoop.io.Text;

 public class UDAFGroupConcat extends UDAF{

        public static class GroupConcatStringEvaluator implements
 UDAFEvaluator {
                private Text mOutput;
                private boolean mEmpty;

        public GroupConcatStringEvaluator() {
                super();
                init();
        }

        public void init() {
                mOutput = null;
                mEmpty = true;
        }

        public boolean iterate(Text o) {
                if (o!=null) {
                        if(mEmpty) {
                                mOutput = new Text(o);
                                mEmpty = false;
                        } else {
                                mOutput.set(mOutput.toString()+
 +o.toString());
                        }
                }
                return true;
        }
        public Text terminatePartial() {return mEmpty ? null : mOutput;}
        public boolean merge(Text o) {return iterate(o);}
        public Text terminate() {return mEmpty ? null : mOutput;}
 }
 }

  Not Working Ordered #
 /*QUERY: select user, event, group_concatN(details, detail_id) from
 datatable group by user,event;*/

 package com.company.hadoop.hive.udaf;

 import org.apache.hadoop.hive.ql.exec.UDAF;
 import org.apache.hadoop.hive.ql.exec.UDAFEvaluator;
 import org.apache.hadoop.io.Text;
 import org.apache.hadoop.io.IntWritable;

 public class UDAFGroupConcatN extends UDAF{

        public static class GroupConcatNStringEvaluator implements
 UDAFEvaluator {

                private Text[] mArray;
                private boolean mEmpty;

                public GroupConcatNStringEvaluator() {
                        super();
                        init();
                }

        public void init() {
                mArray = new Text[5];
                mEmpty = true;
        }

        public boolean iterate(Text o, IntWritable N) {
                if (o!=nullN!=null) {
                        mArray[N.get()].set(o.toString());
                        mEmpty=false;
                }
                return true;
        }
        public Text[] terminatePartial() {return mEmpty ? null : mArray;}
        public boolean merge(Text[] o) {
                if (o!=null) {
                        for(int i=0; i=5; i++){
                                if(mArray[i].getLength()==0){
                                        mArray[i].set(o[i].toString());
                                }
                        }
                }
                return true;
        }

        public Text[] terminate() {return mEmpty ? null : mArray;}
 }
 }




-- 
Yours,
Zheng

Re: Sequence Files with data inside key

2010-04-02 Thread Zheng Shao

The easiest way is to write a SequenceFileInputFormat that returns a
RecordReader that has key in the value and value in the key.

Zheng

On Fri, Apr 2, 2010 at 2:16 PM, Edward Capriolo edlinuxg...@gmail.com wrote:
 I have some sequence files in which all our data is in the key.

 http://osdir.com/ml/hive-user-hadoop-apache/2009-10/msg00027.html

 Has anyone tackled the above issue?





-- 
Yours,
Zheng

Re: date_sub() function returns wrong date because of daylight saving time difference

2010-04-01 Thread Zheng Shao

I will take a look. Thanks Bryan!

On Thu, Apr 1, 2010 at 12:38 AM, Bryan Talbot btal...@aeriagames.com wrote:
 I guess most places are running their clusters with UTC time zones or these 
 functions are not widely used.

 Any chance of getting a committer to look at the patch with unit tests?


 -Bryan




 On Mar 26, 2010, at Mar 26, 11:37 AM, Bryan Talbot wrote:

 Has anyone else been running into this issue?

 https://issues.apache.org/jira/browse/HIVE-1253


 If not, what are we doing wrong to get hit by it?



 -Bryan









-- 
Yours,
Zheng

Re: unix_timestamp function

2010-04-01 Thread Zheng Shao

Setting TZ in your .bash_profile won't work because the map/reduce tasks
runs on the hadoop clusters.
If you start your hadoop tasktracker with that TZ setting, it will probably
work.

Zheng

On Thu, Apr 1, 2010 at 3:32 PM, tom kersnick hiveu...@gmail.com wrote:

 So its working, but Im having a time zone issue.

 My servers are located in EST, but i need this data in PST.

 So when it converts this:

 hive select from_unixtime(1270145333,'-MM-dd HH:mm:ss') from
 ut2;
 Total MapReduce jobs = 1
 Launching Job 1 out of 1
 Number of reduce tasks is set to 0 since there's no reduce operator
 Starting Job = job_201003031204_0102, Tracking URL =
 http://master:50030/jobdetails.jsp?jobid=job_201003031204_0102
 Kill Command = /usr/local/hadoop/bin/../bin/hadoop job
 -Dmapred.job.tracker=master:54311 -kill job_201003031204_0102
 2010-04-01 18:28:23,041 Stage-1 map = 0%,  reduce = 0%
 2010-04-01 18:28:37,315 Stage-1 map = 67%,  reduce = 0%
 2010-04-01 18:28:43,386 Stage-1 map = 100%,  reduce = 0%
 2010-04-01 18:28:46,412 Stage-1 map = 100%,  reduce = 100%
 Ended Job = job_201003031204_0102
 OK
 2010-04-01 14:08:53
 Time taken: 30.191 seconds


 I need it to be :
 2010-04-01 11:08:53


 I tried setting the variable in my .bash_profile  for TZ=/ /Americas/  = no
 go.

 Nothing in the hive ddl link you is leading me in the right direction.  Is
 there something you guys can recommend?  I can write a script outside of
 hive, but it would be great if I can have users handle this within their
 queries.

 Thanks in advance!

 /tom




 On Thu, Apr 1, 2010 at 2:17 PM, tom kersnick hiveu...@gmail.com wrote:

 ok thanks

 I should have caught that.

 /tom




 On Thu, Apr 1, 2010 at 2:13 PM, Carl Steinbach c...@cloudera.com wrote:

 Hi Tom,

 Unix Time is defined as the number of *seconds* since January 1, 1970.
 It looks like the data you have in cola is in milliseconds. You need to
 divide this value by 1000 before calling from_unixtime() on the result.

 Thanks.

 Carl


 On Thu, Apr 1, 2010 at 2:02 PM, tom kersnick hiveu...@gmail.com wrote:

 Thanks, but there is something fishy going on.

 Im using hive 0.5.0 with hadoop 0.20.1

 I tried the column as both a bigint and a string.  According the hive
 ddl:

 string

 from_unixtime(int unixtime)

 Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC)
 to a string representing the timestamp of that moment in the current system
 time zone in the format of 1970-01-01 00:00:00

 It looks like the input is int,  that would be too small for my
 1270145333155 timestamp.

 Any ideas?

 Example below:

 /tom


 hive describe ut;
 OK
 colabigint
 colbstring
 Time taken: 0.101 seconds


 hive select * from ut;
 OK
 1270145333155tuesday
 Time taken: 0.065 seconds

 hive select from_unixtime(cola,'-MM-dd HH:mm:ss'),colb from ut;

 Total MapReduce jobs = 1
 Launching Job 1 out of 1
 Number of reduce tasks is set to 0 since there's no reduce operator
 Starting Job = job_201003031204_0083, Tracking URL =
 http://master:50030/jobdetails.jsp?jobid=job_201003031204_0083
 Kill Command = /usr/local/hadoop/bin/../bin/hadoop job
 -Dmapred.job.tracker=master:54311 -kill job_201003031204_0083
 2010-04-01 16:57:32,407 Stage-1 map = 0%,  reduce = 0%
 2010-04-01 16:57:45,577 Stage-1 map = 100%,  reduce = 0%
 2010-04-01 16:57:48,605 Stage-1 map = 100%,  reduce = 100%
 Ended Job = job_201003031204_0083
 OK
 42219-04-22 00:05:55tuesday
 Time taken: 18.066 seconds


 hive describe ut;
 OK
 colastring
 colbstring
 Time taken: 0.077 seconds


 hive select * from ut;
 OK
 1270145333155tuesday
 Time taken: 0.065 seconds


 hive select from_unixtime(cola,'-MM-dd HH:mm:ss'),colb from ut;
 FAILED: Error in semantic analysis: line 1:7 Function Argument Type
 Mismatch from_unixtime: Looking for UDF from_unixtime with parameters
 [class org.apache.hadoop.io.Text, class org.apache.hadoop.io.Text]





 On Thu, Apr 1, 2010 at 1:37 PM, Carl Steinbach c...@cloudera.comwrote:

 Hi Tom,

 I think you want to use the from_unixtime UDF:

 hive describe function extended from_unixtime;
 describe function extended from_unixtime;
 OK
 from_unixtime(unix_time, format) - returns unix_time in the specified
 format
 Example:
SELECT from_unixtime(0, '-MM-dd HH:mm:ss') FROM src LIMIT 1;
   '1970-01-01 00:00:00'
 Time taken: 0.647 seconds
 hive

 Thanks.

 Carl

 On Thu, Apr 1, 2010 at 1:11 PM, tom kersnick hiveu...@gmail.comwrote:

 hive describe ut;
 OK
 timebigint
 daystring
 Time taken: 0.128 seconds
 hive select * from ut;
 OK
 1270145333155tuesday
 Time taken: 0.085 seconds

 When I run this simple query, I'm getting a NULL for the time column
 with data type bigint.

 hive select unix_timestamp(time),day from ut;
 Total MapReduce jobs = 1
 Launching Job 1 out of 1
 Number of reduce tasks is set to 0 since there's no reduce operator
 Starting Job = job_201003031204_0080, Tracking URL =

Re: How do I make Hive use a custom scheduler and not the default scheduler?

2010-03-23 Thread Zheng Shao

Hive also loads hadoop conf in HADOOP_HOME/conf. You can set it there.

On 3/23/10, Ryan LeCompte lecom...@gmail.com wrote:
 Right now when we submit queries, it uses the hadoop scheduler. I have a
 custom fair share scheduler configured as well, but I see that jobs
 generated from our Hive queries never get picked up by that scheduler. Is
 there something in hive-site.xml that I can configure to make all queries
 use a particular scheduler?

 Thanks,
 Ryan


-- 
Sent from my mobile device

Yours,
Zheng

Re: Performance Programming Comparison of JAQL, Hive, Pig and Java

2010-03-23 Thread Zheng Shao

Glad to know that Hive has a good performance compared with other languages.

It will be great if you can publish the queries/codes in the
benchmark, as well as environment setup, so that other people can
rerun your benchmark easily.


Zheng

On Tue, Mar 23, 2010 at 7:11 AM, Rob Stewart
robstewar...@googlemail.com wrote:
 Hi folks,
 As promised, today I have made available my findings and experiment results
 from my research project, examining the high level languages: Pig, Hive and
 JAQL.
 The project extends from existing studies, by evaluating the scale up, scale
 out, and runtime for 3 benchmarking applications. It also examines the ease
 of programming, and the computational power of each language.
 I've created two documents:
 - Publication - A slide-by-slide presentation. 16 slides - *Suitable for
 most readers*
 - dissertation results chapter (18 pages of text)
 You can find these documents at:
 http://www.macs.hw.ac.uk/~rs46/publications.html
 Excuse the .HTML link - It is useful for me to record the number of hits the
 publication receives.
 I welcome any feedback, either on this mailing list, or to my University
 email address for direct correspondence. Any questions regarding the
 benchmarks should be sent to my University email address.

 Thanks for taking an interest,

 Rob Stewart



-- 
Yours,
Zheng

Re: support for arrays, maps, structs while writing output of custom reduce script to table

2010-03-22 Thread Zheng Shao

From 0.5 (probably), we can add type information to the column names after 
AS.
Note that the first level separator should be TAB, and the second
separator should be ^B (and then ^C, etc)

 FROM (select * from srcTable DISTRIBUTE BY id SORT BY id) s
    INSERT OVERWRITE TABLE SS
    REDUCE *
        USING 'myreduce.py'
        AS
                (a INT, b INT, vals ARRAYSTRUCTx:INT, y:STRING)
        ;


On Mon, Mar 22, 2010 at 1:50 PM, Dilip Joseph
dilip.antony.jos...@gmail.com wrote:
 Hello,

 Does Hive currently support arrays, maps, structs while using custom
 reduce/map scripts? 'myreduce.py' in the example below produces an
 array of structs delimited by \2s and \3s.

 CREATE TABLE SS (
                    a INT,
                    b INT,
                    vals ARRAYSTRUCTx:INT, y:STRING
                );

 FROM (select * from srcTable DISTRIBUTE BY id SORT BY id) s
    INSERT OVERWRITE TABLE SS
    REDUCE *
        USING 'myreduce.py'
        AS
                (a,b, vals)
        ;

 However, the query is failing with the following error message, even
 before the script is executed:

 FAILED: Error in semantic analysis: line 2:27 Cannot insert into
 target table because column number/types are different SS: Cannot
 convert column 2 from string to arraystructx:int,y:string.

 I saw a discussion about this in
 http://www.mail-archive.com/hive-user@hadoop.apache.org/msg00160.html,
 dated over a year ago.  Just wondering if there have been any updates.

 Thanks,

 Dilip




-- 
Yours,
Zheng

Re: support for arrays, maps, structs while writing output of custom reduce script to table

2010-03-22 Thread Zheng Shao

Great!

This is a bug. Hive field names should be case-insensitive. Can you
open a JIRA for that?

Zheng
On Mon, Mar 22, 2010 at 2:43 PM, Dilip Joseph
dilip.antony.jos...@gmail.com wrote:
 Thanks Zheng,  That worked.

 It appears that the type information is converted to lower case before
 comparison.  The following statements where userId is used as a
 field name failed.

 hive CREATE TABLE SS (
                         a INT,
                         b INT,
                         vals ARRAYSTRUCTuserId:INT, y:STRING
                     );
 OK
 Time taken: 0.309 seconds
 hive FROM (select * from srcTable DISTRIBUTE BY id SORT BY id) s
         INSERT OVERWRITE TABLE SS
         REDUCE *
             USING 'myreduce.py'
             AS
                         (a INT,
                         b INT,
                         vals ARRAYSTRUCTuserId:INT, y:STRING
                         )
             ;
 FAILED: Error in semantic analysis: line 2:27 Cannot insert into
 target table because column number/types are different SS: Cannot
 convert column 2 from arraystructuserId:int,y:string to
 arraystructuserid:int,y:string.

 The same queries worked fine after changing userId to userid.

 Dilip

 On Mon, Mar 22, 2010 at 2:20 PM, Zheng Shao zsh...@gmail.com wrote:
 From 0.5 (probably), we can add type information to the column names after 
 AS.
 Note that the first level separator should be TAB, and the second
 separator should be ^B (and then ^C, etc)

 FROM (select * from srcTable DISTRIBUTE BY id SORT BY id) s
    INSERT OVERWRITE TABLE SS
    REDUCE *
        USING 'myreduce.py'
        AS
                (a INT, b INT, vals ARRAYSTRUCTx:INT, y:STRING)
        ;


 On Mon, Mar 22, 2010 at 1:50 PM, Dilip Joseph
 dilip.antony.jos...@gmail.com wrote:
 Hello,

 Does Hive currently support arrays, maps, structs while using custom
 reduce/map scripts? 'myreduce.py' in the example below produces an
 array of structs delimited by \2s and \3s.

 CREATE TABLE SS (
                    a INT,
                    b INT,
                    vals ARRAYSTRUCTx:INT, y:STRING
                );

 FROM (select * from srcTable DISTRIBUTE BY id SORT BY id) s
    INSERT OVERWRITE TABLE SS
    REDUCE *
        USING 'myreduce.py'
        AS
                (a,b, vals)
        ;

 However, the query is failing with the following error message, even
 before the script is executed:

 FAILED: Error in semantic analysis: line 2:27 Cannot insert into
 target table because column number/types are different SS: Cannot
 convert column 2 from string to arraystructx:int,y:string.

 I saw a discussion about this in
 http://www.mail-archive.com/hive-user@hadoop.apache.org/msg00160.html,
 dated over a year ago.  Just wondering if there have been any updates.

 Thanks,

 Dilip




 --
 Yours,
 Zheng




 --
 _
 Dilip Antony Joseph
 http://www.marydilip.info




-- 
Yours,
Zheng

Re: SerDe examples that use arrays and structs?

2010-03-21 Thread Zheng Shao

BinarySortableSerDe, LazySimpleSerDe, and LazyBinarySerDe all supports
arrays/structs.

There is a UDF called size(var) that can return the size of an array.

Zheng

On Sun, Mar 21, 2010 at 9:19 PM, Adam O'Donnell a...@immunet.com wrote:
 First of all, thank you to all of the facebook guys for hosting the
 hive user group last week.

 Second of all, does anyone have some SerDe code that uses arrays and
 structs on deserialization?

 Also, is there a way inside of Hive to discover the number of elements
 in an array?

 Thanks and take care

 Adam




-- 
Yours,
Zheng

Re: delimiters for nested structures

2010-03-19 Thread Zheng Shao

Multiple-level of delimiters works as the following by default:


The first level (fields delimiters) will be \001 (^A, ascii code 1).
Each level of struct and array take an additional field delimitor
following (\002, etc). Each level of map takes 2 levels of additional
field deimitor.

So it will be:
s1.name ^B s1.age  ^A  a1[0].x ^C a2[0].y ^B a1[1].x ^C a2[1].y  ^A
b1.key1 ^C b1.value1[0] ^D b1.value1[1] ^B b1.key2 ^C b1.value2[0] ^D
b1.value2[1]


Zheng

On Fri, Mar 19, 2010 at 6:07 PM, Dilip Joseph
dilip.antony.jos...@gmail.com wrote:
 Hello,

 What are the delimiters for data to be loaded into a table with nested
 arrays, structs, maps etc?  For example:

 CREATE TABLE nested (   s1 STRUCTname:STRING, age: INT,
                                        a1 ARRAYSTRUCTx:INT, y:INT,
                                        b1 MAPSTRING, ARRAYINT
                                 )

 Should I write a custom SerDe for this?

 Thank you,

 Dilip




-- 
Yours,
Zheng

Re: DynamicSerDe/TBinaryProtocol

2010-03-10 Thread Zheng Shao

What is the format of your data?

TBinaryProtocol does not work with TextFile format, as you can imagine.


On 3/10/10, Anty anty@gmail.com wrote:
 Hi: ALL

 I encounter a problem, any suggestion will be appreciated!
 MY hive version is 0.30.0
 I create a table in CLI.
 CREATE TABLE table2 (boo int,bar string)
 ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe'
 WITH SERDEPROPERTIES (
 'serialization.format'=org.apache.hadoop.hive.serde2.thrift.TCTLSeparatedProtocol')
 STORED AS TEXTFILE;
 Then a load some data to  table2.
 INSERT OVERWRITE TABLE table2 SELECT foo,bar from pokes.
 Everything is OK. Also , i can issue queries against table2.

 But, when i change the protocol to TBinaryProtocol,


 CREATE TABLE table1 (boo int,bar string)
 ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe'
 WITH SERDEPROPERTIES (
 'serialization.format'='org.apache.thrift.protocol.TBinaryProtocol')
 STORED AS TEXTFILE;

 then load some data to table1 ,there is some error ,the loading process
 can't be completed.

 java.lang.RuntimeException: org.apache.hadoop.hive.serde2.SerDeException:
 org.apache.thrift.transport.TTransportException: Cannot read. Remote side
 has closed. Tried to read 1 bytes, but only got 0 bytes. at
 org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:182) at
 org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at
 org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at
 org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at
 org.apache.hadoop.mapred.Child.main(Child.java:170) Caused by:
 org.apache.hadoop.hive.ql.metadata.HiveException:
 org.apache.hadoop.hive.serde2.SerDeException:
 org.apache.thrift.transport.TTransportException: Cannot read. Remote side
 has closed. Tried to read 1 bytes, but only got 0 bytes. at
 org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:328) at
 org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:165) ... 4
 more Caused by: org.apache.hadoop.hive.serde2.SerDeException:
 org.apache.thrift.transport.TTransportException: Cannot read. Remote side
 has closed. Tried to read 1 bytes, but only got 0 bytes. at
 org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe.deserialize(DynamicSerDe.java:135)
 at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:319)
 ... 5 more Caused by: org.apache.thrift.transport.TTransportException:
 Cannot read. Remote side has closed. Tried to read 1 bytes, but only got 0
 bytes. at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
 at
 org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:314)
 at
 org.apache.thrift.protocol.TBinaryProtocol.readByte(TBinaryProtocol.java:247)
 at
 org.apache.thrift.protocol.TBinaryProtocol.readFieldBegin(TBinaryProtocol.java:216)
 at
 org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDeFieldList.deserialize(DynamicSerDeFieldList.java:163)
 at
 org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDeStructBase.deserialize(DynamicSerDeStructBase.java:59)
 at
 org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe.deserialize(DynamicSerDe.java:131)
 ... 6 more


 If there is something wrong with TBinaryProtocol?



 --
 Best Regards
 Anty Rao


-- 
Sent from my mobile device

Yours,
Zheng

Re: Hive UDF Unknown exception:

2010-03-10 Thread Zheng Shao

Try Double[]. Primitive arrays (like double[], int[]) are not
supported yet, because that needs special handling for each of the
primitive type.

Zheng

On Wed, Mar 10, 2010 at 4:55 PM, tom kersnick hiveu...@gmail.com wrote:
 Gents,

 Any ideas why this happens? Im using hive 0.50 with hadoop 20.2.

 This is a super simple UDF.

 Im just taking the length of the values and then dividing by pi.  It keeps
 popping up with this error:

 FAILED: Unknown exception: [D cannot be cast to [Ljava.lang.Object;

 Here is my approach:


 package com.xyz.udf;

 import org.apache.hadoop.hive.ql.exec.UDF;
 import java.util.Collections;


 public final class test extends UDF {
     public double evaluate(double[] values) {
     final Integer len = values.length;
     final Integer pi = len / 3.14159265;
     return values[pi];
 }
 }


 hive list
 jars;
 hive add jar /tmp/hive_aux/x-y-z-udf-1.0-SNAPSHOT.jar;
 Added /tmp/hive_aux/x-y-z-udf-1.0-SNAPSHOT.jar to class path
 hive create temporary function my_test as 'com.xyz.udf.test';
 OK
 Time taken: 0.41 seconds
 hive show
 tables;
 OK
 userpool
 test
 Time taken: 3.167 seconds
 hive describe userpool;
 OK
 word    string
 amount    int
 Time taken: 0.098 seconds
 hive select my_test(amount) from
 userpool;
 FAILED: Unknown exception: [D cannot be cast to [Ljava.lang.Object;
 hive describe test;
 OK
 word    string
 amount    string
 Time taken: 0.134 seconds
 hive select my_test(amount) from
 test;
 FAILED: Unknown exception: [D cannot be cast to [Ljava.lang.Object;


 Thanks in advance!

 /tom




-- 
Yours,
Zheng

Re: problem with IS NOT NULL operator in hive

2010-03-09 Thread Zheng Shao

WHERE product_name IS NOT NULL AND product_name  ''

On Tue, Mar 9, 2010 at 12:45 AM, prakash sejwani
prakashsejw...@gmail.com wrote:
 yes right can you give me a tip how to exclude blank values

 On Tue, Mar 9, 2010 at 2:13 PM, Zheng Shao zsh...@gmail.com wrote:

 So I guess you didn't exclude the Blank ones?

 On Tue, Mar 9, 2010 at 12:41 AM, prakash sejwani
 prakashsejw...@gmail.com wrote:
  yes, regexp_extract return NULL or Blank
 
  On Tue, Mar 9, 2010 at 2:05 PM, Zheng Shao zsh...@gmail.com wrote:
 
  What do you mean by product_name is present?
  If it is not present, does the regexp_extract return NULL?
 
  Zheng
 
  On Tue, Mar 9, 2010 at 12:13 AM, prakash sejwani
  prakashsejw...@gmail.com wrote:
   Hi all,
     I have a query below
  
   FROM (
     SELECT h.*
     FROM (
     -- Pull from the access_log
     SELECT ip,
       -- Reformat the time from the access log
       time, dt,
       --method, resource, protocol, status, length, referer, agent,
       -- Extract the product_id for the hit from the URL
       cast( regexp_extract(resource,'\q=([^\]+)', 1) AS STRING)
   AS
   product_name
     FROM a_log
       ) h
   )hit
       -- Insert the hit data into a seperate search table
   INSERT OVERWRITE TABLE search
     SELECT ip, time, dt,
       product_name
   WHERE product_name IS NOT NULL;
  
  
   it suppose to populate the search table with only if product_name is
   present
   but i get all of it..
  
   any help would be appreciated
  
   thanks
   prakash sejwani
   econify infotech
   mumbai
  
 
 
 
  --
  Yours,
  Zheng
 
 



 --
 Yours,
 Zheng





-- 
Yours,
Zheng

Re: All Map jobs fail with NPE in LazyStruct.uncheckedGetField

2010-03-05 Thread Zheng Shao

Do you want to try hive release 0.5.0 or hive trunk?
We should have provided better error messages here:
https://issues.apache.org/jira/browse/HIVE-1216

Zheng

On Thu, Mar 4, 2010 at 12:34 PM, Tom Nichols tmnich...@gmail.com wrote:
 I am trying out Hive, using Cloudera's EC2 distribution (Hadoop
 0.18.3, Hive 0.4.1, I believe)

 I'm trying to run the following query which causes every map task to
 fail with an NPE before making any progress:

 java.lang.NullPointerException
        at 
 org.apache.hadoop.hive.serde2.lazy.LazyStruct.uncheckedGetField(LazyStruct.java:205)
        at 
 org.apache.hadoop.hive.serde2.lazy.LazyStruct.getField(LazyStruct.java:182)
        at 
 org.apache.hadoop.hive.serde2.objectinspector.LazySimpleStructObjectInspector.getStructFieldData(LazySimpleStructObjectInspector.java:141)
        at 
 org.apache.hadoop.hive.ql.exec.ExprNodeColumnEvaluator.evaluate(ExprNodeColumnEvaluator.java:53)
        at 
 org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:74)
        at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:332)
        at 
 org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:49)
        at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:332)
        at 
 org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:175)
        at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:71)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)


 The query:
 -- Get the node's max price and corresponding year/day/hour/month
 select isone.node_id, isone.day, isone.hour, isone.lmp
 from (select max(lmp) as mlmp, node_id
    from isone_lmp
    where isone_lmp.node_id = 400
    group by node_id) maxlmp
 join isone_lmp isone on ( isone.node_id = maxlmp.node_id
  and isone.lmp=maxlmp.mlmp );

 The table:
 CREATE TABLE isone_lmp (
  node_id int,
  day string,
  hour int,
  minute int,
  energy float,
  congestion float,
  loss float,
  lmp float
 )
 ROW FORMAT DELIMITED
 FIELDS TERMINATED BY ','
 STORED AS TEXTFILE;

 The data looks like the following:
 396,20090120,00,00,62.77,0,.78,63.55
 397,20090120,00,00,62.77,0,.65,63.42
 398,20090120,00,00,62.77,0,.65,63.42
 399,20090120,00,00,62.77,0,.65,63.42
 400,20090120,00,00,62.77,0,.65,63.42
 401,20090120,00,00,62.77,0,-1.02,61.75
 405,20090120,00,00,62.77,0,.21,62.98

 It's about 15GB of data total; I can do a simple select count(1) from
 isone_lmp; which executes as expected.  Any thoughts?  I've been able
 to execute the same query on a smaller subset of data (2M rows as
 opposed to 500M) on a non-distributed setup locally.

 Thanks.
 -Tom




-- 
Yours,
Zheng

Re: complex query using FROM and INSERT in hive

2010-03-02 Thread Zheng Shao

there is an extra , before FROM

cast(regexp_extract(resource, '/companies/(\\d+)', 1) AS INT)
AS company_id,
-- Run our User Defined Function (see
src/com/econify/geoip/IpToCountry.java).  Takes the IP of the hit and
looks up its country
-- ip_to_country(ip) AS ip_country
  FROM access_log


On Tue, Mar 2, 2010 at 7:37 AM, prakash sejwani
prakashsejw...@gmail.com wrote:
 when i run this query from hive console

 FROM (
   SELECT h.*,
     p.title AS product_sku, p.description AS product_name,
     c.name AS company_name,
     c2.id AS product_company_id,
     c2.name AS product_company_name
   FROM (
   -- Pull from the access_log
   SELECT ip, ident, user,
     -- Reformat the time from the access log
     from_unixtime(cast(unix_
 timestamp(time, dd/MMM/:hh:mm:ss Z) AS INT)) AS time,
     method, resource, protocol, status, length, referer, agent,
     -- Extract the product_id for the hit from the URL
     cast(regexp_extract(resource, '/products/(\\d+)', 1) AS INT) AS
 product_id,
     -- Extract the company_id for the hit from the URL
     cast(regexp_extract(resource, '/companies/(\\d+)', 1) AS INT) AS
 company_id,
     -- Run our User Defined Function (see
 src/com/econify/geoip/IpToCountry.java).  Takes the IP of the hit and looks
 up its country
     -- ip_to_country(ip) AS ip_country
   FROM access_log
     ) h
     -- Join each hit with its product or company (if it has one)
     LEFT OUTER JOIN products p ON (h.product_id = p.id)
     LEFT OUTER JOIN companies c ON (h.company_id = c.id)
     -- If the hit was for a product, we probably didn't get the company_id
 in the hit subquery,
     -- so join products.company_id with another instance of the companies
 table
     LEFT OUTER JOIN companies c2 ON (p.company_id = c2.id)
     -- Filter out all hits that weren't for a company or a product
     WHERE h.product_id IS NOT NULL OR h.company_id IS NOT NULL
 ) hit
 -- Insert the hit data into a seperate product_hits table
 INSERT OVERWRITE TABLE product_hits
   SELECT ip, ident, user, time,
     method, resource, protocol, status,
     length, referer, agent,
     product_id,
     product_company_id AS company_id,
     ip_country,
     product_name,
     product_company_name AS company_name
   WHERE product_name IS NOT NULL
 -- Insert the hit data insto a seperate company_hits table
 INSERT OVERWRITE TABLE company_hits
   SELECT ip, ident, user, time,
     method, resource, protocol, status,
     length, referer, agent,
     company_id,
     ip_country,
     company_name
   WHERE company_name IS NOT NULL;

 I get the following error

 FAILED: Parse Error: line 19:6 cannot recognize input 'FROM' in select
 expression

 thanks,
 prakash



-- 
Yours,
Zheng

Re: Hive User Group Meeting 3/18/2010 7pm at Facebook

2010-03-01 Thread Zheng Shao

We also created a Meetup group in case you prefer to register on meetup.com

http://www.meetup.com/Hive-User-Group-Meeting/calendar/12741356/

We are hosting a Hive User Group Meeting, open to all current and
potential hadoop/hive users.

Agenda:
* Hive Tutorial (Carl Steinbach, cloudera): 20 min
* Hive User Case Study (Eva Tse, netflix): 20 min
* New Features and API (Hive team, Facebook): 25 min
JDBC/ODBC and CTAS(Create Table As Select)
UDF/UDAF/UDTF (User-defined Functions)
Create View/HBaseInputFormat (Hive and HBase integration)
Hive Join Strategy (How Hive does the join)
SerDe (Hive's serialization/deserialization framework)


Hive is a scalable data warehouse infrastructure built on top of
Hadoop. It provides tools to enable easy data ETL, a mechanism to put
structures on the data, and the capability to querying and analysis of
large data sets stored in Hadoop files. Hive defines a simple SQL-like
query language, called HiveQL, that enables users familiar with SQL to
query the data. At the same time, this language also allows
programmers who are familiar with MapReduce to be able to plug in
their custom mappers and reducers to perform more sophisticated
analysis.

The current largest deployment of Hive is the silver cluster at
Facebook, which consists of 1100 nodes with 8 CPU-cores and 12
1TB-disk each. The total capacity is 8800 CPU-cores with 13 PB of raw
storage space. More than 4 TB of compressed data (20+ TB uncompressed)
are loaded into Hive every day.


If you'd like to network with fellow Hive/Hadoop users online, feel
free to find them here:
http://www.facebook.com/event.php?eid=319237846974



Zheng

On Fri, Feb 26, 2010 at 1:56 PM, Zheng Shao zsh...@gmail.com wrote:
 Hi all,

 We are going to hold the second Hive User Group Meeting at 7PM on
 3/18/2010 Thursday.

 The agenda will be:

 * Hive Tutorial: 20 min
 * Hive User Case Study: 20 min
 * New Features and API: 25 min
  JDBC/ODBC and CTAS
  UDF/UDAF/UDTF
  Create View/HBaseInputFormat
  Hive Join Strategy
  SerDe

 The audience is beginner to intermediate Hive users/developers.

 *** The details are here: http://www.facebook.com/event.php?eid=319237846974 
 ***
 *** Please RSVP so we can schedule logistics accordingly. ***

 --
 Yours,
 Zheng




-- 
Yours,
Zheng

Re: hive 0.50 on hadoop 0.22

2010-03-01 Thread Zheng Shao

Hi Massoud,

Great work!

Yes this is exactly the use of shims. When we see an API change across
hadoop versions, we add a new function to shims interface, and
implement it in each of the shim.

For this one, you probably want to wrap the logic in Driver.java into
a single shim interface function, and implement that function in all
shim versions.

Does that make sense?

Zheng

On Mon, Mar 1, 2010 at 1:08 PM, Massoud Mazar massoud.ma...@avg.com wrote:
 Zheng,

 Thanks for answering.
 I've decided to give it (hive 0.50 on hadoop 0.22) a try. I'm a developer, 
 but not a Java developer, so with some initial help I can spend time and work 
 on this.
 Just to start, I modified the ShimLoader.java and copied the same 
 HADOOP_SHIM_CLASSES and JETTY_SHIM_CLASSES from 0.20 to 0.22 to see where it 
 breaks.

 I built and deployed hive 0.50 to a running hadoop 0.22 and did show 
 tables; in hive, and I got this:

 Exception in thread main java.lang.NoSuchMethodError: 
 org.apache.hadoop.security.UserGroupInformation: method init()V not found
        at 
 org.apache.hadoop.security.UnixUserGroupInformation.init(UnixUserGroupInformation.java:69)
        at 
 org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:271)
        at 
 org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:300)
        at org.apache.hadoop.hive.ql.Driver.init(Driver.java:243)
        at 
 org.apache.hadoop.hive.ql.processors.CommandProcessorFactory.get(CommandProcessorFactory.java:40)
        at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:116)
        at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:181)
        at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:287)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:187)

 Now, when I look at the UserGroupInformation class in hadoop 0.22 source 
 code, it does not have a parameter-less constructor, but documentation at 
 http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/security/UserGroupInformation.html
  shows such a constructor.

 Now, my question is: is this something that can be fixed by shims? Or it is a 
 problem with hadoop?

 -Original Message-
 From: Zheng Shao [mailto:zsh...@gmail.com]
 Sent: Saturday, February 27, 2010 4:24 AM
 To: hive-user@hadoop.apache.org
 Subject: Re: hive 0.50 on hadoop 0.22

 Hi Mazar,

 We have not tried Hive on Hadoop higher than 0.20 yet.

 However, Hive has the shim infrastructure which makes it easy to port
 to new Hadoop versions.
 Please see the shim directory inside Hive.

 Zheng

 On Fri, Feb 26, 2010 at 1:59 PM, Massoud Mazar massoud.ma...@avg.com wrote:
 Is it possible to run release-0.5.0-rc0 on top of hadoop 0.22.0 (trunk)?




 --
 Yours,
 Zheng




-- 
Yours,
Zheng

Re: hive 0.50 on hadoop 0.22

2010-02-27 Thread Zheng Shao

Hi Mazar,

We have not tried Hive on Hadoop higher than 0.20 yet.

However, Hive has the shim infrastructure which makes it easy to port
to new Hadoop versions.
Please see the shim directory inside Hive.

Zheng

On Fri, Feb 26, 2010 at 1:59 PM, Massoud Mazar massoud.ma...@avg.com wrote:
 Is it possible to run release-0.5.0-rc0 on top of hadoop 0.22.0 (trunk)?




-- 
Yours,
Zheng

Hive User Group Meeting 3/18/2010 7pm at Facebook

2010-02-26 Thread Zheng Shao

Hi all,

We are going to hold the second Hive User Group Meeting at 7PM on
3/18/2010 Thursday.

The agenda will be:

* Hive Tutorial: 20 min
* Hive User Case Study: 20 min
* New Features and API: 25 min
 JDBC/ODBC and CTAS
 UDF/UDAF/UDTF
 Create View/HBaseInputFormat
 Hive Join Strategy
 SerDe

The audience is beginner to intermediate Hive users/developers.

*** The details are here: http://www.facebook.com/event.php?eid=319237846974 ***
*** Please RSVP so we can schedule logistics accordingly. ***

-- 
Yours,
Zheng

Re: How to generate Row Id in Hive?

2010-02-25 Thread Zheng Shao

Since Hive runs many mappers/reducers in parallel, there is no way to
generate a globally unique increasing row id.
If you are OK with that, you can easily write a non-deterministic
UDF. See rand() (or UDFRand.java) for example.

Please open a JIRA if you plan to work on that.

Zheng

On Wed, Feb 24, 2010 at 6:47 PM, Weiwei Hsieh whs...@slingmedia.com wrote:
 All,



 Could anyone tell me on how to generate a row id for a new record in Hive?



 Many thanks.



 weiwei



-- 
Yours,
Zheng

Re: Execution Error

2010-02-25 Thread Zheng Shao

Most probably $TMPDIR does not exist.
I think by default it's /tmp/user. Can you mkdir ?

On Thu, Feb 25, 2010 at 5:58 AM, Aryeh Berkowitz ar...@iswcorp.com wrote:
     Can anybody tell me why I’m getting this error?



 hive show tables;

 OK

 email

 html_href

 html_src

 ipadrr

 phone

 urls

 Time taken: 0.129 seconds

 hive SELECT DISTINCT a.url, a.signature, a.size from urls a;

 Total MapReduce jobs = 1

 Launching Job 1 out of 1

 java.io.IOException: No such file or directory

     at java.io.UnixFileSystem.createFileExclusively(Native Method)

     at java.io.File.checkAndCreate(File.java:1704)

     at java.io.File.createTempFile(File.java:1792)

     at
 org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:87)

     at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:107)

     at
 org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:55)

     at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:630)

     at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:504)

     at org.apache.hadoop.hive.ql.Driver.run(Driver.java:382)

     at
 org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:138)

     at
 org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:197)

     at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:303)

     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

     at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)

     at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

     at java.lang.reflect.Method.invoke(Method.java:597)

     at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

 FAILED: Execution Error, return code 1 from
 org.apache.hadoop.hive.ql.exec.MapRedTask



-- 
Yours,
Zheng

Re: How to generate Row Id in Hive?

2010-02-25 Thread Zheng Shao

Not right now. It should be pretty simple to do though. We can expose
the current JobConf via a static method in ExecMapper.

Zheng

On Thu, Feb 25, 2010 at 7:52 AM, Todd Lipcon t...@cloudera.com wrote:
 Zheng: is there a way to get at the hadoop conf variables from within a
 query? If so, you could use mapred.task.id to get a unique string.
 -Todd

 On Thu, Feb 25, 2010 at 12:42 AM, Zheng Shao zsh...@gmail.com wrote:

 Since Hive runs many mappers/reducers in parallel, there is no way to
 generate a globally unique increasing row id.
 If you are OK with that, you can easily write a non-deterministic
 UDF. See rand() (or UDFRand.java) for example.

 Please open a JIRA if you plan to work on that.

 Zheng

 On Wed, Feb 24, 2010 at 6:47 PM, Weiwei Hsieh whs...@slingmedia.com
 wrote:
  All,
 
 
 
  Could anyone tell me on how to generate a row id for a new record in
  Hive?
 
 
 
  Many thanks.
 
 
 
  weiwei



 --
 Yours,
 Zheng





-- 
Yours,
Zheng

[ANNOUNCE] Hive 0.5.0 released

2010-02-24 Thread Zheng Shao

Hi folks,

We have released Hive 0.5.0.
You can find it from the download page in 24 hours (still waiting to
be mirrored)

http://hadoop.apache.org/hive/releases.html#Download

-- 
Yours,
Zheng

Re: [ANNOUNCE] Hive 0.5.0 released

2010-02-24 Thread Zheng Shao

Thanks for the feedback.

Which exact version of hadoop are you using?

There is a bug in hadoop combinefileinputformat that was fixed recently.

Zheng


On 2/24/10, Ryan LeCompte lecom...@gmail.com wrote:
 Actually, I just fixed the problem by removing the following in
 hive-site.xml:

 property
   namehive.input.format/name
   valueorg.apache.hadoop.hive.ql.io.CombineHiveInputFormat/value
 /property


 Any reason why specifying the above would cause the error? We are using
 latest version of Hadoop.

 Thanks,
 Ryan


 On Wed, Feb 24, 2010 at 10:40 AM, Ryan LeCompte lecom...@gmail.com wrote:

 I actually just tried doing this (using same metastoredb, just using 0.5.0
 release code), and now when I execute a simple query it immediately fails
 with the following in hive.log:

 2010-02-24 10:39:31,950 WARN  mapred.JobClient
 (JobClient.java:configureCommandLineOptions(539)) - Use
 GenericOptionsParser
 for parsing the arguments. Applications should implement Tool for the
 same.
 2010-02-24 10:39:33,535 ERROR exec.ExecDriver
 (SessionState.java:printError(248)) - Ended Job = job_201002241035_0002
 with
 errors
 2010-02-24 10:39:33,555 ERROR ql.Driver
 (SessionState.java:printError(248))
 - FAILED: Execution Error, return code 2 from
 org.apache.hadoop.hive.ql.exec.ExecDriver

 Any ideas how to get this working?

 Thanks,
 Ryan



 On Wed, Feb 24, 2010 at 8:20 AM, Massoud Mazar
 massoud.ma...@avg.comwrote:

 Is it compatible with release-0.4.1-rc2 so I can just replace the code?

 -Original Message-
 From: Zheng Shao [mailto:zsh...@gmail.com]
 Sent: Wednesday, February 24, 2010 3:34 AM
 To: hive-user@hadoop.apache.org; hive-...@hadoop.apache.org
 Subject: [ANNOUNCE] Hive 0.5.0 released

 Hi folks,

 We have released Hive 0.5.0.
 You can find it from the download page in 24 hours (still waiting to
 be mirrored)

 http://hadoop.apache.org/hive/releases.html#Download

 --
 Yours,
 Zheng





-- 
Sent from my mobile device

Yours,
Zheng

Re: [ANNOUNCE] Hive 0.5.0 released

2010-02-24 Thread Zheng Shao

Yes, see
http://issues.apache.org/jira/browse/HADOOP-5759?page=com.atlassian.jira.plugin.ext.subversion:subversion-commits-tabpanel

The fix is committed to Hadoop 0.20.2 and 0.21.0.


But you can continue to use Hive 0.5.0 if you remove that configuration.

Zheng

On Wed, Feb 24, 2010 at 10:17 AM, Ryan LeCompte lecom...@gmail.com wrote:
 Ah, interesting.

 Using Hadoop 0.20.1. Is this the problematic version?

 Thanks,
 Ryan


 On Wed, Feb 24, 2010 at 12:50 PM, Zheng Shao zsh...@gmail.com wrote:

 Thanks for the feedback.

 Which exact version of hadoop are you using?

 There is a bug in hadoop combinefileinputformat that was fixed recently.

 Zheng


 On 2/24/10, Ryan LeCompte lecom...@gmail.com wrote:
  Actually, I just fixed the problem by removing the following in
  hive-site.xml:
 
  property
    namehive.input.format/name
    valueorg.apache.hadoop.hive.ql.io.CombineHiveInputFormat/value
  /property
 
 
  Any reason why specifying the above would cause the error? We are using
  latest version of Hadoop.
 
  Thanks,
  Ryan
 
 
  On Wed, Feb 24, 2010 at 10:40 AM, Ryan LeCompte lecom...@gmail.com
  wrote:
 
  I actually just tried doing this (using same metastoredb, just using
  0.5.0
  release code), and now when I execute a simple query it immediately
  fails
  with the following in hive.log:
 
  2010-02-24 10:39:31,950 WARN  mapred.JobClient
  (JobClient.java:configureCommandLineOptions(539)) - Use
  GenericOptionsParser
  for parsing the arguments. Applications should implement Tool for the
  same.
  2010-02-24 10:39:33,535 ERROR exec.ExecDriver
  (SessionState.java:printError(248)) - Ended Job = job_201002241035_0002
  with
  errors
  2010-02-24 10:39:33,555 ERROR ql.Driver
  (SessionState.java:printError(248))
  - FAILED: Execution Error, return code 2 from
  org.apache.hadoop.hive.ql.exec.ExecDriver
 
  Any ideas how to get this working?
 
  Thanks,
  Ryan
 
 
 
  On Wed, Feb 24, 2010 at 8:20 AM, Massoud Mazar
  massoud.ma...@avg.comwrote:
 
  Is it compatible with release-0.4.1-rc2 so I can just replace the
  code?
 
  -Original Message-
  From: Zheng Shao [mailto:zsh...@gmail.com]
  Sent: Wednesday, February 24, 2010 3:34 AM
  To: hive-user@hadoop.apache.org; hive-...@hadoop.apache.org
  Subject: [ANNOUNCE] Hive 0.5.0 released
 
  Hi folks,
 
  We have released Hive 0.5.0.
  You can find it from the download page in 24 hours (still waiting to
  be mirrored)
 
  http://hadoop.apache.org/hive/releases.html#Download
 
  --
  Yours,
  Zheng
 
 
 
 

 --
 Sent from my mobile device

 Yours,
 Zheng





-- 
Yours,
Zheng

Re: Error while starting hive

2010-02-21 Thread Zheng Shao

export 
HADOOP_CLASSPATH=/master/hadoop/json.jar:/master/hadoop/hbase-0.20.2/hbase-0.20.2.jar:/master/hadoop/hbase-0.20.2/lib/zookeeper-3.2.1.jar:/master/hadoop/hive/build/dist/lib/:/master/hadoop/hive/build/dist/lib/*.jar:/master/hadoop/hive/build/dist/conf/

should be:

export 
HADOOP_CLASSPATH=/master/hadoop/json.jar:/master/hadoop/hbase-0.20.2/hbase-0.20.2.jar:/master/hadoop/hbase-0.20.2/lib/zookeeper-3.2.1.jar:/master/hadoop/hive/build/dist/lib/:/master/hadoop/hive/build/dist/lib/*.jar:/master/hadoop/hive/build/dist/conf/:$HADOOP_CLASSPATH

Zheng

On Sun, Feb 21, 2010 at 11:19 PM, Mafish Liu maf...@gmail.com wrote:
 This happens when hive fails to find hive jar files.
 Did you specify HIVE_HOME and HIVE_LIB in your system?

 2010/2/22 Vidyasagar Venkata Nallapati vidyasagar.nallap...@onmobile.com:
 Hi,



 While starting hive I am still getting an error, attached are the hadoop env
 and hive-ste I am using



 phoe...@ph1:/master/hadoop/hive/build/dist$ bin/hive

 Exception in thread main java.lang.NoClassDefFoundError:
 org/apache/hadoop/hive/conf/HiveConf

     at java.lang.Class.forName0(Native Method)

     at java.lang.Class.forName(Class.java:247)

     at org.apache.hadoop.util.RunJar.main(RunJar.java:149)

 Caused by: java.lang.ClassNotFoundException:
 org.apache.hadoop.hive.conf.HiveConf

     at java.net.URLClassLoader$1.run(URLClassLoader.java:200)

     at java.security.AccessController.doPrivileged(Native Method)

     at java.net.URLClassLoader.findClass(URLClassLoader.java:188)

     at java.lang.ClassLoader.loadClass(ClassLoader.java:307)

     at java.lang.ClassLoader.loadClass(ClassLoader.java:252)

     at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)

     ... 3 more



 Regards

 Vidya

 
 DISCLAIMER: The information in this message is confidential and may be
 legally privileged. It is intended solely for the addressee. Access to this
 message by anyone else is unauthorized. If you are not the intended
 recipient, any disclosure, copying, or distribution of the message, or any
 action or omission taken by you in reliance on it, is prohibited and may be
 unlawful. Please immediately contact the sender if you have received this
 message in error. Further, this e-mail may contain viruses and all
 reasonable precaution to minimize the risk arising there from is taken by
 OnMobile. OnMobile is not liable for any damage sustained by you as a result
 of any virus in this e-mail. All applicable virus checks should be carried
 out by you before opening this e-mail or any attachment thereto.
 Thank you - OnMobile Global Limited.




 --
 maf...@gmail.com




-- 
Yours,
Zheng

[VOTE] hive 0.5.0 release (rc1)

2010-02-21 Thread Zheng Shao

Hi,

I just made a release candidate at
https://svn.apache.org/repos/asf/hadoop/hive/tags/release-0.5.0-rc1

The tarballs are at: http://people.apache.org/~zshao/hive-0.5.0-candidate-1/

The HWI startup problem is fixed in rc1. This supersedes the previous
email about voting on rc0.


Please vote.

--
Yours,
Zheng

Re: [VOTE] hive 0.5.0 release candidate 0

2010-02-20 Thread Zheng Shao

Can you generate a patch for 0.5? The patch does not work on branch-0.5

Zheng


On 2/19/10, Edward Capriolo edlinuxg...@gmail.com wrote:
 On Fri, Feb 19, 2010 at 9:49 PM, Zheng Shao zsh...@gmail.com wrote:
 Hi,

 I just made a release candidate at
 https://svn.apache.org/repos/asf/hadoop/hive/tags/release-0.5.0-rc0

 The tarballs are at:
 http://people.apache.org/~zshao/hive-0.4.1-candidate-3/


 Please vote.

 --
 Yours,
 Zheng


 -1 I would like to fix https://issues.apache.org/jira/browse/HIVE-1183


-- 
Sent from my mobile device

Yours,
Zheng

Re: SequenceFile compression on Amazon EMR not very good

2010-02-19 Thread Zheng Shao

hive.exec.compress.output controls whether or not to compress hive
output. (This overrides mapred.output.compress in Hive).

All other compression flags are from hadoop. Please see
http://hadoop.apache.org/common/docs/r0.18.0/hadoop-default.html

Zheng

On Fri, Feb 19, 2010 at 5:53 AM, Saurabh Nanda saurabhna...@gmail.com wrote:
 And also hive.exec.compress.*. So that makes it three sets of configuration
 variables:

 mapred.output.compress.*
 io.seqfile.compress.*
 hive.exec.compress.*

 What's the relationship between these configuration parameters and which
 ones should I set to achieve a well compress output table?

 Saurabh.

 On Fri, Feb 19, 2010 at 7:16 PM, Saurabh Nanda saurabhna...@gmail.com
 wrote:

 I'm confused here Zheng. There are two sets of configuration variables.
 Those starting with io.* and those starting with mapred.*. For making sure
 that the final output table is compressed, which ones do I have to set?

 Saurabh.

 On Fri, Feb 19, 2010 at 12:37 AM, Zheng Shao zsh...@gmail.com wrote:

 Did you also:

 SET mapred.output.compression.codec=org.apacheGZipCode;

 Zheng

 On Thu, Feb 18, 2010 at 8:25 AM, Saurabh Nanda saurabhna...@gmail.com
 wrote:
  Hi Zheng,
 
  I cross checked. I am setting the following in my Hive script before
  the
  INSERT command:
 
  SET io.seqfile.compression.type=BLOCK;
  SET hive.exec.compress.output=true;
 
  A 132 MB (gzipped) input file going through a cleanup and getting
  populated
  in a sequencefile table is growing to 432 MB. What could be going
  wrong?
 
  Saurabh.
 
  On Wed, Feb 3, 2010 at 2:26 PM, Saurabh Nanda saurabhna...@gmail.com
  wrote:
 
  Thanks, Zheng. Will do some more tests and get back.
 
  Saurabh.
 
  On Mon, Feb 1, 2010 at 1:22 PM, Zheng Shao zsh...@gmail.com wrote:
 
  I would first check whether it is really the block compression or
  record compression.
  Also maybe the block size is too small but I am not sure that is
  tunable in SequenceFile or not.
 
  Zheng
 
  On Sun, Jan 31, 2010 at 9:03 PM, Saurabh Nanda
  saurabhna...@gmail.com
  wrote:
   Hi,
  
   The size of my Gzipped weblog files is about 35MB. However, upon
   enabling
   block compression, and inserting the logs into another Hive table
   (sequencefile), the file size bloats up to about 233MB. I've done
   similar
   processing on a local Hadoop/Hive cluster, and while the
   compressions
   is not
   as good as gzipping, it still is not this bad. What could be going
   wrong?
  
   I looked at the header of the resulting file and here's what it
   says:
  
  
  
   SEQ^Forg.apache.hadoop.io.BytesWritable^Yorg.apache.hadoop.io.Text^A^@'org.apache.hadoop.io.compress.GzipCodec
  
   Does Amazon Elastic MapReduce behave differently or am I doing
   something
   wrong?
  
   Saurabh.
   --
   http://nandz.blogspot.com
   http://foodieforlife.blogspot.com
  
 
 
 
  --
  Yours,
  Zheng
 
 
 
  --
  http://nandz.blogspot.com
  http://foodieforlife.blogspot.com
 
 
 
  --
  http://nandz.blogspot.com
  http://foodieforlife.blogspot.com
 



 --
 Yours,
 Zheng



 --
 http://nandz.blogspot.com
 http://foodieforlife.blogspot.com



 --
 http://nandz.blogspot.com
 http://foodieforlife.blogspot.com




-- 
Yours,
Zheng

Re: Thrift Server Error Messages

2010-02-19 Thread Zheng Shao

Can you open a JIRA and help propose some concrete design of the change?
That will help make it faster to have this feature.

Thanks,
Zheng

On Fri, Feb 19, 2010 at 6:17 AM, Andy Kent andy.k...@forward.co.uk wrote:
 When executing commands on the hive command line it give really useful output 
 if you have syntax errors in your query. When using the Thrift interface I 
 seem to only be able to get errors like 'Error code: 11'.

 Is there a way to get at the human friendly error messages via the thrift 
 interface?
 If not, is there a list of the thrift error codes and what they mean anywhere?

 If it's not available it would really great if this could be exposed via 
 thrift.

 Thanks,
 Andy.



-- 
Yours,
Zheng

Re: computing median and percentiles

2010-02-19 Thread Zheng Shao

Hi Jerome,

Is there any update on this?
https://issues.apache.org/jira/browse/HIVE-259

Zheng

On Fri, Feb 5, 2010 at 9:34 AM, Jerome Boulon jbou...@netflix.com wrote:
 Hi Bryan,
 I'm working on Hive-259. I'll post an update early next week.
 /Jerome.


 On 2/4/10 9:08 PM, Bryan Talbot btal...@aeriagames.com wrote:

 What's the best way to compute median and other percentiles using Hive 0.40?
 I've run across http://issues.apache.org/jira/browse/HIVE-259 but there
 doesn't seem to be any planned implementation yet.


 -Bryan










-- 
Yours,
Zheng

Re: Having trouble with lateral view

2010-02-19 Thread Zheng Shao

Jason,

Do you want to open a JIRA and contrib your map_explode function to Hive?
That will be greatly appreciated.


Zheng

On Fri, Feb 19, 2010 at 2:49 PM, Yongqiang He
heyongqi...@software.ict.ac.cn wrote:
 Hi Jason,

 This is a known bug, see https://issues.apache.org/jira/browse/HIVE-1056

 You can first disable ppd with “set hive.optimize.ppd=false;”

 Thanks
 Yongqiang
 On 2/19/10 2:23 PM, Jason Michael jmich...@videoegg.com wrote:

 I’m currently running a hive build from trunk, revision number 911889.  I’ve
 built a UDTF called map_explode which just emits the key and value of each
 entry in a map as a row in the result table.  The table I’m running it
 against looks like:

 hive describe mytable;
 product    string    from deserializer
 ...
 interactions    mapstring,int    from deserializer

 If I use the map_explode in the select clause, I get the expected results:

 hive select map_explode(interactions) as (key, value) from mytable where
 day = '2010-02-18' and hour = 1 limit 10;
 ...
 OK
 invite_impression    1
 invite_impression    1
 invite_impression    1
 invite_impression    1
 rollout    12
 invite_impression    1
 invite_impression    1
 invite_impression    1
 rollout    4
 invite_impression    1
 Time taken: 22.11 seconds

 However, if I try to use LATERAL JOIN to relate the exploded values back to
 the parent table, like so:

 hive select product, key, sum(value) from mytable LATERAL VIEW
 map_explode(interactions) interacts as key, value where day = '2010-02-18'
 and hour = 1 group by product, key;

 I get the following error:

 FAILED: Unknown exception: null

 Looking in hive.log, I see the follow stack trace:

 2010-02-19 14:15:17,215 ERROR ql.Driver (SessionState.java:printError(255))
 - FAILED: Unknown exception: null
 java.lang.NullPointerException
 at
 org.apache.hadoop.hive.ql.ppd.ExprWalkerProcFactory$ColumnExprProcessor.process(ExprWalkerProcFactory.java:87)
 at
 org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:89)
 at
 org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:89)
 at
 org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.walk(DefaultGraphWalker.java:129)
 at
 org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:103)
 at
 org.apache.hadoop.hive.ql.ppd.ExprWalkerProcFactory.extractPushdownPreds(ExprWalkerProcFactory.java:273)
 at
 org.apache.hadoop.hive.ql.ppd.OpProcFactory$DefaultPPD.mergeWithChildrenPred(OpProcFactory.java:317)
 at
 org.apache.hadoop.hive.ql.ppd.OpProcFactory$DefaultPPD.process(OpProcFactory.java:258)
 at
 org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:89)
 at
 org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:89)
 at
 org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.walk(DefaultGraphWalker.java:129)
 at
 org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:103)
 at
 org.apache.hadoop.hive.ql.ppd.PredicatePushDown.transform(PredicatePushDown.java:103)
 at
 org.apache.hadoop.hive.ql.optimizer.Optimizer.optimize(Optimizer.java:74)
 at
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:5758)
 at
 org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:125)
 at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:304)
 at org.apache.hadoop.hive.ql.Driver.run(Driver.java:377)
 at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:138)
 at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:197)
 at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:303)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

 I peeked at ExprWalkerProcFactory, but couldn’t readily see what was causing
 the problem.  Any ideas?

 Jason




-- 
Yours,
Zheng

[VOTE] hive 0.5.0 release candidate 0

2010-02-19 Thread Zheng Shao

Hi,

I just made a release candidate at
https://svn.apache.org/repos/asf/hadoop/hive/tags/release-0.5.0-rc0

The tarballs are at: http://people.apache.org/~zshao/hive-0.4.1-candidate-3/


Please vote.

-- 
Yours,
Zheng

Re: SequenceFile compression on Amazon EMR not very good

2010-02-18 Thread Zheng Shao

Did you also:

SET mapred.output.compression.codec=org.apacheGZipCode;

Zheng

On Thu, Feb 18, 2010 at 8:25 AM, Saurabh Nanda saurabhna...@gmail.com wrote:
 Hi Zheng,

 I cross checked. I am setting the following in my Hive script before the
 INSERT command:

 SET io.seqfile.compression.type=BLOCK;
 SET hive.exec.compress.output=true;

 A 132 MB (gzipped) input file going through a cleanup and getting populated
 in a sequencefile table is growing to 432 MB. What could be going wrong?

 Saurabh.

 On Wed, Feb 3, 2010 at 2:26 PM, Saurabh Nanda saurabhna...@gmail.com
 wrote:

 Thanks, Zheng. Will do some more tests and get back.

 Saurabh.

 On Mon, Feb 1, 2010 at 1:22 PM, Zheng Shao zsh...@gmail.com wrote:

 I would first check whether it is really the block compression or
 record compression.
 Also maybe the block size is too small but I am not sure that is
 tunable in SequenceFile or not.

 Zheng

 On Sun, Jan 31, 2010 at 9:03 PM, Saurabh Nanda saurabhna...@gmail.com
 wrote:
  Hi,
 
  The size of my Gzipped weblog files is about 35MB. However, upon
  enabling
  block compression, and inserting the logs into another Hive table
  (sequencefile), the file size bloats up to about 233MB. I've done
  similar
  processing on a local Hadoop/Hive cluster, and while the compressions
  is not
  as good as gzipping, it still is not this bad. What could be going
  wrong?
 
  I looked at the header of the resulting file and here's what it says:
 
 
  SEQ^Forg.apache.hadoop.io.BytesWritable^Yorg.apache.hadoop.io.Text^A^@'org.apache.hadoop.io.compress.GzipCodec
 
  Does Amazon Elastic MapReduce behave differently or am I doing
  something
  wrong?
 
  Saurabh.
  --
  http://nandz.blogspot.com
  http://foodieforlife.blogspot.com
 



 --
 Yours,
 Zheng



 --
 http://nandz.blogspot.com
 http://foodieforlife.blogspot.com



 --
 http://nandz.blogspot.com
 http://foodieforlife.blogspot.com




-- 
Yours,
Zheng

Re: Question on modifying a table to become external

2010-02-18 Thread Zheng Shao

There is no command to do that right now.

One way to go is to create another external table pointing to the same
location (and forget about the old table).
Or you can move the files first, before dropping and recreating the same table.

Zheng
On Thu, Feb 18, 2010 at 10:22 AM, Eva Tse e...@netflix.com wrote:

 We created a table without the ‘EXTERNAL’ qualifier but did specify a
 location for the warehouse. We would like to modify this to be an external
 table. We tried to drop the table, but it does delete the files in the S3
 external location.

 Is there a way we could achieve this?

 Thanks,
 Eva.

 CREATE TABLE IF NOT EXISTS exampletable
 (
 other_properties Mapstring, string,
 event_ts_ms bigint,
 hostname string
 )
 PARTITIONED by (dateint int, hour int)
 ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' COLLECTION ITEMS TERMINATED
 BY '\004' MAP KEYS TERMINATED BY '\002' stored as SEQUENCEFILE
 LOCATION ' s3n://bucketname/hive/warehouse/exampletable';



-- 
Yours,
Zheng

Re: map join and OOM

2010-02-18 Thread Zheng Shao

https://issues.apache.org/jira/browse/HIVE-917 might be what you want
(suppose both of the tables are already bucketed on the join column).

Zheng

On Thu, Feb 18, 2010 at 2:53 PM, Ning Zhang nzh...@facebook.com wrote:
 1GB of the small table is usually too large for map-side joins. If the raw 
 data is 1GB, it could be 10x larger when it is read into main memory as Java 
 objects. Our default value is 10MB.

 Another factor to determine whether to use map-side join is the number of 
 rows in the small table. If it is too large, each mapper will spend long time 
 to process the join (each mapper reads the whole small table into a hash 
 table in main memory and joins a split of the large table).

 Thanks,
 Ning

 On Feb 18, 2010, at 2:45 PM, Edward Capriolo wrote:

 I have Hive 4.1-rc2. My query runs in Time taken: 312.956 seconds
 using the map/reduce join. I was interested in using mapjoin, I get
 an OOM error.

 hive
 java.lang.OutOfMemoryError: GC overhead limit exceeded
       at 
 org.apache.hadoop.hive.ql.util.jdbm.recman.RecordFile.getNewNode(RecordFile.java:369)

 My pageviews is 8GB and my client_ips is ~ 1GB
 property
 namemapred.child.java.opts/name
 value-Xmx778m/value
 /property

 [ecapri...@nyhadoopdata10 ~]$ hive
 Hive history 
 file=/tmp/ecapriolo/hive_job_log_ecapriolo_201002181717_253155276.txt
 hive explain Select /*+ MAPJOIN( client_ips )*/clientip_id,client_ip,
 SUM(bytes_sent) as X from pageviews join client_ips on
 pageviews.clientip_id=client_ips.id  where year=2010 AND month=02 and
 day=17 group by clientip_id,client_ip
 ;
 OK
 ABSTRACT SYNTAX TREE:
  (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF pageviews) (TOK_TABREF
 client_ips) (= (. (TOK_TABLE_OR_COL pageviews) clientip_id) (.
 (TOK_TABLE_OR_COL client_ips) id (TOK_INSERT (TOK_DESTINATION
 (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_HINTLIST (TOK_HINT
 TOK_MAPJOIN (TOK_HINTARGLIST client_ips))) (TOK_SELEXPR
 (TOK_TABLE_OR_COL clientip_id)) (TOK_SELEXPR (TOK_TABLE_OR_COL
 client_ip)) (TOK_SELEXPR (TOK_FUNCTION SUM (TOK_TABLE_OR_COL
 bytes_sent)) X)) (TOK_WHERE (and (AND (= (TOK_TABLE_OR_COL year) 2010)
 (= (TOK_TABLE_OR_COL month) 02)) (= (TOK_TABLE_OR_COL day) 17)))
 (TOK_GROUPBY (TOK_TABLE_OR_COL clientip_id) (TOK_TABLE_OR_COL
 client_ip

 STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-2 depends on stages: Stage-1
  Stage-0 is a root stage

 STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias - Map Operator Tree:
        pageviews
          TableScan
            alias: pageviews
            Filter Operator
              predicate:
                  expr: (((UDFToDouble(year) = UDFToDouble(2010)) and
 (UDFToDouble(month) = UDFToDouble(2))) and (UDFToDouble(day) =
 UDFToDouble(17)))
                  type: boolean
              Common Join Operator
                condition map:
                     Inner Join 0 to 1
                condition expressions:
                  0 {clientip_id} {bytes_sent} {year} {month} {day}
                  1 {client_ip}
                keys:
                  0
                  1
                outputColumnNames: _col13, _col17, _col22, _col23,
 _col24, _col26
                Position of Big Table: 0
                File Output Operator
                  compressed: false
                  GlobalTableId: 0
                  table:
                      input format:
 org.apache.hadoop.mapred.SequenceFileInputFormat
                      output format:
 org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
      Local Work:
        Map Reduce Local Work
          Alias - Map Local Tables:
            client_ips
              Fetch Operator
                limit: -1
          Alias - Map Local Operator Tree:
            client_ips
              TableScan
                alias: client_ips
                Common Join Operator
                  condition map:
                       Inner Join 0 to 1
                  condition expressions:
                    0 {clientip_id} {bytes_sent} {year} {month} {day}
                    1 {client_ip}
                  keys:
                    0
                    1
                  outputColumnNames: _col13, _col17, _col22, _col23,
 _col24, _col26
                  Position of Big Table: 0
                  File Output Operator
                    compressed: false
                    GlobalTableId: 0
                    table:
                        input format:
 org.apache.hadoop.mapred.SequenceFileInputFormat
                        output format:
 org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

  Stage: Stage-2
    Map Reduce
      Alias - Map Operator Tree:
        
 hdfs://nyhadoopname1.ops.about.com:8020/tmp/hive-ecapriolo/975920219/10002
          Select Operator
            expressions:
                  expr: _col13
                  type: int
                  expr: _col17
                  type: int
                  expr: _col22
                  type: string
                  expr: _col23

Re: Hive Server Leaking File Descriptors?

2010-02-18 Thread Zheng Shao

This is actually a bug in MAPREDUCE-1504, but we will try to find a workaround.
https://issues.apache.org/jira/browse/HIVE-1181

Given that release 0.5.0 is much wanted right now, I don't think we
want to wait purely for 0.5.0 since the ultimate fix should come from
Hadoop.
We will definitely get HIVE-1181 for branch 0.5.

Zheng

-- Forwarded message --
From: Andy Kent andy.k...@forward.co.uk
Date: Thu, Feb 18, 2010 at 3:17 PM
Subject: Re: Hive Server Leaking File Descriptors?
To: hive-user@hadoop.apache.org hive-user@hadoop.apache.org



On 18 Feb 2010, at 20:29, Zheng Shao zsh...@gmail.com wrote:

 I've tried to look into it a bit more and it seems to happen on
 load data
 inpath

This is inline with what we have been seeing as we do around 200 load
data statements per day and leak approx the same number of file
descriptors.

Is there any chance this fix will make it into the 0.5 release?



-- 
Yours,
Zheng

Re: NoClassDef error

2010-02-18 Thread Zheng Shao

The stacktrace that you showed is from the hive cli right?
Did you define HADOOP_CLASSPATH somewhere?

Hive modifies HADOOP_CLASSPATH so it's important to modify it by
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/my/new/path instead of
directly overwriting it.


Zheng

On Thu, Feb 18, 2010 at 9:22 PM, Vidyasagar Venkata Nallapati
vidyasagar.nallap...@onmobile.com wrote:
 Hi,



 I have kept the hive/conf in the HADOOP_CLASSPATH

 Also I have verified that there are no hive jars in the hadoop directory and
 also added the



 property
 namehadoop.bin.path/name
 value/usr/bin/hadoop/value
 !-- note that the hive shell script also uses this property name --
 descriptionPath to hadoop binary. Assumes that by default we are executing
 from hive/description
 /property



 But am still getting the same error if a run on multi node cluster, its
 working in a single node setup.





 Regards

 Vidyasagar N V



 From: Yi Mao [mailto:ymaob...@gmail.com]
 Sent: Wednesday, February 17, 2010 11:28 PM
 To: hive-user@hadoop.apache.org
 Subject: Re: NoClassDef error



 I think you also need hive/conf in the classpath.

 On Wed, Feb 17, 2010 at 2:23 AM, Vidyasagar Venkata Nallapati
 vidyasagar.nallap...@onmobile.com wrote:

 Hi ,



 When starting the hive I am getting an error even after I am including in
 class path, attached is the hadoop-env I am using.



 Exception in thread main java.lang.NoClassDefFoundError:
 org/apache/hadoop/hive/conf/HiveConf

     at java.lang.Class.forName0(Native Method)

     at java.lang.Class.forName(Class.java:247)

     at org.apache.hadoop.util.RunJar.main(RunJar.java:149)

 Caused by: java.lang.ClassNotFoundException:
 org.apache.hadoop.hive.conf.HiveConf

     at java.net.URLClassLoader$1.run(URLClassLoader.java:200)

     at java.security.AccessController.doPrivileged(Native Method)

     at java.net.URLClassLoader.findClass(URLClassLoader.java:188)

     at java.lang.ClassLoader.loadClass(ClassLoader.java:307)

     at java.lang.ClassLoader.loadClass(ClassLoader.java:252)

     at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)

     ... 3 more



 Regards

 Vidyasagar N V



 

 DISCLAIMER: The information in this message is confidential and may be
 legally privileged. It is intended solely for the addressee. Access to this
 message by anyone else is unauthorized. If you are not the intended
 recipient, any disclosure, copying, or distribution of the message, or any
 action or omission taken by you in reliance on it, is prohibited and may be
 unlawful. Please immediately contact the sender if you have received this
 message in error. Further, this e-mail may contain viruses and all
 reasonable precaution to minimize the risk arising there from is taken by
 OnMobile. OnMobile is not liable for any damage sustained by you as a result
 of any virus in this e-mail. All applicable virus checks should be carried
 out by you before opening this e-mail or any attachment thereto.
 Thank you - OnMobile Global Limited.



 
 DISCLAIMER: The information in this message is confidential and may be
 legally privileged. It is intended solely for the addressee. Access to this
 message by anyone else is unauthorized. If you are not the intended
 recipient, any disclosure, copying, or distribution of the message, or any
 action or omission taken by you in reliance on it, is prohibited and may be
 unlawful. Please immediately contact the sender if you have received this
 message in error. Further, this e-mail may contain viruses and all
 reasonable precaution to minimize the risk arising there from is taken by
 OnMobile. OnMobile is not liable for any damage sustained by you as a result
 of any virus in this e-mail. All applicable virus checks should be carried
 out by you before opening this e-mail or any attachment thereto.
 Thank you - OnMobile Global Limited.




-- 
Yours,
Zheng

Re: NoClassDef error

2010-02-17 Thread Zheng Shao

In which directory did you start hive? hive should be started in build/dist

Zheng

On Wed, Feb 17, 2010 at 2:23 AM, Vidyasagar Venkata Nallapati
vidyasagar.nallap...@onmobile.com wrote:
 Hi ,



 When starting the hive I am getting an error even after I am including in
 class path, attached is the hadoop-env I am using.



 Exception in thread main java.lang.NoClassDefFoundError:
 org/apache/hadoop/hive/conf/HiveConf

     at java.lang.Class.forName0(Native Method)

     at java.lang.Class.forName(Class.java:247)

     at org.apache.hadoop.util.RunJar.main(RunJar.java:149)

 Caused by: java.lang.ClassNotFoundException:
 org.apache.hadoop.hive.conf.HiveConf

     at java.net.URLClassLoader$1.run(URLClassLoader.java:200)

     at java.security.AccessController.doPrivileged(Native Method)

     at java.net.URLClassLoader.findClass(URLClassLoader.java:188)

     at java.lang.ClassLoader.loadClass(ClassLoader.java:307)

     at java.lang.ClassLoader.loadClass(ClassLoader.java:252)

     at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)

     ... 3 more



 Regards

 Vidyasagar N V

 
 DISCLAIMER: The information in this message is confidential and may be
 legally privileged. It is intended solely for the addressee. Access to this
 message by anyone else is unauthorized. If you are not the intended
 recipient, any disclosure, copying, or distribution of the message, or any
 action or omission taken by you in reliance on it, is prohibited and may be
 unlawful. Please immediately contact the sender if you have received this
 message in error. Further, this e-mail may contain viruses and all
 reasonable precaution to minimize the risk arising there from is taken by
 OnMobile. OnMobile is not liable for any damage sustained by you as a result
 of any virus in this e-mail. All applicable virus checks should be carried
 out by you before opening this e-mail or any attachment thereto.
 Thank you - OnMobile Global Limited.




-- 
Yours,
Zheng

Re: Help with Compressed Storage

2010-02-17 Thread Zheng Shao

I just corrected the wiki page. It will also be a good idea to support
case-insensitive boolean values in the code.

Zheng

On Wed, Feb 17, 2010 at 9:27 AM, Brent Miller brentalanmil...@gmail.com wrote:
 Thanks Adam, that works for me as well.
 It seems that the property for hive.exec.compress.output is case sensitive,
 and when it is set to TRUE (as it is on the compressed storage page on the
 wiki) it is ignored by hive.

 -Brent

 On Tue, Feb 16, 2010 at 4:24 PM, Adam O'Donnell a...@immunet.com wrote:

 Adding these to my hive-site.xml file worked fine:

  property
        namehive.exec.compress.output/name
        valuetrue/value
        descriptionCompress output/description
  /property

  property
        namemapred.output.compression.type/name
        valueBLOCK/value
        descriptionBlock compression/description
  /property


 On Tue, Feb 16, 2010 at 1:43 PM, Brent Miller brentalanmil...@gmail.com
 wrote:
  Hello, I've seen issues similar to this one come up once or twice
  before,
  but I haven't ever seen a solution to the problem that I'm having. I was
  following the Compressed Storage page on the Hive
  Wiki http://wiki.apache.org/hadoop/CompressedStorage and realized that
  the
  sequence files that are created in the warehouse directory are actually
  uncompressed and larger than than the originals.
  For example, I have a table 'test1' who's input data looks something
  like:
  0,1369962224,2010/02/01,00:00:00.101,0C030301,4,BD43
  0,1369962225,2010/02/01,00:00:00.101,0C030501,4,66268E43
  0,1369962226,2010/02/01,00:00:00.101,0C030701,4,041F3341
  ...
  And after creating a second table 'test1_comp' that was crated with the
  STORED AS SEQUENCEFILE directive and the compression options SET as
  described in the wiki, I can look at the resultant sequence files and
  see
  that they're just plain (uncompressed) text:
  SEQ org.apache.hadoop.io.BytesWritable
  org.apache.hadoop.io.Text+�c�!Y�M ��
  Z^��= 80,1369962224,2010/02/01,00:00:00.101,0C030301,4,BD43=
  80,1369962225,2010/02/01,00:00:00.101,0C030501,4,66268E43=
  80,1369962226,2010/02/01,00:00:00.101,0C030701,4,041F3341=
  80,1369962227,2010/02/01,00:00:00.101,0C030901,4,11360141=
  ...
  I've tried messing around with different org.apache.hadoop.io.compress.*
  options, but the sequence files always come out uncompressed. Has
  anybody
  ever seen this or know away to keep the data compressed? Since the input
  text is so uniform, we get huge space savings from compression and would
  like to store the data this way if possible. I'm using Hadoop 20.1 and
  Hive
  that I checked out from SVN about a week ago.
  Thanks,
  Brent



 --
 Adam J. O'Donnell, Ph.D.
 Immunet Corporation
 Cell: +1 (267) 251-0070





-- 
Yours,
Zheng

Re: hive ant spead ups

2010-02-17 Thread Zheng Shao

I think this is worth exploring. Unit test time is now longer and
longer given more code and more tests.
Do you want to start a JIRA issue and discuss more about it?

Zheng

On Wed, Feb 17, 2010 at 8:53 AM, Edward Capriolo edlinuxg...@gmail.com wrote:
 I made an ant target quick-test, which differs from test in that it
 has no dependencies.
  target name=quick-test
    iterate target=test/
  /target

 target name=test depends=clean-test,jar
  iterate target=test/
  iterate-cpp target=test/
 /target

 time ant -Dhadoop.version='0.18.3' -Doffline=true
 -Dtestcase=TestCliDriver -Dqfile=alter1.q quick-test
 BUILD SUCCESSFUL
 Total time: 15 seconds

 real    0m16.250s
 user    0m20.965s
 sys     0m1.579s

 time ant -Dhadoop.version='0.18.3' -Doffline=true
 -Dtestcase=TestCliDriver -Dqfile=alter1.q test
 BUILD SUCCESSFUL
 Total time: 26 seconds

 real    0m26.564s
 user    0m31.307s
 sys     0m2.346s

 It does without saying that Hive ant is very different then make file.
 Most make files can set simple flags to say, 'make.ok ' , so that
 running a target like 'make install' will not cause the dependent
 tasks to be re-run.

 Excuse my ignorance if this is some built in ant switch like
 '--no-deps'. Should we set flags in hive so the build process can
 intelligently skip work that is already done?




-- 
Yours,
Zheng

Re: Help with Compressed Storage

2010-02-17 Thread Zheng Shao

There is no special setting for bz2.

Can you get the debug log?

Zheng

On Wed, Feb 17, 2010 at 9:02 PM, prasenjit mukherjee
pmukher...@quattrowireless.com wrote:
 So I tried the same with  .gz files and it worked. I am using the following
 hadoop version :Hadoop 0.20.1+169.56 with cloudera's ami-2359bf4a. I thought
 that hadoop0.20 does support bz2 compression, hence same should work with
 hive as well.

 Interesting note is that Pig works fine on the same bz2 data.  Is there any
 tweaking/config setup I need to do for hive to take bz2 files as input ?

 On Thu, Feb 18, 2010 at 8:31 AM, prasenjit mukherjee
 pmukher...@quattrowireless.com wrote:

 I have a similar issue with bz2 files. I have the hadoop directories :

 /ip/data/ : containing unzipped text files ( foo1.txt, foo2.txt )
 /ip/datacompressed/ : containing same files bzipped (  foo1.bz2, foo2.bz2
 )

 CREATE EXTERNAL TABLE tx_log(id1 string, id2 string, id3 string)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '\002'
    LOCATION '/ip/datacompressed/';
 SELECT *  FROM tx_log limit 1;

 The command works fine with LOCATION '/ip/data/' but doesnt work with
 LOCATION '/ip/datacompressed/'

 Any pointers ? I thought ( like Pig  ) hive automatically detects .bz2
 extensions and applies appropriate decompression. Am I wrong ?

 -Prasen


 On Thu, Feb 18, 2010 at 3:04 AM, Zheng Shao zsh...@gmail.com wrote:

 I just corrected the wiki page. It will also be a good idea to support
 case-insensitive boolean values in the code.

 Zheng

 On Wed, Feb 17, 2010 at 9:27 AM, Brent Miller brentalanmil...@gmail.com
 wrote:
  Thanks Adam, that works for me as well.
  It seems that the property for hive.exec.compress.output is case
  sensitive,
  and when it is set to TRUE (as it is on the compressed storage page on
  the
  wiki) it is ignored by hive.
 
  -Brent
 
  On Tue, Feb 16, 2010 at 4:24 PM, Adam O'Donnell a...@immunet.com
  wrote:
 
  Adding these to my hive-site.xml file worked fine:
 
   property
         namehive.exec.compress.output/name
         valuetrue/value
         descriptionCompress output/description
   /property
 
   property
         namemapred.output.compression.type/name
         valueBLOCK/value
         descriptionBlock compression/description
   /property
 
 
  On Tue, Feb 16, 2010 at 1:43 PM, Brent Miller
  brentalanmil...@gmail.com
  wrote:
   Hello, I've seen issues similar to this one come up once or twice
   before,
   but I haven't ever seen a solution to the problem that I'm having. I
   was
   following the Compressed Storage page on the Hive
   Wiki http://wiki.apache.org/hadoop/CompressedStorage and realized
   that
   the
   sequence files that are created in the warehouse directory are
   actually
   uncompressed and larger than than the originals.
   For example, I have a table 'test1' who's input data looks something
   like:
   0,1369962224,2010/02/01,00:00:00.101,0C030301,4,BD43
   0,1369962225,2010/02/01,00:00:00.101,0C030501,4,66268E43
   0,1369962226,2010/02/01,00:00:00.101,0C030701,4,041F3341
   ...
   And after creating a second table 'test1_comp' that was crated with
   the
   STORED AS SEQUENCEFILE directive and the compression options SET as
   described in the wiki, I can look at the resultant sequence files
   and
   see
   that they're just plain (uncompressed) text:
   SEQ org.apache.hadoop.io.BytesWritable
   org.apache.hadoop.io.Text+�c�!Y�M ��
   Z^��= 80,1369962224,2010/02/01,00:00:00.101,0C030301,4,BD43=
   80,1369962225,2010/02/01,00:00:00.101,0C030501,4,66268E43=
   80,1369962226,2010/02/01,00:00:00.101,0C030701,4,041F3341=
   80,1369962227,2010/02/01,00:00:00.101,0C030901,4,11360141=
   ...
   I've tried messing around with
   different org.apache.hadoop.io.compress.*
   options, but the sequence files always come out uncompressed. Has
   anybody
   ever seen this or know away to keep the data compressed? Since the
   input
   text is so uniform, we get huge space savings from compression and
   would
   like to store the data this way if possible. I'm using Hadoop 20.1
   and
   Hive
   that I checked out from SVN about a week ago.
   Thanks,
   Brent
 
 
 
  --
  Adam J. O'Donnell, Ph.D.
  Immunet Corporation
  Cell: +1 (267) 251-0070
 
 



 --
 Yours,
 Zheng






-- 
Yours,
Zheng

Re: Help with Compressed Storage

2010-02-17 Thread Zheng Shao

Just remember that we need to have the BZipCodec class in the
following hadoop configuration:
Can you check?

io.compression.codecs

Zheng

On Wed, Feb 17, 2010 at 11:21 PM, prasenjit mukherjee
prasen@gmail.com wrote:
 So this is the command I ran, first with  with small.gz (which worked fine)
 and  then with small.bz2 ( which didnt work )  :

 drop table small_table;
 CREATE  TABLE small_table(id1 string, id2 string, id3 string) ROW FORMAT
 DELIMITED FIELDS TERMINATED BY ',';
 LOAD DATA LOCAL INPATH '/root/data/small.gz' OVERWRITE INTO TABLE
 small_table;
 select * from small_table limit 1;

 For gz files I do see the following lines in hive_debug :
 10/02/18 01:59:23 DEBUG ipc.RPC: Call: getBlockLocations 1
 10/02/18 01:59:23 DEBUG util.NativeCodeLoader: Trying to load the
 custom-built native-hadoop library...
 10/02/18 01:59:23 DEBUG util.NativeCodeLoader: Failed to load native-hadoop
 with error: java.lang.UnsatisfiedLinkError: no hadoop in java.library.path
 10/02/18 01:59:23 DEBUG util.NativeCodeLoader:
 java.library.path=/usr/java/jdk1.6.0_14/jre/lib/amd64/server:/usr/java/jdk1.6.0_14/jre/lib/amd64:/usr/java/jdk1.6.0_14/jre/../lib/amd64:/usr/java/packages/lib/amd64:/lib:/usr/lib
 10/02/18 01:59:23 WARN util.NativeCodeLoader: Unable to load native-hadoop
 library for your platform... using builtin-java classes where applicable
 10/02/18 01:59:23 DEBUG fs.FSInputChecker: DFSClient readChunk got seqno 0
 offsetInBlock 0 lastPacketInBlock true packetLen 88
 aid1     bid2     cid3

 But for bzip files there is none :
 10/02/18 01:57:18 DEBUG ipc.RPC: Call: getBlockLocations 2
 10/02/18 01:57:18 DEBUG fs.FSInputChecker: DFSClient readChunk got seqno 0
 offsetInBlock 0 lastPacketInBlock true packetLen 85
 10/02/18 01:57:18 WARN lazy.LazyStruct: Missing fields! Expected 3 fields
 but only got 1! Ignoring similar problems.
 BZh91AYSYǧ    �Y @   TP?* ���SFL� cѶѶ�$� �
  �w��U�)„�=8O�
 NULL    NULL


 Let me know if you still need the debug files. Attached are the small.gz and
 small.bzip2 files.

 Thanks and appreciate,
 -Prasen

 On Thu, Feb 18, 2010 at 11:52 AM, Zheng Shao zsh...@gmail.com wrote:

 There is no special setting for bz2.

 Can you get the debug log?

 Zheng

 On Wed, Feb 17, 2010 at 9:02 PM, prasenjit mukherjee
 pmukher...@quattrowireless.com wrote:
  So I tried the same with  .gz files and it worked. I am using the
  following
  hadoop version :Hadoop 0.20.1+169.56 with cloudera's ami-2359bf4a. I
  thought
  that hadoop0.20 does support bz2 compression, hence same should work
  with
  hive as well.
 
  Interesting note is that Pig works fine on the same bz2 data.  Is there
  any
  tweaking/config setup I need to do for hive to take bz2 files as input ?
 
  On Thu, Feb 18, 2010 at 8:31 AM, prasenjit mukherjee
  pmukher...@quattrowireless.com wrote:
 
  I have a similar issue with bz2 files. I have the hadoop directories :
 
  /ip/data/ : containing unzipped text files ( foo1.txt, foo2.txt )
  /ip/datacompressed/ : containing same files bzipped (  foo1.bz2,
  foo2.bz2
  )
 
  CREATE EXTERNAL TABLE tx_log(id1 string, id2 string, id3 string)
     ROW FORMAT DELIMITED FIELDS TERMINATED BY '\002'
     LOCATION '/ip/datacompressed/';
  SELECT *  FROM tx_log limit 1;
 
  The command works fine with LOCATION '/ip/data/' but doesnt work with
  LOCATION '/ip/datacompressed/'
 
  Any pointers ? I thought ( like Pig  ) hive automatically detects .bz2
  extensions and applies appropriate decompression. Am I wrong ?
 
  -Prasen
 
 
  On Thu, Feb 18, 2010 at 3:04 AM, Zheng Shao zsh...@gmail.com wrote:
 
  I just corrected the wiki page. It will also be a good idea to support
  case-insensitive boolean values in the code.
 
  Zheng
 
  On Wed, Feb 17, 2010 at 9:27 AM, Brent Miller
  brentalanmil...@gmail.com
  wrote:
   Thanks Adam, that works for me as well.
   It seems that the property for hive.exec.compress.output is case
   sensitive,
   and when it is set to TRUE (as it is on the compressed storage page
   on
   the
   wiki) it is ignored by hive.
  
   -Brent
  
   On Tue, Feb 16, 2010 at 4:24 PM, Adam O'Donnell a...@immunet.com
   wrote:
  
   Adding these to my hive-site.xml file worked fine:
  
    property
          namehive.exec.compress.output/name
          valuetrue/value
          descriptionCompress output/description
    /property
  
    property
          namemapred.output.compression.type/name
          valueBLOCK/value
          descriptionBlock compression/description
    /property
  
  
   On Tue, Feb 16, 2010 at 1:43 PM, Brent Miller
   brentalanmil...@gmail.com
   wrote:
Hello, I've seen issues similar to this one come up once or twice
before,
but I haven't ever seen a solution to the problem that I'm
having. I
was
following the Compressed Storage page on the Hive
Wiki http://wiki.apache.org/hadoop/CompressedStorage and realized
that
the
sequence files that are created in the warehouse

Re: Help with Compressed Storage

2010-02-17 Thread Zheng Shao

Try this one to see if it works:

hive -hiveconf io.compression.codecs=xxx,yyy,zzz

Zheng

On Wed, Feb 17, 2010 at 11:33 PM, prasenjit mukherjee
prasen@gmail.com wrote:
 Thanks for the pointer  that was indeed the problem.  The specific AMI I was
 using didnt include bzip2 codecs in their hadoop-site.xml.  Is there a way I
 can pass those parameters from hive, so that I dont need to manually change
 the file  ?

 -Thanks,
 Prasen

 On Thu, Feb 18, 2010 at 12:54 PM, Zheng Shao zsh...@gmail.com wrote:

 Just remember that we need to have the BZipCodec class in the
 following hadoop configuration:
 Can you check?

 io.compression.codecs

 Zheng







-- 
Yours,
Zheng

[VOTE] release hive 0.5.0

2010-02-15 Thread Zheng Shao

Hive branch 0.5 was created 5 weeks ago:
https://svn.apache.org/viewvc/hadoop/hive/branches/branch-0.5/

It has also been running as the production version of Hive at Facebook
for 2 weeks.


We'd like to start making release candidates (for 0.5.0) from branch 0.5.
Please vote.

-- 
Yours,
Zheng

Re: Hive Server Leaking File Descriptors?

2010-02-15 Thread Zheng Shao

Can you go to that box, sudo as root, and do lsof | grep 12345 where
12345 is the process id of the hive server?
We should be able to see the names of the files that are open.

Zheng

On Mon, Feb 15, 2010 at 7:42 AM, Andy Kent andy.k...@forward.co.uk wrote:
 Nope, no luck so far.

 We have upped the number of file descriptors and are having to restart hive 
 every week or so :(

 Any other suggestions would be greatly appreciated.

 On 15 Feb 2010, at 14:09, Bennie Schut wrote:

 Did this help? I'm running into a similar problem. slowly leaking
 connections to 50010 and after a hive restart all is ok again.

 Andy Kent wrote:
 I can give try and give it a go. I'm not convinced though as we are working 
 with CSV files and don't touch sequence files at all at the moment.

 We are using the Clodera Ubuntu Packages for Hadoop 0.20.1+133 and Hive 0.40


 On 25 Jan 2010, at 15:30, Jay Booth wrote:


 Actually, we had an issue with this, it was a bug in SequenceFile where if 
 there were problems opening a file, it would leave a filehandle open and 
 never close it.

 Here's the patch -- It's already fixed in 0.21/trunk, if I get some time 
 this week I'll submit it against 0.20.2 -- could you apply this to hadoop 
 and let me know if it fixes things for you?

 On Mon, Jan 25, 2010 at 10:11 AM, Jay Booth 
 jaybo...@gmail.commailto:jaybo...@gmail.com wrote:
 Yeah, I'd guess that this is a Hive issue, although it could be a 
 combination..  maybe if you're doing queries and then closing your thrift 
 connection before reading all results, Hive doesn't know what to do and 
 leaves the connection open?  Once the west coast folks wake up, they might 
 have a better answer for you than I do.


 On Mon, Jan 25, 2010 at 9:06 AM, Andy Kent 
 andy.k...@forward.co.ukmailto:andy.k...@forward.co.uk wrote:
 On 25 Jan 2010, at 13:59, Jay Booth wrote:


 That's the datanode port..  if I had to guess, Hive's connecting to DFS 
 directly for some reason (maybe for select * queries?) and not 
 finishing their reads or closing the connections after.

 Thanks for the response.

 That's what I was suspecting. I have triple checked and our Ruby code and 
 it is defiantly closing it's thrift connections properly.

 I'll try running some different queries and see if I can suss out some 
 examples of which ones are leaky. Is this something that I should post to 
 Jira or is it a known issue? I can't believe other people haven't noticed 
 this?


 SequenceFile.patch










-- 
Yours,
Zheng

Re: Got sun.misc.InvalidJarIndexException: Invalid index

2010-02-15 Thread Zheng Shao

MySQL is recommended for multiple-node deployment of Hive. Can you try MySQL?

Zheng

On Mon, Feb 8, 2010 at 6:32 PM, Mafish Liu maf...@gmail.com wrote:
 Hi, all:
 I'm deploying hive from node A to node B. Hive on node A works
 properly while on node B, when I try to create a new table, I got the
 following exception:

 2010-02-08 10:15:38,339 ERROR exec.DDLTask
 (SessionState.java:printError(279)) - FAILED: Error in metadata:
 javax.jdo.JDOUserException: Exception during population of metadata
 for org.apache.hadoop.hive.metastore.model.MDatabase
 NestedThrowables:
 sun.misc.InvalidJarIndexException: Invalid index
 org.apache.hadoop.hive.ql.metadata.HiveException:
 javax.jdo.JDOUserException: Exception during population of metadata
 for org.apache.hadoop.hive.metastore.model.MDatabase
 NestedThrowables:
 sun.misc.InvalidJarIndexException: Invalid index
        at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:258)
        at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:879)
        at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:103)
        at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:379)
        at org.apache.hadoop.hive.ql.Driver.run(Driver.java:285)
        at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:123)
        at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:181)
        at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:287)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
        at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
        at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
 Caused by: javax.jdo.JDOUserException: Exception during population of
 metadata for org.apache.hadoop.hive.metastore.model.MDatabase
 NestedThrowables:
 sun.misc.InvalidJarIndexException: Invalid index
        at 
 org.datanucleus.jdo.NucleusJDOHelper.getJDOExceptionForNucleusException(NucleusJDOHelper.java:350)
        at 
 org.datanucleus.ObjectManagerImpl.getExtent(ObjectManagerImpl.java:3741)
        at 
 org.datanucleus.store.rdbms.query.JDOQLQueryCompiler.compileCandidates(JDOQLQueryCompiler.java:411)
        at 
 org.datanucleus.store.rdbms.query.QueryCompiler.executionCompile(QueryCompiler.java:312)
        at 
 org.datanucleus.store.rdbms.query.JDOQLQueryCompiler.compile(JDOQLQueryCompiler.java:225)
        at 
 org.datanucleus.store.rdbms.query.JDOQLQuery.compileInternal(JDOQLQuery.java:174)
        at org.datanucleus.store.query.Query.executeQuery(Query.java:1443)
        at 
 org.datanucleus.store.rdbms.query.JDOQLQuery.executeQuery(JDOQLQuery.java:244)
        at org.datanucleus.store.query.Query.executeWithArray(Query.java:1357)
        at org.datanucleus.jdo.JDOQuery.execute(JDOQuery.java:242)
        at 
 org.apache.hadoop.hive.metastore.ObjectStore.getMDatabase(ObjectStore.java:283)
        at 
 org.apache.hadoop.hive.metastore.ObjectStore.getDatabase(ObjectStore.java:301)
        at 
 org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:146)
        at 
 org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:118)
        at 
 org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:100)
        at 
 org.apache.hadoop.hive.metastore.HiveMetaStoreClient.init(HiveMetaStoreClient.java:74)
        at 
 org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:783)
        at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:794)
        at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:252)
        ... 16 more
 Caused by: sun.misc.InvalidJarIndexException: Invalid index
        at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:854)
        at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:762)
        at sun.misc.URLClassPath$JarLoader.findResource(URLClassPath.java:732)
        at sun.misc.URLClassPath$1.next(URLClassPath.java:195)
        at sun.misc.URLClassPath$1.hasMoreElements(URLClassPath.java:205)
        at java.net.URLClassLoader$3$1.run(URLClassLoader.java:393)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader$3.next(URLClassLoader.java:390)
        at java.net.URLClassLoader$3.hasMoreElements(URLClassLoader.java:415)
        at sun.misc.CompoundEnumeration.next(CompoundEnumeration.java:27)
        at 
 sun.misc.CompoundEnumeration.hasMoreElements(CompoundEnumeration.java:36)
        at

Re: SerDe issue

2010-02-12 Thread Zheng Shao

Hi Roberto,

The reason that Text is passed in is because the table is defined as
TextFile format (the default).

There are some examples (*.q files) of using SequenceFile format (
CREATE TABLE xxx  STORED AS SEQUENCEFILE).
SEQUENCEFILE will return BytesWritable by default.

Please have a try.

Zheng

On Fri, Feb 12, 2010 at 1:05 PM, Roberto Congiu
roberto.con...@openx.org wrote:
 Hey guys,
 I wrote a SerDe to support lwes (http://lwes.org) using BinarySortableSerDe
 as a model.
 The code is very similar, and I serialize an lwes event to a BytesWritable,
 and deserialize from it.
 Serialization is fine...however, when I run an insert into... select, the
 Deserialize methods is passed a Text object instead of a BytesWritable
 object like expected.
 Hive generates 2 jobs, and it fails on the mapper in the second.
 getSerializedClass() is set correctly:
  public Class? extends Writable getSerializedClass() {
         LOG.debug(JournalSerDe::getSerializedClass());
         return BytesWritable.class;
     }
 And I don't see any relevant difference between BinarySortableSerDe and my
 code.
 Does anybody have a hint on what may be happening ?
 Thanks,
 Roberto



-- 
Yours,
Zheng

Re: Hive Installation Error

2010-02-11 Thread Zheng Shao

What commands did you run? With which release?

Zheng
On Wed, Feb 10, 2010 at 11:20 PM, Vidyasagar Venkata Nallapati
vidyasagar.nallap...@onmobile.com wrote:
 Hi,



 Installation is giving an error as



 master/hadoop/hadoop-0.20.1/build.xml:895: 'java5.home' is not defined.
 Forrest requires Java 5.  Please pass -Djava5.home=base of Java 5
 distribution to Ant on the command-line.

     at org.apache.tools.ant.taskdefs.Exit.execute(Exit.java:142)

     at
 org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:288)

     at sun.reflect.GeneratedMethodAccessor1.invoke(Unknown Source)

     at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

     at java.lang.reflect.Method.invoke(Method.java:597)

     at
 org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:106)

     at org.apache.tools.ant.Task.perform(Task.java:348)

     at org.apache.tools.ant.Target.execute(Target.java:357)

     at org.apache.tools.ant.Target.performTasks(Target.java:385)

     at
 org.apache.tools.ant.Project.executeSortedTargets(Project.java:1337)

     at org.apache.tools.ant.Project.executeTarget(Project.java:1306)

     at
 org.apache.tools.ant.helper.DefaultExecutor.executeTargets(DefaultExecutor.java:41)

     at org.apache.tools.ant.Project.executeTargets(Project.java:1189)

     at org.apache.tools.ant.Main.runBuild(Main.java:758)

     at org.apache.tools.ant.Main.startAnt(Main.java:217)

     at org.apache.tools.ant.launch.Launcher.run(Launcher.java:257)

     at org.apache.tools.ant.launch.Launcher.main(Launcher.java:104)



 Total time: 7 minutes 11 seconds



 Please guide on this case.



 Regards

 Vidyasagar N V

 
 DISCLAIMER: The information in this message is confidential and may be
 legally privileged. It is intended solely for the addressee. Access to this
 message by anyone else is unauthorized. If you are not the intended
 recipient, any disclosure, copying, or distribution of the message, or any
 action or omission taken by you in reliance on it, is prohibited and may be
 unlawful. Please immediately contact the sender if you have received this
 message in error. Further, this e-mail may contain viruses and all
 reasonable precaution to minimize the risk arising there from is taken by
 OnMobile. OnMobile is not liable for any damage sustained by you as a result
 of any virus in this e-mail. All applicable virus checks should be carried
 out by you before opening this e-mail or any attachment thereto.
 Thank you - OnMobile Global Limited.




-- 
Yours,
Zheng

Re: Distributing additional files for reduce scripts

2010-02-11 Thread Zheng Shao

add file myfile.txt;

You can find some examples in *.q files in the distribution.

Zheng

On Thu, Feb 11, 2010 at 10:23 PM, Adam O'Donnell a...@immunet.com wrote:
 Guys:

 How do you go about distributing additional files that may be needed
 by your reduce scripts?  For example, I need to distribute a GeoIP
 database with my reduce script to do some lookups.

 Thanks!

 Adam

 --
 Adam J. O'Donnell, Ph.D.
 Immunet Corporation
 Cell: +1 (267) 251-0070




-- 
Yours,
Zheng

Re: hive map reduce output

2010-02-09 Thread Zheng Shao

Another possible reason is that we found sometimes hadoop framework
does not return the correct count to the clients.
In all these cases, the count is smaller than the number of rows
actually loaded.

which version of hadoop are you using?

Zheng

On Mon, Feb 8, 2010 at 11:27 PM, Jeff Hammerbacher ham...@cloudera.com wrote:
 Hey wd,

 Actually, what version are you running? Your bug sounds an awful lot like
 http://issues.apache.org/jira/browse/HIVE-327, which was fixed many moons
 ago.

 Thanks,
 Jeff

 On Mon, Feb 8, 2010 at 11:25 PM, Carl Steinbach c...@cloudera.com wrote:

 Hi wd,

 Please file a JIRA ticket for this issue.

 Thanks.

 Carl

 On Mon, Feb 8, 2010 at 7:05 PM, wd w...@wdicc.com wrote:

 hi,

 I've use hive map reduce to process some log files. I found out that hive
 will output like num1 rows loaded to table_name message every run. But the
 num1 not equal to select count(1) from table_name execute result.

 I think this should be a bug. If we can not count the right num, why we
 output that message?






-- 
Yours,
Zheng

Re: Lzo problem throwing java.io.IOException:java.io.EOFException

2010-02-09 Thread Zheng Shao

Looks like a lzo codec problem. Can you try a simple mapreduce program
outputs to lzo compression and the same output file format as you hive
table?

On 2/9/10, Bennie Schut bsc...@ebuddy.com wrote:
 I have a bit of an edge case on using lzo which I think might be related
 to HIVE-524.
 When running a query like this:
 select distinct login_cldr_id as cldr_id from chatsessions_load;
 I get a java.io.IOException:java.io.EOFException without much of a
 description.
 I know the output should be a single value and noticed it decided to use
 2 reducers.
 One of the reducers produced a 0 byte file which I imagine will be the
 cause of the IOException. It I do set mapred.reduce.tasks=1 it works
 correctly since there is no 0 byte file anymore.

 I also noticed when using gzip I don't see this problem at all.

 Since I use


-- 
Sent from my mobile device

Yours,
Zheng

Re: Using UDFs stored on HDFS

2010-02-08 Thread Zheng Shao

Yes that's correct. I prefer to download the jars in add jar.

Zheng

On Mon, Feb 8, 2010 at 3:46 PM, Philip Zeyliger phi...@cloudera.com wrote:
 Hi folks,

 I have a quick question about UDF support in Hive.  I'm on the 0.5 branch.
  Can you use a UDF where the jar which contains the function is on HDFS, and
 not on the local filesystem.  Specifically, the following does not seem to
 work:

 # This is Hive 0.5, from svn
 $bin/hive
 Hive history file=/tmp/philip/hive_job_log_philip_201002081541_370227273.txt
 hive add jar hdfs://localhost/FooTest.jar;

 Added hdfs://localhost/FooTest.jar to class path
 hive create temporary function cube as 'com.cloudera.FooTestUDF';

 FAILED: Execution Error, return code 1 from
 org.apache.hadoop.hive.ql.exec.FunctionTask

 Does this work for other people?  I could probably fix it by changing add
 jar to download remote jars locally, when necessary (to load them into the
 classpath), or update URLClassLoader (or whatever is underneath there) to
 read directly from HDFS, which seems a bit more fragile.  But I wanted to
 make sure that my interpretation of what's going on is right before I have
 at it.
 Thanks,
 -- Philip



-- 
Yours,
Zheng

Re: LZO Compression on trunk

2010-02-05 Thread Zheng Shao

That seems to be a bug.
Are you using hive trunk or any release?


On 2/5/10, Bennie Schut bsc...@ebuddy.com wrote:
 I have a tab separated files I have loaded it with load data inpath
 then I do a

 SET hive.exec.compress.output=true;
 SET mapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec;
 SET mapred.map.output.compression.codec=com.hadoop.compression.lzo.LzoCodec;
 select distinct login_cldr_id as cldr_id from chatsessions_load;

 Ended Job = job_201001151039_1641
 OK
 NULL
 NULL
 NULL
 Time taken: 49.06 seconds

 however if I start it without the set commands I get this:
 Ended Job = job_201001151039_1642
 OK
 2283
 Time taken: 45.308 seconds

 Which is the correct result.

 When I do a insert overwrite on a rcfile table it will actually
 compress the data correctly.
 When I disable compression and query this new table the result is correct.
 When I enable compression it's wrong again.
 I see no errors in the logs.

 Any idea's why this might happen?




-- 
Sent from my mobile device

Yours,
Zheng

Re: Hive Installation Problem

2010-02-05 Thread Zheng Shao

HI guys,

Can you have a try to make the following directory the same as mine?
Once this is done, remove the build directory, and run ant package.

Does this solve the problem?



[zs...@dev ~/.ant] ls -lR
.:
total 3896
drwxr-xr-x  2 zshao users4096 Feb  5 13:04 apache-ivy-2.0.0-rc2
-rw-r--r--  1 zshao users 3965953 Nov  4  2008 apache-ivy-2.0.0-rc2-bin.zip
-rw-r--r--  1 zshao users   0 Feb  5 13:04 apache-ivy-2.0.0-rc2.installed
drwxr-xr-x  3 zshao users4096 Feb  5 13:07 cache
drwxr-xr-x  2 zshao users4096 Feb  5 13:04 lib

./apache-ivy-2.0.0-rc2:
total 880
-rw-r--r--  1 zshao users 893199 Oct 28  2008 ivy-2.0.0-rc2.jar

./cache:
total 4
drwxr-xr-x  3 zshao users 4096 Feb  4 19:30 hadoop

./cache/hadoop:
total 4
drwxr-xr-x  3 zshao users 4096 Feb  5 13:08 core

./cache/hadoop/core:
total 4
drwxr-xr-x  2 zshao users 4096 Feb  4 19:30 sources

./cache/hadoop/core/sources:
total 127436
-rw-r--r--  1 zshao users 14427013 Aug 20  2008 hadoop-0.17.2.1.tar.gz
-rw-r--r--  1 zshao users 30705253 Jan 22  2009 hadoop-0.18.3.tar.gz
-rw-r--r--  1 zshao users 42266180 Nov 13  2008 hadoop-0.19.0.tar.gz
-rw-r--r--  1 zshao users 42813980 Apr  8  2009 hadoop-0.20.0.tar.gz

./lib:
total 880
-rw-r--r--  1 zshao users 893199 Feb  5 13:04 ivy-2.0.0-rc2.jar


Zheng

On Fri, Feb 5, 2010 at 5:49 AM, Vidyasagar Venkata Nallapati
vidyasagar.nallap...@onmobile.com wrote:
 Hi ,



 We are still getting the problem



 [ivy:retrieve] no resolved descriptor found: launching default resolve

 Overriding previous definition of property ivy.version

 [ivy:retrieve] using ivy parser to parse
 file:/master/hadoop/hive/shims/ivy.xml

 [ivy:retrieve] :: resolving dependencies ::
 org.apache.hadoop.hive#shims;work...@ph1

 [ivy:retrieve]  confs: [default]

 [ivy:retrieve]  validate = true

 [ivy:retrieve]  refresh = false

 [ivy:retrieve] resolving dependencies for configuration 'default'

 [ivy:retrieve] == resolving dependencies for
 org.apache.hadoop.hive#shims;work...@ph1 [default]

 [ivy:retrieve] == resolving dependencies
 org.apache.hadoop.hive#shims;work...@ph1-hadoop#core;0.20.1 [default-*]

 [ivy:retrieve] default: Checking cache for: dependency: hadoop#core;0.20.1
 {*=[*]}

 [ivy:retrieve]  hadoop-source: no ivy file nor artifact found for
 hadoop#core;0.20.1

 [ivy:retrieve]  tried
 https://repository.apache.org/content/repositories/snapshots/hadoop/core/0.20.1/core-0.20.1.pom



 And the .pom for this is not getting copied, please suggest something on
 this.



 Regards

 Vidyasagar N V



 From: baburaj.S [mailto:babura...@onmobile.com]
 Sent: Friday, February 05, 2010 4:59 PM

 To: hive-user@hadoop.apache.org
 Subject: RE: Hive Installation Problem



 No I don’t have the variable defined. Any other things that I have to check.
 Is this happening because I am trying for Hadoop 0.20.1



 Babu





 From: Carl Steinbach [mailto:c...@cloudera.com]
 Sent: Friday, February 05, 2010 3:07 PM
 To: hive-user@hadoop.apache.org
 Subject: Re: Hive Installation Problem



 Hi Babu,

 ~/.ant/cache is the default Ivy cache directory for Hive, but if the
 environment variable IVY_HOME
 is set it will use $IVY_HOME/cache instead. Is it possible that you have
 this environment
 variable set to a value different than ~/.ant?

 On Fri, Feb 5, 2010 at 12:09 AM, baburaj.S babura...@onmobile.com wrote:

 I have tried the same but still the installation is giving the same error. I
 don't know if it is looking in the cache . Can we make any change in
 ivysettings.xml that it has to resolve the file from the file system rather
 through an url.

 Babu

 -Original Message-
 From: Zheng Shao [mailto:zsh...@gmail.com]
 Sent: Friday, February 05, 2010 12:47 PM
 To: hive-user@hadoop.apache.org
 Subject: Re: Hive Installation Problem

 Added to http://wiki.apache.org/hadoop/Hive/FAQ

 Zheng

 On Thu, Feb 4, 2010 at 11:11 PM, Zheng Shao zsh...@gmail.com wrote:
 Try this:

 cd ~/.ant/cache/hadoop/core/sources
 wget
 http://archive.apache.org/dist/hadoop/core/hadoop-0.20.1/hadoop-0.20.1.tar.gz


 Zheng

 On Thu, Feb 4, 2010 at 10:23 PM, baburaj.S babura...@onmobile.com wrote:
 Hello ,

 I am new to Hadoop and is trying to install Hive now. We have the
 following setup at our side

 OS - Ubuntu 9.10
 Hadoop - 0.20.1
 Hive installation tried - 0.4.0 .

 The Hadoop is installed and is working fine . Now when we were installing
 Hive I got error that it couldn't resolve the dependencies. I changed the
 shims build and properties xml to make the dependencies look for Hadoop
 0.20.1 . But now when I call the ant script I get the following error

 ivy-retrieve-hadoop-source:
 [ivy:retrieve] :: Ivy 2.0.0-rc2 - 20081028224207 ::
 http://ant.apache.org/ivy/ :
 :: loading settings :: file = /master/hive/ivy/ivysettings.xml
 [ivy:retrieve] :: resolving dependencies ::
 org.apache.hadoop.hive#shims;working
 [ivy:retrieve]  confs: [default]
 [ivy:retrieve] :: resolution report :: resolve 953885ms :: artifacts dl
 0ms

Re: Concurrently load data into Hive tables?

2010-02-04 Thread Zheng Shao

We can load data/insert overwrite data concurrently as long as they
are different partitions.

On Thu, Feb 4, 2010 at 6:51 AM, Ryan LeCompte lecom...@gmail.com wrote:
 Hey guys,

 Is it possible to concurrently load data into Hive tables (same table,
 different partition)? I'd like to concurrently execute the LOAD DATA command
 by two separate processes.

 Is Hive thread-safe in this regard? Or is it best to run the LOAD DATA
 commands serially? How about running two Hive queries concurrently that both
 output their results into different partitions of another Hive table?

 Thanks!

 Ryan





-- 
Yours,
Zheng

Re: Question about Hive supporting new Hadoop MapReduce API

2010-02-04 Thread Zheng Shao

We haven't had a plan yet. It will be great to draw out the pros/cons
of moving to the new MapReduce API.

Do you want to open a JIRA to discuss it?

Zheng

On Thu, Feb 4, 2010 at 5:46 PM, Schubert Zhang zson...@gmail.com wrote:
 Does anyone know the plan of Hive to support new Hadoop MapReduce API?

 In current hive 0.4 and trunk, still using  deprecated Hadoop API, we want
 to know the plan.



-- 
Yours,
Zheng

Re: computing median and percentiles

2010-02-04 Thread Zheng Shao

I would say, just create a histogram of value, count pair, sort at
the end, and return the value at the percentile.

This assumes that the number of unique values are not big, which can
be easily enforced by using round(number, digits).

Zheng

On Thu, Feb 4, 2010 at 9:08 PM, Bryan Talbot btal...@aeriagames.com wrote:
 What's the best way to compute median and other percentiles using Hive 0.40?  
 I've run across http://issues.apache.org/jira/browse/HIVE-259 but there 
 doesn't seem to be any planned implementation yet.


 -Bryan








-- 
Yours,
Zheng

Re: Hive Installation Problem

2010-02-04 Thread Zheng Shao

Added to http://wiki.apache.org/hadoop/Hive/FAQ

Zheng

On Thu, Feb 4, 2010 at 11:11 PM, Zheng Shao zsh...@gmail.com wrote:
 Try this:

 cd ~/.ant/cache/hadoop/core/sources
 wget 
 http://archive.apache.org/dist/hadoop/core/hadoop-0.20.1/hadoop-0.20.1.tar.gz


 Zheng

 On Thu, Feb 4, 2010 at 10:23 PM, baburaj.S babura...@onmobile.com wrote:
 Hello ,

 I am new to Hadoop and is trying to install Hive now. We have the following 
 setup at our side

 OS - Ubuntu 9.10
 Hadoop - 0.20.1
 Hive installation tried - 0.4.0 .

 The Hadoop is installed and is working fine . Now when we were installing 
 Hive I got error that it couldn't resolve the dependencies. I changed the 
 shims build and properties xml to make the dependencies look for Hadoop 
 0.20.1 . But now when I call the ant script I get the following error

 ivy-retrieve-hadoop-source:
 [ivy:retrieve] :: Ivy 2.0.0-rc2 - 20081028224207 :: 
 http://ant.apache.org/ivy/ :
 :: loading settings :: file = /master/hive/ivy/ivysettings.xml
 [ivy:retrieve] :: resolving dependencies :: 
 org.apache.hadoop.hive#shims;working
 [ivy:retrieve]  confs: [default]
 [ivy:retrieve] :: resolution report :: resolve 953885ms :: artifacts dl 0ms
        -
        |                  |            modules            ||   artifacts   |
        |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
        -
        |      default     |   1   |   0   |   0   |   0   ||   0   |   0   |
        -
 [ivy:retrieve]
 [ivy:retrieve] :: problems summary ::
 [ivy:retrieve]  WARNINGS
 [ivy:retrieve]          module not found: hadoop#core;0.20.1
 [ivy:retrieve]   hadoop-source: tried
 [ivy:retrieve]    -- artifact hadoop#core;0.20.1!hadoop.tar.gz(source):
 [ivy:retrieve]    
 http://archive.apache.org/dist/hadoop/core/hadoop-0.20.1/hadoop-0.20.1.tar.gz
 [ivy:retrieve]   apache-snapshot: tried
 [ivy:retrieve]    
 https://repository.apache.org/content/repositories/snapshots/hadoop/core/0.20.1/core-0.20.1.pom
 [ivy:retrieve]    -- artifact hadoop#core;0.20.1!hadoop.tar.gz(source):
 [ivy:retrieve]    
 https://repository.apache.org/content/repositories/snapshots/hadoop/core/0.20.1/hadoop-0.20.1.tar.gz
 [ivy:retrieve]   maven2: tried
 [ivy:retrieve]    
 http://repo1.maven.org/maven2/hadoop/core/0.20.1/core-0.20.1.pom
 [ivy:retrieve]    -- artifact hadoop#core;0.20.1!hadoop.tar.gz(source):
 [ivy:retrieve]    
 http://repo1.maven.org/maven2/hadoop/core/0.20.1/core-0.20.1.tar.gz
 [ivy:retrieve]          ::
 [ivy:retrieve]          ::          UNRESOLVED DEPENDENCIES         ::
 [ivy:retrieve]          ::
 [ivy:retrieve]          :: hadoop#core;0.20.1: not found
 [ivy:retrieve]          ::
 [ivy:retrieve]  ERRORS
 [ivy:retrieve]  Server access Error: Connection timed out 
 url=http://archive.apache.org/dist/hadoop/core/hadoop-0.20.1/hadoop-0.20.1.tar.gz
 [ivy:retrieve]  Server access Error: Connection timed out 
 url=https://repository.apache.org/content/repositories/snapshots/hadoop/core/0.20.1/core-0.20.1.pom
 [ivy:retrieve]  Server access Error: Connection timed out 
 url=https://repository.apache.org/content/repositories/snapshots/hadoop/core/0.20.1/hadoop-0.20.1.tar.gz
 [ivy:retrieve]  Server access Error: Connection timed out 
 url=http://repo1.maven.org/maven2/hadoop/core/0.20.1/core-0.20.1.pom
 [ivy:retrieve]  Server access Error: Connection timed out 
 url=http://repo1.maven.org/maven2/hadoop/core/0.20.1/core-0.20.1.tar.gz
 [ivy:retrieve]
 [ivy:retrieve] :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS

 BUILD FAILED
 /master/hive/build.xml:148: The following error occurred while executing 
 this line:
 /master/hive/build.xml:93: The following error occurred while executing this 
 line:
 /master/hive/shims/build.xml:64: The following error occurred while 
 executing this line:
 /master/hive/build-common.xml:172: impossible to resolve dependencies:
        resolve failed - see output for details

 Total time: 15 minutes 55 seconds


 I have even tried to download hadoop-0.20.1.tar.gz and put it in the ant 
 cache of the user . Still the same error is repeated. I am stuck and not 
 able to install it .

 Any help on the above will be greatly appreciated.

 Babu


 DISCLAIMER: The information in this message is confidential and may be 
 legally privileged. It is intended solely for the addressee. Access to this 
 message by anyone else is unauthorized. If you are not the intended 
 recipient, any disclosure, copying, or distribution of the message, or any 
 action or omission taken by you in reliance on it, is prohibited and may be 
 unlawful. Please immediately contact the sender if you have received this 
 message in error

Re: Resolvers for UDAFs

2010-02-03 Thread Zheng Shao

Can you post the Hive query? What are the types of the parameters that
you passed to the function?

Zheng

On Wed, Feb 3, 2010 at 3:23 AM, Sonal Goyal sonalgoy...@gmail.com wrote:
 Hi,

 I am writing a UDAF which takes in 4 parameters. I have 2 cases - one where
 all the paramters are ints, and second where the last parameter is double. I
 wrote two evaluators for this, with iterate as

 public boolean iterate(int max, int groupBy, int attribute, int count)

 and

 public boolean iterate(int max, int groupBy, int attribute, double count)

 However, when I run a query, I get the exception:
 org.apache.hadoop.hive.ql.exec.AmbiguousMethodException: Ambiguous method
 for class org.apache.hadoop.hive.udaf.TopXPerGroup with [int, int, int, int]
     at
 org.apache.hadoop.hive.ql.exec.DefaultUDAFEvaluatorResolver.getEvaluatorClass(DefaultUDAFEvaluatorResolver.java:83)
     at
 org.apache.hadoop.hive.ql.udf.generic.GenericUDAFBridge.getEvaluator(GenericUDAFBridge.java:57)
     at
 org.apache.hadoop.hive.ql.exec.FunctionRegistry.getGenericUDAFEvaluator(FunctionRegistry.java:594)
     at
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getGenericUDAFEvaluator(SemanticAnalyzer.java:1882)
     at
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genGroupByPlanMapGroupByOperator(SemanticAnalyzer.java:2270)
     at
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genGroupByPlanMapAggr1MR(SemanticAnalyzer.java:2821)
     at
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:4543)
     at
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:5058)
     at
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:4999)
     at
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:5020)
     at
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:4999)
     at
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:5020)
     at
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:5587)
     at
 org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:114)
     at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:317)
     at org.apache.hadoop.hive.ql.Driver.runCommand(Driver.java:370)
     at org.apache.hadoop.hive.ql.Driver.run(Driver.java:362)
     at
 org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:140)
     at
 org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:200)
     at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:311)
     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
     at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
     at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
     at java.lang.reflect.Method.invoke(Method.java:597)
     at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

 One option for me is to write  a resolver which I will do. But, I just
 wanted to know if this is a bug in hive whereby it is not able to get the
 write evaluator. Or if this is a gap in my understanding.

 I look forward to hearing your views on this.

 Thanks and Regards,
 Sonal




-- 
Yours,
Zheng

Re: Converting multiple joins into a single multi-way join

2010-02-03 Thread Zheng Shao

See ql/src/test/queries/clientpositive/uniquejoin.q


FROM UNIQUEJOIN PRESERVE T1 a (a.key), PRESERVE T2 b (b.key), PRESERVE
T3 c (c.key)
SELECT a.key, b.key, c.key;

FROM UNIQUEJOIN T1 a (a.key), T2 b (b.key), T3 c (c.key)
SELECT a.key, b.key, c.key;

FROM UNIQUEJOIN T1 a (a.key), T2 b (b.key-1), T3 c (c.key)
SELECT a.key, b.key, c.key;

FROM UNIQUEJOIN PRESERVE T1 a (a.key, a.val), PRESERVE T2 b (b.key,
b.val), PRESERVE T3 c (c.key, c.val)
SELECT a.key, a.val, b.key, b.val, c.key, c.val;

FROM UNIQUEJOIN PRESERVE T1 a (a.key), T2 b (b.key), PRESERVE T3 c (c.key)
SELECT a.key, b.key, c.key;

FROM UNIQUEJOIN PRESERVE T1 a (a.key), T2 b(b.key)
SELECT a.key, b.key;


Zheng

On Wed, Feb 3, 2010 at 2:07 AM, bharath v
bharathvissapragada1...@gmail.com wrote:

 Hi ,

 Can anyone give me an example in which there is an optimization of
 Converting multiple joins into a single multi-way join .. i.e., reducing
 the number of map-reduce jobs .
 I read this from hive's design document but I couldn't find any example .
 Can anyone point me to the same??

 Kindly help,

 Thanks




-- 
Yours,
Zheng

Re: Converting multiple joins into a single multi-way join

2010-02-03 Thread Zheng Shao

https://issues.apache.org/jira/browse/HIVE-591

On Wed, Feb 3, 2010 at 1:34 PM, Zheng Shao zsh...@gmail.com wrote:
 See ql/src/test/queries/clientpositive/uniquejoin.q


 FROM UNIQUEJOIN PRESERVE T1 a (a.key), PRESERVE T2 b (b.key), PRESERVE
 T3 c (c.key)
 SELECT a.key, b.key, c.key;

 FROM UNIQUEJOIN T1 a (a.key), T2 b (b.key), T3 c (c.key)
 SELECT a.key, b.key, c.key;

 FROM UNIQUEJOIN T1 a (a.key), T2 b (b.key-1), T3 c (c.key)
 SELECT a.key, b.key, c.key;

 FROM UNIQUEJOIN PRESERVE T1 a (a.key, a.val), PRESERVE T2 b (b.key,
 b.val), PRESERVE T3 c (c.key, c.val)
 SELECT a.key, a.val, b.key, b.val, c.key, c.val;

 FROM UNIQUEJOIN PRESERVE T1 a (a.key), T2 b (b.key), PRESERVE T3 c (c.key)
 SELECT a.key, b.key, c.key;

 FROM UNIQUEJOIN PRESERVE T1 a (a.key), T2 b(b.key)
 SELECT a.key, b.key;


 Zheng

 On Wed, Feb 3, 2010 at 2:07 AM, bharath v
 bharathvissapragada1...@gmail.com wrote:

 Hi ,

 Can anyone give me an example in which there is an optimization of
 Converting multiple joins into a single multi-way join .. i.e., reducing
 the number of map-reduce jobs .
 I read this from hive's design document but I couldn't find any example .
 Can anyone point me to the same??

 Kindly help,

 Thanks




 --
 Yours,
 Zheng




-- 
Yours,
Zheng

Re: intermediate data written to the disk?

2010-02-03 Thread Zheng Shao

If the join key is the same, you can use unique join to make sure
it's done in a single map-reduce job.


Zheng

On Wed, Feb 3, 2010 at 1:25 AM, bharath v
bharathvissapragada1...@gmail.com wrote:
 Hi ,

 I have a small doubt in how hive handles queries containing join of more
 than 2 tables .

 Suppose we have 3 tables A,B,C .. and the plan is  ((AB)C) ..
 We can join A,B in a map reduce job and join the resultant table with C. I
 have a doubt whether the result of AB is stored to disk before joining
 with C or is it streamed directly to join with C (I don't know how , just a
 guess) .


 Any help is appreciated ,

 Thanks



-- 
Yours,
Zheng

Re: Help writing UDAF with custom object

2010-02-03 Thread Zheng Shao

Which version of Hive are you using?

I looked at the code for trunk and cannot find
PrimitiveObjectInspectorFactory.java:166

Zheng

On Mon, Feb 1, 2010 at 3:41 AM, Sonal Goyal sonalgoy...@gmail.com wrote:
 Hi Zheng,

 Thanks for your response. I had initially used ints, but due to the error I
 got, I changed them to Integers. I have now reverted the code to use ints as
 suggested by you.

 My problem:
 I have a table called products_bought which has a number of products bought
 by each customer ordered by count bought. I want to get the top x customers
 of each product.

 Table products_bought
  product_id customer_id product_count
   1  1    6
   1  2    5
   1  3    4
   2  1    8
   2  2    4
   2  3    1

   I want the say, top 2 results per products. Which will be:

   product_id customer_id product_count
   1  1    6
   1  2    5
   2  1    8
   2  2    4

 Solution:
 I create a jar with the code I sent and do the following steps in cli

 1. add jar jarname
 2. create temporary function topx as 'class name';
 3. select topx(2, product_id, customer_id, product_count) from
 products_bought

 The logs give me the error:
 0/02/01 16:56:28 DEBUG ipc.RPC: Call: mkdirs 23
 10/02/01 16:56:28 INFO parse.SemanticAnalyzer: Completed getting MetaData in
 Semantic Analysis
 10/02/01 16:56:28 DEBUG parse.SemanticAnalyzer: Created Table Plan for
 products_bought org.apache.hadoop.hive.ql.exec.tablescanopera...@72d8978c
 10/02/01 16:56:28 DEBUG exec.FunctionRegistry: Looking up GenericUDAF: topx
 FAILED: Unknown exception : Internal error: Cannot recognize int
 10/02/01 16:56:28 ERROR ql.Driver: FAILED: Unknown exception : Internal
 error: Cannot recognize int
 java.lang.RuntimeException: Internal error: Cannot recognize int
     at
 org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory.getPrimitiveObjectInspectorFromClass(PrimitiveObjectInspectorFactory.java:166)
     at
 org.apache.hadoop.hive.ql.udf.generic.GenericUDFUtils$PrimitiveConversionHelper.init(GenericUDFUtils.java:197)
     at
 org.apache.hadoop.hive.ql.udf.generic.GenericUDAFBridge$GenericUDAFBridgeEvaluator.init(GenericUDAFBridge.java:123)
     at
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getGenericUDAFInfo(SemanticAnalyzer.java:1592)
     at
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genGroupByPlanMapGroupByOperator(SemanticAnalyzer.java:1912)
     at
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genGroupByPlanMapAggr1MR(SemanticAnalyzer.java:2452)
     at
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:3733)
     at
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:4184)
     at
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:4425)
     at
 org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:76)
     at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:249)
     at org.apache.hadoop.hive.ql.Driver.run(Driver.java:281)
     at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:123)
     at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:181)
     at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:287)
     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
     at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
     at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
     at java.lang.reflect.Method.invoke(Method.java:597)
     at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

 I am going through the code mentioned by Zheng to see if there is something
 wrong I am doing. At this point of time, I think my main concern is to get
 the function to output something and to verify that Hive specific hooks are
 in place. If you have any suggestions, please do let me know.

 Thanks and Regards,
 Sonal


 On Mon, Feb 1, 2010 at 1:19 PM, Zheng Shao zsh...@gmail.com wrote:

 The first problem is:

                private Integer key;
                private Integer attribute;
                private Integer count;

 Java Integer objects are non-modifiable, which means we have to create
 a new object per row (which in turn makes the code really
 inefficient).

 You can change it to private int to make it efficient (and also
 works for Hive).


 Second, can you post your Hive query? It seems your code does not do
 what you want. You might want to take a look at
 http://issues.apache.org/jira/browse/HIVE-894 for the UDAF max_n and
 see how that works for Hive.

 Zheng

 On Sun, Jan 31, 2010 at 9:38 PM, Sonal Goyal sonalgoy...@gmail.com
 wrote:
  Hi,
 
  I am writing a UDAF which returns the top x results per key. Lets say my
  input is
 
  key attribute count
  1  1    6
  1  2    5
  1  3    4

Re: Resolvers for UDAFs

2010-02-03 Thread Zheng Shao

Hi Sonal,

1. We usually move the group_by column out of the UDAF - just like we
do SELECT key, sum(value) FROM table.

I think you should write:

SELECT customer_id, topx(2, product_id, product_count)
FROM products_bought

and in topx:
public boolean iterate(int max, int attribute, int count).


2. Can you run describe products_bought? Does product_count column
have type int?

You might want to try removing the other interate function to see
whether that solves the problem.


Zheng


On Wed, Feb 3, 2010 at 9:58 PM, Sonal Goyal sonalgoy...@gmail.com wrote:
 Hi Zheng,

 My query is:

 select a.myTable.key, a.myTable.attribute, a.myTable.count from (select
 explode (t.pc) as myTable from (select topx(2, product_id, customer_id,
 product_count) as pc from (select product_id, customer_id, product_count
 from products_bought order by product_id, product_count desc) r ) t )a;

 My overloaded iterators are:

 public boolean iterate(int max, int groupBy, int attribute, int count)

 public boolean iterate(int max, int groupBy, int attribute, double count)

 Before overloading, my query was running fine. My table products_bought is:
 product_id int, customer_id int, product_count int

 And I get:
 FAILED: Error in semantic analysis: Ambiguous method for class
 org.apache.hadoop.hive.udaf.TopXPerGroup with [int, int, int, int]

 The hive logs say:
 2010-02-03 11:18:15,721 ERROR processors.DeleteResourceProcessor
 (SessionState.java:printError(255)) - Usage: delete [FILE|JAR|ARCHIVE]
 value [value]*
 2010-02-03 11:22:14,663 ERROR ql.Driver (SessionState.java:printError(255))
 - FAILED: Error in semantic analysis: Ambiguous method for class
 org.apache.hadoop.hive.udaf.TopXPerGroup with [int, int, int, int]
 org.apache.hadoop.hive.ql.exec.AmbiguousMethodException: Ambiguous method
 for class org.apache.hadoop.hive.udaf.TopXPerGroup with [int, int, int, int]
     at
 org.apache.hadoop.hive.ql.exec.DefaultUDAFEvaluatorResolver.getEvaluatorClass(DefaultUDAFEvaluatorResolver.java:83)
     at
 org.apache.hadoop.hive.ql.udf.generic.GenericUDAFBridge.getEvaluator(GenericUDAFBridge.java:57)
     at
 org.apache.hadoop.hive.ql.exec.FunctionRegistry.getGenericUDAFEvaluator(FunctionRegistry.java:594)
     at
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getGenericUDAFEvaluator(SemanticAnalyzer.java:1882)
     at
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genGroupByPlanMapGroupByOperator(SemanticAnalyzer.java:2270)
     at
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genGroupByPlanMapAggr1MR(SemanticAnalyzer.java:2821)
     at
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:4543)
     at
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:5058)
     at
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:4999)
     at
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:5020)
     at
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:4999)
     at
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:5020)
     at
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:5587)
     at
 org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:114)
     at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:317)
     at org.apache.hadoop.hive.ql.Driver.runCommand(Driver.java:370)
     at org.apache.hadoop.hive.ql.Driver.run(Driver.java:362)
     at
 org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:140)
     at
 org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:200)
     at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:311)
     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
     at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
     at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
     at java.lang.reflect.Method.invoke(Method.java:597)
     at org.apache.hadoop.util.RunJar.main(RunJar.java:156)



 Thanks and Regards,
 Sonal


 On Thu, Feb 4, 2010 at 12:12 AM, Zheng Shao zsh...@gmail.com wrote:

 Can you post the Hive query? What are the types of the parameters that
 you passed to the function?

 Zheng

 On Wed, Feb 3, 2010 at 3:23 AM, Sonal Goyal sonalgoy...@gmail.com wrote:
  Hi,
 
  I am writing a UDAF which takes in 4 parameters. I have 2 cases - one
  where
  all the paramters are ints, and second where the last parameter is
  double. I
  wrote two evaluators for this, with iterate as
 
  public boolean iterate(int max, int groupBy, int attribute, int count)
 
  and
 
  public boolean iterate(int max, int groupBy, int attribute, double
  count)
 
  However, when I run a query, I get the exception:
  org.apache.hadoop.hive.ql.exec.AmbiguousMethodException

Re: Resolvers for UDAFs

2010-02-03 Thread Zheng Shao

Yes it should be:

SELECT customer_id, topx(2, product_id, product_count)
FROM products_bought
GROUP BY customer_id;



On Wed, Feb 3, 2010 at 11:31 PM, Sonal Goyal sonalgoy...@gmail.com wrote:
 Hi Zheng,

 Wouldnt the query you mentioned need a group by clause? I need the top x
 customers per product id. Sorry, can you please explain.

 Thanks and Regards,
 Sonal


 On Thu, Feb 4, 2010 at 12:07 PM, Sonal Goyal sonalgoy...@gmail.com wrote:

 Hi Zheng,

 Thanks for your email and your feedback. I will try to change the code as
 suggested by you.

 Here is the output of describe:

 hive describe products_bought;
 OK
 product_id    int
 customer_id    int
 product_count    int


 My function was working fine earlier with this table and iterate(int, int,
 int, int). Once I introduced the other iterate, it stopped working.


 Thanks and Regards,
 Sonal


 On Thu, Feb 4, 2010 at 11:37 AM, Zheng Shao zsh...@gmail.com wrote:

 Hi Sonal,

 1. We usually move the group_by column out of the UDAF - just like we
 do SELECT key, sum(value) FROM table.

 I think you should write:

 SELECT customer_id, topx(2, product_id, product_count)
 FROM products_bought

 and in topx:
 public boolean iterate(int max, int attribute, int count).


 2. Can you run describe products_bought? Does product_count column
 have type int?

 You might want to try removing the other interate function to see
 whether that solves the problem.


 Zheng


 On Wed, Feb 3, 2010 at 9:58 PM, Sonal Goyal sonalgoy...@gmail.com
 wrote:
  Hi Zheng,
 
  My query is:
 
  select a.myTable.key, a.myTable.attribute, a.myTable.count from (select
  explode (t.pc) as myTable from (select topx(2, product_id, customer_id,
  product_count) as pc from (select product_id, customer_id,
  product_count
  from products_bought order by product_id, product_count desc) r ) t )a;
 
  My overloaded iterators are:
 
  public boolean iterate(int max, int groupBy, int attribute, int count)
 
  public boolean iterate(int max, int groupBy, int attribute, double
  count)
 
  Before overloading, my query was running fine. My table products_bought
  is:
  product_id int, customer_id int, product_count int
 
  And I get:
  FAILED: Error in semantic analysis: Ambiguous method for class
  org.apache.hadoop.hive.udaf.TopXPerGroup with [int, int, int, int]
 
  The hive logs say:
  2010-02-03 11:18:15,721 ERROR processors.DeleteResourceProcessor
  (SessionState.java:printError(255)) - Usage: delete [FILE|JAR|ARCHIVE]
  value [value]*
  2010-02-03 11:22:14,663 ERROR ql.Driver
  (SessionState.java:printError(255))
  - FAILED: Error in semantic analysis: Ambiguous method for class
  org.apache.hadoop.hive.udaf.TopXPerGroup with [int, int, int, int]
  org.apache.hadoop.hive.ql.exec.AmbiguousMethodException: Ambiguous
  method
  for class org.apache.hadoop.hive.udaf.TopXPerGroup with [int, int, int,
  int]
      at
 
  org.apache.hadoop.hive.ql.exec.DefaultUDAFEvaluatorResolver.getEvaluatorClass(DefaultUDAFEvaluatorResolver.java:83)
      at
 
  org.apache.hadoop.hive.ql.udf.generic.GenericUDAFBridge.getEvaluator(GenericUDAFBridge.java:57)
      at
 
  org.apache.hadoop.hive.ql.exec.FunctionRegistry.getGenericUDAFEvaluator(FunctionRegistry.java:594)
      at
 
  org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getGenericUDAFEvaluator(SemanticAnalyzer.java:1882)
      at
 
  org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genGroupByPlanMapGroupByOperator(SemanticAnalyzer.java:2270)
      at
 
  org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genGroupByPlanMapAggr1MR(SemanticAnalyzer.java:2821)
      at
 
  org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:4543)
      at
 
  org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:5058)
      at
 
  org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:4999)
      at
 
  org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:5020)
      at
 
  org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:4999)
      at
 
  org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:5020)
      at
 
  org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:5587)
      at
 
  org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:114)
      at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:317)
      at org.apache.hadoop.hive.ql.Driver.runCommand(Driver.java:370)
      at org.apache.hadoop.hive.ql.Driver.run(Driver.java:362)
      at
  org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:140)
      at
  org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:200)
      at
  org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:311)
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method

Re: SequenceFile compression on Amazon EMR not very good

2010-01-31 Thread Zheng Shao

I would first check whether it is really the block compression or
record compression.
Also maybe the block size is too small but I am not sure that is
tunable in SequenceFile or not.

Zheng

On Sun, Jan 31, 2010 at 9:03 PM, Saurabh Nanda saurabhna...@gmail.com wrote:
 Hi,

 The size of my Gzipped weblog files is about 35MB. However, upon enabling
 block compression, and inserting the logs into another Hive table
 (sequencefile), the file size bloats up to about 233MB. I've done similar
 processing on a local Hadoop/Hive cluster, and while the compressions is not
 as good as gzipping, it still is not this bad. What could be going wrong?

 I looked at the header of the resulting file and here's what it says:

 SEQ^Forg.apache.hadoop.io.BytesWritable^Yorg.apache.hadoop.io.Text^A^@'org.apache.hadoop.io.compress.GzipCodec

 Does Amazon Elastic MapReduce behave differently or am I doing something
 wrong?

 Saurabh.
 --
 http://nandz.blogspot.com
 http://foodieforlife.blogspot.com




-- 
Yours,
Zheng

Re: UDAF/UDTF question

2010-01-28 Thread Zheng Shao

The easiest way to go is to write a UDAF to return the answer in
arraystructdecile:int, value:double.

Then you can do: (note that explode is a predefined UDTF)

SELECT
  tmp.key, tmp2.d.decile, tmp2.d.value
FROM
  (SELECT key, Decile(value) as deciles
   GROUP BY key) tmp
  LATERAL VIEW explode(tmp.deciles) tmp2 AS d


Zheng

On Thu, Jan 28, 2010 at 2:07 PM, Jason Michael jmich...@videoegg.com wrote:
 Hello all,

 What would be the best way to write a function that would perform
 aggregation computations on records in a table and return multiple rows (and
 possibly columns)?  For example, imagine a function called DECILES that
 computes all the deciles for a given measure and returns them as 10 rows
 with 2 columns, decile and value.  It seems like what I want is some sort of
 combination of a UDAF and a UDTF.  Does such an animal exist in the Hive
 world?

 Jason



-- 
Yours,
Zheng

Re: help!

2010-01-27 Thread Zheng Shao

Can you take a look at /tmp/user/hive.log?
There should be some exceptions there.

Zheng

On Wed, Jan 27, 2010 at 7:59 PM, Fu Ecy fuzhijie1...@gmail.com wrote:
 I want to load some files on HDFS to a hive table, but there is an execption
 as follow:
 hive load data inpath '/group/taobao/taobao/dw/stb/20100125/collect_info/*'
 into table collect_info;
 Loading data to table collect_info
 Failed with exception addFiles: error while moving files!!!
 FAILED: Execution Error, return code 1 from
 org.apache.hadoop.hive.ql.exec.MoveTask

 But, when I download the files from HDFS to local machine, then load them
 into the table, it works.
 Data in '/group/taobao/taobao/dw/stb/20100125/collect_info/*' is a
 little more than 200GB.

 I need to use the Hive to make some statistics.
 much thanks :-)



-- 
Yours,
Zheng

Re: help!

2010-01-27 Thread Zheng Shao

When Hive loads data from HDFS, it moves the files instead of copying the files.

That means the current user should have write permissions to the
source files/directories as well.
Can you check that?

Zheng

On Wed, Jan 27, 2010 at 11:18 PM, Fu Ecy fuzhijie1...@gmail.com wrote:
 property
   namehive.metastore.warehouse.dir/name
   value/group/tbdev/kunlun/henshao/hive//value
   descriptionlocation of default database for the warehouse/description
 /property

 property
   namehive.exec.scratchdir/name
   value/group/tbdev/kunlun/henshao/hive/temp/value
   descriptionScratch space for Hive jobs/description
 /property

 [kun...@gate2 ~]$ hive --config config/ -u root -p root
 Hive history file=/tmp/kunlun/hive_job_log_kunlun_201001281514_422659187.txt
 hive create table pokes (foo int, bar string);
 OK
 Time taken: 0.825 seconds

 Yes, I have the permission for Hive's warehouse directory and  tmp
 directory.

 2010/1/28 김영우 warwit...@gmail.com

 Hi Fu,

 Your query seems correct but I think, It's a problem related HDFS
 permission.
 Did you set right permission for Hive's warehouse directory and  tmp
 directory?
 Seems user 'kunlun' does not have WRITE permission for hive warehouse
 directory.

 Youngwoo

 2010/1/28 Fu Ecy fuzhijie1...@gmail.com

 2010-01-27 12:58:22,182 ERROR ql.Driver
 (SessionState.java:printError(303)) - FAILED: Parse Error: line 2:10 cannot
 recognize
  input ',' in column type

 org.apache.hadoop.hive.ql.parse.ParseException: line 2:10 cannot
 recognize input ',' in column type

     at
 org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:357)
     at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:249)
     at org.apache.hadoop.hive.ql.Driver.run(Driver.java:290)
     at
 org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:163)
     at
 org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:221)
     at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:335)
     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
     at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
     at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
     at java.lang.reflect.Method.invoke(Method.java:597)
     at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
     at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
     at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)

 2010-01-27 12:58:40,394 ERROR hive.log
 (MetaStoreUtils.java:logAndThrowMetaException(570)) - Got exception:
 org.apache.hadoop
 .security.AccessControlException
 org.apache.hadoop.security.AccessControlException: Permission denied:
 user=kunlun, access=WR
 ITE, inode=user:hadoop:cug-admin:rwxr-xr-x
 2010-01-27 12:58:40,395 ERROR hive.log
 (MetaStoreUtils.java:logAndThrowMetaException(571)) -
 org.apache.hadoop.security.Acces
 sControlException: org.apache.hadoop.security.AccessControlException:
 Permission denied: user=kunlun, access=WRITE, inode=us
 er:hadoop:cug-admin:rwxr-xr-x
     at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
 Method)
     at
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
     at
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
     at
 java.lang.reflect.Constructor.newInstance(Constructor.java:513)
     at
 org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:96)
     at
 org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:58)
     at org.apache.hadoop.hdfs.DFSClient.mkdirs(DFSClient.java:831)
     at
 org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:257)
     at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1118)
     at
 org.apache.hadoop.hive.metastore.Warehouse.mkdirs(Warehouse.java:123)
     at
 org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_table(HiveMetaStore.java:505)
     at
 org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createTable(HiveMetaStoreClient.java:256)
     at
 org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:254)
     at
 org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:883)
     at
 org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:105)
     at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:388)
     at org.apache.hadoop.hive.ql.Driver.run(Driver.java:294)
     at
 org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:163)
     at
 org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:221)
     at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:335)
     at

Re: help!

2010-01-27 Thread Zheng Shao

Please see http://wiki.apache.org/hadoop/Hive/LanguageManual/DDL for
how to use External table.
You don't need to load into external table because external table
can directly point to your data directory.

Zheng

On Wed, Jan 27, 2010 at 11:38 PM, Fu Ecy fuzhijie1...@gmail.com wrote:
 hive CREATE EXTERNAL TABLE collect_info (
     
       id string,
       t1 string,
       t2 string,
       t3 string,
       t4 string,
       t5 string,
       collector string)
      ROW FORMAT DELIMITED
      FIELDS TERMINATED BY '\t'
      STORED AS TEXTFILE;
 OK
 Time taken: 0.234 seconds

 hive load data inpath
 '/group/taobao/taobao/dw/stb/20100125/collect_info/coll_9.collect_info575'
 overwrite into table collect_info;
 Loading data to table collect_info
 Failed with exception replaceFiles: error while moving files!!!
 FAILED: Execution Error, return code 1 from
 org.apache.hadoop.hive.ql.exec.MoveTask

 It doesn't wok.

 2010/1/28 Fu Ecy fuzhijie1...@gmail.com

 I think this is the problem, I don't have the write permissions to the
 source files/directories. Thank you, Shao :-)

 2010/1/28 Zheng Shao zsh...@gmail.com

 When Hive loads data from HDFS, it moves the files instead of copying the
 files.

 That means the current user should have write permissions to the
 source files/directories as well.
 Can you check that?

 Zheng

 On Wed, Jan 27, 2010 at 11:18 PM, Fu Ecy fuzhijie1...@gmail.com wrote:
  property
    namehive.metastore.warehouse.dir/name
    value/group/tbdev/kunlun/henshao/hive//value
    descriptionlocation of default database for the
  warehouse/description
  /property
 
  property
    namehive.exec.scratchdir/name
    value/group/tbdev/kunlun/henshao/hive/temp/value
    descriptionScratch space for Hive jobs/description
  /property
 
  [kun...@gate2 ~]$ hive --config config/ -u root -p root
  Hive history
  file=/tmp/kunlun/hive_job_log_kunlun_201001281514_422659187.txt
  hive create table pokes (foo int, bar string);
  OK
  Time taken: 0.825 seconds
 
  Yes, I have the permission for Hive's warehouse directory and  tmp
  directory.
 
  2010/1/28 김영우 warwit...@gmail.com
 
  Hi Fu,
 
  Your query seems correct but I think, It's a problem related HDFS
  permission.
  Did you set right permission for Hive's warehouse directory and  tmp
  directory?
  Seems user 'kunlun' does not have WRITE permission for hive warehouse
  directory.
 
  Youngwoo
 
  2010/1/28 Fu Ecy fuzhijie1...@gmail.com
 
  2010-01-27 12:58:22,182 ERROR ql.Driver
  (SessionState.java:printError(303)) - FAILED: Parse Error: line 2:10
  cannot
  recognize
   input ',' in column type
 
  org.apache.hadoop.hive.ql.parse.ParseException: line 2:10 cannot
  recognize input ',' in column type
 
      at
 
  org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:357)
      at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:249)
      at org.apache.hadoop.hive.ql.Driver.run(Driver.java:290)
      at
  org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:163)
      at
  org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:221)
      at
  org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:335)
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
  Method)
      at
 
  sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
      at
 
  sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
      at java.lang.reflect.Method.invoke(Method.java:597)
      at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
      at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
      at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
 
  2010-01-27 12:58:40,394 ERROR hive.log
  (MetaStoreUtils.java:logAndThrowMetaException(570)) - Got exception:
  org.apache.hadoop
  .security.AccessControlException
  org.apache.hadoop.security.AccessControlException: Permission denied:
  user=kunlun, access=WR
  ITE, inode=user:hadoop:cug-admin:rwxr-xr-x
  2010-01-27 12:58:40,395 ERROR hive.log
  (MetaStoreUtils.java:logAndThrowMetaException(571)) -
  org.apache.hadoop.security.Acces
  sControlException: org.apache.hadoop.security.AccessControlException:
  Permission denied: user=kunlun, access=WRITE, inode=us
  er:hadoop:cug-admin:rwxr-xr-x
      at
  sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
  Method)
      at
 
  sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
      at
 
  sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
      at
  java.lang.reflect.Constructor.newInstance(Constructor.java:513)
      at
 
  org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:96

Re: Can not run hive 0.4.1

2010-01-26 Thread Zheng Shao

Can you post the traces in  /tmp/user/hive.log?

Zheng

On Tue, Jan 26, 2010 at 12:40 AM, Jeff Zhang zjf...@gmail.com wrote:
 Hi all,

 I follow the get started wiki page, but I use the hive 0.4.1 release version
 rather than svn trunk. And when I invoke command: show tables;
 It shows the following error message, anyone has encounter this problem
 before ?

 hive show tables;
 FAILED: Error in metadata: javax.jdo.JDOFatalInternalException: Unexpected
 exception caught.
 NestedThrowables:
 java.lang.reflect.InvocationTargetException
 FAILED: Execution Error, return code 1 from
 org.apache.hadoop.hive.ql.exec.DDLTask


 --
 Best Regards

 Jeff Zhang




-- 
Yours,
Zheng

Re: Can not run hive 0.4.1

2010-01-26 Thread Zheng Shao

This usually happens when there is a problem in the metastore configuration.
Did you change any hive configurations?


Zheng

On Tue, Jan 26, 2010 at 1:41 AM, Jeff Zhang zjf...@gmail.com wrote:
 The following is the logs:


 2010-01-26 17:23:51,509 ERROR exec.DDLTask
 (SessionState.java:printError(279)) - FAILED: Error in metadata:
 javax.jdo.JDOFatalInternalException: Unexpected exception caught.
 NestedThrowables:
 java.lang.reflect.InvocationTargetException
 org.apache.hadoop.hive.ql.metadata.HiveException:
 javax.jdo.JDOFatalInternalException: Unexpected exception caught.
 NestedThrowables:
 java.lang.reflect.InvocationTargetException
     at
 org.apache.hadoop.hive.ql.metadata.Hive.getTablesByPattern(Hive.java:400)
     at
 org.apache.hadoop.hive.ql.metadata.Hive.getAllTables(Hive.java:387)
 at
 org.apache.hadoop.hive.ql.exec.DDLTask.showTables(DDLTask.java:352)
 at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:143)
 at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:379)    at
 org.apache.hadoop.hive.ql.Driver.run(Driver.java:285)    at
 org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:123)
 at
 org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:181)
 at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:287)    at
 sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)    at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)    at
 org.apache.hadoop.util.RunJar.main(RunJar.java:156)
 Caused by: javax.jdo.JDOFatalInternalException: Unexpected exception caught.
 NestedThrowables:
 java.lang.reflect.InvocationTargetException
     at
 javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1186)
     at
 javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:803)
     at
 javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:698)
 at
 org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:161)
 at
 org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:178)
 at
 org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:122)
 at
 org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:101)
     at
 org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62)
 at
 org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
 at
 org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:130)
 at
 org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:146)
 at
 org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:118)
 at
 org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:100)
 at
 org.apache.hadoop.hive.metastore.HiveMetaStoreClient.init(HiveMetaStoreClient.java:74)
 at
 org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:783)
 at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:794)    at
 org.apache.hadoop.hive.ql.metadata.Hive.getTablesByPattern(Hive.java:398)
 ... 13 more
 Caused by: java.lang.reflect.InvocationTargetException    at
 sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)    at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)    at
 javax.jdo.JDOHelper$16.run(JDOHelper.java:1956)    at
 java.security.AccessController.doPrivileged(Native Method)    at
 javax.jdo.JDOHelper.invoke(JDOHelper.java:1951)    at
 javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1159)
 ... 29 more


 On Tue, Jan 26, 2010 at 4:52 PM, Zheng Shao zsh...@gmail.com wrote:

 Can you post the traces in  /tmp/user/hive.log?

 Zheng

 On Tue, Jan 26, 2010 at 12:40 AM, Jeff Zhang zjf...@gmail.com wrote:
  Hi all,
 
  I follow the get started wiki page, but I use the hive 0.4.1 release
  version
  rather than svn trunk. And when I invoke command: show tables;
  It shows the following error message, anyone has encounter this problem
  before ?
 
  hive show tables;
  FAILED: Error in metadata: javax.jdo.JDOFatalInternalException:
  Unexpected
  exception caught.
  NestedThrowables:
  java.lang.reflect.InvocationTargetException
  FAILED: Execution Error, return code 1 from
  org.apache.hadoop.hive.ql.exec.DDLTask
 
 
  --
  Best Regards
 
  Jeff Zhang
 



 --
 Yours,
 Zheng



 --
 Best Regards

 Jeff Zhang




-- 
Yours,
Zheng

Re: Can not run hive 0.4.1

2010-01-26 Thread Zheng Shao

In which directory did you run hive?

Try ant package -Doffline=true on hive trunk.

Zheng

On Tue, Jan 26, 2010 at 2:14 AM, Jeff Zhang zjf...@gmail.com wrote:
 No, I did not change anything.

 and BTW, I sync the Hive from svn, but can not build it, the following is
 the error message:

 [ivy:retrieve] :: resolution report :: resolve 7120ms :: artifacts dl
 454644ms

 -
     |  |    modules    ||   artifacts
 |
     |   conf   | number| search|dwnlded|evicted||
 number|dwnlded|

 -
     |  default |   4   |   0   |   0   |   0   ||   4   |   0
 |

 -
 [ivy:retrieve]
 [ivy:retrieve] :: problems summary ::
 [ivy:retrieve]  WARNINGS
 [ivy:retrieve]  [FAILED ]
 hadoop#core;0.18.3!hadoop.tar.gz(source): Downloaded file size doesn't match
 expected Content Length for
 http://archive.apache.org/dist/hadoop/core/hadoop-0.18.3/hadoop-0.18.3.tar.gz.
 Please retry. (154498ms)
 [ivy:retrieve]  [FAILED ]
 hadoop#core;0.18.3!hadoop.tar.gz(source):  (0ms)
 [ivy:retrieve]   hadoop-source: tried
 [ivy:retrieve]
 http://archive.apache.org/dist/hadoop/core/hadoop-0.18.3/hadoop-0.18.3.tar.gz
 [ivy:retrieve]   apache-snapshot: tried
 [ivy:retrieve]
 https://repository.apache.org/content/repositories/snapshots/hadoop/core/0.18.3/hadoop-0.18.3.tar.gz
 [ivy:retrieve]   maven2: tried
 [ivy:retrieve]
 http://repo1.maven.org/maven2/hadoop/core/0.18.3/core-0.18.3.tar.gz
 [ivy:retrieve]  [FAILED ]
 hadoop#core;0.19.0!hadoop.tar.gz(source): Downloaded file size doesn't match
 expected Content Length for
 http://archive.apache.org/dist/hadoop/core/hadoop-0.19.0/hadoop-0.19.0.tar.gz.
 Please retry. (153130ms)
 [ivy:retrieve]  [FAILED ]
 hadoop#core;0.19.0!hadoop.tar.gz(source):  (0ms)
 [ivy:retrieve]   hadoop-source: tried
 [ivy:retrieve]
 http://archive.apache.org/dist/hadoop/core/hadoop-0.19.0/hadoop-0.19.0.tar.gz
 [ivy:retrieve]   apache-snapshot: tried
 [ivy:retrieve]
 https://repository.apache.org/content/repositories/snapshots/hadoop/core/0.19.0/hadoop-0.19.0.tar.gz
 [ivy:retrieve]   maven2: tried
 [ivy:retrieve]
 http://repo1.maven.org/maven2/hadoop/core/0.19.0/core-0.19.0.tar.gz
 [ivy:retrieve]  [FAILED ]
 hadoop#core;0.20.0!hadoop.tar.gz(source): Downloaded file size doesn't match
 expected Content Length for
 http://archive.apache.org/dist/hadoop/core/hadoop-0.20.0/hadoop-0.20.0.tar.gz.
 Please retry. (147000ms)
 [ivy:retrieve]  [FAILED ]
 hadoop#core;0.20.0!hadoop.tar.gz(source):  (0ms)
 [ivy:retrieve]   hadoop-source: tried
 [ivy:retrieve]
 http://archive.apache.org/dist/hadoop/core/hadoop-0.20.0/hadoop-0.20.0.tar.gz
 [ivy:retrieve]   apache-snapshot: tried
 [ivy:retrieve]
 https://repository.apache.org/content/repositories/snapshots/hadoop/core/0.20.0/hadoop-0.20.0.tar.gz
 [ivy:retrieve]   maven2: tried
 [ivy:retrieve]
 http://repo1.maven.org/maven2/hadoop/core/0.20.0/core-0.20.0.tar.gz
 [ivy:retrieve]  ::
 [ivy:retrieve]  ::  FAILED DOWNLOADS    ::
 [ivy:retrieve]  :: ^ see resolution messages for details  ^ ::
 [ivy:retrieve]  ::
 [ivy:retrieve]  :: hadoop#core;0.18.3!hadoop.tar.gz(source)
 [ivy:retrieve]  :: hadoop#core;0.19.0!hadoop.tar.gz(source)
 [ivy:retrieve]  :: hadoop#core;0.20.0!hadoop.tar.gz(source)
 [ivy:retrieve]  ::
 [ivy:retrieve]
 [ivy:retrieve] :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS

 BUILD FAILED
 /root/Hive_trunk/build.xml:148: The following error occurred while executing
 this line:
 /root/Hive_trunk/build.xml:93: The following error occurred while executing
 this line:
 /root/Hive_trunk/shims/build.xml:55: The following error occurred while
 executing this line:
 /root/Hive_trunk/build-common.xml:173: impossible to resolve dependencies:
     resolve failed - see output for details



 On Tue, Jan 26, 2010 at 6:04 PM, Zheng Shao zsh...@gmail.com wrote:

 This usually happens when there is a problem in the metastore
 configuration.
 Did you change any hive configurations?


 Zheng

 On Tue, Jan 26, 2010 at 1:41 AM, Jeff Zhang zjf...@gmail.com wrote:
  The following is the logs:
 
 
  2010-01-26 17:23:51,509 ERROR exec.DDLTask
  (SessionState.java:printError(279)) - FAILED: Error in metadata:
  javax.jdo.JDOFatalInternalException: Unexpected exception caught.
  NestedThrowables:
  java.lang.reflect.InvocationTargetException
  org.apache.hadoop.hive.ql.metadata.HiveException:
  javax.jdo.JDOFatalInternalException: Unexpected exception caught.
  NestedThrowables

Re: How can I implement a cursor in Hive? ...or... Can I implement a CROSS APPLY in Hive?...or... How can I do a FOR or WHILE loop (inside or outside) of Hive?

2010-01-25 Thread Zheng Shao

We can use a combination of UDAF and LATERAL VIEW to implement what you want.

1. Define a UDAF like this: max_n(5, products_bought, customer_id)
which returns the top 5 products_bought and their customer_id in type
of arraystructcol0:int,col1:int
2. Use the Lateral views (with explode) to transform a single row into
multiple rows.

SELECT t.product_id, t5.products_bought, t5.customer_id
FROM (
  SELECT product_id, max_n(5, products_bought, customer_id) as top5
  FROM temp
  GROUP BY product_id) t LATERAL VIEW explode(t.top5) t5 AS
products_bought, customer_id;

See http://wiki.apache.org/hadoop/Hive/LanguageManual/LateralView
Paul is the author of UDTF and Lateral view. He might be able to give
you more details.

Zheng

On Mon, Jan 25, 2010 at 10:47 PM, Mike Roberts m...@spyfu.com wrote:


 I'm trying to use Hive to solve a fairly common SQL scenario that I run
 into.  I have boiled the problem down into its most basic form:



 You have a table of transactions defined as so:  CREATE TABLE transactions
 (product_id INT, customer_id INT)



 ||

 |--Transactions--|

 |---product_id (INT)-|

 |---customer_id(INT)-|

 ||





 The goal is simple: For each product, produce a list of the top 5
 largest customers.



 So, the base query would look like this:



 SELECT product_id, customer_id, count(*) as products_bought

 FROM transactions

 GROUP BY product_id, customer_id



 You could insert that value into another table called products_bought
 defined as:

 CREATE TABLE prod_bought

 (product_id INT, customer_id INT, products_bought INT)



 Now you have an intermediate result that tells you how many times each
 customer bought each product.  But, obviously, that doesn't completely solve
 the problem.



 At this point, in order to solve the problem, you'd have to use a cursor or
 a CROSS APPLY.  Here's an example in T-SQL:



 --THE CURSOR METHOD:



 DECLARE @productId int;

 DECLARE product_cur CURSOR FOR

 SELECT DISTINCT product_id

 FROM transactions t



 OPEN product_cur



 FETCH product_cur into @productId



 WHILE (@@FETCH_STATUS  -1)

 BEGIN



    FETCH product_cur into @productId

    INSERT top_customers_by_product

     SELECT TOP 5 product_id, customer_id, products_bought

     FROM prod_bought

     WHERE product_id = @productId

     ORDER BY products_bought desc





 END

 CLOSE Domains

 DEALLOCATE Domains





 --THE CROSS APPLY METHOD:



 --First create a user defined function

 CREATE FUNCTION dbo.fn_GetTopXCustomers(@ProductId INT)

 RETURNS TABLE

 AS

 RETURN

     SELECT TOP 5 product_id, customer_id, products_bought

     FROM prod_bought

     WHERE product_id = @productId

     ORDER BY products_bought desc

 GO



 --Build a table of distinct product Ids

 SELECT DISTINCT product_id INTO temp_distinct_product_ids FROM transactions



 --Run the CROSS APPLY

 SELECT A.product_id

 , A.customer_id

 , A.products_bought

 INTO top_customers_by_product

 FROM temp_distinct_product_ids T

 CROSS APPLY dbo.fn_GetTopXCustomers(T.product_id) A





 Okay, so there are two ways I could solve the problem in SQL (CROSS APPLY is
 dramatically faster for anyone that cares).  How can I do the same thing in
 Hive?  Here's the question restated: How can I implement a cursor in Hive?
 How can I do a for or while loop in Hive?  Can I implement a CROSS APPLY in
 Hive?





 I realize that I can implement a cursor outside of Hive and just execute the
 same Hive script over and over and over again.  And, that's not a horrible
 solution as long as it leverages the full power of Hadoop.  My concern is
 that each of the individual queries that is run inside are fairly
 inexpensive, but the total number of products makes the total job *very*
 expensive.



 Also, the solution should be reusable -- I'd really prefer not to write a
 custom jar every time I run into this problem.



 Actually, I’m also not particularly religious about using Hive.  If there’s
 some other tech that does what I need, that’s cool too.



 Thanks in advance.



 Mike Roberts



-- 
Yours,
Zheng

Re: Error after loading data

2010-01-22 Thread Zheng Shao

Hi Ankit,

org.apache.hadoop.mapreduce.lib.input.XmlInputFormat is implementing
the new mapreduce InputFormat API. while Hive need an InputFormat that
implements org.apache.hadoop.mapred.InputFormat (the old API).

This might work:
http://www.umiacs.umd.edu/~jimmylin/cloud9/docs/api/edu/umd/cloud9/collection/XMLInputFormat.html

Or you might want to adapt the XMLInputFormat to the old API so Hive
can read from it.

Zheng

On Fri, Jan 22, 2010 at 10:58 AM, ankit bhatnagar abhatna...@gmail.com wrote:
 Hi all,

 I am loading data from xml file to hive schema.

 add jar build/contrib/hadoop-mapred-0.22.0-SNAPSHOT.jar

 CREATE TABLE IF NOT EXISTS PARSE_XML(
 column1 String, column2 String
 )
 STORED AS
 INPUTFORMAT 'org.apache.hadoop.mapreduce.lib.input.XmlInputFormat'
 OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat';



 LOAD DATA LOCAL INPATH './hive-svn/build/dist/examples/files/upload.xml'
 OVERWRITE INTO TABLE PARSE_XML;

 I was able to create the table however I got the following error-


 FAILED: Error in semantic analysis: line 1:14 Input Format must implement
 InputFormat parse_xml


 when I do the select on the table


 Ankit




-- 
Yours,
Zheng

Re: Deleted input files after load

2010-01-22 Thread Zheng Shao

If you want the files to stay there, you can try CREATE EXTERNAL
TABLE with a location (instead of create table + load)

Zheng

On Fri, Jan 22, 2010 at 10:51 AM, Bill Graham billgra...@gmail.com wrote:
 Hive doesn't delete the files upon load, it moves them to a location under
 the Hive warehouse directory. Try looking under
 /user/hive/warehouse/t_word_count.

 On Fri, Jan 22, 2010 at 10:44 AM, Shiva shiv...@gmail.com wrote:

 Hi,
     For the first time I used Hive to load couple of word count data input
 files into tables with and without OVERWRITE.  Both the times the input file
 in HDFS got deleted. Is that a expected behavior? And couldn't find any
 definitive answer on the Hive wiki.

 hive LOAD  DATA  INPATH  '/user/vmplanet/output/part-0'  OVERWRITE
 INTO  TABLE  t_word_count;

 Env.: Using Hadoop 0.20.1 and latest Hive on Ubuntu 9.10 running in
 VMware.

 Thanks,
 Shiva





-- 
Yours,
Zheng

Re: hive multiple inserts

2010-01-13 Thread Zheng Shao

https://issues.apache.org/jira/browse/HIVE-634
As far as I know there is nobody working on that right now. If you are
interested, we can work together on that.
Let's move the discussion to the JIRA.

Zheng

On Tue, Jan 12, 2010 at 3:27 AM, Anty anty@gmail.com wrote:

 Thanks Zheng.
 We have used RegexSerDe in some use cases, but the speed is indeed slower,
 so we don't want to  use regular expression if not necessary.


 yes, we have used RegexSerDe in some use cases.
 I found HIVE-634 https://issues.apache.org/jira/browse/HIVE-634 is what
 i need ,allowing for the user to specify field delimiter with any format.

 INSERT OVERWRITE LOCAL DIRECTORY '/mnt/daily_timelines'
 [ ROW FORMAT DELIMITED | SERDE ... ]
 [ FILE FORMAT ...]
 SELECT * FROM daily_timelines;

 Is somebody still working on this feature?


 On Tue, Jan 12, 2010 at 2:28 PM, Zheng Shao zsh...@gmail.com wrote:
  Yes we only support one-byte delimiter for performance reasons.
 
  You can use the RegexSerDe in the contrib package for any row format that
  allows a regular expression (including your case ), but the speed
 will
  be slower.
 
  Zheng
 
  On Mon, Jan 11, 2010 at 5:54 PM, Anty anty@gmail.com wrote:
 
  Thanks Zheng.
  It does works.
  I have a another question,if the field delimiter is a string ,e.g.
  ,it looks like the LazySimpleSerDe can't works.Does the
  LazySimpleSerDe didn't support string field delimiter,only one byte of
  control characters?
 
  On Tue, Jan 12, 2010 at 3:05 AM, Zheng Shao zsh...@gmail.com wrote:
   For your second question, currently we can do it with a little extra
   work:
   1. Create an external table on the target directory with the field
   delimiter you want;
   2. Run the query and insert overwrite the target external table.
  
   For the first question we can also do the similar thing (create a
   bunch of external table and then insert), but I think we should fix
   the problem.
  
   Zheng
  
   On Mon, Jan 11, 2010 at 8:31 AM, Anty anty@gmail.com wrote:
   HI:
  I came across the same problean, therein is no data.I have one
   more question,can i specify the field delimiter for the output
   file,not just the default ctrl-a field delimiter?
  
   On Fri, Jan 8, 2010 at 2:23 PM, wd w...@wdicc.com wrote:
   hi,
  
   I'v tried use hive svn version, seems this bug still exists.
  
   svn st -v
  
 896805   896744 namit.
  896805   894292 namiteclipse-templates
  896805   894292 namiteclipse-templates/.classpath
  896805   765509 zshao
   eclipse-templates/TestHive.launchtemplate
  896805   765509 zshao
   eclipse-templates/TestMTQueries.l
  ..
  
   svn reversion 896805 ?
  
   follows is the execute log.
  
   hive from
   test
INSERT OVERWRITE LOCAL DIRECTORY '/home/stefdong/tmp/0' select
 *
   where
   a = 1
INSERT OVERWRITE LOCAL DIRECTORY '/home/stefdong/tmp/1' select
 *
   where
   a = 3;
   Total MapReduce jobs = 1
   Launching Job 1 out of 1
   Number of reduce tasks is set to 0 since there's no reduce operator
   Starting Job = job_201001071716_4691, Tracking URL =
   http://abc.com:50030/jobdetails.jsp?jobid=job_201001071716_4691
   Kill Command = hadoop job  -Dmapred.job.tracker=abc.com:9001 -kill
   job_201001071716_4691
   2010-01-08 14:14:55,442 Stage-2 map = 0%,  reduce = 0%
   2010-01-08 14:15:00,643 Stage-2 map = 100%,  reduce = 0%
   Ended Job = job_201001071716_4691
   Copying data to local directory /home/stefdong/tmp/0
   Copying data to local directory /home/stefdong/tmp/0
   13 Rows loaded to /home/stefdong/tmp/0
   9 Rows loaded to /home/stefdong/tmp/1
   OK
   Time taken: 9.409 seconds
  
  
   thx.
  
   2010/1/6 wd w...@wdicc.com
  
   hi,
  
   Single insert can extract data into '/tmp/out/1'.I even can see
 xxx
   rows
   loaded to '/tmp/out/0', xxx rows loaded to '/tmp/out/1'...etc in
   multi
   inserts, but there is no data in fact.
  
   Havn't try svn revision, will try it today.thx.
  
   2010/1/5 Zheng Shao zsh...@gmail.com
  
   Looks like a bug.
   What is the svn revision of Hive?
  
   Did you verify that single insert into '/tmp/out/1' produces
   non-empty
   files?
  
   Zheng
  
   On Tue, Jan 5, 2010 at 12:51 AM, wd w...@wdicc.com wrote:
In hive wiki:
   
Hive extension (multiple inserts):
FROM from_statement
INSERT OVERWRITE [LOCAL] DIRECTORY directory1 select_statement1
   
[INSERT OVERWRITE [LOCAL] DIRECTORY directory2
 select_statement2]
...
   
I'm try to use hive multi inserts to extract data from hive to
local
disk.
Follows is the hql
   
from test_tbl
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/out/0' select select *
where
id%10=0
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/out/1' select select *
where
id%10=1
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/out/2' select select *
where
id%10=2
   
This hql can execute, but only /tmp/out/0 have

Re: hive multiple inserts

2010-01-11 Thread Zheng Shao

For your second question, currently we can do it with a little extra work:
1. Create an external table on the target directory with the field
delimiter you want;
2. Run the query and insert overwrite the target external table.

For the first question we can also do the similar thing (create a
bunch of external table and then insert), but I think we should fix
the problem.

Zheng

On Mon, Jan 11, 2010 at 8:31 AM, Anty anty@gmail.com wrote:
 HI:
    I came across the same problean, therein is no data.I have one
 more question,can i specify the field delimiter for the output
 file,not just the default ctrl-a field delimiter?

 On Fri, Jan 8, 2010 at 2:23 PM, wd w...@wdicc.com wrote:
 hi,

 I'v tried use hive svn version, seems this bug still exists.

 svn st -v

   896805   896744 namit    .
    896805   894292 namit    eclipse-templates
    896805   894292 namit    eclipse-templates/.classpath
    896805   765509 zshao
 eclipse-templates/TestHive.launchtemplate
    896805   765509 zshao    eclipse-templates/TestMTQueries.l
    ..

 svn reversion 896805 ?

 follows is the execute log.

 hive from
 test
      INSERT OVERWRITE LOCAL DIRECTORY '/home/stefdong/tmp/0' select * where
 a = 1
      INSERT OVERWRITE LOCAL DIRECTORY '/home/stefdong/tmp/1' select * where
 a = 3;
 Total MapReduce jobs = 1
 Launching Job 1 out of 1
 Number of reduce tasks is set to 0 since there's no reduce operator
 Starting Job = job_201001071716_4691, Tracking URL =
 http://abc.com:50030/jobdetails.jsp?jobid=job_201001071716_4691
 Kill Command = hadoop job  -Dmapred.job.tracker=abc.com:9001 -kill
 job_201001071716_4691
 2010-01-08 14:14:55,442 Stage-2 map = 0%,  reduce = 0%
 2010-01-08 14:15:00,643 Stage-2 map = 100%,  reduce = 0%
 Ended Job = job_201001071716_4691
 Copying data to local directory /home/stefdong/tmp/0
 Copying data to local directory /home/stefdong/tmp/0
 13 Rows loaded to /home/stefdong/tmp/0
 9 Rows loaded to /home/stefdong/tmp/1
 OK
 Time taken: 9.409 seconds


 thx.

 2010/1/6 wd w...@wdicc.com

 hi,

 Single insert can extract data into '/tmp/out/1'.I even can see xxx rows
 loaded to '/tmp/out/0', xxx rows loaded to '/tmp/out/1'...etc in multi
 inserts, but there is no data in fact.

 Havn't try svn revision, will try it today.thx.

 2010/1/5 Zheng Shao zsh...@gmail.com

 Looks like a bug.
 What is the svn revision of Hive?

 Did you verify that single insert into '/tmp/out/1' produces non-empty
 files?

 Zheng

 On Tue, Jan 5, 2010 at 12:51 AM, wd w...@wdicc.com wrote:
  In hive wiki:
 
  Hive extension (multiple inserts):
  FROM from_statement
  INSERT OVERWRITE [LOCAL] DIRECTORY directory1 select_statement1
 
  [INSERT OVERWRITE [LOCAL] DIRECTORY directory2 select_statement2] ...
 
  I'm try to use hive multi inserts to extract data from hive to local
  disk.
  Follows is the hql
 
  from test_tbl
  INSERT OVERWRITE LOCAL DIRECTORY '/tmp/out/0' select select * where
  id%10=0
  INSERT OVERWRITE LOCAL DIRECTORY '/tmp/out/1' select select * where
  id%10=1
  INSERT OVERWRITE LOCAL DIRECTORY '/tmp/out/2' select select * where
  id%10=2
 
  This hql can execute, but only /tmp/out/0 have datafile in it, other
  directories are empty. why this happen? bug?
 
 
 
 
 



 --
 Yours,
 Zheng






 --
 Best Regards
 Anty Rao




-- 
Yours,
Zheng

Re: hive multiple inserts

2010-01-11 Thread Zheng Shao

Yes we only support one-byte delimiter for performance reasons.

You can use the RegexSerDe in the contrib package for any row format that
allows a regular expression (including your case ), but the speed will
be slower.

Zheng

On Mon, Jan 11, 2010 at 5:54 PM, Anty anty@gmail.com wrote:

 Thanks Zheng.
 It does works.
 I have a another question,if the field delimiter is a string ,e.g.
 ,it looks like the LazySimpleSerDe can't works.Does the
 LazySimpleSerDe didn't support string field delimiter,only one byte of
 control characters?

 On Tue, Jan 12, 2010 at 3:05 AM, Zheng Shao zsh...@gmail.com wrote:
  For your second question, currently we can do it with a little extra
 work:
  1. Create an external table on the target directory with the field
  delimiter you want;
  2. Run the query and insert overwrite the target external table.
 
  For the first question we can also do the similar thing (create a
  bunch of external table and then insert), but I think we should fix
  the problem.
 
  Zheng
 
  On Mon, Jan 11, 2010 at 8:31 AM, Anty anty@gmail.com wrote:
  HI:
 I came across the same problean, therein is no data.I have one
  more question,can i specify the field delimiter for the output
  file,not just the default ctrl-a field delimiter?
 
  On Fri, Jan 8, 2010 at 2:23 PM, wd w...@wdicc.com wrote:
  hi,
 
  I'v tried use hive svn version, seems this bug still exists.
 
  svn st -v
 
896805   896744 namit.
 896805   894292 namiteclipse-templates
 896805   894292 namiteclipse-templates/.classpath
 896805   765509 zshao
  eclipse-templates/TestHive.launchtemplate
 896805   765509 zshao
 eclipse-templates/TestMTQueries.l
 ..
 
  svn reversion 896805 ?
 
  follows is the execute log.
 
  hive from
  test
   INSERT OVERWRITE LOCAL DIRECTORY '/home/stefdong/tmp/0' select *
 where
  a = 1
   INSERT OVERWRITE LOCAL DIRECTORY '/home/stefdong/tmp/1' select *
 where
  a = 3;
  Total MapReduce jobs = 1
  Launching Job 1 out of 1
  Number of reduce tasks is set to 0 since there's no reduce operator
  Starting Job = job_201001071716_4691, Tracking URL =
  http://abc.com:50030/jobdetails.jsp?jobid=job_201001071716_4691
  Kill Command = hadoop job  -Dmapred.job.tracker=abc.com:9001 -kill
  job_201001071716_4691
  2010-01-08 14:14:55,442 Stage-2 map = 0%,  reduce = 0%
  2010-01-08 14:15:00,643 Stage-2 map = 100%,  reduce = 0%
  Ended Job = job_201001071716_4691
  Copying data to local directory /home/stefdong/tmp/0
  Copying data to local directory /home/stefdong/tmp/0
  13 Rows loaded to /home/stefdong/tmp/0
  9 Rows loaded to /home/stefdong/tmp/1
  OK
  Time taken: 9.409 seconds
 
 
  thx.
 
  2010/1/6 wd w...@wdicc.com
 
  hi,
 
  Single insert can extract data into '/tmp/out/1'.I even can see xxx
 rows
  loaded to '/tmp/out/0', xxx rows loaded to '/tmp/out/1'...etc in multi
  inserts, but there is no data in fact.
 
  Havn't try svn revision, will try it today.thx.
 
  2010/1/5 Zheng Shao zsh...@gmail.com
 
  Looks like a bug.
  What is the svn revision of Hive?
 
  Did you verify that single insert into '/tmp/out/1' produces
 non-empty
  files?
 
  Zheng
 
  On Tue, Jan 5, 2010 at 12:51 AM, wd w...@wdicc.com wrote:
   In hive wiki:
  
   Hive extension (multiple inserts):
   FROM from_statement
   INSERT OVERWRITE [LOCAL] DIRECTORY directory1 select_statement1
  
   [INSERT OVERWRITE [LOCAL] DIRECTORY directory2 select_statement2]
 ...
  
   I'm try to use hive multi inserts to extract data from hive to
 local
   disk.
   Follows is the hql
  
   from test_tbl
   INSERT OVERWRITE LOCAL DIRECTORY '/tmp/out/0' select select * where
   id%10=0
   INSERT OVERWRITE LOCAL DIRECTORY '/tmp/out/1' select select * where
   id%10=1
   INSERT OVERWRITE LOCAL DIRECTORY '/tmp/out/2' select select * where
   id%10=2
  
   This hql can execute, but only /tmp/out/0 have datafile in it,
 other
   directories are empty. why this happen? bug?
  
  
  
  
  
 
 
 
  --
  Yours,
  Zheng
 
 
 
 
 
 
  --
  Best Regards
  Anty Rao
 
 
 
 
  --
  Yours,
  Zheng
 



 --
 Best Regards
 Anty Rao




-- 
Yours,
Zheng

Re: Speedup of test target

2010-01-08 Thread Zheng Shao

Unfortunately the trunk does not run tests in parallel yet.
The majority of the time is spent in TestCliDriver which contains over
200 .q files.
We will need to separate the working directories and metastore
directories to make these .q files run in parallel.

Zheng

On Thu, Jan 7, 2010 at 11:46 AM, Edward Capriolo edlinuxg...@gmail.com wrote:
 Since apache was granted a clover license I was looking to add a
 clover target to hive. I know recently there was a jira issue on
 running ant tests in parallel. I have a modest core II due laptop that
 takes quite a while on the test target. Does the trunk currently by
 default run test in parallel, if not how can I enable this? Also what
 are people out there using to run the test target hardware wise, and
 how long does ant test take?


 Thanks,
 Edward




-- 
Yours,
Zheng

Re: hive multiple inserts

2010-01-05 Thread Zheng Shao

Looks like a bug.
What is the svn revision of Hive?

Did you verify that single insert into '/tmp/out/1' produces non-empty files?

Zheng

On Tue, Jan 5, 2010 at 12:51 AM, wd w...@wdicc.com wrote:
 In hive wiki:

 Hive extension (multiple inserts):
 FROM from_statement
 INSERT OVERWRITE [LOCAL] DIRECTORY directory1 select_statement1

 [INSERT OVERWRITE [LOCAL] DIRECTORY directory2 select_statement2] ...

 I'm try to use hive multi inserts to extract data from hive to local disk.
 Follows is the hql

 from test_tbl
 INSERT OVERWRITE LOCAL DIRECTORY '/tmp/out/0' select select * where id%10=0
 INSERT OVERWRITE LOCAL DIRECTORY '/tmp/out/1' select select * where id%10=1
 INSERT OVERWRITE LOCAL DIRECTORY '/tmp/out/2' select select * where id%10=2

 This hql can execute, but only /tmp/out/0 have datafile in it, other
 directories are empty. why this happen? bug?








-- 
Yours,
Zheng

RE: Populating MAP type columns

2010-01-05 Thread Zheng Shao

Hi Saurabh,

I think we can do it with the following 3 UDFs.

make_map(trim(split(cookies, ,)), =)

ArrayListString split(String) See 
http://issues.apache.org/jira/browse/HIVE-642
ArrayListString trim(ArrayListString)  Open one for that
HashMapString,String make_map(ArrayListString, String separator) Open one 
for that

The last 2 need to be written. Please open a JIRA for each.
It will be great if you are interested in working on that. There are some 
examples in the contrib directory already (search for UDFExampleAdd). See 
http://wiki.apache.org/hadoop/Hive/HowToContribute

Zheng
From: Saurabh Nanda [mailto:saurabhna...@gmail.com]
Sent: Tuesday, January 05, 2010 2:01 AM
To: hive-user@hadoop.apache.org
Subject: Populating MAP type columns

From 
http://wiki.apache.org/hadoop/Hive/Tutorial#Map.28Associative_Arrays.29_Operations
 it seems that Such structures can only be created programmatically 
currently.

What does this mean exactly? Do I have to use the Java based APi to insert data 
into such columns? If that is the case, has someone written a UDF which lets me 
import weblog cookie data into a MAP column using only the Hive QL. The cookie 
data is of the following format:
cookie_name1=value; cookie_name2=value; cookie_name3=value

If there is no such UDF available, would it be a good idea to include one in 
the standard Hive distribution?

Thanks,
Saurabh.
--
http://nandz.blogspot.com
http://foodieforlife.blogspot.com

Re: Null values in hive output

2010-01-04 Thread Zheng Shao

Hi Eric,

Most probably there are leading/trailing spaces in the columns that
are defined as int.
If Hive cannot parse the field successfully, the field will become null.

You can try this to find out the rows:
SELECT * FROM raw_facts WHERE year IS NULL;

Zheng
On Mon, Jan 4, 2010 at 4:10 PM, Eric Sammer e...@lifeless.net wrote:
 All:

 I apologize in advance if this is common. I've searched and I can't find
 an explanation.

 I'm loading a plain text tab delimited file into a Hive (0.4.1-dev)
 table. This file is a small sample set of my full dataset and is the
 result of a M/R job, written by TextOutputFormat, if it matters. When I
 query the table, a small percentage (a few hundred out of a few million)
 of the rows contain null values where as the input file does not contain
 any null values. The number of null field records seems to grow
 proportionally to the total number of records at a relatively constant
 rate. It looks as if it's a SerDe error / misconfiguration of some kind,
 but I can't pinpoint anything that would cause the issue.

 To confirm, I've done an fs -cat of the file to local disk and used cut
 and sort to confirm all fields are properly formatted and populated.
 Below is the extended table description along with some additional
 information. Any help is greatly appreciated as using Hive for simple
 aggregation is saving me a ton of time from having to hand write the M/R
 jobs myself.

 I'm sure there's something I've done wrong. Unfortunately, I'm in a
 situation where I can't deal with any portion of the records being
 dumped (part of a reporting system).

 Original create:

 hive create table raw_facts ( year int, month int, day int, application
 string, company_id int, country_code string, receiver_code_id int,
 keyword string, total int ) row format delimited fields terminated by '\t';

 (I've also tried row format TEXTFORMAT or whatever it is; all fields
 were null - assumed it was because hive was expecting ^A delimited.)

 Table:

 hive describe extended raw_facts;
 OK
 year    int
 month   int
 day     int
 application     string
 company_id      int
 country_code    string
 receiver_code_id        int
 keyword string
 total   int

 Detailed Table Information      Table(tableName:raw_facts,
 dbName:default, owner:snip, createTime:1262631537, lastAccessTime:0,
 retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:year, type:int,
 comment:null), FieldSchema(name:month, type:int, comment:null),
 FieldSchema(name:day, type:int, comment:null),
 FieldSchema(name:application, type:string, comment:null),
 FieldSchema(name:company_id, type:int, comment:null),
 FieldSchema(name:country_code, type:string, comment:null),
 FieldSchema(name:receiver_code_id, type:int, comment:null),
 FieldSchema(name:keyword, type:string, comment:null),
 FieldSchema(name:total, type:int, comment:null)],
 location:hdfs://snip/home/hive/warehouse/raw_facts,
 inputFormat:org.apache.hadoop.mapred.TextInputFormat,
 outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat,
 compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null,
 serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,
 parameters:{serialization.format=9,field.delim=      }), bucketCols:[],
 sortCols:[], parameters:{}), partitionKeys:[], parameters:{})

 Sample (real) rows: (these are tab separated in the file)

 2009    12      01      f       98      US      171     test    222
 2009    12      01      f       98      US      199     test    222
 2009    12      01      f       98      US      220     test    222

 Load command used:

 hive load data inpath 'hdfs://snip/some/path/out/part-r-0'
 overwrite into table raw_facts ;

 Some queries:

 hive select count(1) from raw_facts;
 OK
 4723253

 hive select count(1) from raw_facts where year is null;
 OK
 277

 hive select year,count(1) from raw_facts group by year;
 OK
 NULL    277
 2009    4722976

 Thanks in advance.
 --
 Eric Sammer
 e...@lifless.net
 http://esammer.blogspot.com




-- 
Yours,
Zheng

1 2 3 >

1 - 100 of 239 matches

Mail list logo