Re: hive multiple inserts

Ning Zhang Thu, 21 Jan 2010 10:18:23 -0800

BTW, JIRA HIVE-1047 subsumes HIVE-1039 in trunk. So if you are using branch 
0.5.0, HIVE-1039 is already there. If you are using 0.4 or previous releases, 
you can either apply HIVE-1039 or HIVE-1047. Both of them are very simple 
changes.


On Jan 21, 2010, at 9:56 AM, Namit Jain wrote:

> Which version are you using ?
>
> The bug mentioned was fixed by:
>
> https://issues.apache.org/jira/browse/HIVE-1039
>
>
> Thanks,
> -namit
>
> -----Original Message-----
> From: Min Zhou [mailto:[email protected]]
> Sent: Thursday, January 21, 2010 12:40 AM
> To: [email protected]
> Subject: Re: hive multiple inserts
>
> It should be a bug of hive. see below
>
> hive> set hive.merge.mapfiles=true;
> hive> explain from netflix insert overwrite table t1 select movie_id
> insert overwrite table t2 select user_id;
> OK
> ABSTRACT SYNTAX TREE:
>  (TOK_QUERY (TOK_FROM (TOK_TABREF netflix)) (TOK_INSERT
> (TOK_DESTINATION (TOK_TAB t1)) (TOK_SELECT (TOK_SELEXPR
> (TOK_TABLE_OR_COL movie_id)))) (TOK_INSERT (TOK_DESTINATION (TOK_TAB
> t2)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL user_id)))))
>
> STAGE DEPENDENCIES:
>  Stage-2 is a root stage
>  Stage-5 depends on stages: Stage-2
>  Stage-0 depends on stages: Stage-5
>  Stage-8 depends on stages: Stage-2
>  Stage-1 depends on stages: Stage-8
>
> STAGE PLANS:
>  Stage: Stage-2
>    Map Reduce
>      Alias -> Map Operator Tree:
>        netflix
>          TableScan
>            alias: netflix
>            Select Operator
>              expressions:
>                    expr: movie_id
>                    type: string
>              outputColumnNames: _col0
>              File Output Operator
>                compressed: true
>                GlobalTableId: 1
>                table:
>                    input format: org.apache.hadoop.mapred.TextInputFormat
>                    output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                    serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>                    name: t1
>            Select Operator
>              expressions:
>                    expr: user_id
>                    type: string
>              outputColumnNames: _col0
>              File Output Operator
>                compressed: true
>                GlobalTableId: 2
>                table:
>                    input format: org.apache.hadoop.mapred.TextInputFormat
>                    output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                    serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>                    name: t2
>
>  Stage: Stage-5
>    Conditional Operator
>      list of dependent Tasks:
>          Move Operator
>            files:
>                hdfs directory: true
>                destination:
> hdfs://hdpnn.cm3:9000/group/tbdev/zhoumin/hive-tmp/794320195/10000
>          Map Reduce
>            Alias -> Map Operator Tree:
>              
> hdfs://hdpnn.cm3:9000/group/tbdev/zhoumin/hive-tmp/570535800/10004
>                  Reduce Output Operator
>                    sort order:
>                    Map-reduce partition columns:
>                          expr: rand()
>                          type: double
>                    tag: -1
>                    value expressions:
>                          expr: foo
>                          type: string
>            Reduce Operator Tree:
>              Extract
>                File Output Operator
>                  compressed: true
>                  GlobalTableId: 0
>                  table:
>                      input format: org.apache.hadoop.mapred.TextInputFormat
>                      output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                      serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>                      name: t1
>
>  Stage: Stage-0
>    Move Operator
>      tables:
>          replace: true
>          table:
>              input format: org.apache.hadoop.mapred.TextInputFormat
>              output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>              name: t1
>
>  Stage: Stage-8
>    Conditional Operator
>      list of dependent Tasks:
>          Move Operator
>            files:
>                hdfs directory: true
>                destination:
> hdfs://hdpnn.cm3:9000/group/tbdev/zhoumin/hive-tmp/794320195/10002
>          Map Reduce
>            Alias -> Map Operator Tree:
>              
> hdfs://hdpnn.cm3:9000/group/tbdev/zhoumin/hive-tmp/570535800/10005
>                  Reduce Output Operator
>                    sort order:
>                    Map-reduce partition columns:
>                          expr: rand()
>                          type: double
>                    tag: -1
>                    value expressions:
>                          expr: bar
>                          type: string
>            Reduce Operator Tree:
>              Extract
>                File Output Operator
>                  compressed: true
>                  GlobalTableId: 0
>                  table:
>                      input format: org.apache.hadoop.mapred.TextInputFormat
>                      output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                      serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>                      name: t2
>
>  Stage: Stage-1
>    Move Operator
>      tables:
>          replace: true
>          table:
>              input format: org.apache.hadoop.mapred.TextInputFormat
>              output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>              name: t2
>
> -----------------------------------------------------------------------------
> hive> set hive.merge.mapfiles=false;
> hive> explain from netflix insert overwrite table t1 select movie_id
> insert overwrite table t2 select user_id;
> OK
> ABSTRACT SYNTAX TREE:
>  (TOK_QUERY (TOK_FROM (TOK_TABREF netflix)) (TOK_INSERT
> (TOK_DESTINATION (TOK_TAB t1)) (TOK_SELECT (TOK_SELEXPR
> (TOK_TABLE_OR_COL movie_id)))) (TOK_INSERT (TOK_DESTINATION (TOK_TAB
> t2)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL user_id)))))
>
> STAGE DEPENDENCIES:
>  Stage-2 is a root stage
>  Stage-0 depends on stages: Stage-2
>
> STAGE PLANS:
>  Stage: Stage-2
>    Map Reduce
>      Alias -> Map Operator Tree:
>        netflix
>          TableScan
>            alias: netflix
>            Select Operator
>              expressions:
>                    expr: movie_id
>                    type: string
>              outputColumnNames: _col0
>              File Output Operator
>                compressed: true
>                GlobalTableId: 1
>                table:
>                    input format: org.apache.hadoop.mapred.TextInputFormat
>                    output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                    serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>                    name: t1
>            Select Operator
>              expressions:
>                    expr: user_id
>                    type: string
>              outputColumnNames: _col0
>              File Output Operator
>                compressed: true
>                GlobalTableId: 2
>                table:
>                    input format: org.apache.hadoop.mapred.TextInputFormat
>                    output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                    serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>                    name: t2
>
>  Stage: Stage-0
>    Move Operator
>      tables:
>          replace: true
>          table:
>              input format: org.apache.hadoop.mapred.TextInputFormat
>              output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>              name: t1
>
> From explanations above, we can see the query under the second
> condition would only generate one move operator. This is the reason
> why t2 will be empty.
>
>
>
> Regards,
> Min
>
> On Wed, Jan 13, 2010 at 5:01 PM, Zheng Shao <[email protected]> wrote:
>> https://issues.apache.org/jira/browse/HIVE-634
>> As far as I know there is nobody working on that right now. If you are
>> interested, we can work together on that.
>> Let's move the discussion to the JIRA.
>>
>> Zheng
>>
>> On Tue, Jan 12, 2010 at 3:27 AM, Anty <[email protected]> wrote:
>>>
>>> Thanks Zheng.
>>> We have used RegexSerDe in some use cases, but the speed is indeed slower,
>>> so we don't want to  use regular expression if not necessary.
>>>
>>> yes, we have used RegexSerDe in some use cases.
>>> I found HIVE-634 is what i need ,allowing for the user to specify field
>>> delimiter with any format.
>>>
>>> INSERT OVERWRITE LOCAL DIRECTORY '/mnt/daily_timelines'
>>> [ ROW FORMAT DELIMITED | SERDE ... ]
>>> [ FILE FORMAT ...]
>>> SELECT * FROM daily_timelines;
>>>
>>> Is somebody still working on this feature?
>>>
>>> On Tue, Jan 12, 2010 at 2:28 PM, Zheng Shao <[email protected]> wrote:
>>>> Yes we only support one-byte delimiter for performance reasons.
>>>>
>>>> You can use the RegexSerDe in the contrib package for any row format
>>>> that
>>>> allows a regular expression (including your case "<>"), but the speed
>>>> will
>>>> be slower.
>>>>
>>>> Zheng
>>>>
>>>> On Mon, Jan 11, 2010 at 5:54 PM, Anty <[email protected]> wrote:
>>>>>
>>>>> Thanks Zheng.
>>>>> It does works.
>>>>> I have a another question,if the field delimiter is a string ,e.g.
>>>>> "<>",it looks like the LazySimpleSerDe can't works.Does the
>>>>> LazySimpleSerDe didn't support string field delimiter,only one byte of
>>>>> control characters?
>>>>>
>>>>> On Tue, Jan 12, 2010 at 3:05 AM, Zheng Shao <[email protected]> wrote:
>>>>>> For your second question, currently we can do it with a little extra
>>>>>> work:
>>>>>> 1. Create an external table on the target directory with the field
>>>>>> delimiter you want;
>>>>>> 2. Run the query and insert overwrite the target external table.
>>>>>>
>>>>>> For the first question we can also do the similar thing (create a
>>>>>> bunch of external table and then insert), but I think we should fix
>>>>>> the problem.
>>>>>>
>>>>>> Zheng
>>>>>>
>>>>>> On Mon, Jan 11, 2010 at 8:31 AM, Anty <[email protected]> wrote:
>>>>>>> HI:
>>>>>>>   I came across the same problean, therein is no data.I have one
>>>>>>> more question,can i specify the field delimiter for the output
>>>>>>> file,not just the default ctrl-a field delimiter?
>>>>>>>
>>>>>>> On Fri, Jan 8, 2010 at 2:23 PM, wd <[email protected]> wrote:
>>>>>>>> hi,
>>>>>>>>
>>>>>>>> I'v tried use hive svn version, seems this bug still exists.
>>>>>>>>
>>>>>>>> svn st -v
>>>>>>>>
>>>>>>>>          896805   896744 namit        .
>>>>>>>>           896805   894292 namit        eclipse-templates
>>>>>>>>           896805   894292 namit
>>>>>>>> eclipse-templates/.classpath
>>>>>>>>           896805   765509 zshao
>>>>>>>> eclipse-templates/TestHive.launchtemplate
>>>>>>>>           896805   765509 zshao
>>>>>>>> eclipse-templates/TestMTQueries.l
>>>>>>>>           ..........
>>>>>>>>
>>>>>>>> svn reversion 896805 ?
>>>>>>>>
>>>>>>>> follows is the execute log.
>>>>>>>>
>>>>>>>> hive> from
>>>>>>>> test
>>>>>>>>> INSERT OVERWRITE LOCAL DIRECTORY '/home/stefdong/tmp/0'
>>>>>>>> select *
>>>>>>>> where
>>>>>>>> a = 1
>>>>>>>>> INSERT OVERWRITE LOCAL DIRECTORY '/home/stefdong/tmp/1'
>>>>>>>> select *
>>>>>>>> where
>>>>>>>> a = 3;
>>>>>>>> Total MapReduce jobs = 1
>>>>>>>> Launching Job 1 out of 1
>>>>>>>> Number of reduce tasks is set to 0 since there's no reduce operator
>>>>>>>> Starting Job = job_201001071716_4691, Tracking URL =
>>>>>>>> http://abc.com:50030/jobdetails.jsp?jobid=job_201001071716_4691
>>>>>>>> Kill Command = hadoop job  -Dmapred.job.tracker=abc.com:9001 -kill
>>>>>>>> job_201001071716_4691
>>>>>>>> 2010-01-08 14:14:55,442 Stage-2 map = 0%,  reduce = 0%
>>>>>>>> 2010-01-08 14:15:00,643 Stage-2 map = 100%,  reduce = 0%
>>>>>>>> Ended Job = job_201001071716_4691
>>>>>>>> Copying data to local directory /home/stefdong/tmp/0
>>>>>>>> Copying data to local directory /home/stefdong/tmp/0
>>>>>>>> 13 Rows loaded to /home/stefdong/tmp/0
>>>>>>>> 9 Rows loaded to /home/stefdong/tmp/1
>>>>>>>> OK
>>>>>>>> Time taken: 9.409 seconds
>>>>>>>>
>>>>>>>>
>>>>>>>> thx.
>>>>>>>>
>>>>>>>> 2010/1/6 wd <[email protected]>
>>>>>>>>>
>>>>>>>>> hi,
>>>>>>>>>
>>>>>>>>> Single insert can extract data into '/tmp/out/1'.I even can see
>>>>>>>>> "xxx
>>>>>>>>> rows
>>>>>>>>> loaded to '/tmp/out/0', xxx rows loaded to '/tmp/out/1'...etc in
>>>>>>>>> multi
>>>>>>>>> inserts, but there is no data in fact.
>>>>>>>>>
>>>>>>>>> Havn't try svn revision, will try it today.thx.
>>>>>>>>>
>>>>>>>>> 2010/1/5 Zheng Shao <[email protected]>
>>>>>>>>>>
>>>>>>>>>> Looks like a bug.
>>>>>>>>>> What is the svn revision of Hive?
>>>>>>>>>>
>>>>>>>>>> Did you verify that single insert into '/tmp/out/1' produces
>>>>>>>>>> non-empty
>>>>>>>>>> files?
>>>>>>>>>>
>>>>>>>>>> Zheng
>>>>>>>>>>
>>>>>>>>>> On Tue, Jan 5, 2010 at 12:51 AM, wd <[email protected]> wrote:
>>>>>>>>>>> In hive wiki:
>>>>>>>>>>>
>>>>>>>>>>> Hive extension (multiple inserts):
>>>>>>>>>>> FROM from_statement
>>>>>>>>>>> INSERT OVERWRITE [LOCAL] DIRECTORY directory1 select_statement1
>>>>>>>>>>>
>>>>>>>>>>> [INSERT OVERWRITE [LOCAL] DIRECTORY directory2
>>>>>>>>>>> select_statement2]
>>>>>>>>>>> ...
>>>>>>>>>>>
>>>>>>>>>>> I'm try to use hive multi inserts to extract data from hive to
>>>>>>>>>>> local
>>>>>>>>>>> disk.
>>>>>>>>>>> Follows is the hql
>>>>>>>>>>>
>>>>>>>>>>> from test_tbl
>>>>>>>>>>> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/out/0' select select *
>>>>>>>>>>> where
>>>>>>>>>>> id%10=0
>>>>>>>>>>> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/out/1' select select *
>>>>>>>>>>> where
>>>>>>>>>>> id%10=1
>>>>>>>>>>> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/out/2' select select *
>>>>>>>>>>> where
>>>>>>>>>>> id%10=2
>>>>>>>>>>>
>>>>>>>>>>> This hql can execute, but only /tmp/out/0 have datafile in it,
>>>>>>>>>>> other
>>>>>>>>>>> directories are empty. why this happen? bug?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Yours,
>>>>>>>>>> Zheng
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Best Regards
>>>>>>> Anty Rao
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Yours,
>>>>>> Zheng
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best Regards
>>>>> Anty Rao
>>>>
>>>>
>>>>
>>>> --
>>>> Yours,
>>>> Zheng
>>>>
>>>
>>>
>>>
>>> --
>>> Best Regards
>>> Anty Rao
>>>
>>
>>
>>
>> --
>> Yours,
>> Zheng
>>
>
>
>
> --
> My research interests are distributed systems, parallel computing and
> bytecode based virtual machine.
>
> My profile:
> http://www.linkedin.com/in/coderplay
> My blog:
> http://coderplay.javaeye.com

Re: hive multiple inserts

Reply via email to