JIRA_223.3.txt_UNIT_TEST_SUCCEEDED

2009-02-18 Thread Murli Varadachari

SUCCESS: BUILD AND UNIT TEST using PATCH 223.3.txt PASSED!!



[jira] Commented: (HIVE-223) when using map-side aggregates - perform single map-reduce group-by

2009-02-18 Thread Namit Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12674892#action_12674892
 ] 

Namit Jain commented on HIVE-223:
-

tested one big job for correctness

> when using map-side aggregates - perform single map-reduce group-by
> ---
>
> Key: HIVE-223
> URL: https://issues.apache.org/jira/browse/HIVE-223
> Project: Hadoop Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: Joydeep Sen Sarma
>Assignee: Namit Jain
> Attachments: 223.2.txt, 223.3.txt, 223.patch1.txt
>
>
> today even when we do map side aggregates - we do multiple map-reduce jobs. 
> however - the reason for doing multiple map-reduce group-bys (for single 
> group-bys) was the fear of skews. When we are doing map side aggregates - 
> skews should not exist for the most part. There can be two reason for skews:
> - large number of entries for a single grouping set - map side aggregates 
> should take care of this
> - badness in hash function that sends too much stuff to one reducer - we 
> should be able to take care of this by having good hash functions (and prime 
> number reducer counts)
> So i think we should be able to do a single stage map-reduce when doing 
> map-side aggregates.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-291) [Hive] map-side aggregation should be automatically disabled at run-time if it is not turning out to be useful

2009-02-18 Thread Namit Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12674893#action_12674893
 ] 

Namit Jain commented on HIVE-291:
-

tested one big job for correctness

> [Hive] map-side aggregation should be automatically disabled at run-time if 
> it is not turning out to be useful
> --
>
> Key: HIVE-291
> URL: https://issues.apache.org/jira/browse/HIVE-291
> Project: Hadoop Hive
>  Issue Type: Bug
>  Components: Query Processor
>Reporter: Namit Jain
>Assignee: Namit Jain
> Attachments: 291.1.txt
>
>
> Map-side aggregation should be automatically disabled at run-time if it is 
> not turning out to be useful.
> If map-side aggregation is not reducing the number of output rows, it is a 
> drain on the mapper, since it is consuming memory and performing unnecessary 
> hash lookups

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-223) when using map-side aggregates - perform single map-reduce group-by

2009-02-18 Thread Namit Jain (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Namit Jain updated HIVE-223:


Status: Open  (was: Patch Available)

> when using map-side aggregates - perform single map-reduce group-by
> ---
>
> Key: HIVE-223
> URL: https://issues.apache.org/jira/browse/HIVE-223
> Project: Hadoop Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: Joydeep Sen Sarma
>Assignee: Namit Jain
> Attachments: 223.2.txt, 223.3.txt, 223.patch1.txt
>
>
> today even when we do map side aggregates - we do multiple map-reduce jobs. 
> however - the reason for doing multiple map-reduce group-bys (for single 
> group-bys) was the fear of skews. When we are doing map side aggregates - 
> skews should not exist for the most part. There can be two reason for skews:
> - large number of entries for a single grouping set - map side aggregates 
> should take care of this
> - badness in hash function that sends too much stuff to one reducer - we 
> should be able to take care of this by having good hash functions (and prime 
> number reducer counts)
> So i think we should be able to do a single stage map-reduce when doing 
> map-side aggregates.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-223) when using map-side aggregates - perform single map-reduce group-by

2009-02-18 Thread Namit Jain (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Namit Jain updated HIVE-223:


Attachment: 223.3.txt

> when using map-side aggregates - perform single map-reduce group-by
> ---
>
> Key: HIVE-223
> URL: https://issues.apache.org/jira/browse/HIVE-223
> Project: Hadoop Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: Joydeep Sen Sarma
>Assignee: Namit Jain
> Attachments: 223.2.txt, 223.3.txt, 223.patch1.txt
>
>
> today even when we do map side aggregates - we do multiple map-reduce jobs. 
> however - the reason for doing multiple map-reduce group-bys (for single 
> group-bys) was the fear of skews. When we are doing map side aggregates - 
> skews should not exist for the most part. There can be two reason for skews:
> - large number of entries for a single grouping set - map side aggregates 
> should take care of this
> - badness in hash function that sends too much stuff to one reducer - we 
> should be able to take care of this by having good hash functions (and prime 
> number reducer counts)
> So i think we should be able to do a single stage map-reduce when doing 
> map-side aggregates.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-223) when using map-side aggregates - perform single map-reduce group-by

2009-02-18 Thread Namit Jain (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Namit Jain updated HIVE-223:


Status: Patch Available  (was: Open)

> when using map-side aggregates - perform single map-reduce group-by
> ---
>
> Key: HIVE-223
> URL: https://issues.apache.org/jira/browse/HIVE-223
> Project: Hadoop Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: Joydeep Sen Sarma
>Assignee: Namit Jain
> Attachments: 223.2.txt, 223.3.txt, 223.patch1.txt
>
>
> today even when we do map side aggregates - we do multiple map-reduce jobs. 
> however - the reason for doing multiple map-reduce group-bys (for single 
> group-bys) was the fear of skews. When we are doing map side aggregates - 
> skews should not exist for the most part. There can be two reason for skews:
> - large number of entries for a single grouping set - map side aggregates 
> should take care of this
> - badness in hash function that sends too much stuff to one reducer - we 
> should be able to take care of this by having good hash functions (and prime 
> number reducer counts)
> So i think we should be able to do a single stage map-reduce when doing 
> map-side aggregates.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-223) when using map-side aggregates - perform single map-reduce group-by

2009-02-18 Thread Namit Jain (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Namit Jain updated HIVE-223:


Status: Patch Available  (was: Open)

fixed a small bug

> when using map-side aggregates - perform single map-reduce group-by
> ---
>
> Key: HIVE-223
> URL: https://issues.apache.org/jira/browse/HIVE-223
> Project: Hadoop Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: Joydeep Sen Sarma
>Assignee: Namit Jain
> Attachments: 223.2.txt, 223.patch1.txt
>
>
> today even when we do map side aggregates - we do multiple map-reduce jobs. 
> however - the reason for doing multiple map-reduce group-bys (for single 
> group-bys) was the fear of skews. When we are doing map side aggregates - 
> skews should not exist for the most part. There can be two reason for skews:
> - large number of entries for a single grouping set - map side aggregates 
> should take care of this
> - badness in hash function that sends too much stuff to one reducer - we 
> should be able to take care of this by having good hash functions (and prime 
> number reducer counts)
> So i think we should be able to do a single stage map-reduce when doing 
> map-side aggregates.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-223) when using map-side aggregates - perform single map-reduce group-by

2009-02-18 Thread Namit Jain (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Namit Jain updated HIVE-223:


Status: Open  (was: Patch Available)

> when using map-side aggregates - perform single map-reduce group-by
> ---
>
> Key: HIVE-223
> URL: https://issues.apache.org/jira/browse/HIVE-223
> Project: Hadoop Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: Joydeep Sen Sarma
>Assignee: Namit Jain
> Attachments: 223.2.txt, 223.patch1.txt
>
>
> today even when we do map side aggregates - we do multiple map-reduce jobs. 
> however - the reason for doing multiple map-reduce group-bys (for single 
> group-bys) was the fear of skews. When we are doing map side aggregates - 
> skews should not exist for the most part. There can be two reason for skews:
> - large number of entries for a single grouping set - map side aggregates 
> should take care of this
> - badness in hash function that sends too much stuff to one reducer - we 
> should be able to take care of this by having good hash functions (and prime 
> number reducer counts)
> So i think we should be able to do a single stage map-reduce when doing 
> map-side aggregates.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (HIVE-276) input3_limit.q fails under 0.17

2009-02-18 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao resolved HIVE-276.
-

   Resolution: Fixed
Fix Version/s: 0.3.0
   0.2.0
 Release Note: HIVE-276. Fix input3_limit.q for hadoop 0.17. (zshao)
 Hadoop Flags: [Reviewed]

trunk: Committed revision 745721.
branch 0.2: Committed revision 745723.


> input3_limit.q fails under 0.17
> ---
>
> Key: HIVE-276
> URL: https://issues.apache.org/jira/browse/HIVE-276
> Project: Hadoop Hive
>  Issue Type: Bug
>Reporter: Zheng Shao
>Assignee: Zheng Shao
> Fix For: 0.2.0, 0.3.0
>
> Attachments: HIVE-276.1.patch, HIVE-276.2.patch
>
>
> The plan ql/src/test/results/clientpositive/input3_limit.q.out shows that 
> there are 2 map-reduce jobs:
> The first one is distributed and sorted as is specified by the query. The 
> reducer side has LIMIT 20.
> The second one (single reducer job imposed by LIMIT 20) does not have the 
> same sort order, so the final result is non-deterministic.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-279) Implement predicate push down for hive queries

2009-02-18 Thread Prasad Chakka (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prasad Chakka updated HIVE-279:
---

Attachment: hive-279.patch

this is a drop for initial review since i suspect there will be lot of comments 
:). it should work for all cases except for multi-insert queries.

i have not enabled this by default but added a new config param called 
hive.optimize.ppd to enable this feature. 

i have not modified existing testcases but added couple of new testcases. will 
add more while uploading final patch.


> Implement predicate push down for hive queries
> --
>
> Key: HIVE-279
> URL: https://issues.apache.org/jira/browse/HIVE-279
> Project: Hadoop Hive
>  Issue Type: New Feature
>Affects Versions: 0.2.0
>Reporter: Prasad Chakka
>Assignee: Prasad Chakka
> Attachments: hive-279.patch
>
>
> Push predicates that are expressed in outer queries into inner queries where 
> possible so that rows will get filtered out sooner.
> eg.
> select a.*, b.* from a join b on (a.uid = b.uid) where a.age = 20 and 
> a.gender = 'm'
> current compiler generates the filter predicate in the reducer after the join 
> so all the rows have to be passed from mapper to reducer. by pushing the 
> filter predicate to the mapper, query performance should improve.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



*UNIT TEST FAILURE for apache HIVE* Hadoop.Version=0.17.1 based on SVN Rev# 745710.54

2009-02-18 Thread Murli Varadachari
[junit] Test org.apache.hadoop.hive.cli.TestCliDriver FAILED
BUILD FAILED
[junit] Test org.apache.hadoop.hive.cli.TestCliDriver FAILED
BUILD FAILED


[jira] Commented: (HIVE-276) input3_limit.q fails under 0.17

2009-02-18 Thread Raghotham Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12674870#action_12674870
 ] 

Raghotham Murthy commented on HIVE-276:
---

+1

looks good.

> input3_limit.q fails under 0.17
> ---
>
> Key: HIVE-276
> URL: https://issues.apache.org/jira/browse/HIVE-276
> Project: Hadoop Hive
>  Issue Type: Bug
>Reporter: Zheng Shao
>Assignee: Zheng Shao
> Attachments: HIVE-276.1.patch, HIVE-276.2.patch
>
>
> The plan ql/src/test/results/clientpositive/input3_limit.q.out shows that 
> there are 2 map-reduce jobs:
> The first one is distributed and sorted as is specified by the query. The 
> reducer side has LIMIT 20.
> The second one (single reducer job imposed by LIMIT 20) does not have the 
> same sort order, so the final result is non-deterministic.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-276) input3_limit.q fails under 0.17

2009-02-18 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated HIVE-276:


Attachment: HIVE-276.2.patch

Incorporated Ashish's comments.


> input3_limit.q fails under 0.17
> ---
>
> Key: HIVE-276
> URL: https://issues.apache.org/jira/browse/HIVE-276
> Project: Hadoop Hive
>  Issue Type: Bug
>Reporter: Zheng Shao
>Assignee: Zheng Shao
> Attachments: HIVE-276.1.patch, HIVE-276.2.patch
>
>
> The plan ql/src/test/results/clientpositive/input3_limit.q.out shows that 
> there are 2 map-reduce jobs:
> The first one is distributed and sorted as is specified by the query. The 
> reducer side has LIMIT 20.
> The second one (single reducer job imposed by LIMIT 20) does not have the 
> same sort order, so the final result is non-deterministic.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-131) insert overwrite directory leaves behind uncommitted/tmp files from failed tasks

2009-02-18 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated HIVE-131:


   Resolution: Fixed
Fix Version/s: 0.3.0
   0.2.0
 Release Note: HIVE-131. Remove uncommitted files from failed tasks. 
(Joydeep Sen Sarma via zshao)
 Hadoop Flags: [Reviewed]
   Status: Resolved  (was: Patch Available)

trunk: Committed revision 745709.
branch-0.2: Committed revision 745710.



> insert overwrite directory leaves behind uncommitted/tmp files from failed 
> tasks
> 
>
> Key: HIVE-131
> URL: https://issues.apache.org/jira/browse/HIVE-131
> Project: Hadoop Hive
>  Issue Type: Bug
>  Components: Query Processor
>Reporter: Joydeep Sen Sarma
>Assignee: Joydeep Sen Sarma
>Priority: Critical
> Fix For: 0.2.0, 0.3.0
>
> Attachments: HIVE-131.patch.1, hive-131.patch.2
>
>
> _tmp files are getting left behind on insert overwrite directory:
> /user/jssarma/ctst1/40422_m_000195_0.deflate   13285 2008-12-07 01:47  
> rw-r--r-- jssarma supergroup
> /user/jssarma/ctst1/40422_m_000196_0.deflate   3055  2008-12-07 01:46  
> rw-r--r-- jssarma supergroup
> /user/jssarma/ctst1/_tmp.40422_m_33_0  0 2008-12-07 01:53  rw-r--r-- 
> jssarma supergroup
> /user/jssarma/ctst1/_tmp.40422_m_37_1  0 2008-12-07 01:53  rw-r--r-- 
> jssarma supergroup
> this happened with speculative execution. the code looks good (in fact in 
> this case many speculative tasks were launched - and only a couple caused 
> problems). Almost seems like these files did not appear in the namespace 
> until after the map-reduce job finished and the movetask did a listing of the 
> output dir ..

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HIVE-294) Support MAP(a.*), REDUCE(a.*) and TRANSFORM(a.*)

2009-02-18 Thread Zheng Shao (JIRA)
Support MAP(a.*), REDUCE(a.*) and TRANSFORM(a.*)


 Key: HIVE-294
 URL: https://issues.apache.org/jira/browse/HIVE-294
 Project: Hadoop Hive
  Issue Type: Bug
  Components: Query Processor
Affects Versions: 0.2.0, 0.3.0
Reporter: Zheng Shao


Hive language does not accept  MAP(a.*), REDUCE(a.*) and TRANSFORM(a.*) now. We 
should support it.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Need help on Hive.g and parser!

2009-02-18 Thread Shyam Sarkar
Thank you. I went through antlr. Just curious -- was there any comparison done 
between JavaCC and antlr ? How is the quality of code generated by antlr 
compared to JavaCC ? This could be an issue if in future we like to embed XML 
or java script inside Hive QL (not very important at this point).
Advanced SQL syntax embeds XML and Java scripts.

Thanks,
Shyam


--- On Tue, 2/17/09, Zheng Shao  wrote:

> From: Zheng Shao 
> Subject: Re: Need help on Hive.g and parser!
> To: hive-dev@hadoop.apache.org, shyam_sar...@yahoo.com
> Date: Tuesday, February 17, 2009, 10:01 PM
> We are using antlr.
> 
> Basically, the rule checks the timestamp of
> HiveParser.java. If it's newer
> than Hive.g, then we don't need to regenerate
> HiveParse.java from Hive.g
> again.
> 
> Zheng
> 
> On Tue, Feb 17, 2009 at 12:15 PM, Shyam Sarkar
> wrote:
> 
> > Hello,
> >
> > Someone please explain the following build.xml spec
> for grammar build
> > (required and not required) ::
> >
> >
> ===
> >
> >  property="grammarBuild.notRequired">
> > "${src.dir}/org/apache/hadoop/hive/ql/parse"
> > includes="**/*.g"/>
> > >
> to="${build.dir.hive}/ql/gen-java/org/apache/hadoop/hive/ql/parse/HiveParser.java"/>
> >  
> >
> >   unless="grammarBuild.notRequired">
> >Building Grammar
> ${src.dir}/org/apache/hadoop/hive/ql/parse/Hive.g
> >  
> > classpathref="classpath" fork="true">
> >   
> >>
> value="${build.dir.hive}/ql/gen-java/org/apache/hadoop/hive/ql/parse"
> />
> >value="${src.dir}/org/apache/hadoop/hive/ql/parse/Hive.g"
> />
> >
> >  
> >
> =
> >
> > Also can someone tell me which parser generator is
> used? I used JavaCC
> > in the past.
> >
> > Thanks,
> > shyam_sar...@yahoo.com
> >
> >
> >
> >
> >
> 
> 
> -- 
> Yours,
> Zheng


  


Re: You are voted to be a Hive committer

2009-02-18 Thread Johan Oskarsson
Thanks guys, I'll keep the project alive if California slides into the 
pacific :)


/Johan

Jeff Hammerbacher wrote:

Congrats Johan!

On Wed, Feb 18, 2009 at 10:55 AM, Joydeep Sen Sarma 
mailto:jssa...@facebook.com>> wrote:


Congrats!

I guess this means Hive can now reliably survive a massive
earthquake in SF Bay Area.

-Original Message-
From: Dhruba Borthakur [mailto:dhr...@gmail.com
]
Sent: Tuesday, February 17, 2009 10:45 PM
To: Johan Oskarsson
Cc: hive-dev@hadoop.apache.org 
Subject: You are voted to be a Hive committer

 Hi Johan,

The Hadoop PMC has voted to make you a committer for the Hive
subproject.
Please complete and sign the ICLA at
http://www.apache.org/licenses/icla.txtand fax it to the number
specified in the form. Once the form is processed,
you would be granted an apache account.

thanks,
dhruba






[jira] Commented: (HIVE-131) insert overwrite directory leaves behind uncommitted/tmp files from failed tasks

2009-02-18 Thread Joydeep Sen Sarma (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12674776#action_12674776
 ] 

Joydeep Sen Sarma commented on HIVE-131:


please commit this to 0.2 also since it's a pretty severe bug

> insert overwrite directory leaves behind uncommitted/tmp files from failed 
> tasks
> 
>
> Key: HIVE-131
> URL: https://issues.apache.org/jira/browse/HIVE-131
> Project: Hadoop Hive
>  Issue Type: Bug
>  Components: Query Processor
>Reporter: Joydeep Sen Sarma
>Assignee: Joydeep Sen Sarma
>Priority: Critical
> Attachments: HIVE-131.patch.1, hive-131.patch.2
>
>
> _tmp files are getting left behind on insert overwrite directory:
> /user/jssarma/ctst1/40422_m_000195_0.deflate   13285 2008-12-07 01:47  
> rw-r--r-- jssarma supergroup
> /user/jssarma/ctst1/40422_m_000196_0.deflate   3055  2008-12-07 01:46  
> rw-r--r-- jssarma supergroup
> /user/jssarma/ctst1/_tmp.40422_m_33_0  0 2008-12-07 01:53  rw-r--r-- 
> jssarma supergroup
> /user/jssarma/ctst1/_tmp.40422_m_37_1  0 2008-12-07 01:53  rw-r--r-- 
> jssarma supergroup
> this happened with speculative execution. the code looks good (in fact in 
> this case many speculative tasks were launched - and only a couple caused 
> problems). Almost seems like these files did not appear in the namespace 
> until after the map-reduce job finished and the movetask did a listing of the 
> output dir ..

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-74) Hive can use CombineFileInputFormat for when the input are many small files

2009-02-18 Thread Joydeep Sen Sarma (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-74?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12674769#action_12674769
 ] 

Joydeep Sen Sarma commented on HIVE-74:
---

where are the pools for the combinefileinputformat created (one per table)?

> Hive can use CombineFileInputFormat for when the input are many small files
> ---
>
> Key: HIVE-74
> URL: https://issues.apache.org/jira/browse/HIVE-74
> Project: Hadoop Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: dhruba borthakur
>Assignee: dhruba borthakur
> Fix For: 0.2.0
>
> Attachments: hiveCombineSplit.patch, hiveCombineSplit.patch
>
>
> There are cases when the input to a Hive job are thousands of small files. In 
> this case, there is a mapper for each file. Most of the overhead for spawning 
> all these mappers can be avoided if Hive used CombineFileInputFormat 
> introduced via HADOOP-4565

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: You are voted to be a Hive committer

2009-02-18 Thread Dhruba Borthakur
Hi Edward,

You are absolutely right! Sorry for the confusion.

dhruba

On Wed, Feb 18, 2009 at 11:36 AM, Edward Capriolo wrote:

> Congrats Johan.
>
> The subject of the email always fool me. I see an email titled "You
> are voted to be a Hive committer" and I feel like I have won an
> academy award. Then I open the email to find someone else is getting
> one. great sorrow. JK
>


Re: You are voted to be a Hive committer

2009-02-18 Thread Edward Capriolo
Congrats Johan.

The subject of the email always fool me. I see an email titled "You
are voted to be a Hive committer" and I feel like I have won an
academy award. Then I open the email to find someone else is getting
one. great sorrow. JK


[jira] Commented: (HIVE-74) Hive can use CombineFileInputFormat for when the input are many small files

2009-02-18 Thread Joydeep Sen Sarma (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-74?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12674742#action_12674742
 ] 

Joydeep Sen Sarma commented on HIVE-74:
---

Is it possible to do this in a way that Hive continues to compile against 
0.17/18/19?. I think this is almost a hard requirement.

One possibility is to have a new version of HiveInputSplit that only compiles 
against 0.20 - and have this conditionally in the code only for 0.20 and 
onwards. (for example in HiveInputFormat.java - there's a conditional tag 
(//[exclude_0_19]) that does some conditional code inclusion). I am not sure 
how this was implemented.

But even this is less than ideal. How will we deploy this with 17 (with 
combinefilesplit and related patches) (unless we are not using the open source 
version directly)

> Hive can use CombineFileInputFormat for when the input are many small files
> ---
>
> Key: HIVE-74
> URL: https://issues.apache.org/jira/browse/HIVE-74
> Project: Hadoop Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: dhruba borthakur
>Assignee: dhruba borthakur
> Fix For: 0.2.0
>
> Attachments: hiveCombineSplit.patch, hiveCombineSplit.patch
>
>
> There are cases when the input to a Hive job are thousands of small files. In 
> this case, there is a mapper for each file. Most of the overhead for spawning 
> all these mappers can be avoided if Hive used CombineFileInputFormat 
> introduced via HADOOP-4565

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: You are voted to be a Hive committer

2009-02-18 Thread Jeff Hammerbacher
Congrats Johan!

On Wed, Feb 18, 2009 at 10:55 AM, Joydeep Sen Sarma wrote:

> Congrats!
>
> I guess this means Hive can now reliably survive a massive earthquake in SF
> Bay Area.
>
> -Original Message-
> From: Dhruba Borthakur [mailto:dhr...@gmail.com]
> Sent: Tuesday, February 17, 2009 10:45 PM
> To: Johan Oskarsson
> Cc: hive-dev@hadoop.apache.org
> Subject: You are voted to be a Hive committer
>
>  Hi Johan,
>
> The Hadoop PMC has voted to make you a committer for the Hive subproject.
> Please complete and sign the ICLA at
> http://www.apache.org/licenses/icla.txtand fax it to the number
> specified in the form. Once the form is processed,
> you would be granted an apache account.
>
> thanks,
> dhruba
>


RE: You are voted to be a Hive committer

2009-02-18 Thread Joydeep Sen Sarma
Congrats!

I guess this means Hive can now reliably survive a massive earthquake in SF Bay 
Area.

-Original Message-
From: Dhruba Borthakur [mailto:dhr...@gmail.com] 
Sent: Tuesday, February 17, 2009 10:45 PM
To: Johan Oskarsson
Cc: hive-dev@hadoop.apache.org
Subject: You are voted to be a Hive committer

 Hi Johan,

The Hadoop PMC has voted to make you a committer for the Hive subproject.
Please complete and sign the ICLA at
http://www.apache.org/licenses/icla.txtand fax it to the number
specified in the form. Once the form is processed,
you would be granted an apache account.

thanks,
dhruba


Build failed in Hudson: Hive-trunk-h0.19 #8

2009-02-18 Thread Apache Hudson Server
See http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.19/8/changes

Changes:

[zshao] HIVE-270. Add a lazy-deserialized SerDe for efficient deserialization 
of rows with primitive types. (zshao)

--
[...truncated 18767 lines...]
[junit] diff 
http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.19/ws/hive/build/ql/test/logs/negative/unknown_column2.q.out
  
http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.19/ws/hive/ql/src/test/results/compiler/errors/unknown_column2.q.out
 
[junit] Done query: unknown_column2.q
[junit] Hive history 
file=http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.19/ws/hive/ql/../build/ql/tmp/hive_job_log_hudson_200902180724_474849503.txt
 
[junit] Begin query: unknown_column3.q
[junit] Loading data to table srcpart partition {ds=2008-04-08, hr=11}
[junit] OK
[junit] Loading data to table srcpart partition {ds=2008-04-08, hr=12}
[junit] OK
[junit] Loading data to table srcpart partition {ds=2008-04-09, hr=11}
[junit] OK
[junit] Loading data to table srcpart partition {ds=2008-04-09, hr=12}
[junit] OK
[junit] Loading data to table srcbucket
[junit] OK
[junit] Loading data to table srcbucket
[junit] OK
[junit] Loading data to table src
[junit] OK
[junit] diff 
http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.19/ws/hive/build/ql/test/logs/negative/unknown_column3.q.out
  
http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.19/ws/hive/ql/src/test/results/compiler/errors/unknown_column3.q.out
 
[junit] Done query: unknown_column3.q
[junit] Hive history 
file=http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.19/ws/hive/ql/../build/ql/tmp/hive_job_log_hudson_200902180724_1948005189.txt
 
[junit] Begin query: unknown_column4.q
[junit] Loading data to table srcpart partition {ds=2008-04-08, hr=11}
[junit] OK
[junit] Loading data to table srcpart partition {ds=2008-04-08, hr=12}
[junit] OK
[junit] Loading data to table srcpart partition {ds=2008-04-09, hr=11}
[junit] OK
[junit] Loading data to table srcpart partition {ds=2008-04-09, hr=12}
[junit] OK
[junit] Loading data to table srcbucket
[junit] OK
[junit] Loading data to table srcbucket
[junit] OK
[junit] Loading data to table src
[junit] OK
[junit] diff 
http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.19/ws/hive/build/ql/test/logs/negative/unknown_column4.q.out
  
http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.19/ws/hive/ql/src/test/results/compiler/errors/unknown_column4.q.out
 
[junit] Done query: unknown_column4.q
[junit] Hive history 
file=http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.19/ws/hive/ql/../build/ql/tmp/hive_job_log_hudson_200902180724_1836088980.txt
 
[junit] Begin query: unknown_column5.q
[junit] Loading data to table srcpart partition {ds=2008-04-08, hr=11}
[junit] OK
[junit] Loading data to table srcpart partition {ds=2008-04-08, hr=12}
[junit] OK
[junit] Loading data to table srcpart partition {ds=2008-04-09, hr=11}
[junit] OK
[junit] Loading data to table srcpart partition {ds=2008-04-09, hr=12}
[junit] OK
[junit] Loading data to table srcbucket
[junit] OK
[junit] Loading data to table srcbucket
[junit] OK
[junit] Loading data to table src
[junit] OK
[junit] diff 
http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.19/ws/hive/build/ql/test/logs/negative/unknown_column5.q.out
  
http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.19/ws/hive/ql/src/test/results/compiler/errors/unknown_column5.q.out
 
[junit] Done query: unknown_column5.q
[junit] Hive history 
file=http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.19/ws/hive/ql/../build/ql/tmp/hive_job_log_hudson_200902180724_1261990314.txt
 
[junit] Begin query: unknown_column6.q
[junit] Loading data to table srcpart partition {ds=2008-04-08, hr=11}
[junit] OK
[junit] Loading data to table srcpart partition {ds=2008-04-08, hr=12}
[junit] OK
[junit] Loading data to table srcpart partition {ds=2008-04-09, hr=11}
[junit] OK
[junit] Loading data to table srcpart partition {ds=2008-04-09, hr=12}
[junit] OK
[junit] Loading data to table srcbucket
[junit] OK
[junit] Loading data to table srcbucket
[junit] OK
[junit] Loading data to table src
[junit] OK
[junit] diff 
http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.19/ws/hive/build/ql/test/logs/negative/unknown_column6.q.out
  
http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.19/ws/hive/ql/src/test/results/compiler/errors/unknown_column6.q.out
 
[junit] Done query: unknown_column6.q
[junit] Hive history 
file=http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.19/ws/hive/ql/../build/ql/tmp/hive_job_log_hudson_200902180724_782453616.txt
 
[junit] Begin query: unknown_function1.q
[junit] Loa

Build failed in Hudson: Hive-trunk-h0.18 #9

2009-02-18 Thread Apache Hudson Server
See http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.18/9/changes

Changes:

[zshao] HIVE-270. Add a lazy-deserialized SerDe for efficient deserialization 
of rows with primitive types. (zshao)

--
[...truncated 19148 lines...]
[junit] diff 
http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.18/ws/hive/build/ql/test/logs/negative/unknown_column2.q.out
  
http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.18/ws/hive/ql/src/test/results/compiler/errors/unknown_column2.q.out
 
[junit] Done query: unknown_column2.q
[junit] Hive history 
file=http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.18/ws/hive/ql/../build/ql/tmp/hive_job_log_hudson_200902180624_-1207805739.txt
 
[junit] Begin query: unknown_column3.q
[junit] Loading data to table srcpart partition {ds=2008-04-08, hr=11}
[junit] OK
[junit] Loading data to table srcpart partition {ds=2008-04-08, hr=12}
[junit] OK
[junit] Loading data to table srcpart partition {ds=2008-04-09, hr=11}
[junit] OK
[junit] Loading data to table srcpart partition {ds=2008-04-09, hr=12}
[junit] OK
[junit] Loading data to table srcbucket
[junit] OK
[junit] Loading data to table srcbucket
[junit] OK
[junit] Loading data to table src
[junit] OK
[junit] diff 
http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.18/ws/hive/build/ql/test/logs/negative/unknown_column3.q.out
  
http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.18/ws/hive/ql/src/test/results/compiler/errors/unknown_column3.q.out
 
[junit] Done query: unknown_column3.q
[junit] Hive history 
file=http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.18/ws/hive/ql/../build/ql/tmp/hive_job_log_hudson_200902180624_-355194879.txt
 
[junit] Begin query: unknown_column4.q
[junit] Loading data to table srcpart partition {ds=2008-04-08, hr=11}
[junit] OK
[junit] Loading data to table srcpart partition {ds=2008-04-08, hr=12}
[junit] OK
[junit] Loading data to table srcpart partition {ds=2008-04-09, hr=11}
[junit] OK
[junit] Loading data to table srcpart partition {ds=2008-04-09, hr=12}
[junit] OK
[junit] Loading data to table srcbucket
[junit] OK
[junit] Loading data to table srcbucket
[junit] OK
[junit] Loading data to table src
[junit] OK
[junit] diff 
http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.18/ws/hive/build/ql/test/logs/negative/unknown_column4.q.out
  
http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.18/ws/hive/ql/src/test/results/compiler/errors/unknown_column4.q.out
 
[junit] Done query: unknown_column4.q
[junit] Hive history 
file=http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.18/ws/hive/ql/../build/ql/tmp/hive_job_log_hudson_200902180624_238035720.txt
 
[junit] Begin query: unknown_column5.q
[junit] Loading data to table srcpart partition {ds=2008-04-08, hr=11}
[junit] OK
[junit] Loading data to table srcpart partition {ds=2008-04-08, hr=12}
[junit] OK
[junit] Loading data to table srcpart partition {ds=2008-04-09, hr=11}
[junit] OK
[junit] Loading data to table srcpart partition {ds=2008-04-09, hr=12}
[junit] OK
[junit] Loading data to table srcbucket
[junit] OK
[junit] Loading data to table srcbucket
[junit] OK
[junit] Loading data to table src
[junit] OK
[junit] diff 
http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.18/ws/hive/build/ql/test/logs/negative/unknown_column5.q.out
  
http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.18/ws/hive/ql/src/test/results/compiler/errors/unknown_column5.q.out
 
[junit] Done query: unknown_column5.q
[junit] Hive history 
file=http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.18/ws/hive/ql/../build/ql/tmp/hive_job_log_hudson_200902180624_1911369666.txt
 
[junit] Begin query: unknown_column6.q
[junit] Loading data to table srcpart partition {ds=2008-04-08, hr=11}
[junit] OK
[junit] Loading data to table srcpart partition {ds=2008-04-08, hr=12}
[junit] OK
[junit] Loading data to table srcpart partition {ds=2008-04-09, hr=11}
[junit] OK
[junit] Loading data to table srcpart partition {ds=2008-04-09, hr=12}
[junit] OK
[junit] Loading data to table srcbucket
[junit] OK
[junit] Loading data to table srcbucket
[junit] OK
[junit] Loading data to table src
[junit] OK
[junit] diff 
http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.18/ws/hive/build/ql/test/logs/negative/unknown_column6.q.out
  
http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.18/ws/hive/ql/src/test/results/compiler/errors/unknown_column6.q.out
 
[junit] Done query: unknown_column6.q
[junit] Hive history 
file=http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.18/ws/hive/ql/../build/ql/tmp/hive_job_log_hudson_200902180624_-295681886.txt
 
[junit] Begin query: unknown_function1.q
[junit] L

Build failed in Hudson: Hive-trunk-h0.17 #8

2009-02-18 Thread Apache Hudson Server
See http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.17/8/changes

Changes:

[zshao] HIVE-270. Add a lazy-deserialized SerDe for efficient deserialization 
of rows with primitive types. (zshao)

--
[...truncated 16709 lines...]
[junit] diff 
http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.17/ws/hive/build/ql/test/logs/negative/unknown_column2.q.out
  
http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.17/ws/hive/ql/src/test/results/compiler/errors/unknown_column2.q.out
 
[junit] Done query: unknown_column2.q
[junit] Hive history 
file=http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.17/ws/hive/ql/../build/ql/tmp/hive_job_log_hudson_200902180523_68094412.txt
 
[junit] Begin query: unknown_column3.q
[junit] Loading data to table srcpart partition {ds=2008-04-08, hr=11}
[junit] OK
[junit] Loading data to table srcpart partition {ds=2008-04-08, hr=12}
[junit] OK
[junit] Loading data to table srcpart partition {ds=2008-04-09, hr=11}
[junit] OK
[junit] Loading data to table srcpart partition {ds=2008-04-09, hr=12}
[junit] OK
[junit] Loading data to table srcbucket
[junit] OK
[junit] Loading data to table srcbucket
[junit] OK
[junit] Loading data to table src
[junit] OK
[junit] diff 
http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.17/ws/hive/build/ql/test/logs/negative/unknown_column3.q.out
  
http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.17/ws/hive/ql/src/test/results/compiler/errors/unknown_column3.q.out
 
[junit] Done query: unknown_column3.q
[junit] Hive history 
file=http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.17/ws/hive/ql/../build/ql/tmp/hive_job_log_hudson_200902180523_576389447.txt
 
[junit] Begin query: unknown_column4.q
[junit] Loading data to table srcpart partition {ds=2008-04-08, hr=11}
[junit] OK
[junit] Loading data to table srcpart partition {ds=2008-04-08, hr=12}
[junit] OK
[junit] Loading data to table srcpart partition {ds=2008-04-09, hr=11}
[junit] OK
[junit] Loading data to table srcpart partition {ds=2008-04-09, hr=12}
[junit] OK
[junit] Loading data to table srcbucket
[junit] OK
[junit] Loading data to table srcbucket
[junit] OK
[junit] Loading data to table src
[junit] OK
[junit] diff 
http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.17/ws/hive/build/ql/test/logs/negative/unknown_column4.q.out
  
http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.17/ws/hive/ql/src/test/results/compiler/errors/unknown_column4.q.out
 
[junit] Done query: unknown_column4.q
[junit] Hive history 
file=http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.17/ws/hive/ql/../build/ql/tmp/hive_job_log_hudson_200902180523_-1227788210.txt
 
[junit] Begin query: unknown_column5.q
[junit] Loading data to table srcpart partition {ds=2008-04-08, hr=11}
[junit] OK
[junit] Loading data to table srcpart partition {ds=2008-04-08, hr=12}
[junit] OK
[junit] Loading data to table srcpart partition {ds=2008-04-09, hr=11}
[junit] OK
[junit] Loading data to table srcpart partition {ds=2008-04-09, hr=12}
[junit] OK
[junit] Loading data to table srcbucket
[junit] OK
[junit] Loading data to table srcbucket
[junit] OK
[junit] Loading data to table src
[junit] OK
[junit] diff 
http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.17/ws/hive/build/ql/test/logs/negative/unknown_column5.q.out
  
http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.17/ws/hive/ql/src/test/results/compiler/errors/unknown_column5.q.out
 
[junit] Done query: unknown_column5.q
[junit] Hive history 
file=http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.17/ws/hive/ql/../build/ql/tmp/hive_job_log_hudson_200902180523_538024551.txt
 
[junit] Begin query: unknown_column6.q
[junit] Loading data to table srcpart partition {ds=2008-04-08, hr=11}
[junit] OK
[junit] Loading data to table srcpart partition {ds=2008-04-08, hr=12}
[junit] OK
[junit] Loading data to table srcpart partition {ds=2008-04-09, hr=11}
[junit] OK
[junit] Loading data to table srcpart partition {ds=2008-04-09, hr=12}
[junit] OK
[junit] Loading data to table srcbucket
[junit] OK
[junit] Loading data to table srcbucket
[junit] OK
[junit] Loading data to table src
[junit] OK
[junit] diff 
http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.17/ws/hive/build/ql/test/logs/negative/unknown_column6.q.out
  
http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.17/ws/hive/ql/src/test/results/compiler/errors/unknown_column6.q.out
 
[junit] Done query: unknown_column6.q
[junit] Hive history 
file=http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.17/ws/hive/ql/../build/ql/tmp/hive_job_log_hudson_200902180523_-1608223043.txt
 
[junit] Begin query: unknown_function1.q
[junit] Loa