[jira] [Commented] (HIVE-7633) Warehouse#getTablePath() doesn't handle external tables

2015-01-14 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14277568#comment-14277568
 ] 

Yin Huai commented on HIVE-7633:


Changes in HIVE-1537 allows users to specify the location of a table. But, it 
did not change warehouse to correctly return the location of the table. 

 Warehouse#getTablePath() doesn't handle external tables
 ---

 Key: HIVE-7633
 URL: https://issues.apache.org/jira/browse/HIVE-7633
 Project: Hive
  Issue Type: Bug
Affects Versions: 0.8.1, 0.9.0, 0.10.0, 0.11.0, 0.12.0, 0.13.0, 0.14.0, 
 0.13.1
Reporter: Joey Echeverria
Priority: Critical

 Warehouse#getTablePath() takes a DB and a table name. This means it will 
 generate the wrong path for external tables. This can cause a problem if you 
 have an external table on the local file system and HDFS is not currently 
 running when trying to gather statistics.
 getTablePath() should take in the table and see if it's external and has a 
 location before just assuming it's a managed table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-7633) Warehouse#getTablePath() doesn't handle external tables

2015-01-14 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-7633:
---
Component/s: Metastore

 Warehouse#getTablePath() doesn't handle external tables
 ---

 Key: HIVE-7633
 URL: https://issues.apache.org/jira/browse/HIVE-7633
 Project: Hive
  Issue Type: Bug
  Components: Metastore
Affects Versions: 0.8.1, 0.9.0, 0.10.0, 0.11.0, 0.12.0, 0.13.0, 0.14.0, 
 0.13.1
Reporter: Joey Echeverria
Priority: Critical

 Warehouse#getTablePath() takes a DB and a table name. This means it will 
 generate the wrong path for external tables. This can cause a problem if you 
 have an external table on the local file system and HDFS is not currently 
 running when trying to gather statistics.
 getTablePath() should take in the table and see if it's external and has a 
 location before just assuming it's a managed table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-7633) Warehouse#getTablePath() doesn't handle external tables

2015-01-14 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-7633:
---
Affects Version/s: 0.8.1
   0.9.0
   0.10.0
   0.11.0
   0.12.0
   0.14.0
   0.13.1

 Warehouse#getTablePath() doesn't handle external tables
 ---

 Key: HIVE-7633
 URL: https://issues.apache.org/jira/browse/HIVE-7633
 Project: Hive
  Issue Type: Bug
Affects Versions: 0.8.1, 0.9.0, 0.10.0, 0.11.0, 0.12.0, 0.13.0, 0.14.0, 
 0.13.1
Reporter: Joey Echeverria
Priority: Critical

 Warehouse#getTablePath() takes a DB and a table name. This means it will 
 generate the wrong path for external tables. This can cause a problem if you 
 have an external table on the local file system and HDFS is not currently 
 running when trying to gather statistics.
 getTablePath() should take in the table and see if it's external and has a 
 location before just assuming it's a managed table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-7633) Warehouse#getTablePath() doesn't handle external tables

2015-01-14 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-7633:
---
Priority: Critical  (was: Major)

 Warehouse#getTablePath() doesn't handle external tables
 ---

 Key: HIVE-7633
 URL: https://issues.apache.org/jira/browse/HIVE-7633
 Project: Hive
  Issue Type: Bug
Affects Versions: 0.13.0
Reporter: Joey Echeverria
Priority: Critical

 Warehouse#getTablePath() takes a DB and a table name. This means it will 
 generate the wrong path for external tables. This can cause a problem if you 
 have an external table on the local file system and HDFS is not currently 
 running when trying to gather statistics.
 getTablePath() should take in the table and see if it's external and has a 
 location before just assuming it's a managed table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-6137) Hive should report that the file/path doesn’t exist when it doesn’t

2015-01-14 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14277228#comment-14277228
 ] 

Yin Huai commented on HIVE-6137:


[~hsubramaniyan] What is the affect version(s) of this bug?

 Hive should report that the file/path doesn’t exist when it doesn’t
 ---

 Key: HIVE-6137
 URL: https://issues.apache.org/jira/browse/HIVE-6137
 Project: Hive
  Issue Type: Bug
Reporter: Hari Sankar Sivarama Subramaniyan
Assignee: Hari Sankar Sivarama Subramaniyan
 Attachments: HIVE-6137.1.patch, HIVE-6137.2.patch, HIVE-6137.3.patch, 
 HIVE-6137.4.patch, HIVE-6137.5.patch, HIVE-6137.6.patch


 Hive should report that the file/path doesn’t exist when it doesn’t (it now 
 reports SocketTimeoutException):
 Execute a Hive DDL query with a reference to a non-existent blob (such as 
 CREATE EXTERNAL TABLE...) and check Hive logs (stderr):
 FAILED: Execution Error, return code 1 from 
 org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:Got exception: 
 java.io.IOException)
 This error message is not detailed enough. If a file doesn't exist, Hive 
 should report that it received an error while trying to locate the file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-7205) Wrong results when union all of grouping followed by group by with correlation optimization

2014-10-16 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173804#comment-14173804
 ] 

Yin Huai commented on HIVE-7205:


[~navis] Can you update the review board? I will take a look. Thank you.

 Wrong results when union all of grouping followed by group by with 
 correlation optimization
 ---

 Key: HIVE-7205
 URL: https://issues.apache.org/jira/browse/HIVE-7205
 Project: Hive
  Issue Type: Bug
Affects Versions: 0.12.0, 0.13.0, 0.13.1
Reporter: dima machlin
Assignee: Navis
Priority: Critical
 Attachments: HIVE-7205.1.patch.txt, HIVE-7205.2.patch.txt, 
 HIVE-7205.3.patch.txt, HIVE-7205.4.patch.txt


 use case :
 table TBL (a string,b string) contains single row : 'a','a'
 the following query :
 {code:sql}
 select b, sum(cc) from (
 select b,count(1) as cc from TBL group by b
 union all
 select a as b,count(1) as cc from TBL group by a
 ) z
 group by b
 {code}
 returns 
 a 1
 a 1
 while set hive.optimize.correlation=true;
 if we change set hive.optimize.correlation=false;
 it returns correct results : a 2
 The plan with correlation optimization :
 {code:sql}
 ABSTRACT SYNTAX TREE:
   (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_UNION (TOK_QUERY (TOK_FROM 
 (TOK_TABREF (TOK_TABNAME DB TBL))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR 
 TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL b)) (TOK_SELEXPR 
 (TOK_FUNCTION count 1) cc)) (TOK_GROUPBY (TOK_TABLE_OR_COL b (TOK_QUERY 
 (TOK_FROM (TOK_TABREF (TOK_TABNAME DB TBL))) (TOK_INSERT (TOK_DESTINATION 
 (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL a) b) 
 (TOK_SELEXPR (TOK_FUNCTION count 1) cc)) (TOK_GROUPBY (TOK_TABLE_OR_COL 
 a) z)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT 
 (TOK_SELEXPR (TOK_TABLE_OR_COL b)) (TOK_SELEXPR (TOK_FUNCTION sum 
 (TOK_TABLE_OR_COL cc (TOK_GROUPBY (TOK_TABLE_OR_COL b
 STAGE DEPENDENCIES:
   Stage-1 is a root stage
   Stage-0 is a root stage
 STAGE PLANS:
   Stage: Stage-1
 Map Reduce
   Alias - Map Operator Tree:
 null-subquery1:z-subquery1:TBL 
   TableScan
 alias: TBL
 Select Operator
   expressions:
 expr: b
 type: string
   outputColumnNames: b
   Group By Operator
 aggregations:
   expr: count(1)
 bucketGroup: false
 keys:
   expr: b
   type: string
 mode: hash
 outputColumnNames: _col0, _col1
 Reduce Output Operator
   key expressions:
 expr: _col0
 type: string
   sort order: +
   Map-reduce partition columns:
 expr: _col0
 type: string
   tag: 0
   value expressions:
 expr: _col1
 type: bigint
 null-subquery2:z-subquery2:TBL 
   TableScan
 alias: TBL
 Select Operator
   expressions:
 expr: a
 type: string
   outputColumnNames: a
   Group By Operator
 aggregations:
   expr: count(1)
 bucketGroup: false
 keys:
   expr: a
   type: string
 mode: hash
 outputColumnNames: _col0, _col1
 Reduce Output Operator
   key expressions:
 expr: _col0
 type: string
   sort order: +
   Map-reduce partition columns:
 expr: _col0
 type: string
   tag: 1
   value expressions:
 expr: _col1
 type: bigint
   Reduce Operator Tree:
 Demux Operator
   Group By Operator
 aggregations:
   expr: count(VALUE._col0)
 bucketGroup: false
 keys:
   expr: KEY._col0
   type: string
 mode: mergepartial
 outputColumnNames: _col0, _col1
 Select Operator
   expressions:
 expr: _col0
 type: string
 expr: _col1
 type: bigint
   outputColumnNames: _col0, _col1
   Union
 Select Operator
   expressions:
   

[jira] [Commented] (HIVE-7205) Wrong results when union all of grouping followed by group by with correlation optimization

2014-08-11 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14093129#comment-14093129
 ] 

Yin Huai commented on HIVE-7205:


Yeah, fixing correctness bug is very important. 

However, the current patch also introduces a significant refactoring of the 
query evaluation path. I am not sure if this refactoring will not break other 
things. [~navis] Can you post a summary of how those operators work with your 
refactoring?

 Wrong results when union all of grouping followed by group by with 
 correlation optimization
 ---

 Key: HIVE-7205
 URL: https://issues.apache.org/jira/browse/HIVE-7205
 Project: Hive
  Issue Type: Bug
Affects Versions: 0.12.0, 0.13.0, 0.13.1
Reporter: dima machlin
Assignee: Navis
Priority: Critical
 Attachments: HIVE-7205.1.patch.txt, HIVE-7205.2.patch.txt, 
 HIVE-7205.3.patch.txt


 use case :
 table TBL (a string,b string) contains single row : 'a','a'
 the following query :
 {code:sql}
 select b, sum(cc) from (
 select b,count(1) as cc from TBL group by b
 union all
 select a as b,count(1) as cc from TBL group by a
 ) z
 group by b
 {code}
 returns 
 a 1
 a 1
 while set hive.optimize.correlation=true;
 if we change set hive.optimize.correlation=false;
 it returns correct results : a 2
 The plan with correlation optimization :
 {code:sql}
 ABSTRACT SYNTAX TREE:
   (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_UNION (TOK_QUERY (TOK_FROM 
 (TOK_TABREF (TOK_TABNAME DB TBL))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR 
 TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL b)) (TOK_SELEXPR 
 (TOK_FUNCTION count 1) cc)) (TOK_GROUPBY (TOK_TABLE_OR_COL b (TOK_QUERY 
 (TOK_FROM (TOK_TABREF (TOK_TABNAME DB TBL))) (TOK_INSERT (TOK_DESTINATION 
 (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL a) b) 
 (TOK_SELEXPR (TOK_FUNCTION count 1) cc)) (TOK_GROUPBY (TOK_TABLE_OR_COL 
 a) z)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT 
 (TOK_SELEXPR (TOK_TABLE_OR_COL b)) (TOK_SELEXPR (TOK_FUNCTION sum 
 (TOK_TABLE_OR_COL cc (TOK_GROUPBY (TOK_TABLE_OR_COL b
 STAGE DEPENDENCIES:
   Stage-1 is a root stage
   Stage-0 is a root stage
 STAGE PLANS:
   Stage: Stage-1
 Map Reduce
   Alias - Map Operator Tree:
 null-subquery1:z-subquery1:TBL 
   TableScan
 alias: TBL
 Select Operator
   expressions:
 expr: b
 type: string
   outputColumnNames: b
   Group By Operator
 aggregations:
   expr: count(1)
 bucketGroup: false
 keys:
   expr: b
   type: string
 mode: hash
 outputColumnNames: _col0, _col1
 Reduce Output Operator
   key expressions:
 expr: _col0
 type: string
   sort order: +
   Map-reduce partition columns:
 expr: _col0
 type: string
   tag: 0
   value expressions:
 expr: _col1
 type: bigint
 null-subquery2:z-subquery2:TBL 
   TableScan
 alias: TBL
 Select Operator
   expressions:
 expr: a
 type: string
   outputColumnNames: a
   Group By Operator
 aggregations:
   expr: count(1)
 bucketGroup: false
 keys:
   expr: a
   type: string
 mode: hash
 outputColumnNames: _col0, _col1
 Reduce Output Operator
   key expressions:
 expr: _col0
 type: string
   sort order: +
   Map-reduce partition columns:
 expr: _col0
 type: string
   tag: 1
   value expressions:
 expr: _col1
 type: bigint
   Reduce Operator Tree:
 Demux Operator
   Group By Operator
 aggregations:
   expr: count(VALUE._col0)
 bucketGroup: false
 keys:
   expr: KEY._col0
   type: string
 mode: mergepartial
 outputColumnNames: _col0, _col1
 Select Operator
   expressions:
 expr: _col0
 type: string
 

[jira] [Commented] (HIVE-7205) Wrong results when union all of grouping followed by group by with correlation optimization

2014-08-03 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14084158#comment-14084158
 ] 

Yin Huai commented on HIVE-7205:


My main concern is that because we use the right most table as the stream 
table, if hive.join.emit.interval is small, we can generate wrong results if we 
do not have endGroupIfNecessary.

 Wrong results when union all of grouping followed by group by with 
 correlation optimization
 ---

 Key: HIVE-7205
 URL: https://issues.apache.org/jira/browse/HIVE-7205
 Project: Hive
  Issue Type: Bug
Affects Versions: 0.12.0, 0.13.0, 0.13.1
Reporter: dima machlin
Assignee: Navis
Priority: Critical
 Attachments: HIVE-7205.1.patch.txt, HIVE-7205.2.patch.txt, 
 HIVE-7205.3.patch.txt


 use case :
 table TBL (a string,b string) contains single row : 'a','a'
 the following query :
 {code:sql}
 select b, sum(cc) from (
 select b,count(1) as cc from TBL group by b
 union all
 select a as b,count(1) as cc from TBL group by a
 ) z
 group by b
 {code}
 returns 
 a 1
 a 1
 while set hive.optimize.correlation=true;
 if we change set hive.optimize.correlation=false;
 it returns correct results : a 2
 The plan with correlation optimization :
 {code:sql}
 ABSTRACT SYNTAX TREE:
   (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_UNION (TOK_QUERY (TOK_FROM 
 (TOK_TABREF (TOK_TABNAME DB TBL))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR 
 TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL b)) (TOK_SELEXPR 
 (TOK_FUNCTION count 1) cc)) (TOK_GROUPBY (TOK_TABLE_OR_COL b (TOK_QUERY 
 (TOK_FROM (TOK_TABREF (TOK_TABNAME DB TBL))) (TOK_INSERT (TOK_DESTINATION 
 (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL a) b) 
 (TOK_SELEXPR (TOK_FUNCTION count 1) cc)) (TOK_GROUPBY (TOK_TABLE_OR_COL 
 a) z)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT 
 (TOK_SELEXPR (TOK_TABLE_OR_COL b)) (TOK_SELEXPR (TOK_FUNCTION sum 
 (TOK_TABLE_OR_COL cc (TOK_GROUPBY (TOK_TABLE_OR_COL b
 STAGE DEPENDENCIES:
   Stage-1 is a root stage
   Stage-0 is a root stage
 STAGE PLANS:
   Stage: Stage-1
 Map Reduce
   Alias - Map Operator Tree:
 null-subquery1:z-subquery1:TBL 
   TableScan
 alias: TBL
 Select Operator
   expressions:
 expr: b
 type: string
   outputColumnNames: b
   Group By Operator
 aggregations:
   expr: count(1)
 bucketGroup: false
 keys:
   expr: b
   type: string
 mode: hash
 outputColumnNames: _col0, _col1
 Reduce Output Operator
   key expressions:
 expr: _col0
 type: string
   sort order: +
   Map-reduce partition columns:
 expr: _col0
 type: string
   tag: 0
   value expressions:
 expr: _col1
 type: bigint
 null-subquery2:z-subquery2:TBL 
   TableScan
 alias: TBL
 Select Operator
   expressions:
 expr: a
 type: string
   outputColumnNames: a
   Group By Operator
 aggregations:
   expr: count(1)
 bucketGroup: false
 keys:
   expr: a
   type: string
 mode: hash
 outputColumnNames: _col0, _col1
 Reduce Output Operator
   key expressions:
 expr: _col0
 type: string
   sort order: +
   Map-reduce partition columns:
 expr: _col0
 type: string
   tag: 1
   value expressions:
 expr: _col1
 type: bigint
   Reduce Operator Tree:
 Demux Operator
   Group By Operator
 aggregations:
   expr: count(VALUE._col0)
 bucketGroup: false
 keys:
   expr: KEY._col0
   type: string
 mode: mergepartial
 outputColumnNames: _col0, _col1
 Select Operator
   expressions:
 expr: _col0
 type: string
 expr: _col1
 type: bigint
   outputColumnNames: _col0, _col1

[jira] [Commented] (HIVE-7205) Wrong results when union all of grouping followed by group by with correlation optimization

2014-08-03 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14084215#comment-14084215
 ] 

Yin Huai commented on HIVE-7205:


Oh, I see. In the current patch, isLastInput takes special care for 
MuxOperator, so we will not generate wrong results. However, with this version, 
if my understanding is correct, we have to buffer rows from all tables in the 
reduce side join operator for cases like the last query in 
correlationoptimizer15.q (the right most table will not be streamable and we 
will have a higher memory footprint). I am not sure we want this behavior.

I think one thing we may want to investigate is what will be the minimal change 
that can just fix the bug. I totally agree to improve the logic of 
startGroup()/endGroup()/flush(). I guess we need to have a clear plan first.

[~ashutoshc] [~navis] I may not be able to come up with a patch soon. When will 
be our next release?

 Wrong results when union all of grouping followed by group by with 
 correlation optimization
 ---

 Key: HIVE-7205
 URL: https://issues.apache.org/jira/browse/HIVE-7205
 Project: Hive
  Issue Type: Bug
Affects Versions: 0.12.0, 0.13.0, 0.13.1
Reporter: dima machlin
Assignee: Navis
Priority: Critical
 Attachments: HIVE-7205.1.patch.txt, HIVE-7205.2.patch.txt, 
 HIVE-7205.3.patch.txt


 use case :
 table TBL (a string,b string) contains single row : 'a','a'
 the following query :
 {code:sql}
 select b, sum(cc) from (
 select b,count(1) as cc from TBL group by b
 union all
 select a as b,count(1) as cc from TBL group by a
 ) z
 group by b
 {code}
 returns 
 a 1
 a 1
 while set hive.optimize.correlation=true;
 if we change set hive.optimize.correlation=false;
 it returns correct results : a 2
 The plan with correlation optimization :
 {code:sql}
 ABSTRACT SYNTAX TREE:
   (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_UNION (TOK_QUERY (TOK_FROM 
 (TOK_TABREF (TOK_TABNAME DB TBL))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR 
 TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL b)) (TOK_SELEXPR 
 (TOK_FUNCTION count 1) cc)) (TOK_GROUPBY (TOK_TABLE_OR_COL b (TOK_QUERY 
 (TOK_FROM (TOK_TABREF (TOK_TABNAME DB TBL))) (TOK_INSERT (TOK_DESTINATION 
 (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL a) b) 
 (TOK_SELEXPR (TOK_FUNCTION count 1) cc)) (TOK_GROUPBY (TOK_TABLE_OR_COL 
 a) z)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT 
 (TOK_SELEXPR (TOK_TABLE_OR_COL b)) (TOK_SELEXPR (TOK_FUNCTION sum 
 (TOK_TABLE_OR_COL cc (TOK_GROUPBY (TOK_TABLE_OR_COL b
 STAGE DEPENDENCIES:
   Stage-1 is a root stage
   Stage-0 is a root stage
 STAGE PLANS:
   Stage: Stage-1
 Map Reduce
   Alias - Map Operator Tree:
 null-subquery1:z-subquery1:TBL 
   TableScan
 alias: TBL
 Select Operator
   expressions:
 expr: b
 type: string
   outputColumnNames: b
   Group By Operator
 aggregations:
   expr: count(1)
 bucketGroup: false
 keys:
   expr: b
   type: string
 mode: hash
 outputColumnNames: _col0, _col1
 Reduce Output Operator
   key expressions:
 expr: _col0
 type: string
   sort order: +
   Map-reduce partition columns:
 expr: _col0
 type: string
   tag: 0
   value expressions:
 expr: _col1
 type: bigint
 null-subquery2:z-subquery2:TBL 
   TableScan
 alias: TBL
 Select Operator
   expressions:
 expr: a
 type: string
   outputColumnNames: a
   Group By Operator
 aggregations:
   expr: count(1)
 bucketGroup: false
 keys:
   expr: a
   type: string
 mode: hash
 outputColumnNames: _col0, _col1
 Reduce Output Operator
   key expressions:
 expr: _col0
 type: string
   sort order: +
   Map-reduce partition columns:
 expr: _col0
 type: string
   tag: 1
   value expressions:
 expr: _col1
 type: bigint
   Reduce 

[jira] [Commented] (HIVE-7205) Wrong results when union all of grouping followed by group by with correlation optimization

2014-08-01 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14082384#comment-14082384
 ] 

Yin Huai commented on HIVE-7205:


No yet. I will try to find sometime during the weekend.

 Wrong results when union all of grouping followed by group by with 
 correlation optimization
 ---

 Key: HIVE-7205
 URL: https://issues.apache.org/jira/browse/HIVE-7205
 Project: Hive
  Issue Type: Bug
Affects Versions: 0.12.0, 0.13.0, 0.13.1
Reporter: dima machlin
Assignee: Navis
Priority: Critical
 Attachments: HIVE-7205.1.patch.txt, HIVE-7205.2.patch.txt, 
 HIVE-7205.3.patch.txt


 use case :
 table TBL (a string,b string) contains single row : 'a','a'
 the following query :
 {code:sql}
 select b, sum(cc) from (
 select b,count(1) as cc from TBL group by b
 union all
 select a as b,count(1) as cc from TBL group by a
 ) z
 group by b
 {code}
 returns 
 a 1
 a 1
 while set hive.optimize.correlation=true;
 if we change set hive.optimize.correlation=false;
 it returns correct results : a 2
 The plan with correlation optimization :
 {code:sql}
 ABSTRACT SYNTAX TREE:
   (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_UNION (TOK_QUERY (TOK_FROM 
 (TOK_TABREF (TOK_TABNAME DB TBL))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR 
 TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL b)) (TOK_SELEXPR 
 (TOK_FUNCTION count 1) cc)) (TOK_GROUPBY (TOK_TABLE_OR_COL b (TOK_QUERY 
 (TOK_FROM (TOK_TABREF (TOK_TABNAME DB TBL))) (TOK_INSERT (TOK_DESTINATION 
 (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL a) b) 
 (TOK_SELEXPR (TOK_FUNCTION count 1) cc)) (TOK_GROUPBY (TOK_TABLE_OR_COL 
 a) z)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT 
 (TOK_SELEXPR (TOK_TABLE_OR_COL b)) (TOK_SELEXPR (TOK_FUNCTION sum 
 (TOK_TABLE_OR_COL cc (TOK_GROUPBY (TOK_TABLE_OR_COL b
 STAGE DEPENDENCIES:
   Stage-1 is a root stage
   Stage-0 is a root stage
 STAGE PLANS:
   Stage: Stage-1
 Map Reduce
   Alias - Map Operator Tree:
 null-subquery1:z-subquery1:TBL 
   TableScan
 alias: TBL
 Select Operator
   expressions:
 expr: b
 type: string
   outputColumnNames: b
   Group By Operator
 aggregations:
   expr: count(1)
 bucketGroup: false
 keys:
   expr: b
   type: string
 mode: hash
 outputColumnNames: _col0, _col1
 Reduce Output Operator
   key expressions:
 expr: _col0
 type: string
   sort order: +
   Map-reduce partition columns:
 expr: _col0
 type: string
   tag: 0
   value expressions:
 expr: _col1
 type: bigint
 null-subquery2:z-subquery2:TBL 
   TableScan
 alias: TBL
 Select Operator
   expressions:
 expr: a
 type: string
   outputColumnNames: a
   Group By Operator
 aggregations:
   expr: count(1)
 bucketGroup: false
 keys:
   expr: a
   type: string
 mode: hash
 outputColumnNames: _col0, _col1
 Reduce Output Operator
   key expressions:
 expr: _col0
 type: string
   sort order: +
   Map-reduce partition columns:
 expr: _col0
 type: string
   tag: 1
   value expressions:
 expr: _col1
 type: bigint
   Reduce Operator Tree:
 Demux Operator
   Group By Operator
 aggregations:
   expr: count(VALUE._col0)
 bucketGroup: false
 keys:
   expr: KEY._col0
   type: string
 mode: mergepartial
 outputColumnNames: _col0, _col1
 Select Operator
   expressions:
 expr: _col0
 type: string
 expr: _col1
 type: bigint
   outputColumnNames: _col0, _col1
   Union
 Select Operator
   expressions:
 expr: _col0
  

[jira] [Commented] (HIVE-7205) Wrong results when union all of grouping followed by group by with correlation optimization

2014-07-14 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14060383#comment-14060383
 ] 

Yin Huai commented on HIVE-7205:


[~navis] Thank you for the patch. I have left some comments at review board. In 
general, I feel that the logical on startGroup and endGroup is not very clear 
(my original implementation is not very clear either...). Can you explain the 
logic? So, I can better understand your change. Thanks.

 Wrong results when union all of grouping followed by group by with 
 correlation optimization
 ---

 Key: HIVE-7205
 URL: https://issues.apache.org/jira/browse/HIVE-7205
 Project: Hive
  Issue Type: Bug
Affects Versions: 0.12.0, 0.13.0, 0.13.1
Reporter: dima machlin
Assignee: Navis
Priority: Critical
 Attachments: HIVE-7205.1.patch.txt, HIVE-7205.2.patch.txt, 
 HIVE-7205.3.patch.txt


 use case :
 table TBL (a string,b string) contains single row : 'a','a'
 the following query :
 {code:sql}
 select b, sum(cc) from (
 select b,count(1) as cc from TBL group by b
 union all
 select a as b,count(1) as cc from TBL group by a
 ) z
 group by b
 {code}
 returns 
 a 1
 a 1
 while set hive.optimize.correlation=true;
 if we change set hive.optimize.correlation=false;
 it returns correct results : a 2
 The plan with correlation optimization :
 {code:sql}
 ABSTRACT SYNTAX TREE:
   (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_UNION (TOK_QUERY (TOK_FROM 
 (TOK_TABREF (TOK_TABNAME DB TBL))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR 
 TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL b)) (TOK_SELEXPR 
 (TOK_FUNCTION count 1) cc)) (TOK_GROUPBY (TOK_TABLE_OR_COL b (TOK_QUERY 
 (TOK_FROM (TOK_TABREF (TOK_TABNAME DB TBL))) (TOK_INSERT (TOK_DESTINATION 
 (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL a) b) 
 (TOK_SELEXPR (TOK_FUNCTION count 1) cc)) (TOK_GROUPBY (TOK_TABLE_OR_COL 
 a) z)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT 
 (TOK_SELEXPR (TOK_TABLE_OR_COL b)) (TOK_SELEXPR (TOK_FUNCTION sum 
 (TOK_TABLE_OR_COL cc (TOK_GROUPBY (TOK_TABLE_OR_COL b
 STAGE DEPENDENCIES:
   Stage-1 is a root stage
   Stage-0 is a root stage
 STAGE PLANS:
   Stage: Stage-1
 Map Reduce
   Alias - Map Operator Tree:
 null-subquery1:z-subquery1:TBL 
   TableScan
 alias: TBL
 Select Operator
   expressions:
 expr: b
 type: string
   outputColumnNames: b
   Group By Operator
 aggregations:
   expr: count(1)
 bucketGroup: false
 keys:
   expr: b
   type: string
 mode: hash
 outputColumnNames: _col0, _col1
 Reduce Output Operator
   key expressions:
 expr: _col0
 type: string
   sort order: +
   Map-reduce partition columns:
 expr: _col0
 type: string
   tag: 0
   value expressions:
 expr: _col1
 type: bigint
 null-subquery2:z-subquery2:TBL 
   TableScan
 alias: TBL
 Select Operator
   expressions:
 expr: a
 type: string
   outputColumnNames: a
   Group By Operator
 aggregations:
   expr: count(1)
 bucketGroup: false
 keys:
   expr: a
   type: string
 mode: hash
 outputColumnNames: _col0, _col1
 Reduce Output Operator
   key expressions:
 expr: _col0
 type: string
   sort order: +
   Map-reduce partition columns:
 expr: _col0
 type: string
   tag: 1
   value expressions:
 expr: _col1
 type: bigint
   Reduce Operator Tree:
 Demux Operator
   Group By Operator
 aggregations:
   expr: count(VALUE._col0)
 bucketGroup: false
 keys:
   expr: KEY._col0
   type: string
 mode: mergepartial
 outputColumnNames: _col0, _col1
 Select Operator
   expressions:
 expr: _col0
 type: string

[jira] [Commented] (HIVE-5130) Document Correlation Optimizer in Hive wiki

2014-07-13 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14060292#comment-14060292
 ] 

Yin Huai commented on HIVE-5130:


Design doc in Hive wiki: 
https://cwiki.apache.org/confluence/display/Hive/Correlation+Optimizer


 Document Correlation Optimizer in Hive wiki
 ---

 Key: HIVE-5130
 URL: https://issues.apache.org/jira/browse/HIVE-5130
 Project: Hive
  Issue Type: Sub-task
  Components: Documentation
Reporter: Yin Huai
Assignee: Yin Huai





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-5130) Document Correlation Optimizer in Hive wiki

2014-07-13 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14060355#comment-14060355
 ] 

Yin Huai commented on HIVE-5130:


Thanks [~leftylev] Let's put it in the Completed section. 

 Document Correlation Optimizer in Hive wiki
 ---

 Key: HIVE-5130
 URL: https://issues.apache.org/jira/browse/HIVE-5130
 Project: Hive
  Issue Type: Sub-task
  Components: Documentation
Reporter: Yin Huai
Assignee: Yin Huai





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (HIVE-7362) Enabling Correlation Optimizer by default.

2014-07-07 Thread Yin Huai (JIRA)
Yin Huai created HIVE-7362:
--

 Summary: Enabling Correlation Optimizer by default.
 Key: HIVE-7362
 URL: https://issues.apache.org/jira/browse/HIVE-7362
 Project: Hive
  Issue Type: Task
  Components: Query Processor
Reporter: Yin Huai
Assignee: Yin Huai






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-7205) Wrong results when union all of grouping followed by group by with correlation optimization

2014-07-07 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054303#comment-14054303
 ] 

Yin Huai commented on HIVE-7205:


Sure. I will take a look at it.

Seems the issue is that the MuxOperator for the last GroupByOperator cannot 
correctly determine when to call flush/endGroup/processGroup of the 
GroupByOperator because the UnionOperator creates a merging point of two 
branches in the operator tree.


 Wrong results when union all of grouping followed by group by with 
 correlation optimization
 ---

 Key: HIVE-7205
 URL: https://issues.apache.org/jira/browse/HIVE-7205
 Project: Hive
  Issue Type: Bug
Affects Versions: 0.12.0, 0.13.0, 0.13.1
Reporter: dima machlin
Assignee: Navis
Priority: Critical
 Attachments: HIVE-7205.1.patch.txt, HIVE-7205.2.patch.txt, 
 HIVE-7205.3.patch.txt


 use case :
 table TBL (a string,b string) contains single row : 'a','a'
 the following query :
 {code:sql}
 select b, sum(cc) from (
 select b,count(1) as cc from TBL group by b
 union all
 select a as b,count(1) as cc from TBL group by a
 ) z
 group by b
 {code}
 returns 
 a 1
 a 1
 while set hive.optimize.correlation=true;
 if we change set hive.optimize.correlation=false;
 it returns correct results : a 2
 The plan with correlation optimization :
 {code:sql}
 ABSTRACT SYNTAX TREE:
   (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_UNION (TOK_QUERY (TOK_FROM 
 (TOK_TABREF (TOK_TABNAME DB TBL))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR 
 TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL b)) (TOK_SELEXPR 
 (TOK_FUNCTION count 1) cc)) (TOK_GROUPBY (TOK_TABLE_OR_COL b (TOK_QUERY 
 (TOK_FROM (TOK_TABREF (TOK_TABNAME DB TBL))) (TOK_INSERT (TOK_DESTINATION 
 (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL a) b) 
 (TOK_SELEXPR (TOK_FUNCTION count 1) cc)) (TOK_GROUPBY (TOK_TABLE_OR_COL 
 a) z)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT 
 (TOK_SELEXPR (TOK_TABLE_OR_COL b)) (TOK_SELEXPR (TOK_FUNCTION sum 
 (TOK_TABLE_OR_COL cc (TOK_GROUPBY (TOK_TABLE_OR_COL b
 STAGE DEPENDENCIES:
   Stage-1 is a root stage
   Stage-0 is a root stage
 STAGE PLANS:
   Stage: Stage-1
 Map Reduce
   Alias - Map Operator Tree:
 null-subquery1:z-subquery1:TBL 
   TableScan
 alias: TBL
 Select Operator
   expressions:
 expr: b
 type: string
   outputColumnNames: b
   Group By Operator
 aggregations:
   expr: count(1)
 bucketGroup: false
 keys:
   expr: b
   type: string
 mode: hash
 outputColumnNames: _col0, _col1
 Reduce Output Operator
   key expressions:
 expr: _col0
 type: string
   sort order: +
   Map-reduce partition columns:
 expr: _col0
 type: string
   tag: 0
   value expressions:
 expr: _col1
 type: bigint
 null-subquery2:z-subquery2:TBL 
   TableScan
 alias: TBL
 Select Operator
   expressions:
 expr: a
 type: string
   outputColumnNames: a
   Group By Operator
 aggregations:
   expr: count(1)
 bucketGroup: false
 keys:
   expr: a
   type: string
 mode: hash
 outputColumnNames: _col0, _col1
 Reduce Output Operator
   key expressions:
 expr: _col0
 type: string
   sort order: +
   Map-reduce partition columns:
 expr: _col0
 type: string
   tag: 1
   value expressions:
 expr: _col1
 type: bigint
   Reduce Operator Tree:
 Demux Operator
   Group By Operator
 aggregations:
   expr: count(VALUE._col0)
 bucketGroup: false
 keys:
   expr: KEY._col0
   type: string
 mode: mergepartial
 outputColumnNames: _col0, _col1
 Select Operator
   expressions:
 expr: _col0
 type: string
 expr: 

[jira] [Commented] (HIVE-7205) Wrong results when union all of grouping followed by group by with correlation optimization

2014-07-07 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054454#comment-14054454
 ] 

Yin Huai commented on HIVE-7205:


[~navis] Simplifying interactions between operators is good. Let me spend 
sometime to understand the patch. My recent schedule is quite tight. I hope I 
can get you back late this week. Just want to double check. We will not have 
our next release for a while, right?

 Wrong results when union all of grouping followed by group by with 
 correlation optimization
 ---

 Key: HIVE-7205
 URL: https://issues.apache.org/jira/browse/HIVE-7205
 Project: Hive
  Issue Type: Bug
Affects Versions: 0.12.0, 0.13.0, 0.13.1
Reporter: dima machlin
Assignee: Navis
Priority: Critical
 Attachments: HIVE-7205.1.patch.txt, HIVE-7205.2.patch.txt, 
 HIVE-7205.3.patch.txt


 use case :
 table TBL (a string,b string) contains single row : 'a','a'
 the following query :
 {code:sql}
 select b, sum(cc) from (
 select b,count(1) as cc from TBL group by b
 union all
 select a as b,count(1) as cc from TBL group by a
 ) z
 group by b
 {code}
 returns 
 a 1
 a 1
 while set hive.optimize.correlation=true;
 if we change set hive.optimize.correlation=false;
 it returns correct results : a 2
 The plan with correlation optimization :
 {code:sql}
 ABSTRACT SYNTAX TREE:
   (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_UNION (TOK_QUERY (TOK_FROM 
 (TOK_TABREF (TOK_TABNAME DB TBL))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR 
 TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL b)) (TOK_SELEXPR 
 (TOK_FUNCTION count 1) cc)) (TOK_GROUPBY (TOK_TABLE_OR_COL b (TOK_QUERY 
 (TOK_FROM (TOK_TABREF (TOK_TABNAME DB TBL))) (TOK_INSERT (TOK_DESTINATION 
 (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL a) b) 
 (TOK_SELEXPR (TOK_FUNCTION count 1) cc)) (TOK_GROUPBY (TOK_TABLE_OR_COL 
 a) z)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT 
 (TOK_SELEXPR (TOK_TABLE_OR_COL b)) (TOK_SELEXPR (TOK_FUNCTION sum 
 (TOK_TABLE_OR_COL cc (TOK_GROUPBY (TOK_TABLE_OR_COL b
 STAGE DEPENDENCIES:
   Stage-1 is a root stage
   Stage-0 is a root stage
 STAGE PLANS:
   Stage: Stage-1
 Map Reduce
   Alias - Map Operator Tree:
 null-subquery1:z-subquery1:TBL 
   TableScan
 alias: TBL
 Select Operator
   expressions:
 expr: b
 type: string
   outputColumnNames: b
   Group By Operator
 aggregations:
   expr: count(1)
 bucketGroup: false
 keys:
   expr: b
   type: string
 mode: hash
 outputColumnNames: _col0, _col1
 Reduce Output Operator
   key expressions:
 expr: _col0
 type: string
   sort order: +
   Map-reduce partition columns:
 expr: _col0
 type: string
   tag: 0
   value expressions:
 expr: _col1
 type: bigint
 null-subquery2:z-subquery2:TBL 
   TableScan
 alias: TBL
 Select Operator
   expressions:
 expr: a
 type: string
   outputColumnNames: a
   Group By Operator
 aggregations:
   expr: count(1)
 bucketGroup: false
 keys:
   expr: a
   type: string
 mode: hash
 outputColumnNames: _col0, _col1
 Reduce Output Operator
   key expressions:
 expr: _col0
 type: string
   sort order: +
   Map-reduce partition columns:
 expr: _col0
 type: string
   tag: 1
   value expressions:
 expr: _col1
 type: bigint
   Reduce Operator Tree:
 Demux Operator
   Group By Operator
 aggregations:
   expr: count(VALUE._col0)
 bucketGroup: false
 keys:
   expr: KEY._col0
   type: string
 mode: mergepartial
 outputColumnNames: _col0, _col1
 Select Operator
   expressions:
 expr: _col0
 type: string
 expr: _col1
 

[jira] [Commented] (HIVE-7222) Support timestamp column statistics in ORC and extend PPD for timestamp

2014-06-11 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14028769#comment-14028769
 ] 

Yin Huai commented on HIVE-7222:


[~prasanth_j] Unfortunately, I am not working on that. 

 Support timestamp column statistics in ORC and extend PPD for timestamp
 ---

 Key: HIVE-7222
 URL: https://issues.apache.org/jira/browse/HIVE-7222
 Project: Hive
  Issue Type: Improvement
  Components: File Formats
Affects Versions: 0.14.0
Reporter: Prasanth J
  Labels: orcfile

 Add column statistics for timestamp columns in ORC. Also extend predicate 
 pushdown to support timestamp column evaluation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6716) ORC struct throws NPE for tables with inner structs having null values

2014-03-21 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13943839#comment-13943839
 ] 

Yin Huai commented on HIVE-6716:


[~prasanth_j] It is the same bug as I mentioned in 
https://issues.apache.org/jira/browse/HIVE-6631, right? If so, I will mark that 
one as duplicate.

 ORC struct throws NPE for tables with inner structs having null values 
 ---

 Key: HIVE-6716
 URL: https://issues.apache.org/jira/browse/HIVE-6716
 Project: Hive
  Issue Type: Bug
Affects Versions: 0.13.0, 0.14.0
Reporter: Prasanth J
Assignee: Prasanth J
  Labels: orcfile
 Attachments: HIVE-6716.1.patch


 ORCStruct should return null when object passed to 
 getStructFieldsDataAsList(Object obj) is null.
 {code}
 public ListObject getStructFieldsDataAsList(Object object) {
   OrcStruct struct = (OrcStruct) object;
   ListObject result = new ArrayListObject(struct.fields.length);
 {code}
 In the above code struct.fields will throw NPE if struct is NULL.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (HIVE-6631) NPE when select a field of a struct from a table stored by ORC

2014-03-21 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-6631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved HIVE-6631.


Resolution: Duplicate

 NPE when select a field of a struct from a table stored by ORC
 --

 Key: HIVE-6631
 URL: https://issues.apache.org/jira/browse/HIVE-6631
 Project: Hive
  Issue Type: Bug
  Components: Query Processor, Serializers/Deserializers
Affects Versions: 0.13.0, 0.14.0
Reporter: Yin Huai

 I have a table like this ...
 {code:sql}
 create table lineitem_orc_cg
 (
 CG1 STRUCTL_PARTKEY:INT,
L_SUPPKEY:INT,
L_COMMITDATE:STRING,
L_RECEIPTDATE:STRING,
L_SHIPINSTRUCT:STRING,
L_SHIPMODE:STRING,
L_COMMENT:STRING,
L_TAX:float,
L_RETURNFLAG:STRING,
L_LINESTATUS:STRING,
L_LINENUMBER:INT,
L_ORDERKEY:INT,
 CG2 STRUCTL_QUANTITY:float,
L_EXTENDEDPRICE:float,
L_DISCOUNT:float,
L_SHIPDATE:STRING
 )
 row format serde 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
 stored as orc tblproperties (orc.compress=NONE);
 {code}
 When I want to select a field from a struct by using
 {code:sql}
 select cg1.l_comment from lineitem_orc_cg limit 1;
 {code}
 I got 
 {code}
 Caused by: java.lang.NullPointerException
   at 
 org.apache.hadoop.hive.ql.exec.ExprNodeFieldEvaluator.initialize(ExprNodeFieldEvaluator.java:61)
   at 
 org.apache.hadoop.hive.ql.exec.Operator.initEvaluators(Operator.java:928)
   at 
 org.apache.hadoop.hive.ql.exec.Operator.initEvaluatorsAndReturnStruct(Operator.java:954)
   at 
 org.apache.hadoop.hive.ql.exec.SelectOperator.initializeOp(SelectOperator.java:65)
   at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:375)
   at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:459)
   at 
 org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:415)
   at 
 org.apache.hadoop.hive.ql.exec.TableScanOperator.initializeOp(TableScanOperator.java:189)
   at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:375)
   at 
 org.apache.hadoop.hive.ql.exec.MapOperator.initializeOp(MapOperator.java:409)
   at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:375)
   at 
 org.apache.hadoop.hive.ql.exec.mr.ExecMapper.configure(ExecMapper.java:133)
   ... 22 more
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6716) ORC struct throws NPE for tables with inner structs having null values

2014-03-21 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13943843#comment-13943843
 ] 

Yin Huai commented on HIVE-6716:


ok. Have marked that one. Thanks.

 ORC struct throws NPE for tables with inner structs having null values 
 ---

 Key: HIVE-6716
 URL: https://issues.apache.org/jira/browse/HIVE-6716
 Project: Hive
  Issue Type: Bug
Affects Versions: 0.13.0, 0.14.0
Reporter: Prasanth J
Assignee: Prasanth J
  Labels: orcfile
 Attachments: HIVE-6716.1.patch


 ORCStruct should return null when object passed to 
 getStructFieldsDataAsList(Object obj) is null.
 {code}
 public ListObject getStructFieldsDataAsList(Object object) {
   OrcStruct struct = (OrcStruct) object;
   ListObject result = new ArrayListObject(struct.fields.length);
 {code}
 In the above code struct.fields will throw NPE if struct is NULL.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6432) Remove deprecated methods in HCatalog

2014-03-15 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13936175#comment-13936175
 ] 

Yin Huai commented on HIVE-6432:


I tried to generate the tarball with 
{code}
mvn clean package -DskipTests -Phadoop-1 -Pdist
{code}
and got the following error
{code}
[ERROR] Failed to execute goal on project hive-packaging: Could not resolve 
dependencies for project org.apache.hive:hive-packaging:pom:0.14.0-SNAPSHOT: 
Failure to find 
org.apache.hive.hcatalog:hive-hcatalog-hbase-storage-handler:jar:0.14.0-SNAPSHOT
 in http://repository.apache.org/snapshots was cached in the local repository, 
resolution will not be reattempted until the update interval of 
apache.snapshots has elapsed or updates are forced - [Help 1]
{code}

I removed this entry 
(https://github.com/apache/hive/blob/trunk/packaging/pom.xml#L135) and this 
entry 
(https://github.com/apache/hive/blob/trunk/packaging/src/main/assembly/bin.xml#L57)
 to make the packing work. Is there any other update needed?

 Remove deprecated methods in HCatalog
 -

 Key: HIVE-6432
 URL: https://issues.apache.org/jira/browse/HIVE-6432
 Project: Hive
  Issue Type: Task
  Components: HCatalog
Affects Versions: 0.14.0
Reporter: Sushanth Sowmyan
Assignee: Sushanth Sowmyan
 Fix For: 0.14.0

 Attachments: HIVE-6432.patch, HIVE-6432.wip.1.patch, 
 HIVE-6432.wip.2.patch, hcat.6432.test.out


 There are a lot of methods in HCatalog that have been deprecated in HCatalog 
 0.5, and some that were recently deprecated in Hive 0.11 (joint release with 
 HCatalog).
 The goal for HCatalog deprecation is that in general, after something has 
 been deprecated, it is expected to stay around for 2 releases, which means 
 hive-0.13 will be the last release to ship with all the methods that were 
 deprecated in hive-0.11 (the org.apache.hcatalog.* files should all be 
 removed afterwards), and it is also good for us to clean out and nuke all 
 other older deprecated methods.
 We should take this on early in a dev/release cycle to allow us time to 
 resolve all fallout, so I propose that we remove all HCatalog deprecated 
 methods after we branch out 0.13 and 0.14 becomes trunk.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (HIVE-6668) When auto join convert is on and noconditionaltask is off, ConditionalResolverCommonJoin fails to resolve map joins.

2014-03-14 Thread Yin Huai (JIRA)
Yin Huai created HIVE-6668:
--

 Summary: When auto join convert is on and noconditionaltask is 
off, ConditionalResolverCommonJoin fails to resolve map joins.
 Key: HIVE-6668
 URL: https://issues.apache.org/jira/browse/HIVE-6668
 Project: Hive
  Issue Type: Bug
Affects Versions: 0.13.0, 0.14.0
Reporter: Yin Huai
Priority: Blocker
 Fix For: 0.13.0


I tried the following query today ...
{code:sql}
set mapred.job.map.memory.mb=2048;
set mapred.job.reduce.memory.mb=2048;
set mapred.map.child.java.opts=-server -Xmx3072m 
-Djava.net.preferIPv4Stack=true;
set mapred.reduce.child.java.opts=-server -Xmx3072m 
-Djava.net.preferIPv4Stack=true;

set mapred.reduce.tasks=60;

set hive.stats.autogather=false;
set hive.exec.parallel=false;
set hive.enforce.bucketing=true;
set hive.enforce.sorting=true;
set hive.map.aggr=true;
set hive.optimize.bucketmapjoin=true;
set hive.optimize.bucketmapjoin.sortedmerge=true;
set hive.mapred.reduce.tasks.speculative.execution=false;
set hive.auto.convert.join=true;
set hive.auto.convert.sortmerge.join=true;
set hive.auto.convert.sortmerge.join.noconditionaltask=false;
set hive.auto.convert.join.noconditionaltask=false;
set hive.auto.convert.join.noconditionaltask.size=1;
set hive.optimize.reducededuplication=true;
set hive.optimize.reducededuplication.min.reducer=1;
set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
set hive.mapjoin.smalltable.filesize=4500;

set hive.optimize.index.filter=false;
set hive.vectorized.execution.enabled=false;
set hive.optimize.correlation=false;
select
   i_item_id,
   s_state,
   avg(ss_quantity) agg1,
   avg(ss_list_price) agg2,
   avg(ss_coupon_amt) agg3,
   avg(ss_sales_price) agg4
FROM store_sales
JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
JOIN customer_demographics on (store_sales.ss_cdemo_sk = 
customer_demographics.cd_demo_sk)
JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
where
   cd_gender = 'F' and
   cd_marital_status = 'U' and
   cd_education_status = 'Primary' and
   d_year = 2002 and
   s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
group by i_item_id, s_state with rollup
order by
   i_item_id,
   s_state
limit 100;
{code}

The log shows ...
{code}
14/03/14 17:05:02 INFO plan.ConditionalResolverCommonJoin: Failed to resolve 
driver alias (threshold : 4500, length mapping : {store=94175, 
store_sales=48713909726, item=39798667, customer_demographics=1660831, 
date_dim=2275902})
Stage-27 is filtered out by condition resolver.
14/03/14 17:05:02 INFO exec.Task: Stage-27 is filtered out by condition 
resolver.
Stage-28 is filtered out by condition resolver.
14/03/14 17:05:02 INFO exec.Task: Stage-28 is filtered out by condition 
resolver.
Stage-3 is selected by condition resolver.
{code}
Stage-3 is a reduce join. Actually, the resolver should pick the map join



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6668) When auto join convert is on and noconditionaltask is off, ConditionalResolverCommonJoin fails to resolve map joins.

2014-03-14 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13935290#comment-13935290
 ] 

Yin Huai commented on HIVE-6668:


I guess it was broken by HIVE-6403 or HIVE-6144.

 When auto join convert is on and noconditionaltask is off, 
 ConditionalResolverCommonJoin fails to resolve map joins.
 

 Key: HIVE-6668
 URL: https://issues.apache.org/jira/browse/HIVE-6668
 Project: Hive
  Issue Type: Bug
Affects Versions: 0.13.0, 0.14.0
Reporter: Yin Huai
Priority: Blocker
 Fix For: 0.13.0


 I tried the following query today ...
 {code:sql}
 set mapred.job.map.memory.mb=2048;
 set mapred.job.reduce.memory.mb=2048;
 set mapred.map.child.java.opts=-server -Xmx3072m 
 -Djava.net.preferIPv4Stack=true;
 set mapred.reduce.child.java.opts=-server -Xmx3072m 
 -Djava.net.preferIPv4Stack=true;
 set mapred.reduce.tasks=60;
 set hive.stats.autogather=false;
 set hive.exec.parallel=false;
 set hive.enforce.bucketing=true;
 set hive.enforce.sorting=true;
 set hive.map.aggr=true;
 set hive.optimize.bucketmapjoin=true;
 set hive.optimize.bucketmapjoin.sortedmerge=true;
 set hive.mapred.reduce.tasks.speculative.execution=false;
 set hive.auto.convert.join=true;
 set hive.auto.convert.sortmerge.join=true;
 set hive.auto.convert.sortmerge.join.noconditionaltask=false;
 set hive.auto.convert.join.noconditionaltask=false;
 set hive.auto.convert.join.noconditionaltask.size=1;
 set hive.optimize.reducededuplication=true;
 set hive.optimize.reducededuplication.min.reducer=1;
 set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
 set hive.mapjoin.smalltable.filesize=4500;
 set hive.optimize.index.filter=false;
 set hive.vectorized.execution.enabled=false;
 set hive.optimize.correlation=false;
 select
i_item_id,
s_state,
avg(ss_quantity) agg1,
avg(ss_list_price) agg2,
avg(ss_coupon_amt) agg3,
avg(ss_sales_price) agg4
 FROM store_sales
 JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
 JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
 JOIN customer_demographics on (store_sales.ss_cdemo_sk = 
 customer_demographics.cd_demo_sk)
 JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
 where
cd_gender = 'F' and
cd_marital_status = 'U' and
cd_education_status = 'Primary' and
d_year = 2002 and
s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
 group by i_item_id, s_state with rollup
 order by
i_item_id,
s_state
 limit 100;
 {code}
 The log shows ...
 {code}
 14/03/14 17:05:02 INFO plan.ConditionalResolverCommonJoin: Failed to resolve 
 driver alias (threshold : 4500, length mapping : {store=94175, 
 store_sales=48713909726, item=39798667, customer_demographics=1660831, 
 date_dim=2275902})
 Stage-27 is filtered out by condition resolver.
 14/03/14 17:05:02 INFO exec.Task: Stage-27 is filtered out by condition 
 resolver.
 Stage-28 is filtered out by condition resolver.
 14/03/14 17:05:02 INFO exec.Task: Stage-28 is filtered out by condition 
 resolver.
 Stage-3 is selected by condition resolver.
 {code}
 Stage-3 is a reduce join. Actually, the resolver should pick the map join



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6668) When auto join convert is on and noconditionaltask is off, ConditionalResolverCommonJoin fails to resolve map joins.

2014-03-14 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13935397#comment-13935397
 ] 

Yin Huai commented on HIVE-6668:


Seems aliases returned from this line 
(https://github.com/apache/hive/blob/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverCommonJoin.java#L178)
 is an empty set.

 When auto join convert is on and noconditionaltask is off, 
 ConditionalResolverCommonJoin fails to resolve map joins.
 

 Key: HIVE-6668
 URL: https://issues.apache.org/jira/browse/HIVE-6668
 Project: Hive
  Issue Type: Bug
Affects Versions: 0.13.0, 0.14.0
Reporter: Yin Huai
Priority: Blocker
 Fix For: 0.13.0


 I tried the following query today ...
 {code:sql}
 set mapred.job.map.memory.mb=2048;
 set mapred.job.reduce.memory.mb=2048;
 set mapred.map.child.java.opts=-server -Xmx3072m 
 -Djava.net.preferIPv4Stack=true;
 set mapred.reduce.child.java.opts=-server -Xmx3072m 
 -Djava.net.preferIPv4Stack=true;
 set mapred.reduce.tasks=60;
 set hive.stats.autogather=false;
 set hive.exec.parallel=false;
 set hive.enforce.bucketing=true;
 set hive.enforce.sorting=true;
 set hive.map.aggr=true;
 set hive.optimize.bucketmapjoin=true;
 set hive.optimize.bucketmapjoin.sortedmerge=true;
 set hive.mapred.reduce.tasks.speculative.execution=false;
 set hive.auto.convert.join=true;
 set hive.auto.convert.sortmerge.join=true;
 set hive.auto.convert.sortmerge.join.noconditionaltask=false;
 set hive.auto.convert.join.noconditionaltask=false;
 set hive.auto.convert.join.noconditionaltask.size=1;
 set hive.optimize.reducededuplication=true;
 set hive.optimize.reducededuplication.min.reducer=1;
 set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
 set hive.mapjoin.smalltable.filesize=4500;
 set hive.optimize.index.filter=false;
 set hive.vectorized.execution.enabled=false;
 set hive.optimize.correlation=false;
 select
i_item_id,
s_state,
avg(ss_quantity) agg1,
avg(ss_list_price) agg2,
avg(ss_coupon_amt) agg3,
avg(ss_sales_price) agg4
 FROM store_sales
 JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
 JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
 JOIN customer_demographics on (store_sales.ss_cdemo_sk = 
 customer_demographics.cd_demo_sk)
 JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
 where
cd_gender = 'F' and
cd_marital_status = 'U' and
cd_education_status = 'Primary' and
d_year = 2002 and
s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
 group by i_item_id, s_state with rollup
 order by
i_item_id,
s_state
 limit 100;
 {code}
 The log shows ...
 {code}
 14/03/14 17:05:02 INFO plan.ConditionalResolverCommonJoin: Failed to resolve 
 driver alias (threshold : 4500, length mapping : {store=94175, 
 store_sales=48713909726, item=39798667, customer_demographics=1660831, 
 date_dim=2275902})
 Stage-27 is filtered out by condition resolver.
 14/03/14 17:05:02 INFO exec.Task: Stage-27 is filtered out by condition 
 resolver.
 Stage-28 is filtered out by condition resolver.
 14/03/14 17:05:02 INFO exec.Task: Stage-28 is filtered out by condition 
 resolver.
 Stage-3 is selected by condition resolver.
 {code}
 Stage-3 is a reduce join. Actually, the resolver should pick the map join



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6668) When auto join convert is on and noconditionaltask is off, ConditionalResolverCommonJoin fails to resolve map joins.

2014-03-14 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13935455#comment-13935455
 ] 

Yin Huai commented on HIVE-6668:


TestConditionalResolverCommonJoin cannot catch this bug.

 When auto join convert is on and noconditionaltask is off, 
 ConditionalResolverCommonJoin fails to resolve map joins.
 

 Key: HIVE-6668
 URL: https://issues.apache.org/jira/browse/HIVE-6668
 Project: Hive
  Issue Type: Bug
Affects Versions: 0.13.0, 0.14.0
Reporter: Yin Huai
Priority: Blocker
 Fix For: 0.13.0


 I tried the following query today ...
 {code:sql}
 set mapred.job.map.memory.mb=2048;
 set mapred.job.reduce.memory.mb=2048;
 set mapred.map.child.java.opts=-server -Xmx3072m 
 -Djava.net.preferIPv4Stack=true;
 set mapred.reduce.child.java.opts=-server -Xmx3072m 
 -Djava.net.preferIPv4Stack=true;
 set mapred.reduce.tasks=60;
 set hive.stats.autogather=false;
 set hive.exec.parallel=false;
 set hive.enforce.bucketing=true;
 set hive.enforce.sorting=true;
 set hive.map.aggr=true;
 set hive.optimize.bucketmapjoin=true;
 set hive.optimize.bucketmapjoin.sortedmerge=true;
 set hive.mapred.reduce.tasks.speculative.execution=false;
 set hive.auto.convert.join=true;
 set hive.auto.convert.sortmerge.join=true;
 set hive.auto.convert.sortmerge.join.noconditionaltask=false;
 set hive.auto.convert.join.noconditionaltask=false;
 set hive.auto.convert.join.noconditionaltask.size=1;
 set hive.optimize.reducededuplication=true;
 set hive.optimize.reducededuplication.min.reducer=1;
 set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
 set hive.mapjoin.smalltable.filesize=4500;
 set hive.optimize.index.filter=false;
 set hive.vectorized.execution.enabled=false;
 set hive.optimize.correlation=false;
 select
i_item_id,
s_state,
avg(ss_quantity) agg1,
avg(ss_list_price) agg2,
avg(ss_coupon_amt) agg3,
avg(ss_sales_price) agg4
 FROM store_sales
 JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
 JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
 JOIN customer_demographics on (store_sales.ss_cdemo_sk = 
 customer_demographics.cd_demo_sk)
 JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
 where
cd_gender = 'F' and
cd_marital_status = 'U' and
cd_education_status = 'Primary' and
d_year = 2002 and
s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
 group by i_item_id, s_state with rollup
 order by
i_item_id,
s_state
 limit 100;
 {code}
 The log shows ...
 {code}
 14/03/14 17:05:02 INFO plan.ConditionalResolverCommonJoin: Failed to resolve 
 driver alias (threshold : 4500, length mapping : {store=94175, 
 store_sales=48713909726, item=39798667, customer_demographics=1660831, 
 date_dim=2275902})
 Stage-27 is filtered out by condition resolver.
 14/03/14 17:05:02 INFO exec.Task: Stage-27 is filtered out by condition 
 resolver.
 Stage-28 is filtered out by condition resolver.
 14/03/14 17:05:02 INFO exec.Task: Stage-28 is filtered out by condition 
 resolver.
 Stage-3 is selected by condition resolver.
 {code}
 Stage-3 is a reduce join. Actually, the resolver should pick the map join



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (HIVE-6631) NPE when select a field of a struct from a table stored by ORC

2014-03-12 Thread Yin Huai (JIRA)
Yin Huai created HIVE-6631:
--

 Summary: NPE when select a field of a struct from a table stored 
by ORC
 Key: HIVE-6631
 URL: https://issues.apache.org/jira/browse/HIVE-6631
 Project: Hive
  Issue Type: Bug
Reporter: Yin Huai


I have two tables lineitem_orc_cg
{code:sql}
create table lineitem_orc_cg
(
CG1 STRUCTL_PARTKEY:INT,
   L_SUPPKEY:INT,
   L_COMMITDATE:STRING,
   L_RECEIPTDATE:STRING,
   L_SHIPINSTRUCT:STRING,
   L_SHIPMODE:STRING,
   L_COMMENT:STRING,
   L_TAX:float,
   L_RETURNFLAG:STRING,
   L_LINESTATUS:STRING,
   L_LINENUMBER:INT,
   L_ORDERKEY:INT,
CG2 STRUCTL_QUANTITY:float,
   L_EXTENDEDPRICE:float,
   L_DISCOUNT:float,
   L_SHIPDATE:STRING
)
row format serde 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
stored as orc tblproperties (orc.compress=NONE);
{code}
When I want to select a field from a struct by using
{code:sql}
select cg1.l_comment from lineitem_orc_cg limit 1;
{code}

I got 
{code}
Caused by: java.lang.NullPointerException
at 
org.apache.hadoop.hive.ql.exec.ExprNodeFieldEvaluator.initialize(ExprNodeFieldEvaluator.java:61)
at 
org.apache.hadoop.hive.ql.exec.Operator.initEvaluators(Operator.java:928)
at 
org.apache.hadoop.hive.ql.exec.Operator.initEvaluatorsAndReturnStruct(Operator.java:954)
at 
org.apache.hadoop.hive.ql.exec.SelectOperator.initializeOp(SelectOperator.java:65)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:375)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:459)
at 
org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:415)
at 
org.apache.hadoop.hive.ql.exec.TableScanOperator.initializeOp(TableScanOperator.java:189)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:375)
at 
org.apache.hadoop.hive.ql.exec.MapOperator.initializeOp(MapOperator.java:409)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:375)
at 
org.apache.hadoop.hive.ql.exec.mr.ExecMapper.configure(ExecMapper.java:133)
... 22 more
{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-6631) NPE when select a field of a struct from a table stored by ORC

2014-03-12 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-6631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-6631:
---

Component/s: Serializers/Deserializers
 Query Processor

 NPE when select a field of a struct from a table stored by ORC
 --

 Key: HIVE-6631
 URL: https://issues.apache.org/jira/browse/HIVE-6631
 Project: Hive
  Issue Type: Bug
  Components: Query Processor, Serializers/Deserializers
Affects Versions: 0.13.0, 0.14.0
Reporter: Yin Huai

 I have two tables lineitem_orc_cg
 {code:sql}
 create table lineitem_orc_cg
 (
 CG1 STRUCTL_PARTKEY:INT,
L_SUPPKEY:INT,
L_COMMITDATE:STRING,
L_RECEIPTDATE:STRING,
L_SHIPINSTRUCT:STRING,
L_SHIPMODE:STRING,
L_COMMENT:STRING,
L_TAX:float,
L_RETURNFLAG:STRING,
L_LINESTATUS:STRING,
L_LINENUMBER:INT,
L_ORDERKEY:INT,
 CG2 STRUCTL_QUANTITY:float,
L_EXTENDEDPRICE:float,
L_DISCOUNT:float,
L_SHIPDATE:STRING
 )
 row format serde 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
 stored as orc tblproperties (orc.compress=NONE);
 {code}
 When I want to select a field from a struct by using
 {code:sql}
 select cg1.l_comment from lineitem_orc_cg limit 1;
 {code}
 I got 
 {code}
 Caused by: java.lang.NullPointerException
   at 
 org.apache.hadoop.hive.ql.exec.ExprNodeFieldEvaluator.initialize(ExprNodeFieldEvaluator.java:61)
   at 
 org.apache.hadoop.hive.ql.exec.Operator.initEvaluators(Operator.java:928)
   at 
 org.apache.hadoop.hive.ql.exec.Operator.initEvaluatorsAndReturnStruct(Operator.java:954)
   at 
 org.apache.hadoop.hive.ql.exec.SelectOperator.initializeOp(SelectOperator.java:65)
   at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:375)
   at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:459)
   at 
 org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:415)
   at 
 org.apache.hadoop.hive.ql.exec.TableScanOperator.initializeOp(TableScanOperator.java:189)
   at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:375)
   at 
 org.apache.hadoop.hive.ql.exec.MapOperator.initializeOp(MapOperator.java:409)
   at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:375)
   at 
 org.apache.hadoop.hive.ql.exec.mr.ExecMapper.configure(ExecMapper.java:133)
   ... 22 more
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-6631) NPE when select a field of a struct from a table stored by ORC

2014-03-12 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-6631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-6631:
---

Affects Version/s: 0.14.0
   0.13.0

 NPE when select a field of a struct from a table stored by ORC
 --

 Key: HIVE-6631
 URL: https://issues.apache.org/jira/browse/HIVE-6631
 Project: Hive
  Issue Type: Bug
  Components: Query Processor, Serializers/Deserializers
Affects Versions: 0.13.0, 0.14.0
Reporter: Yin Huai

 I have two tables lineitem_orc_cg
 {code:sql}
 create table lineitem_orc_cg
 (
 CG1 STRUCTL_PARTKEY:INT,
L_SUPPKEY:INT,
L_COMMITDATE:STRING,
L_RECEIPTDATE:STRING,
L_SHIPINSTRUCT:STRING,
L_SHIPMODE:STRING,
L_COMMENT:STRING,
L_TAX:float,
L_RETURNFLAG:STRING,
L_LINESTATUS:STRING,
L_LINENUMBER:INT,
L_ORDERKEY:INT,
 CG2 STRUCTL_QUANTITY:float,
L_EXTENDEDPRICE:float,
L_DISCOUNT:float,
L_SHIPDATE:STRING
 )
 row format serde 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
 stored as orc tblproperties (orc.compress=NONE);
 {code}
 When I want to select a field from a struct by using
 {code:sql}
 select cg1.l_comment from lineitem_orc_cg limit 1;
 {code}
 I got 
 {code}
 Caused by: java.lang.NullPointerException
   at 
 org.apache.hadoop.hive.ql.exec.ExprNodeFieldEvaluator.initialize(ExprNodeFieldEvaluator.java:61)
   at 
 org.apache.hadoop.hive.ql.exec.Operator.initEvaluators(Operator.java:928)
   at 
 org.apache.hadoop.hive.ql.exec.Operator.initEvaluatorsAndReturnStruct(Operator.java:954)
   at 
 org.apache.hadoop.hive.ql.exec.SelectOperator.initializeOp(SelectOperator.java:65)
   at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:375)
   at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:459)
   at 
 org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:415)
   at 
 org.apache.hadoop.hive.ql.exec.TableScanOperator.initializeOp(TableScanOperator.java:189)
   at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:375)
   at 
 org.apache.hadoop.hive.ql.exec.MapOperator.initializeOp(MapOperator.java:409)
   at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:375)
   at 
 org.apache.hadoop.hive.ql.exec.mr.ExecMapper.configure(ExecMapper.java:133)
   ... 22 more
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-6631) NPE when select a field of a struct from a table stored by ORC

2014-03-12 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-6631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-6631:
---

Description: 
I have a table like this ...
{code:sql}
create table lineitem_orc_cg
(
CG1 STRUCTL_PARTKEY:INT,
   L_SUPPKEY:INT,
   L_COMMITDATE:STRING,
   L_RECEIPTDATE:STRING,
   L_SHIPINSTRUCT:STRING,
   L_SHIPMODE:STRING,
   L_COMMENT:STRING,
   L_TAX:float,
   L_RETURNFLAG:STRING,
   L_LINESTATUS:STRING,
   L_LINENUMBER:INT,
   L_ORDERKEY:INT,
CG2 STRUCTL_QUANTITY:float,
   L_EXTENDEDPRICE:float,
   L_DISCOUNT:float,
   L_SHIPDATE:STRING
)
row format serde 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
stored as orc tblproperties (orc.compress=NONE);
{code}
When I want to select a field from a struct by using
{code:sql}
select cg1.l_comment from lineitem_orc_cg limit 1;
{code}

I got 
{code}
Caused by: java.lang.NullPointerException
at 
org.apache.hadoop.hive.ql.exec.ExprNodeFieldEvaluator.initialize(ExprNodeFieldEvaluator.java:61)
at 
org.apache.hadoop.hive.ql.exec.Operator.initEvaluators(Operator.java:928)
at 
org.apache.hadoop.hive.ql.exec.Operator.initEvaluatorsAndReturnStruct(Operator.java:954)
at 
org.apache.hadoop.hive.ql.exec.SelectOperator.initializeOp(SelectOperator.java:65)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:375)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:459)
at 
org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:415)
at 
org.apache.hadoop.hive.ql.exec.TableScanOperator.initializeOp(TableScanOperator.java:189)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:375)
at 
org.apache.hadoop.hive.ql.exec.MapOperator.initializeOp(MapOperator.java:409)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:375)
at 
org.apache.hadoop.hive.ql.exec.mr.ExecMapper.configure(ExecMapper.java:133)
... 22 more
{code}

  was:
I have two tables lineitem_orc_cg
{code:sql}
create table lineitem_orc_cg
(
CG1 STRUCTL_PARTKEY:INT,
   L_SUPPKEY:INT,
   L_COMMITDATE:STRING,
   L_RECEIPTDATE:STRING,
   L_SHIPINSTRUCT:STRING,
   L_SHIPMODE:STRING,
   L_COMMENT:STRING,
   L_TAX:float,
   L_RETURNFLAG:STRING,
   L_LINESTATUS:STRING,
   L_LINENUMBER:INT,
   L_ORDERKEY:INT,
CG2 STRUCTL_QUANTITY:float,
   L_EXTENDEDPRICE:float,
   L_DISCOUNT:float,
   L_SHIPDATE:STRING
)
row format serde 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
stored as orc tblproperties (orc.compress=NONE);
{code}
When I want to select a field from a struct by using
{code:sql}
select cg1.l_comment from lineitem_orc_cg limit 1;
{code}

I got 
{code}
Caused by: java.lang.NullPointerException
at 
org.apache.hadoop.hive.ql.exec.ExprNodeFieldEvaluator.initialize(ExprNodeFieldEvaluator.java:61)
at 
org.apache.hadoop.hive.ql.exec.Operator.initEvaluators(Operator.java:928)
at 
org.apache.hadoop.hive.ql.exec.Operator.initEvaluatorsAndReturnStruct(Operator.java:954)
at 
org.apache.hadoop.hive.ql.exec.SelectOperator.initializeOp(SelectOperator.java:65)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:375)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:459)
at 
org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:415)
at 
org.apache.hadoop.hive.ql.exec.TableScanOperator.initializeOp(TableScanOperator.java:189)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:375)
at 
org.apache.hadoop.hive.ql.exec.MapOperator.initializeOp(MapOperator.java:409)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:375)
at 
org.apache.hadoop.hive.ql.exec.mr.ExecMapper.configure(ExecMapper.java:133)
... 22 more
{code}


 NPE when select a field of a struct from a table stored by ORC
 --

 Key: HIVE-6631
 URL: https://issues.apache.org/jira/browse/HIVE-6631
 Project: Hive
  Issue Type: Bug
  Components: Query Processor, Serializers/Deserializers
Affects Versions: 0.13.0, 0.14.0
Reporter: Yin Huai

 I have a table like this ...
 {code:sql}
 create table lineitem_orc_cg
 (
 CG1 STRUCTL_PARTKEY:INT,
L_SUPPKEY:INT,
L_COMMITDATE:STRING,
L_RECEIPTDATE:STRING,
L_SHIPINSTRUCT:STRING,
L_SHIPMODE:STRING,
L_COMMENT:STRING,
L_TAX:float,
L_RETURNFLAG:STRING,
L_LINESTATUS:STRING,
L_LINENUMBER:INT,
L_ORDERKEY:INT,
 CG2 

[jira] [Created] (HIVE-6632) ORC should be able to only read needed fields in a complex column

2014-03-12 Thread Yin Huai (JIRA)
Yin Huai created HIVE-6632:
--

 Summary: ORC should be able to only read needed fields in a 
complex column
 Key: HIVE-6632
 URL: https://issues.apache.org/jira/browse/HIVE-6632
 Project: Hive
  Issue Type: Improvement
Reporter: Yin Huai


Currently, we use a string of ids to record needed columns. However, this 
string cannot record needed fields of a complex column. Although ORC decomposes 
a complex column to multiple sub-columns, it has to load the entire complex 
column if only a single field of this complex column is needed.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6632) ORC should be able to only read needed fields in a complex column

2014-03-12 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13931978#comment-13931978
 ] 

Yin Huai commented on HIVE-6632:


Does Parquet have the same issue?

 ORC should be able to only read needed fields in a complex column
 -

 Key: HIVE-6632
 URL: https://issues.apache.org/jira/browse/HIVE-6632
 Project: Hive
  Issue Type: Improvement
Reporter: Yin Huai

 Currently, we use a string of ids to record needed columns. However, this 
 string cannot record needed fields of a complex column. Although ORC 
 decomposes a complex column to multiple sub-columns, it has to load the 
 entire complex column if only a single field of this complex column is needed.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6163) OrcOutputFormat#getRecordWriter creates OrcRecordWriter with relative path

2014-02-05 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13892067#comment-13892067
 ] 

Yin Huai commented on HIVE-6163:


I think we should also check all other OutputFormats and make sure they all 
have the consistent behaviors on creating file paths. 

 OrcOutputFormat#getRecordWriter creates OrcRecordWriter with relative path
 --

 Key: HIVE-6163
 URL: https://issues.apache.org/jira/browse/HIVE-6163
 Project: Hive
  Issue Type: Bug
  Components: File Formats
Affects Versions: 0.12.0
Reporter: Branky Shao

 Hi,
 OrcOutputFormat#getRecordWriter creates OrcRecordWriter instance using a file 
 with relative path actually.
 return new OrcRecordWriter(new Path(name), OrcFile.writerOptions(conf));
 https://github.com/apache/hive/blob/7263b3bb1632b1a7c6ef5d2363e58020e1fdd756/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcOutputFormat.java#L114
 The fix should be very simple, just as RCFileOutputFormat#getRecordWriter, 
 append work output path as the parent:
 Path outputPath = getWorkOutputPath(job);
 Path file = new Path(outputPath, name);
 https://github.com/apache/hive/blob/d85eea2dc5decbf23e8f4010b32f1817cf057ea0/ql/src/java/org/apache/hadoop/hive/ql/io/RCFileOutputFormat.java#L78



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HIVE-5945) ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task.

2014-01-16 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-5945:
---

   Resolution: Fixed
Fix Version/s: 0.13.0
 Release Note: Committed to trunk. Thanks, Navis!
   Status: Resolved  (was: Patch Available)

 ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those 
 tables which are not used in the child of this conditional task.
 -

 Key: HIVE-5945
 URL: https://issues.apache.org/jira/browse/HIVE-5945
 Project: Hive
  Issue Type: Bug
  Components: Query Processor
Affects Versions: 0.8.0, 0.9.0, 0.10.0, 0.11.0, 0.12.0, 0.13.0
Reporter: Yin Huai
Assignee: Navis
Priority: Critical
 Fix For: 0.13.0

 Attachments: HIVE-5945.1.patch.txt, HIVE-5945.2.patch.txt, 
 HIVE-5945.3.patch.txt, HIVE-5945.4.patch.txt, HIVE-5945.5.patch.txt, 
 HIVE-5945.6.patch.txt, HIVE-5945.7.patch.txt, HIVE-5945.8.patch.txt


 Here is an example
 {code}
 select
i_item_id,
s_state,
avg(ss_quantity) agg1,
avg(ss_list_price) agg2,
avg(ss_coupon_amt) agg3,
avg(ss_sales_price) agg4
 FROM store_sales
 JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
 JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
 JOIN customer_demographics on (store_sales.ss_cdemo_sk = 
 customer_demographics.cd_demo_sk)
 JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
 where
cd_gender = 'F' and
cd_marital_status = 'U' and
cd_education_status = 'Primary' and
d_year = 2002 and
s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
 group by
i_item_id,
s_state
 order by
i_item_id,
s_state
 limit 100;
 {\code}
 I turned off noconditionaltask. So, I expected that there will be 4 Map-only 
 jobs for this query. However, I got 1 Map-only job (joining strore_sales and 
 date_dim) and 3 MR job (for reduce joins.)
 So, I checked the conditional task determining the plan of the join involving 
 item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, 
 aliasToFileSizeMap contains all input tables used in this query and the 
 intermediate table generated by joining store_sales and date_dim. So, when we 
 sum the size of all small tables, the size of store_sales (which is around 
 45GB in my test) will be also counted.  



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HIVE-5945) ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task.

2014-01-16 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-5945:
---

Release Note:   (was: Committed to trunk. Thanks, Navis!)

 ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those 
 tables which are not used in the child of this conditional task.
 -

 Key: HIVE-5945
 URL: https://issues.apache.org/jira/browse/HIVE-5945
 Project: Hive
  Issue Type: Bug
  Components: Query Processor
Affects Versions: 0.8.0, 0.9.0, 0.10.0, 0.11.0, 0.12.0, 0.13.0
Reporter: Yin Huai
Assignee: Navis
Priority: Critical
 Fix For: 0.13.0

 Attachments: HIVE-5945.1.patch.txt, HIVE-5945.2.patch.txt, 
 HIVE-5945.3.patch.txt, HIVE-5945.4.patch.txt, HIVE-5945.5.patch.txt, 
 HIVE-5945.6.patch.txt, HIVE-5945.7.patch.txt, HIVE-5945.8.patch.txt


 Here is an example
 {code}
 select
i_item_id,
s_state,
avg(ss_quantity) agg1,
avg(ss_list_price) agg2,
avg(ss_coupon_amt) agg3,
avg(ss_sales_price) agg4
 FROM store_sales
 JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
 JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
 JOIN customer_demographics on (store_sales.ss_cdemo_sk = 
 customer_demographics.cd_demo_sk)
 JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
 where
cd_gender = 'F' and
cd_marital_status = 'U' and
cd_education_status = 'Primary' and
d_year = 2002 and
s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
 group by
i_item_id,
s_state
 order by
i_item_id,
s_state
 limit 100;
 {\code}
 I turned off noconditionaltask. So, I expected that there will be 4 Map-only 
 jobs for this query. However, I got 1 Map-only job (joining strore_sales and 
 date_dim) and 3 MR job (for reduce joins.)
 So, I checked the conditional task determining the plan of the join involving 
 item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, 
 aliasToFileSizeMap contains all input tables used in this query and the 
 intermediate table generated by joining store_sales and date_dim. So, when we 
 sum the size of all small tables, the size of store_sales (which is around 
 45GB in my test) will be also counted.  



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HIVE-5945) ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task.

2014-01-16 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13873656#comment-13873656
 ] 

Yin Huai commented on HIVE-5945:


Committed to trunk. Thanks, Navis!

 ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those 
 tables which are not used in the child of this conditional task.
 -

 Key: HIVE-5945
 URL: https://issues.apache.org/jira/browse/HIVE-5945
 Project: Hive
  Issue Type: Bug
  Components: Query Processor
Affects Versions: 0.8.0, 0.9.0, 0.10.0, 0.11.0, 0.12.0, 0.13.0
Reporter: Yin Huai
Assignee: Navis
Priority: Critical
 Fix For: 0.13.0

 Attachments: HIVE-5945.1.patch.txt, HIVE-5945.2.patch.txt, 
 HIVE-5945.3.patch.txt, HIVE-5945.4.patch.txt, HIVE-5945.5.patch.txt, 
 HIVE-5945.6.patch.txt, HIVE-5945.7.patch.txt, HIVE-5945.8.patch.txt


 Here is an example
 {code}
 select
i_item_id,
s_state,
avg(ss_quantity) agg1,
avg(ss_list_price) agg2,
avg(ss_coupon_amt) agg3,
avg(ss_sales_price) agg4
 FROM store_sales
 JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
 JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
 JOIN customer_demographics on (store_sales.ss_cdemo_sk = 
 customer_demographics.cd_demo_sk)
 JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
 where
cd_gender = 'F' and
cd_marital_status = 'U' and
cd_education_status = 'Primary' and
d_year = 2002 and
s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
 group by
i_item_id,
s_state
 order by
i_item_id,
s_state
 limit 100;
 {\code}
 I turned off noconditionaltask. So, I expected that there will be 4 Map-only 
 jobs for this query. However, I got 1 Map-only job (joining strore_sales and 
 date_dim) and 3 MR job (for reduce joins.)
 So, I checked the conditional task determining the plan of the join involving 
 item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, 
 aliasToFileSizeMap contains all input tables used in this query and the 
 intermediate table generated by joining store_sales and date_dim. So, when we 
 sum the size of all small tables, the size of store_sales (which is around 
 45GB in my test) will be also counted.  



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HIVE-5945) ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task.

2014-01-15 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13872582#comment-13872582
 ] 

Yin Huai commented on HIVE-5945:


+1

 ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those 
 tables which are not used in the child of this conditional task.
 -

 Key: HIVE-5945
 URL: https://issues.apache.org/jira/browse/HIVE-5945
 Project: Hive
  Issue Type: Bug
  Components: Query Processor
Affects Versions: 0.8.0, 0.9.0, 0.10.0, 0.11.0, 0.12.0, 0.13.0
Reporter: Yin Huai
Assignee: Navis
Priority: Critical
 Attachments: HIVE-5945.1.patch.txt, HIVE-5945.2.patch.txt, 
 HIVE-5945.3.patch.txt, HIVE-5945.4.patch.txt, HIVE-5945.5.patch.txt, 
 HIVE-5945.6.patch.txt, HIVE-5945.7.patch.txt, HIVE-5945.8.patch.txt


 Here is an example
 {code}
 select
i_item_id,
s_state,
avg(ss_quantity) agg1,
avg(ss_list_price) agg2,
avg(ss_coupon_amt) agg3,
avg(ss_sales_price) agg4
 FROM store_sales
 JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
 JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
 JOIN customer_demographics on (store_sales.ss_cdemo_sk = 
 customer_demographics.cd_demo_sk)
 JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
 where
cd_gender = 'F' and
cd_marital_status = 'U' and
cd_education_status = 'Primary' and
d_year = 2002 and
s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
 group by
i_item_id,
s_state
 order by
i_item_id,
s_state
 limit 100;
 {\code}
 I turned off noconditionaltask. So, I expected that there will be 4 Map-only 
 jobs for this query. However, I got 1 Map-only job (joining strore_sales and 
 date_dim) and 3 MR job (for reduce joins.)
 So, I checked the conditional task determining the plan of the join involving 
 item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, 
 aliasToFileSizeMap contains all input tables used in this query and the 
 intermediate table generated by joining store_sales and date_dim. So, when we 
 sum the size of all small tables, the size of store_sales (which is around 
 45GB in my test) will be also counted.  



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HIVE-5945) ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task.

2014-01-06 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13863047#comment-13863047
 ] 

Yin Huai commented on HIVE-5945:


Thanks Navis for the change. date_dim is a native table. Actually, I think the 
problem is 
org.apache.hadoop.hive.ql.plan.ConditionalResolverCommonJoin.getParticipants. 
It uses ctx.getAliasToTask(); to get all aliases. However, these aliases do not 
include aliases appearing in the MapLocalWork (those small tables.). So for a 
query like 
{code}
set hive.auto.convert.join.noconditionaltask=false;
select
   i_item_id
FROM store_sales
JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
limit 10;
{code}

The plan is 
{code}
STAGE DEPENDENCIES:
  Stage-5 is a root stage , consists of Stage-6, Stage-1
  Stage-6 has a backup stage: Stage-1
  Stage-3 depends on stages: Stage-6
  Stage-1
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-5
Conditional Operator

  Stage: Stage-6
Map Reduce Local Work
  Alias - Map Local Tables:
item 
  Fetch Operator
limit: -1
  Alias - Map Local Operator Tree:
item 
  TableScan
alias: item
HashTable Sink Operator
  condition expressions:
0 
1 {i_item_id}
  handleSkewJoin: false
  keys:
0 [Column[ss_item_sk]]
1 [Column[i_item_sk]]
  Position of Big Table: 0

  Stage: Stage-3
Map Reduce
  Alias - Map Operator Tree:
store_sales 
  TableScan
alias: store_sales
Map Join Operator
  condition map:
   Inner Join 0 to 1
  condition expressions:
0 
1 {i_item_id}
  handleSkewJoin: false
  keys:
0 [Column[ss_item_sk]]
1 [Column[i_item_sk]]
  outputColumnNames: _col26
  Position of Big Table: 0
  Select Operator
expressions:
  expr: _col26
  type: string
outputColumnNames: _col0
Limit
  File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
  Local Work:
Map Reduce Local Work

  Stage: Stage-1
Map Reduce
  Alias - Map Operator Tree:
item 
  TableScan
alias: item
Reduce Output Operator
  key expressions:
expr: i_item_sk
type: int
  sort order: +
  Map-reduce partition columns:
expr: i_item_sk
type: int
  tag: 1
  value expressions:
expr: i_item_id
type: string
store_sales 
  TableScan
alias: store_sales
Reduce Output Operator
  key expressions:
expr: ss_item_sk
type: int
  sort order: +
  Map-reduce partition columns:
expr: ss_item_sk
type: int
  tag: 0
  Reduce Operator Tree:
Join Operator
  condition map:
   Inner Join 0 to 1
  condition expressions:
0 
1 {VALUE._col1}
  handleSkewJoin: false
  outputColumnNames: _col26
  Select Operator
expressions:
  expr: _col26
  type: string
outputColumnNames: _col0
Limit
  File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
Fetch Operator
  limit: 10
{code}
The alias of item will not be in the set returned by getParticipants. Thus, 
the input of sumOfExcept will be 
{code}
aliasToSize: {store_sales=388445409, item=5051899}
aliases: [store_sales]
except: store_sales
{code}
and then we get 0 for the size of small tables.

I think in getParticipants, we can check the type of a task and if it is a 
MapRedTask, we can use getWork().getMapWork().getMapLocalWork() to get the 
local task. Then, we can get aliases of those small tables through aliasToWork.

Another minor comment. Can you add a comment 

[jira] [Updated] (HIVE-6083) User provided table properties are not assigned to the TableDesc of the FileSinkDesc in a CTAS query

2013-12-31 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-6083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-6083:
---

Status: Open  (was: Patch Available)

 User provided table properties are not assigned to the TableDesc of the 
 FileSinkDesc in a CTAS query
 

 Key: HIVE-6083
 URL: https://issues.apache.org/jira/browse/HIVE-6083
 Project: Hive
  Issue Type: Bug
Affects Versions: 0.12.0, 0.13.0
Reporter: Yin Huai
Assignee: Yin Huai
 Attachments: HIVE-6083.1.patch.txt, HIVE-6083.2.patch.txt


 I was trying to use a CTAS query to create a table stored with ORC and 
 orc.compress was set to SNAPPY. However, the table was still compressed as 
 ZLIB (although the result of DESCRIBE still shows that this table is 
 compressed by SNAPPY). For a CTAS query, SemanticAnalyzer.genFileSinkPlan 
 uses CreateTableDesc to generate the TableDesc for the FileSinkDesc by 
 calling PlanUtils.getTableDesc. However, in PlanUtils.getTableDesc, I do not 
 see user provided table properties are assigned to the returned TableDesc 
 (CreateTableDesc.getTblProps was not called in this method ).  
 btw, I only checked the code of 0.12 and trunk.
 Two examples:
 * Snappy compression
 {code}
 create table web_sales_wrong_orc_snappy
 stored as orc tblproperties (orc.compress=SNAPPY)
 as select * from web_sales;
 {code}
 {code}
 describe formatted web_sales_wrong_orc_snappy;
 
 Location: 
 hdfs://localhost:54310/user/hive/warehouse/web_sales_wrong_orc_snappy
 Table Type:   MANAGED_TABLE
 Table Parameters:  
   COLUMN_STATS_ACCURATE   true
   numFiles1   
   numRows 719384  
   orc.compressSNAPPY  
   rawDataSize 97815412
   totalSize   40625243
   transient_lastDdlTime   1387566015   
    
 {code}
 {code}
 bin/hive --orcfiledump 
 /user/hive/warehouse/web_sales_wrong_orc_snappy/00_0
 Rows: 719384
 Compression: ZLIB
 Compression size: 262144
 ...
 {code}
 * No compression
 {code}
 create table web_sales_wrong_orc_none
 stored as orc tblproperties (orc.compress=NONE)
 as select * from web_sales;
 {code}
 {code}
 describe formatted web_sales_wrong_orc_none;
 
 Location: 
 hdfs://localhost:54310/user/hive/warehouse/web_sales_wrong_orc_none  
 Table Type:   MANAGED_TABLE
 Table Parameters:  
   COLUMN_STATS_ACCURATE   true
   numFiles1   
   numRows 719384  
   orc.compressNONE
   rawDataSize 97815412
   totalSize   40625243
   transient_lastDdlTime   1387566064   
    
 {code}
 {code}
 bin/hive --orcfiledump /user/hive/warehouse/web_sales_wrong_orc_none/00_0
 Rows: 719384
 Compression: ZLIB
 Compression size: 262144
 ...
 {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HIVE-6083) User provided table properties are not assigned to the TableDesc of the FileSinkDesc in a CTAS query

2013-12-31 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-6083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-6083:
---

Status: Patch Available  (was: Open)

 User provided table properties are not assigned to the TableDesc of the 
 FileSinkDesc in a CTAS query
 

 Key: HIVE-6083
 URL: https://issues.apache.org/jira/browse/HIVE-6083
 Project: Hive
  Issue Type: Bug
Affects Versions: 0.12.0, 0.13.0
Reporter: Yin Huai
Assignee: Yin Huai
 Attachments: HIVE-6083.1.patch.txt, HIVE-6083.2.patch.txt


 I was trying to use a CTAS query to create a table stored with ORC and 
 orc.compress was set to SNAPPY. However, the table was still compressed as 
 ZLIB (although the result of DESCRIBE still shows that this table is 
 compressed by SNAPPY). For a CTAS query, SemanticAnalyzer.genFileSinkPlan 
 uses CreateTableDesc to generate the TableDesc for the FileSinkDesc by 
 calling PlanUtils.getTableDesc. However, in PlanUtils.getTableDesc, I do not 
 see user provided table properties are assigned to the returned TableDesc 
 (CreateTableDesc.getTblProps was not called in this method ).  
 btw, I only checked the code of 0.12 and trunk.
 Two examples:
 * Snappy compression
 {code}
 create table web_sales_wrong_orc_snappy
 stored as orc tblproperties (orc.compress=SNAPPY)
 as select * from web_sales;
 {code}
 {code}
 describe formatted web_sales_wrong_orc_snappy;
 
 Location: 
 hdfs://localhost:54310/user/hive/warehouse/web_sales_wrong_orc_snappy
 Table Type:   MANAGED_TABLE
 Table Parameters:  
   COLUMN_STATS_ACCURATE   true
   numFiles1   
   numRows 719384  
   orc.compressSNAPPY  
   rawDataSize 97815412
   totalSize   40625243
   transient_lastDdlTime   1387566015   
    
 {code}
 {code}
 bin/hive --orcfiledump 
 /user/hive/warehouse/web_sales_wrong_orc_snappy/00_0
 Rows: 719384
 Compression: ZLIB
 Compression size: 262144
 ...
 {code}
 * No compression
 {code}
 create table web_sales_wrong_orc_none
 stored as orc tblproperties (orc.compress=NONE)
 as select * from web_sales;
 {code}
 {code}
 describe formatted web_sales_wrong_orc_none;
 
 Location: 
 hdfs://localhost:54310/user/hive/warehouse/web_sales_wrong_orc_none  
 Table Type:   MANAGED_TABLE
 Table Parameters:  
   COLUMN_STATS_ACCURATE   true
   numFiles1   
   numRows 719384  
   orc.compressNONE
   rawDataSize 97815412
   totalSize   40625243
   transient_lastDdlTime   1387566064   
    
 {code}
 {code}
 bin/hive --orcfiledump /user/hive/warehouse/web_sales_wrong_orc_none/00_0
 Rows: 719384
 Compression: ZLIB
 Compression size: 262144
 ...
 {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HIVE-5945) ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task.

2013-12-30 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13859114#comment-13859114
 ] 

Yin Huai commented on HIVE-5945:


Thanks Navis :) I played with your patch and found a issue which I commented at 
the review board. I am also attaching more info at here. For the query in the 
description, we can have 4 map-joins. There will be 3 different intermediate 
tables called $INTNAME. The current patch does not update the size of $INTNAME.

Here are logs.
{code}
13/12/30 16:48:25 INFO ql.Driver: MapReduce Jobs Launched: 
Job 0: Map: 1   Cumulative CPU: 12.76 sec   HDFS Read: 388445624 HDFS Write: 
20815654 SUCCESS
13/12/30 16:48:25 INFO ql.Driver: Job 0: Map: 1   Cumulative CPU: 12.76 sec   
HDFS Read: 388445624 HDFS Write: 20815654 SUCCESS
Job 1: Map: 1   Cumulative CPU: 9.18 sec   HDFS Read: 20816111 HDFS Write: 
28593993 SUCCESS
13/12/30 16:48:25 INFO ql.Driver: Job 1: Map: 1   Cumulative CPU: 9.18 sec   
HDFS Read: 20816111 HDFS Write: 28593993 SUCCESS
Job 2: Map: 1   Cumulative CPU: 17.38 sec   HDFS Read: 80660331 HDFS Write: 
378063 SUCCESS
13/12/30 16:48:25 INFO ql.Driver: Job 2: Map: 1   Cumulative CPU: 17.38 sec   
HDFS Read: 80660331 HDFS Write: 378063 SUCCESS
Job 3: Map: 1   Cumulative CPU: 2.06 sec   HDFS Read: 378520 HDFS Write: 96 
SUCCESS
13/12/30 16:48:25 INFO ql.Driver: Job 3: Map: 1   Cumulative CPU: 2.06 sec   
HDFS Read: 378520 HDFS Write: 96 SUCCESS
Job 4: Map: 1  Reduce: 1   Cumulative CPU: 2.45 sec   HDFS Read: 553 HDFS 
Write: 96 SUCCESS
13/12/30 16:48:25 INFO ql.Driver: Job 4: Map: 1  Reduce: 1   Cumulative CPU: 
2.45 sec   HDFS Read: 553 HDFS Write: 96 SUCCESS
Job 5: Map: 1  Reduce: 1   Cumulative CPU: 2.33 sec   HDFS Read: 553 HDFS 
Write: 0 SUCCESS
13/12/30 16:48:25 INFO ql.Driver: Job 5: Map: 1  Reduce: 1   Cumulative CPU: 
2.33 sec   HDFS Read: 553 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 46 seconds 160 msec
{code}

{code}
Map-join1:
plan.ConditionalResolverCommonJoin: Driver alias is store_sales with size 
388445409 (total size of others : 0, threshold : 2500)
Stage-28 is selected by condition resolver.

Map-join2:
plan.ConditionalResolverCommonJoin: Driver alias is $INTNAME with size 20815654 
(total size of others : 5051899, threshold : 2500)
Stage-26 is selected by condition resolver.

Map-join3:
 plan.ConditionalResolverCommonJoin: Driver alias is customer_demographics with 
size 80660096 (total size of others : 20815654, threshold : 2500)
Stage-24 is filtered out by condition resolver.

Map-join4:
plan.ConditionalResolverCommonJoin: Driver alias is $INTNAME with size 20815654 
(total size of others : 3155, threshold : 2500)
Stage-22 is selected by condition resolver.
{code}


btw, a minor question. Why the log of map-join 1 shows the size of others 0?

 ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those 
 tables which are not used in the child of this conditional task.
 -

 Key: HIVE-5945
 URL: https://issues.apache.org/jira/browse/HIVE-5945
 Project: Hive
  Issue Type: Bug
  Components: Query Processor
Affects Versions: 0.8.0, 0.9.0, 0.10.0, 0.11.0, 0.12.0, 0.13.0
Reporter: Yin Huai
Assignee: Navis
Priority: Critical
 Attachments: HIVE-5945.1.patch.txt, HIVE-5945.2.patch.txt, 
 HIVE-5945.3.patch.txt, HIVE-5945.4.patch.txt, HIVE-5945.5.patch.txt


 Here is an example
 {code}
 select
i_item_id,
s_state,
avg(ss_quantity) agg1,
avg(ss_list_price) agg2,
avg(ss_coupon_amt) agg3,
avg(ss_sales_price) agg4
 FROM store_sales
 JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
 JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
 JOIN customer_demographics on (store_sales.ss_cdemo_sk = 
 customer_demographics.cd_demo_sk)
 JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
 where
cd_gender = 'F' and
cd_marital_status = 'U' and
cd_education_status = 'Primary' and
d_year = 2002 and
s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
 group by
i_item_id,
s_state
 order by
i_item_id,
s_state
 limit 100;
 {\code}
 I turned off noconditionaltask. So, I expected that there will be 4 Map-only 
 jobs for this query. However, I got 1 Map-only job (joining strore_sales and 
 date_dim) and 3 MR job (for reduce joins.)
 So, I checked the conditional task determining the plan of the join involving 
 item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, 
 aliasToFileSizeMap contains all input tables used in this query and the 
 intermediate table generated by joining store_sales and date_dim. So, when we 
 sum the size of all small tables, the size of store_sales (which is around 
 45GB in my test) 

[jira] [Updated] (HIVE-6083) User provided table properties are not assigned to the TableDesc of the FileSinkDesc in a CTAS query

2013-12-30 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-6083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-6083:
---

Status: Open  (was: Patch Available)

 User provided table properties are not assigned to the TableDesc of the 
 FileSinkDesc in a CTAS query
 

 Key: HIVE-6083
 URL: https://issues.apache.org/jira/browse/HIVE-6083
 Project: Hive
  Issue Type: Bug
Affects Versions: 0.12.0, 0.13.0
Reporter: Yin Huai
Assignee: Yin Huai
 Attachments: HIVE-6083.1.patch.txt


 I was trying to use a CTAS query to create a table stored with ORC and 
 orc.compress was set to SNAPPY. However, the table was still compressed as 
 ZLIB (although the result of DESCRIBE still shows that this table is 
 compressed by SNAPPY). For a CTAS query, SemanticAnalyzer.genFileSinkPlan 
 uses CreateTableDesc to generate the TableDesc for the FileSinkDesc by 
 calling PlanUtils.getTableDesc. However, in PlanUtils.getTableDesc, I do not 
 see user provided table properties are assigned to the returned TableDesc 
 (CreateTableDesc.getTblProps was not called in this method ).  
 btw, I only checked the code of 0.12 and trunk.
 Two examples:
 * Snappy compression
 {code}
 create table web_sales_wrong_orc_snappy
 stored as orc tblproperties (orc.compress=SNAPPY)
 as select * from web_sales;
 {code}
 {code}
 describe formatted web_sales_wrong_orc_snappy;
 
 Location: 
 hdfs://localhost:54310/user/hive/warehouse/web_sales_wrong_orc_snappy
 Table Type:   MANAGED_TABLE
 Table Parameters:  
   COLUMN_STATS_ACCURATE   true
   numFiles1   
   numRows 719384  
   orc.compressSNAPPY  
   rawDataSize 97815412
   totalSize   40625243
   transient_lastDdlTime   1387566015   
    
 {code}
 {code}
 bin/hive --orcfiledump 
 /user/hive/warehouse/web_sales_wrong_orc_snappy/00_0
 Rows: 719384
 Compression: ZLIB
 Compression size: 262144
 ...
 {code}
 * No compression
 {code}
 create table web_sales_wrong_orc_none
 stored as orc tblproperties (orc.compress=NONE)
 as select * from web_sales;
 {code}
 {code}
 describe formatted web_sales_wrong_orc_none;
 
 Location: 
 hdfs://localhost:54310/user/hive/warehouse/web_sales_wrong_orc_none  
 Table Type:   MANAGED_TABLE
 Table Parameters:  
   COLUMN_STATS_ACCURATE   true
   numFiles1   
   numRows 719384  
   orc.compressNONE
   rawDataSize 97815412
   totalSize   40625243
   transient_lastDdlTime   1387566064   
    
 {code}
 {code}
 bin/hive --orcfiledump /user/hive/warehouse/web_sales_wrong_orc_none/00_0
 Rows: 719384
 Compression: ZLIB
 Compression size: 262144
 ...
 {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HIVE-6083) User provided table properties are not assigned to the TableDesc of the FileSinkDesc in a CTAS query

2013-12-30 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-6083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-6083:
---

Attachment: HIVE-6083.2.patch.txt

Let me trigger HiveQA again.

 User provided table properties are not assigned to the TableDesc of the 
 FileSinkDesc in a CTAS query
 

 Key: HIVE-6083
 URL: https://issues.apache.org/jira/browse/HIVE-6083
 Project: Hive
  Issue Type: Bug
Affects Versions: 0.12.0, 0.13.0
Reporter: Yin Huai
Assignee: Yin Huai
 Attachments: HIVE-6083.1.patch.txt, HIVE-6083.2.patch.txt


 I was trying to use a CTAS query to create a table stored with ORC and 
 orc.compress was set to SNAPPY. However, the table was still compressed as 
 ZLIB (although the result of DESCRIBE still shows that this table is 
 compressed by SNAPPY). For a CTAS query, SemanticAnalyzer.genFileSinkPlan 
 uses CreateTableDesc to generate the TableDesc for the FileSinkDesc by 
 calling PlanUtils.getTableDesc. However, in PlanUtils.getTableDesc, I do not 
 see user provided table properties are assigned to the returned TableDesc 
 (CreateTableDesc.getTblProps was not called in this method ).  
 btw, I only checked the code of 0.12 and trunk.
 Two examples:
 * Snappy compression
 {code}
 create table web_sales_wrong_orc_snappy
 stored as orc tblproperties (orc.compress=SNAPPY)
 as select * from web_sales;
 {code}
 {code}
 describe formatted web_sales_wrong_orc_snappy;
 
 Location: 
 hdfs://localhost:54310/user/hive/warehouse/web_sales_wrong_orc_snappy
 Table Type:   MANAGED_TABLE
 Table Parameters:  
   COLUMN_STATS_ACCURATE   true
   numFiles1   
   numRows 719384  
   orc.compressSNAPPY  
   rawDataSize 97815412
   totalSize   40625243
   transient_lastDdlTime   1387566015   
    
 {code}
 {code}
 bin/hive --orcfiledump 
 /user/hive/warehouse/web_sales_wrong_orc_snappy/00_0
 Rows: 719384
 Compression: ZLIB
 Compression size: 262144
 ...
 {code}
 * No compression
 {code}
 create table web_sales_wrong_orc_none
 stored as orc tblproperties (orc.compress=NONE)
 as select * from web_sales;
 {code}
 {code}
 describe formatted web_sales_wrong_orc_none;
 
 Location: 
 hdfs://localhost:54310/user/hive/warehouse/web_sales_wrong_orc_none  
 Table Type:   MANAGED_TABLE
 Table Parameters:  
   COLUMN_STATS_ACCURATE   true
   numFiles1   
   numRows 719384  
   orc.compressNONE
   rawDataSize 97815412
   totalSize   40625243
   transient_lastDdlTime   1387566064   
    
 {code}
 {code}
 bin/hive --orcfiledump /user/hive/warehouse/web_sales_wrong_orc_none/00_0
 Rows: 719384
 Compression: ZLIB
 Compression size: 262144
 ...
 {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HIVE-6083) User provided table properties are not assigned to the TableDesc of the FileSinkDesc in a CTAS query

2013-12-30 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-6083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-6083:
---

Status: Patch Available  (was: Open)

 User provided table properties are not assigned to the TableDesc of the 
 FileSinkDesc in a CTAS query
 

 Key: HIVE-6083
 URL: https://issues.apache.org/jira/browse/HIVE-6083
 Project: Hive
  Issue Type: Bug
Affects Versions: 0.12.0, 0.13.0
Reporter: Yin Huai
Assignee: Yin Huai
 Attachments: HIVE-6083.1.patch.txt, HIVE-6083.2.patch.txt


 I was trying to use a CTAS query to create a table stored with ORC and 
 orc.compress was set to SNAPPY. However, the table was still compressed as 
 ZLIB (although the result of DESCRIBE still shows that this table is 
 compressed by SNAPPY). For a CTAS query, SemanticAnalyzer.genFileSinkPlan 
 uses CreateTableDesc to generate the TableDesc for the FileSinkDesc by 
 calling PlanUtils.getTableDesc. However, in PlanUtils.getTableDesc, I do not 
 see user provided table properties are assigned to the returned TableDesc 
 (CreateTableDesc.getTblProps was not called in this method ).  
 btw, I only checked the code of 0.12 and trunk.
 Two examples:
 * Snappy compression
 {code}
 create table web_sales_wrong_orc_snappy
 stored as orc tblproperties (orc.compress=SNAPPY)
 as select * from web_sales;
 {code}
 {code}
 describe formatted web_sales_wrong_orc_snappy;
 
 Location: 
 hdfs://localhost:54310/user/hive/warehouse/web_sales_wrong_orc_snappy
 Table Type:   MANAGED_TABLE
 Table Parameters:  
   COLUMN_STATS_ACCURATE   true
   numFiles1   
   numRows 719384  
   orc.compressSNAPPY  
   rawDataSize 97815412
   totalSize   40625243
   transient_lastDdlTime   1387566015   
    
 {code}
 {code}
 bin/hive --orcfiledump 
 /user/hive/warehouse/web_sales_wrong_orc_snappy/00_0
 Rows: 719384
 Compression: ZLIB
 Compression size: 262144
 ...
 {code}
 * No compression
 {code}
 create table web_sales_wrong_orc_none
 stored as orc tblproperties (orc.compress=NONE)
 as select * from web_sales;
 {code}
 {code}
 describe formatted web_sales_wrong_orc_none;
 
 Location: 
 hdfs://localhost:54310/user/hive/warehouse/web_sales_wrong_orc_none  
 Table Type:   MANAGED_TABLE
 Table Parameters:  
   COLUMN_STATS_ACCURATE   true
   numFiles1   
   numRows 719384  
   orc.compressNONE
   rawDataSize 97815412
   totalSize   40625243
   transient_lastDdlTime   1387566064   
    
 {code}
 {code}
 bin/hive --orcfiledump /user/hive/warehouse/web_sales_wrong_orc_none/00_0
 Rows: 719384
 Compression: ZLIB
 Compression size: 262144
 ...
 {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HIVE-6083) User provided table properties are not assigned to the TableDesc of the FileSinkDesc in a CTAS query

2013-12-20 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-6083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-6083:
---

Summary: User provided table properties are not assigned to the TableDesc 
of the FileSinkDesc in a CTAS query  (was: User provided table properties are 
not assigned to the TableDesc of the FileSinkDesc in a CTAS)

 User provided table properties are not assigned to the TableDesc of the 
 FileSinkDesc in a CTAS query
 

 Key: HIVE-6083
 URL: https://issues.apache.org/jira/browse/HIVE-6083
 Project: Hive
  Issue Type: Bug
Affects Versions: 0.12.0, 0.13.0
Reporter: Yin Huai
Assignee: Yin Huai
 Attachments: HIVE-6083.1.patch.txt


 I was trying to use a CTAS query to create a table stored with ORC and 
 orc.compress was set to SNAPPY. However, the table was still compressed as 
 ZLIB (although the result of DESCRIBE still shows that this table is 
 compressed by SNAPPY). For a CTAS query, SemanticAnalyzer.genFileSinkPlan 
 uses CreateTableDesc to generate the TableDesc for the FileSinkDesc by 
 calling PlanUtils.getTableDesc. However, in PlanUtils.getTableDesc, I do not 
 see user provided table properties are assigned to the returned TableDesc 
 (CreateTableDesc.getTblProps was not called in this method ).  
 btw, I only checked the code of 0.12 and trunk.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (HIVE-5945) ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task.

2013-12-18 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13851756#comment-13851756
 ] 

Yin Huai commented on HIVE-5945:


Two minor comments in the review board.

Two additional comments.
When we find 
{code}
bigTableFileAlias != null
{\code}
can we also log sumOfOthers and the threshold of the size of small tables? So, 
the log entry will show the size of the big table, the total size of other 
small tables, and the threshold of the size of small tables.
Also, can you add a unit test?

Thanks :)

 ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those 
 tables which are not used in the child of this conditional task.
 -

 Key: HIVE-5945
 URL: https://issues.apache.org/jira/browse/HIVE-5945
 Project: Hive
  Issue Type: Bug
  Components: Query Processor
Affects Versions: 0.8.0, 0.9.0, 0.10.0, 0.11.0, 0.12.0, 0.13.0
Reporter: Yin Huai
Assignee: Navis
Priority: Critical
 Attachments: HIVE-5945.1.patch.txt, HIVE-5945.2.patch.txt, 
 HIVE-5945.3.patch.txt


 Here is an example
 {code}
 select
i_item_id,
s_state,
avg(ss_quantity) agg1,
avg(ss_list_price) agg2,
avg(ss_coupon_amt) agg3,
avg(ss_sales_price) agg4
 FROM store_sales
 JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
 JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
 JOIN customer_demographics on (store_sales.ss_cdemo_sk = 
 customer_demographics.cd_demo_sk)
 JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
 where
cd_gender = 'F' and
cd_marital_status = 'U' and
cd_education_status = 'Primary' and
d_year = 2002 and
s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
 group by
i_item_id,
s_state
 order by
i_item_id,
s_state
 limit 100;
 {\code}
 I turned off noconditionaltask. So, I expected that there will be 4 Map-only 
 jobs for this query. However, I got 1 Map-only job (joining strore_sales and 
 date_dim) and 3 MR job (for reduce joins.)
 So, I checked the conditional task determining the plan of the join involving 
 item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, 
 aliasToFileSizeMap contains all input tables used in this query and the 
 intermediate table generated by joining store_sales and date_dim. So, when we 
 sum the size of all small tables, the size of store_sales (which is around 
 45GB in my test) will be also counted.  



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (HIVE-5945) ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task.

2013-12-18 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-5945:
---

Status: Open  (was: Patch Available)

 ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those 
 tables which are not used in the child of this conditional task.
 -

 Key: HIVE-5945
 URL: https://issues.apache.org/jira/browse/HIVE-5945
 Project: Hive
  Issue Type: Bug
  Components: Query Processor
Affects Versions: 0.12.0, 0.11.0, 0.10.0, 0.9.0, 0.8.0, 0.13.0
Reporter: Yin Huai
Assignee: Navis
Priority: Critical
 Attachments: HIVE-5945.1.patch.txt, HIVE-5945.2.patch.txt, 
 HIVE-5945.3.patch.txt


 Here is an example
 {code}
 select
i_item_id,
s_state,
avg(ss_quantity) agg1,
avg(ss_list_price) agg2,
avg(ss_coupon_amt) agg3,
avg(ss_sales_price) agg4
 FROM store_sales
 JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
 JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
 JOIN customer_demographics on (store_sales.ss_cdemo_sk = 
 customer_demographics.cd_demo_sk)
 JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
 where
cd_gender = 'F' and
cd_marital_status = 'U' and
cd_education_status = 'Primary' and
d_year = 2002 and
s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
 group by
i_item_id,
s_state
 order by
i_item_id,
s_state
 limit 100;
 {\code}
 I turned off noconditionaltask. So, I expected that there will be 4 Map-only 
 jobs for this query. However, I got 1 Map-only job (joining strore_sales and 
 date_dim) and 3 MR job (for reduce joins.)
 So, I checked the conditional task determining the plan of the join involving 
 item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, 
 aliasToFileSizeMap contains all input tables used in this query and the 
 intermediate table generated by joining store_sales and date_dim. So, when we 
 sum the size of all small tables, the size of store_sales (which is around 
 45GB in my test) will be also counted.  



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (HIVE-6043) Document incompatible changes in Hive 0.12 and trunk

2013-12-17 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-6043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-6043:
---

Description: 
We need to document incompatible changes. For example

* HIVE-5372 changed object inspector hierarchy breaking most if not all custom 
serdes
* HIVE-1511/HIVE-5263 serializes ObjectInspectors with Kryo so all custom 
serdes (fixed by HIVE-5380)
* Hive 0.12 separates MapredWork into MapWork and ReduceWork which is used by 
Serdes
* HIVE-5411 serializes expressions with Kryo which are used by custom serdes
* HIVE-4827 removed the flag of hive.optimize.mapjoin.mapreduce (This flag 
was introduced in Hive 0.11 by HIVE-3952).


  was:
We need to document incompatible changes. For example

* HIVE-5372 changed object inspector hierarchy breaking most if not all custom 
serdes
* HIVE-1511/HIVE-5263 serializes ObjectInspectors with Kryo so all custom 
serdes (fixed by HIVE-5380)
* Hive 0.12 separates MapredWork into MapWork and ReduceWork which is used by 
Serdes
* HIVE-5411 serializes expressions with Kryo which are used by custom serdes



 Document incompatible changes in Hive 0.12 and trunk
 

 Key: HIVE-6043
 URL: https://issues.apache.org/jira/browse/HIVE-6043
 Project: Hive
  Issue Type: Task
Reporter: Brock Noland
Priority: Blocker

 We need to document incompatible changes. For example
 * HIVE-5372 changed object inspector hierarchy breaking most if not all 
 custom serdes
 * HIVE-1511/HIVE-5263 serializes ObjectInspectors with Kryo so all custom 
 serdes (fixed by HIVE-5380)
 * Hive 0.12 separates MapredWork into MapWork and ReduceWork which is used by 
 Serdes
 * HIVE-5411 serializes expressions with Kryo which are used by custom serdes
 * HIVE-4827 removed the flag of hive.optimize.mapjoin.mapreduce (This flag 
 was introduced in Hive 0.11 by HIVE-3952).



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (HIVE-6043) Document incompatible changes in Hive 0.12 and trunk

2013-12-17 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13850759#comment-13850759
 ] 

Yin Huai commented on HIVE-6043:


I added HIVE-4827, which removed the flag of hive.optimize.mapjoin.mapreduce.

 Document incompatible changes in Hive 0.12 and trunk
 

 Key: HIVE-6043
 URL: https://issues.apache.org/jira/browse/HIVE-6043
 Project: Hive
  Issue Type: Task
Reporter: Brock Noland
Priority: Blocker

 We need to document incompatible changes. For example
 * HIVE-5372 changed object inspector hierarchy breaking most if not all 
 custom serdes
 * HIVE-1511/HIVE-5263 serializes ObjectInspectors with Kryo so all custom 
 serdes (fixed by HIVE-5380)
 * Hive 0.12 separates MapredWork into MapWork and ReduceWork which is used by 
 Serdes
 * HIVE-5411 serializes expressions with Kryo which are used by custom serdes
 * HIVE-4827 removed the flag of hive.optimize.mapjoin.mapreduce (This flag 
 was introduced in Hive 0.11 by HIVE-3952).



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (HIVE-5891) Alias conflict when merging multiple mapjoin tasks into their common child mapred task

2013-12-17 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13851007#comment-13851007
 ] 

Yin Huai commented on HIVE-5891:


[~sunrui] Sorry for getting back late.

I just took a look at QB. Seems it uses aliasToSubq to store the mapping from 
aliases to sub query expressions (QBExpr). Then, a QBExpr also stores a QB 
which represents the subquery QB. With this recursive way, all QBs for 
different levels of the query are stored. So, parseCtx.getQB() only gets the 
main query block and its id is null. I am not sure if we can get the right QB 
(the QB for a subquery) from GenMapRedUtils.splitTasks... Can you take a quick 
look to see if it is easy to get the correct QB? If so, we can use the id of a 
QB to replace INTNAME. If not, let's use joinTree.getId for those 
JoinOperators. Seems we do not need to take special care to DemuxOperator. Can 
you create a review request for your patch? I can leave comments on the review 
board.

Also, since QBJoinTree.getJoinStreamDesc is not used, let's delete it.

 Alias conflict when merging multiple mapjoin tasks into their common child 
 mapred task
 --

 Key: HIVE-5891
 URL: https://issues.apache.org/jira/browse/HIVE-5891
 Project: Hive
  Issue Type: Bug
  Components: Query Processor
Affects Versions: 0.12.0
Reporter: Sun Rui
Assignee: Sun Rui
 Attachments: HIVE-5891.1.patch


 Use the following test case with HIVE 0.12:
 {quote}
 create table src(key int, value string);
 load data local inpath 'src/data/files/kv1.txt' overwrite into table src;
 select * from (
   select c.key from
 (select a.key from src a join src b on a.key=b.key group by a.key) tmp
 join src c on tmp.key=c.key
   union all
   select c.key from
 (select a.key from src a join src b on a.key=b.key group by a.key) tmp
 join src c on tmp.key=c.key
 ) x;
 {quote}
 We will get a NullPointerException from Union Operator:
 {quote}
 java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: 
 Hive Runtime Error while processing row {_col0:0}
   at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:175)
   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
   at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
 Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime 
 Error while processing row {_col0:0}
   at 
 org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:544)
   at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:157)
   ... 4 more
 Caused by: java.lang.NullPointerException
   at 
 org.apache.hadoop.hive.ql.exec.UnionOperator.processOp(UnionOperator.java:120)
   at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504)
   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
   at 
 org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:88)
   at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504)
   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
   at 
 org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genUniqueJoinObject(CommonJoinOperator.java:652)
   at 
 org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genUniqueJoinObject(CommonJoinOperator.java:655)
   at 
 org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:758)
   at 
 org.apache.hadoop.hive.ql.exec.MapJoinOperator.processOp(MapJoinOperator.java:220)
   at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504)
   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
   at 
 org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:91)
   at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504)
   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
   at 
 org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:534)
   ... 5 more
 {quote}
   
 The root cause is in 
 CommonJoinTaskDispatcher.mergeMapJoinTaskIntoItsChildMapRedTask().
   +--+  +--+
   | MapJoin task |  | MapJoin task |
   +--+  +--+
  \ /
   \   /
  +--+
  |  Union task  |
  +--+
  
 CommonJoinTaskDispatcher merges the two MapJoin tasks into their common 
 child: Union task. The two MapJoin tasks have the same alias name for their 
 big 

[jira] [Commented] (HIVE-5945) ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task.

2013-12-17 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13851260#comment-13851260
 ] 

Yin Huai commented on HIVE-5945:


Thanks [~navis] :) I left a few comments on the review board. I think the 
conditional task in the original trunk is not well tested. With a .q test file, 
we cannot test if a conditional task picks the right execution plan because the 
result of a .q file only shows the plan and the result. I think it is necessary 
to add a junit test to unit test the decision of resolveMapJoinTask. Also, 
let's add some logs in resolveMapJoinTask. Right now, we only have xx is 
filtered out by condition resolver. and xx is selected by condition 
resolver. in ConditionalTask. Through these two logs, we cannot know why a 
execution plan is selected. In resolveMapJoinTask, we can first log the size of 
tables which will be used in next task and then log why a path is selected.

 ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those 
 tables which are not used in the child of this conditional task.
 -

 Key: HIVE-5945
 URL: https://issues.apache.org/jira/browse/HIVE-5945
 Project: Hive
  Issue Type: Bug
  Components: Query Processor
Affects Versions: 0.8.0, 0.9.0, 0.10.0, 0.11.0, 0.12.0, 0.13.0
Reporter: Yin Huai
Assignee: Navis
Priority: Critical
 Attachments: HIVE-5945.1.patch.txt, HIVE-5945.2.patch.txt


 Here is an example
 {code}
 select
i_item_id,
s_state,
avg(ss_quantity) agg1,
avg(ss_list_price) agg2,
avg(ss_coupon_amt) agg3,
avg(ss_sales_price) agg4
 FROM store_sales
 JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
 JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
 JOIN customer_demographics on (store_sales.ss_cdemo_sk = 
 customer_demographics.cd_demo_sk)
 JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
 where
cd_gender = 'F' and
cd_marital_status = 'U' and
cd_education_status = 'Primary' and
d_year = 2002 and
s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
 group by
i_item_id,
s_state
 order by
i_item_id,
s_state
 limit 100;
 {\code}
 I turned off noconditionaltask. So, I expected that there will be 4 Map-only 
 jobs for this query. However, I got 1 Map-only job (joining strore_sales and 
 date_dim) and 3 MR job (for reduce joins.)
 So, I checked the conditional task determining the plan of the join involving 
 item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, 
 aliasToFileSizeMap contains all input tables used in this query and the 
 intermediate table generated by joining store_sales and date_dim. So, when we 
 sum the size of all small tables, the size of store_sales (which is around 
 45GB in my test) will be also counted.  



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (HIVE-5945) ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task.

2013-12-17 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-5945:
---

Status: Open  (was: Patch Available)

 ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those 
 tables which are not used in the child of this conditional task.
 -

 Key: HIVE-5945
 URL: https://issues.apache.org/jira/browse/HIVE-5945
 Project: Hive
  Issue Type: Bug
  Components: Query Processor
Affects Versions: 0.12.0, 0.11.0, 0.10.0, 0.9.0, 0.8.0, 0.13.0
Reporter: Yin Huai
Assignee: Navis
Priority: Critical
 Attachments: HIVE-5945.1.patch.txt, HIVE-5945.2.patch.txt


 Here is an example
 {code}
 select
i_item_id,
s_state,
avg(ss_quantity) agg1,
avg(ss_list_price) agg2,
avg(ss_coupon_amt) agg3,
avg(ss_sales_price) agg4
 FROM store_sales
 JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
 JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
 JOIN customer_demographics on (store_sales.ss_cdemo_sk = 
 customer_demographics.cd_demo_sk)
 JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
 where
cd_gender = 'F' and
cd_marital_status = 'U' and
cd_education_status = 'Primary' and
d_year = 2002 and
s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
 group by
i_item_id,
s_state
 order by
i_item_id,
s_state
 limit 100;
 {\code}
 I turned off noconditionaltask. So, I expected that there will be 4 Map-only 
 jobs for this query. However, I got 1 Map-only job (joining strore_sales and 
 date_dim) and 3 MR job (for reduce joins.)
 So, I checked the conditional task determining the plan of the join involving 
 item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, 
 aliasToFileSizeMap contains all input tables used in this query and the 
 intermediate table generated by joining store_sales and date_dim. So, when we 
 sum the size of all small tables, the size of store_sales (which is around 
 45GB in my test) will be also counted.  



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Created] (HIVE-6007) Make the output of the reduce side plan optimized by the correlation optimizer more reader-friendly.

2013-12-11 Thread Yin Huai (JIRA)
Yin Huai created HIVE-6007:
--

 Summary: Make the output of the reduce side plan optimized by the 
correlation optimizer more reader-friendly.
 Key: HIVE-6007
 URL: https://issues.apache.org/jira/browse/HIVE-6007
 Project: Hive
  Issue Type: Sub-task
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Minor


Because a MuxOperator can have multiple parents, the output of the plan can 
show the sub-plan starting from this MuxOperator multiple times, which makes 
the reduce side plan confusing. An example is shown in 
https://mail-archives.apache.org/mod_mbox/hive-user/201312.mbox/%3CCAO0ZKSjniR0z%2BOt4KWouq236fKXo%3D5nE_Oih7A87e3HiuBsG9w%40mail.gmail.com%3E.




--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (HIVE-5945) ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task.

2013-12-10 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13844968#comment-13844968
 ] 

Yin Huai commented on HIVE-5945:


Thanks [~navis] for taking this issue. Can you attach the link to the review 
board? Also, I saw 
{code}
+// todo: should nullify summary for non-native tables,
+// not to be selected as a mapjoin target
{\code}
in your patch. Does a non-native table mean an intermediate table? If so, I 
think for a conditional task, it's better to keep the option to use the 
intermediate table as the small table.

 ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those 
 tables which are not used in the child of this conditional task.
 -

 Key: HIVE-5945
 URL: https://issues.apache.org/jira/browse/HIVE-5945
 Project: Hive
  Issue Type: Bug
  Components: Query Processor
Affects Versions: 0.8.0, 0.9.0, 0.10.0, 0.11.0, 0.12.0, 0.13.0
Reporter: Yin Huai
Assignee: Navis
Priority: Critical
 Attachments: HIVE-5945.1.patch.txt


 Here is an example
 {code}
 select
i_item_id,
s_state,
avg(ss_quantity) agg1,
avg(ss_list_price) agg2,
avg(ss_coupon_amt) agg3,
avg(ss_sales_price) agg4
 FROM store_sales
 JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
 JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
 JOIN customer_demographics on (store_sales.ss_cdemo_sk = 
 customer_demographics.cd_demo_sk)
 JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
 where
cd_gender = 'F' and
cd_marital_status = 'U' and
cd_education_status = 'Primary' and
d_year = 2002 and
s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
 group by
i_item_id,
s_state
 order by
i_item_id,
s_state
 limit 100;
 {\code}
 I turned off noconditionaltask. So, I expected that there will be 4 Map-only 
 jobs for this query. However, I got 1 Map-only job (joining strore_sales and 
 date_dim) and 3 MR job (for reduce joins.)
 So, I checked the conditional task determining the plan of the join involving 
 item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, 
 aliasToFileSizeMap contains all input tables used in this query and the 
 intermediate table generated by joining store_sales and date_dim. So, when we 
 sum the size of all small tables, the size of store_sales (which is around 
 45GB in my test) will be also counted.  



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (HIVE-5945) ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask sums all tables' sizes including those tables which are not used in the child of this conditional task.

2013-12-04 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-5945:
---

Affects Version/s: 0.13.0

 ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask sums all tables' 
 sizes including those tables which are not used in the child of this 
 conditional task.
 

 Key: HIVE-5945
 URL: https://issues.apache.org/jira/browse/HIVE-5945
 Project: Hive
  Issue Type: Bug
  Components: Query Processor
Affects Versions: 0.13.0
Reporter: Yin Huai





--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Created] (HIVE-5945) ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask sums all tables' sizes including those tables which are not used in the child of this conditional task.

2013-12-04 Thread Yin Huai (JIRA)
Yin Huai created HIVE-5945:
--

 Summary: ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask 
sums all tables' sizes including those tables which are not used in the child 
of this conditional task.
 Key: HIVE-5945
 URL: https://issues.apache.org/jira/browse/HIVE-5945
 Project: Hive
  Issue Type: Bug
Reporter: Yin Huai






--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HIVE-5945) ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask sums all tables' sizes including those tables which are not used in the child of this conditional task.

2013-12-04 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-5945:
---

Component/s: Query Processor

 ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask sums all tables' 
 sizes including those tables which are not used in the child of this 
 conditional task.
 

 Key: HIVE-5945
 URL: https://issues.apache.org/jira/browse/HIVE-5945
 Project: Hive
  Issue Type: Bug
  Components: Query Processor
Affects Versions: 0.13.0
Reporter: Yin Huai





--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HIVE-5945) ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask sums all tables' sizes including those tables which are not used in the child of this conditional task.

2013-12-04 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-5945:
---

Description: 
Here is an example
{code}
select
   i_item_id,
   s_state,
   avg(ss_quantity) agg1,
   avg(ss_list_price) agg2,
   avg(ss_coupon_amt) agg3,
   avg(ss_sales_price) agg4
FROM store_sales
JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
JOIN customer_demographics on (store_sales.ss_cdemo_sk = 
customer_demographics.cd_demo_sk)
JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
where
   cd_gender = 'F' and
   cd_marital_status = 'U' and
   cd_education_status = 'Primary' and
   d_year = 2002 and
   s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
group by
   i_item_id,
   s_state
order by
   i_item_id,
   s_state
limit 100;
{\code}
I turned off noconditionaltask. So, I expected that there will be 4 Map-only 
jobs for this query. However, I got 1 Map-only job (joining strore_sales and 
date_dim) and 3 MR job (for reduce joins.)

So, I checked the conditional task determining the plan of the join involving 
item. In 

 ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask sums all tables' 
 sizes including those tables which are not used in the child of this 
 conditional task.
 

 Key: HIVE-5945
 URL: https://issues.apache.org/jira/browse/HIVE-5945
 Project: Hive
  Issue Type: Bug
  Components: Query Processor
Affects Versions: 0.13.0
Reporter: Yin Huai

 Here is an example
 {code}
 select
i_item_id,
s_state,
avg(ss_quantity) agg1,
avg(ss_list_price) agg2,
avg(ss_coupon_amt) agg3,
avg(ss_sales_price) agg4
 FROM store_sales
 JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
 JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
 JOIN customer_demographics on (store_sales.ss_cdemo_sk = 
 customer_demographics.cd_demo_sk)
 JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
 where
cd_gender = 'F' and
cd_marital_status = 'U' and
cd_education_status = 'Primary' and
d_year = 2002 and
s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
 group by
i_item_id,
s_state
 order by
i_item_id,
s_state
 limit 100;
 {\code}
 I turned off noconditionaltask. So, I expected that there will be 4 Map-only 
 jobs for this query. However, I got 1 Map-only job (joining strore_sales and 
 date_dim) and 3 MR job (for reduce joins.)
 So, I checked the conditional task determining the plan of the join involving 
 item. In 



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HIVE-5945) ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask sums all tables' sizes including those tables which are not used in the child of this conditional task.

2013-12-04 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-5945:
---

Description: 
Here is an example
{code}
select
   i_item_id,
   s_state,
   avg(ss_quantity) agg1,
   avg(ss_list_price) agg2,
   avg(ss_coupon_amt) agg3,
   avg(ss_sales_price) agg4
FROM store_sales
JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
JOIN customer_demographics on (store_sales.ss_cdemo_sk = 
customer_demographics.cd_demo_sk)
JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
where
   cd_gender = 'F' and
   cd_marital_status = 'U' and
   cd_education_status = 'Primary' and
   d_year = 2002 and
   s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
group by
   i_item_id,
   s_state
order by
   i_item_id,
   s_state
limit 100;
{\code}
I turned off noconditionaltask. So, I expected that there will be 4 Map-only 
jobs for this query. However, I got 1 Map-only job (joining strore_sales and 
date_dim) and 3 MR job (for reduce joins.)

So, I checked the conditional task determining the plan of the join involving 
item. In Hive 
HiveHIVE-5945
ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, aliasToFileSizeMap 
contains all input tables used in this query and the intermediate table 
generated by joining store_sales and date_dim. So, when we sum the size of all 
small tables, the size of store_sales (which is around 45GB in my test) will be 
also counted.  

  was:
Here is an example
{code}
select
   i_item_id,
   s_state,
   avg(ss_quantity) agg1,
   avg(ss_list_price) agg2,
   avg(ss_coupon_amt) agg3,
   avg(ss_sales_price) agg4
FROM store_sales
JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
JOIN customer_demographics on (store_sales.ss_cdemo_sk = 
customer_demographics.cd_demo_sk)
JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
where
   cd_gender = 'F' and
   cd_marital_status = 'U' and
   cd_education_status = 'Primary' and
   d_year = 2002 and
   s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
group by
   i_item_id,
   s_state
order by
   i_item_id,
   s_state
limit 100;
{\code}
I turned off noconditionaltask. So, I expected that there will be 4 Map-only 
jobs for this query. However, I got 1 Map-only job (joining strore_sales and 
date_dim) and 3 MR job (for reduce joins.)

So, I checked the conditional task determining the plan of the join involving 
item. In 


 ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask sums all tables' 
 sizes including those tables which are not used in the child of this 
 conditional task.
 

 Key: HIVE-5945
 URL: https://issues.apache.org/jira/browse/HIVE-5945
 Project: Hive
  Issue Type: Bug
  Components: Query Processor
Affects Versions: 0.13.0
Reporter: Yin Huai

 Here is an example
 {code}
 select
i_item_id,
s_state,
avg(ss_quantity) agg1,
avg(ss_list_price) agg2,
avg(ss_coupon_amt) agg3,
avg(ss_sales_price) agg4
 FROM store_sales
 JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
 JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
 JOIN customer_demographics on (store_sales.ss_cdemo_sk = 
 customer_demographics.cd_demo_sk)
 JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
 where
cd_gender = 'F' and
cd_marital_status = 'U' and
cd_education_status = 'Primary' and
d_year = 2002 and
s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
 group by
i_item_id,
s_state
 order by
i_item_id,
s_state
 limit 100;
 {\code}
 I turned off noconditionaltask. So, I expected that there will be 4 Map-only 
 jobs for this query. However, I got 1 Map-only job (joining strore_sales and 
 date_dim) and 3 MR job (for reduce joins.)
 So, I checked the conditional task determining the plan of the join involving 
 item. In Hive   
 HiveHIVE-5945
 ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, aliasToFileSizeMap 
 contains all input tables used in this query and the intermediate table 
 generated by joining store_sales and date_dim. So, when we sum the size of 
 all small tables, the size of store_sales (which is around 45GB in my test) 
 will be also counted.  



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HIVE-5945) ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask sums all tables' sizes including those tables which are not used in the child of this conditional task.

2013-12-04 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-5945:
---

Description: 
Here is an example
{code}
select
   i_item_id,
   s_state,
   avg(ss_quantity) agg1,
   avg(ss_list_price) agg2,
   avg(ss_coupon_amt) agg3,
   avg(ss_sales_price) agg4
FROM store_sales
JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
JOIN customer_demographics on (store_sales.ss_cdemo_sk = 
customer_demographics.cd_demo_sk)
JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
where
   cd_gender = 'F' and
   cd_marital_status = 'U' and
   cd_education_status = 'Primary' and
   d_year = 2002 and
   s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
group by
   i_item_id,
   s_state
order by
   i_item_id,
   s_state
limit 100;
{\code}
I turned off noconditionaltask. So, I expected that there will be 4 Map-only 
jobs for this query. However, I got 1 Map-only job (joining strore_sales and 
date_dim) and 3 MR job (for reduce joins.)

So, I checked the conditional task determining the plan of the join involving 
item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, 
aliasToFileSizeMap contains all input tables used in this query and the 
intermediate table generated by joining store_sales and date_dim. So, when we 
sum the size of all small tables, the size of store_sales (which is around 45GB 
in my test) will be also counted.  

  was:
Here is an example
{code}
select
   i_item_id,
   s_state,
   avg(ss_quantity) agg1,
   avg(ss_list_price) agg2,
   avg(ss_coupon_amt) agg3,
   avg(ss_sales_price) agg4
FROM store_sales
JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
JOIN customer_demographics on (store_sales.ss_cdemo_sk = 
customer_demographics.cd_demo_sk)
JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
where
   cd_gender = 'F' and
   cd_marital_status = 'U' and
   cd_education_status = 'Primary' and
   d_year = 2002 and
   s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
group by
   i_item_id,
   s_state
order by
   i_item_id,
   s_state
limit 100;
{\code}
I turned off noconditionaltask. So, I expected that there will be 4 Map-only 
jobs for this query. However, I got 1 Map-only job (joining strore_sales and 
date_dim) and 3 MR job (for reduce joins.)

So, I checked the conditional task determining the plan of the join involving 
item. In Hive 
HiveHIVE-5945
ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, aliasToFileSizeMap 
contains all input tables used in this query and the intermediate table 
generated by joining store_sales and date_dim. So, when we sum the size of all 
small tables, the size of store_sales (which is around 45GB in my test) will be 
also counted.  


 ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask sums all tables' 
 sizes including those tables which are not used in the child of this 
 conditional task.
 

 Key: HIVE-5945
 URL: https://issues.apache.org/jira/browse/HIVE-5945
 Project: Hive
  Issue Type: Bug
  Components: Query Processor
Affects Versions: 0.13.0
Reporter: Yin Huai

 Here is an example
 {code}
 select
i_item_id,
s_state,
avg(ss_quantity) agg1,
avg(ss_list_price) agg2,
avg(ss_coupon_amt) agg3,
avg(ss_sales_price) agg4
 FROM store_sales
 JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
 JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
 JOIN customer_demographics on (store_sales.ss_cdemo_sk = 
 customer_demographics.cd_demo_sk)
 JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
 where
cd_gender = 'F' and
cd_marital_status = 'U' and
cd_education_status = 'Primary' and
d_year = 2002 and
s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
 group by
i_item_id,
s_state
 order by
i_item_id,
s_state
 limit 100;
 {\code}
 I turned off noconditionaltask. So, I expected that there will be 4 Map-only 
 jobs for this query. However, I got 1 Map-only job (joining strore_sales and 
 date_dim) and 3 MR job (for reduce joins.)
 So, I checked the conditional task determining the plan of the join involving 
 item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, 
 aliasToFileSizeMap contains all input tables used in this query and the 
 intermediate table generated by joining store_sales and date_dim. So, when we 
 sum the size of all small tables, the size of store_sales (which is around 
 45GB in my test) will be also counted.  



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HIVE-5945) ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask sums all tables' sizes including those tables which are not used in the child of this conditional task.

2013-12-04 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13838991#comment-13838991
 ] 

Yin Huai commented on HIVE-5945:


aliasToFileSizeMap should have aliases used in the next stage instead of all 
tables. 

 ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask sums all tables' 
 sizes including those tables which are not used in the child of this 
 conditional task.
 

 Key: HIVE-5945
 URL: https://issues.apache.org/jira/browse/HIVE-5945
 Project: Hive
  Issue Type: Bug
  Components: Query Processor
Affects Versions: 0.13.0
Reporter: Yin Huai

 Here is an example
 {code}
 select
i_item_id,
s_state,
avg(ss_quantity) agg1,
avg(ss_list_price) agg2,
avg(ss_coupon_amt) agg3,
avg(ss_sales_price) agg4
 FROM store_sales
 JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
 JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
 JOIN customer_demographics on (store_sales.ss_cdemo_sk = 
 customer_demographics.cd_demo_sk)
 JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
 where
cd_gender = 'F' and
cd_marital_status = 'U' and
cd_education_status = 'Primary' and
d_year = 2002 and
s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
 group by
i_item_id,
s_state
 order by
i_item_id,
s_state
 limit 100;
 {\code}
 I turned off noconditionaltask. So, I expected that there will be 4 Map-only 
 jobs for this query. However, I got 1 Map-only job (joining strore_sales and 
 date_dim) and 3 MR job (for reduce joins.)
 So, I checked the conditional task determining the plan of the join involving 
 item. In Hive   
 HiveHIVE-5945
 ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, aliasToFileSizeMap 
 contains all input tables used in this query and the intermediate table 
 generated by joining store_sales and date_dim. So, when we sum the size of 
 all small tables, the size of store_sales (which is around 45GB in my test) 
 will be also counted.  



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HIVE-5945) ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask sums all tables' sizes including those tables which are not used in the child of this conditional task.

2013-12-04 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-5945:
---

Affects Version/s: 0.8.0
   0.9.0
   0.10.0
   0.11.0
   0.12.0

 ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask sums all tables' 
 sizes including those tables which are not used in the child of this 
 conditional task.
 

 Key: HIVE-5945
 URL: https://issues.apache.org/jira/browse/HIVE-5945
 Project: Hive
  Issue Type: Bug
  Components: Query Processor
Affects Versions: 0.8.0, 0.9.0, 0.10.0, 0.11.0, 0.12.0, 0.13.0
Reporter: Yin Huai

 Here is an example
 {code}
 select
i_item_id,
s_state,
avg(ss_quantity) agg1,
avg(ss_list_price) agg2,
avg(ss_coupon_amt) agg3,
avg(ss_sales_price) agg4
 FROM store_sales
 JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
 JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
 JOIN customer_demographics on (store_sales.ss_cdemo_sk = 
 customer_demographics.cd_demo_sk)
 JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
 where
cd_gender = 'F' and
cd_marital_status = 'U' and
cd_education_status = 'Primary' and
d_year = 2002 and
s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
 group by
i_item_id,
s_state
 order by
i_item_id,
s_state
 limit 100;
 {\code}
 I turned off noconditionaltask. So, I expected that there will be 4 Map-only 
 jobs for this query. However, I got 1 Map-only job (joining strore_sales and 
 date_dim) and 3 MR job (for reduce joins.)
 So, I checked the conditional task determining the plan of the join involving 
 item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, 
 aliasToFileSizeMap contains all input tables used in this query and the 
 intermediate table generated by joining store_sales and date_dim. So, when we 
 sum the size of all small tables, the size of store_sales (which is around 
 45GB in my test) will be also counted.  



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HIVE-5945) ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask sums all tables' sizes including those tables which are not used in the child of this conditional task.

2013-12-04 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13839008#comment-13839008
 ] 

Yin Huai commented on HIVE-5945:


Seems this bug was introduced by HIVE-2095. I am marking all affected versions.

 ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask sums all tables' 
 sizes including those tables which are not used in the child of this 
 conditional task.
 

 Key: HIVE-5945
 URL: https://issues.apache.org/jira/browse/HIVE-5945
 Project: Hive
  Issue Type: Bug
  Components: Query Processor
Affects Versions: 0.8.0, 0.9.0, 0.10.0, 0.11.0, 0.12.0, 0.13.0
Reporter: Yin Huai

 Here is an example
 {code}
 select
i_item_id,
s_state,
avg(ss_quantity) agg1,
avg(ss_list_price) agg2,
avg(ss_coupon_amt) agg3,
avg(ss_sales_price) agg4
 FROM store_sales
 JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
 JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
 JOIN customer_demographics on (store_sales.ss_cdemo_sk = 
 customer_demographics.cd_demo_sk)
 JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
 where
cd_gender = 'F' and
cd_marital_status = 'U' and
cd_education_status = 'Primary' and
d_year = 2002 and
s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
 group by
i_item_id,
s_state
 order by
i_item_id,
s_state
 limit 100;
 {\code}
 I turned off noconditionaltask. So, I expected that there will be 4 Map-only 
 jobs for this query. However, I got 1 Map-only job (joining strore_sales and 
 date_dim) and 3 MR job (for reduce joins.)
 So, I checked the conditional task determining the plan of the join involving 
 item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, 
 aliasToFileSizeMap contains all input tables used in this query and the 
 intermediate table generated by joining store_sales and date_dim. So, when we 
 sum the size of all small tables, the size of store_sales (which is around 
 45GB in my test) will be also counted.  



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HIVE-5945) ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task.

2013-12-04 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-5945:
---

Summary: ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums 
those tables which are not used in the child of this conditional task.  (was: 
ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask sums all tables' sizes 
including those tables which are not used in the child of this conditional 
task.)

 ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those 
 tables which are not used in the child of this conditional task.
 -

 Key: HIVE-5945
 URL: https://issues.apache.org/jira/browse/HIVE-5945
 Project: Hive
  Issue Type: Bug
  Components: Query Processor
Affects Versions: 0.8.0, 0.9.0, 0.10.0, 0.11.0, 0.12.0, 0.13.0
Reporter: Yin Huai

 Here is an example
 {code}
 select
i_item_id,
s_state,
avg(ss_quantity) agg1,
avg(ss_list_price) agg2,
avg(ss_coupon_amt) agg3,
avg(ss_sales_price) agg4
 FROM store_sales
 JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
 JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
 JOIN customer_demographics on (store_sales.ss_cdemo_sk = 
 customer_demographics.cd_demo_sk)
 JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
 where
cd_gender = 'F' and
cd_marital_status = 'U' and
cd_education_status = 'Primary' and
d_year = 2002 and
s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
 group by
i_item_id,
s_state
 order by
i_item_id,
s_state
 limit 100;
 {\code}
 I turned off noconditionaltask. So, I expected that there will be 4 Map-only 
 jobs for this query. However, I got 1 Map-only job (joining strore_sales and 
 date_dim) and 3 MR job (for reduce joins.)
 So, I checked the conditional task determining the plan of the join involving 
 item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, 
 aliasToFileSizeMap contains all input tables used in this query and the 
 intermediate table generated by joining store_sales and date_dim. So, when we 
 sum the size of all small tables, the size of store_sales (which is around 
 45GB in my test) will be also counted.  



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Created] (HIVE-5922) In orc.InStream.CompressedStream, the desired position passed to seek can equal offsets[i] + bytes[i].remaining() when ORC predicate pushdown is enabled

2013-12-02 Thread Yin Huai (JIRA)
Yin Huai created HIVE-5922:
--

 Summary: In orc.InStream.CompressedStream, the desired position 
passed to seek can equal offsets[i] + bytes[i].remaining() when ORC predicate 
pushdown is enabled
 Key: HIVE-5922
 URL: https://issues.apache.org/jira/browse/HIVE-5922
 Project: Hive
  Issue Type: Bug
  Components: File Formats
Reporter: Yin Huai


Two stack traces ...
{code}
java.io.IOException: IO error in map input file 
hdfs://10.38.55.204:8020/user/hive/warehouse/ssdb_bin_compress_orc_large_0_13.db/cycle/04_0
at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:236)
at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:210)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: java.io.IOException: java.io.IOException: Seek outside of data in 
compressed stream Stream for column 9 kind DATA position: 21496054 length: 
33790900 range: 2 offset: 1048588 limit: 1048588 range 0 = 13893791 to 1048588; 
 range 1 = 17039555 to 1310735;  range 2 = 20447466 to 1048588;  range 3 = 
23855377 to 1048588;  range 4 = 27263288 to 1048588;  range 5 = 30409052 to 
1310735 uncompressed: 262144 to 262144 to 21496054
at 
org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
at 
org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
at 
org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:276)
at 
org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:79)
at 
org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:33)
at 
org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:108)
at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:230)
... 9 more
Caused by: java.io.IOException: Seek outside of data in compressed stream 
Stream for column 9 kind DATA position: 21496054 length: 33790900 range: 2 
offset: 1048588 limit: 1048588 range 0 = 13893791 to 1048588;  range 1 = 
17039555 to 1310735;  range 2 = 20447466 to 1048588;  range 3 = 23855377 to 
1048588;  range 4 = 27263288 to 1048588;  range 5 = 30409052 to 1310735 
uncompressed: 262144 to 262144 to 21496054
at 
org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.seek(InStream.java:328)
at 
org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.readHeader(InStream.java:161)
at 
org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.read(InStream.java:205)
at 
org.apache.hadoop.hive.ql.io.orc.SerializationUtils.readInts(SerializationUtils.java:450)
at 
org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerReaderV2.readDirectValues(RunLengthIntegerReaderV2.java:240)
at 
org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:53)
at 
org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:288)
at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl$IntTreeReader.next(RecordReaderImpl.java:510)
at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl$StructTreeReader.next(RecordReaderImpl.java:1581)
at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.next(RecordReaderImpl.java:2707)
at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.next(OrcInputFormat.java:110)
at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.next(OrcInputFormat.java:86)
at 
org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:274)
... 13 more
{\code}

{code}
java.io.IOException: IO error in map input file 
hdfs://10.38.55.204:8020/user/hive/warehouse/ssdb_bin_compress_orc_large_0_13.db/cycle/95_0
at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:236)
at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:210)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
at 

[jira] [Commented] (HIVE-5922) In orc.InStream.CompressedStream, the desired position passed to seek can equal offsets[i] + bytes[i].remaining() when ORC predicate pushdown is enabled

2013-12-02 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13837275#comment-13837275
 ] 

Yin Huai commented on HIVE-5922:


For the first trace, the desired position is 21496054 and the second range is 
range 2 = 20447466 to 1048588. For the second trace, the desired position is 
20447466 and the sixth range is range 6 = 18612437 to 1835029. 

When I turned off predicate pushdown or I used predicate pushdown with 
uncompressed data, I did not see this problem.

 In orc.InStream.CompressedStream, the desired position passed to seek can 
 equal offsets[i] + bytes[i].remaining() when ORC predicate pushdown is enabled
 

 Key: HIVE-5922
 URL: https://issues.apache.org/jira/browse/HIVE-5922
 Project: Hive
  Issue Type: Bug
  Components: File Formats
Reporter: Yin Huai

 Two stack traces ...
 {code}
 java.io.IOException: IO error in map input file 
 hdfs://10.38.55.204:8020/user/hive/warehouse/ssdb_bin_compress_orc_large_0_13.db/cycle/04_0
   at 
 org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:236)
   at 
 org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:210)
   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
   at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
   at org.apache.hadoop.mapred.Child.main(Child.java:249)
 Caused by: java.io.IOException: java.io.IOException: Seek outside of data in 
 compressed stream Stream for column 9 kind DATA position: 21496054 length: 
 33790900 range: 2 offset: 1048588 limit: 1048588 range 0 = 13893791 to 
 1048588;  range 1 = 17039555 to 1310735;  range 2 = 20447466 to 1048588;  
 range 3 = 23855377 to 1048588;  range 4 = 27263288 to 1048588;  range 5 = 
 30409052 to 1310735 uncompressed: 262144 to 262144 to 21496054
   at 
 org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
   at 
 org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
   at 
 org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:276)
   at 
 org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:79)
   at 
 org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:33)
   at 
 org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:108)
   at 
 org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:230)
   ... 9 more
 Caused by: java.io.IOException: Seek outside of data in compressed stream 
 Stream for column 9 kind DATA position: 21496054 length: 33790900 range: 2 
 offset: 1048588 limit: 1048588 range 0 = 13893791 to 1048588;  range 1 = 
 17039555 to 1310735;  range 2 = 20447466 to 1048588;  range 3 = 23855377 to 
 1048588;  range 4 = 27263288 to 1048588;  range 5 = 30409052 to 1310735 
 uncompressed: 262144 to 262144 to 21496054
   at 
 org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.seek(InStream.java:328)
   at 
 org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.readHeader(InStream.java:161)
   at 
 org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.read(InStream.java:205)
   at 
 org.apache.hadoop.hive.ql.io.orc.SerializationUtils.readInts(SerializationUtils.java:450)
   at 
 org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerReaderV2.readDirectValues(RunLengthIntegerReaderV2.java:240)
   at 
 org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:53)
   at 
 org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:288)
   at 
 org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl$IntTreeReader.next(RecordReaderImpl.java:510)
   at 
 org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl$StructTreeReader.next(RecordReaderImpl.java:1581)
   at 
 org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.next(RecordReaderImpl.java:2707)
   at 
 org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.next(OrcInputFormat.java:110)
   at 
 org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.next(OrcInputFormat.java:86)
   at 
 

[jira] [Moved] (HIVE-5910) In HiveConf, the name of mapred.min.split.size.per.rack is MAPREDMINSPLITSIZEPERNODE and the name of mapred.min.split.size.per.node is MAPREDMINSPLITSIZEPERRACK

2013-11-30 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-5910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai moved MAPREDUCE-5659 to HIVE-5910:
---

Key: HIVE-5910  (was: MAPREDUCE-5659)
Project: Hive  (was: Hadoop Map/Reduce)

 In HiveConf, the name of mapred.min.split.size.per.rack is 
 MAPREDMINSPLITSIZEPERNODE and the name of mapred.min.split.size.per.node is 
 MAPREDMINSPLITSIZEPERRACK
 

 Key: HIVE-5910
 URL: https://issues.apache.org/jira/browse/HIVE-5910
 Project: Hive
  Issue Type: Bug
Reporter: Yin Huai

 In HiveConf.java ...
 {code}
 MAPREDMINSPLITSIZEPERNODE(mapred.min.split.size.per.rack, 1L),
 MAPREDMINSPLITSIZEPERRACK(mapred.min.split.size.per.node, 1L),
 {\code}
 Then, in ExecDriver.java ...
 {code}
 if (mWork.getMinSplitSizePerNode() != null) {
   HiveConf.setLongVar(job, HiveConf.ConfVars.MAPREDMINSPLITSIZEPERNODE, 
 mWork.getMinSplitSizePerNode().longValue());
 }
  if (mWork.getMinSplitSizePerRack() != null) {
   HiveConf.setLongVar(job, HiveConf.ConfVars.MAPREDMINSPLITSIZEPERRACK, 
 mWork.getMinSplitSizePerRack().longValue());
 }
 {\code}



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HIVE-5910) In HiveConf, the name of mapred.min.split.size.per.rack is MAPREDMINSPLITSIZEPERNODE and the name of mapred.min.split.size.per.node is MAPREDMINSPLITSIZEPERRACK

2013-11-30 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13835768#comment-13835768
 ] 

Yin Huai commented on HIVE-5910:


my bad... did not notice the project when I created it... It has been moved to 
hive. Thanks for letting me know, Ted :)

 In HiveConf, the name of mapred.min.split.size.per.rack is 
 MAPREDMINSPLITSIZEPERNODE and the name of mapred.min.split.size.per.node is 
 MAPREDMINSPLITSIZEPERRACK
 

 Key: HIVE-5910
 URL: https://issues.apache.org/jira/browse/HIVE-5910
 Project: Hive
  Issue Type: Bug
Reporter: Yin Huai

 In HiveConf.java ...
 {code}
 MAPREDMINSPLITSIZEPERNODE(mapred.min.split.size.per.rack, 1L),
 MAPREDMINSPLITSIZEPERRACK(mapred.min.split.size.per.node, 1L),
 {\code}
 Then, in ExecDriver.java ...
 {code}
 if (mWork.getMinSplitSizePerNode() != null) {
   HiveConf.setLongVar(job, HiveConf.ConfVars.MAPREDMINSPLITSIZEPERNODE, 
 mWork.getMinSplitSizePerNode().longValue());
 }
  if (mWork.getMinSplitSizePerRack() != null) {
   HiveConf.setLongVar(job, HiveConf.ConfVars.MAPREDMINSPLITSIZEPERRACK, 
 mWork.getMinSplitSizePerRack().longValue());
 }
 {\code}



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HIVE-5910) In HiveConf, the name of mapred.min.split.size.per.rack is MAPREDMINSPLITSIZEPERNODE and the name of mapred.min.split.size.per.node is MAPREDMINSPLITSIZEPERRACK

2013-11-30 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13835921#comment-13835921
 ] 

Yin Huai commented on HIVE-5910:


[~leftylev] Actually, these two are MapReduce configurations. Seems these two 
are internally used in Hive. I am not sure if we need to add them to add them 
to our conf template.

 In HiveConf, the name of mapred.min.split.size.per.rack is 
 MAPREDMINSPLITSIZEPERNODE and the name of mapred.min.split.size.per.node is 
 MAPREDMINSPLITSIZEPERRACK
 

 Key: HIVE-5910
 URL: https://issues.apache.org/jira/browse/HIVE-5910
 Project: Hive
  Issue Type: Bug
Reporter: Yin Huai

 In HiveConf.java ...
 {code}
 MAPREDMINSPLITSIZEPERNODE(mapred.min.split.size.per.rack, 1L),
 MAPREDMINSPLITSIZEPERRACK(mapred.min.split.size.per.node, 1L),
 {\code}
 Then, in ExecDriver.java ...
 {code}
 if (mWork.getMinSplitSizePerNode() != null) {
   HiveConf.setLongVar(job, HiveConf.ConfVars.MAPREDMINSPLITSIZEPERNODE, 
 mWork.getMinSplitSizePerNode().longValue());
 }
  if (mWork.getMinSplitSizePerRack() != null) {
   HiveConf.setLongVar(job, HiveConf.ConfVars.MAPREDMINSPLITSIZEPERRACK, 
 mWork.getMinSplitSizePerRack().longValue());
 }
 {\code}



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HIVE-5891) Alias conflict when merging multiple mapjoin tasks into their common child mapred task

2013-11-27 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13833857#comment-13833857
 ] 

Yin Huai commented on HIVE-5891:


I think the main problem is mergeMapJoinTaskIntoItsChildMapRedTask happens in 
the physical optimization phase which is after we break the plan using 
GenMapRedUtils. In this case 
{code}
while (cplan.getMapWork().getAliasToWork().get(streamDesc) != null) {
  streamDesc = origStreamDesc.concat(String.valueOf(++pos));
}
{\code}
will not help because those MapJoins were ReduceJoins and they were in 
different MR jobs. Also, seems the pattern triggers the bug looks like this...
{code}
 Union or Join
 /\
/  \
   MapJoin1 MapJoin1
  /   \/\
   MR1 small1 MR2small2
{\code}
In here, MR1 and MR2 are two MapReduce jobs which generates intermediate 
datasets. small1 and small2 are two small tables. When 
mergeMapJoinTaskIntoItsChildMapRedTask attaches MapJoin1 and MapJoin2 to the 
Map phase of the job for Union or Join, MR1 and MR2 has the same alias... 
Actually, I am thinking using the id of a QB may be a good alias for an 
intermediate dataset. Thoughts?

I think your change will not affect DemuxOperator because before GenMapRedUtils 
starts to work, Correlation Optimizer (HIVE-2206) has already generated the 
optimized plan. But let's give it a try. Can you try this query and see if 
there is anything wrong?
{code:sql}
set hive.optimize.correlation=true;
SELECT tmp1.key
FROM (SELECT key, value
 FROM src
GROUP BY a.key, b.value) tmp1
JOIN
   (SELECT key, value
FROM src
   GROUP BY key, value) tmp2
ON (tmp1.key=tmp2.key)
JOIN
 (SELECT key
  FROM src
  GROUP BY key) tmp3
ON (tmp2.key=tmp3.key)
GROUP BY tmp1.key
{code}
The plan should have three MR jobs. The first one is used to evaluate tmp1. The 
second is used to evaluate tmp2. And the third one is used to evaluate the join 
of tmp1, tmp2, and tmp3, and gby.


 Alias conflict when merging multiple mapjoin tasks into their common child 
 mapred task
 --

 Key: HIVE-5891
 URL: https://issues.apache.org/jira/browse/HIVE-5891
 Project: Hive
  Issue Type: Bug
  Components: Query Processor
Affects Versions: 0.12.0
Reporter: Sun Rui
Assignee: Sun Rui
 Attachments: HIVE-5891.1.patch


 Use the following test case with HIVE 0.12:
 {quote}
 create table src(key int, value string);
 load data local inpath 'src/data/files/kv1.txt' overwrite into table src;
 select * from (
   select c.key from
 (select a.key from src a join src b on a.key=b.key group by a.key) tmp
 join src c on tmp.key=c.key
   union all
   select c.key from
 (select a.key from src a join src b on a.key=b.key group by a.key) tmp
 join src c on tmp.key=c.key
 ) x;
 {quote}
 We will get a NullPointerException from Union Operator:
 {quote}
 java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: 
 Hive Runtime Error while processing row {_col0:0}
   at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:175)
   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
   at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
 Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime 
 Error while processing row {_col0:0}
   at 
 org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:544)
   at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:157)
   ... 4 more
 Caused by: java.lang.NullPointerException
   at 
 org.apache.hadoop.hive.ql.exec.UnionOperator.processOp(UnionOperator.java:120)
   at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504)
   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
   at 
 org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:88)
   at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504)
   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
   at 
 org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genUniqueJoinObject(CommonJoinOperator.java:652)
   at 
 org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genUniqueJoinObject(CommonJoinOperator.java:655)
   at 
 org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:758)
   at 
 org.apache.hadoop.hive.ql.exec.MapJoinOperator.processOp(MapJoinOperator.java:220)
   at 

[jira] [Commented] (HIVE-5891) Alias conflict when merging multiple mapjoin tasks into their common child mapred task

2013-11-26 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13832779#comment-13832779
 ] 

Yin Huai commented on HIVE-5891:


[~sunrui] What will the plan of the query in the description look like with 
your patch? Will MapJoins and the Union be executed in the same job? Seems 
those two tmps appearing in the same position in those MapJoins triggered the 
bug. I was thinking if ids in those two QBJoinTrees are the same? If so, 
aliases of those two tables are probably still the same. 0.11 does not have 
this bug because it does not use a single job to evaluate those MapJoins and 
the Union.

I do not think it will affect Demux since Demux is at the reducer side.

btw, I also think $INTNAME is confusing... Seems it is used to represent 
those intermediate results. I'd like a name which has a meaningful part which 
can represent how this intermediate results are generated and a unique part to 
address the issue shown in this jira.

 Alias conflict when merging multiple mapjoin tasks into their common child 
 mapred task
 --

 Key: HIVE-5891
 URL: https://issues.apache.org/jira/browse/HIVE-5891
 Project: Hive
  Issue Type: Bug
  Components: Query Processor
Affects Versions: 0.12.0
Reporter: Sun Rui
Assignee: Sun Rui
 Attachments: HIVE-5891.1.patch


 Use the following test case with HIVE 0.12:
 {quote}
 create table src(key int, value string);
 load data local inpath 'src/data/files/kv1.txt' overwrite into table src;
 select * from (
   select c.key from
 (select a.key from src a join src b on a.key=b.key group by a.key) tmp
 join src c on tmp.key=c.key
   union all
   select c.key from
 (select a.key from src a join src b on a.key=b.key group by a.key) tmp
 join src c on tmp.key=c.key
 ) x;
 {quote}
 We will get a NullPointerException from Union Operator:
 {quote}
 java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: 
 Hive Runtime Error while processing row {_col0:0}
   at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:175)
   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
   at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
 Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime 
 Error while processing row {_col0:0}
   at 
 org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:544)
   at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:157)
   ... 4 more
 Caused by: java.lang.NullPointerException
   at 
 org.apache.hadoop.hive.ql.exec.UnionOperator.processOp(UnionOperator.java:120)
   at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504)
   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
   at 
 org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:88)
   at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504)
   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
   at 
 org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genUniqueJoinObject(CommonJoinOperator.java:652)
   at 
 org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genUniqueJoinObject(CommonJoinOperator.java:655)
   at 
 org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:758)
   at 
 org.apache.hadoop.hive.ql.exec.MapJoinOperator.processOp(MapJoinOperator.java:220)
   at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504)
   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
   at 
 org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:91)
   at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504)
   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
   at 
 org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:534)
   ... 5 more
 {quote}
   
 The root cause is in 
 CommonJoinTaskDispatcher.mergeMapJoinTaskIntoItsChildMapRedTask().
   +--+  +--+
   | MapJoin task |  | MapJoin task |
   +--+  +--+
  \ /
   \   /
  +--+
  |  Union task  |
  +--+
  
 CommonJoinTaskDispatcher merges the two MapJoin tasks into their common 
 child: Union task. The two MapJoin tasks have the same alias name for their 
 big tables: $INTNAME, which is the name of the temporary table of a join 
 

[jira] [Commented] (HIVE-5891) Alias conflict when merging multiple mapjoin tasks into their common child mapred task

2013-11-26 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13833409#comment-13833409
 ] 

Yin Huai commented on HIVE-5891:


Thanks [~sunrui] for confirming the plan. Will JOIN_INTERMEDIATE give an 
impression that the dataset is an intermediate dataset during the processing of 
join instead of an input dataset?

Also, I am sorry that I did not get your question about DemuxOperator. Why 
DemuxOperator is related to this issue?  I think Demux is not related to your 
change since it is an operator at the reducer side. 

 Alias conflict when merging multiple mapjoin tasks into their common child 
 mapred task
 --

 Key: HIVE-5891
 URL: https://issues.apache.org/jira/browse/HIVE-5891
 Project: Hive
  Issue Type: Bug
  Components: Query Processor
Affects Versions: 0.12.0
Reporter: Sun Rui
Assignee: Sun Rui
 Attachments: HIVE-5891.1.patch


 Use the following test case with HIVE 0.12:
 {quote}
 create table src(key int, value string);
 load data local inpath 'src/data/files/kv1.txt' overwrite into table src;
 select * from (
   select c.key from
 (select a.key from src a join src b on a.key=b.key group by a.key) tmp
 join src c on tmp.key=c.key
   union all
   select c.key from
 (select a.key from src a join src b on a.key=b.key group by a.key) tmp
 join src c on tmp.key=c.key
 ) x;
 {quote}
 We will get a NullPointerException from Union Operator:
 {quote}
 java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: 
 Hive Runtime Error while processing row {_col0:0}
   at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:175)
   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
   at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
 Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime 
 Error while processing row {_col0:0}
   at 
 org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:544)
   at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:157)
   ... 4 more
 Caused by: java.lang.NullPointerException
   at 
 org.apache.hadoop.hive.ql.exec.UnionOperator.processOp(UnionOperator.java:120)
   at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504)
   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
   at 
 org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:88)
   at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504)
   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
   at 
 org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genUniqueJoinObject(CommonJoinOperator.java:652)
   at 
 org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genUniqueJoinObject(CommonJoinOperator.java:655)
   at 
 org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:758)
   at 
 org.apache.hadoop.hive.ql.exec.MapJoinOperator.processOp(MapJoinOperator.java:220)
   at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504)
   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
   at 
 org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:91)
   at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504)
   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
   at 
 org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:534)
   ... 5 more
 {quote}
   
 The root cause is in 
 CommonJoinTaskDispatcher.mergeMapJoinTaskIntoItsChildMapRedTask().
   +--+  +--+
   | MapJoin task |  | MapJoin task |
   +--+  +--+
  \ /
   \   /
  +--+
  |  Union task  |
  +--+
  
 CommonJoinTaskDispatcher merges the two MapJoin tasks into their common 
 child: Union task. The two MapJoin tasks have the same alias name for their 
 big tables: $INTNAME, which is the name of the temporary table of a join 
 stream. The aliasToWork map uses alias as key, so eventually only the MapJoin 
 operator tree of one MapJoin task is saved into the aliasToWork map of the 
 Union task, while the MapJoin operator tree of another MapJoin task is lost. 
 As a result, Union operator won't be initialized because not all of its 
 parents gets intialized (The Union operator itself indicates it has two 
 parents, but actually it has only 1 

[jira] [Updated] (HIVE-5697) Correlation Optimizer may generate wrong plans for cases involving outer join

2013-11-09 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-5697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-5697:
---

Attachment: (was: HIVE-5697.2.patch)

 Correlation Optimizer may generate wrong plans for cases involving outer join
 -

 Key: HIVE-5697
 URL: https://issues.apache.org/jira/browse/HIVE-5697
 Project: Hive
  Issue Type: Sub-task
Affects Versions: 0.12.0, 0.13.0
Reporter: Yin Huai
Assignee: Yin Huai
 Attachments: HIVE-5697.1.patch, HIVE-5697.2.patch


 For example,
 {code:sql}
 select x.key, y.value, count(*) from src x right outer join src1 y on 
 (x.key=y.key and x.value=y.value) group by x.key, y.value; 
 {code}
 Correlation optimizer will determine that a single MR job is enough for this 
 query. However, the group by key are from both left and right tables of the 
 right outer join. 
 We will have a wrong result like
 {code}
 NULL  4
 NULL  val_165 1
 NULL  val_193 1
 NULL  val_265 1
 NULL  val_27  1
 NULL  val_409 1
 NULL  val_484 1
 NULL  1
 146   val_146 2
 150   val_150 1
 213   val_213 2
 NULL  1
 238   val_238 2
 255   val_255 2
 273   val_273 3
 278   val_278 2
 311   val_311 3
 NULL  1
 401   val_401 5
 406   val_406 4
 66val_66  1
 98val_98  2
 {code}
 Rows with both x.key and y.value are null may not be grouped.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HIVE-5697) Correlation Optimizer may generate wrong plans for cases involving outer join

2013-11-09 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-5697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-5697:
---

Attachment: HIVE-5697.2.patch

reuploading patch .2

 Correlation Optimizer may generate wrong plans for cases involving outer join
 -

 Key: HIVE-5697
 URL: https://issues.apache.org/jira/browse/HIVE-5697
 Project: Hive
  Issue Type: Sub-task
Affects Versions: 0.12.0, 0.13.0
Reporter: Yin Huai
Assignee: Yin Huai
 Attachments: HIVE-5697.1.patch, HIVE-5697.2.patch


 For example,
 {code:sql}
 select x.key, y.value, count(*) from src x right outer join src1 y on 
 (x.key=y.key and x.value=y.value) group by x.key, y.value; 
 {code}
 Correlation optimizer will determine that a single MR job is enough for this 
 query. However, the group by key are from both left and right tables of the 
 right outer join. 
 We will have a wrong result like
 {code}
 NULL  4
 NULL  val_165 1
 NULL  val_193 1
 NULL  val_265 1
 NULL  val_27  1
 NULL  val_409 1
 NULL  val_484 1
 NULL  1
 146   val_146 2
 150   val_150 1
 213   val_213 2
 NULL  1
 238   val_238 2
 255   val_255 2
 273   val_273 3
 278   val_278 2
 311   val_311 3
 NULL  1
 401   val_401 5
 406   val_406 4
 66val_66  1
 98val_98  2
 {code}
 Rows with both x.key and y.value are null may not be grouped.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HIVE-5697) Correlation Optimizer may generate wrong plans for cases involving outer join

2013-10-30 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-5697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-5697:
---

Issue Type: Sub-task  (was: Bug)
Parent: HIVE-3667

 Correlation Optimizer may generate wrong plans for cases involving outer join
 -

 Key: HIVE-5697
 URL: https://issues.apache.org/jira/browse/HIVE-5697
 Project: Hive
  Issue Type: Sub-task
Affects Versions: 0.12.0, 0.13.0
Reporter: Yin Huai
Assignee: Yin Huai

 For example,
 {code:sql}
 select x.key, y.value, count(*) from src x right outer join src1 y on 
 (x.key=y.key and x.value=y.value) group by x.key, y.value; 
 {code}
 Correlation optimizer will determine that a single MR job is enough for this 
 query. However, the group by key are from both left and right tables of the 
 right outer join. 
 We will have a wrong result like
 {code}
 NULL  4
 NULL  val_165 1
 NULL  val_193 1
 NULL  val_265 1
 NULL  val_27  1
 NULL  val_409 1
 NULL  val_484 1
 NULL  1
 146   val_146 2
 150   val_150 1
 213   val_213 2
 NULL  1
 238   val_238 2
 255   val_255 2
 273   val_273 3
 278   val_278 2
 311   val_311 3
 NULL  1
 401   val_401 5
 406   val_406 4
 66val_66  1
 98val_98  2
 {code}
 Rows with both x.key and y.value are null may not be grouped.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Created] (HIVE-5697) Correlation Optimizer may generate wrong plans for cases involving outer join

2013-10-30 Thread Yin Huai (JIRA)
Yin Huai created HIVE-5697:
--

 Summary: Correlation Optimizer may generate wrong plans for cases 
involving outer join
 Key: HIVE-5697
 URL: https://issues.apache.org/jira/browse/HIVE-5697
 Project: Hive
  Issue Type: Bug
Affects Versions: 0.12.0, 0.13.0
Reporter: Yin Huai
Assignee: Yin Huai


For example,
{code:sql}
select x.key, y.value, count(*) from src x right outer join src1 y on 
(x.key=y.key and x.value=y.value) group by x.key, y.value; 
{code}
Correlation optimizer will determine that a single MR job is enough for this 
query. However, the group by key are from both left and right tables of the 
right outer join. 

We will have a wrong result like
{code}
NULL4
NULLval_165 1
NULLval_193 1
NULLval_265 1
NULLval_27  1
NULLval_409 1
NULLval_484 1
NULL1
146 val_146 2
150 val_150 1
213 val_213 2
NULL1
238 val_238 2
255 val_255 2
273 val_273 3
278 val_278 2
311 val_311 3
NULL1
401 val_401 5
406 val_406 4
66  val_66  1
98  val_98  2
{code}
Rows with both x.key and y.value are null may not be grouped.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HIVE-5697) Correlation Optimizer may generate wrong plans for cases involving outer join

2013-10-30 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-5697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-5697:
---

Attachment: HIVE-5697.1.patch

Will add test later.

 Correlation Optimizer may generate wrong plans for cases involving outer join
 -

 Key: HIVE-5697
 URL: https://issues.apache.org/jira/browse/HIVE-5697
 Project: Hive
  Issue Type: Sub-task
Affects Versions: 0.12.0, 0.13.0
Reporter: Yin Huai
Assignee: Yin Huai
 Attachments: HIVE-5697.1.patch


 For example,
 {code:sql}
 select x.key, y.value, count(*) from src x right outer join src1 y on 
 (x.key=y.key and x.value=y.value) group by x.key, y.value; 
 {code}
 Correlation optimizer will determine that a single MR job is enough for this 
 query. However, the group by key are from both left and right tables of the 
 right outer join. 
 We will have a wrong result like
 {code}
 NULL  4
 NULL  val_165 1
 NULL  val_193 1
 NULL  val_265 1
 NULL  val_27  1
 NULL  val_409 1
 NULL  val_484 1
 NULL  1
 146   val_146 2
 150   val_150 1
 213   val_213 2
 NULL  1
 238   val_238 2
 255   val_255 2
 273   val_273 3
 278   val_278 2
 311   val_311 3
 NULL  1
 401   val_401 5
 406   val_406 4
 66val_66  1
 98val_98  2
 {code}
 Rows with both x.key and y.value are null may not be grouped.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HIVE-5697) Correlation Optimizer may generate wrong plans for cases involving outer join

2013-10-30 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-5697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-5697:
---

Status: Patch Available  (was: Open)

 Correlation Optimizer may generate wrong plans for cases involving outer join
 -

 Key: HIVE-5697
 URL: https://issues.apache.org/jira/browse/HIVE-5697
 Project: Hive
  Issue Type: Sub-task
Affects Versions: 0.12.0, 0.13.0
Reporter: Yin Huai
Assignee: Yin Huai
 Attachments: HIVE-5697.1.patch, HIVE-5697.2.patch


 For example,
 {code:sql}
 select x.key, y.value, count(*) from src x right outer join src1 y on 
 (x.key=y.key and x.value=y.value) group by x.key, y.value; 
 {code}
 Correlation optimizer will determine that a single MR job is enough for this 
 query. However, the group by key are from both left and right tables of the 
 right outer join. 
 We will have a wrong result like
 {code}
 NULL  4
 NULL  val_165 1
 NULL  val_193 1
 NULL  val_265 1
 NULL  val_27  1
 NULL  val_409 1
 NULL  val_484 1
 NULL  1
 146   val_146 2
 150   val_150 1
 213   val_213 2
 NULL  1
 238   val_238 2
 255   val_255 2
 273   val_273 3
 278   val_278 2
 311   val_311 3
 NULL  1
 401   val_401 5
 406   val_406 4
 66val_66  1
 98val_98  2
 {code}
 Rows with both x.key and y.value are null may not be grouped.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HIVE-5697) Correlation Optimizer may generate wrong plans for cases involving outer join

2013-10-30 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-5697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-5697:
---

Attachment: HIVE-5697.2.patch

added a test query

 Correlation Optimizer may generate wrong plans for cases involving outer join
 -

 Key: HIVE-5697
 URL: https://issues.apache.org/jira/browse/HIVE-5697
 Project: Hive
  Issue Type: Sub-task
Affects Versions: 0.12.0, 0.13.0
Reporter: Yin Huai
Assignee: Yin Huai
 Attachments: HIVE-5697.1.patch, HIVE-5697.2.patch


 For example,
 {code:sql}
 select x.key, y.value, count(*) from src x right outer join src1 y on 
 (x.key=y.key and x.value=y.value) group by x.key, y.value; 
 {code}
 Correlation optimizer will determine that a single MR job is enough for this 
 query. However, the group by key are from both left and right tables of the 
 right outer join. 
 We will have a wrong result like
 {code}
 NULL  4
 NULL  val_165 1
 NULL  val_193 1
 NULL  val_265 1
 NULL  val_27  1
 NULL  val_409 1
 NULL  val_484 1
 NULL  1
 146   val_146 2
 150   val_150 1
 213   val_213 2
 NULL  1
 238   val_238 2
 255   val_255 2
 273   val_273 3
 278   val_278 2
 311   val_311 3
 NULL  1
 401   val_401 5
 406   val_406 4
 66val_66  1
 98val_98  2
 {code}
 Rows with both x.key and y.value are null may not be grouped.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HIVE-5697) Correlation Optimizer may generate wrong plans for cases involving outer join

2013-10-30 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-5697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-5697:
---

Attachment: (was: HIVE-5697.2.patch)

 Correlation Optimizer may generate wrong plans for cases involving outer join
 -

 Key: HIVE-5697
 URL: https://issues.apache.org/jira/browse/HIVE-5697
 Project: Hive
  Issue Type: Sub-task
Affects Versions: 0.12.0, 0.13.0
Reporter: Yin Huai
Assignee: Yin Huai
 Attachments: HIVE-5697.1.patch


 For example,
 {code:sql}
 select x.key, y.value, count(*) from src x right outer join src1 y on 
 (x.key=y.key and x.value=y.value) group by x.key, y.value; 
 {code}
 Correlation optimizer will determine that a single MR job is enough for this 
 query. However, the group by key are from both left and right tables of the 
 right outer join. 
 We will have a wrong result like
 {code}
 NULL  4
 NULL  val_165 1
 NULL  val_193 1
 NULL  val_265 1
 NULL  val_27  1
 NULL  val_409 1
 NULL  val_484 1
 NULL  1
 146   val_146 2
 150   val_150 1
 213   val_213 2
 NULL  1
 238   val_238 2
 255   val_255 2
 273   val_273 3
 278   val_278 2
 311   val_311 3
 NULL  1
 401   val_401 5
 406   val_406 4
 66val_66  1
 98val_98  2
 {code}
 Rows with both x.key and y.value are null may not be grouped.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HIVE-5697) Correlation Optimizer may generate wrong plans for cases involving outer join

2013-10-30 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-5697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-5697:
---

Status: Open  (was: Patch Available)

 Correlation Optimizer may generate wrong plans for cases involving outer join
 -

 Key: HIVE-5697
 URL: https://issues.apache.org/jira/browse/HIVE-5697
 Project: Hive
  Issue Type: Sub-task
Affects Versions: 0.12.0, 0.13.0
Reporter: Yin Huai
Assignee: Yin Huai
 Attachments: HIVE-5697.1.patch


 For example,
 {code:sql}
 select x.key, y.value, count(*) from src x right outer join src1 y on 
 (x.key=y.key and x.value=y.value) group by x.key, y.value; 
 {code}
 Correlation optimizer will determine that a single MR job is enough for this 
 query. However, the group by key are from both left and right tables of the 
 right outer join. 
 We will have a wrong result like
 {code}
 NULL  4
 NULL  val_165 1
 NULL  val_193 1
 NULL  val_265 1
 NULL  val_27  1
 NULL  val_409 1
 NULL  val_484 1
 NULL  1
 146   val_146 2
 150   val_150 1
 213   val_213 2
 NULL  1
 238   val_238 2
 255   val_255 2
 273   val_273 3
 278   val_278 2
 311   val_311 3
 NULL  1
 401   val_401 5
 406   val_406 4
 66val_66  1
 98val_98  2
 {code}
 Rows with both x.key and y.value are null may not be grouped.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HIVE-5697) Correlation Optimizer may generate wrong plans for cases involving outer join

2013-10-30 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-5697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-5697:
---

Attachment: HIVE-5697.2.patch

 Correlation Optimizer may generate wrong plans for cases involving outer join
 -

 Key: HIVE-5697
 URL: https://issues.apache.org/jira/browse/HIVE-5697
 Project: Hive
  Issue Type: Sub-task
Affects Versions: 0.12.0, 0.13.0
Reporter: Yin Huai
Assignee: Yin Huai
 Attachments: HIVE-5697.1.patch, HIVE-5697.2.patch


 For example,
 {code:sql}
 select x.key, y.value, count(*) from src x right outer join src1 y on 
 (x.key=y.key and x.value=y.value) group by x.key, y.value; 
 {code}
 Correlation optimizer will determine that a single MR job is enough for this 
 query. However, the group by key are from both left and right tables of the 
 right outer join. 
 We will have a wrong result like
 {code}
 NULL  4
 NULL  val_165 1
 NULL  val_193 1
 NULL  val_265 1
 NULL  val_27  1
 NULL  val_409 1
 NULL  val_484 1
 NULL  1
 146   val_146 2
 150   val_150 1
 213   val_213 2
 NULL  1
 238   val_238 2
 255   val_255 2
 273   val_273 3
 278   val_278 2
 311   val_311 3
 NULL  1
 401   val_401 5
 406   val_406 4
 66val_66  1
 98val_98  2
 {code}
 Rows with both x.key and y.value are null may not be grouped.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HIVE-5697) Correlation Optimizer may generate wrong plans for cases involving outer join

2013-10-30 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-5697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-5697:
---

Status: Patch Available  (was: Open)

 Correlation Optimizer may generate wrong plans for cases involving outer join
 -

 Key: HIVE-5697
 URL: https://issues.apache.org/jira/browse/HIVE-5697
 Project: Hive
  Issue Type: Sub-task
Affects Versions: 0.12.0, 0.13.0
Reporter: Yin Huai
Assignee: Yin Huai
 Attachments: HIVE-5697.1.patch, HIVE-5697.2.patch


 For example,
 {code:sql}
 select x.key, y.value, count(*) from src x right outer join src1 y on 
 (x.key=y.key and x.value=y.value) group by x.key, y.value; 
 {code}
 Correlation optimizer will determine that a single MR job is enough for this 
 query. However, the group by key are from both left and right tables of the 
 right outer join. 
 We will have a wrong result like
 {code}
 NULL  4
 NULL  val_165 1
 NULL  val_193 1
 NULL  val_265 1
 NULL  val_27  1
 NULL  val_409 1
 NULL  val_484 1
 NULL  1
 146   val_146 2
 150   val_150 1
 213   val_213 2
 NULL  1
 238   val_238 2
 255   val_255 2
 273   val_273 3
 278   val_278 2
 311   val_311 3
 NULL  1
 401   val_401 5
 406   val_406 4
 66val_66  1
 98val_98  2
 {code}
 Rows with both x.key and y.value are null may not be grouped.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HIVE-5610) Merge maven branch into trunk

2013-10-24 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13804341#comment-13804341
 ] 

Yin Huai commented on HIVE-5610:


Not an expert on maven. Here are what I tried...
I first tried 
{code}
mvn clean package -DskipTests
{code}
Then, I got the following error when maven was compiling Hive common
{code}
[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) on 
project hive-common: Compilation failure: Compilation failure:
[ERROR] 
/home/yhuai/Projects/Hive/hive-trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java:[43,36]
 package org.apache.hadoop.hive.shims does not exist
[ERROR] 
/home/yhuai/Projects/Hive/hive-trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java:[1027,5]
 cannot find symbol
[ERROR] symbol  : variable ShimLoader
[ERROR] location: class org.apache.hadoop.hive.conf.HiveConf
[ERROR] 
/home/yhuai/Projects/Hive/hive-trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java:[1271,34]
 cannot find symbol
[ERROR] symbol  : variable ShimLoader
[ERROR] location: class org.apache.hadoop.hive.conf.HiveConf
[ERROR] - [Help 1]
{code} 
After I checked jars of shims, I found classes were not packed in those jars 
because of the dir structure. So, I set source dirs for those pom files in 
shims, e.g. 
{code}
build
sourceDirectory${basedir}/../src/common-secure/java/sourceDirectory

testSourceDirectory${basedir}/../src/common-secure/test/testSourceDirectory
/build
{code}
Then, I got errors when maven was compiling tests of common-secure. For example,
{code}
[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-compiler-plugin:3.1:testCompile 
(default-testCompile) on project common-secure: Compilation failure: 
Compilation failure:
[ERROR] 
/home/yhuai/Projects/Hive/hive-trunk/shims/common-secure/../src/common-secure/test/org/apache/hadoop/hive/thrift/TestDBTokenStore.java:[26,54]
 package org.apache.hadoop.hive.metastore.HiveMetaStore does not exist
{code}
So, I asked maven to not compile tests 
{code}
mvn clean install -Dmaven.test.skip=true
{code}
Then, I got 
{code}
[ERROR] Failed to execute goal on project hive-service: Could not resolve 
dependencies for project org.apache.hive:hive-service:jar:0.13.0-SNAPSHOT: 
Could not find artifact org.apache.hive:hive-exec:jar:tests:0.13.0-SNAPSHOT - 
[Help 1]
{code}
Seems the scope of hive-exec:jar:tests:0.13.0-SNAPSHOT in hive-service is test. 
Why did maven still try to resolve this dependency?

 Merge maven branch into trunk
 -

 Key: HIVE-5610
 URL: https://issues.apache.org/jira/browse/HIVE-5610
 Project: Hive
  Issue Type: Sub-task
Reporter: Brock Noland
Assignee: Brock Noland

 With HIVE-5566 nearing completion we will be nearly ready to merge the maven 
 branch to trunk. The following tasks will be done post-merge:
 * HIVE-5611 - Add assembly (i.e.) tar creation to pom
 * HIVE-5612 - Add ability to re-generate generated code stored in source 
 control
 The merge process will be as follows:
 1) svn merge ^/hive/branches/maven
 2) Commit result
 3) Modify the following line in maven-rollforward.sh:
 {noformat}
   mv $source $target
 {noformat}
 to
 {noformat}
   svn mv $source $target
 {noformat}
 4) Execute maven-rollfward.sh
 5) Commit result 
 6) Update trunk-mr1.properties and trunk-mr2.properties on the ptesting host, 
 adding the following:
 {noformat}
 mavenEnvOpts = -Dhttp.proxyHost=localhost -Dhttp.proxyPort=3128 
 testCasePropertyName = test
 buildTool = maven
 unitTests.directories = ./
 {noformat}
 Notes:
 * To build everything you must:
 {noformat}
 $ mvn clean install -DskipTests
 $ cd itests
 $ mvn clean install -DskipTests
 {noformat}
 because itests (any tests that has cyclical dependencies or requires that the 
 packages be built) is not part of the root reactor build.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HIVE-5610) Merge maven branch into trunk

2013-10-24 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13804379#comment-13804379
 ] 

Yin Huai commented on HIVE-5610:


my bad... I did not notice that... 

Tried again. The build worked great. Thanks Brock :)

 Merge maven branch into trunk
 -

 Key: HIVE-5610
 URL: https://issues.apache.org/jira/browse/HIVE-5610
 Project: Hive
  Issue Type: Sub-task
Reporter: Brock Noland
Assignee: Brock Noland

 With HIVE-5566 nearing completion we will be nearly ready to merge the maven 
 branch to trunk. The following tasks will be done post-merge:
 * HIVE-5611 - Add assembly (i.e.) tar creation to pom
 * HIVE-5612 - Add ability to re-generate generated code stored in source 
 control
 The merge process will be as follows:
 1) svn merge ^/hive/branches/maven
 2) Commit result
 3) Modify the following line in maven-rollforward.sh:
 {noformat}
   mv $source $target
 {noformat}
 to
 {noformat}
   svn mv $source $target
 {noformat}
 4) Execute maven-rollfward.sh
 5) Commit result 
 6) Update trunk-mr1.properties and trunk-mr2.properties on the ptesting host, 
 adding the following:
 {noformat}
 mavenEnvOpts = -Dhttp.proxyHost=localhost -Dhttp.proxyPort=3128 
 testCasePropertyName = test
 buildTool = maven
 unitTests.directories = ./
 {noformat}
 Notes:
 * To build everything you must:
 {noformat}
 $ mvn clean install -DskipTests
 $ cd itests
 $ mvn clean install -DskipTests
 {noformat}
 because itests (any tests that has cyclical dependencies or requires that the 
 packages be built) is not part of the root reactor build.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HIVE-5592) Add an option to convert enum as structvalue:int as of Hive 0.8

2013-10-21 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-5592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-5592:
---

Description: 
HIVE-3323 introduced the incompatible change: Hive handling of enum types has 
been changed to always return the string value rather than structvalue:int. 
But it didn't add the option hive.data.convert.enum.to.string  as planned and 
thus broke all Enum usage prior to 0.10.


  was:
HIVE-3222 introduced the incompatible change: Hive handling of enum types has 
been changed to always return the string value rather than structvalue:int. 
But it didn't add the option hive.data.convert.enum.to.string  as planned and 
thus broke all Enum usage prior to 0.10.



 Add an option to convert enum as structvalue:int as of Hive 0.8
 -

 Key: HIVE-5592
 URL: https://issues.apache.org/jira/browse/HIVE-5592
 Project: Hive
  Issue Type: Bug
Affects Versions: 0.10.0, 0.11.0, 0.12.0
Reporter: Jie Li

 HIVE-3323 introduced the incompatible change: Hive handling of enum types has 
 been changed to always return the string value rather than structvalue:int. 
 But it didn't add the option hive.data.convert.enum.to.string  as planned 
 and thus broke all Enum usage prior to 0.10.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Created] (HIVE-5546) A change in ORCInputFormat made by HIVE-4113 was reverted by HIVE-HIVE-5391

2013-10-15 Thread Yin Huai (JIRA)
Yin Huai created HIVE-5546:
--

 Summary: A change in ORCInputFormat made by HIVE-4113 was reverted 
by HIVE-HIVE-5391
 Key: HIVE-5546
 URL: https://issues.apache.org/jira/browse/HIVE-5546
 Project: Hive
  Issue Type: Bug
Reporter: Yin Huai
Assignee: Yin Huai


{code}
2013-10-15 10:49:49,386 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: 
included column ids = 
2013-10-15 10:49:49,386 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: 
included columns names = 
2013-10-15 10:49:49,386 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: 
No ORC pushdown predicate
2013-10-15 10:49:49,834 INFO 
org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file 
hdfs://localhost:54310/user/hive/warehouse/web_sales_orc/00_0
2013-10-15 10:49:49,834 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 1
2013-10-15 10:49:49,840 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 100
2013-10-15 10:49:49,968 INFO org.apache.hadoop.mapred.TaskLogsTruncater: 
Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
2013-10-15 10:49:49,994 INFO org.apache.hadoop.io.nativeio.NativeIO: 
Initialized cache for UID to User mapping with a cache timeout of 14400 seconds.
2013-10-15 10:49:49,994 INFO org.apache.hadoop.io.nativeio.NativeIO: Got 
UserName yhuai for UID 1000 from the native implementation
2013-10-15 10:49:49,996 FATAL org.apache.hadoop.mapred.Child: Error running 
child : java.lang.OutOfMemoryError: Java heap space
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.init(MapTask.java:949)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:428)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
{code}

If includedColumnIds is an empty list, we do not need to read any column



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HIVE-5546) A change in ORCInputFormat made by HIVE-4113 was reverted by HIVE-5391

2013-10-15 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-5546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-5546:
---

Summary: A change in ORCInputFormat made by HIVE-4113 was reverted by 
HIVE-5391  (was: A change in ORCInputFormat made by HIVE-4113 was reverted by 
HIVE-HIVE-5391)

 A change in ORCInputFormat made by HIVE-4113 was reverted by HIVE-5391
 --

 Key: HIVE-5546
 URL: https://issues.apache.org/jira/browse/HIVE-5546
 Project: Hive
  Issue Type: Bug
Reporter: Yin Huai
Assignee: Yin Huai

 {code}
 2013-10-15 10:49:49,386 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: 
 included column ids = 
 2013-10-15 10:49:49,386 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: 
 included columns names = 
 2013-10-15 10:49:49,386 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: 
 No ORC pushdown predicate
 2013-10-15 10:49:49,834 INFO 
 org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file 
 hdfs://localhost:54310/user/hive/warehouse/web_sales_orc/00_0
 2013-10-15 10:49:49,834 INFO org.apache.hadoop.mapred.MapTask: 
 numReduceTasks: 1
 2013-10-15 10:49:49,840 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 
 100
 2013-10-15 10:49:49,968 INFO org.apache.hadoop.mapred.TaskLogsTruncater: 
 Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
 2013-10-15 10:49:49,994 INFO org.apache.hadoop.io.nativeio.NativeIO: 
 Initialized cache for UID to User mapping with a cache timeout of 14400 
 seconds.
 2013-10-15 10:49:49,994 INFO org.apache.hadoop.io.nativeio.NativeIO: Got 
 UserName yhuai for UID 1000 from the native implementation
 2013-10-15 10:49:49,996 FATAL org.apache.hadoop.mapred.Child: Error running 
 child : java.lang.OutOfMemoryError: Java heap space
   at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.init(MapTask.java:949)
   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:428)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
   at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
   at org.apache.hadoop.mapred.Child.main(Child.java:249)
 {code}
 If includedColumnIds is an empty list, we do not need to read any column



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HIVE-5546) A change in ORCInputFormat made by HIVE-4113 was reverted by HIVE-5391

2013-10-15 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-5546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-5546:
---

Description: 
{code}
2013-10-15 10:49:49,386 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: 
included column ids = 
2013-10-15 10:49:49,386 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: 
included columns names = 
2013-10-15 10:49:49,386 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: 
No ORC pushdown predicate
2013-10-15 10:49:49,834 INFO 
org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file 
hdfs://localhost:54310/user/hive/warehouse/web_sales_orc/00_0
2013-10-15 10:49:49,834 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 1
2013-10-15 10:49:49,840 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 100
2013-10-15 10:49:49,968 INFO org.apache.hadoop.mapred.TaskLogsTruncater: 
Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
2013-10-15 10:49:49,994 INFO org.apache.hadoop.io.nativeio.NativeIO: 
Initialized cache for UID to User mapping with a cache timeout of 14400 seconds.
2013-10-15 10:49:49,994 INFO org.apache.hadoop.io.nativeio.NativeIO: Got 
UserName yhuai for UID 1000 from the native implementation
2013-10-15 10:49:49,996 FATAL org.apache.hadoop.mapred.Child: Error running 
child : java.lang.OutOfMemoryError: Java heap space
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.init(MapTask.java:949)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:428)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
{code}

If includedColumnIds is an empty list, we do not need to read any column. But, 
in 
{code}
if (ColumnProjectionUtils.isReadAllColumns(conf) ||
  includedStr == null || includedStr.trim().length() == 0) {
  return null;
} 
{code}

  was:
{code}
2013-10-15 10:49:49,386 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: 
included column ids = 
2013-10-15 10:49:49,386 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: 
included columns names = 
2013-10-15 10:49:49,386 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: 
No ORC pushdown predicate
2013-10-15 10:49:49,834 INFO 
org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file 
hdfs://localhost:54310/user/hive/warehouse/web_sales_orc/00_0
2013-10-15 10:49:49,834 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 1
2013-10-15 10:49:49,840 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 100
2013-10-15 10:49:49,968 INFO org.apache.hadoop.mapred.TaskLogsTruncater: 
Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
2013-10-15 10:49:49,994 INFO org.apache.hadoop.io.nativeio.NativeIO: 
Initialized cache for UID to User mapping with a cache timeout of 14400 seconds.
2013-10-15 10:49:49,994 INFO org.apache.hadoop.io.nativeio.NativeIO: Got 
UserName yhuai for UID 1000 from the native implementation
2013-10-15 10:49:49,996 FATAL org.apache.hadoop.mapred.Child: Error running 
child : java.lang.OutOfMemoryError: Java heap space
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.init(MapTask.java:949)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:428)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
{code}

If includedColumnIds is an empty list, we do not need to read any column


 A change in ORCInputFormat made by HIVE-4113 was reverted by HIVE-5391
 --

 Key: HIVE-5546
 URL: https://issues.apache.org/jira/browse/HIVE-5546
 Project: Hive
  Issue Type: Bug
Reporter: Yin Huai
Assignee: Yin Huai

 {code}
 2013-10-15 10:49:49,386 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: 
 included column ids = 
 2013-10-15 10:49:49,386 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: 
 included columns names = 
 2013-10-15 10:49:49,386 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: 
 No ORC pushdown predicate
 2013-10-15 10:49:49,834 INFO 
 org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file 
 hdfs://localhost:54310/user/hive/warehouse/web_sales_orc/00_0
 

[jira] [Updated] (HIVE-5546) A change in ORCInputFormat made by HIVE-4113 was reverted by HIVE-5391

2013-10-15 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-5546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-5546:
---

Description: 
{code}
2013-10-15 10:49:49,386 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: 
included column ids = 
2013-10-15 10:49:49,386 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: 
included columns names = 
2013-10-15 10:49:49,386 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: 
No ORC pushdown predicate
2013-10-15 10:49:49,834 INFO 
org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file 
hdfs://localhost:54310/user/hive/warehouse/web_sales_orc/00_0
2013-10-15 10:49:49,834 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 1
2013-10-15 10:49:49,840 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 100
2013-10-15 10:49:49,968 INFO org.apache.hadoop.mapred.TaskLogsTruncater: 
Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
2013-10-15 10:49:49,994 INFO org.apache.hadoop.io.nativeio.NativeIO: 
Initialized cache for UID to User mapping with a cache timeout of 14400 seconds.
2013-10-15 10:49:49,994 INFO org.apache.hadoop.io.nativeio.NativeIO: Got 
UserName yhuai for UID 1000 from the native implementation
2013-10-15 10:49:49,996 FATAL org.apache.hadoop.mapred.Child: Error running 
child : java.lang.OutOfMemoryError: Java heap space
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.init(MapTask.java:949)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:428)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
{code}

If includedColumnIds is an empty list, we do not need to read any column. But, 
right now, in OrcInputFormat.findIncludedColumns, we have ...
{code}
if (ColumnProjectionUtils.isReadAllColumns(conf) ||
  includedStr == null || includedStr.trim().length() == 0) {
  return null;
} 
{code}
If includedStr is an empty string, the code assumes that we need all columns, 
which is not correct.

  was:
{code}
2013-10-15 10:49:49,386 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: 
included column ids = 
2013-10-15 10:49:49,386 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: 
included columns names = 
2013-10-15 10:49:49,386 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: 
No ORC pushdown predicate
2013-10-15 10:49:49,834 INFO 
org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file 
hdfs://localhost:54310/user/hive/warehouse/web_sales_orc/00_0
2013-10-15 10:49:49,834 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 1
2013-10-15 10:49:49,840 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 100
2013-10-15 10:49:49,968 INFO org.apache.hadoop.mapred.TaskLogsTruncater: 
Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
2013-10-15 10:49:49,994 INFO org.apache.hadoop.io.nativeio.NativeIO: 
Initialized cache for UID to User mapping with a cache timeout of 14400 seconds.
2013-10-15 10:49:49,994 INFO org.apache.hadoop.io.nativeio.NativeIO: Got 
UserName yhuai for UID 1000 from the native implementation
2013-10-15 10:49:49,996 FATAL org.apache.hadoop.mapred.Child: Error running 
child : java.lang.OutOfMemoryError: Java heap space
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.init(MapTask.java:949)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:428)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
{code}

If includedColumnIds is an empty list, we do not need to read any column. But, 
in 
{code}
if (ColumnProjectionUtils.isReadAllColumns(conf) ||
  includedStr == null || includedStr.trim().length() == 0) {
  return null;
} 
{code}


 A change in ORCInputFormat made by HIVE-4113 was reverted by HIVE-5391
 --

 Key: HIVE-5546
 URL: https://issues.apache.org/jira/browse/HIVE-5546
 Project: Hive
  Issue Type: Bug
Reporter: Yin Huai
Assignee: Yin Huai

 {code}
 2013-10-15 10:49:49,386 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: 
 included column ids = 
 2013-10-15 10:49:49,386 INFO 

[jira] [Commented] (HIVE-5546) A change in ORCInputFormat made by HIVE-4113 was reverted by HIVE-5391

2013-10-15 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13795280#comment-13795280
 ] 

Yin Huai commented on HIVE-5546:


Based on my understanding, I think that includedStr in 
OrcInputFormat.findIncludedColumns(ListType, Configuration) is null if and 
only if  ColumnProjectionUtils.isReadAllColumns(conf)=true. 

 A change in ORCInputFormat made by HIVE-4113 was reverted by HIVE-5391
 --

 Key: HIVE-5546
 URL: https://issues.apache.org/jira/browse/HIVE-5546
 Project: Hive
  Issue Type: Bug
Reporter: Yin Huai
Assignee: Yin Huai

 {code}
 2013-10-15 10:49:49,386 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: 
 included column ids = 
 2013-10-15 10:49:49,386 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: 
 included columns names = 
 2013-10-15 10:49:49,386 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: 
 No ORC pushdown predicate
 2013-10-15 10:49:49,834 INFO 
 org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file 
 hdfs://localhost:54310/user/hive/warehouse/web_sales_orc/00_0
 2013-10-15 10:49:49,834 INFO org.apache.hadoop.mapred.MapTask: 
 numReduceTasks: 1
 2013-10-15 10:49:49,840 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 
 100
 2013-10-15 10:49:49,968 INFO org.apache.hadoop.mapred.TaskLogsTruncater: 
 Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
 2013-10-15 10:49:49,994 INFO org.apache.hadoop.io.nativeio.NativeIO: 
 Initialized cache for UID to User mapping with a cache timeout of 14400 
 seconds.
 2013-10-15 10:49:49,994 INFO org.apache.hadoop.io.nativeio.NativeIO: Got 
 UserName yhuai for UID 1000 from the native implementation
 2013-10-15 10:49:49,996 FATAL org.apache.hadoop.mapred.Child: Error running 
 child : java.lang.OutOfMemoryError: Java heap space
   at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.init(MapTask.java:949)
   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:428)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
   at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
   at org.apache.hadoop.mapred.Child.main(Child.java:249)
 {code}
 If includedColumnIds is an empty list, we do not need to read any column



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HIVE-5546) A change in ORCInputFormat made by HIVE-4113 was reverted by HIVE-5391

2013-10-15 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-5546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-5546:
---

Attachment: HIVE-5546.1.patch

Tried both
{code}
select count(1) from web_sales_orc;
{code}
and 
{code}
select count(*) from web_sales_orc;
{code}

Here is the results on a sf=1 TPC-DS dataset.
{code}
MapReduce Jobs Launched: 
Job 0: Map: 1  Reduce: 1   Cumulative CPU: 3.96 sec   HDFS Read: 17112 HDFS 
Write: 7 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 960 msec
{code}

 A change in ORCInputFormat made by HIVE-4113 was reverted by HIVE-5391
 --

 Key: HIVE-5546
 URL: https://issues.apache.org/jira/browse/HIVE-5546
 Project: Hive
  Issue Type: Bug
Reporter: Yin Huai
Assignee: Yin Huai
 Attachments: HIVE-5546.1.patch


 {code}
 2013-10-15 10:49:49,386 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: 
 included column ids = 
 2013-10-15 10:49:49,386 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: 
 included columns names = 
 2013-10-15 10:49:49,386 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: 
 No ORC pushdown predicate
 2013-10-15 10:49:49,834 INFO 
 org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file 
 hdfs://localhost:54310/user/hive/warehouse/web_sales_orc/00_0
 2013-10-15 10:49:49,834 INFO org.apache.hadoop.mapred.MapTask: 
 numReduceTasks: 1
 2013-10-15 10:49:49,840 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 
 100
 2013-10-15 10:49:49,968 INFO org.apache.hadoop.mapred.TaskLogsTruncater: 
 Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
 2013-10-15 10:49:49,994 INFO org.apache.hadoop.io.nativeio.NativeIO: 
 Initialized cache for UID to User mapping with a cache timeout of 14400 
 seconds.
 2013-10-15 10:49:49,994 INFO org.apache.hadoop.io.nativeio.NativeIO: Got 
 UserName yhuai for UID 1000 from the native implementation
 2013-10-15 10:49:49,996 FATAL org.apache.hadoop.mapred.Child: Error running 
 child : java.lang.OutOfMemoryError: Java heap space
   at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.init(MapTask.java:949)
   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:428)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
   at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
   at org.apache.hadoop.mapred.Child.main(Child.java:249)
 {code}
 If includedColumnIds is an empty list, we do not need to read any column. 
 But, right now, in OrcInputFormat.findIncludedColumns, we have ...
 {code}
 if (ColumnProjectionUtils.isReadAllColumns(conf) ||
   includedStr == null || includedStr.trim().length() == 0) {
   return null;
 } 
 {code}
 If includedStr is an empty string, the code assumes that we need all columns, 
 which is not correct.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HIVE-5546) A change in ORCInputFormat made by HIVE-4113 was reverted by HIVE-5391

2013-10-15 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-5546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-5546:
---

Status: Patch Available  (was: Open)

[~sershe] [~ashutoshc] Can you take a look?

 A change in ORCInputFormat made by HIVE-4113 was reverted by HIVE-5391
 --

 Key: HIVE-5546
 URL: https://issues.apache.org/jira/browse/HIVE-5546
 Project: Hive
  Issue Type: Bug
Reporter: Yin Huai
Assignee: Yin Huai
 Attachments: HIVE-5546.1.patch


 {code}
 2013-10-15 10:49:49,386 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: 
 included column ids = 
 2013-10-15 10:49:49,386 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: 
 included columns names = 
 2013-10-15 10:49:49,386 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: 
 No ORC pushdown predicate
 2013-10-15 10:49:49,834 INFO 
 org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file 
 hdfs://localhost:54310/user/hive/warehouse/web_sales_orc/00_0
 2013-10-15 10:49:49,834 INFO org.apache.hadoop.mapred.MapTask: 
 numReduceTasks: 1
 2013-10-15 10:49:49,840 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 
 100
 2013-10-15 10:49:49,968 INFO org.apache.hadoop.mapred.TaskLogsTruncater: 
 Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
 2013-10-15 10:49:49,994 INFO org.apache.hadoop.io.nativeio.NativeIO: 
 Initialized cache for UID to User mapping with a cache timeout of 14400 
 seconds.
 2013-10-15 10:49:49,994 INFO org.apache.hadoop.io.nativeio.NativeIO: Got 
 UserName yhuai for UID 1000 from the native implementation
 2013-10-15 10:49:49,996 FATAL org.apache.hadoop.mapred.Child: Error running 
 child : java.lang.OutOfMemoryError: Java heap space
   at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.init(MapTask.java:949)
   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:428)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
   at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
   at org.apache.hadoop.mapred.Child.main(Child.java:249)
 {code}
 If includedColumnIds is an empty list, we do not need to read any column. 
 But, right now, in OrcInputFormat.findIncludedColumns, we have ...
 {code}
 if (ColumnProjectionUtils.isReadAllColumns(conf) ||
   includedStr == null || includedStr.trim().length() == 0) {
   return null;
 } 
 {code}
 If includedStr is an empty string, the code assumes that we need all columns, 
 which is not correct.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HIVE-2419) CREATE TABLE AS SELECT should create warehouse directory

2013-10-15 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13795317#comment-13795317
 ] 

Yin Huai commented on HIVE-2419:


Seems MoveTask.moveFile(Path, Path, boolean) throws this exception when it 
trying to rename the path.

 CREATE TABLE AS SELECT should create warehouse directory
 

 Key: HIVE-2419
 URL: https://issues.apache.org/jira/browse/HIVE-2419
 Project: Hive
  Issue Type: Bug
Reporter: David Phillips
 Attachments: HIVE-2419.1.patch


 If you run a CTAS statement on a fresh Hive install without a warehouse 
 directory (as is the case with Amazon EMR), it runs the query but errors out 
 at the end:
 {quote}
 hive create table foo as select * from t_message limit 1;
 Total MapReduce jobs = 1
 Launching Job 1 out of 1
 ...
 Ended Job = job_201108301753_0001
 Moving data to: 
 hdfs://ip-10-202-22-194.ec2.internal:9000/mnt/hive_07_1/warehouse/foo
 Failed with exception Unable to rename: 
 hdfs://ip-10-202-22-194.ec2.internal:9000/mnt/var/lib/hive_07_1/tmp/scratch/hive_2011-08-30_18-04-36_809_6130923980133666976/-ext-10001
  to: hdfs://ip-10-202-22-194.ec2.internal:9000/mnt/hive_07_1/warehouse/foo
 FAILED: Execution Error, return code 1 from 
 org.apache.hadoop.hive.ql.exec.MoveTask
 {quote}
 This is different behavior from a simple CREATE TABLE, which creates the 
 warehouse directory.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HIVE-5546) A change in ORCInputFormat made by HIVE-4113 was reverted by HIVE-5391

2013-10-15 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-5546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-5546:
---

Affects Version/s: 0.13.0

 A change in ORCInputFormat made by HIVE-4113 was reverted by HIVE-5391
 --

 Key: HIVE-5546
 URL: https://issues.apache.org/jira/browse/HIVE-5546
 Project: Hive
  Issue Type: Bug
Affects Versions: 0.13.0
Reporter: Yin Huai
Assignee: Yin Huai
 Attachments: HIVE-5546.1.patch


 {code}
 2013-10-15 10:49:49,386 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: 
 included column ids = 
 2013-10-15 10:49:49,386 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: 
 included columns names = 
 2013-10-15 10:49:49,386 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: 
 No ORC pushdown predicate
 2013-10-15 10:49:49,834 INFO 
 org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file 
 hdfs://localhost:54310/user/hive/warehouse/web_sales_orc/00_0
 2013-10-15 10:49:49,834 INFO org.apache.hadoop.mapred.MapTask: 
 numReduceTasks: 1
 2013-10-15 10:49:49,840 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 
 100
 2013-10-15 10:49:49,968 INFO org.apache.hadoop.mapred.TaskLogsTruncater: 
 Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
 2013-10-15 10:49:49,994 INFO org.apache.hadoop.io.nativeio.NativeIO: 
 Initialized cache for UID to User mapping with a cache timeout of 14400 
 seconds.
 2013-10-15 10:49:49,994 INFO org.apache.hadoop.io.nativeio.NativeIO: Got 
 UserName yhuai for UID 1000 from the native implementation
 2013-10-15 10:49:49,996 FATAL org.apache.hadoop.mapred.Child: Error running 
 child : java.lang.OutOfMemoryError: Java heap space
   at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.init(MapTask.java:949)
   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:428)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
   at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
   at org.apache.hadoop.mapred.Child.main(Child.java:249)
 {code}
 If includedColumnIds is an empty list, we do not need to read any column. 
 But, right now, in OrcInputFormat.findIncludedColumns, we have ...
 {code}
 if (ColumnProjectionUtils.isReadAllColumns(conf) ||
   includedStr == null || includedStr.trim().length() == 0) {
   return null;
 } 
 {code}
 If includedStr is an empty string, the code assumes that we need all columns, 
 which is not correct.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HIVE-5546) A change in ORCInputFormat made by HIVE-4113 was reverted by HIVE-5391

2013-10-15 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-5546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-5546:
---

Attachment: HIVE-5546.2.patch

Sure. I have removed includedStr (I kept the log of included column ids =). 
Thanks Sergey :)

 A change in ORCInputFormat made by HIVE-4113 was reverted by HIVE-5391
 --

 Key: HIVE-5546
 URL: https://issues.apache.org/jira/browse/HIVE-5546
 Project: Hive
  Issue Type: Bug
Affects Versions: 0.13.0
Reporter: Yin Huai
Assignee: Yin Huai
 Attachments: HIVE-5546.1.patch, HIVE-5546.2.patch


 {code}
 2013-10-15 10:49:49,386 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: 
 included column ids = 
 2013-10-15 10:49:49,386 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: 
 included columns names = 
 2013-10-15 10:49:49,386 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: 
 No ORC pushdown predicate
 2013-10-15 10:49:49,834 INFO 
 org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file 
 hdfs://localhost:54310/user/hive/warehouse/web_sales_orc/00_0
 2013-10-15 10:49:49,834 INFO org.apache.hadoop.mapred.MapTask: 
 numReduceTasks: 1
 2013-10-15 10:49:49,840 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 
 100
 2013-10-15 10:49:49,968 INFO org.apache.hadoop.mapred.TaskLogsTruncater: 
 Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
 2013-10-15 10:49:49,994 INFO org.apache.hadoop.io.nativeio.NativeIO: 
 Initialized cache for UID to User mapping with a cache timeout of 14400 
 seconds.
 2013-10-15 10:49:49,994 INFO org.apache.hadoop.io.nativeio.NativeIO: Got 
 UserName yhuai for UID 1000 from the native implementation
 2013-10-15 10:49:49,996 FATAL org.apache.hadoop.mapred.Child: Error running 
 child : java.lang.OutOfMemoryError: Java heap space
   at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.init(MapTask.java:949)
   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:428)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
   at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
   at org.apache.hadoop.mapred.Child.main(Child.java:249)
 {code}
 If includedColumnIds is an empty list, we do not need to read any column. 
 But, right now, in OrcInputFormat.findIncludedColumns, we have ...
 {code}
 if (ColumnProjectionUtils.isReadAllColumns(conf) ||
   includedStr == null || includedStr.trim().length() == 0) {
   return null;
 } 
 {code}
 If includedStr is an empty string, the code assumes that we need all columns, 
 which is not correct.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HIVE-5245) hive create table as select(CTAS) can not work(not support) with join on operator

2013-10-12 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13793439#comment-13793439
 ] 

Yin Huai commented on HIVE-5245:


I meant you can try hive trunk and see if the error also exist. If the error 
also exist, we need to find a way to reproduce it.

 hive create table as select(CTAS) can not work(not support) with join on 
 operator
 -

 Key: HIVE-5245
 URL: https://issues.apache.org/jira/browse/HIVE-5245
 Project: Hive
  Issue Type: Bug
  Components: HiveServer2
Affects Versions: 0.11.0
Reporter: jeff little
  Labels: CTAS, hive
   Original Estimate: 96h
  Remaining Estimate: 96h

 hello everyone, recently i came across one hive problem as below:
 hive (test) create table test_09 as
 select a.* from test_01 a
 join test_02 b
 on (a.id=b.id);
 Automatically selecting local only mode for query
 Total MapReduce jobs = 2
 setting HADOOP_USER_NAMEhadoop
 13/09/09 17:22:36 WARN conf.Configuration: 
 file:/tmp/hadoop/hive_2013-09-09_17-22-34_848_1629553341892012305/-local-10008/jobconf.xml:a
  attempt to override final parameter: mapred.system.dir;  Ignoring.
 13/09/09 17:22:36 WARN conf.Configuration: 
 file:/tmp/hadoop/hive_2013-09-09_17-22-34_848_1629553341892012305/-local-10008/jobconf.xml:a
  attempt to override final parameter: mapred.local.dir;  Ignoring.
 Execution log at: /tmp/hadoop/.log
 2013-09-09 05:22:36 Starting to launch local task to process map join;
   maximum memory = 932118528
 2013-09-09 05:22:37 Processing rows:4   Hashtable size: 4 
   Memory usage:   113068056   rate:   0.121
 2013-09-09 05:22:37 Dump the hashtable into file: 
 file:/tmp/hadoop/hive_2013-09-09_17-22-34_848_1629553341892012305/-local-10005/HashTable-Stage-6/MapJoin-mapfile90--.hashtable
 2013-09-09 05:22:37 Upload 1 File to: 
 file:/tmp/hadoop/hive_2013-09-09_17-22-34_848_1629553341892012305/-local-10005/HashTable-Stage-6/MapJoin-mapfile90--.hashtable
  File size: 788
 2013-09-09 05:22:37 End of local task; Time Taken: 0.444 sec.
 Execution completed successfully
 Mapred Local Task Succeeded . Convert the Join into MapJoin
 Mapred Local Task Succeeded . Convert the Join into MapJoin
 Launching Job 1 out of 2
 Number of reduce tasks is set to 0 since there's no reduce operator
 13/09/09 17:22:38 WARN conf.Configuration: 
 file:/tmp/hadoop/hive_2013-09-09_17-22-34_848_1629553341892012305/-local-10009/jobconf.xml:a
  attempt to override final parameter: mapred.system.dir;  Ignoring.
 13/09/09 17:22:38 WARN conf.Configuration: 
 file:/tmp/hadoop/hive_2013-09-09_17-22-34_848_1629553341892012305/-local-10009/jobconf.xml:a
  attempt to override final parameter: mapred.local.dir;  Ignoring.
 Execution log at: /tmp/hadoop/.log
 Job running in-process (local Hadoop)
 Hadoop job information for null: number of mappers: 0; number of reducers: 0
 2013-09-09 17:22:41,807 null map = 0%,  reduce = 0%
 2013-09-09 17:22:44,814 null map = 100%,  reduce = 0%
 Ended Job = job_local_0001
 Execution completed successfully
 Mapred Local Task Succeeded . Convert the Join into MapJoin
 Stage-7 is filtered out by condition resolver.
 OK
 Time taken: 13.138 seconds
 hive (test) select * from test_09;
 FAILED: SemanticException [Error 10001]: Line 1:14 Table not found 'test_09'
 hive (test)
 Problem:
 I can't get the created table, namely this CTAS is nonavailable, and this 
 table is not created by this hql sentence at all.who can explain for 
 me.Thanks.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HIVE-5245) hive create table as select(CTAS) can not work(not support) with join on operator

2013-10-11 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13792648#comment-13792648
 ] 

Yin Huai commented on HIVE-5245:


I think that error log is fine. Because you have a CTAS query and test_10 did 
not exist in your db, so when we first tried to get the table for test_10, we 
would not be able to get the corresponding object.

Seems stages for move operator and create table operator were not executed. Can 
you try the trunk?

 hive create table as select(CTAS) can not work(not support) with join on 
 operator
 -

 Key: HIVE-5245
 URL: https://issues.apache.org/jira/browse/HIVE-5245
 Project: Hive
  Issue Type: Bug
  Components: HiveServer2
Affects Versions: 0.11.0
Reporter: jeff little
  Labels: CTAS, hive
   Original Estimate: 96h
  Remaining Estimate: 96h

 hello everyone, recently i came across one hive problem as below:
 hive (test) create table test_09 as
 select a.* from test_01 a
 join test_02 b
 on (a.id=b.id);
 Automatically selecting local only mode for query
 Total MapReduce jobs = 2
 setting HADOOP_USER_NAMEhadoop
 13/09/09 17:22:36 WARN conf.Configuration: 
 file:/tmp/hadoop/hive_2013-09-09_17-22-34_848_1629553341892012305/-local-10008/jobconf.xml:a
  attempt to override final parameter: mapred.system.dir;  Ignoring.
 13/09/09 17:22:36 WARN conf.Configuration: 
 file:/tmp/hadoop/hive_2013-09-09_17-22-34_848_1629553341892012305/-local-10008/jobconf.xml:a
  attempt to override final parameter: mapred.local.dir;  Ignoring.
 Execution log at: /tmp/hadoop/.log
 2013-09-09 05:22:36 Starting to launch local task to process map join;
   maximum memory = 932118528
 2013-09-09 05:22:37 Processing rows:4   Hashtable size: 4 
   Memory usage:   113068056   rate:   0.121
 2013-09-09 05:22:37 Dump the hashtable into file: 
 file:/tmp/hadoop/hive_2013-09-09_17-22-34_848_1629553341892012305/-local-10005/HashTable-Stage-6/MapJoin-mapfile90--.hashtable
 2013-09-09 05:22:37 Upload 1 File to: 
 file:/tmp/hadoop/hive_2013-09-09_17-22-34_848_1629553341892012305/-local-10005/HashTable-Stage-6/MapJoin-mapfile90--.hashtable
  File size: 788
 2013-09-09 05:22:37 End of local task; Time Taken: 0.444 sec.
 Execution completed successfully
 Mapred Local Task Succeeded . Convert the Join into MapJoin
 Mapred Local Task Succeeded . Convert the Join into MapJoin
 Launching Job 1 out of 2
 Number of reduce tasks is set to 0 since there's no reduce operator
 13/09/09 17:22:38 WARN conf.Configuration: 
 file:/tmp/hadoop/hive_2013-09-09_17-22-34_848_1629553341892012305/-local-10009/jobconf.xml:a
  attempt to override final parameter: mapred.system.dir;  Ignoring.
 13/09/09 17:22:38 WARN conf.Configuration: 
 file:/tmp/hadoop/hive_2013-09-09_17-22-34_848_1629553341892012305/-local-10009/jobconf.xml:a
  attempt to override final parameter: mapred.local.dir;  Ignoring.
 Execution log at: /tmp/hadoop/.log
 Job running in-process (local Hadoop)
 Hadoop job information for null: number of mappers: 0; number of reducers: 0
 2013-09-09 17:22:41,807 null map = 0%,  reduce = 0%
 2013-09-09 17:22:44,814 null map = 100%,  reduce = 0%
 Ended Job = job_local_0001
 Execution completed successfully
 Mapred Local Task Succeeded . Convert the Join into MapJoin
 Stage-7 is filtered out by condition resolver.
 OK
 Time taken: 13.138 seconds
 hive (test) select * from test_09;
 FAILED: SemanticException [Error 10001]: Line 1:14 Table not found 'test_09'
 hive (test)
 Problem:
 I can't get the created table, namely this CTAS is nonavailable, and this 
 table is not created by this hql sentence at all.who can explain for 
 me.Thanks.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HIVE-5245) hive create table as select(CTAS) can not work(not support) with join on operator

2013-10-10 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13791546#comment-13791546
 ] 

Yin Huai commented on HIVE-5245:


[~jeff_little] I tried both trunk and 0.11 with tables used in our unit tests 
(src and src1) with the query 
{code:sql}
create table test_10 as
select a.* from src a join src1 b on (a.key=b.key);
{code}
I did not see the error in your post. 

Since Stage-7 is filtered out by condition resolver. appears in your log, 
seems hive.auto.convert.join.noconditionaltask was false in your test. Can you 
post the results of EXPLAIN with hive.auto.convert.join.noconditionaltask=true 
and hive.auto.convert.join.noconditionaltask=false?

Because it is a CTAS query, in the query plan, we should see two extra stages 
besides stages for the select query, one for Move Operator and another for 
Create Table Operator.

 hive create table as select(CTAS) can not work(not support) with join on 
 operator
 -

 Key: HIVE-5245
 URL: https://issues.apache.org/jira/browse/HIVE-5245
 Project: Hive
  Issue Type: Bug
  Components: HiveServer2
Affects Versions: 0.11.0
Reporter: jeff little
  Labels: CTAS, hive
   Original Estimate: 96h
  Remaining Estimate: 96h

 hello everyone, recently i came across one hive problem as below:
 hive (test) create table test_09 as
 select a.* from test_01 a
 join test_02 b
 on (a.id=b.id);
 Automatically selecting local only mode for query
 Total MapReduce jobs = 2
 setting HADOOP_USER_NAMEhadoop
 13/09/09 17:22:36 WARN conf.Configuration: 
 file:/tmp/hadoop/hive_2013-09-09_17-22-34_848_1629553341892012305/-local-10008/jobconf.xml:a
  attempt to override final parameter: mapred.system.dir;  Ignoring.
 13/09/09 17:22:36 WARN conf.Configuration: 
 file:/tmp/hadoop/hive_2013-09-09_17-22-34_848_1629553341892012305/-local-10008/jobconf.xml:a
  attempt to override final parameter: mapred.local.dir;  Ignoring.
 Execution log at: /tmp/hadoop/.log
 2013-09-09 05:22:36 Starting to launch local task to process map join;
   maximum memory = 932118528
 2013-09-09 05:22:37 Processing rows:4   Hashtable size: 4 
   Memory usage:   113068056   rate:   0.121
 2013-09-09 05:22:37 Dump the hashtable into file: 
 file:/tmp/hadoop/hive_2013-09-09_17-22-34_848_1629553341892012305/-local-10005/HashTable-Stage-6/MapJoin-mapfile90--.hashtable
 2013-09-09 05:22:37 Upload 1 File to: 
 file:/tmp/hadoop/hive_2013-09-09_17-22-34_848_1629553341892012305/-local-10005/HashTable-Stage-6/MapJoin-mapfile90--.hashtable
  File size: 788
 2013-09-09 05:22:37 End of local task; Time Taken: 0.444 sec.
 Execution completed successfully
 Mapred Local Task Succeeded . Convert the Join into MapJoin
 Mapred Local Task Succeeded . Convert the Join into MapJoin
 Launching Job 1 out of 2
 Number of reduce tasks is set to 0 since there's no reduce operator
 13/09/09 17:22:38 WARN conf.Configuration: 
 file:/tmp/hadoop/hive_2013-09-09_17-22-34_848_1629553341892012305/-local-10009/jobconf.xml:a
  attempt to override final parameter: mapred.system.dir;  Ignoring.
 13/09/09 17:22:38 WARN conf.Configuration: 
 file:/tmp/hadoop/hive_2013-09-09_17-22-34_848_1629553341892012305/-local-10009/jobconf.xml:a
  attempt to override final parameter: mapred.local.dir;  Ignoring.
 Execution log at: /tmp/hadoop/.log
 Job running in-process (local Hadoop)
 Hadoop job information for null: number of mappers: 0; number of reducers: 0
 2013-09-09 17:22:41,807 null map = 0%,  reduce = 0%
 2013-09-09 17:22:44,814 null map = 100%,  reduce = 0%
 Ended Job = job_local_0001
 Execution completed successfully
 Mapred Local Task Succeeded . Convert the Join into MapJoin
 Stage-7 is filtered out by condition resolver.
 OK
 Time taken: 13.138 seconds
 hive (test) select * from test_09;
 FAILED: SemanticException [Error 10001]: Line 1:14 Table not found 'test_09'
 hive (test)
 Problem:
 I can't get the created table, namely this CTAS is nonavailable, and this 
 table is not created by this hql sentence at all.who can explain for 
 me.Thanks.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HIVE-5245) hive create table as select(CTAS) can not work(not support) with join on operator

2013-10-10 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13791578#comment-13791578
 ] 

Yin Huai commented on HIVE-5245:


Can you check your log and see you can find something like (after the line of 
INFO exec.ExecDriver: Execution completed successfully) ...
{code}
13/10/10 11:10:23 INFO exec.Task: Moving data to: ...
13/10/10 11:10:23 INFO exec.DDLTask: Default to LazySimpleSerDe for table 
test_10
13/10/10 11:10:23 INFO metastore.HiveMetaStore: 0: create_table: 
Table(tableName:test_10, ...
{code}

You can direct the log to you console by using 
{code}
bin/hive -hiveconf hive.root.logger=INFO,console 
{code}

What version of hive are you using? 0.11?

 hive create table as select(CTAS) can not work(not support) with join on 
 operator
 -

 Key: HIVE-5245
 URL: https://issues.apache.org/jira/browse/HIVE-5245
 Project: Hive
  Issue Type: Bug
  Components: HiveServer2
Affects Versions: 0.11.0
Reporter: jeff little
  Labels: CTAS, hive
   Original Estimate: 96h
  Remaining Estimate: 96h

 hello everyone, recently i came across one hive problem as below:
 hive (test) create table test_09 as
 select a.* from test_01 a
 join test_02 b
 on (a.id=b.id);
 Automatically selecting local only mode for query
 Total MapReduce jobs = 2
 setting HADOOP_USER_NAMEhadoop
 13/09/09 17:22:36 WARN conf.Configuration: 
 file:/tmp/hadoop/hive_2013-09-09_17-22-34_848_1629553341892012305/-local-10008/jobconf.xml:a
  attempt to override final parameter: mapred.system.dir;  Ignoring.
 13/09/09 17:22:36 WARN conf.Configuration: 
 file:/tmp/hadoop/hive_2013-09-09_17-22-34_848_1629553341892012305/-local-10008/jobconf.xml:a
  attempt to override final parameter: mapred.local.dir;  Ignoring.
 Execution log at: /tmp/hadoop/.log
 2013-09-09 05:22:36 Starting to launch local task to process map join;
   maximum memory = 932118528
 2013-09-09 05:22:37 Processing rows:4   Hashtable size: 4 
   Memory usage:   113068056   rate:   0.121
 2013-09-09 05:22:37 Dump the hashtable into file: 
 file:/tmp/hadoop/hive_2013-09-09_17-22-34_848_1629553341892012305/-local-10005/HashTable-Stage-6/MapJoin-mapfile90--.hashtable
 2013-09-09 05:22:37 Upload 1 File to: 
 file:/tmp/hadoop/hive_2013-09-09_17-22-34_848_1629553341892012305/-local-10005/HashTable-Stage-6/MapJoin-mapfile90--.hashtable
  File size: 788
 2013-09-09 05:22:37 End of local task; Time Taken: 0.444 sec.
 Execution completed successfully
 Mapred Local Task Succeeded . Convert the Join into MapJoin
 Mapred Local Task Succeeded . Convert the Join into MapJoin
 Launching Job 1 out of 2
 Number of reduce tasks is set to 0 since there's no reduce operator
 13/09/09 17:22:38 WARN conf.Configuration: 
 file:/tmp/hadoop/hive_2013-09-09_17-22-34_848_1629553341892012305/-local-10009/jobconf.xml:a
  attempt to override final parameter: mapred.system.dir;  Ignoring.
 13/09/09 17:22:38 WARN conf.Configuration: 
 file:/tmp/hadoop/hive_2013-09-09_17-22-34_848_1629553341892012305/-local-10009/jobconf.xml:a
  attempt to override final parameter: mapred.local.dir;  Ignoring.
 Execution log at: /tmp/hadoop/.log
 Job running in-process (local Hadoop)
 Hadoop job information for null: number of mappers: 0; number of reducers: 0
 2013-09-09 17:22:41,807 null map = 0%,  reduce = 0%
 2013-09-09 17:22:44,814 null map = 100%,  reduce = 0%
 Ended Job = job_local_0001
 Execution completed successfully
 Mapred Local Task Succeeded . Convert the Join into MapJoin
 Stage-7 is filtered out by condition resolver.
 OK
 Time taken: 13.138 seconds
 hive (test) select * from test_09;
 FAILED: SemanticException [Error 10001]: Line 1:14 Table not found 'test_09'
 hive (test)
 Problem:
 I can't get the created table, namely this CTAS is nonavailable, and this 
 table is not created by this hql sentence at all.who can explain for 
 me.Thanks.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HIVE-5245) hive create table as select(CTAS) can not work(not support) with join on operator

2013-10-10 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13791604#comment-13791604
 ] 

Yin Huai commented on HIVE-5245:


Seems those logs are at runtime from Map tasks. I meant the log from the Hive 
driver.

Can you use 
{code}
bin/hive -hiveconf hive.root.logger=INFO,console 
{code}

 hive create table as select(CTAS) can not work(not support) with join on 
 operator
 -

 Key: HIVE-5245
 URL: https://issues.apache.org/jira/browse/HIVE-5245
 Project: Hive
  Issue Type: Bug
  Components: HiveServer2
Affects Versions: 0.11.0
Reporter: jeff little
  Labels: CTAS, hive
   Original Estimate: 96h
  Remaining Estimate: 96h

 hello everyone, recently i came across one hive problem as below:
 hive (test) create table test_09 as
 select a.* from test_01 a
 join test_02 b
 on (a.id=b.id);
 Automatically selecting local only mode for query
 Total MapReduce jobs = 2
 setting HADOOP_USER_NAMEhadoop
 13/09/09 17:22:36 WARN conf.Configuration: 
 file:/tmp/hadoop/hive_2013-09-09_17-22-34_848_1629553341892012305/-local-10008/jobconf.xml:a
  attempt to override final parameter: mapred.system.dir;  Ignoring.
 13/09/09 17:22:36 WARN conf.Configuration: 
 file:/tmp/hadoop/hive_2013-09-09_17-22-34_848_1629553341892012305/-local-10008/jobconf.xml:a
  attempt to override final parameter: mapred.local.dir;  Ignoring.
 Execution log at: /tmp/hadoop/.log
 2013-09-09 05:22:36 Starting to launch local task to process map join;
   maximum memory = 932118528
 2013-09-09 05:22:37 Processing rows:4   Hashtable size: 4 
   Memory usage:   113068056   rate:   0.121
 2013-09-09 05:22:37 Dump the hashtable into file: 
 file:/tmp/hadoop/hive_2013-09-09_17-22-34_848_1629553341892012305/-local-10005/HashTable-Stage-6/MapJoin-mapfile90--.hashtable
 2013-09-09 05:22:37 Upload 1 File to: 
 file:/tmp/hadoop/hive_2013-09-09_17-22-34_848_1629553341892012305/-local-10005/HashTable-Stage-6/MapJoin-mapfile90--.hashtable
  File size: 788
 2013-09-09 05:22:37 End of local task; Time Taken: 0.444 sec.
 Execution completed successfully
 Mapred Local Task Succeeded . Convert the Join into MapJoin
 Mapred Local Task Succeeded . Convert the Join into MapJoin
 Launching Job 1 out of 2
 Number of reduce tasks is set to 0 since there's no reduce operator
 13/09/09 17:22:38 WARN conf.Configuration: 
 file:/tmp/hadoop/hive_2013-09-09_17-22-34_848_1629553341892012305/-local-10009/jobconf.xml:a
  attempt to override final parameter: mapred.system.dir;  Ignoring.
 13/09/09 17:22:38 WARN conf.Configuration: 
 file:/tmp/hadoop/hive_2013-09-09_17-22-34_848_1629553341892012305/-local-10009/jobconf.xml:a
  attempt to override final parameter: mapred.local.dir;  Ignoring.
 Execution log at: /tmp/hadoop/.log
 Job running in-process (local Hadoop)
 Hadoop job information for null: number of mappers: 0; number of reducers: 0
 2013-09-09 17:22:41,807 null map = 0%,  reduce = 0%
 2013-09-09 17:22:44,814 null map = 100%,  reduce = 0%
 Ended Job = job_local_0001
 Execution completed successfully
 Mapred Local Task Succeeded . Convert the Join into MapJoin
 Stage-7 is filtered out by condition resolver.
 OK
 Time taken: 13.138 seconds
 hive (test) select * from test_09;
 FAILED: SemanticException [Error 10001]: Line 1:14 Table not found 'test_09'
 hive (test)
 Problem:
 I can't get the created table, namely this CTAS is nonavailable, and this 
 table is not created by this hql sentence at all.who can explain for 
 me.Thanks.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


  1   2   3   4   5   6   >