[jira] [Created] (HIVE-12232) Create external table failed when enabled StorageBasedAuthorization

2015-10-22 Thread WangMeng (JIRA)
WangMeng created HIVE-12232:
---

 Summary: Create external table failed when enabled 
StorageBasedAuthorization
 Key: HIVE-12232
 URL: https://issues.apache.org/jira/browse/HIVE-12232
 Project: Hive
  Issue Type: Bug
Affects Versions: 1.2.1
Reporter: WangMeng
Assignee: WangMeng


Please look at the stacktrace, when enabled StorageBasedAuthorization, creating 
external table will failed with write permission about the default warehouse 
path "/user/hive/warehouse": 

> CREATE EXTERNAL TABLE test(id int) LOCATION '/tmp/wangmeng/test'  ;
Error: Error while compiling statement: FAILED: HiveException 
java.security.AccessControlException: Permission denied: user=wangmeng, 
access=WRITE, inode="/user/hive/warehouse":hive:hive:drwxr-x--t.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HIVE-12231) StorageBasedAuthorization requires write permission of default Warehouse PATH when execute "CREATE DATABASE $Name LOCATION '$ExternalPath' "

2015-10-22 Thread WangMeng (JIRA)
WangMeng created HIVE-12231:
---

 Summary: StorageBasedAuthorization requires write permission of 
default Warehouse PATH when execute "CREATE DATABASE $Name LOCATION 
'$ExternalPath' "
 Key: HIVE-12231
 URL: https://issues.apache.org/jira/browse/HIVE-12231
 Project: Hive
  Issue Type: Bug
Affects Versions: 1.2.1
Reporter: WangMeng


Please look at the stacktrace, when enabled StorageBasedAuthorization ,  I set 
external Location of creating database, it will also check write permission of 
default Warehouse "/user/hive/warehouse" :
> create  database test  location '/tmp/wangmeng/test'  ;
Error: Error while compiling statement: FAILED: HiveException 
java.security.AccessControlException: Permission denied: user=wangmeng, 
access=WRITE, inode="/user/hive/warehouse":hive:hive:drwxr-x--t
at 
org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkFsPermission(DefaultAuthorizationProvider.java:255)
at 
org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.check(DefaultAuthorizationProvider.java:236)
at 
org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkPermission(DefaultAuthorizationProvider.java:151)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HIVE-12085) when using HiveServer2 JDBC, creating Hive DB will fail if setting DB location

2015-10-09 Thread WangMeng (JIRA)
WangMeng created HIVE-12085:
---

 Summary: when using HiveServer2 JDBC, creating Hive DB will  fail 
if  setting DB location 
 Key: HIVE-12085
 URL: https://issues.apache.org/jira/browse/HIVE-12085
 Project: Hive
  Issue Type: Bug
Affects Versions: 1.2.1
Reporter: WangMeng






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HIVE-11880) IndexOutOfBoundsException when execute query with filter condition on type incompatible column(A) on data(composed by UNION ALL when a union column is constant and it

2015-09-18 Thread WangMeng (JIRA)
WangMeng created HIVE-11880:
---

 Summary:IndexOutOfBoundsException when execute query with 
filter condition on type incompatible column(A) on data(composed by UNION ALL 
when a union column is constant and it has incompatible type with  
corresponding column) 
 Key: HIVE-11880
 URL: https://issues.apache.org/jira/browse/HIVE-11880
 Project: Hive
  Issue Type: Bug
  Components: Query Processor
Affects Versions: 1.2.1
Reporter: WangMeng
Assignee: WangMeng


For Hive UNION ALL , when a union column is constant(column a) and it has 
incompatible type with the corresponding column A. The query with filter 
condition on type incompatible column a on this UNION-ALL results  will cause 
IndexOutOfBoundsException

such as TPC-H table orders:
CREATE VIEW `view_orders` AS select `oo`.`o_orderkey` , `oo`.`o_custkey`  from 
(  select  `orders`.`o_orderkey` , `rcfileorders`.`o_custkey` from 
`tpch270g`.`rcfileorders`   union all  select `orcfileorders`.`o_orderkey` , 0L 
as `o_custkey`   from  `tpch270g`.`textfileorders`) `oo`.

Type of "o_custkey" is INT,  the type of corresponding constant column 0 is 
BIGINT.
Then the fllowing query(with filter incompatible column 0_custkey)  will fail:
select count(1) from view_orders  where o_custkey<10 with  
java.lang.IndexOutOfBoundsException.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HIVE-11695) Hql "write to LOCAL DIRECTORY " can not throws exception when Hive user does not have write-promotion of the DIRECTORY

2015-08-31 Thread WangMeng (JIRA)
WangMeng created HIVE-11695:
---

 Summary:  Hql "write to LOCAL DIRECTORY " can not throws exception 
when Hive user does not have write-promotion of the DIRECTORY
 Key: HIVE-11695
 URL: https://issues.apache.org/jira/browse/HIVE-11695
 Project: Hive
  Issue Type: Bug
  Components: Query Processor
Affects Versions: 1.2.1, 1.1.0, 1.2.0, 1.0.0, 0.14.0, 0.13.0
Reporter: WangMeng
Assignee: WangMeng


 For Hive user who dose not have write promotion of  LOCAL DIRECTORY such as   
"/data/wangmeng/"  , when the user executes Hql "insert  overwrite LOCAL  
DIRECTORY  "/data/wangmeng/hiveserver2" ,this query can not throw any exception 
 and pretend to have finished successfully.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HIVE-11149) Fix issue with Thread unsafe Class HashMap in PerfLogger.java hangs in Multi-thread environment

2015-06-30 Thread WangMeng (JIRA)
WangMeng created HIVE-11149:
---

 Summary: Fix issue with Thread unsafe Class  HashMap in 
PerfLogger.java  hangs  in  Multi-thread environment
 Key: HIVE-11149
 URL: https://issues.apache.org/jira/browse/HIVE-11149
 Project: Hive
  Issue Type: Bug
  Components: Logging
Affects Versions: 1.2.0
Reporter: WangMeng
Assignee: WangMeng
 Fix For: 1.2.0


In  Multi-thread environment,  the Thread unsafe Class HashMap in 
PerfLogger.java  will hang  and cost  large amounts of unnecessary CPU and 
Memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HIVE-10971) count(*) with count(distinct) gives wrong results when hive.groupby.skewindata=true

2015-06-09 Thread wangmeng (JIRA)
wangmeng created HIVE-10971:
---

 Summary: count(*) with count(distinct) gives wrong results when 
hive.groupby.skewindata=true
 Key: HIVE-10971
 URL: https://issues.apache.org/jira/browse/HIVE-10971
 Project: Hive
  Issue Type: Bug
  Components: Hive
Affects Versions: 1.2.0
Reporter: wangmeng
Assignee: wangmeng


When hive.groupby.skewindata=true, the following query based on TPC-H gives 
wrong results:

{code}
set hive.groupby.skewindata=true;

select l_returnflag, count(*), count(distinct l_linestatus)
from lineitem
group by l_returnflag
limit 10;
{code}

The query plan shows that it generates only one MapReduce job instead of two, 
which is dictated by hive.groupby.skewindata=true.

The problem arises only when {noformat}count(*){noformat} and 
{noformat}count(distinct){noformat} exist together.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-7822) how to merge two hive metastores' metadata stored in different databases (such as mysql)

2014-09-13 Thread wangmeng (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14133095#comment-14133095
 ] 

wangmeng commented on HIVE-7822:


thanks. for. your. advice! 发自网易邮箱手机版 在2014年09月14日 11:21,Xuefu Zhang (JIRA)写道: [ 
https://issues.apache.org/jira/browse/HIVE-7822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14133064#comment-14133064
 ] Xuefu Zhang commented on HIVE-7822: --- 
[~wangmeng] JIRA is to report issues or request features. Questions like what 
you presented is better sent to user list. I'm closing this JIRA. > how to 
merge two  hive metastores' metadata  stored in different databases (such as 
mysql) > 
--
 > >                 Key: HIVE-7822 >                 URL: 
https://issues.apache.org/jira/browse/HIVE-7822 >             Project: Hive >   
       Issue Type: Improvement >            Reporter: wangmeng > > Hi, What is 
a good way to merge  two hive metadata stored in different databases(such as 
mysql)? > Is there any way to get all history Hqls  from metastore?  I think  I 
need  to  run these  Hqls   in  another hive  metadata database  again. > 
Thanks -- This message was sent by Atlassian JIRA (v6.3.4#6332)


> how to merge two  hive metastores' metadata  stored in different databases 
> (such as mysql)
> --
>
> Key: HIVE-7822
> URL: https://issues.apache.org/jira/browse/HIVE-7822
> Project: Hive
>  Issue Type: Improvement
>Reporter: wangmeng
>
> Hi, What is a good way to merge  two hive metadata stored in different 
> databases(such as mysql)?
> Is there any way to get all history Hqls  from metastore?  I think  I need  
> to  run these  Hqls   in  another hive  metadata database  again.
> Thanks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-3421) Column Level Top K Values Statistics

2014-08-22 Thread wangmeng (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14106624#comment-14106624
 ] 

wangmeng commented on HIVE-3421:


this is very useful!!!  I  am  waiting  the  coming  version

> Column Level Top K Values Statistics
> 
>
> Key: HIVE-3421
> URL: https://issues.apache.org/jira/browse/HIVE-3421
> Project: Hive
>  Issue Type: New Feature
>Reporter: Feng Lu
>Assignee: Feng Lu
> Attachments: HIVE-3421.patch.1.txt, HIVE-3421.patch.2.txt, 
> HIVE-3421.patch.3.txt, HIVE-3421.patch.4.txt, HIVE-3421.patch.5.txt, 
> HIVE-3421.patch.6.txt, HIVE-3421.patch.7.txt, HIVE-3421.patch.8.txt, 
> HIVE-3421.patch.9.txt, HIVE-3421.patch.txt
>
>
> Compute (estimate) top k values statistics for each column, and put the most 
> skewed column into skewed info, if user hasn't specified skew.
> This feature depends on ListBucketing (create table skewed on) 
> https://cwiki.apache.org/Hive/listbucketing.html.
> All column topk can be added to skewed info, if in the future skewed info 
> supports multiple independent columns.
> The TopK algorithm is based on this paper:
> http://www.cs.ucsb.edu/research/tech_reports/reports/2005-23.pdf



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (HIVE-7822) how to merge two hive metastores' metadata stored in different databases (such as mysql)

2014-08-20 Thread wangmeng (JIRA)
wangmeng created HIVE-7822:
--

 Summary: how to merge two  hive metastores' metadata  stored in 
different databases (such as mysql)
 Key: HIVE-7822
 URL: https://issues.apache.org/jira/browse/HIVE-7822
 Project: Hive
  Issue Type: Improvement
Reporter: wangmeng


Hi, What is a good way to merge  two hive metadata stored in different 
databases(such as mysql)?

Is there any way to get all history Hqls  from metastore?  I think  I need  to  
run these  Hqls   in  another hive  metadata database  again.

Thanks



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-7292) Hive on Spark

2014-07-22 Thread wangmeng (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14070044#comment-14070044
 ] 

wangmeng commented on HIVE-7292:


This is a very valuable project!

> Hive on Spark
> -
>
> Key: HIVE-7292
> URL: https://issues.apache.org/jira/browse/HIVE-7292
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Attachments: Hive-on-Spark.pdf
>
>
> Spark as an open-source data analytics cluster computing framework has gained 
> significant momentum recently. Many Hive users already have Spark installed 
> as their computing backbone. To take advantages of Hive, they still need to 
> have either MapReduce or Tez on their cluster. This initiative will provide 
> user a new alternative so that those user can consolidate their backend. 
> Secondly, providing such an alternative further increases Hive's adoption as 
> it exposes Spark users  to a viable, feature-rich de facto standard SQL tools 
> on Hadoop.
> Finally, allowing Hive to run on Spark also has performance benefits. Hive 
> queries, especially those involving multiple reducer stages, will run faster, 
> thus improving user experience as Tez does.
> This is an umbrella JIRA which will cover many coming subtask. Design doc 
> will be attached here shortly, and will be on the wiki as well. Feedback from 
> the community is greatly appreciated!



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7469) skew join keys when two join table have the same big skew key

2014-07-22 Thread wangmeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangmeng updated HIVE-7469:
---

Description: 
In https://issues.apache.org/jira/browse/HIVE-964, I  have a  general   idea 
about how to  deal with skew join key , the key is that  use mapjoin to deal 
with skew key, but there has a case  which troubles me:
if the two join tables  have the same big skew key on one value :
for example , select *  from  table A join B  on  A.id=b.id,  both table A  and 
B  have  a lot of  keys on id=1,  in  this  case , if we  use map join  to deal 
with   the skew key  id=1  ,maybe itwill OOM.
so ,how  to fix this  case?  Will  it  rollback  to common  join ? Thanks.

  was:
In https://issues.apache.org/jira/browse/HIVE-964, I  have a  general   idea 
about how to  deal with skew join key ,but there has a case  which troubles me:
if the two join tables  have the same big skew key on one value :
for example , select *  from  table A join B  on  A.id=b.id,  both table A  and 
B  have  a lot of  keys on id=1,  in  this  case , if we  use map join  to deal 
with   the skew key  id=1  ,maybe itwill OOM.
so ,how  to fix this  case?  Will  it  rollback  to common  join ? Thanks.


> skew join keys  when  two join  table  have the same big skew key
> -
>
> Key: HIVE-7469
> URL: https://issues.apache.org/jira/browse/HIVE-7469
> Project: Hive
>  Issue Type: Improvement
>Reporter: wangmeng
>
> In https://issues.apache.org/jira/browse/HIVE-964, I  have a  general   idea 
> about how to  deal with skew join key , the key is that  use mapjoin to deal 
> with skew key, but there has a case  which troubles me:
> if the two join tables  have the same big skew key on one value :
> for example , select *  from  table A join B  on  A.id=b.id,  both table A  
> and B  have  a lot of  keys on id=1,  in  this  case , if we  use map join  
> to deal with   the skew key  id=1  ,maybe itwill OOM.
> so ,how  to fix this  case?  Will  it  rollback  to common  join ? Thanks.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7469) skew join keys when two join table have the same big skew key

2014-07-22 Thread wangmeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangmeng updated HIVE-7469:
---

Description: 
In https://issues.apache.org/jira/browse/HIVE-964, I  have a  general   idea 
about how to  deal with skew join key , the key  point is that  use mapjoin to 
deal with skew key, but there has a case  which troubles me:
if the two join tables  have the same big skew key on one value :
for example , select *  from  table A join B  on  A.id=b.id,  both table A  and 
B  have  a lot of  keys on id=1,  in  this  case , if we  use map join  to deal 
with   the skew key  id=1  ,maybe itwill OOM.
so ,how  to fix this  case?  Will  it  rollback  to common  join ? Thanks.

  was:
In https://issues.apache.org/jira/browse/HIVE-964, I  have a  general   idea 
about how to  deal with skew join key , the key is that  use mapjoin to deal 
with skew key, but there has a case  which troubles me:
if the two join tables  have the same big skew key on one value :
for example , select *  from  table A join B  on  A.id=b.id,  both table A  and 
B  have  a lot of  keys on id=1,  in  this  case , if we  use map join  to deal 
with   the skew key  id=1  ,maybe itwill OOM.
so ,how  to fix this  case?  Will  it  rollback  to common  join ? Thanks.


> skew join keys  when  two join  table  have the same big skew key
> -
>
> Key: HIVE-7469
> URL: https://issues.apache.org/jira/browse/HIVE-7469
> Project: Hive
>  Issue Type: Improvement
>Reporter: wangmeng
>
> In https://issues.apache.org/jira/browse/HIVE-964, I  have a  general   idea 
> about how to  deal with skew join key , the key  point is that  use mapjoin 
> to deal with skew key, but there has a case  which troubles me:
> if the two join tables  have the same big skew key on one value :
> for example , select *  from  table A join B  on  A.id=b.id,  both table A  
> and B  have  a lot of  keys on id=1,  in  this  case , if we  use map join  
> to deal with   the skew key  id=1  ,maybe itwill OOM.
> so ,how  to fix this  case?  Will  it  rollback  to common  join ? Thanks.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7469) skew join keys when two join table have the same big skew key

2014-07-22 Thread wangmeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangmeng updated HIVE-7469:
---

Description: 
In https://issues.apache.org/jira/browse/HIVE-964, I  have a  general   idea 
about how to  deal with skew join key ,but there has a case  which troubles me:
if the two join tables  have the same big skew key on one value :
for example , select *  from  table A join B  on  A.id=b.id,  both table A  and 
B  have  a lot of  keys on id=1,  in  this  case , if we  use map join  to deal 
with   the skew key  id=1  ,maybe itwill OOM.
so ,how  to fix this  case?  Will  it  rollback  to common  join ? Thanks.

  was:
In https://issues.apache.org/jira/browse/HIVE-964, I  have an general   idea 
about how to  deal with skew join key ,but there has a case  which troubles me:
if the two join tables  have the same big skew key on one value :
for example , select *  from  table A join B  on  A.id=b.id,  both table A  and 
B  have  a lot of  keys on id=1,  in  this  case , if we  use map join  to deal 
with   the skew key  id=1  ,maybe itwill OOM.
so ,how  to fix this  case?  Will  it  rollback  to common  join ? Thanks.


> skew join keys  when  two join  table  have the same big skew key
> -
>
> Key: HIVE-7469
> URL: https://issues.apache.org/jira/browse/HIVE-7469
> Project: Hive
>  Issue Type: Improvement
>Reporter: wangmeng
>
> In https://issues.apache.org/jira/browse/HIVE-964, I  have a  general   idea 
> about how to  deal with skew join key ,but there has a case  which troubles 
> me:
> if the two join tables  have the same big skew key on one value :
> for example , select *  from  table A join B  on  A.id=b.id,  both table A  
> and B  have  a lot of  keys on id=1,  in  this  case , if we  use map join  
> to deal with   the skew key  id=1  ,maybe itwill OOM.
> so ,how  to fix this  case?  Will  it  rollback  to common  join ? Thanks.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (HIVE-7469) skew join keys when two join table have the same big skew key

2014-07-22 Thread wangmeng (JIRA)
wangmeng created HIVE-7469:
--

 Summary: skew join keys  when  two join  table  have the same big 
skew key
 Key: HIVE-7469
 URL: https://issues.apache.org/jira/browse/HIVE-7469
 Project: Hive
  Issue Type: Improvement
Reporter: wangmeng


In https://issues.apache.org/jira/browse/HIVE-964, I  have an general   idea 
about how to  deal with skew join key ,but there has a case  which troubles me:
if the two join tables  have the same big skew key on one value :
for example , select *  from  table A join B  on  A.id=b.id,  both table A  and 
B  have  a lot of  keys on id=1,  in  this  case , if we  use map join  to deal 
with   the skew key  id=1  ,maybe itwill OOM.
so ,how  to fix this  case?  Will  it  rollback  to common  join ? Thanks.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-964) handle skewed keys for a join in a separate job

2014-07-22 Thread wangmeng (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14069997#comment-14069997
 ] 

wangmeng commented on HIVE-964:
---

if the two join tables  have the same big skew key on one value (for example 
,select *  from  table A join B  on  A.id=b.id,  both table A  and B  have  a 
lot of  keys on id=1,  in  this  case ,map join  will OOM),  how  to fix this  
case?  Will  it  rollback  to common  join ? 

> handle skewed keys for a join in a separate job
> ---
>
> Key: HIVE-964
> URL: https://issues.apache.org/jira/browse/HIVE-964
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: Namit Jain
>Assignee: He Yongqiang
> Fix For: 0.6.0
>
> Attachments: hive-964-2009-12-17.txt, hive-964-2009-12-28-2.patch, 
> hive-964-2009-12-29-4.patch, hive-964-2010-01-08.patch, 
> hive-964-2010-01-13-2.patch, hive-964-2010-01-14-3.patch, 
> hive-964-2010-01-15-4.patch
>
>
> The skewed keys can be written to a temporary table or file, and a followup 
> conditional task can be used to perform the join on those keys.
> As a first step, JDBM can be used for those keys



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-7296) big data approximate processing at a very low cost based on hive sql

2014-07-05 Thread wangmeng (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14052834#comment-14052834
 ] 

wangmeng commented on HIVE-7296:


yes,I like it.





--

Best  Regards
HomePage:http://wangmeng.us/
Name:Wang Meng---Data structures and Algorithms,Java,Jvm, Linux, Shell, 
Distributed system , Hadoop  Hive , Performancse Optimization and Debug 
,Spark/Shark Impala
Major: Software Engineering --
Degree:  Master
E-mail:   sjtufigh...@163.com   sjtufigh...@sjtu.edu.cn
Tel: 13141202303(BeiJing)   18818272832(ShangHai)
GitHub:https://github.com/sjtufighter







> big data approximate processing  at a very  low cost  based on hive sql 
> 
>
> Key: HIVE-7296
> URL: https://issues.apache.org/jira/browse/HIVE-7296
> Project: Hive
>  Issue Type: New Feature
>Reporter: wangmeng
>
> For big data analysis, we often need to do the following query and statistics:
> 1.Cardinality Estimation,   count the number of different elements in the 
> collection, such as Unique Visitor ,UV)
> Now we can use hive-query:
> Select distinct(id)  from TestTable ;
> 2.Frequency Estimation: estimate number of an element is repeated, such as 
> the site visits of  a user 。
> Hive query: select  count(1)  from TestTable where name=”wangmeng”
> 3.Heavy Hitters, top-k elements: such as top-100 shops 
> Hive query: select count(1), name  from TestTable  group by name ;  need UDF……
> 4.Range Query: for example, to find out the number of  users between 20 to 30
> Hive query : select  count(1) from TestTable where age>20 and age <30
> 5.Membership Query : for example, whether  the user name is already 
> registered?
> According to the implementation mechanism of hive , it  will cost too large 
> memory space and a long query time.
> However ,in many cases, we do not need very accurate results and a small 
> error can be tolerated. In such case  , we can use  approximate processing  
> to greatly improve the time and space efficiency.
> Now , based  on some theoretical analysis materials ,I want to  do some for 
> these new features so much if possible. 
> So, is there anything I can do ?  Many Thanks.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-7296) big data approximate processing at a very low cost based on hive sql

2014-06-26 Thread wangmeng (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14045604#comment-14045604
 ] 

wangmeng commented on HIVE-7296:


Sorry,  they are different   features

> big data approximate processing  at a very  low cost  based on hive sql 
> 
>
> Key: HIVE-7296
> URL: https://issues.apache.org/jira/browse/HIVE-7296
> Project: Hive
>  Issue Type: New Feature
>Reporter: wangmeng
>
> For big data analysis, we often need to do the following query and statistics:
> 1.Cardinality Estimation,   count the number of different elements in the 
> collection, such as Unique Visitor ,UV)
> Now we can use hive-query:
> Select distinct(id)  from TestTable ;
> 2.Frequency Estimation: estimate number of an element is repeated, such as 
> the site visits of  a user 。
> Hive query: select  count(1)  from TestTable where name=”wangmeng”
> 3.Heavy Hitters, top-k elements: such as top-100 shops 
> Hive query: select count(1), name  from TestTable  group by name ;  need UDF……
> 4.Range Query: for example, to find out the number of  users between 20 to 30
> Hive query : select  count(1) from TestTable where age>20 and age <30
> 5.Membership Query : for example, whether  the user name is already 
> registered?
> According to the implementation mechanism of hive , it  will cost too large 
> memory space and a long query time.
> However ,in many cases, we do not need very accurate results and a small 
> error can be tolerated. In such case  , we can use  approximate processing  
> to greatly improve the time and space efficiency.
> Now , based  on some theoretical analysis materials ,I want to  do some for 
> these new features so much if possible. .
> I am familiar with hive and  hadoop , and  I have implemented an efficient  
> storage format based on hive.( 
> https://github.com/sjtufighter/Data---Storage--).
> So, is there anything I can do ?  Many Thanks.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7296) big data approximate processing at a very low cost based on hive sql

2014-06-26 Thread wangmeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangmeng updated HIVE-7296:
---

Description: 
For big data analysis, we often need to do the following query and statistics:

1.Cardinality Estimation,   count the number of different elements in the 
collection, such as Unique Visitor ,UV)

Now we can use hive-query:
Select distinct(id)  from TestTable ;

2.Frequency Estimation: estimate number of an element is repeated, such as the 
site visits of  a user 。

Hive query: select  count(1)  from TestTable where name=”wangmeng”

3.Heavy Hitters, top-k elements: such as top-100 shops 

Hive query: select count(1), name  from TestTable  group by name ;  need UDF……

4.Range Query: for example, to find out the number of  users between 20 to 30

Hive query : select  count(1) from TestTable where age>20 and age <30

5.Membership Query : for example, whether  the user name is already registered?

According to the implementation mechanism of hive , it  will cost too large 
memory space and a long query time.

However ,in many cases, we do not need very accurate results and a small error 
can be tolerated. In such case  , we can use  approximate processing  to 
greatly improve the time and space efficiency.

Now , based  on some theoretical analysis materials ,I want to  do some for 
these new features so much if possible. 

So, is there anything I can do ?  Many Thanks.


  was:
For big data analysis, we often need to do the following query and statistics:

1.Cardinality Estimation,   count the number of different elements in the 
collection, such as Unique Visitor ,UV)

Now we can use hive-query:
Select distinct(id)  from TestTable ;

2.Frequency Estimation: estimate number of an element is repeated, such as the 
site visits of  a user 。

Hive query: select  count(1)  from TestTable where name=”wangmeng”

3.Heavy Hitters, top-k elements: such as top-100 shops 

Hive query: select count(1), name  from TestTable  group by name ;  need UDF……

4.Range Query: for example, to find out the number of  users between 20 to 30

Hive query : select  count(1) from TestTable where age>20 and age <30

5.Membership Query : for example, whether  the user name is already registered?

According to the implementation mechanism of hive , it  will cost too large 
memory space and a long query time.

However ,in many cases, we do not need very accurate results and a small error 
can be tolerated. In such case  , we can use  approximate processing  to 
greatly improve the time and space efficiency.

Now , based  on some theoretical analysis materials ,I want to  do some for 
these new features so much if possible. .

I am familiar with hive and  hadoop , and  I have implemented an efficient  
storage format based on hive.( 
https://github.com/sjtufighter/Data---Storage--).

So, is there anything I can do ?  Many Thanks.



> big data approximate processing  at a very  low cost  based on hive sql 
> 
>
> Key: HIVE-7296
> URL: https://issues.apache.org/jira/browse/HIVE-7296
> Project: Hive
>  Issue Type: New Feature
>Reporter: wangmeng
>
> For big data analysis, we often need to do the following query and statistics:
> 1.Cardinality Estimation,   count the number of different elements in the 
> collection, such as Unique Visitor ,UV)
> Now we can use hive-query:
> Select distinct(id)  from TestTable ;
> 2.Frequency Estimation: estimate number of an element is repeated, such as 
> the site visits of  a user 。
> Hive query: select  count(1)  from TestTable where name=”wangmeng”
> 3.Heavy Hitters, top-k elements: such as top-100 shops 
> Hive query: select count(1), name  from TestTable  group by name ;  need UDF……
> 4.Range Query: for example, to find out the number of  users between 20 to 30
> Hive query : select  count(1) from TestTable where age>20 and age <30
> 5.Membership Query : for example, whether  the user name is already 
> registered?
> According to the implementation mechanism of hive , it  will cost too large 
> memory space and a long query time.
> However ,in many cases, we do not need very accurate results and a small 
> error can be tolerated. In such case  , we can use  approximate processing  
> to greatly improve the time and space efficiency.
> Now , based  on some theoretical analysis materials ,I want to  do some for 
> these new features so much if possible. 
> So, is there anything I can do ?  Many Thanks.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7296) big data approximate processing at a very low cost based on hive sql

2014-06-26 Thread wangmeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangmeng updated HIVE-7296:
---

Description: 
For big data analysis, we often need to do the following query and statistics:

1.Cardinality Estimation,   count the number of different elements in the 
collection, such as Unique Visitor ,UV)

Now we can use hive-query:
Select distinct(id)  from TestTable ;

2.Frequency Estimation: estimate number of an element is repeated, such as the 
site visits of  a user 。

Hive query: select  count(1)  from TestTable where name=”wangmeng”

3.Heavy Hitters, top-k elements: such as top-100 shops 

Hive query: select count(1), name  from TestTable  group by name ;  need UDF……

4.Range Query: for example, to find out the number of  users between 20 to 30

Hive query : select  count(1) from TestTable where age>20 and age <30

5.Membership Query : for example, whether  the user name is already registered?

According to the implementation mechanism of hive , it  will cost too large 
memory space and a long query time.

However ,in many cases, we do not need very accurate results and a small error 
can be tolerated. In such case  , we can use  approximate processing  to 
greatly improve the time and space efficiency.

Now , based  on some theoretical analysis materials ,I want to  do some for 
these new features so much if possible. .

I am familiar with hive and  hadoop , and  I have implemented an efficient  
storage format based on hive.( 
https://github.com/sjtufighter/Data---Storage--).

So, is there anything I can do ?  Many Thanks.


  was:
For big data analysis, we often need to do the following query and statistics:

1.Cardinality Estimation,   count the number of different elements in the 
collection, such as Unique Visitor ,UV)

Now we can use hive-query:
Select distinct(id)  from TestTable ;

2.Frequency Estimation: estimate number of an element is repeated, such as the 
site visits of  a user 。

Hive query: select  count(1)  from TestTable where name=”wangmeng”

3.Heavy Hitters, top-k elements: such as top-100 shops 

Hive query: select count(1), name  from TestTable  group by name ;  need UDF……

4.Range Query: for example, to find out the number of  users between 20 to 30

Hive query : select  count(1) from TestTable where age>20 and age <30

5.Membership Query : for example, whether  the user name is already registered?

According to the implementation mechanism of hive , it  will cost too large 
memory space and a long query time.

However ,in many cases, we do not need very accurate results and a small error 
can be tolerated. In such case  , we can use  approximate processing  to 
greatly improve the time and space efficiency.

Now , based  on some theoretical analysis materials ,I want to  do some for 
these new features so much  

I am familiar with hive and  hadoop , and  I have implemented an efficient  
storage format based on hive.( 
https://github.com/sjtufighter/Data---Storage--).

So, is there anything I can do ?  Many Thanks.



> big data approximate processing  at a very  low cost  based on hive sql 
> 
>
> Key: HIVE-7296
> URL: https://issues.apache.org/jira/browse/HIVE-7296
> Project: Hive
>  Issue Type: New Feature
>Reporter: wangmeng
>
> For big data analysis, we often need to do the following query and statistics:
> 1.Cardinality Estimation,   count the number of different elements in the 
> collection, such as Unique Visitor ,UV)
> Now we can use hive-query:
> Select distinct(id)  from TestTable ;
> 2.Frequency Estimation: estimate number of an element is repeated, such as 
> the site visits of  a user 。
> Hive query: select  count(1)  from TestTable where name=”wangmeng”
> 3.Heavy Hitters, top-k elements: such as top-100 shops 
> Hive query: select count(1), name  from TestTable  group by name ;  need UDF……
> 4.Range Query: for example, to find out the number of  users between 20 to 30
> Hive query : select  count(1) from TestTable where age>20 and age <30
> 5.Membership Query : for example, whether  the user name is already 
> registered?
> According to the implementation mechanism of hive , it  will cost too large 
> memory space and a long query time.
> However ,in many cases, we do not need very accurate results and a small 
> error can be tolerated. In such case  , we can use  approximate processing  
> to greatly improve the time and space efficiency.
> Now , based  on some theoretical analysis materials ,I want to  do some for 
> these new features so much if possible. .
> I am familiar with hive and  hadoop , and  I have implemented an efficient  
> storage format based on hive.( 
> https://github.com/sjtufighter/Data---Storage--).
> So, is there anything I can do ?  Many Thanks.



--
This message was sent by Atlassian JIRA
(v6.2#625

[jira] [Updated] (HIVE-7296) big data approximate processing at a very low cost based on hive sql

2014-06-26 Thread wangmeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangmeng updated HIVE-7296:
---

Description: 
For big data analysis, we often need to do the following query and statistics:

1.Cardinality Estimation,   count the number of different elements in the 
collection, such as Unique Visitor ,UV)

Now we can use hive-query:
Select distinct(id)  from TestTable ;

2.Frequency Estimation: estimate number of an element is repeated, such as the 
site visits of  a user 。

Hive query: select  count(1)  from TestTable where name=”wangmeng”

3.Heavy Hitters, top-k elements: such as top-100 shops 

Hive query: select count(1), name  from TestTable  group by name ;  need UDF……

4.Range Query: for example, to find out the number of  users between 20 to 30

Hive query : select  count(1) from TestTable where age>20 and age <30

5.Membership Query : for example, whether  the user name is already registered?

According to the implementation mechanism of hive , it  will cost too large 
memory space and a long query time.

However ,in many cases, we do not need very accurate results and a small error 
can be tolerated. In such case  , we can use  approximate processing  to 
greatly improve the time and space efficiency.

Now , based  on some theoretical analysis materials ,I want to  do some for 
these new features so much  

I am familiar with hive and  hadoop , and  I have implemented an efficient  
storage format based on hive.( 
https://github.com/sjtufighter/Data---Storage--).

So, is there anything I can do ?  Many Thanks.


  was:
For big data analysis, we often need to do the following query and statistics:

1.Cardinality Estimation,   count the number of different elements in the 
collection, such as Unique Visitor ,UV)

Now we can use hive-query:
Select distinct(id)  from TestTable ;

2.Frequency Estimation: estimate number of an element is repeated, such as the 
site visits of  a user 。

Hive query: select  count(1)  from TestTable where name=”wangmeng”

3.Heavy Hitters, top-k elements: such as top-100 shops 

Hive query: select count(1), name  from TestTable  group by name ;  need UDF……

4.Range Query: for example, to find out the number of  users between 20 to 30

Hive query : select  count(1) from TestTable where age>20 and age <30

5.Membership Query : for example, whether  the user name is already registered?

According to the implementation mechanism of hive , it  will cost too large 
memory space and a long query time.

However ,in many cases, we do not need very accurate results and a small error 
can be tolerated. In such case  , we can use  approximate processing  to 
greatly improve the time and space efficiency.

Now , based  on some theoretical analysis materials ,I want to  do some for 
these new features so much .

I am familiar with hive and  hadoop , and  I have implemented an efficient  
storage format based on hive.( 
https://github.com/sjtufighter/Data---Storage--).

So, is there anything I can do ?  Many Thanks.



> big data approximate processing  at a very  low cost  based on hive sql 
> 
>
> Key: HIVE-7296
> URL: https://issues.apache.org/jira/browse/HIVE-7296
> Project: Hive
>  Issue Type: New Feature
>Reporter: wangmeng
>
> For big data analysis, we often need to do the following query and statistics:
> 1.Cardinality Estimation,   count the number of different elements in the 
> collection, such as Unique Visitor ,UV)
> Now we can use hive-query:
> Select distinct(id)  from TestTable ;
> 2.Frequency Estimation: estimate number of an element is repeated, such as 
> the site visits of  a user 。
> Hive query: select  count(1)  from TestTable where name=”wangmeng”
> 3.Heavy Hitters, top-k elements: such as top-100 shops 
> Hive query: select count(1), name  from TestTable  group by name ;  need UDF……
> 4.Range Query: for example, to find out the number of  users between 20 to 30
> Hive query : select  count(1) from TestTable where age>20 and age <30
> 5.Membership Query : for example, whether  the user name is already 
> registered?
> According to the implementation mechanism of hive , it  will cost too large 
> memory space and a long query time.
> However ,in many cases, we do not need very accurate results and a small 
> error can be tolerated. In such case  , we can use  approximate processing  
> to greatly improve the time and space efficiency.
> Now , based  on some theoretical analysis materials ,I want to  do some for 
> these new features so much  
> I am familiar with hive and  hadoop , and  I have implemented an efficient  
> storage format based on hive.( 
> https://github.com/sjtufighter/Data---Storage--).
> So, is there anything I can do ?  Many Thanks.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (HIVE-7296) big data approximate processing at a very low cost based on hive sql

2014-06-25 Thread wangmeng (JIRA)
wangmeng created HIVE-7296:
--

 Summary: big data approximate processing  at a very  low cost  
based on hive sql 
 Key: HIVE-7296
 URL: https://issues.apache.org/jira/browse/HIVE-7296
 Project: Hive
  Issue Type: New Feature
Reporter: wangmeng


For big data analysis, we often need to do the following query and statistics:

1.Cardinality Estimation,   count the number of different elements in the 
collection, such as Unique Visitor ,UV)

Now we can use hive-query:
Select distinct(id)  from TestTable ;

2.Frequency Estimation: estimate number of an element is repeated, such as the 
site visits of  a user 。

Hive query: select  count(1)  from TestTable where name=”wangmeng”

3.Heavy Hitters, top-k elements: such as top-100 shops 

Hive query: select count(1), name  from TestTable  group by name ;  need UDF……

4.Range Query: for example, to find out the number of  users between 20 to 30

Hive query : select  count(1) from TestTable where age>20 and age <30

5.Membership Query : for example, whether  the user name is already registered?

According to the implementation mechanism of hive , it  will cost too large 
memory space and a long query time.

However ,in many cases, we do not need very accurate results and a small error 
can be tolerated. In such case  , we can use  approximate processing  to 
greatly improve the time and space efficiency.

Now , based  on some theoretical analysis materials ,I want to  do some for 
these new features so much .

I am familiar with hive and  hadoop , and  I have implemented an efficient  
storage format based on hive.( 
https://github.com/sjtufighter/Data---Storage--).

So, is there anything I can do ?  Many Thanks.




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-7277) how to decide reduce numbers according to the input size of reduce stage rather than the input size of map stage?

2014-06-23 Thread wangmeng (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14041761#comment-14041761
 ] 

wangmeng commented on HIVE-7277:


well, the MR api  can not  fit up  with  this logical  plan  to  generate the 
physical plan .





--

Best  Regards
HomePage:http://wangmeng.us/
Name:Wang Meng---Data structures and Algorithms,Java,Jvm, Linux, Shell, 
Distributed system , Hadoop  Hive , Performancse Optimization and Debug 
,Spark/Shark 
Major: Software Engineering
Degree:  Master
E-mail:   sjtufigh...@163.com   sjtufigh...@sjtu.edu.cn
GitHub:https://github.com/sjtufighter







> how to decide reduce numbers   according  to  the input size of reduce stage 
> rather than the  input size of  map stage?
> ---
>
> Key: HIVE-7277
> URL: https://issues.apache.org/jira/browse/HIVE-7277
> Project: Hive
>  Issue Type: New Feature
>Reporter: wangmeng
> Fix For: 0.13.0
>
>
> As we  know ,now  hive decide the  reduce numbers  just by  the " Input size 
> of   map/ hive.exec.reducers.bytes.per.reducer(default 1G ).
> But ,I  think  the out put size of map stage  may have a big difference from  
> the original  input size , so I  think  this  strategy to decide 
> reduce-numbers may be improper
> So is   there any feature  which can decide the reduce number just  according 
> to the out put  of the map stage.?thanks .  
>  As  I know , actually ,the reduce stage will begin just  after some map 
> tasks have finished rather than until  the  whole map stage have finished , 
> so I  think  it is improper too  decide reduce numbers   when  the  whole map 
> stage  have finished.
> As  someone point ,We can just according to  the out put size of the  
> earliest map tasks which have finished   to  estimate the whole reduce 
> numbers..However,   in fact ,now Hive has used filter push down(where) 
> ,which may  resulting a big  difference from each map task .
> So,  this  estimation  is improper.
> thanks .



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-7277) how to decide reduce numbers according to the input size of reduce stage rather than the input size of map stage?

2014-06-23 Thread wangmeng (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14041742#comment-14041742
 ] 

wangmeng commented on HIVE-7277:




As I know ,TEZ is a new compute engine different from mapreduce, is there any 
solution based on map reduce engine ?



--

Best  Regards
HomePage:http://wangmeng.us/
Name:Wang Meng---Data structures and Algorithms,Java,Jvm, Linux, Shell, 
Distributed system , Hadoop  Hive , Performancse Optimization and Debug 
,Spark/Shark Impala
Major: Software Engineering --
Degree:  Master
E-mail:   sjtufigh...@163.com   sjtufigh...@sjtu.edu.cn
Tel: 13141202303(BeiJing)   18818272832(ShangHai)
GitHub:https://github.com/sjtufighter







> how to decide reduce numbers   according  to  the input size of reduce stage 
> rather than the  input size of  map stage?
> ---
>
> Key: HIVE-7277
> URL: https://issues.apache.org/jira/browse/HIVE-7277
> Project: Hive
>  Issue Type: New Feature
>Reporter: wangmeng
> Fix For: 0.13.0
>
>
> As we  know ,now  hive decide the  reduce numbers  just by  the " Input size 
> of   map/ hive.exec.reducers.bytes.per.reducer(default 1G ).
> But ,I  think  the out put size of map stage  may have a big difference from  
> the original  input size , so I  think  this  strategy to decide 
> reduce-numbers may be improper
> So is   there any feature  which can decide the reduce number just  according 
> to the out put  of the map stage.?thanks .  
>  As  I know , actually ,the reduce stage will begin just  after some map 
> tasks have finished rather than until  the  whole map stage have finished , 
> so I  think  it is improper too  decide reduce numbers   when  the  whole map 
> stage  have finished.
> As  someone point ,We can just according to  the out put size of the  
> earliest map tasks which have finished   to  estimate the whole reduce 
> numbers..However,   in fact ,now Hive has used filter push down(where) 
> ,which may  resulting a big  difference from each map task .
> So,  this  estimation  is improper.
> thanks .



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-7277) how to decide reduce numbers according to the input size of reduce stage rather than the input size of map stage?

2014-06-23 Thread wangmeng (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14041741#comment-14041741
 ] 

wangmeng commented on HIVE-7277:


As  I  know ,TEZ is a new  compute engine  different from mapreduce,   is there 
 any  solution based on map reduce engine  ?

> how to decide reduce numbers   according  to  the input size of reduce stage 
> rather than the  input size of  map stage?
> ---
>
> Key: HIVE-7277
> URL: https://issues.apache.org/jira/browse/HIVE-7277
> Project: Hive
>  Issue Type: New Feature
>Reporter: wangmeng
> Fix For: 0.13.0
>
>
> As we  know ,now  hive decide the  reduce numbers  just by  the " Input size 
> of   map/ hive.exec.reducers.bytes.per.reducer(default 1G ).
> But ,I  think  the out put size of map stage  may have a big difference from  
> the original  input size , so I  think  this  strategy to decide 
> reduce-numbers may be improper
> So is   there any feature  which can decide the reduce number just  according 
> to the out put  of the map stage.?thanks .  
>  As  I know , actually ,the reduce stage will begin just  after some map 
> tasks have finished rather than until  the  whole map stage have finished , 
> so I  think  it is improper too  decide reduce numbers   when  the  whole map 
> stage  have finished.
> As  someone point ,We can just according to  the out put size of the  
> earliest map tasks which have finished   to  estimate the whole reduce 
> numbers..However,   in fact ,now Hive has used filter push down(where) 
> ,which may  resulting a big  difference from each map task .
> So,  this  estimation  is improper.
> thanks .



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (HIVE-7277) how to decide reduce numbers according to the input size of reduce stage rather than the input size of map stage?

2014-06-23 Thread wangmeng (JIRA)
wangmeng created HIVE-7277:
--

 Summary: how to decide reduce numbers   according  to  the input 
size of reduce stage rather than the  input size of  map stage?
 Key: HIVE-7277
 URL: https://issues.apache.org/jira/browse/HIVE-7277
 Project: Hive
  Issue Type: New Feature
Reporter: wangmeng
 Fix For: 0.13.0


As we  know ,now  hive decide the  reduce numbers  just by  the " Input size of 
  map/ hive.exec.reducers.bytes.per.reducer(default 1G ).

But ,I  think  the out put size of map stage  may have a big difference from  
the original  input size , so I  think  this  strategy to decide reduce-numbers 
may be improper

So is   there any feature  which can decide the reduce number just  according 
to the out put  of the map stage.?thanks .  

 As  I know , actually ,the reduce stage will begin just  after some map tasks 
have finished rather than until  the  whole map stage have finished , so I  
think  it is improper too  decide reduce numbers   when  the  whole map stage  
have finished.

As  someone point ,We can just according to  the out put size of the  earliest 
map tasks which have finished   to  estimate the whole reduce 
numbers..However,   in fact ,now Hive has used filter push down(where) 
,which may  resulting a big  difference from each map task .

So,  this  estimation  is improper.

thanks .




--
This message was sent by Atlassian JIRA
(v6.2#6252)