[jira] [Created] (HIVE-12232) Create external table failed when enabled StorageBasedAuthorization
WangMeng created HIVE-12232: --- Summary: Create external table failed when enabled StorageBasedAuthorization Key: HIVE-12232 URL: https://issues.apache.org/jira/browse/HIVE-12232 Project: Hive Issue Type: Bug Affects Versions: 1.2.1 Reporter: WangMeng Assignee: WangMeng Please look at the stacktrace, when enabled StorageBasedAuthorization, creating external table will failed with write permission about the default warehouse path "/user/hive/warehouse": > CREATE EXTERNAL TABLE test(id int) LOCATION '/tmp/wangmeng/test' ; Error: Error while compiling statement: FAILED: HiveException java.security.AccessControlException: Permission denied: user=wangmeng, access=WRITE, inode="/user/hive/warehouse":hive:hive:drwxr-x--t. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-12231) StorageBasedAuthorization requires write permission of default Warehouse PATH when execute "CREATE DATABASE $Name LOCATION '$ExternalPath' "
WangMeng created HIVE-12231: --- Summary: StorageBasedAuthorization requires write permission of default Warehouse PATH when execute "CREATE DATABASE $Name LOCATION '$ExternalPath' " Key: HIVE-12231 URL: https://issues.apache.org/jira/browse/HIVE-12231 Project: Hive Issue Type: Bug Affects Versions: 1.2.1 Reporter: WangMeng Please look at the stacktrace, when enabled StorageBasedAuthorization , I set external Location of creating database, it will also check write permission of default Warehouse "/user/hive/warehouse" : > create database test location '/tmp/wangmeng/test' ; Error: Error while compiling statement: FAILED: HiveException java.security.AccessControlException: Permission denied: user=wangmeng, access=WRITE, inode="/user/hive/warehouse":hive:hive:drwxr-x--t at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkFsPermission(DefaultAuthorizationProvider.java:255) at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.check(DefaultAuthorizationProvider.java:236) at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkPermission(DefaultAuthorizationProvider.java:151) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-12085) when using HiveServer2 JDBC, creating Hive DB will fail if setting DB location
WangMeng created HIVE-12085: --- Summary: when using HiveServer2 JDBC, creating Hive DB will fail if setting DB location Key: HIVE-12085 URL: https://issues.apache.org/jira/browse/HIVE-12085 Project: Hive Issue Type: Bug Affects Versions: 1.2.1 Reporter: WangMeng -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-11880) IndexOutOfBoundsException when execute query with filter condition on type incompatible column(A) on data(composed by UNION ALL when a union column is constant and it
WangMeng created HIVE-11880: --- Summary:IndexOutOfBoundsException when execute query with filter condition on type incompatible column(A) on data(composed by UNION ALL when a union column is constant and it has incompatible type with corresponding column) Key: HIVE-11880 URL: https://issues.apache.org/jira/browse/HIVE-11880 Project: Hive Issue Type: Bug Components: Query Processor Affects Versions: 1.2.1 Reporter: WangMeng Assignee: WangMeng For Hive UNION ALL , when a union column is constant(column a) and it has incompatible type with the corresponding column A. The query with filter condition on type incompatible column a on this UNION-ALL results will cause IndexOutOfBoundsException such as TPC-H table orders: CREATE VIEW `view_orders` AS select `oo`.`o_orderkey` , `oo`.`o_custkey` from ( select `orders`.`o_orderkey` , `rcfileorders`.`o_custkey` from `tpch270g`.`rcfileorders` union all select `orcfileorders`.`o_orderkey` , 0L as `o_custkey` from `tpch270g`.`textfileorders`) `oo`. Type of "o_custkey" is INT, the type of corresponding constant column 0 is BIGINT. Then the fllowing query(with filter incompatible column 0_custkey) will fail: select count(1) from view_orders where o_custkey<10 with java.lang.IndexOutOfBoundsException. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-11695) Hql "write to LOCAL DIRECTORY " can not throws exception when Hive user does not have write-promotion of the DIRECTORY
WangMeng created HIVE-11695: --- Summary: Hql "write to LOCAL DIRECTORY " can not throws exception when Hive user does not have write-promotion of the DIRECTORY Key: HIVE-11695 URL: https://issues.apache.org/jira/browse/HIVE-11695 Project: Hive Issue Type: Bug Components: Query Processor Affects Versions: 1.2.1, 1.1.0, 1.2.0, 1.0.0, 0.14.0, 0.13.0 Reporter: WangMeng Assignee: WangMeng For Hive user who dose not have write promotion of LOCAL DIRECTORY such as "/data/wangmeng/" , when the user executes Hql "insert overwrite LOCAL DIRECTORY "/data/wangmeng/hiveserver2" ,this query can not throw any exception and pretend to have finished successfully. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-11149) Fix issue with Thread unsafe Class HashMap in PerfLogger.java hangs in Multi-thread environment
WangMeng created HIVE-11149: --- Summary: Fix issue with Thread unsafe Class HashMap in PerfLogger.java hangs in Multi-thread environment Key: HIVE-11149 URL: https://issues.apache.org/jira/browse/HIVE-11149 Project: Hive Issue Type: Bug Components: Logging Affects Versions: 1.2.0 Reporter: WangMeng Assignee: WangMeng Fix For: 1.2.0 In Multi-thread environment, the Thread unsafe Class HashMap in PerfLogger.java will hang and cost large amounts of unnecessary CPU and Memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-10971) count(*) with count(distinct) gives wrong results when hive.groupby.skewindata=true
wangmeng created HIVE-10971: --- Summary: count(*) with count(distinct) gives wrong results when hive.groupby.skewindata=true Key: HIVE-10971 URL: https://issues.apache.org/jira/browse/HIVE-10971 Project: Hive Issue Type: Bug Components: Hive Affects Versions: 1.2.0 Reporter: wangmeng Assignee: wangmeng When hive.groupby.skewindata=true, the following query based on TPC-H gives wrong results: {code} set hive.groupby.skewindata=true; select l_returnflag, count(*), count(distinct l_linestatus) from lineitem group by l_returnflag limit 10; {code} The query plan shows that it generates only one MapReduce job instead of two, which is dictated by hive.groupby.skewindata=true. The problem arises only when {noformat}count(*){noformat} and {noformat}count(distinct){noformat} exist together. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-7822) how to merge two hive metastores' metadata stored in different databases (such as mysql)
[ https://issues.apache.org/jira/browse/HIVE-7822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14133095#comment-14133095 ] wangmeng commented on HIVE-7822: thanks. for. your. advice! 发自网易邮箱手机版 在2014年09月14日 11:21,Xuefu Zhang (JIRA)写道: [ https://issues.apache.org/jira/browse/HIVE-7822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14133064#comment-14133064 ] Xuefu Zhang commented on HIVE-7822: --- [~wangmeng] JIRA is to report issues or request features. Questions like what you presented is better sent to user list. I'm closing this JIRA. > how to merge two hive metastores' metadata stored in different databases (such as mysql) > -- > > Key: HIVE-7822 > URL: https://issues.apache.org/jira/browse/HIVE-7822 > Project: Hive > Issue Type: Improvement > Reporter: wangmeng > > Hi, What is a good way to merge two hive metadata stored in different databases(such as mysql)? > Is there any way to get all history Hqls from metastore? I think I need to run these Hqls in another hive metadata database again. > Thanks -- This message was sent by Atlassian JIRA (v6.3.4#6332) > how to merge two hive metastores' metadata stored in different databases > (such as mysql) > -- > > Key: HIVE-7822 > URL: https://issues.apache.org/jira/browse/HIVE-7822 > Project: Hive > Issue Type: Improvement >Reporter: wangmeng > > Hi, What is a good way to merge two hive metadata stored in different > databases(such as mysql)? > Is there any way to get all history Hqls from metastore? I think I need > to run these Hqls in another hive metadata database again. > Thanks -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-3421) Column Level Top K Values Statistics
[ https://issues.apache.org/jira/browse/HIVE-3421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14106624#comment-14106624 ] wangmeng commented on HIVE-3421: this is very useful!!! I am waiting the coming version > Column Level Top K Values Statistics > > > Key: HIVE-3421 > URL: https://issues.apache.org/jira/browse/HIVE-3421 > Project: Hive > Issue Type: New Feature >Reporter: Feng Lu >Assignee: Feng Lu > Attachments: HIVE-3421.patch.1.txt, HIVE-3421.patch.2.txt, > HIVE-3421.patch.3.txt, HIVE-3421.patch.4.txt, HIVE-3421.patch.5.txt, > HIVE-3421.patch.6.txt, HIVE-3421.patch.7.txt, HIVE-3421.patch.8.txt, > HIVE-3421.patch.9.txt, HIVE-3421.patch.txt > > > Compute (estimate) top k values statistics for each column, and put the most > skewed column into skewed info, if user hasn't specified skew. > This feature depends on ListBucketing (create table skewed on) > https://cwiki.apache.org/Hive/listbucketing.html. > All column topk can be added to skewed info, if in the future skewed info > supports multiple independent columns. > The TopK algorithm is based on this paper: > http://www.cs.ucsb.edu/research/tech_reports/reports/2005-23.pdf -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (HIVE-7822) how to merge two hive metastores' metadata stored in different databases (such as mysql)
wangmeng created HIVE-7822: -- Summary: how to merge two hive metastores' metadata stored in different databases (such as mysql) Key: HIVE-7822 URL: https://issues.apache.org/jira/browse/HIVE-7822 Project: Hive Issue Type: Improvement Reporter: wangmeng Hi, What is a good way to merge two hive metadata stored in different databases(such as mysql)? Is there any way to get all history Hqls from metastore? I think I need to run these Hqls in another hive metadata database again. Thanks -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-7292) Hive on Spark
[ https://issues.apache.org/jira/browse/HIVE-7292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14070044#comment-14070044 ] wangmeng commented on HIVE-7292: This is a very valuable project! > Hive on Spark > - > > Key: HIVE-7292 > URL: https://issues.apache.org/jira/browse/HIVE-7292 > Project: Hive > Issue Type: Improvement > Components: Spark >Reporter: Xuefu Zhang >Assignee: Xuefu Zhang > Attachments: Hive-on-Spark.pdf > > > Spark as an open-source data analytics cluster computing framework has gained > significant momentum recently. Many Hive users already have Spark installed > as their computing backbone. To take advantages of Hive, they still need to > have either MapReduce or Tez on their cluster. This initiative will provide > user a new alternative so that those user can consolidate their backend. > Secondly, providing such an alternative further increases Hive's adoption as > it exposes Spark users to a viable, feature-rich de facto standard SQL tools > on Hadoop. > Finally, allowing Hive to run on Spark also has performance benefits. Hive > queries, especially those involving multiple reducer stages, will run faster, > thus improving user experience as Tez does. > This is an umbrella JIRA which will cover many coming subtask. Design doc > will be attached here shortly, and will be on the wiki as well. Feedback from > the community is greatly appreciated! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HIVE-7469) skew join keys when two join table have the same big skew key
[ https://issues.apache.org/jira/browse/HIVE-7469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangmeng updated HIVE-7469: --- Description: In https://issues.apache.org/jira/browse/HIVE-964, I have a general idea about how to deal with skew join key , the key is that use mapjoin to deal with skew key, but there has a case which troubles me: if the two join tables have the same big skew key on one value : for example , select * from table A join B on A.id=b.id, both table A and B have a lot of keys on id=1, in this case , if we use map join to deal with the skew key id=1 ,maybe itwill OOM. so ,how to fix this case? Will it rollback to common join ? Thanks. was: In https://issues.apache.org/jira/browse/HIVE-964, I have a general idea about how to deal with skew join key ,but there has a case which troubles me: if the two join tables have the same big skew key on one value : for example , select * from table A join B on A.id=b.id, both table A and B have a lot of keys on id=1, in this case , if we use map join to deal with the skew key id=1 ,maybe itwill OOM. so ,how to fix this case? Will it rollback to common join ? Thanks. > skew join keys when two join table have the same big skew key > - > > Key: HIVE-7469 > URL: https://issues.apache.org/jira/browse/HIVE-7469 > Project: Hive > Issue Type: Improvement >Reporter: wangmeng > > In https://issues.apache.org/jira/browse/HIVE-964, I have a general idea > about how to deal with skew join key , the key is that use mapjoin to deal > with skew key, but there has a case which troubles me: > if the two join tables have the same big skew key on one value : > for example , select * from table A join B on A.id=b.id, both table A > and B have a lot of keys on id=1, in this case , if we use map join > to deal with the skew key id=1 ,maybe itwill OOM. > so ,how to fix this case? Will it rollback to common join ? Thanks. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HIVE-7469) skew join keys when two join table have the same big skew key
[ https://issues.apache.org/jira/browse/HIVE-7469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangmeng updated HIVE-7469: --- Description: In https://issues.apache.org/jira/browse/HIVE-964, I have a general idea about how to deal with skew join key , the key point is that use mapjoin to deal with skew key, but there has a case which troubles me: if the two join tables have the same big skew key on one value : for example , select * from table A join B on A.id=b.id, both table A and B have a lot of keys on id=1, in this case , if we use map join to deal with the skew key id=1 ,maybe itwill OOM. so ,how to fix this case? Will it rollback to common join ? Thanks. was: In https://issues.apache.org/jira/browse/HIVE-964, I have a general idea about how to deal with skew join key , the key is that use mapjoin to deal with skew key, but there has a case which troubles me: if the two join tables have the same big skew key on one value : for example , select * from table A join B on A.id=b.id, both table A and B have a lot of keys on id=1, in this case , if we use map join to deal with the skew key id=1 ,maybe itwill OOM. so ,how to fix this case? Will it rollback to common join ? Thanks. > skew join keys when two join table have the same big skew key > - > > Key: HIVE-7469 > URL: https://issues.apache.org/jira/browse/HIVE-7469 > Project: Hive > Issue Type: Improvement >Reporter: wangmeng > > In https://issues.apache.org/jira/browse/HIVE-964, I have a general idea > about how to deal with skew join key , the key point is that use mapjoin > to deal with skew key, but there has a case which troubles me: > if the two join tables have the same big skew key on one value : > for example , select * from table A join B on A.id=b.id, both table A > and B have a lot of keys on id=1, in this case , if we use map join > to deal with the skew key id=1 ,maybe itwill OOM. > so ,how to fix this case? Will it rollback to common join ? Thanks. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HIVE-7469) skew join keys when two join table have the same big skew key
[ https://issues.apache.org/jira/browse/HIVE-7469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangmeng updated HIVE-7469: --- Description: In https://issues.apache.org/jira/browse/HIVE-964, I have a general idea about how to deal with skew join key ,but there has a case which troubles me: if the two join tables have the same big skew key on one value : for example , select * from table A join B on A.id=b.id, both table A and B have a lot of keys on id=1, in this case , if we use map join to deal with the skew key id=1 ,maybe itwill OOM. so ,how to fix this case? Will it rollback to common join ? Thanks. was: In https://issues.apache.org/jira/browse/HIVE-964, I have an general idea about how to deal with skew join key ,but there has a case which troubles me: if the two join tables have the same big skew key on one value : for example , select * from table A join B on A.id=b.id, both table A and B have a lot of keys on id=1, in this case , if we use map join to deal with the skew key id=1 ,maybe itwill OOM. so ,how to fix this case? Will it rollback to common join ? Thanks. > skew join keys when two join table have the same big skew key > - > > Key: HIVE-7469 > URL: https://issues.apache.org/jira/browse/HIVE-7469 > Project: Hive > Issue Type: Improvement >Reporter: wangmeng > > In https://issues.apache.org/jira/browse/HIVE-964, I have a general idea > about how to deal with skew join key ,but there has a case which troubles > me: > if the two join tables have the same big skew key on one value : > for example , select * from table A join B on A.id=b.id, both table A > and B have a lot of keys on id=1, in this case , if we use map join > to deal with the skew key id=1 ,maybe itwill OOM. > so ,how to fix this case? Will it rollback to common join ? Thanks. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (HIVE-7469) skew join keys when two join table have the same big skew key
wangmeng created HIVE-7469: -- Summary: skew join keys when two join table have the same big skew key Key: HIVE-7469 URL: https://issues.apache.org/jira/browse/HIVE-7469 Project: Hive Issue Type: Improvement Reporter: wangmeng In https://issues.apache.org/jira/browse/HIVE-964, I have an general idea about how to deal with skew join key ,but there has a case which troubles me: if the two join tables have the same big skew key on one value : for example , select * from table A join B on A.id=b.id, both table A and B have a lot of keys on id=1, in this case , if we use map join to deal with the skew key id=1 ,maybe itwill OOM. so ,how to fix this case? Will it rollback to common join ? Thanks. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-964) handle skewed keys for a join in a separate job
[ https://issues.apache.org/jira/browse/HIVE-964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14069997#comment-14069997 ] wangmeng commented on HIVE-964: --- if the two join tables have the same big skew key on one value (for example ,select * from table A join B on A.id=b.id, both table A and B have a lot of keys on id=1, in this case ,map join will OOM), how to fix this case? Will it rollback to common join ? > handle skewed keys for a join in a separate job > --- > > Key: HIVE-964 > URL: https://issues.apache.org/jira/browse/HIVE-964 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Reporter: Namit Jain >Assignee: He Yongqiang > Fix For: 0.6.0 > > Attachments: hive-964-2009-12-17.txt, hive-964-2009-12-28-2.patch, > hive-964-2009-12-29-4.patch, hive-964-2010-01-08.patch, > hive-964-2010-01-13-2.patch, hive-964-2010-01-14-3.patch, > hive-964-2010-01-15-4.patch > > > The skewed keys can be written to a temporary table or file, and a followup > conditional task can be used to perform the join on those keys. > As a first step, JDBM can be used for those keys -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-7296) big data approximate processing at a very low cost based on hive sql
[ https://issues.apache.org/jira/browse/HIVE-7296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14052834#comment-14052834 ] wangmeng commented on HIVE-7296: yes,I like it. -- Best Regards HomePage:http://wangmeng.us/ Name:Wang Meng---Data structures and Algorithms,Java,Jvm, Linux, Shell, Distributed system , Hadoop Hive , Performancse Optimization and Debug ,Spark/Shark Impala Major: Software Engineering -- Degree: Master E-mail: sjtufigh...@163.com sjtufigh...@sjtu.edu.cn Tel: 13141202303(BeiJing) 18818272832(ShangHai) GitHub:https://github.com/sjtufighter > big data approximate processing at a very low cost based on hive sql > > > Key: HIVE-7296 > URL: https://issues.apache.org/jira/browse/HIVE-7296 > Project: Hive > Issue Type: New Feature >Reporter: wangmeng > > For big data analysis, we often need to do the following query and statistics: > 1.Cardinality Estimation, count the number of different elements in the > collection, such as Unique Visitor ,UV) > Now we can use hive-query: > Select distinct(id) from TestTable ; > 2.Frequency Estimation: estimate number of an element is repeated, such as > the site visits of a user 。 > Hive query: select count(1) from TestTable where name=”wangmeng” > 3.Heavy Hitters, top-k elements: such as top-100 shops > Hive query: select count(1), name from TestTable group by name ; need UDF…… > 4.Range Query: for example, to find out the number of users between 20 to 30 > Hive query : select count(1) from TestTable where age>20 and age <30 > 5.Membership Query : for example, whether the user name is already > registered? > According to the implementation mechanism of hive , it will cost too large > memory space and a long query time. > However ,in many cases, we do not need very accurate results and a small > error can be tolerated. In such case , we can use approximate processing > to greatly improve the time and space efficiency. > Now , based on some theoretical analysis materials ,I want to do some for > these new features so much if possible. > So, is there anything I can do ? Many Thanks. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-7296) big data approximate processing at a very low cost based on hive sql
[ https://issues.apache.org/jira/browse/HIVE-7296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14045604#comment-14045604 ] wangmeng commented on HIVE-7296: Sorry, they are different features > big data approximate processing at a very low cost based on hive sql > > > Key: HIVE-7296 > URL: https://issues.apache.org/jira/browse/HIVE-7296 > Project: Hive > Issue Type: New Feature >Reporter: wangmeng > > For big data analysis, we often need to do the following query and statistics: > 1.Cardinality Estimation, count the number of different elements in the > collection, such as Unique Visitor ,UV) > Now we can use hive-query: > Select distinct(id) from TestTable ; > 2.Frequency Estimation: estimate number of an element is repeated, such as > the site visits of a user 。 > Hive query: select count(1) from TestTable where name=”wangmeng” > 3.Heavy Hitters, top-k elements: such as top-100 shops > Hive query: select count(1), name from TestTable group by name ; need UDF…… > 4.Range Query: for example, to find out the number of users between 20 to 30 > Hive query : select count(1) from TestTable where age>20 and age <30 > 5.Membership Query : for example, whether the user name is already > registered? > According to the implementation mechanism of hive , it will cost too large > memory space and a long query time. > However ,in many cases, we do not need very accurate results and a small > error can be tolerated. In such case , we can use approximate processing > to greatly improve the time and space efficiency. > Now , based on some theoretical analysis materials ,I want to do some for > these new features so much if possible. . > I am familiar with hive and hadoop , and I have implemented an efficient > storage format based on hive.( > https://github.com/sjtufighter/Data---Storage--). > So, is there anything I can do ? Many Thanks. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HIVE-7296) big data approximate processing at a very low cost based on hive sql
[ https://issues.apache.org/jira/browse/HIVE-7296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangmeng updated HIVE-7296: --- Description: For big data analysis, we often need to do the following query and statistics: 1.Cardinality Estimation, count the number of different elements in the collection, such as Unique Visitor ,UV) Now we can use hive-query: Select distinct(id) from TestTable ; 2.Frequency Estimation: estimate number of an element is repeated, such as the site visits of a user 。 Hive query: select count(1) from TestTable where name=”wangmeng” 3.Heavy Hitters, top-k elements: such as top-100 shops Hive query: select count(1), name from TestTable group by name ; need UDF…… 4.Range Query: for example, to find out the number of users between 20 to 30 Hive query : select count(1) from TestTable where age>20 and age <30 5.Membership Query : for example, whether the user name is already registered? According to the implementation mechanism of hive , it will cost too large memory space and a long query time. However ,in many cases, we do not need very accurate results and a small error can be tolerated. In such case , we can use approximate processing to greatly improve the time and space efficiency. Now , based on some theoretical analysis materials ,I want to do some for these new features so much if possible. So, is there anything I can do ? Many Thanks. was: For big data analysis, we often need to do the following query and statistics: 1.Cardinality Estimation, count the number of different elements in the collection, such as Unique Visitor ,UV) Now we can use hive-query: Select distinct(id) from TestTable ; 2.Frequency Estimation: estimate number of an element is repeated, such as the site visits of a user 。 Hive query: select count(1) from TestTable where name=”wangmeng” 3.Heavy Hitters, top-k elements: such as top-100 shops Hive query: select count(1), name from TestTable group by name ; need UDF…… 4.Range Query: for example, to find out the number of users between 20 to 30 Hive query : select count(1) from TestTable where age>20 and age <30 5.Membership Query : for example, whether the user name is already registered? According to the implementation mechanism of hive , it will cost too large memory space and a long query time. However ,in many cases, we do not need very accurate results and a small error can be tolerated. In such case , we can use approximate processing to greatly improve the time and space efficiency. Now , based on some theoretical analysis materials ,I want to do some for these new features so much if possible. . I am familiar with hive and hadoop , and I have implemented an efficient storage format based on hive.( https://github.com/sjtufighter/Data---Storage--). So, is there anything I can do ? Many Thanks. > big data approximate processing at a very low cost based on hive sql > > > Key: HIVE-7296 > URL: https://issues.apache.org/jira/browse/HIVE-7296 > Project: Hive > Issue Type: New Feature >Reporter: wangmeng > > For big data analysis, we often need to do the following query and statistics: > 1.Cardinality Estimation, count the number of different elements in the > collection, such as Unique Visitor ,UV) > Now we can use hive-query: > Select distinct(id) from TestTable ; > 2.Frequency Estimation: estimate number of an element is repeated, such as > the site visits of a user 。 > Hive query: select count(1) from TestTable where name=”wangmeng” > 3.Heavy Hitters, top-k elements: such as top-100 shops > Hive query: select count(1), name from TestTable group by name ; need UDF…… > 4.Range Query: for example, to find out the number of users between 20 to 30 > Hive query : select count(1) from TestTable where age>20 and age <30 > 5.Membership Query : for example, whether the user name is already > registered? > According to the implementation mechanism of hive , it will cost too large > memory space and a long query time. > However ,in many cases, we do not need very accurate results and a small > error can be tolerated. In such case , we can use approximate processing > to greatly improve the time and space efficiency. > Now , based on some theoretical analysis materials ,I want to do some for > these new features so much if possible. > So, is there anything I can do ? Many Thanks. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HIVE-7296) big data approximate processing at a very low cost based on hive sql
[ https://issues.apache.org/jira/browse/HIVE-7296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangmeng updated HIVE-7296: --- Description: For big data analysis, we often need to do the following query and statistics: 1.Cardinality Estimation, count the number of different elements in the collection, such as Unique Visitor ,UV) Now we can use hive-query: Select distinct(id) from TestTable ; 2.Frequency Estimation: estimate number of an element is repeated, such as the site visits of a user 。 Hive query: select count(1) from TestTable where name=”wangmeng” 3.Heavy Hitters, top-k elements: such as top-100 shops Hive query: select count(1), name from TestTable group by name ; need UDF…… 4.Range Query: for example, to find out the number of users between 20 to 30 Hive query : select count(1) from TestTable where age>20 and age <30 5.Membership Query : for example, whether the user name is already registered? According to the implementation mechanism of hive , it will cost too large memory space and a long query time. However ,in many cases, we do not need very accurate results and a small error can be tolerated. In such case , we can use approximate processing to greatly improve the time and space efficiency. Now , based on some theoretical analysis materials ,I want to do some for these new features so much if possible. . I am familiar with hive and hadoop , and I have implemented an efficient storage format based on hive.( https://github.com/sjtufighter/Data---Storage--). So, is there anything I can do ? Many Thanks. was: For big data analysis, we often need to do the following query and statistics: 1.Cardinality Estimation, count the number of different elements in the collection, such as Unique Visitor ,UV) Now we can use hive-query: Select distinct(id) from TestTable ; 2.Frequency Estimation: estimate number of an element is repeated, such as the site visits of a user 。 Hive query: select count(1) from TestTable where name=”wangmeng” 3.Heavy Hitters, top-k elements: such as top-100 shops Hive query: select count(1), name from TestTable group by name ; need UDF…… 4.Range Query: for example, to find out the number of users between 20 to 30 Hive query : select count(1) from TestTable where age>20 and age <30 5.Membership Query : for example, whether the user name is already registered? According to the implementation mechanism of hive , it will cost too large memory space and a long query time. However ,in many cases, we do not need very accurate results and a small error can be tolerated. In such case , we can use approximate processing to greatly improve the time and space efficiency. Now , based on some theoretical analysis materials ,I want to do some for these new features so much I am familiar with hive and hadoop , and I have implemented an efficient storage format based on hive.( https://github.com/sjtufighter/Data---Storage--). So, is there anything I can do ? Many Thanks. > big data approximate processing at a very low cost based on hive sql > > > Key: HIVE-7296 > URL: https://issues.apache.org/jira/browse/HIVE-7296 > Project: Hive > Issue Type: New Feature >Reporter: wangmeng > > For big data analysis, we often need to do the following query and statistics: > 1.Cardinality Estimation, count the number of different elements in the > collection, such as Unique Visitor ,UV) > Now we can use hive-query: > Select distinct(id) from TestTable ; > 2.Frequency Estimation: estimate number of an element is repeated, such as > the site visits of a user 。 > Hive query: select count(1) from TestTable where name=”wangmeng” > 3.Heavy Hitters, top-k elements: such as top-100 shops > Hive query: select count(1), name from TestTable group by name ; need UDF…… > 4.Range Query: for example, to find out the number of users between 20 to 30 > Hive query : select count(1) from TestTable where age>20 and age <30 > 5.Membership Query : for example, whether the user name is already > registered? > According to the implementation mechanism of hive , it will cost too large > memory space and a long query time. > However ,in many cases, we do not need very accurate results and a small > error can be tolerated. In such case , we can use approximate processing > to greatly improve the time and space efficiency. > Now , based on some theoretical analysis materials ,I want to do some for > these new features so much if possible. . > I am familiar with hive and hadoop , and I have implemented an efficient > storage format based on hive.( > https://github.com/sjtufighter/Data---Storage--). > So, is there anything I can do ? Many Thanks. -- This message was sent by Atlassian JIRA (v6.2#625
[jira] [Updated] (HIVE-7296) big data approximate processing at a very low cost based on hive sql
[ https://issues.apache.org/jira/browse/HIVE-7296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangmeng updated HIVE-7296: --- Description: For big data analysis, we often need to do the following query and statistics: 1.Cardinality Estimation, count the number of different elements in the collection, such as Unique Visitor ,UV) Now we can use hive-query: Select distinct(id) from TestTable ; 2.Frequency Estimation: estimate number of an element is repeated, such as the site visits of a user 。 Hive query: select count(1) from TestTable where name=”wangmeng” 3.Heavy Hitters, top-k elements: such as top-100 shops Hive query: select count(1), name from TestTable group by name ; need UDF…… 4.Range Query: for example, to find out the number of users between 20 to 30 Hive query : select count(1) from TestTable where age>20 and age <30 5.Membership Query : for example, whether the user name is already registered? According to the implementation mechanism of hive , it will cost too large memory space and a long query time. However ,in many cases, we do not need very accurate results and a small error can be tolerated. In such case , we can use approximate processing to greatly improve the time and space efficiency. Now , based on some theoretical analysis materials ,I want to do some for these new features so much I am familiar with hive and hadoop , and I have implemented an efficient storage format based on hive.( https://github.com/sjtufighter/Data---Storage--). So, is there anything I can do ? Many Thanks. was: For big data analysis, we often need to do the following query and statistics: 1.Cardinality Estimation, count the number of different elements in the collection, such as Unique Visitor ,UV) Now we can use hive-query: Select distinct(id) from TestTable ; 2.Frequency Estimation: estimate number of an element is repeated, such as the site visits of a user 。 Hive query: select count(1) from TestTable where name=”wangmeng” 3.Heavy Hitters, top-k elements: such as top-100 shops Hive query: select count(1), name from TestTable group by name ; need UDF…… 4.Range Query: for example, to find out the number of users between 20 to 30 Hive query : select count(1) from TestTable where age>20 and age <30 5.Membership Query : for example, whether the user name is already registered? According to the implementation mechanism of hive , it will cost too large memory space and a long query time. However ,in many cases, we do not need very accurate results and a small error can be tolerated. In such case , we can use approximate processing to greatly improve the time and space efficiency. Now , based on some theoretical analysis materials ,I want to do some for these new features so much . I am familiar with hive and hadoop , and I have implemented an efficient storage format based on hive.( https://github.com/sjtufighter/Data---Storage--). So, is there anything I can do ? Many Thanks. > big data approximate processing at a very low cost based on hive sql > > > Key: HIVE-7296 > URL: https://issues.apache.org/jira/browse/HIVE-7296 > Project: Hive > Issue Type: New Feature >Reporter: wangmeng > > For big data analysis, we often need to do the following query and statistics: > 1.Cardinality Estimation, count the number of different elements in the > collection, such as Unique Visitor ,UV) > Now we can use hive-query: > Select distinct(id) from TestTable ; > 2.Frequency Estimation: estimate number of an element is repeated, such as > the site visits of a user 。 > Hive query: select count(1) from TestTable where name=”wangmeng” > 3.Heavy Hitters, top-k elements: such as top-100 shops > Hive query: select count(1), name from TestTable group by name ; need UDF…… > 4.Range Query: for example, to find out the number of users between 20 to 30 > Hive query : select count(1) from TestTable where age>20 and age <30 > 5.Membership Query : for example, whether the user name is already > registered? > According to the implementation mechanism of hive , it will cost too large > memory space and a long query time. > However ,in many cases, we do not need very accurate results and a small > error can be tolerated. In such case , we can use approximate processing > to greatly improve the time and space efficiency. > Now , based on some theoretical analysis materials ,I want to do some for > these new features so much > I am familiar with hive and hadoop , and I have implemented an efficient > storage format based on hive.( > https://github.com/sjtufighter/Data---Storage--). > So, is there anything I can do ? Many Thanks. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (HIVE-7296) big data approximate processing at a very low cost based on hive sql
wangmeng created HIVE-7296: -- Summary: big data approximate processing at a very low cost based on hive sql Key: HIVE-7296 URL: https://issues.apache.org/jira/browse/HIVE-7296 Project: Hive Issue Type: New Feature Reporter: wangmeng For big data analysis, we often need to do the following query and statistics: 1.Cardinality Estimation, count the number of different elements in the collection, such as Unique Visitor ,UV) Now we can use hive-query: Select distinct(id) from TestTable ; 2.Frequency Estimation: estimate number of an element is repeated, such as the site visits of a user 。 Hive query: select count(1) from TestTable where name=”wangmeng” 3.Heavy Hitters, top-k elements: such as top-100 shops Hive query: select count(1), name from TestTable group by name ; need UDF…… 4.Range Query: for example, to find out the number of users between 20 to 30 Hive query : select count(1) from TestTable where age>20 and age <30 5.Membership Query : for example, whether the user name is already registered? According to the implementation mechanism of hive , it will cost too large memory space and a long query time. However ,in many cases, we do not need very accurate results and a small error can be tolerated. In such case , we can use approximate processing to greatly improve the time and space efficiency. Now , based on some theoretical analysis materials ,I want to do some for these new features so much . I am familiar with hive and hadoop , and I have implemented an efficient storage format based on hive.( https://github.com/sjtufighter/Data---Storage--). So, is there anything I can do ? Many Thanks. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-7277) how to decide reduce numbers according to the input size of reduce stage rather than the input size of map stage?
[ https://issues.apache.org/jira/browse/HIVE-7277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14041761#comment-14041761 ] wangmeng commented on HIVE-7277: well, the MR api can not fit up with this logical plan to generate the physical plan . -- Best Regards HomePage:http://wangmeng.us/ Name:Wang Meng---Data structures and Algorithms,Java,Jvm, Linux, Shell, Distributed system , Hadoop Hive , Performancse Optimization and Debug ,Spark/Shark Major: Software Engineering Degree: Master E-mail: sjtufigh...@163.com sjtufigh...@sjtu.edu.cn GitHub:https://github.com/sjtufighter > how to decide reduce numbers according to the input size of reduce stage > rather than the input size of map stage? > --- > > Key: HIVE-7277 > URL: https://issues.apache.org/jira/browse/HIVE-7277 > Project: Hive > Issue Type: New Feature >Reporter: wangmeng > Fix For: 0.13.0 > > > As we know ,now hive decide the reduce numbers just by the " Input size > of map/ hive.exec.reducers.bytes.per.reducer(default 1G ). > But ,I think the out put size of map stage may have a big difference from > the original input size , so I think this strategy to decide > reduce-numbers may be improper > So is there any feature which can decide the reduce number just according > to the out put of the map stage.?thanks . > As I know , actually ,the reduce stage will begin just after some map > tasks have finished rather than until the whole map stage have finished , > so I think it is improper too decide reduce numbers when the whole map > stage have finished. > As someone point ,We can just according to the out put size of the > earliest map tasks which have finished to estimate the whole reduce > numbers..However, in fact ,now Hive has used filter push down(where) > ,which may resulting a big difference from each map task . > So, this estimation is improper. > thanks . -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-7277) how to decide reduce numbers according to the input size of reduce stage rather than the input size of map stage?
[ https://issues.apache.org/jira/browse/HIVE-7277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14041742#comment-14041742 ] wangmeng commented on HIVE-7277: As I know ,TEZ is a new compute engine different from mapreduce, is there any solution based on map reduce engine ? -- Best Regards HomePage:http://wangmeng.us/ Name:Wang Meng---Data structures and Algorithms,Java,Jvm, Linux, Shell, Distributed system , Hadoop Hive , Performancse Optimization and Debug ,Spark/Shark Impala Major: Software Engineering -- Degree: Master E-mail: sjtufigh...@163.com sjtufigh...@sjtu.edu.cn Tel: 13141202303(BeiJing) 18818272832(ShangHai) GitHub:https://github.com/sjtufighter > how to decide reduce numbers according to the input size of reduce stage > rather than the input size of map stage? > --- > > Key: HIVE-7277 > URL: https://issues.apache.org/jira/browse/HIVE-7277 > Project: Hive > Issue Type: New Feature >Reporter: wangmeng > Fix For: 0.13.0 > > > As we know ,now hive decide the reduce numbers just by the " Input size > of map/ hive.exec.reducers.bytes.per.reducer(default 1G ). > But ,I think the out put size of map stage may have a big difference from > the original input size , so I think this strategy to decide > reduce-numbers may be improper > So is there any feature which can decide the reduce number just according > to the out put of the map stage.?thanks . > As I know , actually ,the reduce stage will begin just after some map > tasks have finished rather than until the whole map stage have finished , > so I think it is improper too decide reduce numbers when the whole map > stage have finished. > As someone point ,We can just according to the out put size of the > earliest map tasks which have finished to estimate the whole reduce > numbers..However, in fact ,now Hive has used filter push down(where) > ,which may resulting a big difference from each map task . > So, this estimation is improper. > thanks . -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-7277) how to decide reduce numbers according to the input size of reduce stage rather than the input size of map stage?
[ https://issues.apache.org/jira/browse/HIVE-7277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14041741#comment-14041741 ] wangmeng commented on HIVE-7277: As I know ,TEZ is a new compute engine different from mapreduce, is there any solution based on map reduce engine ? > how to decide reduce numbers according to the input size of reduce stage > rather than the input size of map stage? > --- > > Key: HIVE-7277 > URL: https://issues.apache.org/jira/browse/HIVE-7277 > Project: Hive > Issue Type: New Feature >Reporter: wangmeng > Fix For: 0.13.0 > > > As we know ,now hive decide the reduce numbers just by the " Input size > of map/ hive.exec.reducers.bytes.per.reducer(default 1G ). > But ,I think the out put size of map stage may have a big difference from > the original input size , so I think this strategy to decide > reduce-numbers may be improper > So is there any feature which can decide the reduce number just according > to the out put of the map stage.?thanks . > As I know , actually ,the reduce stage will begin just after some map > tasks have finished rather than until the whole map stage have finished , > so I think it is improper too decide reduce numbers when the whole map > stage have finished. > As someone point ,We can just according to the out put size of the > earliest map tasks which have finished to estimate the whole reduce > numbers..However, in fact ,now Hive has used filter push down(where) > ,which may resulting a big difference from each map task . > So, this estimation is improper. > thanks . -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (HIVE-7277) how to decide reduce numbers according to the input size of reduce stage rather than the input size of map stage?
wangmeng created HIVE-7277: -- Summary: how to decide reduce numbers according to the input size of reduce stage rather than the input size of map stage? Key: HIVE-7277 URL: https://issues.apache.org/jira/browse/HIVE-7277 Project: Hive Issue Type: New Feature Reporter: wangmeng Fix For: 0.13.0 As we know ,now hive decide the reduce numbers just by the " Input size of map/ hive.exec.reducers.bytes.per.reducer(default 1G ). But ,I think the out put size of map stage may have a big difference from the original input size , so I think this strategy to decide reduce-numbers may be improper So is there any feature which can decide the reduce number just according to the out put of the map stage.?thanks . As I know , actually ,the reduce stage will begin just after some map tasks have finished rather than until the whole map stage have finished , so I think it is improper too decide reduce numbers when the whole map stage have finished. As someone point ,We can just according to the out put size of the earliest map tasks which have finished to estimate the whole reduce numbers..However, in fact ,now Hive has used filter push down(where) ,which may resulting a big difference from each map task . So, this estimation is improper. thanks . -- This message was sent by Atlassian JIRA (v6.2#6252)