This is an automated email from the ASF dual-hosted git repository. zabetak pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/hive-site.git
The following commit(s) were added to refs/heads/main by this push: new 8c21608 HIVE-29140: Drop comment and save sections from documentation pages (#61) 8c21608 is described below commit 8c2160824c476a53b74aee27bedaac7595b8157e Author: Raghav Aggarwal <raghavaggarwal03...@gmail.com> AuthorDate: Mon Sep 1 19:08:57 2025 +0530 HIVE-29140: Drop comment and save sections from documentation pages (#61) --- .../desingdocs/column-statistics-in-hive.md | 20 -- .../desingdocs/hivereplicationdevelopment.md | 16 -- content/Development/desingdocs/links.md | 260 --------------------- content/Development/desingdocs/listbucketing.md | 92 -------- content/Development/desingdocs/llap.md | 16 -- .../desingdocs/skewed-join-optimization.md | 19 -- content/community/resources/hive-apis-overview.md | 14 -- .../community/resources/unit-testing-hive-sql.md | 21 -- .../docs/latest/admin/adminmanual-configuration.md | 10 - .../admin/adminmanual-metastore-administration.md | 6 - .../latest/admin/hive-on-spark-getting-started.md | 15 -- content/docs/latest/admin/replication.md | 8 - ...nhanced-aggregation-cube-grouping-and-rollup.md | 29 --- content/docs/latest/language/languagemanual-ddl.md | 12 - .../docs/latest/language/languagemanual-types.md | 8 - content/docs/latest/language/reflectudf.md | 14 -- content/docs/latest/language/supported-features.md | 41 ---- content/docs/latest/user/Hive-Transactions-ACID.md | 10 - .../docs/latest/user/configuration-properties.md | 88 ------- content/docs/latest/user/hive-transactions.md | 10 - content/docs/latest/user/hiveserver2-clients.md | 16 -- content/docs/latest/user/multidelimitserde.md | 21 -- content/docs/latest/user/serde.md | 52 ----- 23 files changed, 798 deletions(-) diff --git a/content/Development/desingdocs/column-statistics-in-hive.md b/content/Development/desingdocs/column-statistics-in-hive.md index ee79af6..e3562e7 100644 --- a/content/Development/desingdocs/column-statistics-in-hive.md +++ b/content/Development/desingdocs/column-statistics-in-hive.md @@ -212,23 +212,3 @@ Note that delete_column_statistics is needed to remove the entries from the meta Note that in V1 of the project, we will support only scalar statistics. Furthermore, we will support only static partitions, i.e., both the partition key and partition value should be specified in the analyze command. In a following version, we will add support for height balanced histograms as well as support for dynamic partitions in the analyze command for column level statistics. -## Comments: - -| | -| --- | -| -Shreepadma, is there a jira for this ? Is this ready for review, or is it a initial design ? -Also, can you go over <https://issues.apache.org/jira/browse/HIVE-3421> and see how the two are related ? - - Posted by namit.jain at Sep 14, 2012 00:51 - | -| -Namit, This patch is ready for review. There is already a JIRA for this - HIVE-1362. I've the patch on both JIRA and reviewboard. Please note that this goes beyond HIVE-3421 - this patch adds the stats specified on both this wiki and the JIRA page. Thanks. - - Posted by shreepadma at Oct 03, 2012 00:46 - | - - - - - diff --git a/content/Development/desingdocs/hivereplicationdevelopment.md b/content/Development/desingdocs/hivereplicationdevelopment.md index 61ae593..627d7b8 100644 --- a/content/Development/desingdocs/hivereplicationdevelopment.md +++ b/content/Development/desingdocs/hivereplicationdevelopment.md @@ -190,19 +190,3 @@ In the future, additional work should be done in the following areas: [1] Kemme, B., et al., "Database Replication: A Tutorial," in *Replication: Theory and Practice*, B. Charron-Bost et al., Eds. Berlin, Germany: Springer, 2010, pp. 219-252. -Save - -Save - -Save - -Save - -Save - -Save - - - - - diff --git a/content/Development/desingdocs/links.md b/content/Development/desingdocs/links.md index a49de09..1780ab2 100644 --- a/content/Development/desingdocs/links.md +++ b/content/Development/desingdocs/links.md @@ -111,263 +111,3 @@ By definition, there isn't a one-one mapping between a view partition and a tabl The above notes make it clear that what we are trying to build is a very special case of a degenerate view, and it would be cleaner to introduce a new concept in Hive to model these ‘imports’. -## Comments: - -| | -| --- | -| -Questions from Ashutosh Chauhan (with inline responses): -* What exactly is contained in tracking capacity usage. One is disk space. That I presume you are going to track via summing size under database directory. Are you also thinking of tracking resource usage in terms of CPU/memory/network utilization for different teams? -*Right now the capacity usage in Hive we will track is the disk space (managed tables that belong to the namespace + imported tables). We will track the mappers and reducers that the namepace utilizes directly from Hadoop.* - -* Each namespace (ns) will have exactly one database. If so, then users are not allowed to create/use databases in such deployment? Not necessarily a problem, just trying to understand design. -*This is correct – this is a limitation of the design. Introducing a new concept seemed heavyweight, so we re-used databases for namespaces. But this approach means that a given namespace cannot have sub-databases in it.* - -* How are you going to keep metadata consistent across two ns? If metadata gets updated in remote ns, will it get automatically updated in user's local ns? If yes, how will this be implemented? If no, then every time user need to use data from remote ns, she has to bring metadata uptodate in her ns. How will she do it? -*Metadata will be kept in sync for linked tables. We will make alter table on the remote table (source of the link) cause an update to the target of the link. Note that from a Hive perspective, the metadata for the source and target of a link is in the same metastore.* - -* Is it even required that metadata of two linked tables to be consistent? Seems like user has to run "alter link add partition" herself for each partition. She can choose only to add few partitions. In this case, tables in two ns have different number of partitions and thus data. -*What you say above is true for static links. For dynamic links, add and drop partition on the source of the link will cause the target to get those partitions as well (we trap alter table add/drop partition to provide this behavior).* - -* Who is allowed to create links? -*Any user on the database who has create/all privileges on the database. We could potentially create a new privilege for this, but I think create privilege should suffice. We can similarly map alter, drop privileges to the appropriate operations.* - -* Once user creates a link, who can use it? If everyone is allowed to access, then I don't see how is it different from the problem that you are outlining in first alternative design option, wherein user having an access to two ns via roles has access to data on both ns. -*The link creates metadata in the target database. So you can only access data that has been linked into this database (access is via the T@Y or Y.T syntax depending on the chosen design option). Note that this is different than having a role that a user maps to since in that case, there is no local metadata in the target database specifying if the imported data is accessible from this database.* - -* If links are first class concepts, then authorization model also needs to understand them? I don't see any mention of that. -*Yes, we need to account for the authorization model.* - -* I see there is a hdfs jira for implementing hard links of files in hdfs layer, so that takes care of linking physical data on hdfs. What about tables whose data is stored in external systems. For example, hbase. Does hbase also needs to implement feature of hard-linking their table for hive to make use of this feature? What about other storage handlers like cassandra, mongodb etc. -*The link does not create a link on HDFS. It just points to the source table/partitions. One can think of it as a Hive-level link so there is no need for any changes/features from the other storage handlers.* - -* Migration will involve two step process of distcp'ing data from one cluster to another and then replicating one mysql instance to another. Are there any other steps? Do you plan to (later) build tools to automate this process of migration. -*We will be building tools to enable migration of a namespace to another cluster. Migration will involve replicating the metadata and the data as you mention above.* - -* When migrating ns from one datacenter to another, will links be dropped or they are also preserved? -*We will preserve them – by copying the data for the links to the other datacenter.* - - Posted by sambavi at May 22, 2012 02:10 - | -| -The first draft of this proposal is very hard to decipher because it relies on terms that aren't well defined. For example, here's the second sentence from the motivations section: -Growth beyond a single warehouse (or) separation of capacity usage and allocation requires the creation of multiple physical warehouses, i.e., separate Hive instances. -What's the difference between a *warehouse* and a *physical warehouse*? How do you define a *Hive instance*? In the requirements section the term *virtual warehouse* is introduced and equated to a namespace, but clearly it's more than that because otherwise DBs/Schemas would suffice. Can you please update the proposal to include definitions of these terms? - - Posted by cwsteinbach at May 22, 2012 18:35 - | -| -Prevent access using two part name syntax (Y.T) if namespaces feature is "on" in a Hive instance. This ensures the database is self-contained. -The cross-namespace Hiveconf ACL proposed in HIVE-3016 doesn't prevent anyone from doing anything because there is no way to keep users from disabling it. I'm surprised to see this ticket mentioned here since three committers have already gone on record saying that this is the wrong approach, and one committer even -1'd it. If preventing cross-db references in queries is a requirement for this project, then I think Hive's authorization mechanism will need to be extended to support this p [...] - - Posted by cwsteinbach at May 22, 2012 18:48 - | -| -From the design section: -We are building a namespace service external to Hive that has metadata on namespace location across the Hive instances, and allows importing data across Hive instances using replication. -Does the work proposed in HIVE-2989 also include adding this Db/Table replication infrastructure to Hive? - - Posted by cwsteinbach at May 22, 2012 18:53 - | -| -We mention the JIRA here for the sake of completeness. We are implementing this as a pre-execution hook for now, but support for namespaces will be incomplete without this control (since you can't guarantee self-contained namespaces unless you prevent two-part name access). -What extensions to the authorization system are you thinking of? One idea would be to set role for a session (corresponding to the namespace the user is operating in), so that a user operating in the context of that role can only see the data available to that role. - - Posted by sambavi at May 22, 2012 19:42 - | -| -What extensions to the authorization system are you thinking of? -Add a new privilege named something like "select_cross_db" and GRANT it to specific users as follows: -GRANT select_cross_db ON DATABASE db TO USER x; -GRANT select_cross_db ON DATABASE db to ROLE x; -This privilege would be provided by default, but if absent, then the user would be prevented from referencing DBs outside of 'db' while using 'db' as the primary database. - - Posted by cwsteinbach at May 22, 2012 20:01 - | -| -Thanks Carl - this is an interesting suggestion - for installations with namespaces, we would need to turn this privilege off by default and have no users. groups or roles be granted this privilege. We'll discuss internally. - - Posted by sambavi at May 22, 2012 20:25 - | -| -I've edited the main page to include a definition for physical warehouse and removed the term Hive instance to reduce ambiguity. - - Posted by sambavi at May 22, 2012 20:26 - | -| -No Hive-2989 will not include the replication infrastructure. We plan to provide replication in the second half of the year. - - Posted by sambavi at May 23, 2012 10:30 - | -| -I have opened a JIRA for adding a new privilege for cross database commands and resolved HIVE-3016. Thanks for the suggestion! - - Posted by sambavi at May 23, 2012 12:36 - | -| -This proposal describes the DDL and metadata changes that are necessary to support DB Links, but it doesn't include any details about the mechanics of replicating data across clusters (it's probably a little more complicated than just running distcp). I think the proposal needs to include these details before it can be considered complete. -No Hive-2989 will not include the replication infrastructure. We plan to provide replication in the second half of the year. -The metadata extensions described in this proposal will require users to run metastore upgrade scripts, and the DDL extensions will become part of the public API. The former imposes a burden on users, and the latter constitutes a continuing maintenance burden on the people who contribute to this project. Taking this into account I think we need to be able to demonstrate that the new code tangibly benefits users before it appears in a Hive release. I don't think it will be possible to dem [...] - - Posted by cwsteinbach at May 25, 2012 19:21 - | -| -Some thoughts on this from our team: -1. Even in a single physical warehouse, namespaces allow better quota management and isolation between independent team's data/workload. This is independent of security and replication considerations. -2. Further, replication as a feature is pretty big and will take a while to be built out. We can hide the table link feature behind a config parameter so that its not exposed to users who don't need it until its completed. The only piece we cannot hide is the metastore changes, but the upgrade script for the metastore will just add few columns in few tables, and should not take more than a few minutes even for a pretty large warehouse (few thousand tables + ~100,000 partitions). In the m [...] -If there was links support to start with in hive, we would have used it from the very beginning, and not gotten into the mess of one large warehouse with difficulty in dealing with multi-tenancy. We seriously believe that this is the right direction for the community, and all new users can design the warehouse in the right way from the very start, and learn from Facebook's experience. - - Posted by sambavi at May 25, 2012 21:37 - | -| -Even in a single physical warehouse, namespaces allow better quota management and isolation between independent team's data/workload. This is independent of security and replication considerations. -Hive already provides support for namespaces in the form of databases/schemas. As far as I can tell the table link feature proposed here actually weakens the isolation guarantees provided by databases/schemas, and consequently will make quota and workload management between teams more complicated. In order to resolve this disagreement I think it would help if you provided some concrete examples of how the table link feature improves this situation. - - Posted by cwsteinbach at May 29, 2012 13:54 - | -| -Suppose there are 2 teams which want to use the same physical warehouse. -Team 1 wants to use the following: (let us say that each table is partitioned by date) -T1 (all partitions) -T2 (all partitions) -T3 (all partitions) -Team 2 wants to use the following: -T1 (partitions for the last 3 days) -T2 (partition for a fixed day: say May 29' 2012) -T4 (all partitions) -Using the current hive architecture, we can perform the following: -* Use a single database and have scripts for quota -* Use 2 databases and copy the data in both the databases (say, the databases are DB1 and DB2 respectively) -* Use 2 databases, and use views in database 2 (to be used to team 2). - -The problems with these approaches is as follows: -* Table Discovery etc.becomes very messy. You can do that via tags, but then, all the functionality that is provided by databases - can also be provided via tags. -* Duplication of data -* The user will have to perform the management himself. When a partition gets added to DB1.T1, the corresponding partition needs to be added to DB2.View1, and the 3 day old partition from DB2.View1 needs to be dropped. This has to be done outside hive, and makes -the task of maintaining these partitions very difficult - how do you make sure this is atomic etc. User has to do lot more scripting. - -Links is a degenerate case of views. With links, the above use case can be solved very easily. -This is a real use case at Facebook today, and I think, there will be similar use cases for other users. Maybe, they are not solving it in the most optimal manner currently. - - Posted by namit.jain at May 29, 2012 18:33 - | -| -Furthermore, databases don't provide isolation today since two part name access is unrestricted. By introducing a trackable way of accessing content outside the current database (using table links), we get isolation for a namespace using Hive databases. - - Posted by sambavi at May 30, 2012 01:22 - | -| -@Namit: Thanks for providing an example. I have a couple followup questions: -Use a single database and have scripts for quota -Can you provide some more details about how quota management works? For example, Team 1 and 2 both need access to a single partition in table T2, so who pays for the space occupied by this partition? Are both teams charged for it? -Use 2 databases and copy the data in both the databases (say, the databases are DB1 and DB2 respectively) -I'm not sure why this is listed as an option. What does this actually accomplish? -Use 2 databases, and use views in database 2 (to be used to team 2). -This seems like the most reasonable approach given my current understanding of your use case. -You listed three problems with these approaches. Most of them don't seem applicable to views: -Table Discovery etc.becomes very messy. You can do that via tags, but then, all the functionality that is provided by databases can also be provided via tags. -It's hard for me to evaluate this claim since I'm not totally sure what is meant by "table discovery". Can you please provide an example? However, my guess is that this is not a differentiator if you're comparing the table links approach to views. -Duplication of data -Not applicable for views and table links. -The user will have to perform the management himself. When a partition gets added to DB1.T1, the corresponding partition needs to be added to DB2.View1, and the 3 day old partition from DB2.View1 needs to be dropped. -Based on the description of table links in HIVE-2989 it sounds like the user will still have to perform manual management even with table links, e.g. dropping the link that points to the partition from four days ago and adding a new link that points to the most recent partition. In this case views may actually work better since you can embed the filter condition (last three days) in the view definition instead of relying on external tools to update the table links. -This has to be done outside hive, and makes the task of maintaining these partitions very difficult - how do you make sure this is atomic etc. User has to do lot more scripting. -I don't think table links make this process atomic, and as I mentioned above the process of maintaining this linked set of partitions actually seems easier if you use views instead. -Links is a degenerate case of views. -I agree that table links are a degenerate case of views. Since that's the case, why is it necessary to implement table links? Why not leverage the functionality that is already provided with views? - - Posted by cwsteinbach at May 31, 2012 16:45 - | -| -Furthermore, databases don't provide isolation today since two part name access is unrestricted. -DBs in conjunction with the authorization system provide strong isolation between different namespaces. Also, it should be possible to extend the authorization system to the two-part-name-access case that you described above (e.g. [HIVE-3047](https://issues.apache.org/jira/browse/HIVE-3047)). -By introducing a trackable way of accessing content outside the current database (using table links), we get isolation for a namespace using Hive databases. -I think you already get that by using views. If I'm wrong can you please explain how the view approach falls short? Thanks. - - Posted by cwsteinbach at May 31, 2012 16:59 - | -| -Carl: I've addressed your questions below. ->>Can you provide some more details about how quota management works? For example, Team 1 and 2 both need access to a single partition in table T2, so who pays for the space occupied by this partition? Are both teams charged for it? -If the partition is shared, it is accounted towards both their quota (base quota for the team that owns the partition, and imported quota for the team that imports it via a link). The reason for this is that when a namespace is moved to another datacenter, we have to account for all the quota (both imported and base) as belonging to the namespace (the data can no longer be shared directly via a link, and we will need to replicate it). ->>Use 2 databases and copy the data in both the databases (say, the databases are DB1 and DB2 respectively) ->> I'm not sure why this is listed as an option. What does this actually accomplish? -It was just one way of achieving the same data being available in the two namespaces. You can ignore this one (smile) ->>It's hard for me to evaluate this claim since I'm not totally sure what is meant by "table discovery". Can you please provide an example? However, my guess is that this is not a differentiator if you're comparing the table links approach to views. -I think Namit meant this in reference to the design option of using a single database and using scripts for quota management. In the case of views, due to views being opaque, it will be hard to see which tables are imported into the namespace. ->>Based on the description of table links in HIVE-2989 it sounds like the user will still have to perform manual management even with table links, e.g. dropping the link that points to the partition from four days ago and adding a new link that points to the most recent partition. In this case views may actually work better since you can embed the filter condition (last three days) in the view definition instead of relying on external tools to update the table links. -Maybe the description was unclear. Table links have two types: static and dynamic. Static links behave the way you describe, but dynamic links will have addition and drop of partitions when the source table (of the link) has partitions added or removed from it. ->>I don't think table links make this process atomic, and as I mentioned above the process of maintaining this linked set of partitions actually seems easier if you use views instead. -Addressed this above - table links do keep the links updated when the source of the link has partitions added or dropped. This will be atomic since it is done in one metastore operation during an ALTER TABLE ADD/DROP PARTITION command. ->>I agree that table links are a degenerate case of views. Since that's the case, why is it necessary to implement table links? Why not leverage the functionality that is already provided with views? -Table links allow for better accounting of imported data (views are opaque), single instancing of imports and partition pruning when the imports only have some of the partitions of the source table of the link. Given this, it seems ideal to introduce table links as a concept rather than overload views. - - Posted by sambavi at May 31, 2012 18:09 - | -| -Hi Carl, I explained how views fall short in the post below (response to your comments on Namit's post). Please add any more questions you have - I can explain further if unclear. - - Posted by sambavi at May 31, 2012 18:11 - | -| -I think Namit meant this in reference to the design option of using a single database and using scripts for quota management. In the case of views, due to views being opaque, it will be hard to see which tables are imported into the namespace. -Views are not opaque. DESCRIBE FORMATTED currently includes the following information: - -``` -# View Information -View Original Text: SELECT value FROM src WHERE key=86 -View Expanded Text: SELECT `src`.`value` FROM `src` WHERE `src`.`key`=86 - -``` - -Currently the metastore only tracks the original and expanded text view query, but it would be straightforward to also extract and store the list of source tables that are referenced in the query when the view is created (in fact, there's already a JIRA ticket for this ([HIVE-1073](https://issues.apache.org/jira/browse/HIVE-1073)), and the information is already readily available internally as described [here](https://cwiki.apache.org/confluence/display/Hive/PartitionedViews#PartitionedV [...] -Maybe the description was unclear. Table links have two types: static and dynamic. Static links behave the way you describe, but dynamic links will have addition and drop of partitions when the source table (of the link) has partitions added or removed from it. -I don't think dynamic table links satisfy the use case covered by Team 2's access requirements for table T1. Team 2 wants to see only the most recent three partitions in table T1, and my understanding of dynamic table links is that once the link is created, Team 2 will subsequently see every new partition that is added to the source table. In order to satisfy Team 2's requirements I think you're going to have to manually add and drop partitions from the link using the ALTER LINK ADD/DROP [...] -The functionality provided by dynamic links does make sense in some contexts, but the same is true for dynamic partitioned views. Why not extend the partitioned view feature to support dynamic partitions? -Table links allow for better accounting of imported data (views are opaque), single instancing of imports and partition pruning when the imports only have some of the partitions of the source table of the link. Given this, it seems ideal to introduce table links as a concept rather than overload views. -I addressed the "views are opaque" argument above. I'm having trouble following the rest of the sentence. What does "single instancing of imports" mean? If possible can you provide an example in terms of table links and partitioned views? - - Posted by cwsteinbach at May 31, 2012 22:27 - | -| -Going back to my example: -Suppose there are 2 teams which want to use the same physical warehouse. -Team 1 wants to use the following: (let us say that each table is partitioned by date) -T1 (all partitions) -T2 (all partitions) -T3 (all partitions) -Team 2 wants to use the following: -T1 (partitions for the last 3 days) -T2 (partition for a fixed day: say May 29' 2012) -T4 (all partitions) -Using the current hive architecture, we can perform the following: -* Use a single database and have scripts for quota -* Use 2 databases and copy the data in both the databases (say, the databases are DB1 and DB2 respectively) -* Use 2 databases, and use views in database 2 (to be used to team 2). - -We have discarded the first 2 approaches above, so let us discuss how will we use approach 3 (specifically for T1). -Team 2 will create the view: create view V1T1 as select * from DB1.T1 -Now, whenever a partition gets added in DB1.T1, someone (a hook or something - outside hive) needs to add the corresponding partition in V1T1. -That extra layer needs to make sure that the new partition in V1T1 is part of the inputs (may be necessary for auditing etc.) -Hive metastore has no knowledge of this dependency (view partition -> table partition), and it is maintained in multiple places (for possibly -different teams). -The same argument applies when a partition gets dropped from DB1.T1. -By design, there is no one-to-one dependency between a table partition and a view partition, and we do not want to create such a dependency. -The view may depend on multiple tables/partitions. The views in hive are not updatable. -By design, the schema of the view and the underlying table(s) can be different. -Links provide the above functionality. If I understand right, you are proposing to extend views to support the above functionality. We will end up -with a very specific model for a specific type of views, which are not like normal hive views. That would be more confusing, in my opinion. - - Posted by namit.jain at Jun 01, 2012 14:25 - | -| -Please comment - we haven't gotten any updates on the wiki as well as the jira <https://issues.apache.org/jira/browse/HIVE-2989> - - Posted by namit.jain at Jun 02, 2012 19:35 - | - - - - - diff --git a/content/Development/desingdocs/listbucketing.md b/content/Development/desingdocs/listbucketing.md index f9d91e2..87fd2aa 100644 --- a/content/Development/desingdocs/listbucketing.md +++ b/content/Development/desingdocs/listbucketing.md @@ -215,95 +215,3 @@ List bucketing was added in Hive 0.10.0 and 0.11.0. For more information, see [Skewed Tables in the DDL document]({{< ref "#skewed-tables-in-the-ddl-document" >}}). -## Comments: - -| | -| --- | -| -Does this feature require any changes to the metastore? If so can you please describe them? Thanks. - - Posted by cwsteinbach at Jun 11, 2012 15:13 - | -| -Please also describe any changes that will be made to public APIs including the following: -* The metastore and/or HiveServer Thrift interfaces (note that this includes overloading functions that are already included in the current Thrift interfaces, as well as modifying or adding new Thrift structs/objects). -* Hive Query Language, including new commands, extensions to existing commands, or changes to the output generated by commands (e.g. DESCRIBE FORMATTED TABLE). -* New configuration properties. -* Modifications to any of the public plugin APIs including SerDes and Hook/Listener interfaces, - -Also, if this feature requires any changes to the Metastore schema, those changes should be described in this document. -Finally, please describe your plan for implementing this feature and getting it committed. Will it go in as a single patch or be split into several different patches. - - Posted by cwsteinbach at Jun 12, 2012 01:47 - | -| -Yes, it requires metastore change. -We want to store the following information in metastore: -1. skewed column names -2. skewed column values -3. mappings from skewed column value to directories. -The above 3 will be added to MStorageDescriptor.java etc - - Posted by gangtimliu at Jun 14, 2012 12:47 - | -| -Yes, I will update document with any changes in the areas you mention. -Here is plan: -1. Implement End-to-end feature for single skewed column (DDL+DML) and go in as a single patch. -2. Implement End-to-end feature for multiple skewed columns (DDL+DML) and go in as a single patch. -3. Implement follow-ups and go in as a single patch. -The #3 is a slot for those not critical but nice to have and not in #1 & #2 due to resource constraints etc. - - Posted by gangtimliu at Jun 14, 2012 12:55 - | -| -It wasn't clear to me from this wiki page what the benefit is of storing the skewed values "as directories" over just storing them as files as regular skew tables do? Tim, could you please elaborate on that? - - Posted by mgro...@oanda.com at Nov 07, 2012 11:23 - | -| -Different terms but refer to the same thing: create sub directory for skewed value and store record in file. -Note that regular skew table doesn't create sub directory. It's different from non-skewed table because it has meta-data of skewed column name and values so that feature like skewed join can leverage it. -Only list bucketing table creates sub directory for skewed-value. We use "stored as directories" to mark it. -Hope it helps. - - Posted by gangtimliu at Nov 07, 2012 12:49 - | -| -Tim, thanks for responding but I am still missing something. I re-read the wiki page and here is my understanding. Please correct me if I am wrong. -Let's take a hand-wavy example. -Skewed table: -create table t1 (x string) skewed by (error) on ('a', 'b') partitioned by dt location '/user/hive/warehouse/t1'; -will create the following files: -/user/hive/warehouse/t1/dt=something/x=a.txt -/user/hive/warehouse/t1/dt=something/x=b.txt -/user/hive/warehouse/t1/dt=something/default -List bucketing table: -create table t2 (x string) skewed by (error) on ('a', 'b') partitioned by dt location '/user/hive/warehouse/t2' ; -will create the following files: -/user/hive/warehouse/t2/dt=something/x=a/data.txt -/user/hive/warehouse/t2/dt=something/x=b/data.txt -/user/hive/warehouse/t2/dt=something/default/data.txt -Is that correct? -In that case, why would a user ever choose to create sub-directories? Skewed joins would perform just well for regular skewed tables or list bucketing tables. Given that list bucketing introduces sub-directories it imposes restrictions on what other things users can and cannot do while regular skewed tables don't. So what would be someone's motivation to choose list bucketing over skewed tables? - - Posted by mgro...@oanda.com at Nov 09, 2012 00:11 - | -| -sorry for confusion. wiki requires polish to make it clear. -I assume t2 has stored as directories. -t1 doesn't have sub-directories but t2 has sub-directories. Directory structure looks like: -/user/hive/warehouse/t1/dt=something/data.txt -/user/hive/warehouse/t2/dt=something/x=a/data.txt -/user/hive/warehouse/t2/dt=something/x=b/data.txt -/user/hive/warehouse/t2/dt=something/default/data.txt -"stored as directories" tells hive to create sub-directories. -what's use case of t1? t1 can be used for skewed join since t1 has skewed column and value information. - - Posted by gangtimliu at Nov 09, 2012 01:55 - | - - - - - diff --git a/content/Development/desingdocs/llap.md b/content/Development/desingdocs/llap.md index dcda0ee..d00fc33 100644 --- a/content/Development/desingdocs/llap.md +++ b/content/Development/desingdocs/llap.md @@ -190,22 +190,6 @@ The watch and running nodes options were added in release 2.2.0 with [HIVE-15217 [Hive Contributor Meetup Presentation](https://cwiki.apache.org/confluence/download/attachments/27362054/LLAP-Meetup-Nov.ppsx?version=1&modificationDate=1447885307000&api=v2) -Save - -Save - -Save - -Save - -Save - -Save - -Save - -Save - ## Attachments:  diff --git a/content/Development/desingdocs/skewed-join-optimization.md b/content/Development/desingdocs/skewed-join-optimization.md index 1db084e..95d3995 100644 --- a/content/Development/desingdocs/skewed-join-optimization.md +++ b/content/Development/desingdocs/skewed-join-optimization.md @@ -49,22 +49,3 @@ The assumption is that B has few rows with keys which are skewed in A. So these *Implementation:* Starting in Hive 0.10.0, tables can be created as skewed or altered to be skewed (in which case partitions created after the ALTER statement will be skewed). In addition, skewed tables can use the list bucketing feature by specifying the STORED AS DIRECTORIES option. See the DDL documentation for details: [Create Table]({{< ref "#create-table" >}}), [Skewed Tables]({{< ref "#skewed-tables" >}}), and [Alter Table Skewed or Stored as Directories]({{< ref "#alter-table-sk [...] -## Comments: - -| | -| --- | -| -Is this proposal ready for review? - - Posted by cwsteinbach at May 31, 2012 21:27 - | -| -yes - - Posted by namit.jain at Jun 01, 2012 21:07 - | - - - - - diff --git a/content/community/resources/hive-apis-overview.md b/content/community/resources/hive-apis-overview.md index 0c7f21a..ff33217 100644 --- a/content/community/resources/hive-apis-overview.md +++ b/content/community/resources/hive-apis-overview.md @@ -67,17 +67,3 @@ Operation based Java API focused on mutating (insert/update/delete) records into JDBC API supported by Hive. It supports most of the functionality in JDBC spec. -## Comments: - -| | -| --- | -| -Page created after [an interesting discussion](https://issues.apache.org/jira/browse/HIVE-12285?focusedCommentId=14981551&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14981551). - - Posted by teabot at Oct 30, 2015 17:09 - | - - - - - diff --git a/content/community/resources/unit-testing-hive-sql.md b/content/community/resources/unit-testing-hive-sql.md index 1dc15d8..eb27414 100644 --- a/content/community/resources/unit-testing-hive-sql.md +++ b/content/community/resources/unit-testing-hive-sql.md @@ -114,24 +114,3 @@ The following Hive specific practices can be used to make processes more amenabl Although not specifically related to Hive SQL, tooling exists for the testing of other aspects of the Hive ecosystem. In particular the [BeeJU](https://github.com/HotelsDotCom/beeju) project provides JUnit rules to simplify the testing of integrations with the Hive Metastore and HiveServer2 services. These are useful, if for example, you are developing alternative data processing frameworks or tools that aim to leverage Hive's metadata features. - - -## Comments: - -| | -| --- | -| -Disclosure: The tools are listed according to level of experience I have with each tool, HiveRunner being the tool that I have used the most. Furthermore, I have previously contributed to the HiveRunner project. I've also been involved with the BeeJU project. - - Posted by teabot at Nov 11, 2015 10:20 - | -| -Where does [Capybara](https://cwiki.apache.org/confluence/download/attachments/27362054/CapybaraHiveMeetupNov2015.pptx) fit into this (it at all)? - - Posted by teabot at Dec 03, 2015 11:30 - | - - - - - diff --git a/content/docs/latest/admin/adminmanual-configuration.md b/content/docs/latest/admin/adminmanual-configuration.md index 94237f6..e92a597 100644 --- a/content/docs/latest/admin/adminmanual-configuration.md +++ b/content/docs/latest/admin/adminmanual-configuration.md @@ -212,13 +212,3 @@ For Hive releases prior to 0.11.0, see the "Thrift Server Setup" section in the For information about configuring WebHCat, see [WebHCat Configuration]({{< ref "webhcat-configure" >}}). - - - - -Save - - - - - diff --git a/content/docs/latest/admin/adminmanual-metastore-administration.md b/content/docs/latest/admin/adminmanual-metastore-administration.md index f8f1bd3..5cd9394 100644 --- a/content/docs/latest/admin/adminmanual-metastore-administration.md +++ b/content/docs/latest/admin/adminmanual-metastore-administration.md @@ -203,9 +203,3 @@ To suppress the schema check and allow the metastore to implicitly modify the sc Starting in release 0.12, Hive also includes an off-line schema tool to initialize and upgrade the metastore schema. Please refer to the details [here]({{< ref "hive-schema-tool" >}}). -Save - - - - - diff --git a/content/docs/latest/admin/hive-on-spark-getting-started.md b/content/docs/latest/admin/hive-on-spark-getting-started.md index 396224d..fb6f64f 100644 --- a/content/docs/latest/admin/hive-on-spark-getting-started.md +++ b/content/docs/latest/admin/hive-on-spark-getting-started.md @@ -238,19 +238,4 @@ See [Spark section of configuration page](https://cwiki.apache.org/confluence/di  [attachments/44302539/53575687.pdf](/attachments/44302539/53575687.pdf) (application/pdf) - - -## Comments: - -| | -| --- | -| -Spark has its own property to control whether to merge small files. Set hive.merge.sparkfiles=true to merge small files. - - Posted by lirui at Jan 15, 2015 01:34 - | - - - - diff --git a/content/docs/latest/admin/replication.md b/content/docs/latest/admin/replication.md index 0d91db4..64f5329 100644 --- a/content/docs/latest/admin/replication.md +++ b/content/docs/latest/admin/replication.md @@ -80,11 +80,3 @@ At this time it is not possible to replicate to tables on EMR that have a path l </property> ``` - - -Save - - - - - diff --git a/content/docs/latest/language/enhanced-aggregation-cube-grouping-and-rollup.md b/content/docs/latest/language/enhanced-aggregation-cube-grouping-and-rollup.md index bed85aa..13f7cca 100644 --- a/content/docs/latest/language/enhanced-aggregation-cube-grouping-and-rollup.md +++ b/content/docs/latest/language/enhanced-aggregation-cube-grouping-and-rollup.md @@ -172,32 +172,3 @@ For the first row, none of the columns are being selected. For the second row, only the first column is being selected, which explains the count of 2. For the third row, both the columns are being selected (and the second column happens to be null), which explains the count of 1. -## Comments: - -| | -| --- | -| -Is there really much value-add in the grouping sets grammar? If I think about the plan for generating a CUBE/ROLLUP (), it's pretty much as efficient as generating the CUBE and then sub-selecting what you need from it. -Can we just provide CUBE and ROLLUP and not provide the additional syntax? - - Posted by sambavi at Sep 21, 2012 15:32 - | -| -Depends on what the use case is. -By sub-selecting for the right grouping set, we would be passing more data across the map-reduce boundaries. -I have started a prototype implementation, and the work for grouping set should not be substantially more than -a cube or a rollup. We can stage it, and implement GROUPING_ID later, on demand. - - Posted by namit.jain at Sep 25, 2012 06:50 - | -| -I can only implement CUBE and ROLLUP first, but keep the execution layer general. -It will only require parser changes to plug in grouping sets, if need be, later. - - Posted by namit.jain at Sep 25, 2012 07:16 - | - - - - - diff --git a/content/docs/latest/language/languagemanual-ddl.md b/content/docs/latest/language/languagemanual-ddl.md index 700d7e1..1ddcd6e 100644 --- a/content/docs/latest/language/languagemanual-ddl.md +++ b/content/docs/latest/language/languagemanual-ddl.md @@ -2298,15 +2298,3 @@ For information about DDL in HCatalog and WebHCat, see: * [HCatalog DDL]({{< ref "#hcatalog-ddl" >}}) in the [HCatalog manual]({{< ref "hcatalog-base" >}}) * [WebHCat DDL Resources]({{< ref "webhcat-reference-allddl" >}}) in the [WebHCat manual]({{< ref "webhcat-base" >}}) - - -Save - -Save - - - - - - - diff --git a/content/docs/latest/language/languagemanual-types.md b/content/docs/latest/language/languagemanual-types.md index b92424d..568efaf 100644 --- a/content/docs/latest/language/languagemanual-types.md +++ b/content/docs/latest/language/languagemanual-types.md @@ -411,11 +411,3 @@ When [hive.metastore.disallow.incompatible.col.type.changes]({{< ref "#hive-meta | date to | false | false | false | false | false | false | false | false | false | true | true | false | true | false | | binary to | false | false | false | false | false | false | false | false | false | false | false | false | false | true | - - -Save - - - - - diff --git a/content/docs/latest/language/reflectudf.md b/content/docs/latest/language/reflectudf.md index 428075e..6341294 100644 --- a/content/docs/latest/language/reflectudf.md +++ b/content/docs/latest/language/reflectudf.md @@ -29,17 +29,3 @@ As of Hive 0.9.0, java_method() is a synonym for reflect(). See [Misc. Functions Note that Reflect UDF is non-deterministic since there is no guarantee what a specific method will return given the same parameters. So be cautious when using Reflect on the WHERE clause because that may invalidate Predicate Pushdown optimization. -## Comments: - -| | -| --- | -| -This doc comes from the Hive xdocs, with minor edits. It is included here because the xdocs are currently unavailable (Feb. 2013). - - Posted by leftyl at Feb 21, 2013 09:30 - | - - - - - diff --git a/content/docs/latest/language/supported-features.md b/content/docs/latest/language/supported-features.md index abe0277..13466c2 100644 --- a/content/docs/latest/language/supported-features.md +++ b/content/docs/latest/language/supported-features.md @@ -245,44 +245,3 @@ This table covers all mandatory features from [SQL:2016](https://en.wikipedia.o | T624 | Common logarithm functions | Yes | Optional | | | T631 | IN predicate with one list element | Yes | Mandatory | -## Comments: - -| | | | | | | | -| --- | --- | --- | --- | --- | --- | --- | -| -[Alan Gates](https://cwiki.apache.org/confluence/display/~alangates) Following features are supported in 3.1: - -| E061-09 | Subqueries in comparison predicate | - -| E141-06 | CHECK constraints | No | Mandatory | - - Posted by vgarg at Nov 29, 2018 19:18 - | -| -*No need to declare NOT NULL with PRIMARY KEY or UNIQUE* - I think this is not true. NOT NULL is not inferred on UNIQUE and needs to be explicitly declared. - - Posted by vgarg at Nov 29, 2018 19:20 - | -| - -| E121-02 | ORDER BY columns need not be in select list | No | Mandatory | - - Looks like this feature is partially supported. Hive allows this if there is not aggregate. - - Posted by vgarg at Nov 29, 2018 19:26 - | -| -IIUC the requirement isn't that you don't need to declare not null and it is inferred, but rather that it can support unique/pk indices with nulls in them. - - Posted by alangates at Nov 29, 2018 20:57 - | -| -Agreed, I missed this one. Feel free to edit it. I'll be circling back on this and a few others shortly to fix it. - - Posted by alangates at Nov 29, 2018 20:57 - | - - - - - diff --git a/content/docs/latest/user/Hive-Transactions-ACID.md b/content/docs/latest/user/Hive-Transactions-ACID.md index eee624f..f56ee81 100644 --- a/content/docs/latest/user/Hive-Transactions-ACID.md +++ b/content/docs/latest/user/Hive-Transactions-ACID.md @@ -279,13 +279,3 @@ DataWorks Summit 2018, San Jose, CA, USA - Covers Hive 3 and ACID V2 features * [Slides](https://www.slideshare.net/Hadoop_Summit/transactional-operations-in-apache-hive-present-and-future-102803358) * [Video](https://www.youtube.com/watch?v=GyzU9wG0cFQ&t=834s) - - -Save - -Save - - - - - diff --git a/content/docs/latest/user/configuration-properties.md b/content/docs/latest/user/configuration-properties.md index bf23ecb..e0f4155 100644 --- a/content/docs/latest/user/configuration-properties.md +++ b/content/docs/latest/user/configuration-properties.md @@ -5083,91 +5083,3 @@ Jobs submitted to HCatalog can specify configuration properties that affect stor For WebHCat configuration, see [Configuration Variables]({{< ref "#configuration-variables" >}}) in the WebHCat manual. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Save - -Save - -Save - -Save - -Save - -Save - -Save - -Save - -Save - -Save - -Save - -Save - -Save - -Save - -Save - -Save - -Save - -Save - -Save - -Save - -Save - -Save - - - - - - - diff --git a/content/docs/latest/user/hive-transactions.md b/content/docs/latest/user/hive-transactions.md index 59363b2..7f9b08f 100644 --- a/content/docs/latest/user/hive-transactions.md +++ b/content/docs/latest/user/hive-transactions.md @@ -260,13 +260,3 @@ DataWorks Summit 2018, San Jose, CA, USA - Covers Hive 3 and ACID V2 features * [Slides](https://www.slideshare.net/Hadoop_Summit/transactional-operations-in-apache-hive-present-and-future-102803358) * [Video](https://www.youtube.com/watch?v=GyzU9wG0cFQ&t=834s) - - -Save - -Save - - - - - diff --git a/content/docs/latest/user/hiveserver2-clients.md b/content/docs/latest/user/hiveserver2-clients.md index 606eb2a..274661c 100644 --- a/content/docs/latest/user/hiveserver2-clients.md +++ b/content/docs/latest/user/hiveserver2-clients.md @@ -1061,19 +1061,3 @@ JDBC connection URL: When the above URL is specified, Beeline will call underlying requests to add HTTP cookie in the request header, and will set it to *<name1>*=*<value1>* and *<name2>*=*<value2>*. - - - - - - - - - - -Save - - - - - diff --git a/content/docs/latest/user/multidelimitserde.md b/content/docs/latest/user/multidelimitserde.md index 1cba29f..01f5152 100644 --- a/content/docs/latest/user/multidelimitserde.md +++ b/content/docs/latest/user/multidelimitserde.md @@ -38,24 +38,3 @@ where field.delim is the field delimiter, collection.delim and mapkey.delim * Nested complex type is not supported, e.g. an Array<Array>. * To use MultiDelimitSerDe prior to Hive release 4.0.0, you have to add the hive-contrib jar to the class path, e.g. with the add jar command. - - -## Comments: - -| | -| --- | -| -Thank you [Lefty Leverenz](https://cwiki.apache.org/confluence/display/~leftyl) - - Posted by afan at Oct 05, 2018 06:18 - | -| -And thanks for your contributions [Alice Fan](https://cwiki.apache.org/confluence/display/~afan). - - Posted by leftyl at Oct 05, 2018 06:27 - | - - - - - diff --git a/content/docs/latest/user/serde.md b/content/docs/latest/user/serde.md index b1d90c4..0b3b079 100644 --- a/content/docs/latest/user/serde.md +++ b/content/docs/latest/user/serde.md @@ -68,55 +68,3 @@ In short, Hive will automatically convert objects so that Integer will be conver Between map and reduce, Hive uses LazyBinarySerDe and BinarySortableSerDe 's serialize methods. SerDe can serialize an object that is created by another serde, using ObjectInspector. -## Comments: - -| | -| --- | -| -I noticed that there are '!'s in the text, but didn't figure out why. - - Posted by xuefu at Feb 22, 2014 20:17 - | -| -The exclamation marks also appear in two sections of the Developer Guide:* [Hive SerDe]({{< ref "#hive-serde" >}}) -* [ObjectInspector]({{< ref "#objectinspector" >}}) - -I asked about them in a comment on [HIVE-5380](https://issues.apache.org/jira/browse/HIVE-5380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13895544#comment-13895544). If they aren't escape characters, could they be leftovers from a previous formatting style? - - Posted by leftyl at Feb 23, 2014 08:47 - | -| -Yes, they are artifacts of the old MoinMoin Wiki syntax and can be removed. - - Posted by larsfrancke at Feb 23, 2014 09:09 - | -| -And they're gone, gone, solid gone. Thanks Lars. - - Posted by leftyl at Feb 25, 2014 09:19 - | -| -[Lefty Leverenz](https://cwiki.apache.org/confluence/display/~leftyl) I added JsonSerDe to the list of built-in serdes and created new page for Json Serde. Can you review it? - - Posted by apivovarov at Dec 15, 2015 01:43 - | -| -Great! Thanks [Alexander Pivovarov](https://cwiki.apache.org/confluence/display/~apivovarov), I'll just make a few minor edits. - - Posted by leftyl at Jan 06, 2016 03:17 - | -| -[Alexander Pivovarov](https://cwiki.apache.org/confluence/display/~apivovarov), in the Json SerDe doc you have a code box with the title "Create table, specify CSV properties" but I don't see anything about CSV in the code – should it be "Create table, specify JsonSerDe" instead? - - Posted by leftyl at Jan 07, 2016 08:31 - | -| -[Alexander Pivovarov](https://cwiki.apache.org/confluence/display/~apivovarov), pinging about "CSV" in the Json SerDe doc's code box (see my reply to your comment on the SerDe doc). - - Posted by leftyl at Mar 19, 2016 08:33 - | - - - - -