http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3c2c8f12/docs/topics/impala_authorization.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_authorization.xml b/docs/topics/impala_authorization.xml index 12cdb98..4b009ce 100644 --- a/docs/topics/impala_authorization.xml +++ b/docs/topics/impala_authorization.xml @@ -3,7 +3,18 @@ <concept rev="1.1" id="authorization"> <title>Enabling Sentry Authorization for Impala</title> - + <prolog> + <metadata> + <data name="Category" value="Security"/> + <data name="Category" value="Sentry"/> + <data name="Category" value="Impala"/> + <data name="Category" value="Configuring"/> + <data name="Category" value="Starting and Stopping"/> + <data name="Category" value="Users"/> + <data name="Category" value="Groups"/> + <data name="Category" value="Administrators"/> + </metadata> + </prolog> <conbody id="sentry"> @@ -17,7 +28,1591 @@ other client program, and associates various privileges with each user. </p> - + <note> + Sentry is typically used in conjunction with Kerberos authentication, which defines which hosts are allowed + to connect to each server. Using the combination of Sentry and Kerberos prevents malicious users from being + able to connect by creating a named account on an untrusted machine. See + <xref href="impala_kerberos.xml#kerberos"/> for details about Kerberos authentication. + </note> + + <p audience="PDF" outputclass="toc inpage"> + See the following sections for details about using the Impala authorization features: + </p> + </conbody> + + <concept id="sentry_priv_model"> + + <title>The Sentry Privilege Model</title> + + <conbody> + + <p> + Privileges can be granted on different objects in the schema. Any privilege that can be granted is + associated with a level in the object hierarchy. If a privilege is granted on a container object in the + hierarchy, the child object automatically inherits it. This is the same privilege model as Hive and other + database systems such as MySQL. + </p> + + <p rev="2.3.0 collevelauth"> + The object hierarchy for Impala covers Server, URI, Database, Table, and Column. (The Table privileges apply to views as well; + anywhere you specify a table name, you can specify a view name instead.) + Column-level authorization is available in CDH 5.5 / Impala 2.3 and higher, as described in <xref href="sg_hive_sql.xml#concept_c2q_4qx_p4/col_level_auth_sentry"/>; + previously, you constructed views to query specific columns and assigned privileges + based on the views rather than the base tables. + </p> + + <p> + A restricted set of privileges determines what you can do with each object: + </p> + + <dl> + <dlentry id="select_priv"> + + <dt> + SELECT privilege + </dt> + + <dd> + Lets you read data from a table or view, for example with the <codeph>SELECT</codeph> statement, the + <codeph>INSERT...SELECT</codeph> syntax, or <codeph>CREATE TABLE...LIKE</codeph>. Also required to + issue the <codeph>DESCRIBE</codeph> statement or the <codeph>EXPLAIN</codeph> statement for a query + against a particular table. Only objects for which a user has this privilege are shown in the output + for <codeph>SHOW DATABASES</codeph> and <codeph>SHOW TABLES</codeph> statements. The + <codeph>REFRESH</codeph> statement and <codeph>INVALIDATE METADATA</codeph> statements only access + metadata for tables for which the user has this privilege. + </dd> + + </dlentry> + + <dlentry id="insert_priv"> + + <dt> + INSERT privilege + </dt> + + <dd> + Lets you write data to a table. Applies to the <codeph>INSERT</codeph> and <codeph>LOAD DATA</codeph> + statements. + </dd> + + </dlentry> + + <dlentry id="all_priv"> + + <dt> + ALL privilege + </dt> + + <dd> + Lets you create or modify the object. Required to run DDL statements such as <codeph>CREATE + TABLE</codeph>, <codeph>ALTER TABLE</codeph>, or <codeph>DROP TABLE</codeph> for a table, + <codeph>CREATE DATABASE</codeph> or <codeph>DROP DATABASE</codeph> for a database, or <codeph>CREATE + VIEW</codeph>, <codeph>ALTER VIEW</codeph>, or <codeph>DROP VIEW</codeph> for a view. Also required for + the URI of the <q>location</q> parameter for the <codeph>CREATE EXTERNAL TABLE</codeph> and + <codeph>LOAD DATA</codeph> statements. +<!-- Have to think about the best wording, how often to repeat, how best to conref this caveat. + You do not actually code the keyword <codeph>ALL</codeph> in the policy file; instead you use + <codeph>action=*</codeph> or shorten the right-hand portion of the rule. + --> + </dd> + + </dlentry> + </dl> + + <p> + Privileges can be specified for a table or view before that object actually exists. If you do not have + sufficient privilege to perform an operation, the error message does not disclose if the object exists or + not. + </p> + + <p> + Originally, privileges were encoded in a policy file, stored in HDFS. This mode of operation is still an + option, but the emphasis of privilege management is moving towards being SQL-based. Although currently + Impala does not have <codeph>GRANT</codeph> or <codeph>REVOKE</codeph> statements, Impala can make use of + privileges assigned through <codeph>GRANT</codeph> and <codeph>REVOKE</codeph> statements done through + Hive. The mode of operation with <codeph>GRANT</codeph> and <codeph>REVOKE</codeph> statements instead of + the policy file requires that a special Sentry service be enabled; this service stores, retrieves, and + manipulates privilege information stored inside the metastore database. + </p> + </conbody> + </concept> + + <concept id="secure_startup"> + + <title>Starting the impalad Daemon with Sentry Authorization Enabled</title> + <prolog> + <metadata> + <data name="Category" value="Starting and Stopping"/> + </metadata> + </prolog> + + <conbody> + + <p> + To run the <cmdname>impalad</cmdname> daemon with authorization enabled, you add one or more options to the + <codeph>IMPALA_SERVER_ARGS</codeph> declaration in the <filepath>/etc/default/impala</filepath> + configuration file: + </p> + + <ul> + <li> + The <codeph>-server_name</codeph> option turns on Sentry authorization for Impala. The authorization + rules refer to a symbolic server name, and you specify the name to use as the argument to the + <codeph>-server_name</codeph> option. + </li> + + <li rev="1.4.0"> + If you specify just <codeph>-server_name</codeph>, Impala uses the Sentry service for authorization, + relying on the results of <codeph>GRANT</codeph> and <codeph>REVOKE</codeph> statements issued through + Hive. (This mode of operation is available in Impala 1.4.0 and higher.) Prior to Impala 1.4.0, or if you + want to continue storing privilege rules in the policy file, also specify the + <codeph>-authorization_policy_file</codeph> option as in the following item. + </li> + + <li> + Specifying the <codeph>-authorization_policy_file</codeph> option in addition to + <codeph>-server_name</codeph> makes Impala read privilege information from a policy file, rather than + from the metastore database. The argument to the <codeph>-authorization_policy_file</codeph> option + specifies the HDFS path to the policy file that defines the privileges on different schema objects. + </li> + </ul> + + <p rev="1.4.0"> + For example, you might adapt your <filepath>/etc/default/impala</filepath> configuration to contain lines + like the following. To use the Sentry service rather than the policy file: + </p> + +<codeblock rev="1.4.0">IMPALA_SERVER_ARGS=" \ +-server_name=server1 \ +... +</codeblock> + + <p> + Or to use the policy file, as in releases prior to Impala 1.4: + </p> + +<codeblock>IMPALA_SERVER_ARGS=" \ +-authorization_policy_file=/user/hive/warehouse/auth-policy.ini \ +-server_name=server1 \ +... +</codeblock> + + <p> + The preceding examples set up a symbolic name of <codeph>server1</codeph> to refer to the current instance + of Impala. This symbolic name is used in the following ways: + </p> + + <ul> + <li> + <p> + In an environment managed by Cloudera Manager, the server name is specified through + <menucascade><uicontrol>Impala (Service-Wide)</uicontrol><uicontrol>Category</uicontrol><uicontrol>Advanced</uicontrol><uicontrol>Sentry Service</uicontrol></menucascade> and + <menucascade><uicontrol>Hive</uicontrol><uicontrol>Service-Wide</uicontrol><uicontrol>Advanced</uicontrol><uicontrol>Sentry Service</uicontrol></menucascade>. + The values must be the same for both, so that Impala and Hive can share the privilege rules. Restart + the Impala and Hive services after setting or changing this value. + </p> + </li> + + <li> + <p> + In an environment not managed by Cloudera Manager, you specify this value for the + <codeph>sentry.hive.server</codeph> property in the <filepath>sentry-site.xml</filepath> configuration + file for Hive, as well as in the <codeph>-server_name</codeph> option for <cmdname>impalad</cmdname>. + </p> + <p> + If the <cmdname>impalad</cmdname> daemon is not already running, start it as described in + <xref href="impala_processes.xml#processes"/>. If it is already running, restart it with the command + <codeph>sudo /etc/init.d/impala-server restart</codeph>. Run the appropriate commands on all the nodes + where <cmdname>impalad</cmdname> normally runs. + </p> + </li> + + <li> + <p> + If you use the mode of operation using the policy file, the rules in the <codeph>[roles]</codeph> + section of the policy file refer to this same <codeph>server1</codeph> name. For example, the following + rule sets up a role <codeph>report_generator</codeph> that lets users with that role query any table in + a database named <codeph>reporting_db</codeph> on a node where the <cmdname>impalad</cmdname> daemon + was started up with the <codeph>-server_name=server1</codeph> option: + </p> +<codeblock>[roles] +report_generator = server=server1->db=reporting_db->table=*->action=SELECT +</codeblock> + </li> + </ul> + + <p> + When <cmdname>impalad</cmdname> is started with one or both of the <codeph>-server_name=server1</codeph> + and <codeph>-authorization_policy_file</codeph> options, Impala authorization is enabled. If Impala detects + any errors or inconsistencies in the authorization settings or the policy file, the daemon refuses to + start. + </p> + </conbody> + </concept> + + <concept id="sentry_service"> + + <title>Using Impala with the Sentry Service (CDH 5.1 or higher only)</title> + + <conbody> + + <p> + When you use the Sentry service rather than the policy file, you set up privileges through + <codeph>GRANT</codeph> and <codeph>REVOKE</codeph> statement in either Impala or Hive, then both components + use those same privileges automatically. (Impala added the <codeph>GRANT</codeph> and + <codeph>REVOKE</codeph> statements in Impala 2.0.0 / CDH 5.2.0.) + </p> + + <p> + Hive already had <codeph>GRANT</codeph> and <codeph>REVOKE</codeph> statements prior to CDH 5.1, but those + statements were not production-ready. CDH 5.1 is the first release where those statements use the Sentry + framework and are considered GA level. If you used the Hive <codeph>GRANT</codeph> and + <codeph>REVOKE</codeph> statements prior to CDH 5.1, you must set up these privileges with the CDH 5.1 + versions of <codeph>GRANT</codeph> and <codeph>REVOKE</codeph> to take advantage of Sentry authorization. + </p> + + <p> + For information about using the updated Hive <codeph>GRANT</codeph> and <codeph>REVOKE</codeph> statements, + see +<!-- Original URL: http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-Security-Guide/cdh_sg_sentry_service.html --> + <xref href="http://www.cloudera.com/documentation/enterprise/latest/topics/cm_sg_sentry_service.html" scope="external" format="html">Sentry + service</xref> topic in the <cite>CDH 5 Security Guide</cite>. + </p> + </conbody> + </concept> + + <concept id="security_policy_file"> + + <title>Using Impala with the Sentry Policy File</title> + + <conbody> + + <p> + The policy file is a file that you put in a designated location in HDFS, and is read during the startup of + the <cmdname>impalad</cmdname> daemon when you specify both the <codeph>-server_name</codeph> and + <codeph>-authorization_policy_file</codeph> startup options. It controls which objects (databases, tables, + and HDFS directory paths) can be accessed by the user who connects to <cmdname>impalad</cmdname>, and what + operations that user can perform on the objects. + </p> + + <note rev="1.4.0"> + <p rev="1.4.0"> + In CDH 5 and higher, Cloudera recommends + managing privileges through SQL statements, as described in + <xref href="impala_authorization.xml#sentry_service"/>. If you are still using policy files, plan to + migrate to the new approach some time in the future. + </p> + </note> + + <p> + The location of the policy file is listed in the <filepath>auth-site.xml</filepath> configuration file. To + minimize overhead, the security information from this file is cached by each <cmdname>impalad</cmdname> + daemon and refreshed automatically, with a default interval of 5 minutes. After making a substantial change + to security policies, restart all Impala daemons to pick up the changes immediately. + </p> + + <p outputclass="toc inpage"/> + </conbody> + + <concept id="security_policy_file_details"> + + <title>Policy File Location and Format</title> + + <conbody> + + <p> + The policy file uses the familiar <codeph>.ini</codeph> format, divided into the major sections + <codeph>[groups]</codeph> and <codeph>[roles]</codeph>. There is also an optional + <codeph>[databases]</codeph> section, which allows you to specify a specific policy file for a particular + database, as explained in <xref href="#security_multiple_policy_files"/>. Another optional section, + <codeph>[users]</codeph>, allows you to override the OS-level mapping of users to groups; that is an + advanced technique primarily for testing and debugging, and is beyond the scope of this document. + </p> + + <p> + In the <codeph>[groups]</codeph> section, you define various categories of users and select which roles + are associated with each category. The group and usernames correspond to Linux groups and users on the + server where the <cmdname>impalad</cmdname> daemon runs. + </p> + + <p> + The group and usernames in the <codeph>[groups]</codeph> section correspond to Linux groups and users on + the server where the <cmdname>impalad</cmdname> daemon runs. When you access Impala through the + <cmdname>impalad</cmdname> interpreter, for purposes of authorization, the user is the logged-in Linux + user and the groups are the Linux groups that user is a member of. When you access Impala through the + ODBC or JDBC interfaces, the user and password specified through the connection string are used as login + credentials for the Linux server, and authorization is based on that username and the associated Linux + group membership. + </p> + + <p> + In the <codeph>[roles]</codeph> section, you a set of roles. For each role, you specify precisely the set + of privileges is available. That is, which objects users with that role can access, and what operations + they can perform on those objects. This is the lowest-level category of security information; the other + sections in the policy file map the privileges to higher-level divisions of groups and users. In the + <codeph>[groups]</codeph> section, you specify which roles are associated with which groups. The group + and usernames correspond to Linux groups and users on the server where the <cmdname>impalad</cmdname> + daemon runs. The privileges are specified using patterns like: +<codeblock>server=<varname>server_name</varname>->db=<varname>database_name</varname>->table=<varname>table_name</varname>->action=SELECT +server=<varname>server_name</varname>->db=<varname>database_name</varname>->table=<varname>table_name</varname>->action=CREATE +server=<varname>server_name</varname>->db=<varname>database_name</varname>->table=<varname>table_name</varname>->action=ALL +</codeblock> + For the <varname>server_name</varname> value, substitute the same symbolic name you specify with the + <cmdname>impalad</cmdname> <codeph>-server_name</codeph> option. You can use <codeph>*</codeph> wildcard + characters at each level of the privilege specification to allow access to all such objects. For example: +<codeblock>server=impala-host.example.com->db=default->table=t1->action=SELECT +server=impala-host.example.com->db=*->table=*->action=CREATE +server=impala-host.example.com->db=*->table=audit_log->action=SELECT +server=impala-host.example.com->db=default->table=t1->action=* +</codeblock> + </p> + + <p> + When authorization is enabled, Impala uses the policy file as a <i>whitelist</i>, representing every + privilege available to any user on any object. That is, only operations specified for the appropriate + combination of object, role, group, and user are allowed; all other operations are not allowed. If a + group or role is defined multiple times in the policy file, the last definition takes precedence. + </p> + + <p> + To understand the notion of whitelisting, set up a minimal policy file that does not provide any + privileges for any object. When you connect to an Impala node where this policy file is in effect, you + get no results for <codeph>SHOW DATABASES</codeph>, and an error when you issue any <codeph>SHOW + TABLES</codeph>, <codeph>USE <varname>database_name</varname></codeph>, <codeph>DESCRIBE + <varname>table_name</varname></codeph>, <codeph>SELECT</codeph>, and or other statements that expect to + access databases or tables, even if the corresponding databases and tables exist. + </p> + + <p> + The contents of the policy file are cached, to avoid a performance penalty for each query. The policy + file is re-checked by each <cmdname>impalad</cmdname> node every 5 minutes. When you make a + non-time-sensitive change such as adding new privileges or new users, you can let the change take effect + automatically a few minutes later. If you remove or reduce privileges, and want the change to take effect + immediately, restart the <cmdname>impalad</cmdname> daemon on all nodes, again specifying the + <codeph>-server_name</codeph> and <codeph>-authorization_policy_file</codeph> options so that the rules + from the updated policy file are applied. + </p> + </conbody> + </concept> + + <concept id="security_examples"> + + <title>Examples of Policy File Rules for Security Scenarios</title> + + <conbody> + + <p> + The following examples show rules that might go in the policy file to deal with various + authorization-related scenarios. For illustration purposes, this section shows several very small policy + files with only a few rules each. In your environment, typically you would define many roles to cover all + the scenarios involving your own databases, tables, and applications, and a smaller number of groups, + whose members are given the privileges from one or more roles. + </p> + + <example id="sec_ex_unprivileged"> + + <title>A User with No Privileges</title> + + <p> + If a user has no privileges at all, that user cannot access any schema objects in the system. The error + messages do not disclose the names or existence of objects that the user is not authorized to read. + </p> + + <p> +<!-- This example demonstrates the lack of privileges using a blank policy file, so no users have any privileges. --> + This is the experience you want a user to have if they somehow log into a system where they are not an + authorized Impala user. In a real deployment with a filled-in policy file, a user might have no + privileges because they are not a member of any of the relevant groups mentioned in the policy file. + </p> + +<!-- Have the raw material but not formatted into easily digestible example. Do for first 1.1 doc refresh. +<codeblock></codeblock> --> + + </example> + + <example id="sec_ex_superuser"> + + <title>Examples of Privileges for Administrative Users</title> + + <p> + When an administrative user has broad access to tables or databases, the associated rules in the + <codeph>[roles]</codeph> section typically use wildcards and/or inheritance. For example, in the + following sample policy file, <codeph>db=*</codeph> refers to all databases and + <codeph>db=*->table=*</codeph> refers to all tables in all databases. + </p> + + <p> + Omitting the rightmost portion of a rule means that the privileges apply to all the objects that could + be specified there. For example, in the following sample policy file, the + <codeph>all_databases</codeph> role has all privileges for all tables in all databases, while the + <codeph>one_database</codeph> role has all privileges for all tables in one specific database. The + <codeph>all_databases</codeph> role does not grant privileges on URIs, so a group with that role could + not issue a <codeph>CREATE TABLE</codeph> statement with a <codeph>LOCATION</codeph> clause. The + <codeph>entire_server</codeph> role has all privileges on both databases and URIs within the server. + </p> + +<codeblock>[groups] +supergroup = all_databases + +[roles] +read_all_tables = server=server1->db=*->table=*->action=SELECT +all_tables = server=server1->db=*->table=* +all_databases = server=server1->db=* +one_database = server=server1->db=test_db +entire_server = server=server1 +</codeblock> + + </example> + + <example id="sec_ex_detailed"> + + <title>A User with Privileges for Specific Databases and Tables</title> + + <p> + If a user has privileges for specific tables in specific databases, the user can access those things + but nothing else. They can see the tables and their parent databases in the output of <codeph>SHOW + TABLES</codeph> and <codeph>SHOW DATABASES</codeph>, <codeph>USE</codeph> the appropriate databases, + and perform the relevant actions (<codeph>SELECT</codeph> and/or <codeph>INSERT</codeph>) based on the + table privileges. To actually create a table requires the <codeph>ALL</codeph> privilege at the + database level, so you might define separate roles for the user that sets up a schema and other users + or applications that perform day-to-day operations on the tables. + </p> + + <p> + The following sample policy file shows some of the syntax that is appropriate as the policy file grows, + such as the <codeph>#</codeph> comment syntax, <codeph>\</codeph> continuation syntax, and comma + separation for roles assigned to groups or privileges assigned to roles. + </p> + +<codeblock>[groups] +cloudera = training_sysadmin, instructor +visitor = student + +[roles] +training_sysadmin = server=server1->db=training, \ +server=server1->db=instructor_private, \ +server=server1->db=lesson_development +instructor = server=server1->db=training->table=*->action=*, \ +server=server1->db=instructor_private->table=*->action=*, \ +server=server1->db=lesson_development->table=lesson* +# This particular course is all about queries, so the students can SELECT but not INSERT or CREATE/DROP. +student = server=server1->db=training->table=lesson_*->action=SELECT +</codeblock> + + </example> + +<!-- +<example id="sec_ex_superuser_single_db"> +<title>A User with Full Privileges for a Specific Database</title> +<p> + +</p> +<codeblock></codeblock> +</example> + +<example id="sec_ex_readonly_single_db"> +<title>A User with Read-Only Privileges for a Specific Database</title> +<p> + +</p> +<codeblock></codeblock> + <p> + If a user has <codeph>SELECT</codeph> privilege for a database, they can issue a <codeph>USE</codeph> statement + for that database. Whether or not they can access tables within the database depends on further privileges + defined at the table level. + </p> + +<codeblock></codeblock> + +</example> + +<example id="sec_ex_superuser_single_table"> +<title>A User with Full Privileges for a Specific Table</title> + <p> + If a user has <codeph>SELECT</codeph> privilege for a table, they can query, describe, or explain queries for + that table. + </p> + +<codeblock></codeblock> +</example> + +<example id="sec_ex_load_data"> +<title>A User with Privileges to Load Data but not Read Data</title> + + <p> + If a user has <codeph>INSERT</codeph> privilege for a table, they can write to the table if it already exists. + They cannot create or alter the table; those operations require the <codeph>ALL</codeph> privilege. + </p> +<codeblock></codeblock> +</example> +--> + + <example id="sec_ex_external_files"> + + <title>Privileges for Working with External Data Files</title> + + <p> + When data is being inserted through the <codeph>LOAD DATA</codeph> statement, or is referenced from an + HDFS location outside the normal Impala database directories, the user also needs appropriate + permissions on the URIs corresponding to those HDFS locations. + </p> + + <p> + In this sample policy file: + </p> + + <ul> + <li> + The <codeph>external_table</codeph> role lets us insert into and query the Impala table, + <codeph>external_table.sample</codeph>. + </li> + + <li> + The <codeph>staging_dir</codeph> role lets us specify the HDFS path + <filepath>/user/cloudera/external_data</filepath> with the <codeph>LOAD DATA</codeph> statement. + Remember, when Impala queries or loads data files, it operates on all the files in that directory, + not just a single file, so any Impala <codeph>LOCATION</codeph> parameters refer to a directory + rather than an individual file. + </li> + + <li> + We included the IP address and port of the Hadoop name node in the HDFS URI of the + <codeph>staging_dir</codeph> rule. We found those details in + <filepath>/etc/hadoop/conf/core-site.xml</filepath>, under the <codeph>fs.default.name</codeph> + element. That is what we use in any roles that specify URIs (that is, the locations of directories in + HDFS). + </li> + + <li> + We start this example after the table <codeph>external_table.sample</codeph> is already created. In + the policy file for the example, we have already taken away the <codeph>external_table_admin</codeph> + role from the <codeph>cloudera</codeph> group, and replaced it with the lesser-privileged + <codeph>external_table</codeph> role. + </li> + + <li> + We assign privileges to a subdirectory underneath <filepath>/user/cloudera</filepath> in HDFS, + because such privileges also apply to any subdirectories underneath. If we had assigned privileges to + the parent directory <filepath>/user/cloudera</filepath>, it would be too likely to mess up other + files by specifying a wrong location by mistake. + </li> + + <li> + The <codeph>cloudera</codeph> under the <codeph>[groups]</codeph> section refers to the + <codeph>cloudera</codeph> group. (In the demo VM used for this example, there is a + <codeph>cloudera</codeph> user that is a member of a <codeph>cloudera</codeph> group.) + </li> + </ul> + + <p> + Policy file: + </p> + +<codeblock>[groups] +cloudera = external_table, staging_dir + +[roles] +external_table_admin = server=server1->db=external_table +external_table = server=server1->db=external_table->table=sample->action=* +staging_dir = server=server1->uri=hdfs://127.0.0.1:8020/user/cloudera/external_data->action=* +</codeblock> + + <p> + <cmdname>impala-shell</cmdname> session: + </p> + +<codeblock>[localhost:21000] > use external_table; +Query: use external_table +[localhost:21000] > show tables; +Query: show tables +Query finished, fetching results ... ++--------+ +| name | ++--------+ +| sample | ++--------+ +Returned 1 row(s) in 0.02s + +[localhost:21000] > select * from sample; +Query: select * from sample +Query finished, fetching results ... ++-----+ +| x | ++-----+ +| 1 | +| 5 | +| 150 | ++-----+ +Returned 3 row(s) in 1.04s + +[localhost:21000] > load data inpath '/user/cloudera/external_data' into table sample; +Query: load data inpath '/user/cloudera/external_data' into table sample +Query finished, fetching results ... ++----------------------------------------------------------+ +| summary | ++----------------------------------------------------------+ +| Loaded 1 file(s). Total files in destination location: 2 | ++----------------------------------------------------------+ +Returned 1 row(s) in 0.26s +[localhost:21000] > select * from sample; +Query: select * from sample +Query finished, fetching results ... ++-------+ +| x | ++-------+ +| 2 | +| 4 | +| 6 | +| 8 | +| 64738 | +| 49152 | +| 1 | +| 5 | +| 150 | ++-------+ +Returned 9 row(s) in 0.22s + +[localhost:21000] > load data inpath '/user/cloudera/unauthorized_data' into table sample; +Query: load data inpath '/user/cloudera/unauthorized_data' into table sample +ERROR: AuthorizationException: User 'cloudera' does not have privileges to access: hdfs://127.0.0.1:8020/user/cloudera/unauthorized_data +</codeblock> + + </example> + + <example audience="Cloudera" id="sec_ex_views" rev="2.3.0 collevelauth"> + + <title>Controlling Access at the Column Level through Views</title> + + <p> + If a user has <codeph>SELECT</codeph> privilege for a view, they can query the view, even if they do + not have any privileges on the underlying table. To see the details about the underlying table through + <codeph>EXPLAIN</codeph> or <codeph>DESCRIBE FORMATTED</codeph> statements on the view, the user must + also have <codeph>SELECT</codeph> privilege for the underlying table. + </p> + + <note type="important"> + <p> + The types of data that are considered sensitive and confidential differ depending on the jurisdiction + the type of industry, or both. For fine-grained access controls, set up appropriate privileges based + on all applicable laws and regulations. + </p> + <p> + Be careful using the <codeph>ALTER VIEW</codeph> statement to point an existing view at a different + base table or a new set of columns that includes sensitive or restricted data. Make sure that any + users who have <codeph>SELECT</codeph> privilege on the view do not gain access to any additional + information they are not authorized to see. + </p> + </note> + + <p> + The following example shows how a system administrator could set up a table containing some columns + with sensitive information, then create a view that only exposes the non-confidential columns. + </p> + +<codeblock>[localhost:21000] > create table sensitive_info + > ( + > name string, + > address string, + > credit_card string, + > taxpayer_id string + > ); +[localhost:21000] > create view name_address_view as select name, address from sensitive_info; +</codeblock> + + <p> + Then the following policy file specifies read-only privilege for that view, without authorizing access + to the underlying table: + </p> + +<codeblock>[groups] +cloudera = view_only_privs + +[roles] +view_only_privs = server=server1->db=reports->table=name_address_view->action=SELECT +</codeblock> + + <p> + Thus, a user with the <codeph>view_only_privs</codeph> role could access through Impala queries the + basic information but not the sensitive information, even if both kinds of information were part of the + same data file. + </p> + + <p> + You might define other views to allow users from different groups to query different sets of columns. + </p> + + </example> + + <example id="sec_sysadmin"> + + <title>Separating Administrator Responsibility from Read and Write Privileges</title> + + <p> + Remember that to create a database requires full privilege on that database, while day-to-day + operations on tables within that database can be performed with lower levels of privilege on specific + table. Thus, you might set up separate roles for each database or application: an administrative one + that could create or drop the database, and a user-level one that can access only the relevant tables. + </p> + + <p> + For example, this policy file divides responsibilities between users in 3 different groups: + </p> + + <ul> + <li> + Members of the <codeph>supergroup</codeph> group have the <codeph>training_sysadmin</codeph> role and + so can set up a database named <codeph>training</codeph>. + </li> + + <li> + Members of the <codeph>cloudera</codeph> group have the <codeph>instructor</codeph> role and so can + create, insert into, and query any tables in the <codeph>training</codeph> database, but cannot + create or drop the database itself. + </li> + + <li> + Members of the <codeph>visitor</codeph> group have the <codeph>student</codeph> role and so can query + those tables in the <codeph>training</codeph> database. + </li> + </ul> + +<codeblock>[groups] +supergroup = training_sysadmin +cloudera = instructor +visitor = student + +[roles] +training_sysadmin = server=server1->db=training +instructor = server=server1->db=training->table=*->action=* +student = server=server1->db=training->table=*->action=SELECT +</codeblock> + + </example> + </conbody> + </concept> + + <concept id="security_multiple_policy_files"> + + <title>Using Multiple Policy Files for Different Databases</title> + + <conbody> + + <p> + For an Impala cluster with many databases being accessed by many users and applications, it might be + cumbersome to update the security policy file for each privilege change or each new database, table, or + view. You can allow security to be managed separately for individual databases, by setting up a separate + policy file for each database: + </p> + + <ul> + <li> + Add the optional <codeph>[databases]</codeph> section to the main policy file. + </li> + + <li> + Add entries in the <codeph>[databases]</codeph> section for each database that has its own policy file. + </li> + + <li> + For each listed database, specify the HDFS path of the appropriate policy file. + </li> + </ul> + + <p> + For example: + </p> + +<codeblock>[databases] +# Defines the location of the per-DB policy files for the 'customers' and 'sales' databases. +customers = hdfs://ha-nn-uri/etc/access/customers.ini +sales = hdfs://ha-nn-uri/etc/access/sales.ini +</codeblock> + + <p> + To enable URIs in per-DB policy files, add the following string in the Cloudera Manager field + <uicontrol>Impala Service Environment Advanced Configuration Snippet (Safety Valve)</uicontrol>: + </p> + +<codeblock>JAVA_TOOL_OPTIONS="-Dsentry.allow.uri.db.policyfile=true" +</codeblock> + + <note type="important"> + Enabling URIs in per-DB policy files introduces a security risk by allowing the owner of the db-level + policy file to grant himself/herself load privileges to anything the <codeph>impala</codeph> user has + read permissions for in HDFS (including data in other databases controlled by different db-level policy + files). + </note> + </conbody> + </concept> + </concept> + + <concept id="security_schema"> + + <title>Setting Up Schema Objects for a Secure Impala Deployment</title> + + <conbody> + + <p> + Remember that in your role definitions, you specify privileges at the level of individual databases and + tables, or all databases or all tables within a database. To simplify the structure of these rules, plan + ahead of time how to name your schema objects so that data with different authorization requirements is + divided into separate databases. + </p> + + <p> + If you are adding security on top of an existing Impala deployment, remember that you can rename tables or + even move them between databases using the <codeph>ALTER TABLE</codeph> statement. In Impala, creating new + databases is a relatively inexpensive operation, basically just creating a new directory in HDFS. + </p> + + <p> + You can also plan the security scheme and set up the policy file before the actual schema objects named in + the policy file exist. Because the authorization capability is based on whitelisting, a user can only + create a new database or table if the required privilege is already in the policy file: either by listing + the exact name of the object being created, or a <codeph>*</codeph> wildcard to match all the applicable + objects within the appropriate container. + </p> + </conbody> + </concept> + + <concept id="security_privileges"> + + <title>Privilege Model and Object Hierarchy</title> + + <conbody> + + <p> + Privileges can be granted on different objects in the schema. Any privilege that can be granted is + associated with a level in the object hierarchy. If a privilege is granted on a container object in the + hierarchy, the child object automatically inherits it. This is the same privilege model as Hive and other + database systems such as MySQL. + </p> + + <p> + The kinds of objects in the schema hierarchy are: + </p> + +<codeblock>Server +URI +Database + Table +</codeblock> + + <p> + The server name is specified by the <codeph>-server_name</codeph> option when <cmdname>impalad</cmdname> + starts. Specify the same name for all <cmdname>impalad</cmdname> nodes in the cluster. + </p> + + <p> + URIs represent the HDFS paths you specify as part of statements such as <codeph>CREATE EXTERNAL + TABLE</codeph> and <codeph>LOAD DATA</codeph>. Typically, you specify what look like UNIX paths, but these + locations can also be prefixed with <codeph>hdfs://</codeph> to make clear that they are really URIs. To + set privileges for a URI, specify the name of a directory, and the privilege applies to all the files in + that directory and any directories underneath it. + </p> + + <p rev="2.3.0 collevelauth"> + In CDH 5.5 / Impala 2.3 and higher, you can specify privileges for individual columns, + as described in <xref href="sg_hive_sql.xml#concept_c2q_4qx_p4/col_level_auth_sentry"/>. Formerly, to specify + read privileges at this level, you created a view that queried specific columns and/or partitions from a base + table, and gave <codeph>SELECT</codeph> privilege on the view but not the underlying table. + </p> + + <p> + URIs must start with either <codeph>hdfs://</codeph> or <codeph>file://</codeph>. If a URI starts with + anything else, it will cause an exception and the policy file will be invalid. When defining URIs for HDFS, + you must also specify the NameNode. For example: +<codeblock>data_read = server=server1->uri=file:///path/to/dir, \ +server=server1->uri=hdfs://namenode:port/path/to/dir +</codeblock> + <note type="warning"> + <p> + Because the NameNode host and port must be specified, Cloudera strongly recommends you use High + Availability (HA). This ensures that the URI will remain constant even if the namenode changes. + </p> +<codeblock>data_read = server=server1->uri=file:///path/to/dir,\ server=server1->uri=hdfs://ha-nn-uri/path/to/dir +</codeblock> + </note> + </p> + +<!-- Experiment with replacing my original copied table with a conref'ed version of Ambreen's from the Security Guide. --> + +<!-- + <table> + <title>Sentry privilege types and objects they apply to</title> + <tgroup cols="2"> + <colspec colnum="1" colname="col1"/> + <colspec colnum="2" colname="col2"/> + <tbody> + <row> + <entry>Privilege</entry> + <entry>Object</entry> + </row> + <row> + <entry>INSERT</entry> + <entry>TABLE, URI</entry> + </row> + <row> + <entry>SELECT</entry> + <entry>TABLE, VIEW, URI</entry> + </row> + <row> + <entry>ALL</entry> + <entry>SERVER, DB, URI</entry> + </row> + </tbody> + </tgroup> + </table> +--> + + <table conref="cdh_sg_sentry.xml#concept_yk2_j3p_wk/table_stm_53p_wk"> + <tgroup cols="2"> + <colspec colnum="1" colname="col1" colwidth="1*"/> +<!-- <colspec colnum="2" colname="col2" colwidth="2*"/> --> + <tbody> + <row> + <entry/> + </row> + </tbody> + </tgroup> + </table> + + <note> + <p> + Although this document refers to the <codeph>ALL</codeph> privilege, currently if you use the policy file + mode, you do not use the actual keyword <codeph>ALL</codeph> in the policy file. When you code role + entries in the policy file: + </p> + <ul> + <li> + To specify the <codeph>ALL</codeph> privilege for a server, use a role like + <codeph>server=<varname>server_name</varname></codeph>. + </li> + + <li> + To specify the <codeph>ALL</codeph> privilege for a database, use a role like + <codeph>server=<varname>server_name</varname>->db=<varname>database_name</varname></codeph>. + </li> + + <li> + To specify the <codeph>ALL</codeph> privilege for a table, use a role like + <codeph>server=<varname>server_name</varname>->db=<varname>database_name</varname>->table=<varname>table_name</varname>->action=*</codeph>. + </li> + </ul> + </note> + + <draft-comment author="ambreen.kazi" translate="no">The Privilege Model lives at: +https://wiki.cloudera.com/pages/viewpage.action?pageId=24919851</draft-comment> + + <table> + <tgroup cols="4"> + <colspec colnum="1" colname="col1" colwidth="1.31*"/> + <colspec colnum="2" colname="col2" colwidth="1.17*"/> + <colspec colnum="3" colname="col3" colwidth="1*"/> + <colspec colname="newCol4" colnum="4" colwidth="1*"/> + <thead> + <row> + <entry> + Operation + </entry> + <entry> + Scope + </entry> + <entry> + Privileges + </entry> + <entry> + URI + </entry> + </row> + </thead> + <tbody> + <row conref="cdh_sg_sentry.xml#concept_yk2_j3p_wk/explain_privs"> + <entry/> + </row> + <row conref="cdh_sg_sentry.xml#concept_yk2_j3p_wk/load_data_privs"> + <entry/> + </row> + <row conref="cdh_sg_sentry.xml#concept_yk2_j3p_wk/create_database_privs"> + <entry/> + </row> + <row conref="cdh_sg_sentry.xml#concept_yk2_j3p_wk/drop_database_privs"> + <entry/> + </row> + <row conref="cdh_sg_sentry.xml#concept_yk2_j3p_wk/create_table_privs"> + <entry/> + </row> + <row conref="cdh_sg_sentry.xml#concept_yk2_j3p_wk/drop_table_privs"> + <entry/> + </row> + <row conref="cdh_sg_sentry.xml#concept_yk2_j3p_wk/describe_table_privs"> + <entry/> + </row> + <row conref="cdh_sg_sentry.xml#concept_yk2_j3p_wk/alter_table_add_columns_privs"> + <entry/> + </row> + <row conref="cdh_sg_sentry.xml#concept_yk2_j3p_wk/alter_table_replace_columns_privs"> + <entry/> + </row> + <row conref="cdh_sg_sentry.xml#concept_yk2_j3p_wk/alter_table_change_column_privs"> + <entry/> + </row> + <row conref="cdh_sg_sentry.xml#concept_yk2_j3p_wk/alter_table_rename_privs"> + <entry/> + </row> + <row conref="cdh_sg_sentry.xml#concept_yk2_j3p_wk/alter_table_set_tblproperties_privs"> + <entry/> + </row> + <row conref="cdh_sg_sentry.xml#concept_yk2_j3p_wk/alter_table_set_fileformat_privs"> + <entry/> + </row> + <row conref="cdh_sg_sentry.xml#concept_yk2_j3p_wk/alter_table_set_location_privs"> + <entry/> + </row> + <row conref="cdh_sg_sentry.xml#concept_yk2_j3p_wk/alter_table_add_partition_privs"> + <entry/> + </row> + <row conref="cdh_sg_sentry.xml#concept_yk2_j3p_wk/alter_table_add_partition_location_privs"> + <entry/> + </row> + <row conref="cdh_sg_sentry.xml#concept_yk2_j3p_wk/alter_table_drop_partition_privs"> + <entry/> + </row> + <row conref="cdh_sg_sentry.xml#concept_yk2_j3p_wk/alter_table_partition_set_fileformat_privs"> + <entry/> + </row> + <row conref="cdh_sg_sentry.xml#concept_yk2_j3p_wk/alter_table_set_serdeproperties_privs"> + <entry/> + </row> + <row conref="cdh_sg_sentry.xml#concept_yk2_j3p_wk/create_view_privs"> + <entry/> + </row> + <row conref="cdh_sg_sentry.xml#concept_yk2_j3p_wk/drop_view_privs"> + <entry/> + </row> + <row id="alter_view_privs"> + <entry> + ALTER VIEW + </entry> + <entry rev="2.3.0 collevelauth"> + You need <codeph>ALL</codeph> privilege on the named view <ph rev="1.4.0">and the parent + database</ph>, plus <codeph>SELECT</codeph> privilege for any tables or views referenced by the + view query. Once the view is created or altered by a high-privileged system administrator, it can + be queried by a lower-privileged user who does not have full query privileges for the base tables. + </entry> + <entry> + ALL, SELECT + </entry> + <entry/> + </row> + <row conref="cdh_sg_sentry.xml#concept_yk2_j3p_wk/alter_table_set_location_privs"> + <entry/> + </row> + <row id="create_external_table_privs"> + <entry> + CREATE EXTERNAL TABLE + </entry> + <entry> + Database (ALL), URI (SELECT) + </entry> + <entry> + ALL, SELECT + </entry> + <entry/> + </row> + <row conref="cdh_sg_sentry.xml#concept_yk2_j3p_wk/select_privs"> + <entry/> + </row> + <row conref="cdh_sg_sentry.xml#concept_yk2_j3p_wk/use_privs"> + <entry/> + </row> + <row conref="cdh_sg_sentry.xml#concept_yk2_j3p_wk/create_function_privs"> + <entry/> + </row> + <row conref="cdh_sg_sentry.xml#concept_yk2_j3p_wk/drop_function_privs"> + <entry/> + </row> + <row conref="cdh_sg_sentry.xml#concept_yk2_j3p_wk/refresh_privs"> + <entry/> + </row> + <row conref="cdh_sg_sentry.xml#concept_yk2_j3p_wk/invalidate_metadata_privs"> + <entry/> + </row> + <row conref="cdh_sg_sentry.xml#concept_yk2_j3p_wk/invalidate_metadata_table_privs"> + <entry/> + </row> + <row conref="cdh_sg_sentry.xml#concept_yk2_j3p_wk/compute_stats_privs"> + <entry/> + </row> + <row id="show_table_stats_privs"> + <entry> + SHOW TABLE STATS, SHOW PARTITIONS + </entry> + <entry> + TABLE + </entry> + <entry> + SELECT/INSERT + </entry> + <entry/> + </row> + <row> + <entry id="show_column_stats_privs"> + SHOW COLUMN STATS + </entry> + <entry> + TABLE + </entry> + <entry> + SELECT/INSERT + </entry> + <entry/> + </row> + <row> + <entry id="show_functions_privs"> + SHOW FUNCTIONS + </entry> + <entry> + DATABASE + </entry> + <entry> + SELECT + </entry> + <entry/> + </row> + <row id="show_tables_privs"> + <entry> + SHOW TABLES + </entry> + <entry/> + <entry> + No special privileges needed to issue the statement, but only shows objects you are authorized for + </entry> + <entry/> + </row> + <row id="show_databases_privs"> + <entry> + SHOW DATABASES, SHOW SCHEMAS + </entry> + <entry/> + <entry> + No special privileges needed to issue the statement, but only shows objects you are authorized for + </entry> + <entry/> + </row> + </tbody> + </tgroup> + </table> + +<!-- Original table, now replaced by content conref'ed from Security Guide +<table> +<title>Sentry privileges for Impala SQL operations</title> + <tgroup cols="3"> + <colspec colnum="1" colname="col1" colwidth="*"/> + <colspec colnum="2" colname="col2" colwidth="*"/> + <colspec colnum="3" colname="col3" colwidth="4*"/> +<thead> +<row> +<entry>Impala SQL Operation</entry> +<entry>Privileges Required</entry> +<entry>Object on which Privileges Required</entry> +</row> +</thead> +<tbody> +<row> +<entry>EXPLAIN</entry> +<entry>SELECT</entry> +<entry>Table</entry> +</row> +<row> +<entry>LOAD DATA</entry> +<entry>INSERT, SELECT</entry> +<entry>Table (INSERT) and URI (ALL); write privilege is required on the URI because the data files are physically moved from there</entry> +</row> +<row> +<entry>CREATE DATABASE</entry> +<entry>ALL</entry> +<entry><ph rev="1.4.0">Server</ph></entry> +</row> +<row> +<entry>DROP DATABASE</entry> +<entry>ALL</entry> +<entry><ph rev="1.4.0">Server</ph></entry> +</row> +<row> +<entry>CREATE TABLE</entry> +<entry>ALL</entry> +<entry><ph rev="1.4.0">Database</ph></entry> +</row> +<row> +<entry>DROP TABLE</entry> +<entry>ALL</entry> +<entry><ph rev="1.4.0">Database</ph></entry> +</row> +<row> +<entry>DESCRIBE TABLE</entry> +<entry>SELECT or INSERT</entry> +<entry>Table</entry> +</row> +<row> +<entry>ALTER TABLE</entry> +<entry>ALL</entry> +<entry><ph rev="1.4.0">Database</ph></entry> +</row> +<row> +<entry>SHOW DATABASES</entry> +<entry>Any privilege</entry> +<entry>Any object in the database; only databases for which the user has privileges on some object are shown</entry> +</row> +<row> +<entry>SHOW TABLES</entry> +<entry>Any privilege</entry> +<entry>Any table in the database; only tables for which the user has privileges are shown</entry> +</row> +<row> +<entry>CREATE VIEW</entry> +<entry>ALL, SELECT</entry> +<entry rev="2.3.0 collevelauth"> +You need <codeph>ALL</codeph> privilege on the named view <ph rev="1.4.0">and the parent database</ph>, plus <codeph>SELECT</codeph> privilege for any tables or views referenced by the view query. +Once the view is created or altered by a high-privileged system administrator, +it can be queried by a lower-privileged user who does not have full query privileges for the base tables. +</entry> +</row> +<row> +<entry>DROP VIEW</entry> +<entry>ALL</entry> +<entry><ph rev="1.4.0">Database</ph></entry> +</row> +<row> +<entry>ALTER VIEW</entry> +<entry>ALL, SELECT</entry> +<entry rev="2.3.0 collevelauth"> +You need <codeph>ALL</codeph> privilege on the named view <ph rev="1.4.0">and the parent database</ph>, plus <codeph>SELECT</codeph> privilege for any tables or views referenced by the view query. +Once the view is created or altered by a high-privileged system administrator, +it can be queried by a lower-privileged user who does not have full query privileges for the base tables. +</entry> +</row> +<row> +<entry>ALTER TABLE LOCATION</entry> +<entry>ALL</entry> +<entry>Table, URI</entry> +</row> +<row> +<entry>CREATE EXTERNAL TABLE</entry> +<entry>ALL, SELECT</entry> +<entry>Database (ALL), URI (SELECT)</entry> +</row> +<row> +<entry>SELECT</entry> +<entry>SELECT</entry> +<entry rev="2.3.0 collevelauth">Table, View; you can have SELECT privilege for a view without having SELECT privilege +for the underlying tables. See <xref href="sg_hive_sql.xml#concept_c2q_4qx_p4/col_level_auth_sentry"/> for details +about column-level authorization. +Before column-level authorization was added in CDH 5.5 / Impala 2.3, the way to implement +column-level security was by creating views that referenced particular sets of columns.</entry> +</row> +<row> +<entry>USE</entry> +<entry>Any privilege</entry> +<entry>Any object in the database</entry> +</row> +<row rev="1.2"> +<entry>CREATE FUNCTION</entry> +<entry>ALL</entry> +<entry>Server</entry> +</row> +<row rev="1.2"> +<entry>DROP FUNCTION</entry> +<entry>ALL</entry> +<entry>Server</entry> +</row> +<row> +<entry>REFRESH</entry> +<entry>ALL</entry> +<entry>Table</entry> +</row> +<row> +<entry>INVALIDATE METADATA</entry> +<entry>ALL</entry> +<entry>Server</entry> +</row> +<row rev="1.2.2"> +<entry>COMPUTE STATS</entry> +<entry>ALL</entry> +<entry>Table</entry> +</row> +<row> +<entry>SHOW TABLE STATS, SHOW PARTITIONS</entry> +<entry>?</entry> +<entry>?</entry> +</row> +<row> +<entry>SHOW COLUMN STATS</entry> +<entry>?</entry> +<entry>?</entry> +</row> +<row> +<entry>SHOW FUNCTIONS</entry> +<entry>?</entry> +<entry>?</entry> +</row> +<row> +<entry>SHOW DATABASES, SHOW SCHEMAS</entry> +<entry>?</entry> +<entry>?</entry> +</row> +</tbody> +</tgroup> +</table> +--> + +<!-- Here are pieces from the original Hive table that do not apply to Impala because of differences in the SQL syntax. + Might resurrect some of these later as features are added to Impala, in particular CREATE TABLE AS SELECT. +<row> +<entry>EXPORT</entry> +<entry>SELECT, INSERT@URI</entry> +<entry>Table</entry> +</row> +<row> +<entry>IMPORT</entry> +<entry>ALL, SELECT@URI</entry> +<entry>Database</entry> +</row> +<row> +<entry>ANALYZE</entry> +<entry>INSERT, SELECT</entry> +<entry>Table</entry> +</row> +<row> +<entry>SWITCH DATABASE</entry> +<entry>Any</entry> +<entry>Any Table, View in the DB</entry> +</row> +<row> +<entry>DESCRIBE FUNCTION*</entry> +<entry>Not Restricted</entry> +</row> +<row> +<entry>MSCK</entry> +<entry>ALL</entry> +<entry>Database</entry> +</row> +<row> +<entry>SHOW COLUMNS</entry> +<entry>SELECT or INSERT</entry> +<entry>Table</entry> +</row> +<row> +<entry>SHOW TABLE STATUS</entry> +<entry>SELECT or INSERT</entry> +<entry>Table</entry> +</row> +<row> +<entry>SHOW TABLE PROPERTIES</entry> +<entry>SELECT or INSERT</entry> +<entry>Table</entry> +</row> +<row> +<entry>SHOW CREATE TABLE</entry> +<entry>SELECT or INSERT</entry> +<entry>Table</entry> +</row> +<row> +<entry>SHOW FUNCTIONS*</entry> +<entry>Not restricted</entry> +</row> +<row> +<entry>SHOW PARTITIONS</entry> +<entry>SELECT or INSERT</entry> +<entry>Table</entry> +</row> +<row> +<entry>SHOW INDEXES</entry> +<entry>SELECT or INSERT</entry> +<entry>Table</entry> +</row> +<row> +<entry>SHOW LOCKS</entry> +<entry>Not Restricted</entry> +</row> +<row> +<entry>CREATE FUNCTION</entry> +<entry>Any Privilege</entry> +<entry>Any Object</entry> +</row> +<row> +<entry>CREATE INDEX</entry> +<entry>ALL</entry> +<entry>Database</entry> +</row> +<row> +<entry>DROP FUNCTION</entry> +<entry>Any Privilege</entry> +<entry>Any Object</entry> +</row> +<row> +<entry>DROP INDEX</entry> +<entry>ALL</entry> +<entry>Database</entry> +</row> +<row> +<entry>ALTER INDEX REBUILD</entry> +<entry>ALL</entry> +<entry>Database</entry> +</row> +<row> +<entry>ALTER VIEW PROPERTIES</entry> +<entry>ALL</entry> +<entry>Database</entry> +</row> +<row> +<entry>GRANT PRIVILEGE</entry> +<entry>Allowed, but has no effect on HS2 auth</entry> +</row> +<row> +<entry>REVOKE PRIVILEGE</entry> +<entry>Allowed, but has no effect on HS2 auth</entry> +</row> +<row> +<entry>SHOW GRANTS</entry> +<entry>Allowed, but has no effect on HS2 auth</entry> +</row> +<row> +<entry>ALTER TABLE PROTECT MODE</entry> +<entry>ALL</entry> +<entry>Database</entry> +</row> +<row> +<entry>ALTER TABLE FILE FORMAT</entry> +<entry>ALL</entry> +<entry>Database</entry> +</row> +<row> +<entry>ALTER PARTITION PROTECT MODE</entry> +<entry>ALL</entry> +<entry>Database</entry> +</row> +<row> +<entry>ALTER PARTITION FILE FORMAT</entry> +<entry>ALL</entry> +<entry>Database</entry> +</row> +<row> +<entry>ALTER PARTITION LOCATION*</entry> +<entry>ALL</entry> +<entry>Server</entry> +</row> +<row> +<entry>CREATE TABLE AS SELECT</entry> +<entry>ALL, SELECT</entry> +<entry>Database, Table/View</entry> +</row> +<row> +<entry>ALTER INDEX PROP</entry> +<entry>ALL</entry> +<entry>Database</entry> +</row> +<row> +<entry>ALTER DATABASE</entry> +<entry>ALL</entry> +<entry>Database</entry> +</row> +<row> +<entry>DESCRIBE DATABASE</entry> +<entry>ALL</entry> +<entry>Database</entry> +</row> +<row> +<entry>ALTER TABLE MERGE FILE</entry> +<entry>ALL</entry> +<entry>Database</entry> +</row> +<row> +<entry>ALTER PARTITION MERGE FILE</entry> +<entry>ALL</entry> +<entry>Database</entry> +</row> +<row> +<entry>ALTER TABLE SKEWED</entry> +<entry>ALL</entry> +<entry>Database</entry> +</row> +<row> +<entry>ALTER TABLE PARTITION SKEWED LOCATION</entry> +<entry>ALL</entry> +<entry>Database</entry> +</row> +<row> +<entry>ADD JAR*</entry> +<entry>Restricted unless hive.server2.authorization.external.exec = true</entry> +</row> +<row> +<entry>TRANSFORM*</entry> +<entry>ALL</entry> +<entry>Server</entry> +</row> +--> + </conbody> + </concept> + + <concept id="sentry_debug"> + + <title><ph conref="../shared/impala_common.xml#common/title_sentry_debug"/></title> + + <conbody> + + <p conref="../shared/impala_common.xml#common/sentry_debug"/> + </conbody> + </concept> + + <concept id="sentry_cm"> + + <title>Managing Sentry for Impala through Cloudera Manager</title> + + <conbody> + + <p> + To enable the Sentry service for Impala and Hive, set <uicontrol>Hive/Impala > Service-Wide > Sentry + Service</uicontrol> parameter to the Sentry service. Then restart Impala and Hive. Simply adding Sentry + service as a dependency and restarting enables Impala and Hive to use the Sentry service. + </p> + + <p> + To set the server name to use when granting server level privileges, set the <uicontrol>Hive > + Service-Wide > Advanced > Server Name for Sentry Authorization</uicontrol> parameter. When using + Sentry with the Hive Metastore, you can specify the list of users that are allowed to bypass Sentry + Authorization in Hive Metastore using <uicontrol>Hive > Service-Wide > Security > Bypass Sentry + Authorization Users</uicontrol>. These are usually service users that already ensure all activity has been + authorized. + </p> + + <note> + The <uicontrol>Hive/Impala > Service-Wide > Policy File Based Sentry</uicontrol> tab contains + parameters only relevant to configuring Sentry using policy files. In particular, make sure that + <uicontrol>Enable Sentry Authorization using Policy Files</uicontrol> parameter is unchecked when using the + Sentry service. Cloudera Manager throws a validation error if you attempt to configure the Sentry service + and policy file at the same time. + </note> </conbody> </concept> + <concept id="sec_ex_default"> + + <title>The DEFAULT Database in a Secure Deployment</title> + + <conbody> + + <p> + Because of the extra emphasis on granular access controls in a secure deployment, you should move any + important or sensitive information out of the <codeph>DEFAULT</codeph> database into a named database whose + privileges are specified in the policy file. Sometimes you might need to give privileges on the + <codeph>DEFAULT</codeph> database for administrative reasons; for example, as a place you can reliably + specify with a <codeph>USE</codeph> statement when preparing to drop a database. + </p> + +<!-- Maybe have an example later, but not for initial 1.1 release. +<codeblock></codeblock> +--> + </conbody> + </concept> +</concept>
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3c2c8f12/docs/topics/impala_avg.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_avg.xml b/docs/topics/impala_avg.xml index 26f5450..d95b2d5 100644 --- a/docs/topics/impala_avg.xml +++ b/docs/topics/impala_avg.xml @@ -3,7 +3,7 @@ <concept id="avg"> <title>AVG Function</title> - <titlealts><navtitle>AVG</navtitle></titlealts> + <titlealts audience="PDF"><navtitle>AVG</navtitle></titlealts> <prolog> <metadata> <data name="Category" value="Impala"/> @@ -12,6 +12,8 @@ <data name="Category" value="Analytic Functions"/> <data name="Category" value="Aggregate Functions"/> <data name="Category" value="Querying"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> </metadata> </prolog> @@ -44,7 +46,7 @@ <p conref="../shared/impala_common.xml#common/complex_types_blurb"/> <p conref="../shared/impala_common.xml#common/complex_types_aggregation_explanation"/> - + <p conref="../shared/impala_common.xml#common/complex_types_aggregation_example"/> <p conref="../shared/impala_common.xml#common/example_blurb"/> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3c2c8f12/docs/topics/impala_avro.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_avro.xml b/docs/topics/impala_avro.xml index ae19192..d89d782 100644 --- a/docs/topics/impala_avro.xml +++ b/docs/topics/impala_avro.xml @@ -4,7 +4,15 @@ <title>Using the Avro File Format with Impala Tables</title> <titlealts audience="PDF"><navtitle>Avro Data Files</navtitle></titlealts> - + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="File Formats"/> + <data name="Category" value="Avro"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + </metadata> + </prolog> <conbody> @@ -15,9 +23,534 @@ insert operations, use Hive, then switch back to Impala to run queries. </p> - + <table> + <title>Avro Format Support in Impala</title> + <tgroup cols="5"> + <colspec colname="1" colwidth="10*"/> + <colspec colname="2" colwidth="10*"/> + <colspec colname="3" colwidth="20*"/> + <colspec colname="4" colwidth="30*"/> + <colspec colname="5" colwidth="30*"/> + <thead> + <row> + <entry> + File Type + </entry> + <entry> + Format + </entry> + <entry> + Compression Codecs + </entry> + <entry> + Impala Can CREATE? + </entry> + <entry> + Impala Can INSERT? + </entry> + </row> + </thead> + <tbody> + <row conref="impala_file_formats.xml#file_formats/avro_support"> + <entry/> + </row> + </tbody> + </tgroup> + </table> + + <p outputclass="toc inpage"/> + </conbody> + + <concept id="avro_create_table"> + + <title>Creating Avro Tables</title> + + <conbody> + + <p> + To create a new table using the Avro file format, issue the <codeph>CREATE TABLE</codeph> statement through + Impala with the <codeph>STORED AS AVRO</codeph> clause, or through Hive. If you create the table through + Impala, you must include column definitions that match the fields specified in the Avro schema. With Hive, + you can omit the columns and just specify the Avro schema. + </p> + + <p rev="2.3.0"> + In CDH 5.5 / Impala 2.3 and higher, the <codeph>CREATE TABLE</codeph> for Avro tables can include + SQL-style column definitions rather than specifying Avro notation through the <codeph>TBLPROPERTIES</codeph> + clause. Impala issues warning messages if there are any mismatches between the types specified in the + SQL column definitions and the underlying types; for example, any <codeph>TINYINT</codeph> or + <codeph>SMALLINT</codeph> columns are treated as <codeph>INT</codeph> in the underlying Avro files, + and therefore are displayed as <codeph>INT</codeph> in any <codeph>DESCRIBE</codeph> or + <codeph>SHOW CREATE TABLE</codeph> output. + </p> + + <note> + <p conref="../shared/impala_common.xml#common/avro_no_timestamp"/> + </note> + + <!-- + To do: Expand these examples to show switching between impala-shell and Hive, loading some data, and then + doing DESCRIBE and querying the table. + --> + + <p> + The following examples demonstrate creating an Avro table in Impala, using either an inline column + specification or one taken from a JSON file stored in HDFS: + </p> + +<codeblock><![CDATA[ +[localhost:21000] > CREATE TABLE avro_only_sql_columns + > ( + > id INT, + > bool_col BOOLEAN, + > tinyint_col TINYINT, /* Gets promoted to INT */ + > smallint_col SMALLINT, /* Gets promoted to INT */ + > int_col INT, + > bigint_col BIGINT, + > float_col FLOAT, + > double_col DOUBLE, + > date_string_col STRING, + > string_col STRING + > ) + > STORED AS AVRO; + +[localhost:21000] > CREATE TABLE impala_avro_table + > (bool_col BOOLEAN, int_col INT, long_col BIGINT, float_col FLOAT, double_col DOUBLE, string_col STRING, nullable_int INT) + > STORED AS AVRO + > TBLPROPERTIES ('avro.schema.literal'='{ + > "name": "my_record", + > "type": "record", + > "fields": [ + > {"name":"bool_col", "type":"boolean"}, + > {"name":"int_col", "type":"int"}, + > {"name":"long_col", "type":"long"}, + > {"name":"float_col", "type":"float"}, + > {"name":"double_col", "type":"double"}, + > {"name":"string_col", "type":"string"}, + > {"name": "nullable_int", "type": ["null", "int"]}]}'); + +[localhost:21000] > CREATE TABLE avro_examples_of_all_types ( + > id INT, + > bool_col BOOLEAN, + > tinyint_col TINYINT, + > smallint_col SMALLINT, + > int_col INT, + > bigint_col BIGINT, + > float_col FLOAT, + > double_col DOUBLE, + > date_string_col STRING, + > string_col STRING + > ) + > STORED AS AVRO + > TBLPROPERTIES ('avro.schema.url'='hdfs://localhost:8020/avro_schemas/alltypes.json'); +]]> +</codeblock> + + <p> + The following example demonstrates creating an Avro table in Hive: + </p> + +<codeblock><![CDATA[ +hive> CREATE TABLE hive_avro_table + > ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' + > STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' + > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' + > TBLPROPERTIES ('avro.schema.literal'='{ + > "name": "my_record", + > "type": "record", + > "fields": [ + > {"name":"bool_col", "type":"boolean"}, + > {"name":"int_col", "type":"int"}, + > {"name":"long_col", "type":"long"}, + > {"name":"float_col", "type":"float"}, + > {"name":"double_col", "type":"double"}, + > {"name":"string_col", "type":"string"}, + > {"name": "nullable_int", "type": ["null", "int"]}]}'); +]]> +</codeblock> + + <p> + Each field of the record becomes a column of the table. Note that any other information, such as the record + name, is ignored. + </p> + +<!-- Have not got a working example of this syntax yet from Lenni. +<p> +The schema can be specified either through the <codeph>TBLPROPERTIES</codeph> clause or the +<codeph>WITH SERDEPROPERTIES</codeph> clause. +For best compatibility with future versions of Hive, use the <codeph>WITH SERDEPROPERTIES</codeph> clause +for this information. +</p> +--> + + <note> + For nullable Avro columns, make sure to put the <codeph>"null"</codeph> entry before the actual type name. + In Impala, all columns are nullable; Impala currently does not have a <codeph>NOT NULL</codeph> clause. Any + non-nullable property is only enforced on the Avro side. + </note> + + <p> + Most column types map directly from Avro to Impala under the same names. These are the exceptions and + special cases to consider: + </p> + + <ul> + <li> + The <codeph>DECIMAL</codeph> type is defined in Avro as a <codeph>BYTE</codeph> type with the + <codeph>logicalType</codeph> property set to <codeph>"decimal"</codeph> and a specified precision and + scale. Use <codeph>DECIMAL</codeph> in Avro tables only under CDH 5. The infrastructure and components + under CDH 4 do not have reliable <codeph>DECIMAL</codeph> support. + </li> + + <li> + The Avro <codeph>long</codeph> type maps to <codeph>BIGINT</codeph> in Impala. + </li> + </ul> + + <p> + If you create the table through Hive, switch back to <cmdname>impala-shell</cmdname> and issue an + <codeph>INVALIDATE METADATA <varname>table_name</varname></codeph> statement. Then you can run queries for + that table through <cmdname>impala-shell</cmdname>. + </p> + + <p rev="2.3.0"> + In rare instances, a mismatch could occur between the Avro schema and the column definitions in the + metastore database. In CDH 5.5 / Impala 2.3 and higher, Impala checks for such inconsistencies during + a <codeph>CREATE TABLE</codeph> statement and each time it loads the metadata for a table (for example, + after <codeph>INVALIDATE METADATA</codeph>). Impala uses the following rules to determine how to treat + mismatching columns, a process known as <term>schema reconciliation</term>: + <ul> + <li> + If there is a mismatch in the number of columns, Impala uses the column + definitions from the Avro schema. + </li> + <li> + If there is a mismatch in column name or type, Impala uses the column definition from the Avro schema. + Because a <codeph>CHAR</codeph> or <codeph>VARCHAR</codeph> column in Impala maps to an Avro <codeph>STRING</codeph>, + this case is not considered a mismatch and the column is preserved as <codeph>CHAR</codeph> or <codeph>VARCHAR</codeph> + in the reconciled schema. <ph rev="2.7.0 IMPALA-3687 CDH-43731">Prior to CDH 5.9 / Impala 2.7, the column + name and comment for such <codeph>CHAR</codeph> and <codeph>VARCHAR</codeph> columns was also taken from the SQL column definition. + In CDH 5.9 / Impala 2.7 and higher, the column name and comment from the Avro schema file take precedence for such columns, + and only the <codeph>CHAR</codeph> or <codeph>VARCHAR</codeph> type is preserved from the SQL column definition.</ph> + </li> + <li> + An Impala <codeph>TIMESTAMP</codeph> column definition maps to an Avro <codeph>STRING</codeph> and is presented as a <codeph>STRING</codeph> + in the reconciled schema, because Avro has no binary <codeph>TIMESTAMP</codeph> representation. + As a result, no Avro table can have a <codeph>TIMESTAMP</codeph> column; this restriction is the same as + in earlier CDH and Impala releases. + </li> + </ul> + </p> + + <p conref="../shared/impala_common.xml#common/complex_types_unsupported_filetype"/> + + </conbody> + </concept> + + <concept id="avro_map_table"> + + <title>Using a Hive-Created Avro Table in Impala</title> + + <conbody> + + <p> + If you have an Avro table created through Hive, you can use it in Impala as long as it contains only + Impala-compatible data types. It cannot contain: + <ul> + <li> + Complex types: <codeph>array</codeph>, <codeph>map</codeph>, <codeph>record</codeph>, + <codeph>struct</codeph>, <codeph>union</codeph> other than + <codeph>[<varname>supported_type</varname>,null]</codeph> or + <codeph>[null,<varname>supported_type</varname>]</codeph> + </li> + + <li> + The Avro-specific types <codeph>enum</codeph>, <codeph>bytes</codeph>, and <codeph>fixed</codeph> + </li> + + <li> + Any scalar type other than those listed in <xref href="impala_datatypes.xml#datatypes"/> + </li> + </ul> + Because Impala and Hive share the same metastore database, Impala can directly access the table definitions + and data for tables that were created in Hive. + </p> + + <p> + If you create an Avro table in Hive, issue an <codeph>INVALIDATE METADATA</codeph> the next time you + connect to Impala through <cmdname>impala-shell</cmdname>. This is a one-time operation to make Impala + aware of the new table. You can issue the statement while connected to any Impala node, and the catalog + service broadcasts the change to all other Impala nodes. + </p> + + <p> + If you load new data into an Avro table through Hive, either through a Hive <codeph>LOAD DATA</codeph> or + <codeph>INSERT</codeph> statement, or by manually copying or moving files into the data directory for the + table, issue a <codeph>REFRESH <varname>table_name</varname></codeph> statement the next time you connect + to Impala through <cmdname>impala-shell</cmdname>. You can issue the statement while connected to any + Impala node, and the catalog service broadcasts the change to all other Impala nodes. If you issue the + <codeph>LOAD DATA</codeph> statement through Impala, you do not need a <codeph>REFRESH</codeph> afterward. + </p> + + <p> + Impala only supports fields of type <codeph>boolean</codeph>, <codeph>int</codeph>, <codeph>long</codeph>, + <codeph>float</codeph>, <codeph>double</codeph>, and <codeph>string</codeph>, or unions of these types with + null; for example, <codeph>["string", "null"]</codeph>. Unions with <codeph>null</codeph> essentially + create a nullable type. + </p> + </conbody> + </concept> + + <concept id="avro_json"> + + <title>Specifying the Avro Schema through JSON</title> + + <conbody> + + <p> + While you can embed a schema directly in your <codeph>CREATE TABLE</codeph> statement, as shown above, + column width restrictions in the Hive metastore limit the length of schema you can specify. If you + encounter problems with long schema literals, try storing your schema as a <codeph>JSON</codeph> file in + HDFS instead. Specify your schema in HDFS using table properties similar to the following: + </p> + +<codeblock>tblproperties ('avro.schema.url'='hdfs//your-name-node:port/path/to/schema.json');</codeblock> + </conbody> + </concept> + + <concept id="avro_load_data"> + + <title>Loading Data into an Avro Table</title> + <prolog> + <metadata> + <data name="Category" value="ETL"/> + <data name="Category" value="Ingest"/> + </metadata> + </prolog> + + <conbody> + + <p rev="DOCS-1523"> + Currently, Impala cannot write Avro data files. Therefore, an Avro table cannot be used as the destination + of an Impala <codeph>INSERT</codeph> statement or <codeph>CREATE TABLE AS SELECT</codeph>. + </p> + + <p> + To copy data from another table, issue any <codeph>INSERT</codeph> statements through Hive. For information + about loading data into Avro tables through Hive, see + <xref href="https://cwiki.apache.org/confluence/display/Hive/AvroSerDe" scope="external" format="html">Avro + page on the Hive wiki</xref>. + </p> + + <p> + If you already have data files in Avro format, you can also issue <codeph>LOAD DATA</codeph> in either + Impala or Hive. Impala can move existing Avro data files into an Avro table, it just cannot create new + Avro data files. + </p> + + </conbody> + </concept> + + <concept id="avro_compression"> + + <title>Enabling Compression for Avro Tables</title> + <prolog> + <metadata> + <data name="Category" value="Compression"/> + <data name="Category" value="Snappy"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">compression</indexterm> + To enable compression for Avro tables, specify settings in the Hive shell to enable compression and to + specify a codec, then issue a <codeph>CREATE TABLE</codeph> statement as in the preceding examples. Impala + supports the <codeph>snappy</codeph> and <codeph>deflate</codeph> codecs for Avro tables. + </p> + + <p> + For example: + </p> + +<codeblock>hive> set hive.exec.compress.output=true; +hive> set avro.output.codec=snappy;</codeblock> + </conbody> + </concept> + + <concept rev="1.1" id="avro_schema_evolution"> + + <title>How Impala Handles Avro Schema Evolution</title> + <prolog> + <metadata> + <data name="Category" value="Concepts"/> + </metadata> + </prolog> + + <conbody> + + <p> + Starting in Impala 1.1, Impala can deal with Avro data files that employ <term>schema evolution</term>, + where different data files within the same table use slightly different type definitions. (You would + perform the schema evolution operation by issuing an <codeph>ALTER TABLE</codeph> statement in the Hive + shell.) The old and new types for any changed columns must be compatible, for example a column might start + as an <codeph>int</codeph> and later change to a <codeph>bigint</codeph> or <codeph>float</codeph>. + </p> + + <p> + As with any other tables where the definitions are changed or data is added outside of the current + <cmdname>impalad</cmdname> node, ensure that Impala loads the latest metadata for the table if the Avro + schema is modified through Hive. Issue a <codeph>REFRESH <varname>table_name</varname></codeph> or + <codeph>INVALIDATE METADATA <varname>table_name</varname></codeph> statement. <codeph>REFRESH</codeph> + reloads the metadata immediately, <codeph>INVALIDATE METADATA</codeph> reloads the metadata the next time + the table is accessed. + </p> + + <p> + When Avro data files or columns are not consulted during a query, Impala does not check for consistency. + Thus, if you issue <codeph>SELECT c1, c2 FROM t1</codeph>, Impala does not return any error if the column + <codeph>c3</codeph> changed in an incompatible way. If a query retrieves data from some partitions but not + others, Impala does not check the data files for the unused partitions. + </p> + + <p> + In the Hive DDL statements, you can specify an <codeph>avro.schema.literal</codeph> table property (if the + schema definition is short) or an <codeph>avro.schema.url</codeph> property (if the schema definition is + long, or to allow convenient editing for the definition). + </p> + + <p> + For example, running the following SQL code in the Hive shell creates a table using the Avro file format + and puts some sample data into it: + </p> + +<codeblock>CREATE TABLE avro_table (a string, b string) +ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' +STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' +OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' +TBLPROPERTIES ( + 'avro.schema.literal'='{ + "type": "record", + "name": "my_record", + "fields": [ + {"name": "a", "type": "int"}, + {"name": "b", "type": "string"} + ]}'); + +INSERT OVERWRITE TABLE avro_table SELECT 1, "avro" FROM functional.alltypes LIMIT 1; +</codeblock> + + <p> + Once the Avro table is created and contains data, you can query it through the + <cmdname>impala-shell</cmdname> command: + </p> + +<codeblock>[localhost:21000] > select * from avro_table; ++---+------+ +| a | b | ++---+------+ +| 1 | avro | ++---+------+ +</codeblock> + + <p> + Now in the Hive shell, you change the type of a column and add a new column with a default value: + </p> + +<codeblock>-- Promote column "a" from INT to FLOAT (no need to update Avro schema) +ALTER TABLE avro_table CHANGE A A FLOAT; + +-- Add column "c" with default +ALTER TABLE avro_table ADD COLUMNS (c int); +ALTER TABLE avro_table SET TBLPROPERTIES ( + 'avro.schema.literal'='{ + "type": "record", + "name": "my_record", + "fields": [ + {"name": "a", "type": "int"}, + {"name": "b", "type": "string"}, + {"name": "c", "type": "int", "default": 10} + ]}'); +</codeblock> + + <p> + Once again in <cmdname>impala-shell</cmdname>, you can query the Avro table based on its latest schema + definition. Because the table metadata was changed outside of Impala, you issue a <codeph>REFRESH</codeph> + statement first so that Impala has up-to-date metadata for the table. + </p> + +<codeblock>[localhost:21000] > refresh avro_table; +[localhost:21000] > select * from avro_table; ++---+------+----+ +| a | b | c | ++---+------+----+ +| 1 | avro | 10 | ++---+------+----+ +</codeblock> + </conbody> + </concept> + + <concept id="avro_data_types"> + + <title>Data Type Considerations for Avro Tables</title> + + <conbody> + + <p> + The Avro format defines a set of data types whose names differ from the names of the corresponding Impala + data types. If you are preparing Avro files using other Hadoop components such as Pig or MapReduce, you + might need to work with the type names defined by Avro. The following figure lists the Avro-defined types + and the equivalent types in Impala. + </p> + +<codeblock><![CDATA[Primitive Types (Avro -> Impala) +-------------------------------- +STRING -> STRING +STRING -> CHAR +STRING -> VARCHAR +INT -> INT +BOOLEAN -> BOOLEAN +LONG -> BIGINT +FLOAT -> FLOAT +DOUBLE -> DOUBLE + +Logical Types +------------- +BYTES + logicalType = "decimal" -> DECIMAL + +Avro Types with No Impala Equivalent +------------------------------------ +RECORD, MAP, ARRAY, UNION, ENUM, FIXED, NULL + +Impala Types with No Avro Equivalent +------------------------------------ +TIMESTAMP +]]> +</codeblock> + + <p conref="../shared/impala_common.xml#common/avro_2gb_strings"/> </conbody> </concept> + <concept id="avro_performance"> + + <title>Query Performance for Impala Avro Tables</title> + + <conbody> + + <p> + In general, expect query performance with Avro tables to be + faster than with tables using text data, but slower than with + Parquet tables. See <xref href="impala_parquet.xml#parquet"/> + for information about using the Parquet file format for + high-performance analytic queries. + </p> + + <p conref="../shared/impala_common.xml#common/s3_block_splitting"/> + + </conbody> + </concept> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3c2c8f12/docs/topics/impala_batch_size.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_batch_size.xml b/docs/topics/impala_batch_size.xml index 13a4b18..afd3c17 100644 --- a/docs/topics/impala_batch_size.xml +++ b/docs/topics/impala_batch_size.xml @@ -3,10 +3,15 @@ <concept id="batch_size"> <title>BATCH_SIZE Query Option</title> + <titlealts audience="PDF"><navtitle>BATCH_SIZE</navtitle></titlealts> <prolog> <metadata> <data name="Category" value="Impala"/> <data name="Category" value="Impala Query Options"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + <data name="Category" value="Querying"/> + <data name="Category" value="Performance"/> </metadata> </prolog>
