[44/51] [partial] incubator-impala git commit: IMPALA-3398: Add docs to main Impala branch.

jbapple Thu, 17 Nov 2016 15:12:07 -0800

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_array.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_array.xml b/docs/topics/impala_array.xml
new file mode 100644
index 0000000..f9519c1
--- /dev/null
+++ b/docs/topics/impala_array.xml
@@ -0,0 +1,269 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
+<concept id="array">
+
+  <title>ARRAY Complex Type (<keyword keyref="impala23"/> or higher 
only)</title>
+
+  <prolog>
+    <metadata>
+      <data name="Category" value="Impala"/>
+      <data name="Category" value="Impala Data Types"/>
+      <data name="Category" value="SQL"/>
+      <data name="Category" value="Developers"/>
+      <data name="Category" value="Data Analysts"/>
+    </metadata>
+  </prolog>
+
+  <conbody>
+
+    <p>
+      A complex data type that can represent an arbitrary number of ordered 
elements.
+      The elements can be scalars or another complex type 
(<codeph>ARRAY</codeph>,
+      <codeph>STRUCT</codeph>, or <codeph>MAP</codeph>).
+    </p>
+
+    <p conref="../shared/impala_common.xml#common/syntax_blurb"/>
+
+<!-- To do: make sure there is sufficient syntax info under the SELECT 
statement to understand how to query all the complex types. -->
+
+<codeblock><varname>column_name</varname> ARRAY &lt; <varname>type</varname> 
&gt;
+
+type ::= <varname>primitive_type</varname> | <varname>complex_type</varname>
+</codeblock>
+
+      <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
+
+      <p conref="../shared/impala_common.xml#common/complex_types_combo"/>
+
+      <p>
+        The elements of the array have no names. You refer to the value of the 
array item using the
+        <codeph>ITEM</codeph> pseudocolumn, or its position in the array with 
the <codeph>POS</codeph>
+        pseudocolumn. See <xref href="impala_complex_types.xml#item"/> for 
information about
+        these pseudocolumns.
+      </p>
+
+<!-- Array is a frequently used idiom; don't recommend MAP right up front, 
since that is more rarely used. STRUCT has all different considerations.
+      <p>
+        If it would be logical to have a fixed number of elements and give 
each one a name, consider using a
+        <codeph>MAP</codeph> (when all elements are of the same type) or a 
<codeph>STRUCT</codeph> (if different
+        elements have different types) instead of an <codeph>ARRAY</codeph>.
+      </p>
+-->
+
+    <p>
+      Each row can have a different number of elements (including none) in the 
array for that row.
+    </p>
+
+<!-- Since you don't use numeric indexes, this assertion and advice doesn't 
make sense.
+      <p>
+        If you attempt to refer to a non-existent array element, the result is 
<codeph>NULL</codeph>. Therefore,
+        when using operations such as addition or string concatenation 
involving array elements, you might use
+        conditional functions to substitute default values such as 0 or 
<codeph>""</codeph> in the place of missing
+        array elements.
+      </p>
+-->
+
+      <p>
+        When an array contains items of scalar types, you can use aggregation 
functions on the array elements without using join notation. For
+        example, you can find the <codeph>COUNT()</codeph>, 
<codeph>AVG()</codeph>, <codeph>SUM()</codeph>, and so on of numeric array
+        elements, or the <codeph>MAX()</codeph> and <codeph>MIN()</codeph> of 
any scalar array elements by referring to
+        
<codeph><varname>table_name</varname>.<varname>array_column</varname></codeph> 
in the <codeph>FROM</codeph> clause of the query. When
+        you need to cross-reference values from the array with scalar values 
from the same row, such as by including a <codeph>GROUP
+        BY</codeph> clause to produce a separate aggregated result for each 
row, then the join clause is required.
+      </p>
+
+      <p>
+        A common usage pattern with complex types is to have an array as the 
top-level type for the column:
+        an array of structs, an array of maps, or an array of arrays.
+        For example, you can model a denormalized table by creating a column 
that is an <codeph>ARRAY</codeph>
+        of <codeph>STRUCT</codeph> elements; each item in the array represents 
a row from a table that would
+        normally be used in a join query. This kind of data structure lets you 
essentially denormalize tables by
+        associating multiple rows from one table with the matching row in 
another table.
+      </p>
+
+      <p>
+        You typically do not create more than one top-level 
<codeph>ARRAY</codeph> column, because if there is
+        some relationship between the elements of multiple arrays, it is 
convenient to model the data as
+        an array of another complex type element (either 
<codeph>STRUCT</codeph> or <codeph>MAP</codeph>).
+      </p>
+
+      <p conref="../shared/impala_common.xml#common/complex_types_describe"/>
+
+      <p conref="../shared/impala_common.xml#common/added_in_230"/>
+
+      <p conref="../shared/impala_common.xml#common/restrictions_blurb"/>
+
+      <ul 
conref="../shared/impala_common.xml#common/complex_types_restrictions">
+        <li/>
+      </ul>
+
+      <p conref="../shared/impala_common.xml#common/example_blurb"/>
+
+      <note 
conref="../shared/impala_common.xml#common/complex_type_schema_pointer"/>
+
+      <p>
+        The following example shows how to construct a table with various 
kinds of <codeph>ARRAY</codeph> columns,
+        both at the top level and nested within other complex types.
+        Whenever the <codeph>ARRAY</codeph> consists of a scalar value, such 
as in the <codeph>PETS</codeph>
+        column or the <codeph>CHILDREN</codeph> field, you can see that future 
expansion is limited.
+        For example, you could not easily evolve the schema to record the kind 
of pet or the child's birthday alongside the name.
+        Therefore, it is more common to use an <codeph>ARRAY</codeph> whose 
elements are of <codeph>STRUCT</codeph> type,
+        to associate multiple fields with each array element.
+      </p>
+
+      <note>
+        Practice the <codeph>CREATE TABLE</codeph> and query notation for 
complex type columns
+        using empty tables, until you can visualize a complex data structure 
and construct corresponding SQL statements reliably.
+      </note>
+
+<!-- To do: verify and flesh out this example. -->
+
+<codeblock><![CDATA[CREATE TABLE array_demo
+(
+  id BIGINT,
+  name STRING,
+-- An ARRAY of scalar type as a top-level column.
+  pets ARRAY <STRING>,
+
+-- An ARRAY with elements of complex type (STRUCT).
+  places_lived ARRAY < STRUCT <
+    place: STRING,
+    start_year: INT
+  >>,
+
+-- An ARRAY as a field (CHILDREN) within a STRUCT.
+-- (The STRUCT is inside another ARRAY, because it is rare
+-- for a STRUCT to be a top-level column.)
+  marriages ARRAY < STRUCT <
+    spouse: STRING,
+    children: ARRAY <STRING>
+  >>,
+
+-- An ARRAY as the value part of a MAP.
+-- The first MAP field (the key) would be a value such as
+-- 'Parent' or 'Grandparent', and the corresponding array would
+-- represent 2 parents, 4 grandparents, and so on.
+  ancestors MAP < STRING, ARRAY <STRING> >
+)
+STORED AS PARQUET;
+]]>
+</codeblock>
+
+    <p>
+      The following example shows how to examine the structure of a table 
containing one or more <codeph>ARRAY</codeph> columns by using the
+      <codeph>DESCRIBE</codeph> statement. You can visualize each 
<codeph>ARRAY</codeph> as its own two-column table, with columns
+      <codeph>ITEM</codeph> and <codeph>POS</codeph>.
+    </p>
+
+<!-- To do: extend the examples to include MARRIAGES and ANCESTORS columns, or 
get rid of those columns. -->
+
+<codeblock><![CDATA[DESCRIBE array_demo;
++--------------+---------------------------+
+| name         | type                      |
++--------------+---------------------------+
+| id           | bigint                    |
+| name         | string                    |
+| pets         | array<string>             |
+| marriages    | array<struct<             |
+|              |   spouse:string,          |
+|              |   children:array<string>  |
+|              | >>                        |
+| places_lived | array<struct<             |
+|              |   place:string,           |
+|              |   start_year:int          |
+|              | >>                        |
+| ancestors    | map<string,array<string>> |
++--------------+---------------------------+
+
+DESCRIBE array_demo.pets;
++------+--------+
+| name | type   |
++------+--------+
+| item | string |
+| pos  | bigint |
++------+--------+
+
+DESCRIBE array_demo.marriages;
++------+--------------------------+
+| name | type                     |
++------+--------------------------+
+| item | struct<                  |
+|      |   spouse:string,         |
+|      |   children:array<string> |
+|      | >                        |
+| pos  | bigint                   |
++------+--------------------------+
+
+DESCRIBE array_demo.places_lived;
++------+------------------+
+| name | type             |
++------+------------------+
+| item | struct<          |
+|      |   place:string,  |
+|      |   start_year:int |
+|      | >                |
+| pos  | bigint           |
++------+------------------+
+
+DESCRIBE array_demo.ancestors;
++-------+---------------+
+| name  | type          |
++-------+---------------+
+| key   | string        |
+| value | array<string> |
++-------+---------------+
+]]>
+</codeblock>
+
+    <p>
+      The following example shows queries involving <codeph>ARRAY</codeph> 
columns containing elements of scalar or complex types. You
+      <q>unpack</q> each <codeph>ARRAY</codeph> column by referring to it in a 
join query, as if it were a separate table with
+      <codeph>ITEM</codeph> and <codeph>POS</codeph> columns. If the array 
element is a scalar type, you refer to its value using the
+      <codeph>ITEM</codeph> pseudocolumn. If the array element is a 
<codeph>STRUCT</codeph>, you refer to the <codeph>STRUCT</codeph> fields
+      using dot notation and the field names. If the array element is another 
<codeph>ARRAY</codeph> or a <codeph>MAP</codeph>, you use
+      another level of join to unpack the nested collection elements.
+    </p>
+
+<!-- To do: have some sample output to show for these queries. -->
+
+<codeblock><![CDATA[-- Array of scalar values.
+-- Each array element represents a single string, plus we know its position in 
the array.
+SELECT id, name, pets.pos, pets.item FROM array_demo, array_demo.pets;
+
+-- Array of structs.
+-- Now each array element has named fields, possibly of different types.
+-- You can consider an ARRAY of STRUCT to represent a table inside another 
table.
+SELECT id, name, places_lived.pos, places_lived.item.place, 
places_lived.item.start_year
+FROM array_demo, array_demo.places_lived;
+
+-- The .ITEM name is optional for array elements that are structs.
+-- The following query is equivalent to the previous one, with .ITEM
+-- removed from the column references.
+SELECT id, name, places_lived.pos, places_lived.place, places_lived.start_year
+  FROM array_demo, array_demo.places_lived;
+
+-- To filter specific items from the array, do comparisons against the .POS or 
.ITEM
+-- pseudocolumns, or names of struct fields, in the WHERE clause.
+SELECT id, name, pets.item FROM array_demo, array_demo.pets
+  WHERE pets.pos in (0, 1, 3);
+
+SELECT id, name, pets.item FROM array_demo, array_demo.pets
+  WHERE pets.item LIKE 'Mr. %';
+
+SELECT id, name, places_lived.pos, places_lived.place, places_lived.start_year
+  FROM array_demo, array_demo.places_lived
+WHERE places_lived.place like '%California%';
+]]>
+</codeblock>
+
+    <p conref="../shared/impala_common.xml#common/related_info"/>
+
+    <p>
+      <xref href="impala_complex_types.xml#complex_types"/>,
+<!-- <xref href="impala_array.xml#array"/>, -->
+      <xref href="impala_struct.xml#struct"/>, <xref 
href="impala_map.xml#map"/>
+    </p>
+
+  </conbody>
+
+</concept>


http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_auditing.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_auditing.xml b/docs/topics/impala_auditing.xml
new file mode 100644
index 0000000..6332957
--- /dev/null
+++ b/docs/topics/impala_auditing.xml
@@ -0,0 +1,260 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
+<concept id="auditing">
+
+  <title>Auditing Impala Operations</title>
+  <titlealts audience="PDF"><navtitle>Auditing</navtitle></titlealts>
+  <prolog>
+    <metadata>
+      <data name="Category" value="Impala"/>
+      <data name="Category" value="Auditing"/>
+      <data name="Category" value="Governance"/>
+      <data name="Category" value="Navigator"/>
+      <data name="Category" value="Security"/>
+      <data name="Category" value="Administrators"/>
+    </metadata>
+  </prolog>
+
+  <conbody>
+
+    <p>
+      To monitor how Impala data is being used within your organization, 
ensure that your Impala authorization and
+      authentication policies are effective, and detect attempts at intrusion 
or unauthorized access to Impala
+      data, you can use the auditing feature in Impala 1.2.1 and higher:
+    </p>
+
+    <ul>
+      <li>
+        Enable auditing by including the option 
<codeph>-audit_event_log_dir=<varname>directory_path</varname></codeph>
+        in your <cmdname>impalad</cmdname> startup options for a cluster not 
managed by Cloudera Manager, or
+        <xref audience="integrated" 
href="cn_iu_audit_log.xml#xd_583c10bfdbd326ba--6eed2fb8-14349d04bee--7d6f/section_v25_lmy_bn">configuring
 Impala Daemon logging in Cloudera Manager</xref><xref audience="standalone" 
href="http://www.cloudera.com/documentation/enterprise/latest/topics/cn_iu_service_audit.html";
 scope="external" format="html">configuring Impala Daemon logging in Cloudera 
Manager</xref>.
+        The log directory must be a local directory on the
+        server, not an HDFS directory.
+      </li>
+
+      <li>
+        Decide how many queries will be represented in each log files. By 
default, Impala starts a new log file
+        every 5000 queries. To specify a different number, <ph
+          audience="standalone"
+          >include
+        the option 
<codeph>-max_audit_event_log_file_size=<varname>number_of_queries</varname></codeph>
 in the
+        <cmdname>impalad</cmdname> startup
+        options</ph><xref
+          
href="cn_iu_audit_log.xml#xd_583c10bfdbd326ba--6eed2fb8-14349d04bee--7d6f/section_v25_lmy_bn"
+            audience="integrated"
+            >configure
+        Impala Daemon logging in Cloudera Manager</xref>.
+      </li>
+
+      <li> Configure Cloudera Navigator to collect and consolidate the audit
+        logs from all the hosts in the cluster. </li>
+
+      <li>
+        Use Cloudera Navigator or Cloudera Manager to filter, visualize, and 
produce reports based on the audit
+        data. (The Impala auditing feature works with Cloudera Manager 4.7 to 
5.1 and Cloudera Navigator 2.1 and
+        higher.) Check the audit data to ensure that all activity is 
authorized and detect attempts at
+        unauthorized access.
+      </li>
+    </ul>
+
+    <p outputclass="toc inpage"/>
+  </conbody>
+
+  <concept id="auditing_performance">
+
+    <title>Durability and Performance Considerations for Impala 
Auditing</title>
+  <prolog>
+    <metadata>
+      <data name="Category" value="Performance"/>
+    </metadata>
+  </prolog>
+
+    <conbody>
+
+      <p>
+        The auditing feature only imposes performance overhead while auditing 
is enabled.
+      </p>
+
+      <p>
+        Because any Impala host can process a query, enable auditing on all 
hosts where the
+        <ph audience="standalone"><cmdname>impalad</cmdname> daemon</ph>
+        <ph audience="integrated">Impala Daemon role</ph> runs. Each host 
stores its own log
+        files, in a directory in the local filesystem. The log data is 
periodically flushed to disk (through an
+        <codeph>fsync()</codeph> system call) to avoid loss of audit data in 
case of a crash.
+      </p>
+
+      <p> The runtime overhead of auditing applies to whichever host serves as 
the coordinator for the query, that is, the host you connect to when you issue 
the query. This might be the same host for all queries, or different 
applications or users might connect to and issue queries through different 
hosts. </p>
+
+      <p> To avoid excessive I/O overhead on busy coordinator hosts, Impala 
syncs the audit log data (using the <codeph>fsync()</codeph> system call) 
periodically rather than after every query. Currently, the 
<codeph>fsync()</codeph> calls are issued at a fixed interval, every 5 seconds. 
</p>
+
+      <p>
+        By default, Impala avoids losing any audit log data in the case of an 
error during a logging operation
+        (such as a disk full error), by immediately shutting down
+        <cmdname audience="standalone">impalad</cmdname><ph 
audience="integrated">the Impala
+        Daemon role</ph> on the host where the auditing problem occurred.
+        <ph audience="standalone">You can override this setting by specifying 
the option
+        <codeph>-abort_on_failed_audit_event=false</codeph> in the 
<cmdname>impalad</cmdname> startup options.</ph>
+      </p>
+    </conbody>
+  </concept>
+
+  <concept id="auditing_format">
+
+    <title>Format of the Audit Log Files</title>
+  <prolog>
+    <metadata>
+      <data name="Category" value="Logs"/>
+    </metadata>
+  </prolog>
+
+    <conbody>
+
+      <p> The audit log files represent the query information in JSON format, 
one query per line. Typically, rather than looking at the log files themselves, 
you use the Cloudera Navigator product to consolidate the log data from all 
Impala hosts and filter and visualize the results in useful ways. (If you do 
examine the raw log data, you might run the files through a JSON pretty-printer 
first.) </p>
+
+      <p>
+        All the information about schema objects accessed by the query is 
encoded in a single nested record on the
+        same line. For example, the audit log for an <codeph>INSERT ... 
SELECT</codeph> statement records that a
+        select operation occurs on the source table and an insert operation 
occurs on the destination table. The
+        audit log for a query against a view records the base table accessed 
by the view, or multiple base tables
+        in the case of a view that includes a join query. Every Impala 
operation that corresponds to a SQL
+        statement is recorded in the audit logs, whether the operation 
succeeds or fails. Impala records more
+        information for a successful operation than for a failed one, because 
an unauthorized query is stopped
+        immediately, before all the query planning is completed.
+      </p>
+
+<!-- Opportunity to conref at the phrase level here... the content of this 
paragraph is the same as part
+     of a list bullet earlier on. -->
+
+      <p>
+        The information logged for each query includes:
+      </p>
+
+      <ul>
+        <li>
+          Client session state:
+          <ul>
+            <li>
+              Session ID
+            </li>
+
+            <li>
+              User name
+            </li>
+
+            <li>
+              Network address of the client connection
+            </li>
+          </ul>
+        </li>
+
+        <li>
+          SQL statement details:
+          <ul>
+            <li>
+              Query ID
+            </li>
+
+            <li>
+              Statement Type - DML, DDL, and so on
+            </li>
+
+            <li>
+              SQL statement text
+            </li>
+
+            <li>
+              Execution start time, in local time
+            </li>
+
+            <li>
+              Execution Status - Details on any errors that were encountered
+            </li>
+
+            <li>
+              Target Catalog Objects:
+              <ul>
+                <li>
+                  Object Type - Table, View, or Database
+                </li>
+
+                <li>
+                  Fully qualified object name
+                </li>
+
+                <li>
+                  Privilege - How the object is being used 
(<codeph>SELECT</codeph>, <codeph>INSERT</codeph>,
+                  <codeph>CREATE</codeph>, and so on)
+                </li>
+              </ul>
+            </li>
+          </ul>
+        </li>
+      </ul>
+
+<!-- Delegating actual examples to the Cloudera Navigator doc for the moment.
+<p>
+Here is an excerpt from a sample audit log file:
+</p>
+<codeblock></codeblock>
+-->
+    </conbody>
+  </concept>
+
+  <concept id="auditing_exceptions">
+
+    <title>Which Operations Are Audited</title>
+
+    <conbody>
+
+      <p>
+        The kinds of SQL queries represented in the audit log are:
+      </p>
+
+      <ul>
+        <li>
+          Queries that are prevented due to lack of authorization.
+        </li>
+
+        <li>
+          Queries that Impala can analyze and parse to determine that they are 
authorized. The audit data is
+          recorded immediately after Impala finishes its analysis, before the 
query is actually executed.
+        </li>
+      </ul>
+
+      <p>
+        The audit log does not contain entries for queries that could not be 
parsed and analyzed. For example, a
+        query that fails due to a syntax error is not recorded in the audit 
log. The audit log also does not
+        contain queries that fail due to a reference to a table that does not 
exist, if you would be authorized to
+        access the table if it did exist.
+      </p>
+
+      <p>
+        Certain statements in the <cmdname>impala-shell</cmdname> interpreter, 
such as <codeph>CONNECT</codeph>,
+        <codeph rev="1.4.0">SUMMARY</codeph>, <codeph>PROFILE</codeph>, 
<codeph>SET</codeph>, and
+        <codeph>QUIT</codeph>, do not correspond to actual SQL queries, and 
these statements are not reflected in
+        the audit log.
+      </p>
+    </conbody>
+  </concept>
+
+  <concept id="auditing_reviewing">
+
+    <title>Reviewing the Audit Logs</title>
+  <prolog>
+    <metadata>
+      <data name="Category" value="Logs"/>
+    </metadata>
+  </prolog>
+
+    <conbody>
+
+      <p>
+        You typically do not review the audit logs in raw form. The Cloudera 
Manager Agent periodically transfers
+        the log information into a back-end database where it can be examined 
in consolidated form. See
+        <ph audience="standalone">the <xref 
href="http://www.cloudera.com/content/cloudera-content/cloudera-docs/Navigator/latest/Cloudera-Navigator-Installation-and-User-Guide/Cloudera-Navigator-Installation-and-User-Guide.html";
+            scope="external" format="html">Cloudera Navigator 
documentation</xref> for details</ph>
+            <xref href="cn_iu_audits.xml#cn_topic_7" audience="integrated" />.
+      </p>
+    </conbody>
+  </concept>
+</concept>

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_authentication.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_authentication.xml 
b/docs/topics/impala_authentication.xml
new file mode 100644
index 0000000..7200e5f
--- /dev/null
+++ b/docs/topics/impala_authentication.xml
@@ -0,0 +1,39 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
+<concept id="authentication">
+
+  <title>Impala Authentication</title>
+  <prolog>
+    <metadata>
+      <data name="Category" value="Security"/>
+      <data name="Category" value="Impala"/>
+      <data name="Category" value="Authentication"/>
+      <data name="Category" value="Administrators"/>
+    </metadata>
+  </prolog>
+
+  <conbody>
+
+    <p>
+      Authentication is the mechanism to ensure that only specified hosts and 
users can connect to Impala. It also
+      verifies that when clients connect to Impala, they are connected to a 
legitimate server. This feature
+      prevents spoofing such as <term>impersonation</term> (setting up a phony 
client system with the same account
+      and group names as a legitimate user) and <term>man-in-the-middle 
attacks</term> (intercepting application
+      requests before they reach Impala and eavesdropping on sensitive 
information in the requests or the results).
+    </p>
+
+    <p>
+      Impala supports authentication using either Kerberos or LDAP.
+    </p>
+
+    <note 
conref="../shared/impala_common.xml#common/authentication_vs_authorization"/>
+
+    <p outputclass="toc"/>
+
+    <p>
+      Once you are finished setting up authentication, move on to 
authorization, which involves specifying what
+      databases, tables, HDFS directories, and so on can be accessed by 
particular users when they connect through
+      Impala. See <xref href="impala_authorization.xml#authorization"/> for 
details.
+    </p>
+  </conbody>
+</concept>

[44/51] [partial] incubator-impala git commit: IMPALA-3398: Add docs to main Impala branch.

Reply via email to