http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3c2c8f12/docs/topics/impala_txtfile.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_txtfile.xml b/docs/topics/impala_txtfile.xml index ec8c059..543e2ff 100644 --- a/docs/topics/impala_txtfile.xml +++ b/docs/topics/impala_txtfile.xml @@ -4,7 +4,15 @@ <title>Using Text Data Files with Impala Tables</title> <titlealts audience="PDF"><navtitle>Text Data Files</navtitle></titlealts> - + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="File Formats"/> + <data name="Category" value="Tables"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + </metadata> + </prolog> <conbody> @@ -15,10 +23,790 @@ text files, such as CSV or TSV with commas or tabs for delimiters. </p> - + <p> + Text files are also very flexible in their column definitions. For example, a text file could have more + fields than the Impala table, and those extra fields are ignored during queries; or it could have fewer + fields than the Impala table, and those missing fields are treated as <codeph>NULL</codeph> values in + queries. You could have fields that were treated as numbers or timestamps in a table, then use <codeph>ALTER + TABLE ... REPLACE COLUMNS</codeph> to switch them to strings, or the reverse. + </p> + + <table> + <title>Text Format Support in Impala</title> + <tgroup cols="5"> + <colspec colname="1" colwidth="10*"/> + <colspec colname="2" colwidth="10*"/> + <colspec colname="3" colwidth="20*"/> + <colspec colname="4" colwidth="30*"/> + <colspec colname="5" colwidth="30*"/> + <thead> + <row> + <entry> + File Type + </entry> + <entry> + Format + </entry> + <entry> + Compression Codecs + </entry> + <entry> + Impala Can CREATE? + </entry> + <entry> + Impala Can INSERT? + </entry> + </row> + </thead> + <tbody> + <row conref="impala_file_formats.xml#file_formats/txtfile_support"> + <entry/> + </row> + </tbody> + </tgroup> + </table> + + <p outputclass="toc inpage"/> + + </conbody> + + <concept id="text_performance"> + + <title>Query Performance for Impala Text Tables</title> + <prolog> + <metadata> + <data name="Category" value="Performance"/> + </metadata> + </prolog> + + <conbody> + + <p> + Data stored in text format is relatively bulky, and not as efficient to query as binary formats such as + Parquet. You typically use text tables with Impala if that is the format you receive the data and you do + not have control over that process, or if you are a relatively new Hadoop user and not familiar with + techniques to generate files in other formats. (Because the default format for <codeph>CREATE + TABLE</codeph> is text, you might create your first Impala tables as text without giving performance much + thought.) Either way, look for opportunities to use more efficient file formats for the tables used in your + most performance-critical queries. + </p> + + <p> + For frequently queried data, you might load the original text data files into one Impala table, then use an + <codeph>INSERT</codeph> statement to transfer the data to another table that uses the Parquet file format; + the data is converted automatically as it is stored in the destination table. + </p> + + <p> + For more compact data, consider using LZO compression for the text files. LZO is the only compression codec + that Impala supports for text data, because the <q>splittable</q> nature of LZO data files lets different + nodes work on different parts of the same file in parallel. See <xref href="impala_txtfile.xml#lzo"/> for + details. + </p> + + <p rev="2.0.0"> + In Impala 2.0 and later, you can also use text data compressed in the gzip, bzip2, or Snappy formats. + Because these compressed formats are not <q>splittable</q> in the way that LZO is, there is less + opportunity for Impala to parallelize queries on them. Therefore, use these types of compressed data only + for convenience if that is the format in which you receive the data. Prefer to use LZO compression for text + data if you have the choice, or convert the data to Parquet using an <codeph>INSERT ... SELECT</codeph> + statement to copy the original data into a Parquet table. + </p> + + <note rev="2.2.0"> + <p> + Impala supports bzip files created by the <codeph>bzip2</codeph> command, but not bzip files with + multiple streams created by the <codeph>pbzip2</codeph> command. Impala decodes only the data from the + first part of such files, leading to incomplete results. + </p> + + <p> + The maximum size that Impala can accommodate for an individual bzip file is 1 GB (after uncompression). + </p> + </note> + + <p conref="../shared/impala_common.xml#common/s3_block_splitting"/> + + </conbody> + + </concept> + + <concept id="text_ddl"> + + <title>Creating Text Tables</title> + + <conbody> + + <p> + <b>To create a table using text data files:</b> + </p> + + <p> + If the exact format of the text data files (such as the delimiter character) is not significant, use the + <codeph>CREATE TABLE</codeph> statement with no extra clauses at the end to create a text-format table. For + example: + </p> + +<codeblock>create table my_table(id int, s string, n int, t timestamp, b boolean); +</codeblock> + + <p> + The data files created by any <codeph>INSERT</codeph> statements will use the Ctrl-A character (hex 01) as + a separator between each column value. + </p> + + <p> + A common use case is to import existing text files into an Impala table. The syntax is more verbose; the + significant part is the <codeph>FIELDS TERMINATED BY</codeph> clause, which must be preceded by the + <codeph>ROW FORMAT DELIMITED</codeph> clause. The statement can end with a <codeph>STORED AS + TEXTFILE</codeph> clause, but that clause is optional because text format tables are the default. For + example: + </p> + +<codeblock>create table csv(id int, s string, n int, t timestamp, b boolean) + row format delimited + <ph id="csv">fields terminated by ',';</ph> + +create table tsv(id int, s string, n int, t timestamp, b boolean) + row format delimited + <ph id="tsv">fields terminated by '\t';</ph> + +create table pipe_separated(id int, s string, n int, t timestamp, b boolean) + row format delimited + <ph id="psv">fields terminated by '|'</ph> + stored as textfile; +</codeblock> + + <p> + You can create tables with specific separator characters to import text files in familiar formats such as + CSV, TSV, or pipe-separated. You can also use these tables to produce output data files, by copying data + into them through the <codeph>INSERT ... SELECT</codeph> syntax and then extracting the data files from the + Impala data directory. + </p> + + <p rev="1.3.1"> + In Impala 1.3.1 and higher, you can specify a delimiter character <codeph>'\</codeph><codeph>0'</codeph> to + use the ASCII 0 (<codeph>nul</codeph>) character for text tables: + </p> + +<codeblock rev="1.3.1">create table nul_separated(id int, s string, n int, t timestamp, b boolean) + row format delimited + fields terminated by '\0' + stored as textfile; +</codeblock> + + <note> + <p> + Do not surround string values with quotation marks in text data files that you construct. If you need to + include the separator character inside a field value, for example to put a string value with a comma + inside a CSV-format data file, specify an escape character on the <codeph>CREATE TABLE</codeph> statement + with the <codeph>ESCAPED BY</codeph> clause, and insert that character immediately before any separator + characters that need escaping. + </p> + </note> + +<!-- + <p> + In the <cmdname>impala-shell</cmdname> interpreter, issue a command similar to: + </p> + +<codeblock>create table textfile_table (<varname>column_specs</varname>) stored as textfile; +/* If the STORED AS clause is omitted, the default is a TEXTFILE with hex 01 characters as the delimiter. */ +create table default_table (<varname>column_specs</varname>); +/* Some optional clauses in the CREATE TABLE statement apply only to Text tables. */ +create table csv_table (<varname>column_specs</varname>) row format delimited fields terminated by ','; +create table tsv_table (<varname>column_specs</varname>) row format delimited fields terminated by '\t'; +create table dos_table (<varname>column_specs</varname>) lines terminated by '\r';</codeblock> +--> + + <p> + Issue a <codeph>DESCRIBE FORMATTED <varname>table_name</varname></codeph> statement to see the details of + how each table is represented internally in Impala. + </p> + + <p conref="../shared/impala_common.xml#common/complex_types_unsupported_filetype"/> + + </conbody> + + </concept> + + <concept id="text_data_files"> + + <title>Data Files for Text Tables</title> + + <conbody> + + <p> + When Impala queries a table with data in text format, it consults all the data files in the data directory + for that table, with some exceptions: + </p> + + <ul rev="2.2.0"> + <li> + <p> + Impala ignores any hidden files, that is, files whose names start with a dot or an underscore. + </p> + </li> + + <li> + <p conref="../shared/impala_common.xml#common/ignore_file_extensions"/> + </li> + + <li> +<!-- Copied and slightly adapted text from later on in this same file. Turn into a conref. --> + <p> + Impala uses suffixes to recognize when text data files are compressed text. For Impala to recognize the + compressed text files, they must have the appropriate file extension corresponding to the compression + codec, either <codeph>.gz</codeph>, <codeph>.bz2</codeph>, or <codeph>.snappy</codeph>. The extensions + can be in uppercase or lowercase. + </p> + </li> + + <li> + Otherwise, the file names are not significant. When you put files into an HDFS directory through ETL + jobs, or point Impala to an existing HDFS directory with the <codeph>CREATE EXTERNAL TABLE</codeph> + statement, or move data files under external control with the <codeph>LOAD DATA</codeph> statement, + Impala preserves the original filenames. + </li> + </ul> + + <p> + Filenames for data produced through Impala <codeph>INSERT</codeph> statements are given unique names to + avoid filename conflicts. + </p> + + <p> + An <codeph>INSERT ... SELECT</codeph> statement produces one data file from each node that processes the + <codeph>SELECT</codeph> part of the statement. An <codeph>INSERT ... VALUES</codeph> statement produces a + separate data file for each statement; because Impala is more efficient querying a small number of huge + files than a large number of tiny files, the <codeph>INSERT ... VALUES</codeph> syntax is not recommended + for loading a substantial volume of data. If you find yourself with a table that is inefficient due to too + many small data files, reorganize the data into a few large files by doing <codeph>INSERT ... + SELECT</codeph> to transfer the data to a new table. + </p> + + <p> + <b>Special values within text data files:</b> + </p> + + <ul> + <li rev="1.4.0"> + <p> + Impala recognizes the literal strings <codeph>inf</codeph> for infinity and <codeph>nan</codeph> for + <q>Not a Number</q>, for <codeph>FLOAT</codeph> and <codeph>DOUBLE</codeph> columns. + </p> + </li> + + <li> + <p> + Impala recognizes the literal string <codeph>\N</codeph> to represent <codeph>NULL</codeph>. When using + Sqoop, specify the options <codeph>--null-non-string</codeph> and <codeph>--null-string</codeph> to + ensure all <codeph>NULL</codeph> values are represented correctly in the Sqoop output files. By default, + Sqoop writes <codeph>NULL</codeph> values using the string <codeph>null</codeph>, which causes a + conversion error when such rows are evaluated by Impala. (A workaround for existing tables and data files + is to change the table properties through <codeph>ALTER TABLE <varname>name</varname> SET + TBLPROPERTIES("serialization.null.format"="null")</codeph>.) + </p> + </li> + + <li> + <p conref="../shared/impala_common.xml#common/skip_header_lines"/> + </li> + </ul> + + </conbody> + + </concept> + + <concept id="text_etl"> + + <title>Loading Data into Impala Text Tables</title> + <prolog> + <metadata> + <data name="Category" value="ETL"/> + <data name="Category" value="Ingest"/> + </metadata> + </prolog> + + <conbody> + + <p> + To load an existing text file into an Impala text table, use the <codeph>LOAD DATA</codeph> statement and + specify the path of the file in HDFS. That file is moved into the appropriate Impala data directory. + </p> + + <p> + To load multiple existing text files into an Impala text table, use the <codeph>LOAD DATA</codeph> + statement and specify the HDFS path of the directory containing the files. All non-hidden files are moved + into the appropriate Impala data directory. + </p> + + <p> + To convert data to text from any other file format supported by Impala, use a SQL statement such as: + </p> + +<codeblock>-- Text table with default delimiter, the hex 01 character. +CREATE TABLE text_table AS SELECT * FROM other_file_format_table; + +-- Text table with user-specified delimiter. Currently, you cannot specify +-- the delimiter as part of CREATE TABLE LIKE or CREATE TABLE AS SELECT. +-- But you can change an existing text table to have a different delimiter. +CREATE TABLE csv LIKE other_file_format_table; +ALTER TABLE csv SET SERDEPROPERTIES ('serialization.format'=',', 'field.delim'=','); +INSERT INTO csv SELECT * FROM other_file_format_table;</codeblock> + + <p> + This can be a useful technique to see how Impala represents special values within a text-format data file. + Use the <codeph>DESCRIBE FORMATTED</codeph> statement to see the HDFS directory where the data files are + stored, then use Linux commands such as <codeph>hdfs dfs -ls <varname>hdfs_directory</varname></codeph> and + <codeph>hdfs dfs -cat <varname>hdfs_file</varname></codeph> to display the contents of an Impala-created + text file. + </p> + + <p> + To create a few rows in a text table for test purposes, you can use the <codeph>INSERT ... VALUES</codeph> + syntax: + </p> + +<codeblock>INSERT INTO <varname>text_table</varname> VALUES ('string_literal',100,hex('hello world'));</codeblock> + + <note> + Because Impala and the HDFS infrastructure are optimized for multi-megabyte files, avoid the <codeph>INSERT + ... VALUES</codeph> notation when you are inserting many rows. Each <codeph>INSERT ... VALUES</codeph> + statement produces a new tiny file, leading to fragmentation and reduced performance. When creating any + substantial volume of new data, use one of the bulk loading techniques such as <codeph>LOAD DATA</codeph> + or <codeph>INSERT ... SELECT</codeph>. Or, <xref href="impala_hbase.xml#impala_hbase">use an HBase + table</xref> for single-row <codeph>INSERT</codeph> operations, because HBase tables are not subject to the + same fragmentation issues as tables stored on HDFS. + </note> + + <p> + When you create a text file for use with an Impala text table, specify <codeph>\N</codeph> to represent a + <codeph>NULL</codeph> value. For the differences between <codeph>NULL</codeph> and empty strings, see + <xref href="impala_literals.xml#null"/>. + </p> + + <p> + If a text file has fewer fields than the columns in the corresponding Impala table, all the corresponding + columns are set to <codeph>NULL</codeph> when the data in that file is read by an Impala query. + </p> + + <p> + If a text file has more fields than the columns in the corresponding Impala table, the extra fields are + ignored when the data in that file is read by an Impala query. + </p> + + <p> + You can also use manual HDFS operations such as <codeph>hdfs dfs -put</codeph> or <codeph>hdfs dfs + -cp</codeph> to put data files in the data directory for an Impala table. When you copy or move new data + files into the HDFS directory for the Impala table, issue a <codeph>REFRESH + <varname>table_name</varname></codeph> statement in <cmdname>impala-shell</cmdname> before issuing the next + query against that table, to make Impala recognize the newly added files. + </p> + + </conbody> + + </concept> + + <concept id="lzo"> + + <title>Using LZO-Compressed Text Files</title> + <prolog> + <metadata> + <data name="Category" value="LZO"/> + <data name="Category" value="Compression"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">LZO support in Impala</indexterm> + + <indexterm audience="Cloudera">compression</indexterm> + Impala supports using text data files that employ LZO compression. Cloudera recommends compressing + text data files when practical. Impala queries are usually I/O-bound; reducing the amount of data read from + disk typically speeds up a query, despite the extra CPU work to uncompress the data in memory. + </p> + + <p> + Impala can work with LZO-compressed text files are preferable to files compressed by other codecs, because + LZO-compressed files are <q>splittable</q>, meaning that different portions of a file can be uncompressed + and processed independently by different nodes. + </p> + + <p> + Impala does not currently support writing LZO-compressed text files. + </p> + + <p> + Because Impala can query LZO-compressed files but currently cannot write them, you use Hive to do the + initial <codeph>CREATE TABLE</codeph> and load the data, then switch back to Impala to run queries. For + instructions on setting up LZO compression for Hive <codeph>CREATE TABLE</codeph> and + <codeph>INSERT</codeph> statements, see + <xref href="https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LZO" scope="external" format="html">the + LZO page on the Hive wiki</xref>. Once you have created an LZO text table, you can also manually add + LZO-compressed text files to it, produced by the + <xref href="http://www.lzop.org/" scope="external" format="html"> <cmdname>lzop</cmdname></xref> command + or similar method. + </p> + + <section id="lzo_setup"> + + <title>Preparing to Use LZO-Compressed Text Files</title> + + <p> + Before using LZO-compressed tables in Impala, do the following one-time setup for each machine in the + cluster. Install the necessary packages using either the Cloudera public repository, a private repository + you establish, or by using packages. You must do these steps manually, whether or not the cluster is + managed by the Cloudera Manager product. + </p> + + <ol> + <li> + <b>Prepare your systems to work with LZO using Cloudera repositories:</b> + <p> + <b>On systems managed by Cloudera Manager using parcels:</b> + </p> + + <p> + See the setup instructions for the LZO parcel in the Cloudera Manager documentation for + <xref href="http://www.cloudera.com/documentation/enterprise/latest/topics/cm_ig_install_gpl_extras.html" scope="external" format="html">Cloudera + Manager 5</xref>. + </p> + + <p> + <b>On systems managed by Cloudera Manager using packages, or not managed by Cloudera Manager:</b> + </p> + + <p> + Download and install the appropriate file to each machine on which you intend to use LZO with Impala. + These files all come from the Cloudera + <xref href="https://archive.cloudera.com/gplextras/redhat/5/x86_64/gplextras/" scope="external" format="html">GPL + extras</xref> download site. Install the: + </p> + <ul> + <li> + <xref href="https://archive.cloudera.com/gplextras/redhat/5/x86_64/gplextras/cloudera-gplextras4.repo" scope="external" format="repo">Red + Hat 5 repo file</xref> to <filepath>/etc/yum.repos.d/</filepath>. + </li> + + <li> + <xref href="https://archive.cloudera.com/gplextras/redhat/6/x86_64/gplextras/cloudera-gplextras4.repo" scope="external" format="repo">Red + Hat 6 repo file</xref> to <filepath>/etc/yum.repos.d/</filepath>. + </li> + + <li> + <xref href="https://archive.cloudera.com/gplextras/sles/11/x86_64/gplextras/cloudera-gplextras4.repo" scope="external" format="repo">SUSE + repo file</xref> to <filepath>/etc/zypp/repos.d/</filepath>. + </li> + + <li> + <xref href="https://archive.cloudera.com/gplextras/ubuntu/lucid/amd64/gplextras/cloudera.list" scope="external" format="list">Ubuntu + 10.04 list file</xref> to <filepath>/etc/apt/sources.list.d/</filepath>. + </li> + + <li> + <xref href="https://archive.cloudera.com/gplextras/ubuntu/precise/amd64/gplextras/cloudera.list" scope="external" format="list">Ubuntu + 12.04 list file</xref> to <filepath>/etc/apt/sources.list.d/</filepath>. + </li> + + <li> + <xref href="https://archive.cloudera.com/gplextras/debian/squeeze/amd64/gplextras/cloudera.list" scope="external" format="list">Debian + list file</xref> to <filepath>/etc/apt/sources.list.d/</filepath>. + </li> + </ul> + </li> + + <li> + <b>Configure Impala to use LZO:</b> + <p> + Use <b>one</b> of the following sets of commands to refresh your package management system's + repository information, install the base LZO support for Hadoop, and install the LZO support for + Impala. + </p> + + <note rev="1.2.0"> + <p rev="1.2.0"> + The name of the Hadoop LZO package changed between CDH 4 and CDH 5. In CDH 4, the package name was + <codeph>hadoop-lzo-cdh4</codeph>. In CDH 5 and higher, the package name is <codeph>hadoop-lzo</codeph>. + </p> + </note> + + <p> + <b>For RHEL/CentOS systems:</b> + </p> +<codeblock>$ sudo yum update +$ sudo yum install hadoop-lzo +$ sudo yum install impala-lzo</codeblock> + <p> + <b>For SUSE systems:</b> + </p> +<codeblock rev="1.2">$ sudo apt-get update +$ sudo zypper install hadoop-lzo +$ sudo zypper install impala-lzo</codeblock> + <p> + <b>For Debian/Ubuntu systems:</b> + </p> +<codeblock>$ sudo zypper update +$ sudo apt-get install hadoop-lzo +$ sudo apt-get install impala-lzo</codeblock> + <note> + <p> + The level of the <codeph>impala-lzo-cdh4</codeph> package is closely tied to the version of Impala + you use. Any time you upgrade Impala, re-do the installation command for + <codeph>impala-lzo</codeph> on each applicable machine to make sure you have the appropriate + version of that package. + </p> + </note> + </li> + + <li> + For <codeph>core-site.xml</codeph> on the client <b>and</b> server (that is, in the configuration + directories for both Impala and Hadoop), append <codeph>com.hadoop.compression.lzo.LzopCodec</codeph> + to the comma-separated list of codecs. For example: +<codeblock><property> + <name>io.compression.codecs</name> + <value>org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec, + org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.DeflateCodec, + org.apache.hadoop.io.compress.SnappyCodec,com.hadoop.compression.lzo.LzopCodec</value> +</property></codeblock> + <note> + <p> + If this is the first time you have edited the Hadoop <filepath>core-site.xml</filepath> file, note + that the <filepath>/etc/hadoop/conf</filepath> directory is typically a symbolic link, so the + canonical <filepath>core-site.xml</filepath> might reside in a different directory: + </p> +<codeblock>$ ls -l /etc/hadoop +total 8 +lrwxrwxrwx. 1 root root 29 Feb 26 2013 conf -> /etc/alternatives/hadoop-conf +lrwxrwxrwx. 1 root root 10 Feb 26 2013 conf.dist -> conf.empty +drwxr-xr-x. 2 root root 4096 Feb 26 2013 conf.empty +drwxr-xr-x. 2 root root 4096 Oct 28 15:46 conf.pseudo</codeblock> + <p> + If the <codeph>io.compression.codecs</codeph> property is missing from + <filepath>core-site.xml</filepath>, only add <codeph>com.hadoop.compression.lzo.LzopCodec</codeph> + to the new property value, not all the names from the preceding example. + </p> + </note> + </li> + + <li> + <!-- To do: + Link to CM or other doc where that procedure is explained. + Run through the procedure in CM and cite the relevant safety valves to put the XML into. + --> + Restart the MapReduce and Impala services. + </li> + </ol> + + </section> + + <section id="lzo_create_table"> + + <title>Creating LZO Compressed Text Tables</title> + + <p> + A table containing LZO-compressed text files must be created in Hive with the following storage clause: + </p> + +<codeblock>STORED AS + INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat' + OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'</codeblock> + +<!-- + <p> + In Hive, when writing LZO compressed text tables, you must include the following specification: + </p> + +<codeblock>hive> SET hive.exec.compress.output=true; +hive> SET mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec;</codeblock> +--> + + <p> + Also, certain Hive settings need to be in effect. For example: + </p> + +<codeblock>hive> SET mapreduce.output.fileoutputformat.compress=true; +hive> SET hive.exec.compress.output=true; +hive> SET mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzopCodec; +hive> CREATE TABLE lzo_t (s string) STORED AS + > INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat' + > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'; +hive> INSERT INTO TABLE lzo_t SELECT col1, col2 FROM uncompressed_text_table;</codeblock> + + <p> + Once you have created LZO-compressed text tables, you can convert data stored in other tables (regardless + of file format) by using the <codeph>INSERT ... SELECT</codeph> statement in Hive. + </p> + + <p> + Files in an LZO-compressed table must use the <codeph>.lzo</codeph> extension. Examine the files in the + HDFS data directory after doing the <codeph>INSERT</codeph> in Hive, to make sure the files have the + right extension. If the required settings are not in place, you end up with regular uncompressed files, + and Impala cannot access the table because it finds data files with the wrong (uncompressed) format. + </p> + + <p> + After loading data into an LZO-compressed text table, index the files so that they can be split. You + index the files by running a Java class, + <codeph>com.hadoop.compression.lzo.DistributedLzoIndexer</codeph>, through the Linux command line. This + Java class is included in the <codeph>hadoop-lzo</codeph> package. + </p> + + <p> + Run the indexer using a command like the following: + </p> + +<codeblock>$ hadoop jar /usr/lib/hadoop/lib/hadoop-lzo-cdh4-0.4.15-gplextras.jar + com.hadoop.compression.lzo.DistributedLzoIndexer /hdfs_location_of_table/</codeblock> + + <note> + If the path of the JAR file in the preceding example is not recognized, do a <cmdname>find</cmdname> + command to locate <filepath>hadoop-lzo-*-gplextras.jar</filepath> and use that path. + </note> + + <p> + Indexed files have the same name as the file they index, with the <codeph>.index</codeph> extension. If + the data files are not indexed, Impala queries still work, but the queries read the data from remote + DataNodes, which is very inefficient. + </p> + + <!-- To do: + Here is the place to put some end-to-end examples once I have it + all working. Or at least the final step with Impala queries. + Have never actually gotten this part working yet due to mismatches + between the levels of Impala and LZO packages. + --> + + <p> + Once the LZO-compressed tables are created, and data is loaded and indexed, you can query them through + Impala. As always, the first time you start <cmdname>impala-shell</cmdname> after creating a table in + Hive, issue an <codeph>INVALIDATE METADATA</codeph> statement so that Impala recognizes the new table. + (In Impala 1.2 and higher, you only have to run <codeph>INVALIDATE METADATA</codeph> on one node, rather + than on all the Impala nodes.) + </p> + + </section> + + </conbody> + + </concept> + + <concept rev="2.0.0" id="gzip"> + + <title>Using gzip, bzip2, or Snappy-Compressed Text Files</title> + <prolog> + <metadata> + <data name="Category" value="Snappy"/> + <data name="Category" value="Gzip"/> + <data name="Category" value="Compression"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">gzip support in Impala</indexterm> + + <indexterm audience="Cloudera">bzip2 support in Impala</indexterm> + + <indexterm audience="Cloudera">Snappy support in Impala</indexterm> + + <indexterm audience="Cloudera">compression</indexterm> + In Impala 2.0 and later, Impala supports using text data files that employ gzip, bzip2, or Snappy + compression. These compression types are primarily for convenience within an existing ETL pipeline rather + than maximum performance. Although it requires less I/O to read compressed text than the equivalent + uncompressed text, files compressed by these codecs are not <q>splittable</q> and therefore cannot take + full advantage of the Impala parallel query capability. + </p> + + <p> + As each bzip2- or Snappy-compressed text file is processed, the node doing the work reads the entire file + into memory and then decompresses it. Therefore, the node must have enough memory to hold both the + compressed and uncompressed data from the text file. The memory required to hold the uncompressed data is + difficult to estimate in advance, potentially causing problems on systems with low memory limits or with + resource management enabled. <ph rev="2.1.0">In Impala 2.1 and higher, this memory overhead is reduced for + gzip-compressed text files. The gzipped data is decompressed as it is read, rather than all at once.</ph> + </p> + +<!-- + <p> + Impala can work with LZO-compressed text files but not GZip-compressed text. + LZO-compressed files are <q>splittable</q>, meaning that different portions of a file + can be uncompressed and processed independently by different nodes. GZip-compressed + files are not splittable, making them unsuitable for Impala-style distributed queries. + </p> +--> + + <p> + To create a table to hold gzip, bzip2, or Snappy-compressed text, create a text table with no special + compression options. Specify the delimiter and escape character if required, using the <codeph>ROW + FORMAT</codeph> clause. + </p> + + <p> + Because Impala can query compressed text files but currently cannot write them, produce the compressed text + files outside Impala and use the <codeph>LOAD DATA</codeph> statement, manual HDFS commands to move them to + the appropriate Impala data directory. (Or, you can use <codeph>CREATE EXTERNAL TABLE</codeph> and point + the <codeph>LOCATION</codeph> attribute at a directory containing existing compressed text files.) + </p> + + <p> + For Impala to recognize the compressed text files, they must have the appropriate file extension + corresponding to the compression codec, either <codeph>.gz</codeph>, <codeph>.bz2</codeph>, or + <codeph>.snappy</codeph>. The extensions can be in uppercase or lowercase. + </p> + + <p> + The following example shows how you can create a regular text table, put different kinds of compressed and + uncompressed files into it, and Impala automatically recognizes and decompresses each one based on their + file extensions: + </p> + +<codeblock>create table csv_compressed (a string, b string, c string) + row format delimited fields terminated by ","; + +insert into csv_compressed values + ('one - uncompressed', 'two - uncompressed', 'three - uncompressed'), + ('abc - uncompressed', 'xyz - uncompressed', '123 - uncompressed'); +...make equivalent .gz, .bz2, and .snappy files and load them into same table directory... + +select * from csv_compressed; ++--------------------+--------------------+----------------------+ +| a | b | c | ++--------------------+--------------------+----------------------+ +| one - snappy | two - snappy | three - snappy | +| one - uncompressed | two - uncompressed | three - uncompressed | +| abc - uncompressed | xyz - uncompressed | 123 - uncompressed | +| one - bz2 | two - bz2 | three - bz2 | +| abc - bz2 | xyz - bz2 | 123 - bz2 | +| one - gzip | two - gzip | three - gzip | +| abc - gzip | xyz - gzip | 123 - gzip | ++--------------------+--------------------+----------------------+ + +$ hdfs dfs -ls 'hdfs://127.0.0.1:8020/user/hive/warehouse/file_formats.db/csv_compressed/'; +...truncated for readability... +75 hdfs://127.0.0.1:8020/user/hive/warehouse/file_formats.db/csv_compressed/csv_compressed.snappy +79 hdfs://127.0.0.1:8020/user/hive/warehouse/file_formats.db/csv_compressed/csv_compressed_bz2.csv.bz2 +80 hdfs://127.0.0.1:8020/user/hive/warehouse/file_formats.db/csv_compressed/csv_compressed_gzip.csv.gz +116 hdfs://127.0.0.1:8020/user/hive/warehouse/file_formats.db/csv_compressed/dd414df64d67d49b_data.0. +</codeblock> </conbody> </concept> + <concept audience="Cloudera" id="txtfile_data_types"> + + <title>Data Type Considerations for Text Tables</title> + + <conbody> + + <p></p> + + </conbody> + + </concept> +</concept>
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3c2c8f12/docs/topics/impala_udf.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_udf.xml b/docs/topics/impala_udf.xml index 53dd8eb..2d2f3b5 100644 --- a/docs/topics/impala_udf.xml +++ b/docs/topics/impala_udf.xml @@ -8,6 +8,8 @@ <data name="Category" value="Impala"/> <data name="Category" value="Impala Functions"/> <data name="Category" value="UDFs"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> </metadata> </prolog> @@ -169,9 +171,11 @@ select real_words(letters) from word_games;</codeblock> </li> <li> - The return type must be a <q>Writable</q> type such as <codeph>Text</codeph> or + Prior to CDH 5.7 / Impala 2.5, the return type must be a <q>Writable</q> type such as <codeph>Text</codeph> or <codeph>IntWritable</codeph>, rather than a Java primitive type such as <codeph>String</codeph> or - <codeph>int</codeph>. Otherwise, the UDF will return <codeph>NULL</codeph>. + <codeph>int</codeph>. Otherwise, the UDF returns <codeph>NULL</codeph>. + <ph rev="2.5.0">In CDH 5.7 / Impala 2.5 and higher, this restriction is lifted, and both + UDF arguments and return values can be Java primitive types.</ph> </li> <li> @@ -182,6 +186,12 @@ select real_words(letters) from word_games;</codeblock> Typically, a Java UDF will execute several times slower in Impala than the equivalent native UDF written in C++. </li> + <li rev="2.5.0 IMPALA-2843 CDH-39148"> + In CDH 5.7 / Impala 2.5 and higher, you can transparently call Hive Java UDFs through Impala, + or call Impala Java UDFs through Hive. This feature does not apply to built-in Hive functions. + Any Impala Java UDFs created with older versions must be re-created using new <codeph>CREATE FUNCTION</codeph> + syntax, without any signature for arguments or the return value. + </li> </ul> <p> @@ -254,6 +264,7 @@ select real_words(letters) from word_games;</codeblock> <codeph>WHERE</codeph> clause), directly on a column, and on the results of a string expression: </p> +<!-- To do: adapt for signatureless syntax per CDH-39148 / IMPALA-2843. --> <codeblock>[localhost:21000] > create database udfs; [localhost:21000] > use udfs; localhost:21000] > create function lower(string) returns string location '/user/hive/udfs/hive.jar' symbol='org.apache.hadoop.hive.ql.udf.UDFLower'; @@ -385,8 +396,9 @@ and other examples demonstrating this technique in <conbody> - <p> - To develop UDFs for Impala, download and install the <codeph>impala-udf-devel</codeph> package containing + <p rev="CDH-37080"> + To develop UDFs for Impala, download and install the <codeph>impala-udf-devel</codeph> package (RHEL-based + distributions) or <codeph>impala-udf-dev</codeph> (Ubuntu and Debian). This package contains header files, sample source, and build configuration files. </p> @@ -403,9 +415,10 @@ and other examples demonstrating this technique in <codeph>.repo</codeph> file for CDH 4 on RHEL 6</xref>. </li> - <li> + <li rev="CDH-37080"> Use the familiar <codeph>yum</codeph>, <codeph>zypper</codeph>, or <codeph>apt-get</codeph> commands - depending on your operating system, with <codeph>impala-udf-devel</codeph> for the package name. + depending on your operating system. For the package name, specify <codeph>impala-udf-devel</codeph> + (RHEL-based distributions) or <codeph>impala-udf-dev</codeph> (Ubuntu and Debian). </li> </ol> @@ -480,10 +493,12 @@ and other examples demonstrating this technique in <p> For the basic declarations needed to write a scalar UDF, see the header file - <filepath>udf-sample.h</filepath> within the sample build environment, which defines a simple function + <xref href="https://github.com/cloudera/impala-udf-samples/blob/master/udf-sample.h" scope="external" format="html"><filepath>udf-sample.h</filepath></xref> + within the sample build environment, which defines a simple function named <codeph>AddUdf()</codeph>: </p> +<!-- Downloadable version of this file: https://raw.githubusercontent.com/cloudera/impala-udf-samples/master/udf-sample.h --> <codeblock>#ifndef IMPALA_UDF_SAMPLE_UDF_H #define IMPALA_UDF_SAMPLE_UDF_H @@ -493,13 +508,15 @@ using namespace impala_udf; IntVal AddUdf(FunctionContext* context, const IntVal& arg1, const IntVal& arg2); -#endif</codeblock> +#endif +</codeblock> <p> For sample C++ code for a simple function named <codeph>AddUdf()</codeph>, see the source file <filepath>udf-sample.cc</filepath> within the sample build environment: </p> +<!-- Downloadable version of this file: https://raw.githubusercontent.com/cloudera/impala-udf-samples/master/udf-sample.cc --> <codeblock>#include "udf-sample.h" // In this sample we are declaring a UDF that adds two ints and returns an int. @@ -522,7 +539,7 @@ IntVal AddUdf(FunctionContext* context, const IntVal& arg1, const IntVal& Each value that a user-defined function can accept as an argument or return as a result value must map to a SQL data type that you could specify for a table column. </p> - + <p conref="../shared/impala_common.xml#common/udfs_no_complex_types"/> <p> @@ -921,10 +938,10 @@ within UDAs, you can return without specifying a value. </p> <p> - <draft-comment translate="no"> -Need an example to demonstrate exactly what tokens are used for init, merge, finalize in -this substitution. -</draft-comment> + <!-- To do: + Need an example to demonstrate exactly what tokens are used for init, merge, finalize in + this substitution. + --> For convenience, you can use a naming convention for the underlying functions and Impala automatically recognizes those entry points. Specify the <codeph>UPDATE_FN</codeph> clause, using an entry point name containing the string <codeph>update</codeph> or <codeph>Update</codeph>. When you omit the other @@ -943,56 +960,134 @@ this substitution. <filepath>uda-sample.h</filepath>: </p> - <p> - See this file online at: - <xref href="https://github.com/cloudera/impala-udf-samples/blob/master/uda-sample.cc" scope="external" format="html"/> - </p> + <p> See this file online at: <xref + href="https://github.com/cloudera/impala-udf-samples/blob/master/uda-sample.h" + scope="external" format="html" /></p> -<codeblock audience="Cloudera">#ifndef IMPALA_UDF_SAMPLE_UDA_H -#define IMPALA_UDF_SAMPLE_UDA_H +<codeblock audience="Cloudera">#ifndef SAMPLES_UDA_H +#define SAMPLES_UDA_H #include <impala_udf/udf.h> using namespace impala_udf; // This is an example of the COUNT aggregate function. +// +// Usage: > create aggregate function my_count(int) returns bigint +// location '/user/cloudera/libudasample.so' update_fn='CountUpdate'; +// > select my_count(col) from tbl; + void CountInit(FunctionContext* context, BigIntVal* val); -void CountUpdate(FunctionContext* context, const AnyVal& input, BigIntVal* val); +void CountUpdate(FunctionContext* context, const IntVal& input, BigIntVal* val); void CountMerge(FunctionContext* context, const BigIntVal& src, BigIntVal* dst); BigIntVal CountFinalize(FunctionContext* context, const BigIntVal& val); + // This is an example of the AVG(double) aggregate function. This function needs to // maintain two pieces of state, the current sum and the count. We do this using -// the BufferVal intermediate type. When this UDA is registered, it would specify +// the StringVal intermediate type. When this UDA is registered, it would specify // 16 bytes (8 byte sum + 8 byte count) as the size for this buffer. -void AvgInit(FunctionContext* context, BufferVal* val); -void AvgUpdate(FunctionContext* context, const DoubleVal& input, BufferVal* val); -void AvgMerge(FunctionContext* context, const BufferVal& src, BufferVal* dst); -DoubleVal AvgFinalize(FunctionContext* context, const BufferVal& val); +// +// Usage: > create aggregate function my_avg(double) returns string +// location '/user/cloudera/libudasample.so' update_fn='AvgUpdate'; +// > select cast(my_avg(col) as double) from tbl; + +void AvgInit(FunctionContext* context, StringVal* val); +void AvgUpdate(FunctionContext* context, const DoubleVal& input, StringVal* val); +void AvgMerge(FunctionContext* context, const StringVal& src, StringVal* dst); +const StringVal AvgSerialize(FunctionContext* context, const StringVal& val); +StringVal AvgFinalize(FunctionContext* context, const StringVal& val); + // This is a sample of implementing the STRING_CONCAT aggregate function. -// Example: select string_concat(string_col, ",") from table +// +// Usage: > create aggregate function string_concat(string, string) returns string +// location '/user/cloudera/libudasample.so' update_fn='StringConcatUpdate'; +// > select string_concat(string_col, ",") from table; + void StringConcatInit(FunctionContext* context, StringVal* val); void StringConcatUpdate(FunctionContext* context, const StringVal& arg1, const StringVal& arg2, StringVal* val); void StringConcatMerge(FunctionContext* context, const StringVal& src, StringVal* dst); +const StringVal StringConcatSerialize(FunctionContext* context, const StringVal& val); StringVal StringConcatFinalize(FunctionContext* context, const StringVal& val); + +// This is a example of the variance aggregate function. +// +// Usage: > create aggregate function var(double) returns string +// location '/user/cloudera/libudasample.so' update_fn='VarianceUpdate'; +// > select cast(var(col) as double) from tbl; + +void VarianceInit(FunctionContext* context, StringVal* val); +void VarianceUpdate(FunctionContext* context, const DoubleVal& input, StringVal* val); +void VarianceMerge(FunctionContext* context, const StringVal& src, StringVal* dst); +const StringVal VarianceSerialize(FunctionContext* context, const StringVal& val); +StringVal VarianceFinalize(FunctionContext* context, const StringVal& val); + + +// An implementation of the Knuth online variance algorithm, which is also single pass and +// more numerically stable. +// +// Usage: > create aggregate function knuth_var(double) returns string +// location '/user/cloudera/libudasample.so' update_fn='KnuthVarianceUpdate'; +// > select cast(knuth_var(col) as double) from tbl; + +void KnuthVarianceInit(FunctionContext* context, StringVal* val); +void KnuthVarianceUpdate(FunctionContext* context, const DoubleVal& input, StringVal* val); +void KnuthVarianceMerge(FunctionContext* context, const StringVal& src, StringVal* dst); +const StringVal KnuthVarianceSerialize(FunctionContext* context, const StringVal& val); +StringVal KnuthVarianceFinalize(FunctionContext* context, const StringVal& val); + + +// The different steps of the UDA are composable. In this case, we'the UDA will use the +// other steps from the Knuth variance computation. +// +// Usage: > create aggregate function stddev(double) returns string +// location '/user/cloudera/libudasample.so' update_fn='KnuthVarianceUpdate' +// finalize_fn="StdDevFinalize"; +// > select cast(stddev(col) as double) from tbl; + +StringVal StdDevFinalize(FunctionContext* context, const StringVal& val); + + +// Utility function for serialization to StringVal +template <typename T> +StringVal ToStringVal(FunctionContext* context, const T& val); + #endif</codeblock> <p> <filepath>uda-sample.cc</filepath>: </p> - <p> - See this file online at: - <xref href="https://github.com/cloudera/impala-udf-samples/blob/master/uda-sample.h" scope="external" format="html"/> + <p> See this file online at: <xref + href="https://github.com/cloudera/impala-udf-samples/blob/master/uda-sample.cc" + scope="external" format="html" /> </p> <codeblock audience="Cloudera">#include "uda-sample.h" #include <assert.h> +#include <sstream> using namespace impala_udf; +using namespace std; + +template <typename T> +StringVal ToStringVal(FunctionContext* context, const T& val) { + stringstream ss; + ss << val; + string str = ss.str(); + StringVal string_val(context, str.size()); + memcpy(string_val.ptr, str.c_str(), str.size()); + return string_val; +} + +template <> +StringVal ToStringVal<DoubleVal>(FunctionContext* context, const DoubleVal& val) { + if (val.is_null) return StringVal::null(); + return ToStringVal(context, val.val); +} // --------------------------------------------------------------------------- // This is a sample of implementing a COUNT aggregate function. @@ -1002,7 +1097,7 @@ void CountInit(FunctionContext* context, BigIntVal* val) { val->val = 0; } -void CountUpdate(FunctionContext* context, const AnyVal& input, BigIntVal* val) { +void CountUpdate(FunctionContext* context, const IntVal& input, BigIntVal* val) { if (input.is_null) return; ++val->val; } @@ -1016,61 +1111,99 @@ BigIntVal CountFinalize(FunctionContext* context, const BigIntVal& val) { } // --------------------------------------------------------------------------- -// This is a sample of implementing an AVG aggregate function. +// This is a sample of implementing a AVG aggregate function. // --------------------------------------------------------------------------- struct AvgStruct { double sum; int64_t count; }; -void AvgInit(FunctionContext* context, BufferVal* val) { - assert(sizeof(AvgStruct) == 16); - memset(*val, 0, sizeof(AvgStruct)); +// Initialize the StringVal intermediate to a zero'd AvgStruct +void AvgInit(FunctionContext* context, StringVal* val) { + val->is_null = false; + val->len = sizeof(AvgStruct); + val->ptr = context->Allocate(val->len); + memset(val->ptr, 0, val->len); } -void AvgUpdate(FunctionContext* context, const DoubleVal& input, BufferVal* val) { +void AvgUpdate(FunctionContext* context, const DoubleVal& input, StringVal* val) { if (input.is_null) return; - AvgStruct* avg = reinterpret_cast<AvgStruct*>(*val); + assert(!val->is_null); + assert(val->len == sizeof(AvgStruct)); + AvgStruct* avg = reinterpret_cast<AvgStruct*>(val->ptr); avg->sum += input.val; ++avg->count; } -void AvgMerge(FunctionContext* context, const BufferVal& src, BufferVal* dst) { - if (src == NULL) return; - const AvgStruct* src_struct = reinterpret_cast<const AvgStruct*>(src); - AvgStruct* dst_struct = reinterpret_cast<AvgStruct*>(*dst); - dst_struct->sum += src_struct->sum; - dst_struct->count += src_struct->count; +void AvgMerge(FunctionContext* context, const StringVal& src, StringVal* dst) { + if (src.is_null) return; + const AvgStruct* src_avg = reinterpret_cast<const AvgStruct*>(src.ptr); + AvgStruct* dst_avg = reinterpret_cast<AvgStruct*>(dst->ptr); + dst_avg->sum += src_avg->sum; + dst_avg->count += src_avg->count; } -DoubleVal AvgFinalize(FunctionContext* context, const BufferVal& val) { - if (val == NULL) return DoubleVal::null(); - AvgStruct* val_struct = reinterpret_cast<AvgStruct*>(val); - return DoubleVal(val_struct->sum / val_struct->count); +// A serialize function is necesary to free the intermediate state allocation. We use the +// StringVal constructor to allocate memory owned by Impala, copy the intermediate state, +// and free the original allocation. Note that memory allocated by the StringVal ctor is +// not necessarily persisted across UDA function calls, which is why we don't use it in +// AvgInit(). +const StringVal AvgSerialize(FunctionContext* context, const StringVal& val) { + assert(!val.is_null); + StringVal result(context, val.len); + memcpy(result.ptr, val.ptr, val.len); + context->Free(val.ptr); + return result; +} + +StringVal AvgFinalize(FunctionContext* context, const StringVal& val) { + assert(!val.is_null); + assert(val.len == sizeof(AvgStruct)); + AvgStruct* avg = reinterpret_cast<AvgStruct*>(val.ptr); + StringVal result; + if (avg->count == 0) { + result = StringVal::null(); + } else { + // Copies the result to memory owned by Impala + result = ToStringVal(context, avg->sum / avg->count); + } + context->Free(val.ptr); + return result; } // --------------------------------------------------------------------------- // This is a sample of implementing the STRING_CONCAT aggregate function. // Example: select string_concat(string_col, ",") from table // --------------------------------------------------------------------------- +// Delimiter to use if the separator is NULL. +static const StringVal DEFAULT_STRING_CONCAT_DELIM((uint8_t*)", ", 2); + void StringConcatInit(FunctionContext* context, StringVal* val) { val->is_null = true; } -void StringConcatUpdate(FunctionContext* context, const StringVal& arg1, - const StringVal& arg2, StringVal* val) { - if (val->is_null) { - val->is_null = false; - *val = StringVal(context, arg1.len); - memcpy(val->ptr, arg1.ptr, arg1.len); - } else { - int new_len = val->len + arg1.len + arg2.len; - StringVal new_val(context, new_len); - memcpy(new_val.ptr, val->ptr, val->len); - memcpy(new_val.ptr + val->len, arg2.ptr, arg2.len); - memcpy(new_val.ptr + val->len + arg2.len, arg1.ptr, arg1.len); - *val = new_val; +void StringConcatUpdate(FunctionContext* context, const StringVal& str, + const StringVal& separator, StringVal* result) { + if (str.is_null) return; + if (result->is_null) { + // This is the first string, simply set the result to be the value. + uint8_t* copy = context->Allocate(str.len); + memcpy(copy, str.ptr, str.len); + *result = StringVal(copy, str.len); + return; } + + const StringVal* sep_ptr = separator.is_null ? &DEFAULT_STRING_CONCAT_DELIM : + &separator; + + // We need to grow the result buffer and then append the new string and + // separator. + int new_size = result->len + sep_ptr->len + str.len; + result->ptr = context->Reallocate(result->ptr, new_size); + memcpy(result->ptr + result->len, sep_ptr->ptr, sep_ptr->len); + result->len += sep_ptr->len; + memcpy(result->ptr + result->len, str.ptr, str.len); + result->len += str.len; } void StringConcatMerge(FunctionContext* context, const StringVal& src, StringVal* dst) { @@ -1078,13 +1211,31 @@ void StringConcatMerge(FunctionContext* context, const StringVal& src, Strin StringConcatUpdate(context, src, ",", dst); } +// A serialize function is necesary to free the intermediate state allocation. We use the +// StringVal constructor to allocate memory owned by Impala, copy the intermediate +// StringVal, and free the intermediate's memory. Note that memory allocated by the +// StringVal ctor is not necessarily persisted across UDA function calls, which is why we +// don't use it in StringConcatUpdate(). +const StringVal StringConcatSerialize(FunctionContext* context, const StringVal& val) { + if (val.is_null) return val; + StringVal result(context, val.len); + memcpy(result.ptr, val.ptr, val.len); + context->Free(val.ptr); + return result; +} + +// Same as StringConcatSerialize(). StringVal StringConcatFinalize(FunctionContext* context, const StringVal& val) { - return val; + if (val.is_null) return val; + StringVal result(context, val.len); + memcpy(result.ptr, val.ptr, val.len); + context->Free(val.ptr); + return result; }</codeblock> </conbody> </concept> - <concept audience="Cloudera" id="udf_intermediate"> + <concept rev="2.3.0 IMPALA-1829 CDH-30572" id="udf_intermediate"> <title>Intermediate Results for UDAs</title> @@ -1105,6 +1256,16 @@ StringVal StringConcatFinalize(FunctionContext* context, const StringVal& va specify the type name as <codeph>CHAR(<varname>n</varname>)</codeph>, with <varname>n</varname> representing the number of bytes in the intermediate result buffer. </p> + + <p> + For an example of this technique, see the <codeph>trunc_sum()</codeph> aggregate function, which accumulates + intermediate results of type <codeph>DOUBLE</codeph> and returns <codeph>BIGINT</codeph> at the end. + View + <xref href="https://github.com/cloudera/Impala/blob/cdh5-trunk/tests/query_test/test_udfs.py" scope="external" format="html">the <codeph>CREATE FUNCTION</codeph> statement</xref> + and + <xref href="http://github.com/Cloudera/Impala/blob/cdh5-trunk/be/src/testutil/test-udas.cc" scope="external" format="html">the implementation of the underlying TruncSum*() functions</xref> + on Github. + </p> </conbody> </concept> </concept> @@ -1157,15 +1318,21 @@ StringVal StringConcatFinalize(FunctionContext* context, const StringVal& va <note> <p conref="../shared/impala_common.xml#common/udf_persistence_restriction"/> + <p> + See <xref href="impala_create_function.xml#create_function"/> and <xref href="impala_drop_function.xml#drop_function"/> + for the new syntax for the persistent Java UDFs. + </p> </note> <p> Prerequisites for the build environment are: </p> -<codeblock># Use the appropriate package installation command for your Linux distribution. +<codeblock rev="CDH-37080"># Use the appropriate package installation command for your Linux distribution. sudo yum install gcc-c++ cmake boost-devel -sudo yum install impala-udf-devel</codeblock> +sudo yum install impala-udf-devel +# The package name on Ubuntu and Debian is impala-udf-dev. +</codeblock> <p> Then, unpack the sample code in <filepath>udf_samples.tar.gz</filepath> and use that as a template to set @@ -1730,6 +1897,10 @@ Returned 2 row(s) in 0.43s</codeblock> </li> <li> + <p conref="../shared/impala_common.xml#common/current_user_caveat"/> + </li> + + <li> All Impala UDFs must be deterministic, that is, produce the same output each time when passed the same argument values. For example, an Impala UDF must not call functions such as <codeph>rand()</codeph> to produce different values for each invocation. It must not retrieve data from external sources, such as @@ -1740,9 +1911,12 @@ Returned 2 row(s) in 0.43s</codeblock> An Impala UDF must not spawn other threads or processes. </li> - <li> - When the <cmdname>catalogd</cmdname> process is restarted, all UDFs become undefined and must be - reloaded. + <li rev="2.5.0 IMPALA-2843"> + Prior to CDH 5.7 / Impala 2.5, when the <cmdname>catalogd</cmdname> process is restarted, + all UDFs become undefined and must be reloaded. In CDH 5.7 / Impala 2.5 and higher, this + limitation only applies to older Java UDFs. Re-create those UDFs using the new + <codeph>CREATE FUNCTION</codeph> syntax for Java UDFs, which excludes the function signature, + to remove the limitation entirely. </li> <li> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3c2c8f12/docs/topics/impala_union.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_union.xml b/docs/topics/impala_union.xml index 29a0b45..ff4529f 100644 --- a/docs/topics/impala_union.xml +++ b/docs/topics/impala_union.xml @@ -8,6 +8,8 @@ <data name="Category" value="Impala"/> <data name="Category" value="SQL"/> <data name="Category" value="Querying"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> </metadata> </prolog> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3c2c8f12/docs/topics/impala_update.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_update.xml b/docs/topics/impala_update.xml index 3b9e330..a083c48 100644 --- a/docs/topics/impala_update.xml +++ b/docs/topics/impala_update.xml @@ -2,8 +2,8 @@ <!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> <concept id="update"> - <title>UPDATE Statement (CDH 5.5 and higher only)</title> - <titlealts><navtitle>UPDATE</navtitle></titlealts> + <title>UPDATE Statement (CDH 5.10 or higher only)</title> + <titlealts audience="PDF"><navtitle>UPDATE</navtitle></titlealts> <prolog> <metadata> <data name="Category" value="Impala"/> @@ -12,6 +12,7 @@ <data name="Category" value="ETL"/> <data name="Category" value="Ingest"/> <data name="Category" value="DML"/> + <data name="Category" value="Developers"/> <data name="Category" value="Data Analysts"/> </metadata> </prolog> @@ -31,7 +32,7 @@ <codeblock> </codeblock> - <p rev="kudu" audience="impala_next"> + <p rev="kudu"> Normally, an <codeph>UPDATE</codeph> operation for a Kudu table fails if some partition key columns are not found, due to their being deleted or changed by a concurrent <codeph>UPDATE</codeph> or <codeph>DELETE</codeph> operation. http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3c2c8f12/docs/topics/impala_upgrading.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_upgrading.xml b/docs/topics/impala_upgrading.xml index 6fef62e..0baaae6 100644 --- a/docs/topics/impala_upgrading.xml +++ b/docs/topics/impala_upgrading.xml @@ -3,7 +3,13 @@ <concept id="upgrading"> <title>Upgrading Impala</title> - + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Upgrading"/> + <data name="Category" value="Administrators"/> + </metadata> + </prolog> <conbody> @@ -12,7 +18,361 @@ tool to upgrade Impala to the latest version, and then restarting Impala services. </p> - + <note> + <ul> + <li> + Each version of CDH 5 has an associated version of Impala, When you upgrade from CDH 4 to CDH 5, you get + whichever version of Impala comes with the associated level of CDH. Depending on the version of Impala + you were running on CDH 4, this could install a lower level of Impala on CDH 5. For example, if you + upgrade to CDH 5.0 from CDH 4 plus Impala 1.4, the CDH 5.0 installation comes with Impala 1.3. Always + check the associated level of Impala before upgrading to a specific version of CDH 5. Where practical, + upgrade from CDH 4 to the latest CDH 5, which also has the latest Impala. + </li> + + <li rev="ver"> + When you upgrade Impala, also upgrade Cloudera Manager if necessary: + <ul> + <li> + Users running Impala on CDH 5 must upgrade to Cloudera Manager 5.0.0 or higher. + </li> + + <li> + Users running Impala on CDH 4 must upgrade to Cloudera Manager 4.8 or higher. Cloudera Manager 4.8 + includes management support for the Impala catalog service, and is the minimum Cloudera Manager + version you can use. + </li> + + <li> + Cloudera Manager is continually updated with configuration settings for features introduced in the + latest Impala releases. + </li> + </ul> + </li> + + <li> + If you are upgrading from CDH 5 beta to CDH 5.0 production, make sure you are using the appropriate CDH 5 + repositories shown on the +<!-- Original URL: http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH-Version-and-Packaging-Information/CDH-Version-and-Packaging-Information.html --> + <xref href="http://www.cloudera.com/documentation/enterprise/latest/topics/rg_vd.html" scope="external" format="html">CDH + version and packaging</xref> page, then follow the procedures throughout the rest of this section. + </li> + + <li> + Every time you upgrade to a new major or minor Impala release, see + <xref href="impala_incompatible_changes.xml#incompatible_changes"/> in the <cite>Release Notes</cite> for + any changes needed in your source code, startup scripts, and so on. + </li> + + <li> + Also check <xref href="impala_known_issues.xml#known_issues"/> in the <cite>Release Notes</cite> for any + issues or limitations that require workarounds. + </li> + + </ul> + </note> + + <p outputclass="toc inpage"/> + </conbody> + + <concept id="upgrade_cm_parcels"> + + <title>Upgrading Impala through Cloudera Manager - Parcels</title> + <prolog> + <metadata> + <data name="Category" value="Cloudera Manager"/> + <data name="Category" value="Parcels"/> + </metadata> + </prolog> + + <conbody> + + <p> + Parcels are an alternative binary distribution format available in Cloudera Manager 4.5 and higher. + </p> + + <note type="important"> + In CDH 5, there is not a separate Impala parcel; Impala is part of the main CDH 5 parcel. Each level of CDH + 5 has a corresponding version of Impala, and you upgrade Impala by upgrading CDH. See the + <xref href="http://www.cloudera.com/documentation/enterprise/latest/topics/cm_mc_upgrading_cdh.html" scope="external" format="html">CDH + 5 upgrade instructions</xref> and choose the instructions for parcels. The remainder of this section only covers parcel upgrades for + Impala under CDH 4. + </note> + + <p> + To upgrade Impala for CDH 4 in a Cloudera Managed environment, using parcels: + </p> + + <ol> + <li> + <p> + If you originally installed using packages and now are switching to parcels, remove all the + Impala-related packages first. You can check which packages are installed using one of the following + commands, depending on your operating system: + </p> +<codeblock>rpm -qa # RHEL, Oracle Linux, CentOS, Debian +dpkg --get-selections # Debian</codeblock> + and then remove the packages using one of the following commands: +<codeblock>sudo yum remove <varname>pkg_names</varname> # RHEL, Oracle Linux, CentOS +sudo zypper remove <varname>pkg_names</varname> # SLES +sudo apt-get purge <varname>pkg_names</varname> # Ubuntu, Debian</codeblock> + </li> + + <li> + <p> + Connect to the Cloudera Manager Admin Console. + </p> + </li> + + <li> + <p> + Go to the <menucascade><uicontrol>Hosts</uicontrol><uicontrol>Parcels</uicontrol></menucascade> tab. + You should see a parcel with a newer version of Impala that you can upgrade to. + </p> + </li> + + <li> + <p> + Click <uicontrol>Download</uicontrol>, then <uicontrol>Distribute</uicontrol>. (The button changes as + each step completes.) + </p> + </li> + + <li> + <p> + Click <uicontrol>Activate</uicontrol>. + </p> + </li> + + <li> + <p> + When prompted, click <uicontrol>Restart</uicontrol> to restart the Impala service. + </p> + </li> + </ol> + </conbody> + </concept> + + <concept id="upgrade_cm_pkgs"> + + <title>Upgrading Impala through Cloudera Manager - Packages</title> + <prolog> + <metadata> + <data name="Category" value="Packages"/> + <data name="Category" value="Cloudera Manager"/> + </metadata> + </prolog> + + <conbody> + + <p> + To upgrade Impala in a Cloudera Managed environment, using packages: + </p> + + <ol> + <li> + Connect to the Cloudera Manager Admin Console. + </li> + + <li> + In the <b>Services</b> tab, click the <b>Impala</b> service. + </li> + + <li> + Click <b>Actions</b> and click <b>Stop</b>. + </li> + + <li> + Use <b>one</b> of the following sets of commands to update Impala on each Impala node in your cluster: + <p> + <b>For RHEL, Oracle Linux, or CentOS systems:</b> + </p> +<codeblock rev="1.2">$ sudo yum update impala +$ sudo yum update hadoop-lzo-cdh4 # Optional; if this package is already installed +</codeblock> + <p> + <b>For SUSE systems:</b> + </p> +<codeblock rev="1.2">$ sudo zypper update impala +$ sudo zypper update hadoop-lzo-cdh4 # Optional; if this package is already installed +</codeblock> + <p> + <b>For Debian or Ubuntu systems:</b> + </p> +<codeblock rev="1.2">$ sudo apt-get install impala +$ sudo apt-get install hadoop-lzo-cdh4 # Optional; if this package is already installed +</codeblock> + </li> + + <li> + Use <b>one</b> of the following sets of commands to update Impala shell on each node on which it is + installed: + <p> + <b>For RHEL, Oracle Linux, or CentOS systems:</b> + </p> +<codeblock>$ sudo yum update impala-shell</codeblock> + <p> + <b>For SUSE systems:</b> + </p> +<codeblock>$ sudo zypper update impala-shell</codeblock> + <p> + <b>For Debian or Ubuntu systems:</b> + </p> +<codeblock>$ sudo apt-get install impala-shell</codeblock> + </li> + + <li> + Connect to the Cloudera Manager Admin Console. + </li> + + <li> + In the <b>Services</b> tab, click the Impala service. + </li> + + <li> + Click <b>Actions</b> and click <b>Start</b>. + </li> + </ol> </conbody> </concept> + <concept id="upgrade_noncm"> + + <title>Upgrading Impala without Cloudera Manager</title> + <prolog> + <metadata> + <!-- Fill in relevant metatag(s) when we decide how to mark non-CM topics. --> + </metadata> + </prolog> + + <conbody> + + <p> + To upgrade Impala on a cluster not managed by Cloudera Manager, run these Linux commands on the appropriate + hosts in your cluster: + </p> + + <ol> + <li> + Stop Impala services. + <ol> + <li> + Stop <codeph>impalad</codeph> on each Impala node in your cluster: +<codeblock>$ sudo service impala-server stop</codeblock> + </li> + + <li> + Stop any instances of the state store in your cluster: +<codeblock>$ sudo service impala-state-store stop</codeblock> + </li> + + <li rev="1.2"> + Stop any instances of the catalog service in your cluster: +<codeblock>$ sudo service impala-catalog stop</codeblock> + </li> + </ol> + </li> + + <li> + Check if there are new recommended or required configuration settings to put into place in the + configuration files, typically under <filepath>/etc/impala/conf</filepath>. See + <xref href="impala_config_performance.xml#config_performance"/> for settings related to performance and + scalability. + </li> + + <li> + Use <b>one</b> of the following sets of commands to update Impala on each Impala node in your cluster: + <p> + <b>For RHEL, Oracle Linux, or CentOS systems:</b> + </p> +<codeblock>$ sudo yum update impala-server +$ sudo yum update hadoop-lzo-cdh4 # Optional; if this package is already installed +$ sudo yum update impala-catalog # New in Impala 1.2; do yum install when upgrading from 1.1. +</codeblock> + <p> + <b>For SUSE systems:</b> + </p> +<codeblock>$ sudo zypper update impala-server +$ sudo zypper update hadoop-lzo-cdh4 # Optional; if this package is already installed +$ sudo zypper update impala-catalog # New in Impala 1.2; do zypper install when upgrading from 1.1. +</codeblock> + <p> + <b>For Debian or Ubuntu systems:</b> + </p> +<codeblock>$ sudo apt-get install impala-server +$ sudo apt-get install hadoop-lzo-cdh4 # Optional; if this package is already installed +$ sudo apt-get install impala-catalog # New in Impala 1.2. +</codeblock> + </li> + + <li> + Use <b>one</b> of the following sets of commands to update Impala shell on each node on which it is + installed: + <p> + <b>For RHEL, Oracle Linux, or CentOS systems:</b> + </p> +<codeblock>$ sudo yum update impala-shell</codeblock> + <p> + <b>For SUSE systems:</b> + </p> +<codeblock>$ sudo zypper update impala-shell</codeblock> + <p> + <b>For Debian or Ubuntu systems:</b> + </p> +<codeblock>$ sudo apt-get install impala-shell</codeblock> + </li> + + <li rev="alternatives"> + Depending on which release of Impala you are upgrading from, you might find that the symbolic links + <filepath>/etc/impala/conf</filepath> and <filepath>/usr/lib/impala/sbin</filepath> are missing. If so, + see <xref href="impala_known_issues.xml#known_issues"/> for the procedure to work around this + problem. + </li> + + <li> + Restart Impala services: + <ol> + <li> + Restart the Impala state store service on the desired nodes in your cluster. Expect to see a process + named <codeph>statestored</codeph> if the service started successfully. +<codeblock>$ sudo service impala-state-store start +$ ps ax | grep [s]tatestored + 6819 ? Sl 0:07 /usr/lib/impala/sbin/statestored -log_dir=/var/log/impala -state_store_port=24000 +</codeblock> + <p> + Restart the state store service <i>before</i> the Impala server service to avoid <q>Not + connected</q> errors when you run <codeph>impala-shell</codeph>. + </p> + </li> + + <li rev="1.2"> + Restart the Impala catalog service on whichever host it runs on in your cluster. Expect to see a + process named <codeph>catalogd</codeph> if the service started successfully. +<codeblock>$ sudo service impala-catalog restart +$ ps ax | grep [c]atalogd + 6068 ? Sl 4:06 /usr/lib/impala/sbin/catalogd +</codeblock> + </li> + + <li> + Restart the Impala daemon service on each node in your cluster. Expect to see a process named + <codeph>impalad</codeph> if the service started successfully. +<codeblock>$ sudo service impala-server start +$ ps ax | grep [i]mpalad + 7936 ? Sl 0:12 /usr/lib/impala/sbin/impalad -log_dir=/var/log/impala -state_store_port=24000 -use_statestore +-state_store_host=127.0.0.1 -be_port=22000 +</codeblock> + </li> + </ol> + </li> + </ol> + + <note> + <p> + If the services did not start successfully (even though the <codeph>sudo service</codeph> command might + display <codeph>[OK]</codeph>), check for errors in the Impala log file, typically in + <filepath>/var/log/impala</filepath>. + </p> + </note> + </conbody> + </concept> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3c2c8f12/docs/topics/impala_use.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_use.xml b/docs/topics/impala_use.xml index 9e0b654..5ffcdeb 100644 --- a/docs/topics/impala_use.xml +++ b/docs/topics/impala_use.xml @@ -3,12 +3,14 @@ <concept id="use"> <title>USE Statement</title> - <titlealts><navtitle>USE</navtitle></titlealts> + <titlealts audience="PDF"><navtitle>USE</navtitle></titlealts> <prolog> <metadata> <data name="Category" value="Impala"/> <data name="Category" value="SQL"/> <data name="Category" value="Databases"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> </metadata> </prolog> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3c2c8f12/docs/topics/impala_v_cpu_cores.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_v_cpu_cores.xml b/docs/topics/impala_v_cpu_cores.xml index 41be3af..8091f3a 100644 --- a/docs/topics/impala_v_cpu_cores.xml +++ b/docs/topics/impala_v_cpu_cores.xml @@ -3,6 +3,7 @@ <concept rev="1.2" id="v_cpu_cores"> <title>V_CPU_CORES Query Option (CDH 5 only)</title> + <titlealts audience="PDF"><navtitle>V_CPU_CORES</navtitle></titlealts> <prolog> <metadata> <data name="Category" value="Impala"/> @@ -10,16 +11,19 @@ <data name="Category" value="YARN"/> <data name="Category" value="Llama"/> <data name="Category" value="Impala Query Options"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> </metadata> </prolog> <conbody> + <note conref="../shared/impala_common.xml#common/llama_query_options_obsolete"/> + <p> <indexterm audience="Cloudera">V_CPU_CORES query option</indexterm> The number of per-host virtual CPU cores to request from YARN. If set, the query option overrides the automatic estimate from Impala. -<!-- This sentence is used in a few places and could be conref'ed. --> Used in conjunction with the Impala resource management feature in Impala 1.2 and higher and CDH 5. </p> @@ -31,7 +35,5 @@ <b>Default:</b> 0 (use automatic estimates) </p> -<!-- Worth adding a couple of related info links here. --> - </conbody> </concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3c2c8f12/docs/topics/impala_varchar.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_varchar.xml b/docs/topics/impala_varchar.xml index 32db4ae..8b05149 100644 --- a/docs/topics/impala_varchar.xml +++ b/docs/topics/impala_varchar.xml @@ -3,7 +3,7 @@ <concept id="varchar" rev="2.0.0"> <title>VARCHAR Data Type (CDH 5.2 or higher only)</title> - <titlealts><navtitle>VARCHAR (CDH 5.2 or higher only)</navtitle></titlealts> + <titlealts audience="PDF"><navtitle>VARCHAR</navtitle></titlealts> <prolog> <metadata> <data name="Category" value="Impala"/> @@ -17,7 +17,7 @@ <conbody> - <p> + <p rev="2.0.0"> <indexterm audience="Cloudera">VARCHAR data type</indexterm> A variable-length character type, truncated during processing if necessary to fit within the specified length. @@ -80,6 +80,9 @@ prefer to use an integer data type with sufficient range (<codeph>INT</codeph>, Impala processes those values during a query. </p> + <p><b>Avro considerations:</b></p> + <p conref="../shared/impala_common.xml#common/avro_2gb_strings"/> + <p conref="../shared/impala_common.xml#common/schema_evolution_blurb"/> <p> @@ -98,8 +101,7 @@ prefer to use an integer data type with sufficient range (<codeph>INT</codeph>, <p conref="../shared/impala_common.xml#common/compatibility_blurb"/> <p> - This type is available using Impala 2.0 or higher under CDH 4, or with Impala on CDH 5.2 or higher. There are - no compatibility issues with other components when exchanging data files or running Impala on CDH 4. + This type is available on CDH 5.2 or higher. </p> <p conref="../shared/impala_common.xml#common/internals_min_bytes"/> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3c2c8f12/docs/topics/impala_variance.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_variance.xml b/docs/topics/impala_variance.xml index e0c5d02..4ce2eaf 100644 --- a/docs/topics/impala_variance.xml +++ b/docs/topics/impala_variance.xml @@ -3,7 +3,7 @@ <concept rev="1.4" id="variance"> <title>VARIANCE, VARIANCE_SAMP, VARIANCE_POP, VAR_SAMP, VAR_POP Functions</title> - <titlealts><navtitle>VARIANCE, VARIANCE_SAMP, VARIANCE_POP, VAR_SAMP, VAR_POP</navtitle></titlealts> + <titlealts audience="PDF"><navtitle>VARIANCE, VARIANCE_SAMP, VARIANCE_POP, VAR_SAMP, VAR_POP</navtitle></titlealts> <prolog> <metadata> <data name="Category" value="Impala"/> @@ -11,6 +11,8 @@ <data name="Category" value="Impala Functions"/> <data name="Category" value="Aggregate Functions"/> <data name="Category" value="Querying"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> </metadata> </prolog> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3c2c8f12/docs/topics/impala_views.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_views.xml b/docs/topics/impala_views.xml index 78288b3..0b2154c 100644 --- a/docs/topics/impala_views.xml +++ b/docs/topics/impala_views.xml @@ -3,7 +3,7 @@ <concept rev="1.1" id="views"> <title>Overview of Impala Views</title> - <titlealts><navtitle>Views</navtitle></titlealts> + <titlealts audience="PDF"><navtitle>Views</navtitle></titlealts> <prolog> <metadata> <data name="Category" value="Impala"/> @@ -13,6 +13,7 @@ <data name="Category" value="Querying"/> <data name="Category" value="Tables"/> <data name="Category" value="Schemas"/> + <data name="Category" value="Views"/> </metadata> </prolog> @@ -93,9 +94,9 @@ select * from report;</codeblock> <li rev="2.3.0 collevelauth"> Set up fine-grained security where a user can query some columns from a table but not other columns. Because CDH 5.5 / Impala 2.3 and higher support column-level authorization, this technique is no longer - required. <!--If you formerly implemented column-level security through views, see + required. If you formerly implemented column-level security through views, see <xref href="sg_hive_sql.xml#concept_c2q_4qx_p4/col_level_auth_sentry"/> for details about the - column-level authorization feature.--> + column-level authorization feature. <!-- See <xref href="impala_authorization.xml#security_examples/sec_ex_views"/> for details. --> </li> </ul> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3c2c8f12/docs/topics/impala_with.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_with.xml b/docs/topics/impala_with.xml index 8d1001c..acc0f80 100644 --- a/docs/topics/impala_with.xml +++ b/docs/topics/impala_with.xml @@ -8,6 +8,8 @@ <data name="Category" value="Impala"/> <data name="Category" value="SQL"/> <data name="Category" value="Querying"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> </metadata> </prolog>
