Repository: tajo Updated Branches: refs/heads/branch-0.10.1 b729a49c8 -> 8173bc1f4
TAJO-1492: Replace CSV examples into TEXT examples in docs. Signed-off-by: Jihoon Son <[email protected]> Project: http://git-wip-us.apache.org/repos/asf/tajo/repo Commit: http://git-wip-us.apache.org/repos/asf/tajo/commit/8173bc1f Tree: http://git-wip-us.apache.org/repos/asf/tajo/tree/8173bc1f Diff: http://git-wip-us.apache.org/repos/asf/tajo/diff/8173bc1f Branch: refs/heads/branch-0.10.1 Commit: 8173bc1f491f45050b3611868d1d8d0099056d0c Parents: b729a49 Author: Dongjoon Hyun <[email protected]> Authored: Sat Apr 4 19:00:31 2015 +0900 Committer: Jihoon Son <[email protected]> Committed: Sat Apr 4 19:00:31 2015 +0900 ---------------------------------------------------------------------- CHANGES | 3 + .../main/sphinx/backup_and_restore/catalog.rst | 2 +- tajo-docs/src/main/sphinx/getting_started.rst | 2 +- tajo-docs/src/main/sphinx/sql_language/ddl.rst | 2 +- .../src/main/sphinx/table_management/csv.rst | 115 ------------------- .../sphinx/table_management/file_formats.rst | 2 +- .../sphinx/table_management/table_overview.rst | 6 +- .../src/main/sphinx/table_management/text.rst | 115 +++++++++++++++++++ 8 files changed, 125 insertions(+), 122 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/tajo/blob/8173bc1f/CHANGES ---------------------------------------------------------------------- diff --git a/CHANGES b/CHANGES index 3f7880e..08ac729 100644 --- a/CHANGES +++ b/CHANGES @@ -69,6 +69,9 @@ Release 0.10.1 - unreleased TASKS + TAJO-1462: Replace CSV examples into TEXT examples in docs. + (Contributed by Dongjoon Hyun, Committed by jihoon) + TAJO-1439: Some method name is written wrongly. (Contributed by Jongyoung Park. Committed by jihoon) http://git-wip-us.apache.org/repos/asf/tajo/blob/8173bc1f/tajo-docs/src/main/sphinx/backup_and_restore/catalog.rst ---------------------------------------------------------------------- diff --git a/tajo-docs/src/main/sphinx/backup_and_restore/catalog.rst b/tajo-docs/src/main/sphinx/backup_and_restore/catalog.rst index 200aa85..1c2b709 100644 --- a/tajo-docs/src/main/sphinx/backup_and_restore/catalog.rst +++ b/tajo-docs/src/main/sphinx/backup_and_restore/catalog.rst @@ -28,7 +28,7 @@ For example, if you want to backup a table customer, you should type a command a -- Name: customer; Type: TABLE; Storage: CSV -- Path: file:/home/hyunsik/tpch/customer -- - CREATE EXTERNAL TABLE customer (c_custkey INT8, c_name TEXT, c_address TEXT, c_nationkey INT8, c_phone TEXT, c_acctbal FLOAT8, c_mktsegment TEXT, c_comment TEXT) USING CSV LOCATION 'file:/home/hyunsik/tpch/customer'; + CREATE EXTERNAL TABLE customer (c_custkey INT8, c_name TEXT, c_address TEXT, c_nationkey INT8, c_phone TEXT, c_acctbal FLOAT8, c_mktsegment TEXT, c_comment TEXT) USING TEXT LOCATION 'file:/home/hyunsik/tpch/customer'; If you want to restore the catalog from the SQL dump file, please type the below command: :: http://git-wip-us.apache.org/repos/asf/tajo/blob/8173bc1f/tajo-docs/src/main/sphinx/getting_started.rst ---------------------------------------------------------------------- diff --git a/tajo-docs/src/main/sphinx/getting_started.rst b/tajo-docs/src/main/sphinx/getting_started.rst index eaf6973..e30c3fe 100644 --- a/tajo-docs/src/main/sphinx/getting_started.rst +++ b/tajo-docs/src/main/sphinx/getting_started.rst @@ -135,7 +135,7 @@ Here, we assume the schema as (int, text, float, text). :: name text, score float, type text) - using csv with ('text.delimiter'='|') location 'file:/home/x/table1'; + using text with ('text.delimiter'='|') location 'file:/home/x/table1'; To load an external table, you need to use âcreate external tableâ statement. In the location clause, you should use the absolute directory path with an appropriate scheme. http://git-wip-us.apache.org/repos/asf/tajo/blob/8173bc1f/tajo-docs/src/main/sphinx/sql_language/ddl.rst ---------------------------------------------------------------------- diff --git a/tajo-docs/src/main/sphinx/sql_language/ddl.rst b/tajo-docs/src/main/sphinx/sql_language/ddl.rst index 60b7190..662ccff 100644 --- a/tajo-docs/src/main/sphinx/sql_language/ddl.rst +++ b/tajo-docs/src/main/sphinx/sql_language/ddl.rst @@ -56,7 +56,7 @@ If you want to add an external table that contains compressed data, you should g ... L_COMMENT text) - USING csv WITH ('text.delimiter'='|','compression.codec'='org.apache.hadoop.io.compress.DeflateCodec') + USING TEXT WITH ('text.delimiter'='|','compression.codec'='org.apache.hadoop.io.compress.DeflateCodec') LOCATION 'hdfs://localhost:9010/tajo/warehouse/lineitem_100_snappy'; `compression.codec` parameter can have one of the following compression codecs: http://git-wip-us.apache.org/repos/asf/tajo/blob/8173bc1f/tajo-docs/src/main/sphinx/table_management/csv.rst ---------------------------------------------------------------------- diff --git a/tajo-docs/src/main/sphinx/table_management/csv.rst b/tajo-docs/src/main/sphinx/table_management/csv.rst deleted file mode 100644 index 53c6e1d..0000000 --- a/tajo-docs/src/main/sphinx/table_management/csv.rst +++ /dev/null @@ -1,115 +0,0 @@ -************************************* -CSV (TextFile) -************************************* - -A character-separated values (CSV) file represents a tabular data set consisting of rows and columns. -Each row is a plan-text line. A line is usually broken by a character line feed ``\n`` or carriage-return ``\r``. -The line feed ``\n`` is the default delimiter in Tajo. Each record consists of multiple fields, separated by -some other character or string, most commonly a literal vertical bar ``|``, comma ``,`` or tab ``\t``. -The vertical bar is used as the default field delimiter in Tajo. - -========================================= -How to Create a CSV Table ? -========================================= - -If you are not familiar with the ``CREATE TABLE`` statement, please refer to the Data Definition Language :doc:`/sql_language/ddl`. - -In order to specify a certain file format for your table, you need to use the ``USING`` clause in your ``CREATE TABLE`` -statement. The below is an example statement for creating a table using CSV files. - -.. code-block:: sql - - CREATE TABLE - table1 ( - id int, - name text, - score float, - type text - ) USING CSV; - -========================================= -Physical Properties -========================================= - -Some table storage formats provide parameters for enabling or disabling features and adjusting physical parameters. -The ``WITH`` clause in the CREATE TABLE statement allows users to set those parameters. - -Now, the CSV storage format provides the following physical properties. - -* ``text.delimiter``: delimiter character. ``|`` or ``\u0001`` is usually used, and the default field delimiter is ``|``. -* ``text.null``: NULL character. The default NULL character is an empty string ``''``. Hive's default NULL character is ``'\\N'``. -* ``compression.codec``: Compression codec. You can enable compression feature and set specified compression algorithm. The compression algorithm used to compress files. The compression codec name should be the fully qualified class name inherited from `org.apache.hadoop.io.compress.CompressionCodec <https://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/compress/CompressionCodec.html>`_. By default, compression is disabled. -* ``csvfile.serde`` (deprecated): custom (De)serializer class. ``org.apache.tajo.storage.TextSerializerDeserializer`` is the default (De)serializer class. -* ``timezone``: the time zone that the table uses for writting. When table rows are read or written, ```timestamp``` and ```time``` column values are adjusted by this timezone if it is set. Time zone can be an abbreviation form like 'PST' or 'DST'. Also, it accepts an offset-based form like 'UTC+9' or a location-based form like 'Asia/Seoul'. -* ``text.error-tolerance.max-num``: the maximum number of permissible parsing errors. This value should be an integer value. By default, ``text.error-tolerance.max-num`` is ``0``. According to the value, parsing errors will be handled in different ways. - * If ``text.error-tolerance.max-num < 0``, all parsing errors are ignored. - * If ``text.error-tolerance.max-num == 0``, any parsing error is not allowed. If any error occurs, the query will be failed. (default) - * If ``text.error-tolerance.max-num > 0``, the given number of parsing errors in each task will be pemissible. - -The following example is to set a custom field delimiter, NULL character, and compression codec: - -.. code-block:: sql - - CREATE TABLE table1 ( - id int, - name text, - score float, - type text - ) USING CSV WITH('text.delimiter'='\u0001', - 'text.null'='\\N', - 'compression.codec'='org.apache.hadoop.io.compress.SnappyCodec'); - -.. warning:: - - Be careful when using ``\n`` as the field delimiter because CSV uses ``\n`` as the line delimiter. - At the moment, Tajo does not provide a way to specify the line delimiter. - -========================================= -Custom (De)serializer -========================================= - -The CSV storage format not only provides reading and writing interfaces for CSV data but also allows users to process custom -plan-text file formats with user-defined (De)serializer classes. -For example, with custom (de)serializers, Tajo can process JSON file formats or any specialized plan-text file formats. - -In order to specify a custom (De)serializer, set a physical property ``csvfile.serde``. -The property value should be a fully qualified class name. - -For example: - -.. code-block:: sql - - CREATE TABLE table1 ( - id int, - name text, - score float, - type text - ) USING CSV WITH ('csvfile.serde'='org.my.storage.CustomSerializerDeserializer') - - -========================================= -Null Value Handling Issues -========================================= -In default, NULL character in CSV files is an empty string ``''``. -In other words, an empty field is basically recognized as a NULL value in Tajo. -If a field domain is ``TEXT``, an empty field is recognized as a string value ``''`` instead of NULL value. -Besides, You can also use your own NULL character by specifying a physical property ``text.null``. - -========================================= -Compatibility Issues with Apache Hive⢠-========================================= - -CSV files generated in Tajo can be processed directly by Apache Hive⢠without further processing. -In this section, we explain some compatibility issue for users who use both Hive and Tajo. - -If you set a custom field delimiter, the CSV tables cannot be directly used in Hive. -In order to specify the custom field delimiter in Hive, you need to use ``ROW FORMAT DELIMITED FIELDS TERMINATED BY`` -clause in a Hive's ``CREATE TABLE`` statement as follows: - -.. code-block:: sql - - CREATE TABLE table1 (id int, name string, score float, type string) - ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' - STORED AS TEXT - -To the best of our knowledge, there is not way to specify a custom NULL character in Hive. http://git-wip-us.apache.org/repos/asf/tajo/blob/8173bc1f/tajo-docs/src/main/sphinx/table_management/file_formats.rst ---------------------------------------------------------------------- diff --git a/tajo-docs/src/main/sphinx/table_management/file_formats.rst b/tajo-docs/src/main/sphinx/table_management/file_formats.rst index c15dd3f..0579497 100644 --- a/tajo-docs/src/main/sphinx/table_management/file_formats.rst +++ b/tajo-docs/src/main/sphinx/table_management/file_formats.rst @@ -7,7 +7,7 @@ Currently, Tajo provides four file formats as follows: .. toctree:: :maxdepth: 1 - csv + text rcfile parquet sequencefile \ No newline at end of file http://git-wip-us.apache.org/repos/asf/tajo/blob/8173bc1f/tajo-docs/src/main/sphinx/table_management/table_overview.rst ---------------------------------------------------------------------- diff --git a/tajo-docs/src/main/sphinx/table_management/table_overview.rst b/tajo-docs/src/main/sphinx/table_management/table_overview.rst index 3d933c2..3985e19 100644 --- a/tajo-docs/src/main/sphinx/table_management/table_overview.rst +++ b/tajo-docs/src/main/sphinx/table_management/table_overview.rst @@ -29,9 +29,9 @@ The following example is to set a custom field delimiter, NULL character, and co name text, score float, type text - ) USING CSV WITH('text.delimiter'='\u0001', - 'text.null'='\\N', - 'compression.codec'='org.apache.hadoop.io.compress.SnappyCodec'); + ) USING TEXT WITH('text.delimiter'='\u0001', + 'text.null'='\\N', + 'compression.codec'='org.apache.hadoop.io.compress.SnappyCodec'); Each physical table layout has its own specialized properties. They will be addressed in :doc:`/table_management/file_formats`. http://git-wip-us.apache.org/repos/asf/tajo/blob/8173bc1f/tajo-docs/src/main/sphinx/table_management/text.rst ---------------------------------------------------------------------- diff --git a/tajo-docs/src/main/sphinx/table_management/text.rst b/tajo-docs/src/main/sphinx/table_management/text.rst new file mode 100644 index 0000000..3727b03 --- /dev/null +++ b/tajo-docs/src/main/sphinx/table_management/text.rst @@ -0,0 +1,115 @@ +************************************* +TEXT +************************************* + +A character-separated values plain-text file represents a tabular data set consisting of rows and columns. +Each row is a plan-text line. A line is usually broken by a character line feed ``\n`` or carriage-return ``\r``. +The line feed ``\n`` is the default delimiter in Tajo. Each record consists of multiple fields, separated by +some other character or string, most commonly a literal vertical bar ``|``, comma ``,`` or tab ``\t``. +The vertical bar is used as the default field delimiter in Tajo. + +========================================= +How to Create a TEXT Table ? +========================================= + +If you are not familiar with the ``CREATE TABLE`` statement, please refer to the Data Definition Language :doc:`/sql_language/ddl`. + +In order to specify a certain file format for your table, you need to use the ``USING`` clause in your ``CREATE TABLE`` +statement. The below is an example statement for creating a table using *TEXT* format. + +.. code-block:: sql + + CREATE TABLE + table1 ( + id int, + name text, + score float, + type text + ) USING TEXT; + +========================================= +Physical Properties +========================================= + +Some table storage formats provide parameters for enabling or disabling features and adjusting physical parameters. +The ``WITH`` clause in the CREATE TABLE statement allows users to set those parameters. + +*TEXT* format provides the following physical properties. + +* ``text.delimiter``: delimiter character. ``|`` or ``\u0001`` is usually used, and the default field delimiter is ``|``. +* ``text.null``: ``NULL`` character. The default ``NULL`` character is an empty string ``''``. Hive's default ``NULL`` character is ``'\\N'``. +* ``compression.codec``: Compression codec. You can enable compression feature and set specified compression algorithm. The compression algorithm used to compress files. The compression codec name should be the fully qualified class name inherited from `org.apache.hadoop.io.compress.CompressionCodec <https://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/compress/CompressionCodec.html>`_. By default, compression is disabled. +* ``text.serde``: custom (De)serializer class. ``org.apache.tajo.storage.text.CSVLineSerDe`` is the default (De)serializer class. +* ``timezone``: the time zone that the table uses for writting. When table rows are read or written, ```timestamp``` and ```time``` column values are adjusted by this timezone if it is set. Time zone can be an abbreviation form like 'PST' or 'DST'. Also, it accepts an offset-based form like 'UTC+9' or a location-based form like 'Asia/Seoul'. +* ``text.error-tolerance.max-num``: the maximum number of permissible parsing errors. This value should be an integer value. By default, ``text.error-tolerance.max-num`` is ``0``. According to the value, parsing errors will be handled in different ways. + * If ``text.error-tolerance.max-num < 0``, all parsing errors are ignored. + * If ``text.error-tolerance.max-num == 0``, any parsing error is not allowed. If any error occurs, the query will be failed. (default) + * If ``text.error-tolerance.max-num > 0``, the given number of parsing errors in each task will be pemissible. + +The following example is to set a custom field delimiter, ``NULL`` character, and compression codec: + +.. code-block:: sql + + CREATE TABLE table1 ( + id int, + name text, + score float, + type text + ) USING TEXT WITH('text.delimiter'='\u0001', + 'text.null'='\\N', + 'compression.codec'='org.apache.hadoop.io.compress.SnappyCodec'); + +.. warning:: + + Be careful when using ``\n`` as the field delimiter because *TEXT* format tables use ``\n`` as the line delimiter. + At the moment, Tajo does not provide a way to specify the line delimiter. + +========================================= +Custom (De)serializer +========================================= + +The *TEXT* format not only provides reading and writing interfaces for text data but also allows users to process custom +plan-text file formats with user-defined (De)serializer classes. +For example, with custom (de)serializers, Tajo can process JSON file formats or any specialized plan-text file formats. + +In order to specify a custom (De)serializer, set a physical property ``text.serde``. +The property value should be a fully qualified class name. + +For example: + +.. code-block:: sql + + CREATE TABLE table1 ( + id int, + name text, + score float, + type text + ) USING TEXT WITH ('text.serde'='org.my.storage.CustomSerializerDeserializer') + + +========================================= +Null Value Handling Issues +========================================= +In default, ``NULL`` character in *TEXT* format is an empty string ``''``. +In other words, an empty field is basically recognized as a ``NULL`` value in Tajo. +If a field domain is ``TEXT``, an empty field is recognized as a string value ``''`` instead of ``NULL`` value. +Besides, You can also use your own ``NULL`` character by specifying a physical property ``text.null``. + +========================================= +Compatibility Issues with Apache Hive⢠+========================================= + +*TEXT* tables generated in Tajo can be processed directly by Apache Hive⢠without further processing. +In this section, we explain some compatibility issue for users who use both Hive and Tajo. + +If you set a custom field delimiter, the *TEXT* tables cannot be directly used in Hive. +In order to specify the custom field delimiter in Hive, you need to use ``ROW FORMAT DELIMITED FIELDS TERMINATED BY`` +clause in a Hive's ``CREATE TABLE`` statement as follows: + +.. code-block:: sql + + CREATE TABLE table1 (id int, name string, score float, type string) + ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' + STORED AS TEXT + +To the best of our knowledge, there is not way to specify a custom ``NULL`` character in Hive.
