Repository: tajo
Updated Branches:
  refs/heads/master 95f708ac9 -> e5b30e542


TAJO-1486: Text file should support to skip header rows when creating external 
table. (Contributed by Jongyoung Park. Committed by jinho)

Closes #611

Signed-off-by: Jinho Kim <[email protected]>


Project: http://git-wip-us.apache.org/repos/asf/tajo/repo
Commit: http://git-wip-us.apache.org/repos/asf/tajo/commit/e5b30e54
Tree: http://git-wip-us.apache.org/repos/asf/tajo/tree/e5b30e54
Diff: http://git-wip-us.apache.org/repos/asf/tajo/diff/e5b30e54

Branch: refs/heads/master
Commit: e5b30e542a409ec0378a787c76f6387fd3ca84a9
Parents: 95f708a
Author: Jongyoung Park <[email protected]>
Authored: Wed Jul 22 14:01:16 2015 +0900
Committer: Jinho Kim <[email protected]>
Committed: Wed Jul 22 14:02:35 2015 +0900

----------------------------------------------------------------------
 CHANGES                                         |  3 ++
 .../apache/tajo/storage/StorageConstants.java   |  3 ++
 .../src/main/sphinx/table_management/text.rst   | 27 +++++-----
 .../tajo/storage/text/DelimitedTextFile.java    | 24 ++++++---
 .../tajo/storage/TestDelimitedTextFile.java     | 53 ++++++++++++++++++++
 .../TestDelimitedTextFile/testNormal.json       |  6 +++
 .../dataset/TestDelimitedTextFile/testSkip.txt  |  7 +++
 7 files changed, 105 insertions(+), 18 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/tajo/blob/e5b30e54/CHANGES
----------------------------------------------------------------------
diff --git a/CHANGES b/CHANGES
index 1c01e2a..6001893 100644
--- a/CHANGES
+++ b/CHANGES
@@ -4,6 +4,9 @@ Release 0.11.0 - unreleased
 
   NEW FEATURES
 
+    TAJO-1486: Text file should support to skip header rows when creating 
+    external table. (Contributed by Jongyoung Park. Committed by jinho)
+
     TAJO-1661: Implement CORR function. (jihoon)
 
     TAJO-1537: Implement a virtual table for sessions. 

http://git-wip-us.apache.org/repos/asf/tajo/blob/e5b30e54/tajo-common/src/main/java/org/apache/tajo/storage/StorageConstants.java
----------------------------------------------------------------------
diff --git 
a/tajo-common/src/main/java/org/apache/tajo/storage/StorageConstants.java 
b/tajo-common/src/main/java/org/apache/tajo/storage/StorageConstants.java
index 16cf51d..f68e138 100644
--- a/tajo-common/src/main/java/org/apache/tajo/storage/StorageConstants.java
+++ b/tajo-common/src/main/java/org/apache/tajo/storage/StorageConstants.java
@@ -52,6 +52,9 @@ public class StorageConstants {
   public static final String TEXT_NULL = "text.null";
   public static final String TEXT_SERDE_CLASS = "text.serde";
   public static final String DEFAULT_TEXT_SERDE_CLASS = 
"org.apache.tajo.storage.text.CSVLineSerDe";
+
+  public static final String TEXT_SKIP_HEADER_LINE = "text.skip.headerlines";
+
   /**
    * It's the maximum number of parsing error torrence.
    *

http://git-wip-us.apache.org/repos/asf/tajo/blob/e5b30e54/tajo-docs/src/main/sphinx/table_management/text.rst
----------------------------------------------------------------------
diff --git a/tajo-docs/src/main/sphinx/table_management/text.rst 
b/tajo-docs/src/main/sphinx/table_management/text.rst
index 3727b03..4755334 100644
--- a/tajo-docs/src/main/sphinx/table_management/text.rst
+++ b/tajo-docs/src/main/sphinx/table_management/text.rst
@@ -1,6 +1,6 @@
-*************************************
+****
 TEXT
-*************************************
+****
 
 A character-separated values plain-text file represents a tabular data set 
consisting of rows and columns.
 Each row is a plan-text line. A line is usually broken by a character line 
feed ``\n`` or carriage-return ``\r``.
@@ -8,9 +8,9 @@ The line feed ``\n`` is the default delimiter in Tajo. Each 
record consists of m
 some other character or string, most commonly a literal vertical bar ``|``, 
comma ``,`` or tab ``\t``.
 The vertical bar is used as the default field delimiter in Tajo.
 
-=========================================
+============================
 How to Create a TEXT Table ?
-=========================================
+============================
 
 If you are not familiar with the ``CREATE TABLE`` statement, please refer to 
the Data Definition Language :doc:`/sql_language/ddl`.
 
@@ -27,9 +27,9 @@ statement. The below is an example statement for creating a 
table using *TEXT* f
     type text
   ) USING TEXT;
 
-=========================================
+===================
 Physical Properties
-=========================================
+===================
 
 Some table storage formats provide parameters for enabling or disabling 
features and adjusting physical parameters.
 The ``WITH`` clause in the CREATE TABLE statement allows users to set those 
parameters.
@@ -42,10 +42,13 @@ The ``WITH`` clause in the CREATE TABLE statement allows 
users to set those para
 * ``text.serde``: custom (De)serializer class. 
``org.apache.tajo.storage.text.CSVLineSerDe`` is the default (De)serializer 
class.
 * ``timezone``: the time zone that the table uses for writting. When table 
rows are read or written, ```timestamp``` and ```time``` column values are 
adjusted by this timezone if it is set. Time zone can be an abbreviation form 
like 'PST' or 'DST'. Also, it accepts an offset-based form like 'UTC+9' or a 
location-based form like 'Asia/Seoul'.
 * ``text.error-tolerance.max-num``: the maximum number of permissible parsing 
errors. This value should be an integer value. By default, 
``text.error-tolerance.max-num`` is ``0``. According to the value, parsing 
errors will be handled in different ways.
+
   * If ``text.error-tolerance.max-num < 0``, all parsing errors are ignored.
   * If ``text.error-tolerance.max-num == 0``, any parsing error is not 
allowed. If any error occurs, the query will be failed. (default)
   * If ``text.error-tolerance.max-num > 0``, the given number of parsing 
errors in each task will be pemissible.
 
+* ``text.skip.headerlines``: Number of header lines to be skipped. Some text 
files often have a header which has a kind of metadata(e.g.: column names), 
thus this option can be useful.
+
 The following example is to set a custom field delimiter, ``NULL`` character, 
and compression codec:
 
 .. code-block:: sql
@@ -64,9 +67,9 @@ The following example is to set a custom field delimiter, 
``NULL`` character, an
   Be careful when using ``\n`` as the field delimiter because *TEXT* format 
tables use ``\n`` as the line delimiter.
   At the moment, Tajo does not provide a way to specify the line delimiter.
 
-=========================================
+=====================
 Custom (De)serializer
-=========================================
+=====================
 
 The *TEXT* format not only provides reading and writing interfaces for text 
data but also allows users to process custom
 plan-text file formats with user-defined (De)serializer classes.
@@ -87,17 +90,17 @@ For example:
  ) USING TEXT WITH ('text.serde'='org.my.storage.CustomSerializerDeserializer')
 
 
-=========================================
+==========================
 Null Value Handling Issues
-=========================================
+==========================
 In default, ``NULL`` character in *TEXT* format is an empty string ``''``.
 In other words, an empty field is basically recognized as a ``NULL`` value in 
Tajo.
 If a field domain is ``TEXT``, an empty field is recognized as a string value 
``''`` instead of ``NULL`` value.
 Besides, You can also use your own ``NULL`` character by specifying a physical 
property ``text.null``.
 
-=========================================
+======================================
 Compatibility Issues with Apache Hive™
-=========================================
+======================================
 
 *TEXT* tables generated in Tajo can be processed directly by Apache Hive™ 
without further processing.
 In this section, we explain some compatibility issue for users who use both 
Hive and Tajo.

http://git-wip-us.apache.org/repos/asf/tajo/blob/e5b30e54/tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/text/DelimitedTextFile.java
----------------------------------------------------------------------
diff --git 
a/tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/text/DelimitedTextFile.java
 
b/tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/text/DelimitedTextFile.java
index 2aa6707..fdeba4e 100644
--- 
a/tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/text/DelimitedTextFile.java
+++ 
b/tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/text/DelimitedTextFile.java
@@ -48,7 +48,6 @@ import java.io.BufferedOutputStream;
 import java.io.DataOutputStream;
 import java.io.FileNotFoundException;
 import java.io.IOException;
-import java.util.Arrays;
 import java.util.Map;
 import java.util.concurrent.ConcurrentHashMap;
 
@@ -327,8 +326,23 @@ public class DelimitedTextFile {
         LOG.debug("DelimitedTextFileScanner open:" + fragment.getPath() + "," 
+ startOffset + "," + endOffset);
       }
 
+      // skip first line if it reads from middle of file
       if (startOffset > 0) {
-        reader.readLine();  // skip first line;
+        reader.readLine();
+      } else { // skip header lines if it is defined
+
+        // initialization for skipping header(max 20)
+        int headerLineNum = 
Math.min(Integer.parseInt(meta.getOption(StorageConstants.TEXT_SKIP_HEADER_LINE,
 "0")), 20);
+        if (headerLineNum > 0) {
+          LOG.info(String.format("Skip %d header lines", headerLineNum));
+          for (int i = 0; i < headerLineNum; i++) {
+            if (!reader.isReadable()) {
+              return;
+            }
+
+            reader.readLine();
+          }
+        }
       }
 
       deserializer = getLineSerde().createDeserializer(schema, meta, targets);
@@ -391,7 +405,7 @@ public class DelimitedTextFile {
 
           try {
             deserializer.deserialize(buf, tuple);
-            // if a line is read normally, it exists this loop.
+            // if a line is read normally, it exits this loop.
             break;
 
           } catch (TextLineParsingError tae) {
@@ -400,7 +414,7 @@ public class DelimitedTextFile {
 
             // suppress too many log prints, which probably cause performance 
degradation
             if (errorNum < errorPrintOutMaxNum) {
-              LOG.warn("Ignore JSON Parse Error (" + errorNum + "): ", tae);
+              LOG.warn("Ignore Text Parse Error (" + errorNum + "): ", tae);
             }
 
             // Only when the maximum error torrence limit is set (i.e., 
errorTorrenceMaxNum >= 0),
@@ -409,9 +423,7 @@ public class DelimitedTextFile {
             if (errorTorrenceMaxNum >= 0 && errorNum > errorTorrenceMaxNum) {
               throw tae;
             }
-            continue;
           }
-
         } while (reader.isReadable()); // continue until EOS
 
         // recordCount means the number of actual read records. We increment 
the count here.

http://git-wip-us.apache.org/repos/asf/tajo/blob/e5b30e54/tajo-storage/tajo-storage-hdfs/src/test/java/org/apache/tajo/storage/TestDelimitedTextFile.java
----------------------------------------------------------------------
diff --git 
a/tajo-storage/tajo-storage-hdfs/src/test/java/org/apache/tajo/storage/TestDelimitedTextFile.java
 
b/tajo-storage/tajo-storage-hdfs/src/test/java/org/apache/tajo/storage/TestDelimitedTextFile.java
index ba3a5a8..90bec65 100644
--- 
a/tajo-storage/tajo-storage-hdfs/src/test/java/org/apache/tajo/storage/TestDelimitedTextFile.java
+++ 
b/tajo-storage/tajo-storage-hdfs/src/test/java/org/apache/tajo/storage/TestDelimitedTextFile.java
@@ -179,4 +179,57 @@ public class TestDelimitedTextFile {
       scanner.close();
     }
   }
+
+  @Test
+  public void testSkippingHeaderWithJson() throws IOException {
+    TableMeta meta = CatalogUtil.newTableMeta("JSON");
+    meta.putOption(StorageConstants.TEXT_SKIP_HEADER_LINE, "2");
+    FileFragment fragment = getFileFragment("testNormal.json");
+    Scanner scanner = TablespaceManager.getLocalFs().getScanner(meta, schema, 
fragment);
+
+    scanner.init();
+
+    int lines = 0;
+
+    try {
+      while (true) {
+        Tuple tuple = scanner.next();
+        if (tuple != null) {
+          assertEquals(19+lines, tuple.getInt2(2));
+          lines++;
+        }
+        else break;
+      }
+    } finally {
+      assertEquals(4, lines);
+      scanner.close();
+    }
+  }
+
+  @Test
+  public void testSkippingHeaderWithText() throws IOException {
+    TableMeta meta = CatalogUtil.newTableMeta("TEXT");
+    meta.putOption(StorageConstants.TEXT_SKIP_HEADER_LINE, "1");
+    meta.putOption(StorageConstants.TEXT_DELIMITER, ",");
+    FileFragment fragment = getFileFragment("testSkip.txt");
+    Scanner scanner = TablespaceManager.getLocalFs().getScanner(meta, schema, 
fragment);
+    
+    scanner.init();
+
+    int lines = 0;
+
+    try {
+      while (true) {
+        Tuple tuple = scanner.next();
+        if (tuple != null) {
+          assertEquals(17+lines, tuple.getInt2(2));
+          lines++;
+        }
+        else break;
+      }
+    } finally {
+      assertEquals(6, lines);
+      scanner.close();
+    }
+  }
 }

http://git-wip-us.apache.org/repos/asf/tajo/blob/e5b30e54/tajo-storage/tajo-storage-hdfs/src/test/resources/dataset/TestDelimitedTextFile/testNormal.json
----------------------------------------------------------------------
diff --git 
a/tajo-storage/tajo-storage-hdfs/src/test/resources/dataset/TestDelimitedTextFile/testNormal.json
 
b/tajo-storage/tajo-storage-hdfs/src/test/resources/dataset/TestDelimitedTextFile/testNormal.json
new file mode 100644
index 0000000..69fcc37
--- /dev/null
+++ 
b/tajo-storage/tajo-storage-hdfs/src/test/resources/dataset/TestDelimitedTextFile/testNormal.json
@@ -0,0 +1,6 @@
+{"col1": "true", "col2": "hyunsik", "col3": 17, "col4": 59, "col5": 23, 
"col6": 77.9, "col7": 271.9, "col8": "hyunsik", "col9": "aHl1bnNpaw==", 
"col10": "192.168.0.1"}
+{"col1": "true", "col2": "hyunsik", "col3": 18, "col4": 59, "col5": 23, 
"col6": 77.9, "col7": 271.9, "col8": "hyunsik", "col9": "aHl1bnNpaw==", 
"col10": "192.168.0.1"}
+{"col1": "true", "col2": "hyunsik", "col3": 19, "col4": 59, "col5": 23, 
"col6": 77.9, "col7": 271.9, "col8": "hyunsik", "col9": "aHl1bnNpaw==", 
"col10": "192.168.0.1"}
+{"col1": "true", "col2": "hyunsik", "col3": 20, "col4": 59, "col5": 23, 
"col6": 77.9, "col7": 271.9, "col8": "hyunsik", "col9": "aHl1bnNpaw==", 
"col10": "192.168.0.1"}
+{"col1": "true", "col2": "hyunsik", "col3": 21, "col4": 59, "col5": 23, 
"col6": 77.9, "col7": 271.9, "col8": "hyunsik", "col9": "aHl1bnNpaw==", 
"col10": "192.168.0.1"}
+{"col1": "true", "col2": "hyunsik", "col3": 22, "col4": 59, "col5": 23, 
"col6": 77.9, "col7": 271.9, "col8": "hyunsik", "col9": "aHl1bnNpaw==", 
"col10": "192.168.0.1"}
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/tajo/blob/e5b30e54/tajo-storage/tajo-storage-hdfs/src/test/resources/dataset/TestDelimitedTextFile/testSkip.txt
----------------------------------------------------------------------
diff --git 
a/tajo-storage/tajo-storage-hdfs/src/test/resources/dataset/TestDelimitedTextFile/testSkip.txt
 
b/tajo-storage/tajo-storage-hdfs/src/test/resources/dataset/TestDelimitedTextFile/testSkip.txt
new file mode 100644
index 0000000..02714bd
--- /dev/null
+++ 
b/tajo-storage/tajo-storage-hdfs/src/test/resources/dataset/TestDelimitedTextFile/testSkip.txt
@@ -0,0 +1,7 @@
+col1,col2,col3,col4,col5,col6,col7,col8,col9,col10
+true,hyunsik,17,59,23,77.9,271.9,hyunsik,aH1bnNpaw==,192.168.0.1
+true,hyunsik,18,59,23,77.9,271.9,hyunsik,aH1bnNpaw==,192.168.0.1
+true,hyunsik,19,59,23,77.9,271.9,hyunsik,aH1bnNpaw==,192.168.0.1
+true,hyunsik,20,59,23,77.9,271.9,hyunsik,aH1bnNpaw==,192.168.0.1
+true,hyunsik,21,59,23,77.9,271.9,hyunsik,aH1bnNpaw==,192.168.0.1
+true,hyunsik,22,59,23,77.9,271.9,hyunsik,aH1bnNpaw==,192.168.0.1

Reply via email to