GitHub user jianqiao opened a pull request:

    https://github.com/apache/incubator-quickstep/pull/19

    Improve text scan operator

    This PR updates the `TextScanOperator` to improve its performance.
    
    There are three main changes:
    (1) Pass `text_offset` and `text_segment_size` as parameters to each 
`TextScanWorkOrder` instead of really loading the data. Then each 
`TextScanWorkOrder` reads the corresponding piece of data directly from disk.
    (2) Avoid extra string copying by passing `const char **` buffer pointers 
into `parseRow()` and `extractFieldString()`.
    (3) Use `ColumnVectorsValueAccessor` as the temporary container to store 
the parsed tuples. Then call `output_destination_->bulkInsertTuples()` to bulk 
insert the tuples.
    
    **Note 1:** This updated version follows the semantics of the old 
`TextScanOperator` except that it does not support the backslash + newline 
escaping, e.g.
    (a)
    ```
    aaaa\
    bbbb
    ```
    which is semantically equivalent to
    (b)
    ```
    aaaa\nbbbb
    ```
    The updated version supports (b) but not (a). As (a) incurs extra logic 
that complicates code. Meanwhile, format (a) seems to be specific to 
PostgreSQL, and the 
[documentation](http://www.postgresql.org/docs/9.6/static/sql-copy.html) of 
PostgreSQL 9.6 says:
    _It is strongly recommended that applications generating COPY data convert 
data newlines and carriage returns to the \n and \r sequences respectively. At 
present it is possible to represent a data carriage return by a backslash and 
carriage return, and to represent a data newline by a backslash and newline. 
However, these representations might not be accepted in future releases. They 
are also highly vulnerable to corruption if the COPY file is transferred across 
different machines (for example, from Unix to Windows or vice versa)._
    
    **Note 2:** This PR relies on the fix from #18 to work correctly for 
loading `TYPE compressed_columnstore` tables.
    
    **Note 3:** Using 40 workers, the expected loading time on cloudlab 
machines with current SQL-benchmark settings are ~465s for SSB SF100 and ~1050s 
for TPCH SF100.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/apache/incubator-quickstep 
improve-text-scan-operator-column-vectors

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-quickstep/pull/19.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19
    
----
commit 55b06fab1bd336f2cc7ee4bd557d3328a428e4ab
Author: Jianqiao Zhu <jianq...@cs.wisc.edu>
Date:   2016-06-09T08:18:37Z

    Improve text scan operator

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Reply via email to