GitHub user jianqiao opened a pull request: https://github.com/apache/incubator-quickstep/pull/19
Improve text scan operator This PR updates the `TextScanOperator` to improve its performance. There are three main changes: (1) Pass `text_offset` and `text_segment_size` as parameters to each `TextScanWorkOrder` instead of really loading the data. Then each `TextScanWorkOrder` reads the corresponding piece of data directly from disk. (2) Avoid extra string copying by passing `const char **` buffer pointers into `parseRow()` and `extractFieldString()`. (3) Use `ColumnVectorsValueAccessor` as the temporary container to store the parsed tuples. Then call `output_destination_->bulkInsertTuples()` to bulk insert the tuples. **Note 1:** This updated version follows the semantics of the old `TextScanOperator` except that it does not support the backslash + newline escaping, e.g. (a) ``` aaaa\ bbbb ``` which is semantically equivalent to (b) ``` aaaa\nbbbb ``` The updated version supports (b) but not (a). As (a) incurs extra logic that complicates code. Meanwhile, format (a) seems to be specific to PostgreSQL, and the [documentation](http://www.postgresql.org/docs/9.6/static/sql-copy.html) of PostgreSQL 9.6 says: _It is strongly recommended that applications generating COPY data convert data newlines and carriage returns to the \n and \r sequences respectively. At present it is possible to represent a data carriage return by a backslash and carriage return, and to represent a data newline by a backslash and newline. However, these representations might not be accepted in future releases. They are also highly vulnerable to corruption if the COPY file is transferred across different machines (for example, from Unix to Windows or vice versa)._ **Note 2:** This PR relies on the fix from #18 to work correctly for loading `TYPE compressed_columnstore` tables. **Note 3:** Using 40 workers, the expected loading time on cloudlab machines with current SQL-benchmark settings are ~465s for SSB SF100 and ~1050s for TPCH SF100. You can merge this pull request into a Git repository by running: $ git pull https://github.com/apache/incubator-quickstep improve-text-scan-operator-column-vectors Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-quickstep/pull/19.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19 ---- commit 55b06fab1bd336f2cc7ee4bd557d3328a428e4ab Author: Jianqiao Zhu <jianq...@cs.wisc.edu> Date: 2016-06-09T08:18:37Z Improve text scan operator ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---