http://git-wip-us.apache.org/repos/asf/incubator-trafodion/blob/33d796d6/docs/cqd_reference/src/asciidoc/_chapters/schema_controls.adoc ---------------------------------------------------------------------- diff --git a/docs/cqd_reference/src/asciidoc/_chapters/schema_controls.adoc b/docs/cqd_reference/src/asciidoc/_chapters/schema_controls.adoc new file mode 100644 index 0000000..0112b9a --- /dev/null +++ b/docs/cqd_reference/src/asciidoc/_chapters/schema_controls.adoc @@ -0,0 +1,49 @@ +//// +/** +* @@@ START COPYRIGHT @@@ +* +* Licensed to the Apache Software Foundation (ASF) under one +* or more contributor license agreements. See the NOTICE file +* distributed with this work for additional information +* regarding copyright ownership. The ASF licenses this file +* to you under the Apache License, Version 2.0 (the +* "License"); you may not use this file except in compliance +* with the License. You may obtain a copy of the License at +* +* http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, +* software distributed under the License is distributed on an +* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +* KIND, either express or implied. See the License for the +* specific language governing permissions and limitations +* under the License. +* +* @@@ END COPYRIGHT @@@ + */ +//// + +[[schema-controls]] += Schema Controls + +This section describes CQDs that are used for schema controls. + +[[schema]] +== Schema + +[cols="25%h,75%"] +|=== +| *Category* | Influencing Query Plans +| *Description* | Sets the default schema for the session. +| *Values* | SQL identifier. + + + +The default is *'TRAFODION'*. +| *Usage* | A SET SCHEMA statement or a CONTROL QUERY DEFAULT SCHEMA statement can be used to override the default schema name. +| *Production Usage* | It is a convenience so you do not have to type in two-part names. +| *Impact* | Not applicable. +| *Level* | Any. +| *Conflicts/Synergies* | Alternately you can use the SET SCHEMA statement. +| *Real Problem Addressed* | Not applicable. +| *Introduced In Release* | Trafodion 1.3.0. +| *Deprecated In Release* | Not applicable. +|===
http://git-wip-us.apache.org/repos/asf/incubator-trafodion/blob/33d796d6/docs/cqd_reference/src/asciidoc/_chapters/transactions.adoc ---------------------------------------------------------------------- diff --git a/docs/cqd_reference/src/asciidoc/_chapters/transactions.adoc b/docs/cqd_reference/src/asciidoc/_chapters/transactions.adoc new file mode 100644 index 0000000..5f590fb --- /dev/null +++ b/docs/cqd_reference/src/asciidoc/_chapters/transactions.adoc @@ -0,0 +1,94 @@ +//// +/** +* @@@ START COPYRIGHT @@@ +* +* Licensed to the Apache Software Foundation (ASF) under one +* or more contributor license agreements. See the NOTICE file +* distributed with this work for additional information +* regarding copyright ownership. The ASF licenses this file +* to you under the Apache License, Version 2.0 (the +* "License"); you may not use this file except in compliance +* with the License. You may obtain a copy of the License at +* +* http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, +* software distributed under the License is distributed on an +* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +* KIND, either express or implied. See the License for the +* specific language governing permissions and limitations +* under the License. +* +* @@@ END COPYRIGHT @@@ + */ +//// + +[[transaction-control-and-locking]] += Transaction Control and Locking + +This section describes CQDs that are used for transaction control and locking. + +[[isolation-level]] +== ISOLATION_LEVEL + +[cols="25%h,75%"] +|=== +| *Category* | Transaction Control and Locking +| *Description* | Specifies the default transaction isolation level that queries use. +| *Values* | *'READ UNCOMMITTED'*, *'READ COMMITTED'*, *'REPEATABLE READ'*, or *'SERIALIZABLE'*. + + + +The default value is *'READ COMMITTED'* (ANSI). +| *Usage* | If you use uncommitted access (reading "dirty" data when queries are accessing data that is +being simultaneously updated), then you can set the default isolation level as READ UNCOMMITTED. The default isolation level +of READ COMMITTED can cause concurrency issues because reads would wait on locked rows. If rows are locked by long-running +transactions with infrequent commits, this can cause severe concurrency issues for SELECT queries. See Conflicts/Synergies. +| *Production Usage* | Not applicable. +| *Impact* | Using this CQD has implications on locking and concurrency. + + + +If set to READ UNCOMMITTED, then select queries read through locks and don't have to wait on locks. But they won't see committed consistent data. + + + +If set to READ COMMITTED (the default setting), then the reads wait on locked rows before they proceed with the scan. +The read can proceed only when the rows locked by another transaction are released after that transaction commits. The reader does not lock rows. + + + +If set to REPEATABLE READ or SERIALIZABLE, then it has severe implications on concurrency because every row read is also locked. +| *Level* | While you can use this at a query or a service level, the most common use is a system-wide setting. +If query tools are being used, then the query level setting cannot be used. + + + +A service level setting may provide uncommitted access to certain users while providing the default committed access to the other users, +depending which users need to see consistent data. + + + +If however, access to tables during updates is well controlled and read uncommitted is acceptable, then this can be set at the system level. +| *Conflicts/Synergies* | The problem with using READ UNCOMMITTED as the isolation level default value is that in a SET TRANSACTION statement, +the only possible access mode is READ ONLY. Any query that attempts to update the database would fail. + + + +To facilitate updates and DDL statements while the isolation level is set to READ UNCOMMITTED, a new default attribute ISOLATION_LEVEL_FOR_UPDATES +is provided. This default attribute specifies the isolation level for update and DDL statements. If not specified, or if not present in the +SYSTEM_DEFAULTS table, the default value is the same as the ISOLATION_LEVEL default attribute. However, if specified or present in the SYSTEM_DEFAULTS table, +then its value is used as the isolation level for updates and DDL statements. UPDATE in ISOLATION_LEVEL_FOR_UPDATES refers to INSERT, UPDATE, and DELETE statements. +| *Real Problem Addressed* | Not applicable. +| *Introduced In Release* | Trafodion 1.3.0. +| *Deprecated In Release* | Not applicable. +|=== + +<<< +[[isolation-level-for-updates]] +== ISOLATION_LEVEL_FOR_UPDATES + +[cols="25%h,75%"] +|=== +| *Category* | Transaction Control and Locking +| *Description* | Specifies the default transaction isolation level for these update operations: INSERT, UPDATE, or DELETE. +| *Values* | *'READ UNCOMMITTED'*, *'READ COMMITTED'*, *'REPEATABLE READ'*, or *'SERIALIZABLE'*. + + + +The default value is *'READ COMMITTED'* (ANSI). +| *Usage* | Set this CQD to READ UNCOMMITTED to prevent users from performing any updates. +| *Production Usage* | Not applicable. +| *Impact* | When set, this CQD prevents users from doing any of these update operations: INSERT, UPDATE, or DELETE. +| *Level* | Service. +| *Conflicts/Synergies* | Works with the ISOLATION_LEVEL setting. Both settings are READ COMMITTED by default. +ISOLATION_LEVEL can be set to READ UNCOMMITTED. This CQD still remains READ COMMITTED. You can change it to READ UNCOMMITED to prevent queries +running at the service level to not perform any updates. +| *Real Problem Addressed* | Not applicable. +| *Introduced In Release* | Trafodion 1.3.0. +| *Deprecated In Release* | Not applicable. +|=== http://git-wip-us.apache.org/repos/asf/incubator-trafodion/blob/33d796d6/docs/cqd_reference/src/asciidoc/_chapters/update_stats.adoc ---------------------------------------------------------------------- diff --git a/docs/cqd_reference/src/asciidoc/_chapters/update_stats.adoc b/docs/cqd_reference/src/asciidoc/_chapters/update_stats.adoc new file mode 100644 index 0000000..51c0204 --- /dev/null +++ b/docs/cqd_reference/src/asciidoc/_chapters/update_stats.adoc @@ -0,0 +1,92 @@ +//// +/** +* @@@ START COPYRIGHT @@@ +* +* Licensed to the Apache Software Foundation (ASF) under one +* or more contributor license agreements. See the NOTICE file +* distributed with this work for additional information +* regarding copyright ownership. The ASF licenses this file +* to you under the Apache License, Version 2.0 (the +* "License"); you may not use this file except in compliance +* with the License. You may obtain a copy of the License at +* +* http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, +* software distributed under the License is distributed on an +* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +* KIND, either express or implied. See the License for the +* specific language governing permissions and limitations +* under the License. +* +* @@@ END COPYRIGHT @@@ + */ +//// + +[[update-statistics-and-reorg]] += Update Statistics and Reorg + +[[ustat-max-read-age-in-min]] +== USTAT_MAX_READ_AGE_IN_MIN + +[cols="25%h,75%"] +|=== +| *Category* | Update Statistics and Reorg +| *Description* | When performing update statistics with the NECESSARY keyword or with automation, this is +the number of minutes that are allowed to have elapsed since a histogram was marked as read for it to be regenerated. + + + +Histograms that were marked more than USTAT_MAX_READ_AGE_IN_MIN minutes ago are not regenerated. +| *Values* | 0 through max unsigned integer. + + + +Setting this CQD to a value less than 2 * USTAT_AUTOMATION_INTERVAL is silently ignored and +2 * USTAT_AUTOMATION_INTERVAL (default value of 2880) is used. + + + +The default value is 5760 (4 days). +| *Usage* | Influences how frequently the histograms for a table are regenerated. +If a table is being used frequently, then chances are that its histograms are also be considered for update frequently. +However, if a table is not used frequently, then this CQD influences how frequently the histograms for that table are updated. + + + +A smaller setting reduces the number of histograms being updated if there are many tables that have not been used within that interval. +A larger setting updates histogram for many more tables that are not being accessed that often. +| *Production Usage* | Not applicable. +| *Impact* | Influences the number of histograms that need to be regenerated and therefore the time it takes for +update statistics automation to regenerate histograms for all the tables that so qualify. +| *Level* | System. +| *Conflicts/Synergies* | USTAT_AUTOMATION_INTERVAL sets a lower bound on this CQD. + + + +It is influenced by USTAT_AUTO_READTIME_UPDATE_INTERVAL, which influences how often READ_TIME is updated for the histogram. +This CQD qualifies the histogram to be regenerated; it is the maximum time since READ_TIME was updated. $$$ +| *Real Problem Addressed* | Not applicable. +| *Introduced In Release* | Trafodion 1.3.0. +| *Deprecated In Release* | Not applicable. +|=== + +<<< +[[ustat-min-rowcount-for-sample]] +== USTAT_MIN_ROWCOUNT_FOR_SAMPLE + +[cols="25%h,75%"] +|=== +| *Category* | Update Statistics and Reorg +| *Description* | Sets the minimum rows that need to be in a table before sampling is used to update statistics for that table. +If a table has a fewer rows than the value of this CQD, then the SAMPLE option is silently ignored when performing update statistics. +| *Values* | 1 through max unsigned integer. + + + +The default value is 10000. +| *Usage* | Influences for what tables sampling is used for histogram statistics generation. If the setting is smaller, +then more tables qualify for sampling. If the setting is larger, then fewer tables qualify for sampling. Sampling can result in faster update +statistics run times. But for smaller tables, it can also result in poor histogram accuracy which could result in poor query plans. +| *Production Usage* | Not applicable. +| *Impact* | Setting this CQD to a smaller value means that sampling is used for tables with fewer rows, when the SAMPLE option is +specified as part of update statistics. This can result in less accurate histograms and poor query plans, because the sample size may be too small to +generate good estimates for histograms. + + + +Setting this CQD to a larger value can result in sampling not being used for many tables and therefore longer update statistics run times. +However, these tables may also have more accurate histograms. +| *Level* | System. +| *Conflicts/Synergies* | Not applicable. +| *Real Problem Addressed* | Not applicable. +| *Introduced In Release* | Trafodion 1.3.0. +| *Deprecated In Release* | Not applicable. +|=== + http://git-wip-us.apache.org/repos/asf/incubator-trafodion/blob/33d796d6/docs/cqd_reference/src/asciidoc/index.adoc ---------------------------------------------------------------------- diff --git a/docs/cqd_reference/src/asciidoc/index.adoc b/docs/cqd_reference/src/asciidoc/index.adoc new file mode 100644 index 0000000..f67e482 --- /dev/null +++ b/docs/cqd_reference/src/asciidoc/index.adoc @@ -0,0 +1,69 @@ +//// +* @@@ START COPYRIGHT @@@ +* +* Licensed to the Apache Software Foundation (ASF) under one +* or more contributor license agreements. See the NOTICE file +* distributed with this work for additional information +* regarding copyright ownership. The ASF licenses this file +* to you under the Apache License, Version 2.0 (the +* "License"); you may not use this file except in compliance +* with the License. You may obtain a copy of the License at +* +* http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, +* software distributed under the License is distributed on an +* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +* KIND, either express or implied. See the License for the +* specific language governing permissions and limitations +* under the License. +* +* @@@ END COPYRIGHT @@@ +//// + += Control Query Default (CQD) Reference Guide +:doctype: book +:numbered: +:toc: left +:toclevels: 3 +:toc-title: Table of Contents +:icons: font +:iconsdir: icons +:experimental: +:source-language: text +:revnumber: {project-version} +:title-logo-image: ../../../shared/trafodion-logo.jpg + +:images: ../images +:leveloffset: 1 + +// The directory is called _chapters because asciidoctor skips direct +// processing of files found in directories starting with an _. This +// prevents each chapter being built as its own book. + +**License Statement** + +Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file +distributed with this work for additional information regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the +specific language governing permissions and limitations under the License. + +[cols="2",options="header"] +|=== +| Version | Date +| 1.3.0 | January, 2016 +|=== + +include::asciidoc/_chapters/about.adoc[] +include::asciidoc/_chapters/introduction.adoc[] +include::asciidoc/_chapters/query_plans.adoc[] +include::asciidoc/_chapters/histograms.adoc[] +include::asciidoc/_chapters/transactions.adoc[] +include::asciidoc/_chapters/runtime_controls.adoc[] +include::asciidoc/_chapters/schema_controls.adoc[] +include::asciidoc/_chapters/update_stats.adoc[] +include::asciidoc/_chapters/operational_controls.adoc[] http://git-wip-us.apache.org/repos/asf/incubator-trafodion/blob/33d796d6/docs/load_transform/pom.xml ---------------------------------------------------------------------- diff --git a/docs/load_transform/pom.xml b/docs/load_transform/pom.xml new file mode 100644 index 0000000..ef00fd5 --- /dev/null +++ b/docs/load_transform/pom.xml @@ -0,0 +1,289 @@ +<?xml version="1.0" encoding="UTF-8"?> +<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" + xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd"> + <!-- +* @@@ START COPYRIGHT @@@ +* +* Licensed to the Apache Software Foundation (ASF) under one +* or more contributor license agreements. See the NOTICE file +* distributed with this work for additional information +* regarding copyright ownership. The ASF licenses this file +* to you under the Apache License, Version 2.0 (the +* "License"); you may not use this file except in compliance +* with the License. You may obtain a copy of the License at +* +* http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, +* software distributed under the License is distributed on an +* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +* KIND, either express or implied. See the License for the +* specific language governing permissions and limitations +* under the License. +* +* @@@ END COPYRIGHT @@@ +--> + <modelVersion>4.0.0</modelVersion> + <groupId>org.apache.trafodion</groupId> + <artifactId>load-transform-guide</artifactId> + <version>${env.TRAFODION_VER}</version> + <packaging>pom</packaging> + <name>Trafodion Load and Transform Guide</name> + <description>This guide describes how to load and transform data into Trafodion.</description> + <url>http://trafodion.incubator.apache.org</url> + <inceptionYear>2015</inceptionYear> + + <parent> + <groupId>org.apache.trafodion</groupId> + <artifactId>trafodion</artifactId> + <relativePath>../../pom.xml</relativePath> + <version>1.3.0</version> + </parent> + + + <licenses> + <license> + <name>The Apache Software License, Version 2.0</name> + <url>http://www.apache.org/licenses/LICENSE-2.0.txt</url> + <distribution>repo</distribution> + <comments>A business-friendly OSS license</comments> + </license> + </licenses> + + <organization> + <name>Apache Software Foundation</name> + <url>http://www.apache.org</url> + </organization> + + <issueManagement> + <system>JIRA</system> + <url>http://issues.apache.org/jira/browse/TRAFODION</url> + </issueManagement> + + <scm> + <connection>scm:git:http://git-wip-us.apache.org/repos/asf/incubator-trafodion.git</connection> + <developerConnection>scm:git:https://git-wip-us.apache.org/repos/asf/incubator-trafodion.git</developerConnection> + <url>https://git-wip-us.apache.org/repos/asf?p=incubator-trafodion.git</url> + <tag>HEAD</tag> + </scm> + + <ciManagement> + <system>Jenkins</system> + <url>https://jenkins.esgyn.com</url> + </ciManagement> + + <properties> + <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> + <asciidoctor.maven.plugin.version>1.5.2.1</asciidoctor.maven.plugin.version> + <asciidoctorj.pdf.version>1.5.0-alpha.11</asciidoctorj.pdf.version> + <asciidoctorj.version>1.5.4</asciidoctorj.version> + <rubygems.prawn.version>2.0.2</rubygems.prawn.version> + <jruby.version>9.0.4.0</jruby.version> + </properties> + + <repositories> + <repository> + <id>rubygems-proxy-releases</id> + <name>RubyGems.org Proxy (Releases)</name> + <url>http://rubygems-proxy.torquebox.org/releases</url> + <releases> + <enabled>true</enabled> + </releases> + <snapshots> + <enabled>false</enabled> + </snapshots> + </repository> + </repositories> + + <dependencies> + <dependency> + <groupId>rubygems</groupId> + <artifactId>prawn</artifactId> + <version>${rubygems.prawn.version}</version> + <type>gem</type> + <scope>provided</scope> + </dependency> + <dependency> + <groupId>org.jruby</groupId> + <artifactId>jruby-complete</artifactId> + <version>${jruby.version}</version> + </dependency> + <dependency> + <groupId>org.asciidoctor</groupId> + <artifactId>asciidoctorj</artifactId> + <version>${asciidoctorj.version}</version> + </dependency> + </dependencies> + + <build> + <plugins> + <plugin> + <groupId>de.saumya.mojo</groupId> + <artifactId>gem-maven-plugin</artifactId> + <version>1.0.10</version> + <configuration> + <!-- align JRuby version with AsciidoctorJ to avoid redundant downloading --> + <jrubyVersion>${jruby.version}</jrubyVersion> + <gemHome>${project.build.directory}/gems</gemHome> + <gemPath>${project.build.directory}/gems</gemPath> + </configuration> + <executions> + <execution> + <goals> + <goal>initialize</goal> + </goals> + </execution> + </executions> + </plugin> + <plugin> + <groupId>org.apache.maven.plugins</groupId> + <artifactId>maven-resources-plugin</artifactId> + <version>2.7</version> + <configuration> + <encoding>UTF-8</encoding> + <attributes> + <generateReports>false</generateReports> + </attributes> + </configuration> + </plugin> + <plugin> + <groupId>org.asciidoctor</groupId> + <artifactId>asciidoctor-maven-plugin</artifactId> + <version>${asciidoctor.maven.plugin.version}</version> + <dependencies> + <dependency> + <groupId>org.asciidoctor</groupId> + <artifactId>asciidoctorj-pdf</artifactId> + <version>${asciidoctorj.pdf.version}</version> + </dependency> + <dependency> + <groupId>org.asciidoctor</groupId> + <artifactId>asciidoctorj</artifactId> + <version>${asciidoctorj.version}</version> + </dependency> + </dependencies> + <configuration> + <sourceDirectory>${basedir}/src</sourceDirectory> + </configuration> + <executions> + <execution> + <id>generate-html-doc</id> + <goals> + <goal>process-asciidoc</goal> + </goals> + <phase>site</phase> + <configuration> + <doctype>book</doctype> + <backend>html5</backend> + <sourceHighlighter>coderay</sourceHighlighter> + <outputDirectory>${basedir}/target/site</outputDirectory> + <requires> + <require>${basedir}/../shared/google-analytics-postprocessor.rb</require> + </requires> + <attributes> + <!-- Location of centralized stylesheet --> + <stylesheet>${basedir}/../shared/trafodion-manuals.css</stylesheet> + <project-version>${env.TRAFODION_VER}</project-version> + <build-date>${maven.build.timestamp}</build-date> + <google-analytics-account>UA-72491210-1</google-analytics-account> + </attributes> + </configuration> + </execution> + <execution> + <id>generate-pdf-doc</id> + <phase>site</phase> + <goals> + <goal>process-asciidoc</goal> + </goals> + <configuration> + <doctype>book</doctype> + <backend>pdf</backend> + <sourceHighlighter>coderay</sourceHighlighter> + <outputDirectory>${basedir}/target</outputDirectory> + <attributes> + <project-version>${env.TRAFODION_VER}</project-version> + <build-date>${maven.build.timestamp}</build-date> + <pdf-stylesdir>${basedir}/../shared</pdf-stylesdir> + <pdf-style>trafodion</pdf-style> + <icons>font</icons> + <pagenums/> + <toc/> + <idprefix/> + <idseparator>-</idseparator> + </attributes> + </configuration> + </execution> + </executions> + </plugin> + <!-- Rename target/site/index.pdf to client-install-guide.pdf --> + <plugin> + <groupId>org.apache.maven.plugins</groupId> + <artifactId>maven-antrun-plugin</artifactId> + <version>1.8</version> + <inherited>false</inherited> + <executions> + <execution> + <id>populate-release-directories</id> + <phase>post-site</phase> + <configuration> + <target name="Populate Release Directories"> + <!-- The website uses the following organization for the docs/target/docs directory: + - To ensure a known location, the base directory contains the LATEST version of the web book and the PDF files. + - The know location is docs/target/docs/<document> + - target/docs/<version>/<document> contains version-specific renderings of the documents. + - target/docs/<version>/<document> contains the PDF version and the web book. The web book is named index.html + --> + <!-- Copy the PDF file to its target directories --> + <copy file="${basedir}/target/index.pdf" tofile="${basedir}/../target/docs/load_transform_guide/Trafodion_Load_Transform_Guide.pdf" /> + <copy file="${basedir}/target/index.pdf" tofile="${basedir}/../target/docs/${project.version}/load_transform_guide/Trafodion_Load_Transform_Guide.pdf" /> + <!-- Copy the Web Book files to their target directories --> + <copy todir="${basedir}/../target/docs/load_transform_guide"> + <fileset dir="${basedir}/target/site"> + <include name="**/*.*"/> <!--All sub-directories, too--> + </fileset> + </copy> + <copy todir="${basedir}/../target/docs/${project.version}/load_transform_guide"> + <fileset dir="${basedir}/target/site"> + <include name="**/*.*"/> <!--All sub-directories, too--> + </fileset> + </copy> + </target> + </configuration> + <goals> + <goal>run</goal> + </goals> + </execution> + </executions> + </plugin> + </plugins> + </build> + + <!-- Included because this is required. No reports are generated. --> + <reporting> + <excludeDefaults>true</excludeDefaults> + <plugins> + <plugin> + <groupId>org.apache.maven.plugins</groupId> + <artifactId>maven-project-info-reports-plugin</artifactId> + <version>2.8</version> + <reportSets> + <reportSet> + <reports> + </reports> + </reportSet> + </reportSets> + </plugin> + </plugins> + </reporting> + + <distributionManagement> + <site> + <id>trafodion.incubator.apache.org</id> + <name>Trafodion Website at incubator.apache.org</name> + <!-- On why this is the tmp dir and not trafodion.incubator.apache.org, see + https://issues.apache.org/jira/browse/HBASE-7593?focusedCommentId=13555866&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13555866 + --> + <url>file:///tmp</url> + </site> + </distributionManagement> +</project> http://git-wip-us.apache.org/repos/asf/incubator-trafodion/blob/33d796d6/docs/load_transform/src/asciidoc/_chapters/about.adoc ---------------------------------------------------------------------- diff --git a/docs/load_transform/src/asciidoc/_chapters/about.adoc b/docs/load_transform/src/asciidoc/_chapters/about.adoc new file mode 100644 index 0000000..4ef4bd0 --- /dev/null +++ b/docs/load_transform/src/asciidoc/_chapters/about.adoc @@ -0,0 +1,186 @@ +//// +/** +* @@@ START COPYRIGHT @@@ +* +* Licensed to the Apache Software Foundation (ASF) under one +* or more contributor license agreements. See the NOTICE file +* distributed with this work for additional information +* regarding copyright ownership. The ASF licenses this file +* to you under the Apache License, Version 2.0 (the +* "License"); you may not use this file except in compliance +* with the License. You may obtain a copy of the License at +* +* http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, +* software distributed under the License is distributed on an +* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +* KIND, either express or implied. See the License for the +* specific language governing permissions and limitations +* under the License. +* +* @@@ END COPYRIGHT @@@ + */ +//// + += About This Document +This guide describes how to load and transform data into a Trafodion database. + +The information herein is complementary to the following Trafodion documentation: + +* http://trafodion.apache.org/docs/sql_reference/index.html[Trafodion SQL Reference Manual] +* http://trafodion.apache.org/docs/odb/index.html[Trafodion odb User Guide] + +== Intended Audience +This guide targets anyone wanting to load data into a Trafodion database. You need to have the +following skill sets to make full of the information in this guide: + +* SQL DDL and DML. +* Install and configure Linux software. +* Trafodion administration. +* Depending on your data source, Java and/or Hadoop ecosystem usage. + +== New and Changed Information +This is a new guide. + +== Notation Conventions +This list summarizes the notation conventions for syntax presentation in this manual. + +* UPPERCASE LETTERS ++ +Uppercase letters indicate keywords and reserved words. Type these items exactly as shown. Items not enclosed in brackets are required. ++ +``` +SELECT +``` + +* lowercase letters ++ +Lowercase letters, regardless of font, indicate variable items that you supply. Items not enclosed in brackets are required. ++ +``` +file-name +``` + +* [ ] Brackets ++ +Brackets enclose optional syntax items. ++ +``` +DATETIME [start-field TO] end-field +``` ++ +A group of items enclosed in brackets is a list from which you can choose one item or none. ++ +The items in the list can be arranged either vertically, with aligned brackets on each side of the list, or horizontally, enclosed in a pair of brackets and separated by vertical lines. ++ +For example: ++ +``` +DROP SCHEMA schema [CASCADE] +DROP SCHEMA schema [ CASCADE | RESTRICT ] +``` + +* { } Braces ++ +Braces enclose required syntax items. ++ +``` +FROM { grantee [, grantee ] ... } +``` ++ +A group of items enclosed in braces is a list from which you are required to choose one item. ++ +The items in the list can be arranged either vertically, with aligned braces on each side of the list, or horizontally, enclosed in a pair of braces and separated by vertical lines. ++ +For example: ++ +``` +INTERVAL { start-field TO end-field } +{ single-field } +INTERVAL { start-field TO end-field | single-field } +``` + +* | Vertical Line ++ +A vertical line separates alternatives in a horizontal list that is enclosed in brackets or braces. ++ +``` +{expression | NULL} +``` + +* … Ellipsis ++ +An ellipsis immediately following a pair of brackets or braces indicates that you can repeat the enclosed sequence of syntax items any number of times. ++ +``` +ATTRIBUTE[S] attribute [, attribute] ... +{, sql-expression } ... +``` ++ +An ellipsis immediately following a single syntax item indicates that you can repeat that syntax item any number of times. ++ +For example: ++ +``` +expression-n ... +``` + +* Punctuation ++ +Parentheses, commas, semicolons, and other symbols not previously described must be typed as shown. ++ +``` +DAY (datetime-expression) +@script-file +``` ++ +Quotation marks around a symbol such as a bracket or brace indicate the symbol is a required character that you must type as shown. ++ +For example: ++ +``` +"{" module-name [, module-name] ... "}" +``` + +* Item Spacing ++ +Spaces shown between items are required unless one of the items is a punctuation symbol such as a parenthesis or a comma. ++ +``` +DAY (datetime-expression) DAY(datetime-expression) +``` ++ +If there is no space between two items, spaces are not permitted. In this example, no spaces are permitted between the period and any other items: ++ +``` +myfile.sh +``` + +* Line Spacing ++ +If the syntax of a command is too long to fit on a single line, each continuation line is indented three spaces and is separated from the preceding line by a blank line. ++ +This spacing distinguishes items in a continuation line from items in a vertical list of selections. ++ +``` +match-value [NOT] LIKE _pattern + [ESCAPE esc-char-expression] +``` + +== Publishing History +[cols="2*",options="header"] +|=== +| Product Version | Publication Date +| Trafodion Release 1.3.0 | January 2016. +|=== + +== Comments Encouraged +The Trafodion community encourages your comments concerning this document. We are committed to providing documentation that meets your +needs. Send any errors found, suggestions for improvement, or compliments to: + [email protected] + +Include the document title and any comment, error found, or suggestion for improvement you have concerning this document. Or, even +better, join our community and help us improve our documentation. Please refer to +http://trafodion.incubator.apache.org/contributing_redirect.html[Trafodion Contributor Guide] for details. http://git-wip-us.apache.org/repos/asf/incubator-trafodion/blob/33d796d6/docs/load_transform/src/asciidoc/_chapters/bulk_load.adoc ---------------------------------------------------------------------- diff --git a/docs/load_transform/src/asciidoc/_chapters/bulk_load.adoc b/docs/load_transform/src/asciidoc/_chapters/bulk_load.adoc new file mode 100644 index 0000000..03f1385 --- /dev/null +++ b/docs/load_transform/src/asciidoc/_chapters/bulk_load.adoc @@ -0,0 +1,291 @@ +//// +/** +* @@@ START COPYRIGHT @@@ +* +* Licensed to the Apache Software Foundation (ASF) under one +* or more contributor license agreements. See the NOTICE file +* distributed with this work for additional information +* regarding copyright ownership. The ASF licenses this file +* to you under the Apache License, Version 2.0 (the +* "License"); you may not use this file except in compliance +* with the License. You may obtain a copy of the License at +* +* http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, +* software distributed under the License is distributed on an +* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +* KIND, either express or implied. See the License for the +* specific language governing permissions and limitations +* under the License. +* +* @@@ END COPYRIGHT @@@ + */ +//// + +[[bulk-load]] += Bulk Load + +The LOAD statement enables batch loading large volumes of data efficiently in a scalable manner. + +See the http://trafodion.apache.org/docs/sql_reference/index.html#load_statement[Trafodion SQL Reference Manual] +for a full description of this SQL statement. + +You can bulk-load data using one of the following methods: + +* <<bulk-load-data-from-trafodion-tables,Load Data From Trafodion Tables>> +* <<bulk-load-data-from-hdfs-files, Load Data From HDFS Files>> +* <<bulk-load-data-from-hive-tables,Load Data From Hive Tables>> +* <<bulk-load-data-from-external-databases,Load Data From External Databases>> + +[[bulk-load-from-trafodion-tables]] +== Load Data From Trafodion Tables + +You copy data between two Trafodion tables by using the appropriate SELECT statement in the LOAD command. + +=== Example + +``` +``` + +[[bulk-load-data-from-hdfs-files]] +== Load Data From HDFS Files + +You can copy your data into an HDFS folder and then create an external Hive tables with the correct fields that +you point to the HDFS folder containing the data. You can also specify a WHERE clause on the source data as a filter. +See the https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ExternalTables[External Tables] +page on the https://cwiki.apache.org/confluence/display/Hive[Hive Wiki] for more information. + +When Hive tables are accessed through Trafodion columns with integral, string, and char types are accessible. +See the https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types[LanguageManual Types] +page on the https://cwiki.apache.org/confluence/display/Hive[Hive Wiki] for the data types available in Hive. + +Overall, you do the following: + +1. Export the data on the remote cluster. +2. Transport files to Trafodion cluster via FTP, scp, or some other method. +3. Use LOAD referencing HIVE external tables. + +=== Example + + +NEED A DESCRIPTION OF THE USE CASE, THE LOAD FILE, AND ITS LAYOUT. + +Do the following: + +1. Using trafci, define the Trafodion table where you want to load the data. ++ +``` +CREATE TABLE customer_demographics_salt +( + cd_demo_sk INT NOT NULL + , cd_gender CHAR(1) + , cd_marital_status CHAR(1) + , cd_education_status CHAR(20) + , cd_purchase_estimate INT + , cd_credit_rating CHAR(10) + , cd_dep_count INT + , cd_dep_employed_count INT + , cd_dep_college_count INT + , PRIMARY KEY (cd_demo_sk) +) +SALT USING 4 PARTITIONS ON (cd_demo_sk) +; +``` + +2. Copy the data into HDFS ++ +``` +hadoop fs -copyFromLocal $HOME/data/customer_demographics /hive/tpcds/customer_demographics +``` + +3. Using the Hive shell, create an external Hive table: ++ +``` +CREATE EXTERNAL TABLE customer_demographics +( + cd_demo_sk INT + , cd_gender STRING + , cd_marital_status STRING + , cd_education_status STRING + , cd_purchase_estimate INT + , cd_credit_rating STRING + , cd_dep_count INT + , cd_dep_employed_count INT + , cd_dep_college_count INT +) +ROW FORMAT DELIMITED +FIELDS TERMINATED BY '|' +LOCATION '/hive/tpcds/customer_demographics' +; +``` + +4. Using trafci, load the Trafodion `customer_demographics_salt` table from the Hive table named +`hive.hive.customer_demographics`: ++ +``` +>>LOAD INTO customer_demographics_salt ++>SELECT * FROM hive.hive.customer_demographics WHERE cd_demo_sk <= 5000; +Task: LOAD Status: Started Object: TRAFODION.HBASE.CUSTOMER_DEMOGRAPHICS_SALT +Task: DISABLE INDEX Status: Started Object: TRAFODION.HBASE.CUSTOMER_DEMOGRAPHICS_SALT +Task: DISABLE INDEX Status: Ended Object: TRAFODION.HBASE.CUSTOMER_DEMOGRAPHICS_SALT +Task: PREPARATION Status: Started Object: TRAFODION.HBASE.CUSTOMER_DEMOGRAPHICS_SALT + Rows Processed: 5000 +Task: PREPARATION Status: Ended ET: 00:00:03.199 +Task: COMPLETION Status: Started Object: TRAFODION.HBASE.CUSTOMER_DEMOGRAPHICS_SALT +Task: COMPLETION Status: Ended ET: 00:00:00.331 +Task: POPULATE INDEX Status: Started Object: TRAFODION.HBASE.CUSTOMER_DEMOGRAPHICS_SALT +Task: POPULATE INDEX Status: Ended ET: 00:00:05.262 +``` + +[[bulk-load-data-from-hive-tables]] +== Load Data From Hive Tables + +You can import data from Hive using the TrafCI or sqlci command interface. Do the following: + +1. Set these required Control Query Defaults (CQDs) to improve load performance: ++ +``` +CQD HIVE_MAX_STRING_LENGTH '1000'; // if the widest column is 1KB +``` ++ +This setting is required if there are time-related column types in the target Trafodion table. ++ +``` +CQD ALLOW_INCOMPATIBLE_ASSIGNMENT 'on'; +``` + +2. Issue the LOAD statement to load data into Trafodion tables from Hive. For example: ++ +``` +LOAD WITH NO POPULATE INDEXES INTO trafodion.sch.demo SELECT * FROM hive.hive.demo; +``` + +See the http://trafodion.incubator.apache.org/docs/sql_reference/index.html#load_statement[Trafodion SQL Reference Manual]. +for the completed syntax of the LOAD statement. + +If multiple invocations of the LOAD statement are used to incrementally load sets of data to a single target table, then several HFiles +for each partition of the target table are created. This causes inefficient access during SELECT queries and may also cause a compaction +to be triggered based on the policies configured in the HBase settings. + +To avoid this issue, it is good practice to perform a major compaction on a table that has been the target of more than two LOAD statements +in a short period of time. To perform compaction, use this `hbase shell` command: + +``` +major_compact 'TRAFODION.SCH.DEMO' +``` + +NOTE: The `major_compact` command returns immediately since it's not waited. Typically, compaction of a large table takes a long time +(several minutes to hours) to complete. You can monitor the progress of compaction from the HBase Master Web user interface. + +=== Example + +``` +>> CQD HIVE_MAX_STRING_LENGTH '1000' ; +>> CQD ALLOW_INCOMPATIBLE_ASSIGNMENT 'on' ; +>> LOAD WITH NO POPULATE INDEXES INTO trafodion.sch.demo SELECT * FROM hive.hive.demo ; + +<NEED MORE HERE, INCLUDING CREATING THE TRAFODION TABLE AND WHAT TO DO ABOUT INDEXES> +``` + +[[bulk-load-data-from-external-databases]] +== Load Data From External Databases + +You need to import data into Hive when loading data from external databases. +Use http://sqoop.apache.org/[Apache Sqoop], an open-source tools to move the data from the external database +into Hive tables on the Trafodion cluster. + +Source data can be in the following formats: + +[cols="40%,60%", options="header"] +|=== +| Format | Examples +| *Structured* | Relational databases such as Oracle or MySQL. +| *Semi-Structured* | Cassandra or HBase +| *Unstructured* | HDFS +|=== + +You use the Sqoop command-line shell for interactive commands and basic scripting. + +Sqoop basics: + +* Generic JDBC Connector: supports JDBC T-4 Driver. +* Configuration Language for FROM/TO jobs that specify in SQL terms. +* Partitioner: Divide/parallelize the data streams; uses primary key by default. +* Extractor: Uses FROM configuration for SQL statements, plus partitioner information to query data subsets. +* Loader: Uses TO job configuration; INSERT INTO could be generated from col list or explicitly specified. +* Destroyer: Copies staging table to final table and deletes staging table. + +See the http://sqoop.apache.org/docs/1.99.6/Sqoop5MinutesDemo.html[Sqoop 5 Minutes Demo] for a quick +introduction to Sqoop. + + +=== Install Required Software + +By default, Sqoop is not installed on Trafodion clusters. Do the following: + +* Install and start Sqoop on the Trafodion cluster using either the Ambari or Cloudera Manager GUI. +See the http://sqoop.apache.org/docs/1.99.3/Installation.html[Sqoop installation instructions]. + +* Install http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html[JDK 1.8] + +* Install the http://www.oracle.com/technetwork/database/features/jdbc/index-091264.html[Oracle JDBC driver] + +* Set the following environment variables: ++ +``` +export JAVA_HOME=/opt/java/jdk1.8.0_11 +export JAVA_OPTIONS=-Dmapred.child.java.opts=\-Djava.security.egd=file:/dev/urandom+ +``` + +=== Sample Sqoop Commands + +==== List All Oracle Tables + +``` +sqoop list-tables --driver oracle.jdbc.OracleDriver --connect jdbc:oracle:thin:@<Oracle host name>:<port>/<database> +--username <user-name> --password <password> +``` + +==== Import Data to Hive + +*Syntax* + +``` +sqoop import --connect jdbc:oracle:thin:@<Oracle host name:port>/<database> +--username <user-name> --password <password> --table <tablename> +--split-by <column-name> --hive-import --create-hive-table +--hive-table <hive-table-name> --hive-overwrite --null-string '' +--null-non-string '' --hive-drop-import-delims--verbose +``` + +[cols="40%,60%",options="header"] +|=== +| Parameter | Guidelines +| `--split-by <column-name>` +| By default, if not specified, sqoop uses the primary key column as a splitting column, which is not optimal most of the time. +If the table does not contain a primary key, then you must manually specify the splitting column. +| `--null-string <null-string>` +| This is the string to be written for a null value in a string column. +| `--null-non-string <null-string>` +| This is the string to be written for a null value in a non-string column. +| `--hive-drop-import-delims` +| This drops `\n`, `\r`, and `\01` string fields when importing to Hive. + + + +*NOTE*: If the data contains \n or \r and if you do not use the hive-drop-import-delims option, then data is truncated. +You need to use additional Sqoop options during migration by specifying the delimiter that you would like to use, +which does not exist in the data itself. +|=== + +=== Example + +ADD EXAMPLE HERE. + +``` +``` + + + + + + http://git-wip-us.apache.org/repos/asf/incubator-trafodion/blob/33d796d6/docs/load_transform/src/asciidoc/_chapters/bulk_unload.adoc ---------------------------------------------------------------------- diff --git a/docs/load_transform/src/asciidoc/_chapters/bulk_unload.adoc b/docs/load_transform/src/asciidoc/_chapters/bulk_unload.adoc new file mode 100644 index 0000000..5ba2349 --- /dev/null +++ b/docs/load_transform/src/asciidoc/_chapters/bulk_unload.adoc @@ -0,0 +1,73 @@ +//// +/** +* @@@ START COPYRIGHT @@@ +* +* Licensed to the Apache Software Foundation (ASF) under one +* or more contributor license agreements. See the NOTICE file +* distributed with this work for additional information +* regarding copyright ownership. The ASF licenses this file +* to you under the Apache License, Version 2.0 (the +* "License"); you may not use this file except in compliance +* with the License. You may obtain a copy of the License at +* +* http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, +* software distributed under the License is distributed on an +* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +* KIND, either express or implied. See the License for the +* specific language governing permissions and limitations +* under the License. +* +* @@@ END COPYRIGHT @@@ + */ +//// + +[[bulk-unload]] += Bulk Unload + +The http://trafodion.apache.org/docs/sql_reference/index.html#unload_statement[UNLOAD] statement is a +Trafodion extension that exports data from Trafodion tables into an HDFS location that you specify. +Refer to the http://trafodion.apache.org/docs/sql_reference/index.html#unload_statement[Trafodion SQL Reference Manual] +for complete documentation. + +The extracted data can be either compressed or uncompressed based on what you choose. +UNLOAD performs multiple steps based on the options you give; for example: + +* If using snapshot scan: +** Get list of Trafodion tables from the query plan. +** Create/verify snapshots. + +* Purge target location, if specified + +* Extract: +** Copy table data to data files. +** Non-compressed is straight copy. +** Compressed means compression takes place while writing data (no extra step) + +* Merge Data Files if specified + +*Example* + +This example shows how the UNLOAD statement extracts data from a +Trafodion table, `TRAFODION.HBASE.CUSTOMER_DEMOGRAPHICS`, into an HDFS +folder, `/bulkload/customer_demographics`: + +``` +>>UNLOAD ++>WITH PURGEDATA FROM TARGET ++>MERGE FILE 'merged_customer_demogs.gz' OVERWRITE ++>COMPRESSION GZIP ++>INTO '/bulkload/customer_demographics' ++>SELECT * FROM trafodion.hbase.customer_demographics ++><<+ cardinality 10e10 ,+ cardinality 10e10 >>; +Task: UNLOAD Status: Started +Task: EMPTY TARGET Status: Started +Task: EMPTY TARGET Status: Ended ET: 00:00:00.014 +Task: EXTRACT Status: Started + Rows Processed: 200000 +Task: EXTRACT Status: Ended ET: 00:00:04.743 Task: MERGE FILES Status: Started +Task: MERGE FILES Status: Ended ET: 00:00:00.063 + +--- 200000 row(s) unloaded. +``` http://git-wip-us.apache.org/repos/asf/incubator-trafodion/blob/33d796d6/docs/load_transform/src/asciidoc/_chapters/introduction.adoc ---------------------------------------------------------------------- diff --git a/docs/load_transform/src/asciidoc/_chapters/introduction.adoc b/docs/load_transform/src/asciidoc/_chapters/introduction.adoc new file mode 100644 index 0000000..2bbc3e1 --- /dev/null +++ b/docs/load_transform/src/asciidoc/_chapters/introduction.adoc @@ -0,0 +1,117 @@ +//// +/** +* @@@ START COPYRIGHT @@@ +* +* Licensed to the Apache Software Foundation (ASF) under one +* or more contributor license agreements. See the NOTICE file +* distributed with this work for additional information +* regarding copyright ownership. The ASF licenses this file +* to you under the Apache License, Version 2.0 (the +* "License"); you may not use this file except in compliance +* with the License. You may obtain a copy of the License at +* +* http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, +* software distributed under the License is distributed on an +* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +* KIND, either express or implied. See the License for the +* specific language governing permissions and limitations +* under the License. +* +* @@@ END COPYRIGHT @@@ + */ +//// + +[[introduction]] += Introduction + +[[introduction-load-methods]] +== Load Methods +There are two methods used to load data into a Trafodion table. Both methods can run while the database is concurrently queried: + +[cols="20%,40%,40%",options="header"] +|=== +| Type | Description | Methods/Tools +| *Bulk Load* | Large data volumes + +Stage data and load in the batches | Trafodion Bulk Loader +| *Trickle Load* | Small data volumes + +Insert data at it arrives | odb tool + +ETL tool or custom ODBC/JDBC application +|=== + +These two methods use four types of SQL insert statements + +* *Bulk Load* +** http://trafodion.apache.org/docs/sql_reference/index.html#load_statement[LOAD] + +* *Trickle Load* +** http://trafodion.incubator.apache.org/docs/sql_reference/index.html#insert_statement[INSERT] +** http://trafodion.apache.org/docs/sql_reference/index.html#upsert_statement[UPSERT] +** http://trafodion.apache.org/docs/sql_reference/index.html#upsert_statement[UPSERT USING LOAD] + +The http://trafodion.incubator.apache.org/docs/sql_reference/index.html[Trafodion SQL Reference Manual] +provides syntax descriptions for these statements. + +The data source defines what type of load approach and method you use: + +* *Bulk Load* (LOAD statement) +** _Text Files_: Map an external Hive table. +** _JDBC-Compliant Database_: Load into Hive on the Trafodion cluster using `sqoop`. +** _Hive Tables_: Direct load. +** _Native HBase Tables_: Direct load. +** _Disparate Data Source_: Write Java/C++ UDF to read data from source and pass rows to LOAD. + +* *Trickle Load* (odb utility) +** _Text Files_: Direct access +** _pipes_: Via `stdin` +** _ODBC-Compliant Database_: odb COPY command, no intermediate storage + +For more information, refer to: + +* <<bulk-load,Bulk Load>> +* <<trickle-load, Trickle Load>> + +[[introduction-insert-types]] +=== Insert Types + +The following insert types are supported in Trafodion: + +* `INSERT INTO T …` +* `UPSERT INTO T …` +* `UPSERT USING LOAD INTO T …` +* `LOAD INTO T …` + +The following table compares the different insert types: + +[cols="20%,20%,20%,20%,20%",options="header"] +|=== +| Characteristic | INSERT | UPSERT | UPSERT USING LOAD | LOAD +| *Transaction* | Yes | Yes | No, uses HBase WAL for recovery | No, uses snapshot for recovery +| *Method of Operation* | Uses the standard HBase write path through its `CheckAndPut` call. Rows are held in transaction co-processor memory until the transaction is committed. | Uses the standard HBase write path through its `Put` call. | Uses the standard HBase write path through its `Put` call. | Uses the HBase bulk load write path and creates HFiles directly, bypassing HBase RegionServers for most of its operation. +| *Uniqueness Constraint* | Enforced | Not enforced. New row with the same key value overwrites previous row. | Not enforced. New row with same key value overwrites the previous row. | Enforced only within the set of rows in a single statement. Not enforced with rows already in the table. +| *Index* | Can be used on a table with an index. | Can be used on a table with an index. | When used on a table with an index, it reverts to UPSERT. | Can be used on a table with an index. Index is off-line during the LOAD. +| *Max Size/Invocation* | 10,000 * n^1^ rows | 10,000 * n^1^ rows | 5 million * n^1^ rows | 2 billion * n^1^ rows +| *Min Size/Invocation* | 1 row | 1 row | 1 row | Suitable for greater than 1 million * n^1^ rows +| *Speed* | Slowest | Faster than INSERT | Faster than UPSERT | Fastest +|=== + +^1^ *n* is the number of nodes in each invocation. + +Throughput, max/min sizes depends on multiple factors: + +* Format of rows in Trafodion table (aligned format or not). +* Length of row. +* Number of columns in row. +* Data type of columns. +* Network between nodes in cluster. +* WAL setting. +* Number of clients. +* Use of rowsets. + +== Unload + +The Trafodion UNLOAD statement exports data from Trafodion tables into an HDFS directory. Refer to <<bulk-unload,Bulk Unload>> for more information. + + + http://git-wip-us.apache.org/repos/asf/incubator-trafodion/blob/33d796d6/docs/load_transform/src/asciidoc/_chapters/monitor.adoc ---------------------------------------------------------------------- diff --git a/docs/load_transform/src/asciidoc/_chapters/monitor.adoc b/docs/load_transform/src/asciidoc/_chapters/monitor.adoc new file mode 100644 index 0000000..ab66bf8 --- /dev/null +++ b/docs/load_transform/src/asciidoc/_chapters/monitor.adoc @@ -0,0 +1,77 @@ +//// +/** +* @@@ START COPYRIGHT @@@ +* +* Licensed to the Apache Software Foundation (ASF) under one +* or more contributor license agreements. See the NOTICE file +* distributed with this work for additional information +* regarding copyright ownership. The ASF licenses this file +* to you under the Apache License, Version 2.0 (the +* "License"); you may not use this file except in compliance +* with the License. You may obtain a copy of the License at +* +* http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, +* software distributed under the License is distributed on an +* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +* KIND, either express or implied. See the License for the +* specific language governing permissions and limitations +* under the License. +* +* @@@ END COPYRIGHT @@@ + */ +//// + +[[monitor-progress]] += Monitor Progress + +== INSERT and UPSERT + +For an INSERT statement, rows are written to the HBase table that represents the Trafodion table when the transaction commits. +It is more difficult to see query progress here. + +== UPSERT USING LOAD + +For an UPSERT USING LOAD statement, rows added are visible in the Trafodion table after each `ListOfPut` call succeeds. +You can use a `SELECT COUNT(*)` statement to monitor progress. That way, you know how many rows are already in the table when the +statement starts executing. + +``` +SELECT COUNT(*) FROM trafodion.sch.demo ; +``` + +== LOAD +For LOAD, query progress goes through a few phases, which sometimes overlap: + +1. Hive scan. +2. Sort. +3. Create prep HFiles in HDFS bulkload staging directory (`/bulkload` by default). +4. Move HFiles into HBase. + +You can monitor progress in step 2, sort, with this shell command: + +``` +lsof +L1 | grep SCR | wc -l +``` + +This command returns a count of the number of overflow files for sort. Each file is 2GB in size. +You need to have an approximate idea of the volume of data being loaded to know how much more +data needs to be sorted. On a cluster, sort is done on all nodes with a pdsh-like utility. +Trafodion data volume can also be larger than Hive data volume by a factor of 2 or 3. + +In step 3, create prep HFiles, use the following command to monitor the volume of data written +out to the staging directory: + +``` +hadoop fs -dus /bulkload +``` + +The `hadoop fs` command must be run from one node and does not have to be repeated across the cluster. + +If compression and encoding are used, then the size should be similar to the Hive source data volume. +There may be some remnant data in the staging directory from previous commands, so we have to +take that into account. This step will start only when sort has completed. + +Step 4 is usually the shortest and typically does not exceed a few minutes. + http://git-wip-us.apache.org/repos/asf/incubator-trafodion/blob/33d796d6/docs/load_transform/src/asciidoc/_chapters/tables_indexes.adoc ---------------------------------------------------------------------- diff --git a/docs/load_transform/src/asciidoc/_chapters/tables_indexes.adoc b/docs/load_transform/src/asciidoc/_chapters/tables_indexes.adoc new file mode 100644 index 0000000..71ea7f7 --- /dev/null +++ b/docs/load_transform/src/asciidoc/_chapters/tables_indexes.adoc @@ -0,0 +1,208 @@ +//// +/** +* @@@ START COPYRIGHT @@@ +* +* Licensed to the Apache Software Foundation (ASF) under one +* or more contributor license agreements. See the NOTICE file +* distributed with this work for additional information +* regarding copyright ownership. The ASF licenses this file +* to you under the Apache License, Version 2.0 (the +* "License"); you may not use this file except in compliance +* with the License. You may obtain a copy of the License at +* +* http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, +* software distributed under the License is distributed on an +* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +* KIND, either express or implied. See the License for the +* specific language governing permissions and limitations +* under the License. +* +* @@@ END COPYRIGHT @@@ + */ +//// + +[[tables-indexes]] += Tables and Indexes + +The following guidance helps you set up your tables and indexes for better load performance. + +[[choose-primary-key]] +== Choose Primary Key + +The primary key for a Trafodion table must be chosen based on the workload that accesses the table. + +Keyed access to Trafodion tables is very efficient since HBase is a key-value store. You need to analyze the queries +that are used to access the tables to understand their predicates and join conditions. Once identified, you can +choose a primary key that ensures that the leading key columns have highly selective predicates applied to them. + +This technique limits the number of rows that need to scanned in the HBase. Trafodion uses MDAM (Multi Dimensional Access Method) to limit +the rows scanned when predicates are present to only trailing key columns and not the leading key column. MDAM works best when the +unique entry count of leading key columns (on which predicates are absent) is low. + +[[salting]] +== Salting + +With range partitioned data in some workloads, certain key ranges of data may see more access than other key ranges. This can lead to an +unbalanced usage pattern with some HBase RegionServers handling most of the load. This behavior is referred to as "hot-spotting." + +With Native HBase tables, hot-spotting is often addressed by designing appropriate keys. In Trafodion, once you choose the key to a table, as +discussed in <<choose-primary-key,Choose Primary Key>>, you can use *salting* to distribute the data evenly. Salting applies a +hash function to the salt keys and distributes data to partitions based on this hash value. The hash value is physically stored in the +table as the leading key value. Each split of the table will have only one salt key value. + +The salting key can be any subset (including the whole set) of the primary key. It is a good practice to keep the salting key as small +as possible. The key should provide an even distribution of data, which can be achieved when the key values have a large unique entry +count and no significant skew. + +The number of partitions must also be specified during table creation. You choose the number of partition depending on the size of the +cluster and the expected size of the table. A salted table can split if more data is added to it than initially estimated. If this +happens, then more than one partition having rows with the same salt value, which may result in suboptimal execution plans for the table. + +You can also choose not to salt Trafodion tables. This is similar to range partitioning in a traditional database. The number of partitions +grows with the size of the table, and range boundaries are determined by HBase based on the specified split policy. + +[[compression-encoding]] +== Compression and Encoding + +Large Trafodion tables must be encoded and compressed. Trafodion tables that have a large key or several columns grow in size to 10X or more +when compared to a Hive table with equivalent data since HBase stores the key separately for every column in a row. + +HBase provides several types of encoding to avoid storing the same key value to disk for every column in the row. HBase also supports various +types of compression of the entire data block, regardless whether it is encoded or not. + +See http://hbase.apache.org/book.html#compression[Appendix E: Compression and Data Block Encoding In HBase] in the +http://hbase.apache.org/book.html[Apache HBase Reference Guide] for a comparison of various compression and encoding algorithms. + +For a discussion on the space savings achieved by various types of encoding and compression schemes, please see the +https://blogs.apache.org/hbase/category/General["The Effect of ColumnFamily, RowKey and KeyValue Design on HFile Size"] HBase blog entry, +dated Friday April 11, 2014. + +The performance of an algorithm is also a relevant factor when deciding on the type of compression. For more information, see slides 6, 7, +and 8 of Govind Kamat's and Sumeet Singh's presentation, +http://www.slideshare.net/Hadoop_Summit/kamat-singh-june27425pmroom210cv2["Compression Options in Hadoop - A Tale of Tradeoffs"]. + +[[create-trafodion-tables-and-indexes]] +== Create Tables and Indexes + +Create Trafodion tables using the CREATE TABLE statements with the `SALT USING <num> PARTITIONS` clause for salting and +the `HBASE_OPTIONS` clause for compression and encoding. + +*Example* + +``` +CREATE TABLE trafodion.sch.demo +( demo_sk INT NOT NULL +, name VARCHAR(100) +, PRIMARY KEY (demo_sk) +) +HBASE_OPTIONS +( DATA_BLOCK_ENCODING = 'FAST_DIFF' +, COMPRESSION = 'SNAPPY' +, MEMSTORE_FLUSH_SIZE = '1073741824' +) +SALT USING 8 PARTITIONS ON (demo_sk); +``` + +ANY indexes on the table may be salted or not. However, if they are salted, their salting key and number of partitions must be the same as the table. + +*Example* + +``` +CREATE INDEX demo_ix ON sch.demo(name) +HBASE_OPTIONS +( DATA_BLOCK_ENCODING = 'FAST_DIFF' +, COMPRESSION = 'GZ' +) +SALT LIKE TABLE; +``` + +== Update Statistics + +To generate good plans that allow queries to execute quickly and use resources wisely, the Trafodion Optimizer must have a good idea about how the +values of columns are distributed, the number of distinct values, and so on. Trafodion supplies this information to the optimizer in the +form of histograms generated by executing the UPDATE STATISTICS statement. See the +http://trafodion.apache.org/docs/sql_reference/index.html#update_statistics_statement[Trafodion SQL Reference Manual] for a full +description of this statement. + +=== Default Sampling +While accurate statistics are important, the time required to generate them by reading every row in the table may be prohibitive and is +usually unnecessary. Random sampling of the rows of the table can give adequate results in a fraction of the time required to read all +the values in the table. For most situations, the best option is to simply specify SAMPLE at the end of the UPDATE STATISTICS statement, +which will use the default sampling protocol. For example, to use default sampling in the construction of histograms for each column of +table T1, you would execute the following statement: + +``` +UPDATE STATISTICS FOR TABLE t1 ON EVERY COLUMN SAMPLE; +``` + +This default sampling protocol uses a high sampling rate for small tables, reducing the rate with a steep gradient until hitting 1% and +capping the sample size at one million rows. The specific details of default sampling are as follows: + +* Use the full table for tables up to 10,000 rows. +* For table sizes from 10,000 up to a million rows, 10,000 rows are randomly sampled. In effect, this causes the sampling rate to decline +from 100% to 1% as a function of increasing table size. +* For tables with one million to 100 million rows, use a 1% random sample. +* For tables exceeding 100 million rows, the sampling rate is calculated as 1 million divided by the number of rows in the table. +This limits the overall sample size to 1 million rows while ensuring uniform random sampling across the entire table. + +== Generate Single-Column and Multi-Column Histograms From One Statement + +If you use the ON EVERY COLUMN syntax in an UPDATE STATISTICS statement, then it is important to realize that multi-column histograms can be +requested in the same statement. For example, if you wanted to generate a histogram for each single column of table T1, as well as +multi-column histograms for column sets (c1, c2) and (c5, c6, c7), then you could use the following statement: + +``` +UPDATE STATISTICS FOR TABLE t1 ON EVERY COLUMN, (c1,c2), (c5,c6,c7) SAMPLE; +``` + +In terms of the end result, this is equivalent to the following pair of statements: + +``` +UPDATE STATISTICS FOR TABLE t1 ON EVERY COLUMN SAMPLE; +UPDATE STATISTICS FOR TABLE t1 ON (c1, c2), (c5, c6, c7) SAMPLE; +``` + +However, the performance is superior when they are combined into a single statement because a multi-column histogram depends +on the single-column histograms of its component columns. Therefore, separating the generation of single-column and multi-column histograms +for a table into two statements leads to redundantly calculating some of the single-column histograms. Even though the +relevant single-column histograms already exist, they are recomputed at the time the multi-column histograms are generated. + +=== Enable Update Statistics Automation + +If a standard set of queries is run on a regular basis, then one way to generate only those histograms that are needed for efficient execution +of those queries is to enable update statistics automation, and then PREPARE each of the queries: + +``` +CONTROL QUERY DEFAULT USTAT_AUTOMATION_INTERVAL '1440'; +PREPARE s FROM SELECT...; +``` + +The value of the CQD USTAT_AUTOMATION_INTERVAL is intended to determine the automation interval (in minutes) for update statistics +automation. The PREPARE statement causes the Trafodion Compiler to compile and optimize a query without executing it. In the process +of doing so with automation enabled, any histograms needed by the optimizer that are missing causes those columns to be marked +as needing histograms. Then, the following UPDATE STATISTICS statement can be run against each table to generate the needed histograms: + +``` +UPDATE STATISTICS FOR TABLE <table-name> ON NECESSARY COLUMNS SAMPLE; +``` + +=== Regenerate Histograms + +Histograms can become "stale" as the underlying data changes and possibly reflects a different distribution of values, although +it is possible that data turnover or accumulation can be high while maintaining the same distribution. To ensure that statistics +remain accurate, you should regenerate histograms for a table once significant changes have been made to that table since its +histograms were last generated. To refresh existing histograms without adding new ones, use the following statement: + +``` +UPDATE STATISTICS FOR TABLE <table-name> ON EXISTING COLUMNS SAMPLE; +``` + +The critical set of histograms that were previously generated with the ON NECESSARY COLUMNS syntax can be periodically regenerated +using ON EXISTING COLUMNS. Note that using ON NECESSARY COLUMNS will only identify those columns that have been previously +requested by the optimizer but do not exist. The current implementation of automation does not know which existing histograms might be stale. + + + +LocalWords: c7 http://git-wip-us.apache.org/repos/asf/incubator-trafodion/blob/33d796d6/docs/load_transform/src/asciidoc/_chapters/trickle_load.adoc ---------------------------------------------------------------------- diff --git a/docs/load_transform/src/asciidoc/_chapters/trickle_load.adoc b/docs/load_transform/src/asciidoc/_chapters/trickle_load.adoc new file mode 100644 index 0000000..3d3bd1b --- /dev/null +++ b/docs/load_transform/src/asciidoc/_chapters/trickle_load.adoc @@ -0,0 +1,496 @@ +//// +/** +* @@@ START COPYRIGHT @@@ +* +* Licensed to the Apache Software Foundation (ASF) under one +* or more contributor license agreements. See the NOTICE file +* distributed with this work for additional information +* regarding copyright ownership. The ASF licenses this file +* to you under the Apache License, Version 2.0 (the +* "License"); you may not use this file except in compliance +* with the License. You may obtain a copy of the License at +* +* http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, +* software distributed under the License is distributed on an +* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +* KIND, either express or implied. See the License for the +* specific language governing permissions and limitations +* under the License. +* +* @@@ END COPYRIGHT @@@ + */ +//// + +[[trickle-load]] += Trickle Load + +Trafodion Trickle Load allows data to be committed in batches, with sizes ranging from 1 row to a several +thousand rows in each commit. Trickle Load uses the following SQL statements (defined in the +http://trafodion.incubator.apache.org/docs/sql_reference/index.html[Trafodion SQL Reference Manual]: + +* http://trafodion.incubator.apache.org/docs/sql_reference/index.html#insert_statement[INSERT] +* http://trafodion.apache.org/docs/sql_reference/index.html#upsert_statement[UPSERT] +* http://trafodion.apache.org/docs/sql_reference/index.html#upsert_statement[UPSERT USING LOAD] + +Contrary to <<bulk-load,Bulk Load>>, committed rows are immediately visible from other transactions +thereby leading to minimal latency in making newly ingested rows visible to applications and end users. + +You use Trickle Load in the following situations: + +* Inserting and/or updating data on an ongoing basis. Typically, you create a custom JDBC or ODBC +application for this approach. + +* You want to migrate a smaller amount of data (a few millions rows). Typically, you use JDBC- or +ODBC-based ETL tools for this approach; for example: +** <<trickle-load-odb,Trafodion odb^1^>> +** http://squirrel-sql.sourceforge.net[SQuirrel-SQL] +** http://www.pentaho.com/[Pentaho] +** http://www.informatica.com/us/[Informatica]. + +^1^ Trafodion obd typically achieves better load throughput than third-party ETL tools. + +[[trickle-load-improving-throughput]] +== Improving Throughput + +Trickle Load uses the HBase write path, with every row being written to the WAL (Write-Ahead Log) and HBase MemStore. +When memstore is full data is flushed to HStorefile in background. + +Throughput can be improved by use of: + +* Rowsets or Batch Updates. +* UPSERT instead of INSERT statements, if applicable. +* Multiple simultaneous connections. + +In addition, when using INSERT or UPSERT USING LOAD with the objective of maximizing data ingestion throughput, +increasing the HBase table attribute MEMSTORE_FLUSHSIZE from its default value helps. + +The actual value you use depends on the heap size allocated to each Region Server, the concurrent query workload, and the +number of tables for which simultaneous fast data ingestion is needed. With a heap size of 31 GB for each Region Server in +an environment with heavy concurrent query workload, setting this attribute 1 GB gives good performance. + +You can specify this attribute in the HBASE_OPTIONS clause when creating the table. Alternatively, you can also set it from +the `hbase shell` through an `ALTER 'TRAFODION.<schema-name>.<table-name>', MEMSTORE_FLUSHSIZE >= '1073741824'` command. + +[[trickle-load-odb]] +== odb + +odb is a Linux and Windows Trafodion client that is: + +* ODBC based +* Database agnostic query driver +* Query interpreter +* Loader and extractor + +odb may be installed on: + +* The Trafodion cluster. +* The machine that contains source data +* An intermediate machine that is being used for data loading. + +Source data can be in any database (local or remote) that supports ODBC or in flat files local to the machine hosting +the odb tool. + +odd uses threads to achieve parallelism, rowsets to improve throughput. You can specify INSERT, UPSERT or UPSERT USING LOAD +insert types. + +NOTE: odb does not use the bulk load command LOAD, and, therefore, throughput when using odb may be lower than what can be achieved +with the bulk loader. However, when using the odb tool, source data need not be moved to the Trafodion cluster in a separate step. + +odb allows you to access Hadoop data using one of the following methods: + +1. *Use Hive and its ODBC Driver*: odb can access HIVE like any other relational database. +For example, you can copy to from HIVE and other databases using odb's copy option. +2. *Add the hdfs.*` prefix to the input or output file during loads/extracts*: The file is read/written +from/to Hadoop. odb interacts directly with the HDFS file system using *libhdfs*. ++ +This option is currently available only under Linux. + +The following odb commands/features are discussed in this guide: + +* <<trickle-load-odb-load, odb Load>> +* <<trickle-load-odb-copy, odb Copy>> +* <<trickle-load-odb-extract, odb Extract>> +* <<trickle-load-odb-transform, odb Transform>> + +See the http://trafodion.incubator.apache.org/docs/odb_user/index.html[Trafodion odb User Guide] +for installation instructions and usage syntax for the odb tool. + +The following subsections assume that you've installed odb. + +[[trickle-load-odb-throughput]] +=== odb Throughput + +You achieve the best throughput with odb if using the UPSERT USING LOAD option. + +The default insert type used by odb is INSERT; to use UPSERT USING LOAD, please specify `:loadcmd=UL` in odb's `load` or +`copy` command. + +*Example* + +Copy the table `mytable` from `<source_catalog>.<source_schema>` on the source database to `trafodion.my +schema.mytable` +on Trafodion. + +``` +odb64luo -u <src_username>:<tgt_username> -p <src_pswd>:<tgt_pswd> +-d <src_dsn>:<tgt_dsn> +-cp src:<source_catalog>.<source_schema>.mytable tgt:trafodion.myschema.mytable +:splitby=<col-name>:parallel=4:loadcmd=UL +``` + +[cols="25%,75%", options="header"] +|=== +| Option | Defines +| `src_username` | User name for the source database. +| `src_pswd` | Password for the source database. +| `src_dsn` | ODBC DSN for the source database. +| `tgt_username` | User name for the Trafodion database. +| `tgt_pswd` | Password for the Trafodion database. +| `tgt_dsn` | ODBC DSN for the Trafodion database. +| `splitby` | Defines the column used to evenly distributed values for parallelism. Consider using a leading key column. +| `parallel=4` | Use four connections to extract data from the source database and another four connections to write data to the target Trafodion database. +| `loadcmd=UL` | Use UPSERT USING LOAD syntax to write data. +|=== + +[[trickle-load-odb-load]] +=== odb Load + +Refer to the http://trafodion.apache.org/docs/odb/index.html#_load_files[Load Files] section +in the http://trafodion.apache.org/docs/odb/index.html[Trafodion odb User Guide] for complete +documentation of this option. + +You use the `-l` option to load into a table from: + +* File or standard input (pipe) +* gzip compressed files with no external dependencies +* HDFS +* Load XML files +* Delimited and fixed format files +* "Binary" files (example images) +* Generic SQL scripts before/after loads + +The `-l` option provides: + +* Data generation (constant, sequences, random values, from external datasets) +* Configurable rowsets + +You can load single tables or list of tables in the same session using single/parallel threads. +Limited "ETL like" functionalities are provided; for example: +SUBSTR, TRANSLITERATION, TRUNCATE target, DATE/TIME format conversion, and TOUPPER. + +*Important Options* + +[cols="25%,75%", options="header"] +|=== +| Option | Defines +| `src` | Source file. If empty, then odb generates sample data. +| `fs` | Field separator. +| `tgt` | Target table, required. +| `map` | Map file. A text file describing which input column is mapped to which target table column. See +<<trickle-load-odb-transform, odb Transform>> below. +| `rows` | Rowset size to be used. +| `parallel` | Number of connections/threads to be used.` +| `loadcmd` | `IN`, `UP` or `UL`. INSERT, UPSERT or UPSERT USING LOAD. Use UL for best throughput. +|=== + +*Example* + +``` +$ odb64luo -u user -p xx -d dsn -l src=customer.tbl:tgt=TRAFODION.MAURIZIO.CUSTOMER \ +:fs=\|:rows=1000:loadcmd=UL:truncate:parallel=4 +``` + +This command: + +* Loads the file named `customer.tbl` (`src=customer.tbl`) +* in the table `TRAFODION.MAURIZIO.CUSTOMER` (`tgt=TRAFODION.MAURIZIO.CUSTOMER`) +* using `|` (vertical bar) as a field separator (`fs=\|`) +* using `1000 rows` as row-set buffer (`rows=1000`) +* using UPSERT USING LOAD syntax to achieve better throughput +* truncating the target table before loading (`truncate`) +* using `4 parallel threads` to load the target table (`parallel=4`) + +``` +./odb64luo -u xx -p yy -d traf_sqws125 -l src=myfile:fs=|:tgt=TRAFODION.SEABASE.REGION:map=region.map:max=10000:rows=500:parallel=2:loadcmd=UL +``` + +You can load multiple files using different `-l` options. By default odb creates as many threads (and ODBC connections) as the sum +of parallel load threads. + +*Example* + +Truncates and load the CUSTOMER, ORDERS and LINEITEM tables in parallel. + +``` +odb64luo -u user -p xx -d dsn -T 5 \ +-l src=./data/%t.tbl.gz:tgt=TRAFODION.MAURO.CUSTOMER:fs=\ +|:rows=m2:truncate:norb:parallel=4 \ +-l src=./data/%t.tbl.gz:tgt=TRAFODION.MAURO.ORDERS:fs=\ +|:rows=1000:truncate:norb:parallel=4 \ +-l src=./data/%t.tbl.gz:tgt=TRAFODION.MAURO.LINEITEM:fs=\ +|:rows=m10:truncate:norb:parallel=4 +``` + +[[trickle-load-odb-copy]] +=== odb Copy + +Refer to the +http://trafodion.apache.org/docs/odb/index.html#_copy_tables_from_one_database_to_another[Copy Tables From One Database to Another] +section in the http://trafodion.apache.org/docs/odb/index.html[Trafodion odb User Guide] for complete documentation of this option. + +Use the `-cp` option to copy tables *directly* from one data-source to another using ODBC (for example, from Trafodion to Teradata +or vice-versa): + +* Single/Multiple table(s) copy from one database to another +* Data never lands to disk (ODBC buffers moved from source to target) +* Multi-threaded copy: single/multiple tables in parallel using single/multiple "data streams"/table +* Each "data stream" consists of one "extractor" and one or more "loaders" +* Table subsets copy (columns and/or rows) +* No data conversion required +* Other functionalities: sequence creation, limit text col length, max rows to copy, . . . +* Each data stream is "multiple buffered" with loaders and extractors working in parallel (no need to extract before loading). + +The target table has to be be created in advance and should have a compatible structure. + +*Important Options* + +[cols="25%,75%", options="header"] +|=== +| Option | Defines +| `src` | Source file. If empty, then odb generates sample data. +| `fs` | Field separator. +| `tgt` | Target table, required. +| `parallel` | Number of connections/threads to be used.` +| `splitby` | Source column to parallelize copy operation on. +| `pwhere` | `where` condition on source +| `loadcmd` | `IN`, `UP` or `UL`. INSERT, UPSERT or UPSERT USING LOAD. Use UL for best throughput. +|=== + +When copying data from one data source to another, odb needs user/password/dsn for both source and target system. +User credentials and DSN for the target system are specified this way: + +``` +$ odb64luo -u src_user:tgt_user -p src_pwd:tgt:pwd -d src_dsn:tgt_dsn ... -cp src=...:tgt=... +``` + +You can use odb to copy a list of tables from one database to another. + +*Example* + +``` +$ cat tlist.txt +# List of tables to extract +src=TRAFODION.MAURIZIO.ORDERS +src=TRAFODION.MAURIZIO.CUSTOMER +src=TRAFODION.MAURIZIO.PART +src=TRAFODION.MAURIZIO.LINEITEM +``` + +You can extract all these tables by running: + +``` +$ odb64luo -u user1:user2 -p xx:yy -d dsn1:dsn2 \ +-cp src=-tlist.txt:tgt=tpch.stg_%t:rows=m2:truncate:parallel=4 +``` + +Please note the `src=-tlist.txt`. This command copies: + +[cols="50%,50%",options="header",] +|=== +| Source | Target +| `TRAFODION.MAURIZIO.ORDERS` | `tpch.stg_orders` +| `TRAFODION.MAURIZIO.CUSTOMER` | `tpch.stg_customer` +| `TRAFODION.MAURIZIO.PART` | `tpch.stg_part` +| `TRAFODION.MAURIZIO.LINEITEM` | `tpch.stg_lineitem` +|=== + +Optionally, you can define any other command-line options in the input file. + +*Example* + +Using different _splitby columns_. + +``` +$ cat tlist2.txt +# List of tables to extract and their "splitby columns" +src=TRAFODION.MAURIZIO.ORDERS:splitby=O_ORDERKEY +src=TRAFODION.MAURIZIO.CUSTOMER:splitby=C_CUSTOMERKEY +src=TRAFODION.MAURIZIO.PART:splitby=P_PARTKEY +src=TRAFODION.MAURIZIO.LINEITEM:splitby=L_PARTKEY +``` + +[[trickle-load-odb-extract]] +=== odb Extract + +Refer to the http://trafodion.apache.org/docs/odb/index.html#_extract_tables[Extract Tables] +section in the http://trafodion.apache.org/docs/odb/index.html[Trafodion odb User Guide] for complete documentation of this option. + +Use then -e option to extract from data a table and write it to standard files or named pipes. + +You can: + +* Export single tables, list of tables or generic SQL output. +* Export table subsets (columns and/or rows). +* Exports one or multiple tables in parallel using one or multiple data streams for each table +* Invoke other functionalities (trim, remote trim, cast, limit text col length, max rows to export,. . .) + +You can write the extracted data to: + +* Single/multiple files or standard output (pipe). +* gzip compressed files (no external libraries required). +* XML formatted files (no external libraries required). +* Hadoop File System (requires libhdfs). + +Other useful features: + +* Configurable NULL/EMPTY strings, field/record separators +* Configurable rowset +* Possibility to run generic SQL scripts before/after extracts +* Multi-threaded export + +*Important Options* + +[cols="25%,75%", options="header"] +|=== +| Option | Defines +| `src` | Source file. If empty, then odb generates sample data. +| `fs` | Field separator. +| `tgt` | Target table, required. +| `parallel` | Number of connections/threads to be used.` +| `splitby` | Source column to parallelize extract operation on. +| `pwhere` | `where` condition on source +|=== + +*Example* + +``` +$ odb64luo -u user -p xx -d dsn -T 3 \ +-e src=TRAFODION.MAURIZIO.LIN%:tgt=$\{DATA}/ext_%t.csv.gz:rows=m10:fs=\|:trim:gzip: \ +-e src=TRAFODION.MAURIZIO.REGION:tgt=$\{DATA}/ext_%t.csv.gz:rows=m10:fs=\|:trim:gzip \ +-e src=TRAFODION.MAURIZIO.NATION:tgt=$\{DATA}/ext_%t.csv.gz:rows=m10:fs=\|:trim:gzip +``` + +The example above: + +* Extracts tables `REGION`, `NATION`, and all tables starting with `LIN` from the `TRAFODION.MAURIZIO` schema. +* Saves data into files `ext_%t.csv.gz` (`%t` is expanded to the real table name). +* Compresses the output file (gzip) on the fly (uncompressed data never lands to disk). +* Trims text fields. +* Uses a 10 MB IO buffer. +* Uses three threads (ODBC connection) for the extraction process. + +*Example* + +Use odb to extract all tables listed in a file. + +``` +$ cat tlist.txt + +# List of tables to extract src=TRAFODION.MAURIZIO.ORDERS +src=TRAFODION.MAURIZIO.CUSTOMER src=TRAFODION.MAURIZIO.PART +src=TRAFODION.MAURIZIO.LINEITEM + +``` + +Extract all these tables by running: + +``` +$ odb64luo -u user -p xx -d dsn -e src=-tlist.txt:tgt=%t_%d%m:rows=m20:sq=\" +``` + +The example above: + +* Reads the list of source tables from `tlist.txt`. +* Extracts the data into file using the table name in lowercase (`%t`). +appending extraction data and time (`_%d%m`) for the target file name. +* Uses a 20MB I/O buffer for each extraction thread. +* Encloses strings with double-quote characters (`sq=\"`). + + +[[trickle-load-odb-transform]] +=== odb Transform + +Refer to the http://trafodion.apache.org/docs/odb/index.html#load_map_fields[Map Source File Fields to Target Table Columns] +section in the http://trafodion.apache.org/docs/odb/index.html[Trafodion odb User Guide] for complete documentation of +odb's mapping/transformation capabilities. + +odb provides mapping/transformation capabilities though mapfiles. By specifying `map=<mapfile>` load option you can: + +* Associate any input file field to any table column +* Skip input file fields +* Generate sequences +* Insert constants +* Transform dates/timestamps formats +* Extract substrings +* Replace input file strings. For example: insert Maurizio Felici when you read MF +* Generate random values +* And much more + +A generic mapfile contains: + +* *Comments* (line starting with #) +* *Mappings* to link input file fields to the corresponding target table columns. + +Mappings use the following syntax: + +``` +<colname>:<field>[:transformation operator] +``` + +*Example* + +Suppose you have a target table like this: + +``` ++------+---------------+----+-------+------------+ +|COLUMN|TYPE |NULL|DEFAULT|INDEX | ++------+---------------+----+-------+------------+ +|ID |INTEGER SIGNED |NO | |mf_pkey 1 U | +|NAME |CHAR(10) |YES | | | +|AGE |SMALLINT SIGNED|YES | | | +|BDATE |DATE |YES | | | ++------+---------------+----+-------+------------+ +``` + +And an input file like this: + +*** +uno,00,*51*,due,_Maurizio_,tre,[underline]#07 Mar 1959#, ignore,remaining, fields + +uno,00,*46*,due,_Lucia_,tre,[underline]#13 Oct 1964#, ignore, this + +uno,00,*34*,due,_Giovanni_,tre,[underline]#30 Mar 1976# + +uno,00,*48*,due,_Antonella_,tre,[underline]#24 Apr 1962# +*** + +* *Bold text* represents age. +* _Italics text_ represents name. +* [underline]#Underline text# represents birth date. + +You want to load the marked fields into the appropriate column, generate a unique key for ID and ignore the remaining fields, +In addition, you need to convert the date format and replace all occurrences of `Lucia` with `Lucy`. + +The following map file accomplishes these goals: + +*** +$ cat test/load_map/ml1.map + +# Map file to load TRAFODION.MFTEST.FRIENDS from friends.dat + +ID:seq:1 # Inserts into ID column a sequence starting from 1 + +NAME:4:REPLACE:Lucia:Lucy # Loads field #4 into NAME and replace all occurrences of Lucia with Lucy + +AGE:2 # Loads field #2 (they start from zero) into AGE + +BDATE:6:DCONV:d.b.y # Loads field #6 into BDATE converting date format from dd mmm yyyy +*** + +Load as follows: + +``` +$ odb64luo -u user -p xx -d dsn \ + -l src=friends.dat:tgt=TRAFODION.MFTEST.FRIENDS:map=ml1.map:fs=, +``` + +The above example: + +* Reads data from `friends.dat` (`src`). +* Writes data to the `TRAFODION.MFTEST.FRIENDS` Trafodion table (`tgt`). +* Uses `ml1.map` to define transformation specifications (`map`). +* Uses comma as a field separator (`fs`).
