Repository: drill Updated Branches: refs/heads/gh-pages 558caec08 -> 02af103d9
add Jason's examples Project: http://git-wip-us.apache.org/repos/asf/drill/repo Commit: http://git-wip-us.apache.org/repos/asf/drill/commit/a9b1831f Tree: http://git-wip-us.apache.org/repos/asf/drill/tree/a9b1831f Diff: http://git-wip-us.apache.org/repos/asf/drill/diff/a9b1831f Branch: refs/heads/gh-pages Commit: a9b1831f13b34476a820cd9f34ee663b7323cc7c Parents: c7e53cb Author: Kristine Hahn <[email protected]> Authored: Mon Jul 6 13:10:53 2015 -0700 Committer: Kristine Hahn <[email protected]> Committed: Mon Jul 6 13:13:10 2015 -0700 ---------------------------------------------------------------------- .../060-text-files-csv-tsv-psv.md | 89 ++++++++++++++++--- _docs/img/csv_no_header.png | Bin 0 -> 10877 bytes _docs/img/csv_with_comments.png | Bin 0 -> 14602 bytes _docs/img/csv_with_escape.png | Bin 0 -> 14171 bytes _docs/img/csv_with_header.png | Bin 0 -> 12451 bytes 5 files changed, 79 insertions(+), 10 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/drill/blob/a9b1831f/_docs/data-sources-and-file-formats/060-text-files-csv-tsv-psv.md ---------------------------------------------------------------------- diff --git a/_docs/data-sources-and-file-formats/060-text-files-csv-tsv-psv.md b/_docs/data-sources-and-file-formats/060-text-files-csv-tsv-psv.md index b3dce16..5b0d874 100644 --- a/_docs/data-sources-and-file-formats/060-text-files-csv-tsv-psv.md +++ b/_docs/data-sources-and-file-formats/060-text-files-csv-tsv-psv.md @@ -6,7 +6,7 @@ parent: "Data Sources and File Formats" Best practices for reading text files are: * Select data from particular columns -* CAST data +* Cast data * Use a distributed file system ### Select Data from Particular Columns @@ -29,16 +29,16 @@ You can also improve performance by casting the VARCHAR data to INT, FLOAT, DATE Using a distributed file system, such as HDFS, instead of a local file system to query the files also improves performance because currently Drill does not split files on block splits. ## Configuring Drill to Read Text Files -In the storage plugin configuration, you can set the following attributes that affect how Drill reads CSV, TSV, PSV (comma-, tab-, pipe-separated) files. ["List of Attributes and Definitions"]({{site.baseurl}}/docs/plugin-configuration-basics/#list-of-attributes-and-definitions): +In the storage plugin configuration, you can set the following attributes that affect how Drill reads CSV, TSV, PSV (comma-, tab-, pipe-separated) files. * String lineDelimiter = "\n"; One or more characters used to denote a new record. Allows reading files with windows line endings. * char fieldDelimiter = ','; A single character used to separate each value. * char quote = '"'; - A single character used to start/end a quoted value. + A single character used to start/end a value enclosed in quotation marks. * char escape = '"'; - A single character used to escape a quote inside of a value. + A single character used to escape a quototation mark inside of a value. * char comment = '#'; A single character used to denote a comment line. * boolean skipFirstLine = false; @@ -49,9 +49,7 @@ For more information about storage plugin configuration, see ["List of Attribute You can deal with a mix of text files with and without headers either by creating two separate format plugins or by creating two format plugins within the same storage plugin. The former approach is typically easier than the latter. ### Creating Two Separate Format Plugins -Format plugins are associated with a particular storage plugin. Storage plugins define a root directory that Drill targets when using the storage plugin. You can define separate storage plugins for different root directories, and define each of the format attributes to match the files stored below that directory. All files can use the .csv extension. - -For example: +Format plugins are associated with a particular storage plugin. Storage plugins define a root directory that Drill targets when using the storage plugin. You can define separate storage plugins for different root directories, and define each of the format attributes to match the files stored below that directory. All files can use the .csv extension, as shown in the following example: Storage Plugin A @@ -78,9 +76,7 @@ Storage Plugin B }, ### Creating Two Format Plugins within the Same Storage Plugin -Give a different extension to files with a header and to files without a header, and use a storage plugin that looks something like the following example. This method requires renaming some files to use the csv2 extension. - -For example: +Give a different extension to files with a header and to files without a header, and use a storage plugin that looks something like the following example. This method requires renaming some files to use the csv2 extension, as shown in the following example: "csv": { "type": "text", @@ -99,3 +95,76 @@ For example: "delimiter": "," }, +## Examples of Querying Text Files +The examples in this section show the results of querying CSV files that use and do not use a header, include comments, and use an escape character: + +### Using a Header in a File + + + + 0: jdbc:drill:zk=local> SELECT * FROM dfs.`/tmp/csv_with_header.csv2`; + +------------------------+ + | columns | + +------------------------+ + | ["hello","1","2","3"] | + | ["hello","1","2","3"] | + | ["hello","1","2","3"] | + | ["hello","1","2","3"] | + | ["hello","1","2","3"] | + | ["hello","1","2","3"] | + | ["hello","1","2","3"] | + +------------------------+ + +### Not Using a Header in a File + + + + 0: jdbc:drill:zk=local> SELECT * FROM dfs.`/tmp/csv_no_header.csv`; + +------------------------+ + | columns | + +------------------------+ + | ["hello","1","2","3"] | + | ["hello","1","2","3"] | + | ["hello","1","2","3"] | + | ["hello","1","2","3"] | + | ["hello","1","2","3"] | + | ["hello","1","2","3"] | + | ["hello","1","2","3"] | + +------------------------+ + 7 rows selected (0.112 seconds) + +### Escaping a Character in a File + + + + 0: jdbc:drill:zk=local> SELECT * FROM dfs.`/tmp/csv_with_escape.csv`; + +------------------------------------------------------------------------+ + | columns | + +------------------------------------------------------------------------+ + | ["hello","1","2","3 \" double quote is the default escape character"] | + | ["hello","1","2","3"] | + | ["hello","1","2","3"] | + | ["hello","1","2","3"] | + | ["hello","1","2","3"] | + | ["hello","1","2","3"] | + | ["hello","1","2","3"] | + +------------------------------------------------------------------------+ + 7 rows selected (0.104 seconds) + +### Adding Comments to a File + + + + 0: jdbc:drill:zk=local> SELECT * FROM dfs.`/tmp/csv_with_comments.csv2`; + +------------------------+ + | columns | + +------------------------+ + | ["hello","1","2","3"] | + | ["hello","1","2","3"] | + | ["hello","1","2","3"] | + | ["hello","1","2","3"] | + | ["hello","1","2","3"] | + | ["hello","1","2","3"] | + | ["hello","1","2","3"] | + +------------------------+ + 7 rows selected (0.111 seconds) http://git-wip-us.apache.org/repos/asf/drill/blob/a9b1831f/_docs/img/csv_no_header.png ---------------------------------------------------------------------- diff --git a/_docs/img/csv_no_header.png b/_docs/img/csv_no_header.png new file mode 100644 index 0000000..121e250 Binary files /dev/null and b/_docs/img/csv_no_header.png differ http://git-wip-us.apache.org/repos/asf/drill/blob/a9b1831f/_docs/img/csv_with_comments.png ---------------------------------------------------------------------- diff --git a/_docs/img/csv_with_comments.png b/_docs/img/csv_with_comments.png new file mode 100644 index 0000000..3f23eb7 Binary files /dev/null and b/_docs/img/csv_with_comments.png differ http://git-wip-us.apache.org/repos/asf/drill/blob/a9b1831f/_docs/img/csv_with_escape.png ---------------------------------------------------------------------- diff --git a/_docs/img/csv_with_escape.png b/_docs/img/csv_with_escape.png new file mode 100644 index 0000000..31b5d1f Binary files /dev/null and b/_docs/img/csv_with_escape.png differ http://git-wip-us.apache.org/repos/asf/drill/blob/a9b1831f/_docs/img/csv_with_header.png ---------------------------------------------------------------------- diff --git a/_docs/img/csv_with_header.png b/_docs/img/csv_with_header.png new file mode 100644 index 0000000..d15b3c6 Binary files /dev/null and b/_docs/img/csv_with_header.png differ
