[
https://issues.apache.org/jira/browse/DRILL-7279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Arina Ielchiieva updated DRILL-7279:
------------------------------------
Labels: ready-to-commit (was: )
> Support provided schema for CSV without headers
> -----------------------------------------------
>
> Key: DRILL-7279
> URL: https://issues.apache.org/jira/browse/DRILL-7279
> Project: Apache Drill
> Issue Type: Improvement
> Affects Versions: 1.16.0
> Reporter: Paul Rogers
> Assignee: Paul Rogers
> Priority: Major
> Labels: ready-to-commit
> Fix For: 1.17.0
>
>
> Extend the Drill 1.16 provided schema support for the text reader to allow a
> provided schema for files without headers. Behavior:
> * If the file is configured to not extract headers, and a schema is provided,
> and the schema has at least one column, then use the provided schema to
> create individual columns. Otherwise, continue to use {{columns}} as in
> previous versions.
> * The columns in the schema are assumed to match left-to-right with those in
> the file.
> * If the schema contains more columns than the file, the extra columns take
> their default values. (This occurs in schema evolution when a column is added
> to newer files.)
> * If the file contains more columns than the schema, then the extra columns,
> at the end of the line, are ignored. This is the same behavior as occurs if
> the file contains headers.
> h4. Table Properties
> Also adds four table properties for text files. These properties, if present,
> override those defined in the format plugin configuration. The properties
> allow the user to have a single "csv" config, but to have many tables with
> the "csv" suffix, each with different properties. That is, the user need not
> define a new plugin config, and define a new extension, just to change a file
> format property. With this system, the user can have a ".csv" file with
> headers; the user need not define a different suffix (usually ".csvh" in
> Drill) for this case.
> || Table Property || Equivalent Plugin Config Property ||
> | {{drill.headers}} | {{extractHeader}} |
> | {{drill.skipFirstLine}} | {{skipFirstLine}} |
> | {{drill.delimiter}} | {{fieldDelimiter}} |
> | {{drill.commentChar}} | {{comment}}|
> For each, the rules are:
> * If the table property is not set, then the plugin property is used.
> * If the table property is set, then the property value replaces the plugin
> property value for that one specific table.
> * For the delimiter, if the property value is an empty string, then this is
> the same as an unset property.
> * For the comment, if the property value is an empty string, then the comment
> is set to the ASCII NULL, which will never match. This effectively turns off
> the comment feature for this one table.
> * If the delimiter or comment value is longer than a single character, only
> the first character is used.
> It is possible to use the table properties without specifying a "provided"
> schema. Just omit any columns from the schema:
> {noformat}
> create schema () for table `dfs.data`.`example`
> PROPERTIES ('drill.headers'='false', 'drill.skipFirstLine'='false',
> 'drill.delimiter'='|')
> {noformat}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)