[jira] [Updated] (DRILL-7279) Support provided schema for CSV without headers

Paul Rogers (JIRA) Fri, 07 Jun 2019 16:34:34 -0700


     [ 
https://issues.apache.org/jira/browse/DRILL-7279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Paul Rogers updated DRILL-7279:
-------------------------------
    Description: 
Extend the Drill 1.16 provided schema support for the text reader to allow a 
provided schema for files without headers. Behavior:

* If the file is configured to not extract headers, and a schema is provided, 
and the schema has at least one column, then use the provided schema to create 
individual columns. Otherwise, continue to use {{columns}} as in previous 
versions.
* The columns in the schema are assumed to match left-to-right with those in 
the file.
* If the schema contains more columns than the file, the extra columns take 
their default values. (This occurs in schema evolution when a column is added 
to newer files.)
* If the file contains more columns than the schema, then the extra columns, at 
the end of the line, are ignored. This is the same behavior as occurs if the 
file contains headers.

h4. Table Properties

Also adds several table properties for text files. These properties, if 
present, override those defined in the format plugin configuration. The 
properties allow the user to have a single "csv" config, but to have many 
tables with the "csv" suffix, each with different properties. That is, the user 
need not define a new plugin config, and define a new extension, just to change 
a file format property. With this system, the user can have a ".csv" file with 
headers; the user need not define a different suffix (usually ".csvh" in Drill) 
for this case.

All properties start with {{drill}}} (standard for Drill-defined properties) 
then "text" (because they are specific to the text reader.) The tail property 
name is the same as the format config property name.

|| Table Property || Equivalent Plugin Config Property ||
| {{drill.text.extractHeader}} | {{extractHeader}} |
| {{drill.text.skipFirstLine}} |  {{skipFirstLine}} | 
| {{drill.text.fieldDelimiter}} |  {{fieldDelimiter}} | 
| {{drill.text.quote}} |  {{quote}}| 
| {{drill.text.escape}} |  {{escape}}| 
| {{drill.text.lineDelimiter}} |  {{lineDelimiter}}| 

For each, the rules are:

* If the table property is not set, then the plugin property is used.
* If the table property is set, then the property value replaces the plugin 
property value for that one specific table.
* For most properties, if the property value is an empty string, then this is 
the same as an unset property.
* For the comment, if the property value is an empty string, then the comment 
is set to the ASCII NULL, which will never match. This effectively turns off 
the comment feature for this one table.
* If the delimiter or comment value is longer than a single character, only the 
first character is used.

It is possible to use the table properties without specifying a "provided" 
schema. Just omit any columns from the schema:

{noformat}
create schema () for table `dfs.data`.`example`
PROPERTIES ('drill.text.extractHeader'='false', 
'drill.text.skipFirstLine'='false', 'drill.text.fieldDelimiter'='|')
{noformat}

The field and line delimiters are sometimes a non-printable character. Drill 
(via Calcite) already supports the following syntax:

* Standard escapes: {{\n}}, {{\r}}, {{\t}}, perhaps others.
* Two-byte (ASCII) codes: {{\01}}
* Four-byte (Unicode) codes: {{\u0001}}

Note that, although Drill supports Unicode escapes, the text reader itself 
supports only single-byte characters for the delimiter and escape properties.

  was:
Extend the Drill 1.16 provided schema support for the text reader to allow a 
provided schema for files without headers. Behavior:

* If the file is configured to not extract headers, and a schema is provided, 
and the schema has at least one column, then use the provided schema to create 
individual columns. Otherwise, continue to use {{columns}} as in previous 
versions.
* The columns in the schema are assumed to match left-to-right with those in 
the file.
* If the schema contains more columns than the file, the extra columns take 
their default values. (This occurs in schema evolution when a column is added 
to newer files.)
* If the file contains more columns than the schema, then the extra columns, at 
the end of the line, are ignored. This is the same behavior as occurs if the 
file contains headers.

h4. Table Properties

Also adds four table properties for text files. These properties, if present, 
override those defined in the format plugin configuration. The properties allow 
the user to have a single "csv" config, but to have many tables with the "csv" 
suffix, each with different properties. That is, the user need not define a new 
plugin config, and define a new extension, just to change a file format 
property. With this system, the user can have a ".csv" file with headers; the 
user need not define a different suffix (usually ".csvh" in Drill) for this 
case.

|| Table Property || Equivalent Plugin Config Property ||
| {{drill.headers}} | {{extractHeader}} |
| {{drill.skipFirstLine}} |  {{skipFirstLine}} | 
| {{drill.delimiter}} |  {{fieldDelimiter}} | 
|  {{drill.commentChar}} |  {{comment}}| 

For each, the rules are:

* If the table property is not set, then the plugin property is used.
* If the table property is set, then the property value replaces the plugin 
property value for that one specific table.
* For the delimiter, if the property value is an empty string, then this is the 
same as an unset property.
* For the comment, if the property value is an empty string, then the comment 
is set to the ASCII NULL, which will never match. This effectively turns off 
the comment feature for this one table.
* If the delimiter or comment value is longer than a single character, only the 
first character is used.

It is possible to use the table properties without specifying a "provided" 
schema. Just omit any columns from the schema:

{noformat}
create schema () for table `dfs.data`.`example`
PROPERTIES ('drill.headers'='false', 'drill.skipFirstLine'='false', 
'drill.delimiter'='|')
{noformat}


> Support provided schema for CSV without headers
> -----------------------------------------------
>
>                 Key: DRILL-7279
>                 URL: https://issues.apache.org/jira/browse/DRILL-7279
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.16.0
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>            Priority: Major
>              Labels: doc-impacting, ready-to-commit
>             Fix For: 1.17.0
>
>
> Extend the Drill 1.16 provided schema support for the text reader to allow a 
> provided schema for files without headers. Behavior:
> * If the file is configured to not extract headers, and a schema is provided, 
> and the schema has at least one column, then use the provided schema to 
> create individual columns. Otherwise, continue to use {{columns}} as in 
> previous versions.
> * The columns in the schema are assumed to match left-to-right with those in 
> the file.
> * If the schema contains more columns than the file, the extra columns take 
> their default values. (This occurs in schema evolution when a column is added 
> to newer files.)
> * If the file contains more columns than the schema, then the extra columns, 
> at the end of the line, are ignored. This is the same behavior as occurs if 
> the file contains headers.
> h4. Table Properties
> Also adds several table properties for text files. These properties, if 
> present, override those defined in the format plugin configuration. The 
> properties allow the user to have a single "csv" config, but to have many 
> tables with the "csv" suffix, each with different properties. That is, the 
> user need not define a new plugin config, and define a new extension, just to 
> change a file format property. With this system, the user can have a ".csv" 
> file with headers; the user need not define a different suffix (usually 
> ".csvh" in Drill) for this case.
> All properties start with {{drill}}} (standard for Drill-defined properties) 
> then "text" (because they are specific to the text reader.) The tail property 
> name is the same as the format config property name.
> || Table Property || Equivalent Plugin Config Property ||
> | {{drill.text.extractHeader}} | {{extractHeader}} |
> | {{drill.text.skipFirstLine}} |  {{skipFirstLine}} | 
> | {{drill.text.fieldDelimiter}} |  {{fieldDelimiter}} | 
> | {{drill.text.quote}} |  {{quote}}| 
> | {{drill.text.escape}} |  {{escape}}| 
> | {{drill.text.lineDelimiter}} |  {{lineDelimiter}}| 
> For each, the rules are:
> * If the table property is not set, then the plugin property is used.
> * If the table property is set, then the property value replaces the plugin 
> property value for that one specific table.
> * For most properties, if the property value is an empty string, then this is 
> the same as an unset property.
> * For the comment, if the property value is an empty string, then the comment 
> is set to the ASCII NULL, which will never match. This effectively turns off 
> the comment feature for this one table.
> * If the delimiter or comment value is longer than a single character, only 
> the first character is used.
> It is possible to use the table properties without specifying a "provided" 
> schema. Just omit any columns from the schema:
> {noformat}
> create schema () for table `dfs.data`.`example`
> PROPERTIES ('drill.text.extractHeader'='false', 
> 'drill.text.skipFirstLine'='false', 'drill.text.fieldDelimiter'='|')
> {noformat}
> The field and line delimiters are sometimes a non-printable character. Drill 
> (via Calcite) already supports the following syntax:
> * Standard escapes: {{\n}}, {{\r}}, {{\t}}, perhaps others.
> * Two-byte (ASCII) codes: {{\01}}
> * Four-byte (Unicode) codes: {{\u0001}}
> Note that, although Drill supports Unicode escapes, the text reader itself 
> supports only single-byte characters for the delimiter and escape properties.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (DRILL-7279) Support provided schema for CSV without headers

Reply via email to