[ 
https://issues.apache.org/jira/browse/DRILL-7001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16753273#comment-16753273
 ] 

Paul Rogers commented on DRILL-7001:
------------------------------------

Thanks [~benj641] for calling this out. This behavior was added about 18 months 
ago to work around various bugs found at that time. The idea was to try to 
infer schema as best we can even for badly formed CSV files. Originally we'd 
just fail or get into very odd situations.

The renamed columns work for {{SELECT *}} queries. Not so well for explicit 
projection where one has to know the names. Here, the thought was that if a 
user knows the column names, they will have taken care to ensure the names are 
valid.

The team is working on adding a metadata layer. When that is available there 
may be an alternative way to handle messy columns, perhaps by referring to them 
by position and applying proper names in the schema.

All this said, if you have a use case in which the "oddball" column names 
appear, please do provide advice on a better set of rules for handling such 
names.

> Documentation - renaming columns name in csv header
> ---------------------------------------------------
>
>                 Key: DRILL-7001
>                 URL: https://issues.apache.org/jira/browse/DRILL-7001
>             Project: Apache Drill
>          Issue Type: Wish
>    Affects Versions: 1.15.0
>            Reporter: benj
>            Priority: Minor
>
> Don't know how if this is the best place for this request but,
> Some operation are realized that eventually change the name of the column 
> when requesting a csvh file (with header),
>  These operations are not documented.
>  Although it's possible to read 
> [HeaderBuilder.java|https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text/compliant/HeaderBuilder.java],
>  It will be interesting to create a section in documentation to explain at 
> least the principle of these different cases to avoid stupid 
> problems/difficulties
> List of operations (maybe not exhaustive) :
>  * Trim() on CSV column name
> {noformat}
>  Name , Age,PoB  , Info
> =>
> `Name`, `Age`, `PoB` and `Info`{noformat}
>  * Others characters than [a-zA-Z0-9_] are replace by '_' (underscore)
> {noformat}
> Name,Sum$,em@il
> =>
> `Name`,'`Sum_`,`em_il`{noformat}
>  * Fieldname starting with '_' (underscore) are prefixed by 'col'
> {noformat}
> _name,_age_,pob_,_col_
> =>
> `col_name`, `col_age_`, `pob_`, `col_col_`{noformat}
>  * Fieldname starting with [^a-zA-Z] are prefixed 'col_'
> {noformat}
> 0_name, 1_age,@pob,#other1,'other2'
> =>
> `col_0_name`, `col_1_age`, `col_pob`, `col_other1`, `col_other2_`{noformat}
>  *  Quotation marks are removed
>  * If char is unique
>  ** if [a-zA-Z] do nothing
>  ** elif [0-9] prefix with col_
>  ** else reanme in column_[0-9]+ where [0-9]+ designs the position of the 
> column
>  * Duplicate columns names (case insensitive) are suffixed with _[0-9]+ 
> (starting from "_2")
> {noformat}
> 0_name,col_0_name,colx,COLX,colx,colx_2
> =>
> `col_0_name`, `col_0_name_2`, `colx`, `COLX_2`, `colx_3`, `colx_2_2`{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to