[ 
https://issues.apache.org/jira/browse/DRILL-4264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16099918#comment-16099918
 ] 

Volodymyr Vysotskyi commented on DRILL-4264:
--------------------------------------------

Thanks for such detailed analysis. 

I agree with you that such deserializing of {{ColumnTypeMetadata_v3.Key}} 
objects will cause problems for the fields that contain dots in their names. To 
solve this issue I propose to change the structure of the 
{{ColumnTypeMetadata_v3.Key}} class. Instead of using an array with the 
components of the field name we should use {{SchemaPath}} and serialise it as a 
string obtained by calling {{SchemaPath.toExpr()}}. With this change, we also 
should update parquet metadata version. 

A more complex problem is connected with {{MaterializedField}} class. 
{{SchemaPath}} was removed from {{MaterializedField}} class in 
[PR-373|https://github.com/apache/drill/pull/373]. One of the reasons for this 
refactoring was the assumption that {{MaterializedField}} should have no 
knowledge of its parents. Some code in Drill supposes that 
{{MaterializedField.getPath()}} returns field path including its parents. 
For example in [this 
line|https://github.com/apache/drill/blob/3e8b01d5b0d3013e3811913f0fd6028b22c1ac3f/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet2/DrillParquetReader.java#L225]
 {{MaterializedField}} instance will be created with the name 
{{col.getAsUnescapedPath()}}. In [this 
line|https://github.com/apache/drill/blob/874bf6296dcd1a42c7cf7f097c1a6b5458010cbb/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/ScanBatch.java#L362]
 the name with parent field names was used. Using only the field name in the 
{{MaterializedField}} will cause problems since the field at the root level may 
have the same name as the field, nested in the map. 
So full field path should be used in the {{MaterializedField}} class in this 
case.

The {{SchemaPath.getSimplePath(field.getPath())}} code is used in many places, 
but it does not return the same {{SchemaPath}} that was used to create 
{{MaterializedField}} instance. 
We should change the implementation of {{MaterializedField}} in such a way that 
this code returns the same {{SchemaPath}} which was used to create 
{{MaterializedField}} instance. 

I think we should store a separate field {{String path}} in 
{{MaterializedField}} class with value {{SchemaPath.toExpr()}} and replace all 
{{SchemaPath.getAsUnescapedPath()}} calls by the {{SchemaPath.toExpr()}}. 
* when the {{MaterializedField}} instance is created using the path 
{{SchemaPath.toExpr()}}, the name will be assigned as the last name of the 
{{SchemaPath}}. 
* when {{MaterializedField}} instance is created using the name, the path will 
be the same as the name with backticks. 

The less preferred solution is the revert of commit 
[PR-373|https://github.com/apache/drill/pull/373]. In this case dots in the 
field names will be handled correctly. But such solution will make the 
transition to using Apache Arrow more complex (but {{MaterializedField}} was 
replaced by {{Flatbuffer Field}}, so the transition is already too complex). 


> Dots in identifier are not escaped correctly
> --------------------------------------------
>
>                 Key: DRILL-4264
>                 URL: https://issues.apache.org/jira/browse/DRILL-4264
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Codegen
>            Reporter: Alex
>            Assignee: Volodymyr Vysotskyi
>
> If you have some json data like this...
> {code:javascript}
>     {
>       "0.0.1":{
>         "version":"0.0.1",
>         "date_created":"2014-03-15"
>       },
>       "0.1.2":{
>         "version":"0.1.2",
>         "date_created":"2014-05-21"
>       }
>     }
> {code}
> ... there is no way to select any of the rows since their identifiers contain 
> dots and when trying to select them, Drill throws the following error:
> Error: SYSTEM ERROR: UnsupportedOperationException: Unhandled field reference 
> "0.0.1"; a field reference identifier must not have the form of a qualified 
> name
> This must be fixed since there are many json data files containing dots in 
> some of the keys (e.g. when specifying version numbers etc)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to