[
https://issues.apache.org/jira/browse/DRILL-4264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16102548#comment-16102548
]
Paul Rogers edited comment on DRILL-4264 at 7/27/17 1:09 AM:
-------------------------------------------------------------
Wonderful detailed analysis! You caught many detailed issues that my quick scan
missed.
The solution for Parquet metadata seems good. I'm not an expert in that area,
but a few unit tests will validate the change once you make it. Bumping the
version number will solve the forward/backward compatibility issues (using the
mechanism from DRILL-5660.)
The {{MaterializedField}} issue is harder. Fortunately, some of the nested-name
issues might not be actual issues.
For example, your example of
[ScanBatch.Mutator:362|https://github.com/apache/drill/blob/3e8b01d5b0d3013e3811913f0fd6028b22c1ac3f/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet2/DrillParquetReader.java#L225]
should be OK as long as the caller knows call this method for top-level
columns. This line is used to build up a record batch during reading such as in
JSON or Parquet. The problem is if the container is a map. In this case, the
caller should be calling {{AbstractMapVector.addOrGet()}} to add the field
rather than adding it at the top level using the {{Mutator}}.
Are there other cases where the code assembles a path then tears it down again?
Or, parses a path?
Otherwise, we can find all uses of {{MaterializedField.getPath()}}, verify that
the really only use the leaf name, and replace them with {{getName()}}. The
same is true of {{getLastName()}}.
was (Author: paul-rogers):
Wonderful detailed analysis! You caught many detailed issues that my quick scan
missed.
The solution for Parquet metadata seems good. I'm not an expert in that area,
but a few unit tests will validate the change once you make it. Bumping the
version number will solve the forward/backward compatibility issues (using the
mechanism from DRILL-5660.)
The {{MaterializedField}} issue is harder. Fortunately, some of the nested-name
issues might not be actual issues.
For example, your example of
[ScanBatch.Mutator:362|https://github.com/apache/drill/blob/3e8b01d5b0d3013e3811913f0fd6028b22c1ac3f/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet2/DrillParquetReader.java#L225]
should be OK as long as the caller knows to pass in only the leaf name. This
line is used to build up a record batch during reading such as in JSON or
Parquet. The problem is if the container is a map. In this case, the caller
should be calling {{AbstractMapVector.addOrGet()}} to add the field rather than
adding it at the top level using the {{Mutator}}.
Are there other cases where the code assembles a path then tears it down again?
Or, parses a path?
Otherwise, we can find all uses of {{MaterializedField.getPath()}}, verify that
the really only use the leaf name, and replace them with {{getName()}}. The
same is true of {{getLastName()}}.
> Dots in identifier are not escaped correctly
> --------------------------------------------
>
> Key: DRILL-4264
> URL: https://issues.apache.org/jira/browse/DRILL-4264
> Project: Apache Drill
> Issue Type: Bug
> Components: Execution - Codegen
> Reporter: Alex
> Assignee: Volodymyr Vysotskyi
> Labels: doc-impacting
>
> If you have some json data like this...
> {code:javascript}
> {
> "0.0.1":{
> "version":"0.0.1",
> "date_created":"2014-03-15"
> },
> "0.1.2":{
> "version":"0.1.2",
> "date_created":"2014-05-21"
> }
> }
> {code}
> ... there is no way to select any of the rows since their identifiers contain
> dots and when trying to select them, Drill throws the following error:
> Error: SYSTEM ERROR: UnsupportedOperationException: Unhandled field reference
> "0.0.1"; a field reference identifier must not have the form of a qualified
> name
> This must be fixed since there are many json data files containing dots in
> some of the keys (e.g. when specifying version numbers etc)
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)