[ 
https://issues.apache.org/jira/browse/ORC-200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16024887#comment-16024887
 ] 

Owen O'Malley commented on ORC-200:
-----------------------------------

Actually, how will it create trouble? The schema evolution part of the reader 
will map the columns by name, assuming that the reader passes down the schema 
that they want to read with. That said, I'm not against preserving the order of 
the fields instead of sorting them. You'll just have different issues for the 
common case where the writer of the JSON documents doesn't pick a particular 
order for the attributes. Manually comparing schemas becomes much more annoying 
then.

Take a look at what I've been doing on the converter in [Owen's 
orc-199|https://github.com/omalley/orc/tree/orc-199], which adds a CSV reader 
to the converter. In particular, I extended the schema discoverer with the 
ability to merge in the schema directly. It will still lose on some things like 
maps.

> json-schema and convert commands should support schema evolution of json 
> documents
> ----------------------------------------------------------------------------------
>
>                 Key: ORC-200
>                 URL: https://issues.apache.org/jira/browse/ORC-200
>             Project: ORC
>          Issue Type: Bug
>          Components: Java
>    Affects Versions: 1.5.0
>            Reporter: Shawn Hooton
>            Assignee: Shawn Hooton
>         Attachments: example-v1.json, example-v2.json
>
>
> Using the command (sample payloads attached):
> java -jar orc-tools-1.5.0-SNAPSHOT-uber.jar json-schema  -t ~/example-v1.json
> Produces the following output:
> create table tbl (
>   about string,
>   address string,
>   age tinyint,
>   balance string,
>   company string,
>   email string,
>   eyeColor string,
>   favoriteFruit string,
>   friends array <struct <
>       id: tinyint,
>       name: string>>,
>   gender string,
>   greeting string,
>   guid string,
>   id binary,
>   index tinyint,
>   isActive boolean,
>   latitude decimal(8,6),
>   longitude decimal(8,6),
>   name string,
>   phone string,
>   picture string,
>   registered timestamp,
>   tags array <string>
> )
> Notice that because org/apache/orc/tools/json/StructType.java uses a 
> java.util.TreeMap for the fields instance variable the generated DDL is 
> sorted alphabetically and not ordered by structure.  This causes problems for 
> the convert command as well.
> java -jar orc-tools-1.5.0-SNAPSHOT-uber.jar meta -j -p output-v1.orc
> <output ommited for brevity>
>   "schemaString": 
> "struct<about:string,address:string,age:tinyint,balance:string,company:string,email:string,eyeColor:string,favoriteFruit:string,friends:array<struct<id:tinyint,name:string>>,gender:string,greeting:string,guid:string,id:binary,index:tinyint,isActive:boolean,latitude:decimal(8,6),longitude:decimal(8,6),name:string,phone:string,picture:string,registered:timestamp,tags:array<string>>",
>   "schema": [
>     {
>       "columnId": 0,
>       "columnType": "STRUCT",
>       "childColumnNames": [
>         "about",
>         "address",
>         "age",
>         "balance",
>         "company",
>         "email",
>         "eyeColor",
>         "favoriteFruit",
>         "friends",
>         "gender",
>         "greeting",
>         "guid",
>         "id",
>         "index",
>         "isActive",
>         "latitude",
>         "longitude",
>         "name",
>         "phone",
>         "picture",
>         "registered",
>         "tags"
>       ],
> <output ommited for brevity>
> This causes *major* problems when a field is added to the JSON document later
> e.g.
> java -jar orc-tools-1.5.0-SNAPSHOT-uber.jar json-schema  -t ~/example-v2.json
> Examine where the newField field is added in the example-v2.json document and 
> then examine the output below.  This also affects the convert command.
> create table tbl (
>   about string,
>   address string,
>   age tinyint,
>   balance string,
>   company string,
>   email string,
>   eyeColor string,
>   favoriteFruit string,
>   friends array <struct <
>       id: tinyint,
>       name: string>>,
>   gender string,
>   greeting string,
>   guid string,
>   id binary,
>   index tinyint,
>   isActive boolean,
>   latitude decimal(8,6),
>   longitude decimal(8,6),
>   name string,
>   newField string,
>   phone string,
>   picture string,
>   registered timestamp,
>   tags array <string>
> )
> The org/apache/orc/tools/json/StructType.java class should use 
> java.util.LinkedHashMap for the fields instance variable so order is 
> maintained across changes to the JSON schema.
> Pull request *with* test cases incoming :)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to