Zoltan Ivanfi created AVRO-2128:
-----------------------------------

             Summary: Schema parsing in the Java library is more permissive 
than the C implementation or the JSON specification
                 Key: AVRO-2128
                 URL: https://issues.apache.org/jira/browse/AVRO-2128
             Project: Avro
          Issue Type: Bug
            Reporter: Zoltan Ivanfi


When parsing schemas, the Java library accepts C-style comments (which are 
forbidden in JSON) and is unaffected by trailing garbage (parsing stops as soon 
as it reaches the end of the JSON structure).

In the C library, however, comments and trailing whitspaces cause an error.

If a schema is accepted by one language binding, it should be accepted by the 
other as well. The schema should also be valid JSON. It's the Java library that 
does not enforce this by being more permissive than it should be, so it seems 
that the Java implementation should be changed. However, we must also consider 
whether making the Java library stricter at this point would make any existing 
data unreadable.

Fortunately, the schema that is written in the data files themselves is always 
valid JSON, even if it is based on a non-JSON-conformant schema. The reason for 
this is that Java library parses the schema, build an in-memory representation 
and then reserializes that, thereby removing comments and trailing garbage. So 
existing data files are not affected, only user-supplied schemas. These can be 
manually updated (unlike existing data files).

The real-world use-case where this discrepancy causes problems is Hive-Impala 
interaction. Users can create tables in Hive by supplying an Avro schema. That 
schema will be associated with the whole table by getting saved in the Hive 
metastore. Impala also consults this metadata when accessing the table and that 
causes an error in the Avro C library that Impala uses. This is detailed in 
IMPALA-1024. In particular, [this 
comment|https://issues.apache.org/jira/browse/IMPALA-1024?focusedCommentId=16261702&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16261702]
 contains a lot of relevant information.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to