[jira] Commented: (AVRO-160) file format should be friendly to streaming

Doug Cutting (JIRA) Thu, 22 Oct 2009 10:55:25 -0700

    [ 
https://issues.apache.org/jira/browse/AVRO-160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12768781#action_12768781
 ]


Doug Cutting commented on AVRO-160:
-----------------------------------

> Is that not the case?

No, that would not work.  You can only add clauses to unions.  So you can in 
theory go from:

 { "type": "record", "name": "Foo", "fields": [{ "name": "x", "type": ["int", 
"string"]}]}

to

 { "type": "record", "name": "Foo", "fields": [{ "name": "x", "type": ["int", 
"string", "float"]}]}

However the current Java API doesn't permit this: it only permits adding 
clauses to a top-level union.  The Java implementation could be improved to do 
a smarter compatibility check when you attempt to augment a file's schema.

This restriction is created by the binary format: a record is simply serialized 
as its fields, with no added per-field tags or other per-record data.

In the strict streaming case you could reset the schema entirely each time 
metadata is dumped.  However that would prohibit random access operations.

Perhaps the schema should instead be dumped once per block?  Random access 
already requires that you find a block start.  Changing the schema would then 
force a block flush.  If we go this way we might also switch to using a binary 
format for the schema, and/or increasing the block size.  Note that the 
DatumReader has a setSchema() method, so each time one would seek to a new 
block, the container could inform the DatumReader of the new schema, so that it 
could appropriately handle, e.g., new or missing fields.

> file format should be friendly to streaming
> -------------------------------------------
>
>                 Key: AVRO-160
>                 URL: https://issues.apache.org/jira/browse/AVRO-160
>             Project: Avro
>          Issue Type: Improvement
>          Components: spec
>            Reporter: Doug Cutting
>
> It should be possible to stream through an Avro data file without seeking to 
> the end.
> Currently the interpretation is that schemas written to the file apply to all 
> entries before them.  If this were changed so that they instead apply to all 
> entries that follow, and the initial schema is written at the start of the 
> file, then streaming could be supported.
> Note that the only change permitted to a schema as a file is written is to, 
> if it is a union, to add new branches at the end of that union.  If it is not 
> a union, no changes may be made.  So it is still the case that the final 
> schema in a file can read every entry in the file and thus may be used to 
> randomly access the file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (AVRO-160) file format should be friendly to streaming

Reply via email to