[jira] Updated: (AVRO-160) file format should be friendly to streaming

Doug Cutting (JIRA) Fri, 18 Dec 2009 16:48:43 -0800

     [ 
https://issues.apache.org/jira/browse/AVRO-160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Doug Cutting updated AVRO-160:
------------------------------

    Attachment: AVRO-160.patch

;; This buffer is for notes you don't want to save, and for Lisp evaluation.
;; If you want to create a file, visit that file with C-x C-f,
;; then enter the text in that file's own buffer.

> Would it be fair to add to spec.xml, that the file metadata
> { type: map, key-type: string, value-type: bytes }

Yes, I've added a schema to the spec.  BTW, maps don't have key-types,
so it's just {type: map, values: bytes}.

> in.read(magic);
> Should this throw an exception if in.read(magic) didn't return magic.length?

In this case it doesn't matter.  Java initializes arrays with nulls,
and the expected value has no nulls, so, if it doesn't read the entire
thing it will always fail.  But I've changed it
tovin.readFixed(magic), since that has readFully() semantics and is
what we use to read sync markers.

> this.vin = new BinaryDecoder(in);
> I think we should use the specific API here if we can.

I'm fine with adding the header schema to the spec, but I'm not eager
to use specific to implement the header in this patch.  For one thing,
it makes bootstrapping harder.  The build already has some wacky stuff
so that the specific compiler is compiled before the IPC code which
depends on specific compiler output.  Perhaps we should really
re-organize the code tree into stuff that depends on specific output
and stuff that does not, but that would separate files that are
otherwise closely related.  Or we could add a special comment to files
required to compile the specific compiler use an ant <contains> filter
to compile those first.  In any case, can we address this separately?

> public synchronized byte[] getMeta(String key)
> Why does this needs to be synchronized?

It doesn't any longer.  This is a relic from when metadata was
read/write.  Good catch.  Fixed.

> public synchronized D next(D reuse) throws IOException {
> As you suggested in person, this API is a bit broken for iteration

I've now provided an iterator API.

> long blockCount; // # entries in block 
> I was surprised that blockCount was decremented

I changed the name of the variable.

> /** Move to the specified synchronization point, as returned by {...@link 
> DataFileWriter#sync()}. */
> I'm a bit lost as to what that comment means.

I updated that comment.

> if (j == sync.length) { /* position before sync */ sin.seek(sin.tell() - 
> DataFileConstants.SYNC_SIZE);
> Does this work?

Probably not.  I forgot to update it, and it's never had tests.  I've
updated it now and added a test.

> ((SeekableBufferedInput)in).seek(position);
> These casts feel icky.

I replaced this with a field.

> DataFileWriter: appendTo, create
> Why are these synchronized?

Things used in Hadoop InputFormats should be thread safe to make them
easy to use from multi-threaded mappers.  SequenceFile is thread-safe
for this reason, and we want this to be a drop-in replacement for
SequenceFile.

> Would be great to have tests for trying to setMeta() when appending, or after 
> file has had records in it.

Yes, lots more tests would be good, including that.

> I didn't see any tests for the random access stuff.

I've added one now.


> file format should be friendly to streaming
> -------------------------------------------
>
>                 Key: AVRO-160
>                 URL: https://issues.apache.org/jira/browse/AVRO-160
>             Project: Avro
>          Issue Type: Improvement
>          Components: spec
>            Reporter: Doug Cutting
>            Assignee: Doug Cutting
>         Attachments: AVRO-160.patch, AVRO-160.patch
>
>
> It should be possible to stream through an Avro data file without seeking to 
> the end.
> Currently the interpretation is that schemas written to the file apply to all 
> entries before them.  If this were changed so that they instead apply to all 
> entries that follow, and the initial schema is written at the start of the 
> file, then streaming could be supported.
> Note that the only change permitted to a schema as a file is written is to, 
> if it is a union, to add new branches at the end of that union.  If it is not 
> a union, no changes may be made.  So it is still the case that the final 
> schema in a file can read every entry in the file and thus may be used to 
> randomly access the file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (AVRO-160) file format should be friendly to streaming

Reply via email to