Daniel,

I think your approach will work. Currently the for the DataFileReader to work, 
the input just needs to stream. sizeBytes() will add an additional constraint 
that we are able to compute the size, a-priori. I think that is okay.

Please go ahead.

Thanks

Thiru

PS. Instead of discussing this over e-mail, it is better to do it in a JIRA 
ticket. People will have ready access to the discussion in the future. Please 
open a ticket as soon as you can. Thank you.


________________________________
 From: Daniel Russel <[email protected]>
To: [email protected]; Thiruvalluvan MG <[email protected]> 
Sent: Tuesday, 29 January 2013 1:01 AM
Subject: Re: Seeks with DataFileReader in C++
 

On Jan 24, 2013, at 8:46 AM, Thiruvalluvan MG <[email protected]> wrote:

> Daniel,
> 
> I think it is a good use case. One way to achieve what you want is to:
> 
> 1. Expose the existing members objectCount_ and byteCount_ of 
> DataFileReaderBase as size_t objectsRemainingInBlock() and size_t 
> bytesRemainingInBlock() in DataFileReader class.
> 2. Add a new method in DataFileReader class void skip(size_t n), which skips 
> n objects.
> 3. If you prefer you can add skipBlock() which is a shorthand for 
> skip(objectsRemainingInBlock()).
> 
> Does it work for you?
Quite possibly. My main concern with the above API, is that it (from what I 
understand) still forces the Reader to go through and inspect each block 
sequentially. I had been thinking of an API more like
- void seekBytes(size_t offset); // seek to the start of the first block that 
does not start before offset by seeking there and then scanning for a sync mark
- size_t offsetBytes() const; // get the current offset in the file
- size_t sizeBytes() const; // get the size of the file

That would provide (I think) 
- constant time access to objects deep in the file
- allow the construction of indexes for the data file by, for example, seeking 
at each i/1000 of the file, saving the resulting offset (and extracted 
identifier from the object)

The cost would be that you have lower precision (finding the nth record 
requires that you be able to identify it and, possibly, do a search) and be 
able to identify objects based solely on the context (as determining its index 
in the file would still require a linear scan). Also requires that Avro be able 
to find a sync mark starting from an arbitrary point in the file, but, based on 
my understanding, that is a valid assumption (please correct me if I'm wrong).

         --Daniel

Reply via email to