On Mar 12, 2010, at 9:35 AM, Tim Sell wrote:

> excellent! thanks for the response :)
> 
I have committed a large dataset to using the current format.  The current 
format will not be abandoned.

The current format has its limitations.  It is optimized for larger numbers of 
smaller records ( ~ < 2K), and probably should not be used for records 
significantly larger than 1MB.  Essentially, it is built for the more typical 
Hadoop processing use case as well as structured data storage.

The main drawbacks are:
* Synchronous Logging -- the file is written in block size chunks, if one wants 
to commit a record to disk as soon as possible, each record has to be its own 
block -- this is inefficient.
* Large records -- blocks are read in as a whole, and currently need to fit in 
memory in some implementations (including Java).  We could relax this 
requirement for some compression codecs.
* Large records -- the final block size has to be known before writing, 
currently this is done by buffering in memory while writing.
* One schema -- each file has one schema for all records within.  This is a 
very good simplification for most needs, but one cannot merge or concatenate 
two files with different schemas, even for the most minor schema difference. 

Use cases that push the boundaries above may require a new and different file 
format, or perhaps some sort of extension to the current format.

-Scott

> On 12 March 2010 17:30, Doug Cutting <[email protected]> wrote:
>> Tim Sell wrote:
>>> 
>>> But we're wondering if the file format is set in stone now
>> 
>> It should not change again.  It did not seem that any were yet using the
>> prior format, and it had some bad limitations, so we revised it.  If it ever
>> does change again, we would require implementations to be back-compatible,
>> still able to read the old format.
>> 
>> Doug
>> 

Reply via email to