Yikes, but this assumes that your external code can grock whatever strange thing is stored in the writable. This is a non-trivial assumption when you go multi-lingual.

We're going to take a crack at a block compressed format soon. This will vastly reduce the storage impact of this kind of issue anyway.


On Jun 26, 2006, at 4:34 PM, Paul Sutter wrote:

I agree, there's no easy way around this one without separate interfaces (one where the caller keeps the counts, and one where the writable keeps the
counts), and that would be silly.

However -> It still seems to me that the key length in the sequence file is redundant. Since each key must write its own length, know its own length, or be able to figure it out - even via the high speed interface - there's no
reason to have that key length in the file.

Why do I care about 4 bytes per record? Because we're integrating an
external sort, and right now it has to look at a record with two key
lengths. And I assume that others (such as Yahoo) will want to incorporate an external sort. And if we're going to be reading the sequence file in
another language, we might as well be sure about the format to use.

Thanks!

Paul

On 6/26/06, Doug Cutting <[EMAIL PROTECTED]> wrote:

Eric Baldeschwieler wrote:
> Can we turn this around and assume that writables will be given a stream > and a length when they read? That would also let us remove redundant
> info...

Unless I misunderstand, that would make it harder to nest writables,
since all containers would need to store the length.  Currently only
top-level containers (SequenceFile and the RPC protocol) need to write
lengths.  Even these are optional, used only to optimize things.

Doug


Reply via email to