Re: Questions

Owen O'Malley Tue, 30 Apr 2013 21:02:56 -0700

> > 2, The in-memory format that supports either ValueVector, RLE or Dict, I
> > assume RLE or Dict will be leveraging either Orc or Parquet right?
> >
> >
> Kind of.  RLE and Dict are abstraction where a particular operator can take
> advantage of the nature of that encoding.  Parquet and ORC are really
> container formats as opposed to field level formats.



Not really. Unless you mean something very specific that I'm missing, they
are field level formats. ORC relies on the fact that the types are known to
pick the right encoder for each column. For example, ORC uses RLE for
integer data. (In fact, because the dictionary encoding includes integer
data, so do string columns.) In some cases, the ORC writer has a choice of
encodings, but it is focused on picking the right encoding for a particular
set of data. For example, if a string column has enough duplicated values
it will chose a dictionary encoder instead of a direct encoder. But it is
certainly not the case that ORC is a container format where the choice of
serialization is an additional choice.

Unlike RCFile, SequenceFile, TFile, or HFile, it doesn't make sense to
store ProtoBuf or Writables in an ORC file. One of the amusing
characteristics of these new file formats is EXACTLY that. In 2 years, I
would be surprised if anyone is writing new data to files in ProtoBuf,
Thrift, or Avro. It will be one of these new formats. That is a big change.

-- Owen

Re: Questions

Reply via email to