> > 2, The in-memory format that supports either ValueVector, RLE or Dict, I > > assume RLE or Dict will be leveraging either Orc or Parquet right? > > > > > Kind of. RLE and Dict are abstraction where a particular operator can take > advantage of the nature of that encoding. Parquet and ORC are really > container formats as opposed to field level formats.
Not really. Unless you mean something very specific that I'm missing, they are field level formats. ORC relies on the fact that the types are known to pick the right encoder for each column. For example, ORC uses RLE for integer data. (In fact, because the dictionary encoding includes integer data, so do string columns.) In some cases, the ORC writer has a choice of encodings, but it is focused on picking the right encoding for a particular set of data. For example, if a string column has enough duplicated values it will chose a dictionary encoder instead of a direct encoder. But it is certainly not the case that ORC is a container format where the choice of serialization is an additional choice. Unlike RCFile, SequenceFile, TFile, or HFile, it doesn't make sense to store ProtoBuf or Writables in an ORC file. One of the amusing characteristics of these new file formats is EXACTLY that. In 2 years, I would be surprised if anyone is writing new data to files in ProtoBuf, Thrift, or Avro. It will be one of these new formats. That is a big change. -- Owen
