+1 to all of the ideas
If we are cool with incompatible changes… * Allow dictionary for VARBINARY * Disallow old encodings in new files (e.g., no v1) * Fix DATE encoding epoch * Rearrange stripe so index is next to footer so a single IOP can get all data * Change metastore properties so there is a logical mapping from column names to physical column identifiers so columns can be renamed * New timestamp encoding with fixed size per file.. similar to decimal * For compression like zstd, we may want to ship a compression dictionary for a stream Stuff we could do today * A flag that says if CHAR or VHARCHAR contain any multi byte characters (isAsciiOnly) * Max character count for CHAR or VARCHAR (so we don’t need to check length for schema changes) * Max length for VARBINARY (easier to estimate memory usage) * Truncated MIN/MAX for VARBINARY/CHAR/VARCHAR For the new encodings, we should pick encodings that play well with vectorization which is coming in Java 10 (Java 9 also has vastly improved auto vectorization). -dain > On Aug 4, 2017, at 9:29 AM, Owen O'Malley <[email protected]> wrote: > > All, > We've started the process of updating the encodings for ORC. These > changes are going to extend the format in ways that aren't forward > compatible. (eg. The ORC 1.4 readers won't be able to read the new format.) > > The changes that I've heard about are: > * Decimal encoding - this will like be separated in to two categories > + precision <= 18 > + precision > 18 > In both cases the precision and scale will be fixed for the entire file > rather than per value. > * a new Float/Double encoding > * a new RLE encoding > > Are there other encodings that we should consider adding? > > We haven't made forward incompatible changes in a while. Currently the ORC > Writer can write either: > * Hive 0.11 ORC files > * Hive 0.12 ORC files > > So I'd like to propose that we add a new ORC 2.0 file version and all of > these changes need to be so tagged. > > The new ORC writers will maintain the ability to write the old versions of > the files (Hive 0.11 ORC and Hive 0.12 ORC) as well as the ORC 2.0 files. > The new reader will automatically read all three versions. > > Thoughts? > > Owen
