Re: [DISCUSS] ORC 2.0

Dain Sundstrom Fri, 04 Aug 2017 10:41:14 -0700

+1 to all of the ideas

If we are cool with incompatible changes…
 * Allow dictionary for VARBINARY
 * Disallow old encodings in new files (e.g., no v1)
 * Fix DATE encoding epoch
 * Rearrange stripe so index is next to footer so a single IOP can get all data
 * Change metastore properties so there is a logical mapping from column names 
to physical column identifiers so columns can be renamed
 * New timestamp encoding with fixed size per file.. similar to decimal
 * For compression like zstd, we may want to ship a compression dictionary for 
a stream

Stuff we could do today
 * A flag that says if CHAR or VHARCHAR contain any multi byte characters 
(isAsciiOnly)
 * Max character count for CHAR or VARCHAR (so we don’t need to check length 
for schema changes)
 * Max length for VARBINARY (easier to estimate memory usage)
 * Truncated MIN/MAX for VARBINARY/CHAR/VARCHAR

For the new encodings, we should pick encodings that play well with 
vectorization which is coming in Java 10 (Java 9 also has vastly improved auto 
vectorization).

-dain

> On Aug 4, 2017, at 9:29 AM, Owen O'Malley <[email protected]> wrote:
> 
> All,
>  We've started the process of updating the encodings for ORC. These
> changes are going to extend the format in ways that aren't forward
> compatible. (eg. The ORC 1.4 readers won't be able to read the new format.)
> 
> The changes that I've heard about are:
> * Decimal encoding - this will like be separated in to two categories
>   + precision <= 18
>   + precision > 18
>  In both cases the precision and scale will be fixed for the entire file
> rather than per value.
> * a new Float/Double encoding
> * a new RLE encoding
> 
> Are there other encodings that we should consider adding?
> 
> We haven't made forward incompatible changes in a while. Currently the ORC
> Writer can write either:
> * Hive 0.11 ORC files
> * Hive 0.12 ORC files
> 
> So I'd like to propose that we add a new ORC 2.0 file version and all of
> these changes need to be so tagged.
> 
> The new ORC writers will maintain the ability to write the old versions of
> the files (Hive 0.11 ORC and Hive 0.12 ORC) as well as the ORC 2.0 files.
> The new reader will automatically read all three versions.
> 
> Thoughts?
> 
>  Owen

Re: [DISCUSS] ORC 2.0

Reply via email to