Marking categorical data in Parquet schemas

Uwe L. Korn Thu, 06 Apr 2017 02:35:38 -0700

Hello,

we often have the case that we want to treat some columns as categorical
data [1] (also called factors [2] in R) in memory. This is a column that
can only take a limit amount of values. These types also can have an
ordering. In Apache Arrow, we have defined the DictionaryType for this.
It takes an Index (also called categories in some contexts) and the
actual data as a separate integeral array. The most common use case is
that the indices/categories are strings thus engines that don't
explicitly support categorical data, the columns should be treated as
UTF8 data.


While this is similar to dictionary encoding and dictionary encoding
probably being the most efficient form to store categorical data, they
are semantically not the same (e.g. dictionaries are per RowGroup
whereas categories are defined on a per column basis).

To implement support for categorical data, several options come to my
mind:

1. Add an additional flag / metadata to the schema in
https://github.com/apache/parquet-format/blob/4bddbadf79e20a32152076fbedae0c3ce77fb531/src/main/thrift/parquet.thrift#L220
2. Add a new ConvertedType UTF8_CATEGORICAL
3. Add a new physical type for categoricals (this would equal the
implementation in Arrow)

Number 1 is the only options that would work well with old readers, 2+3
would produce files that cannot be read correctly by older
implementations.

[1] http://pandas.pydata.org/pandas-docs/stable/categorical.html
[2] https://stat.ethz.ch/R-manual/R-devel/library/base/html/factor.html
[3]
https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.h#L576

-- 
  Uwe L. Korn
  [email protected]

Marking categorical data in Parquet schemas

Reply via email to