Hello, we often have the case that we want to treat some columns as categorical data [1] (also called factors [2] in R) in memory. This is a column that can only take a limit amount of values. These types also can have an ordering. In Apache Arrow, we have defined the DictionaryType for this. It takes an Index (also called categories in some contexts) and the actual data as a separate integeral array. The most common use case is that the indices/categories are strings thus engines that don't explicitly support categorical data, the columns should be treated as UTF8 data.
While this is similar to dictionary encoding and dictionary encoding probably being the most efficient form to store categorical data, they are semantically not the same (e.g. dictionaries are per RowGroup whereas categories are defined on a per column basis). To implement support for categorical data, several options come to my mind: 1. Add an additional flag / metadata to the schema in https://github.com/apache/parquet-format/blob/4bddbadf79e20a32152076fbedae0c3ce77fb531/src/main/thrift/parquet.thrift#L220 2. Add a new ConvertedType UTF8_CATEGORICAL 3. Add a new physical type for categoricals (this would equal the implementation in Arrow) Number 1 is the only options that would work well with old readers, 2+3 would produce files that cannot be read correctly by older implementations. [1] http://pandas.pydata.org/pandas-docs/stable/categorical.html [2] https://stat.ethz.ch/R-manual/R-devel/library/base/html/factor.html [3] https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.h#L576 -- Uwe L. Korn [email protected]
