hi Uwe, Thanks for bringing this up.
I have a somewhat different opinion, which is that I don't think categorical metadata belongs _formally_ in the Parquet format. The reason is that database systems generally address storage of categorical data using fact and dimension tables -- if you store data in Parquet, and your set of categories need to expand, it's generally not feasible to modify old data files to account for the expanded category set. Parquet uses dictionary encoding as a compression technique (combined with RLE encoding, see e.g. the low entropy examples in http://wesmckinney.com/blog/python-parquet-multithreading/). Parquet as long term storage is distinct from Arrow's use case as an in-memory data structure and transient IPC format, where handling in-memory dictionary-encoded / categorical data makes more sense. I do think it's reasonable to want to faithfully round trip categorical data from R and Python to the Parquet format -- I would instead like us to specify a KeyValue metadata convention that we can all use to maximize interoperability between implementations. Because dictionaries vary in size, even when we store them in Parquet format we'll have a couple cases to handle: * Small dictionaries -- all of the Parquet data pages contain dictionary indices * Large dictionaries -- the encoder fell back to PLAIN encoding because the dictionary page exceeded a size threshold In the first case, we can avoid extra hashing/encoding when reading the file by using the dictionary page directly. In the latter case, if we want to construct the original in-memory data faithfully, we'll have to hash. - Wes On Thu, Apr 6, 2017 at 5:35 AM, Uwe L. Korn <[email protected]> wrote: > Hello, > > we often have the case that we want to treat some columns as categorical > data [1] (also called factors [2] in R) in memory. This is a column that > can only take a limit amount of values. These types also can have an > ordering. In Apache Arrow, we have defined the DictionaryType for this. > It takes an Index (also called categories in some contexts) and the > actual data as a separate integeral array. The most common use case is > that the indices/categories are strings thus engines that don't > explicitly support categorical data, the columns should be treated as > UTF8 data. > > While this is similar to dictionary encoding and dictionary encoding > probably being the most efficient form to store categorical data, they > are semantically not the same (e.g. dictionaries are per RowGroup > whereas categories are defined on a per column basis). > > To implement support for categorical data, several options come to my > mind: > > 1. Add an additional flag / metadata to the schema in > https://github.com/apache/parquet-format/blob/4bddbadf79e20a32152076fbedae0c3ce77fb531/src/main/thrift/parquet.thrift#L220 > 2. Add a new ConvertedType UTF8_CATEGORICAL > 3. Add a new physical type for categoricals (this would equal the > implementation in Arrow) > > Number 1 is the only options that would work well with old readers, 2+3 > would produce files that cannot be read correctly by older > implementations. > > [1] http://pandas.pydata.org/pandas-docs/stable/categorical.html > [2] https://stat.ethz.ch/R-manual/R-devel/library/base/html/factor.html > [3] > https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.h#L576 > > -- > Uwe L. Korn > [email protected]
