Re: Marking categorical data in Parquet schemas

Wes McKinney Thu, 06 Apr 2017 06:21:22 -0700

hi Uwe,

Thanks for bringing this up.

I have a somewhat different opinion, which is that I don't think
categorical metadata belongs _formally_ in the Parquet format. The
reason is that database systems generally address storage of
categorical data using fact and dimension tables -- if you store data
in Parquet, and your set of categories need to expand, it's generally
not feasible to modify old data files to account for the expanded
category set.

Parquet uses dictionary encoding as a compression technique (combined
with RLE encoding, see e.g. the low entropy examples in
http://wesmckinney.com/blog/python-parquet-multithreading/). Parquet
as long term storage is distinct from Arrow's use case as an in-memory
data structure and transient IPC format, where handling in-memory
dictionary-encoded / categorical data makes more sense.

I do think it's reasonable to want to faithfully round trip
categorical data from R and Python to the Parquet format -- I would
instead like us to specify a KeyValue metadata convention that we can
all use to maximize interoperability between implementations.

Because dictionaries vary in size, even when we store them in Parquet
format we'll have a couple cases to handle:

* Small dictionaries -- all of the Parquet data pages contain dictionary indices
* Large dictionaries -- the encoder fell back to PLAIN encoding
because the dictionary page exceeded a size threshold

In the first case, we can avoid extra hashing/encoding when reading
the file by using the dictionary page directly. In the latter case, if
we want to construct the original in-memory data faithfully, we'll
have to hash.

- Wes

On Thu, Apr 6, 2017 at 5:35 AM, Uwe L. Korn <[email protected]> wrote:
> Hello,
>
> we often have the case that we want to treat some columns as categorical
> data [1] (also called factors [2] in R) in memory. This is a column that
> can only take a limit amount of values. These types also can have an
> ordering. In Apache Arrow, we have defined the DictionaryType for this.
> It takes an Index (also called categories in some contexts) and the
> actual data as a separate integeral array. The most common use case is
> that the indices/categories are strings thus engines that don't
> explicitly support categorical data, the columns should be treated as
> UTF8 data.
>
> While this is similar to dictionary encoding and dictionary encoding
> probably being the most efficient form to store categorical data, they
> are semantically not the same (e.g. dictionaries are per RowGroup
> whereas categories are defined on a per column basis).
>
> To implement support for categorical data, several options come to my
> mind:
>
> 1. Add an additional flag / metadata to the schema in
> https://github.com/apache/parquet-format/blob/4bddbadf79e20a32152076fbedae0c3ce77fb531/src/main/thrift/parquet.thrift#L220
> 2. Add a new ConvertedType UTF8_CATEGORICAL
> 3. Add a new physical type for categoricals (this would equal the
> implementation in Arrow)
>
> Number 1 is the only options that would work well with old readers, 2+3
> would produce files that cannot be read correctly by older
> implementations.
>
> [1] http://pandas.pydata.org/pandas-docs/stable/categorical.html
> [2] https://stat.ethz.ch/R-manual/R-devel/library/base/html/factor.html
> [3]
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.h#L576
>
> --
>   Uwe L. Korn
>   [email protected]

Re: Marking categorical data in Parquet schemas

Reply via email to