[ 
https://issues.apache.org/jira/browse/ARROW-81?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15424330#comment-15424330
 ] 

Julian Hyde commented on ARROW-81:
----------------------------------

Since Arrow is a general-purpose data format, this requirement seems to me to 
be too closely targeted at a particular problem domain.

To illustrate, consider another domain, OLAP, where a dimension has a key, a 
name, a caption (localized name), a localized description, an order key and 
perhaps user-defined properties. I'm not claiming that OLAP dimensions are the 
"right" model either.

I suspect that the "right" model is to allow additional attributes in the 
dictionary (in addition to the single "value" attribute at present). By 
convention, there would be one or more attribute names that define a 
category/factor when Python or R reads the dictionary.

> C++: Add a Category nested type
> -------------------------------
>
>                 Key: ARROW-81
>                 URL: https://issues.apache.org/jira/browse/ARROW-81
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++
>            Reporter: Wes McKinney
>            Assignee: Wes McKinney
>
> A Category (or "factor") is a dictionary-encoded array whose dictionary has 
> semantic meaning. The data consists of
> - An array of integer "codes"
> - A child array of some other type, known as the "categories" or "levels" of 
> the array. Typically there is an "ordered" boolean flag indicating whether 
> the order of the categories is meaningful.
> Category/factor types are used in a number of common statistical analyses. 
> See, for example, 
> http://www.voteview.com/R_Ordered_Logistic_or_Probit_Regression.htm. It is a 
> basic requirement for Python and R, at least, as Arrow C++ consumers, to have 
> this type. Separately, we should consider what is necessary to be able to 
> transmit category data in IPCs -- possible an expansion of the Arrow format. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to