[ 
https://issues.apache.org/jira/browse/ARROW-81?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15429156#comment-15429156
 ] 

Wes McKinney edited comment on ARROW-81 at 8/20/16 2:02 AM:
------------------------------------------------------------

[~emkornfi...@gmail.com] what you've proposed has the nice property that a 
system without any Category-specific code could treat it as simple 
dictionary-encoded data. This seems OK to me, if adding this field does not 
offend others' sensibilities. We could make it more general as dictionary 
metadata (to avoid having to add more attributes to the Field table should we 
want to add more interpretations / metadata about the dictionary) 

[~julienledem] curious what you think on these proposals?  


was (Author: wesmckinn):
[~emkornfi...@gmail.com] what you've proposed has the nice property that a 
system without any Category-specific code could treat it as simple 
dictionary-encoded data. This seems OK to me, if adding this field does not 
offend others' sensibilities. We could make it more general as dictionary 
metadata

[~julienledem] curious what you think on these proposals?  

> [Format] Add a Category logical type (distinct from dictionary-encoding)
> ------------------------------------------------------------------------
>
>                 Key: ARROW-81
>                 URL: https://issues.apache.org/jira/browse/ARROW-81
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++
>            Reporter: Wes McKinney
>            Assignee: Wes McKinney
>
> A Category (or "factor") is a dictionary-encoded array whose dictionary has 
> semantic meaning. The data consists of
> - An array of integer "codes"
> - A child array of some other type, known as the "categories" or "levels" of 
> the array. Typically there is an "ordered" boolean flag indicating whether 
> the order of the categories is meaningful.
> Category/factor types are used in a number of common statistical analyses. 
> See, for example, 
> http://www.voteview.com/R_Ordered_Logistic_or_Probit_Regression.htm. It is a 
> basic requirement for Python and R, at least, as Arrow C++ consumers, to have 
> this type. Separately, we should consider what is necessary to be able to 
> transmit category data in IPCs -- possible an expansion of the Arrow format. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to