[ https://issues.apache.org/jira/browse/ARROW-81?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15424893#comment-15424893 ]
Mohit Jaggi commented on ARROW-81: ---------------------------------- When I was working on feature engineering earlier I struggled with this question too. My conclusion was to let the semantic of "category" be left for interpretation by a higher layer (typically feature engineering or machine learning). In "raw" data a category might be represented in several ways (string, boolean=one hot encoding, number etc) anyway so supporting this at a lower layer would impose constraints on the "raw" data. And then whose responsibility will it be to "prepare" the data to satisfy this constraint? Moreover, concepts like "ordered" are also fuzzy. A set of categories may be unordered for machine learning code but may be ordered for display in a UI. If Arrow is below both layers then this would be confusing. > C++: Add a Category nested type > ------------------------------- > > Key: ARROW-81 > URL: https://issues.apache.org/jira/browse/ARROW-81 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ > Reporter: Wes McKinney > Assignee: Wes McKinney > > A Category (or "factor") is a dictionary-encoded array whose dictionary has > semantic meaning. The data consists of > - An array of integer "codes" > - A child array of some other type, known as the "categories" or "levels" of > the array. Typically there is an "ordered" boolean flag indicating whether > the order of the categories is meaningful. > Category/factor types are used in a number of common statistical analyses. > See, for example, > http://www.voteview.com/R_Ordered_Logistic_or_Probit_Regression.htm. It is a > basic requirement for Python and R, at least, as Arrow C++ consumers, to have > this type. Separately, we should consider what is necessary to be able to > transmit category data in IPCs -- possible an expansion of the Arrow format. -- This message was sent by Atlassian JIRA (v6.3.4#6332)