Leo Meyerovich created ARROW-4131:
-------------------------------------

             Summary: [Python] Coerce mixed columns to String
                 Key: ARROW-4131
                 URL: https://issues.apache.org/jira/browse/ARROW-4131
             Project: Apache Arrow
          Issue Type: Improvement
            Reporter: Leo Meyerovich


Continuing [https://github.com/apache/arrow/issues/3280] 

 

===

 

I'm seeing variants of this elsewhere (e.g., 
[wesm/feather#349|https://github.com/wesm/feather/issues/349] ) --

Not all Pandas tables coerce to Arrow tables, and when they fail, not in a way 
that is conducive to automation:

Sample:

{{mixed_df = pd.DataFrame(\{'mixed': [1, 'b']}) pa.Table.from_pandas(mixed_df) 
=> ArrowInvalid: ('Could not convert b with type str: tried to convert to 
double', 'Conversion failed for column mixed with type object') }}

I would have expected behaviors more like the following:
 * Coerce {{toString}} by default, with a default-off option to disallow 
toString coercions

 * Provide a default-off option to {{from_pandas}} to auto-coerce

 * Name the exception so it is clear that this is a column coercion failure, 
and include the column name(s), making this predictable and clearly handleable 
by both library writers & users

I lean towards:
 * Defaults auto-coerce, improving life of early users, 
`coerce_mixed_columns_to_strings=True`
 * For less frequent yet more advanced library implementors, allow them to 
override to `False`
 * In their case, create a predictable & machine-readable exception, 
`MixedColumnException(mixed_columns=['a', 'b', ...], msg="....")`



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to