[jira] [Commented] (ARROW-2101) [Python] from_pandas reads 'str' type as binary Arrow data with Python 2

ASF GitHub Bot (JIRA) Sun, 15 Apr 2018 07:09:41 -0700

    [ 
https://issues.apache.org/jira/browse/ARROW-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16438715#comment-16438715
 ]


ASF GitHub Bot commented on ARROW-2101:
---------------------------------------

joshuastorck commented on issue #1886: ARROW-2101: [Python/C++] Correctly 
convert numpy arrays of bytes to arrow arrays of strings when user specifies 
arrow type of string
URL: https://github.com/apache/arrow/pull/1886#issuecomment-381263228
 
 
   I built for Python 2 and confirmed the behavior is the same. 
   
   @pitrou, in regards to the inefficiency of utf-8 encoding, it could be moved 
below to the check of global_have_bytes. Would you prefer this?
   
   ```cpp
     if (global_have_bytes) {
       if (force_string)
       {
           PyObject* obj;
   
        Ndarray1DIndexer<PyObject*> objects(arr_);
        Ndarray1DIndexer<uint8_t> mask_values;
        
        bool have_mask = false;
        if (mask_ != nullptr) {
          mask_values.Init(mask_);
          have_mask = true;
        }
        
        PyBytesView view;
        for (int64_t offset = 0; offset < objects.size(); ++offset) {
          OwnedRef tmp_obj;
          obj = objects[offset];
          if ((have_mask && mask_values[offset]) || 
internal::PandasObjectIsNull(obj)) {
            continue;
          }
             RETURN_NOT_OK(view.FromString(obj, true);
        }
       }
       else
       {
         for (size_t i = 0; i < out_arrays_.size(); ++i) {
        auto binary_data = out_arrays_[i]->data()->Copy();c
        binary_data->type = ::arrow::binary();
        out_arrays_[i] = std::make_shared<BinaryArray>(binary_data);
         }
       }
   ```
   
   I'm not fond of how much code I had to copy from AppendObjectStrings to 
write that loop. I think it would be helpful to have iterators that look like 
this:
   
   ```cpp
   NdArray1DIndexer<PyObject*> array(array_);
   auto mask = NdArray1DIndexer<uint64_t>::from_mask(mask_);
   NdArray1DMaskedIterator iterator(array.begin() + offset, array.end(), mask, 
true /* include masked value */);
   for (OwnedRef& obj: iterator)
   {
      // Maybe we use None to indicate masked values?
   }
   ```
   Or even better, we use pybind11 and these are light wrappers over them?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> [Python] from_pandas reads 'str' type as binary Arrow data with Python 2
> ------------------------------------------------------------------------
>
>                 Key: ARROW-2101
>                 URL: https://issues.apache.org/jira/browse/ARROW-2101
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.8.0
>            Reporter: Bryan Cutler
>            Assignee: Bryan Cutler
>            Priority: Major
>              Labels: pull-request-available
>
> Using Python 2, converting Pandas with 'str' data to Arrow results in Arrow 
> data of binary type, even if the user supplies type information.  conversion 
> of 'unicode' type works to create Arrow data of string types.  For example
> {code}
> In [25]: pa.Array.from_pandas(pd.Series(['a'])).type
> Out[25]: DataType(binary)
> In [26]: pa.Array.from_pandas(pd.Series(['a']), type=pa.string()).type
> Out[26]: DataType(binary)
> In [27]: pa.Array.from_pandas(pd.Series([u'a'])).type
> Out[27]: DataType(string)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2101) [Python] from_pandas reads 'str' type as binary Arrow data with Python 2

Reply via email to