[ 
https://issues.apache.org/jira/browse/ARROW-1863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16267912#comment-16267912
 ] 

Xianjin YE commented on ARROW-1863:
-----------------------------------

HI [~cpcloud], thanks for your comment.

This definitely shouldn't segfault, and an error should be raised. It's never 
my intention to convert arbitrary type {{x}} to {{string}}.

The reason it segfault lies down to {{UTF8Converter}} in builtin_converter.cc


{code:java}
class UTF8Converter : public TypedConverterVisitor<StringBuilder, 
UTF8Converter> {
 public:
  inline Status AppendItem(const OwnedRef& item) {
    PyObject* bytes_obj;
    OwnedRef tmp;
    const char* bytes;
    Py_ssize_t length;

    PyObject* obj = item.obj();
    if (PyBytes_Check(obj)) {
      tmp.reset(
          PyUnicode_FromStringAndSize(PyBytes_AS_STRING(obj), 
PyBytes_GET_SIZE(obj)));
      RETURN_IF_PYERROR();
      bytes_obj = obj;
    } else if (!PyUnicode_Check(obj)) {
      PyObjectStringify stringified(obj);
      std::stringstream ss;
      ss << "Non bytes/unicode value encountered: " << stringified.bytes;
      return Status::Invalid(ss.str());
    } else {
      tmp.reset(PyUnicode_AsUTF8String(obj));
      RETURN_IF_PYERROR();
      bytes_obj = tmp.obj();
    }

    // No error checking
    length = PyBytes_GET_SIZE(bytes_obj);
    bytes = PyBytes_AS_STRING(bytes_obj);
    return typed_builder_->Append(bytes, static_cast<int32_t>(length));
  }
};
{code}
{{PyObjectStringify}} is used to construct the error message,  but a NULLPTR is 
used instead of string representation of the Python object.

{quote}
but {{PyObjectStringify}} is meant to convert a Python str, bytes, or unicode 
type to const char*, it's not meant to take an arbitrary Python object and 
convert it to a string.
{quote}
So, I don't think this is true. From the function name and its usage, 
{{PyObjectStringify}} should get the string representation of an arbitrary 
Python object.

> Should use PyObject_Str or PyObject_Repr in PyObjectStringify
> -------------------------------------------------------------
>
>                 Key: ARROW-1863
>                 URL: https://issues.apache.org/jira/browse/ARROW-1863
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>            Reporter: Xianjin YE
>            Assignee: Phillip Cloud
>             Fix For: 0.8.0
>
>
> PyObjectStringify doesn't handle non-string(bytes or utf-8) type correctly. 
> Should use PyObject_Repr(or PyObject_Str) to get string representation of 
> PyObject.
> {code:java}
> struct ARROW_EXPORT PyObjectStringify {
>   OwnedRef tmp_obj;
>   const char* bytes;
>   Py_ssize_t size;
>   explicit PyObjectStringify(PyObject* obj) {
>     PyObject* bytes_obj;
>     if (PyUnicode_Check(obj)) {
>       bytes_obj = PyUnicode_AsUTF8String(obj);
>       tmp_obj.reset(bytes_obj);
>       bytes = PyBytes_AsString(bytes_obj);
>       size = PyBytes_GET_SIZE(bytes_obj);
>     } else if (PyBytes_Check(obj)) {
>       bytes = PyBytes_AsString(obj);
>       size = PyBytes_GET_SIZE(obj);
>     } else {
>       bytes = NULLPTR;
>       size = -1;
>     }
>   }
> };
> {code}
> should change to 
> {code:java}
> struct ARROW_EXPORT PyObjectStringify {
>   OwnedRef tmp_obj;
>   const char* bytes;
>   Py_ssize_t size;
>   explicit PyObjectStringify(PyObject* obj) {
>     PyObject* bytes_obj;
>     if (PyUnicode_Check(obj)) {
>       bytes_obj = PyUnicode_AsUTF8String(obj);
>       tmp_obj.reset(bytes_obj);
>       bytes = PyBytes_AsString(bytes_obj);
>       size = PyBytes_GET_SIZE(bytes_obj);
>     } else if (PyBytes_Check(obj)) {
>       bytes = PyBytes_AsString(obj);
>       size = PyBytes_GET_SIZE(obj);
>     } else {
>       bytes_obj = PyObject_Repr(obj);
>       tmp_obj.reset(bytes_obj);
>       bytes = PyBytes_AsString(bytes_obj);
>       size = PyBytes_GET_SIZE(bytes_obj);
>     }
>   }
> };
> {code}
> How do this infect pyarrow? Minimal reproduction case:
> {code:java}
> import pyarrow
> data = ['-10', '-5', {'a': 1}, '0', '5', '10']
> arr = pyarrow.array(data, type=pyarrow.string())
> [1]    64491 segmentation fault  ipython
> {code}
> This case is found by my colleague. I would ask him to send a pr here.  
> cc [~wesmckinn]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to