[jira] [Commented] (ARROW-1863) Should use PyObject_Str or PyObject_Repr in PyObjectStringify

Phillip Cloud (JIRA) Mon, 27 Nov 2017 10:50:27 -0800

    [ 
https://issues.apache.org/jira/browse/ARROW-1863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16267238#comment-16267238
 ]


Phillip Cloud commented on ARROW-1863:
--------------------------------------

[~advancedxy] Thanks for the report!

This definitely shouldn't segfault, but {{PyObjectStringify}} is meant to 
convert a Python {{str}}, {{bytes}}, or {{unicode}} type to {{const char*}}, 
it's not meant to take an arbitrary Python object and convert it to a string.

I think this should raise an error, since you're telling arrow to construct an 
array of type string and you're passing a non-string object to it.

It seems arbitrary to enable this behavior for type {{X}} to {{string}}, but 
not for say, {{string}} to {{int64}}. Why should implicit conversion from type 
{{X}} to {{string}} be special?

For example, should this try to convert the string to an integer?

{code}
data = [1, 2, '3']
pyarrow.array(data, type=pyarrow.int64())
{code}

I don't think so.

Implicit casting from one type to another is a slippery slope and one that 
makes it hard to predict the output of a function, especially in the presence 
of the ability to override the string representation of an object.

> Should use PyObject_Str or PyObject_Repr in PyObjectStringify
> -------------------------------------------------------------
>
>                 Key: ARROW-1863
>                 URL: https://issues.apache.org/jira/browse/ARROW-1863
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>            Reporter: Xianjin YE
>             Fix For: 0.8.0
>
>
> PyObjectStringify doesn't handle non-string(bytes or utf-8) type correctly. 
> Should use PyObject_Repr(or PyObject_Str) to get string representation of 
> PyObject.
> {code:java}
> struct ARROW_EXPORT PyObjectStringify {
>   OwnedRef tmp_obj;
>   const char* bytes;
>   Py_ssize_t size;
>   explicit PyObjectStringify(PyObject* obj) {
>     PyObject* bytes_obj;
>     if (PyUnicode_Check(obj)) {
>       bytes_obj = PyUnicode_AsUTF8String(obj);
>       tmp_obj.reset(bytes_obj);
>       bytes = PyBytes_AsString(bytes_obj);
>       size = PyBytes_GET_SIZE(bytes_obj);
>     } else if (PyBytes_Check(obj)) {
>       bytes = PyBytes_AsString(obj);
>       size = PyBytes_GET_SIZE(obj);
>     } else {
>       bytes = NULLPTR;
>       size = -1;
>     }
>   }
> };
> {code}
> should change to 
> {code:java}
> struct ARROW_EXPORT PyObjectStringify {
>   OwnedRef tmp_obj;
>   const char* bytes;
>   Py_ssize_t size;
>   explicit PyObjectStringify(PyObject* obj) {
>     PyObject* bytes_obj;
>     if (PyUnicode_Check(obj)) {
>       bytes_obj = PyUnicode_AsUTF8String(obj);
>       tmp_obj.reset(bytes_obj);
>       bytes = PyBytes_AsString(bytes_obj);
>       size = PyBytes_GET_SIZE(bytes_obj);
>     } else if (PyBytes_Check(obj)) {
>       bytes = PyBytes_AsString(obj);
>       size = PyBytes_GET_SIZE(obj);
>     } else {
>       bytes_obj = PyObject_Repr(obj);
>       tmp_obj.reset(bytes_obj);
>       bytes = PyBytes_AsString(bytes_obj);
>       size = PyBytes_GET_SIZE(bytes_obj);
>     }
>   }
> };
> {code}
> How do this infect pyarrow? Minimal reproduction case:
> {code:java}
> import pyarrow
> data = ['-10', '-5', {'a': 1}, '0', '5', '10']
> arr = pyarrow.array(data, type=pyarrow.string())
> [1]    64491 segmentation fault  ipython
> {code}
> This case is found by my colleague. I would ask him to send a pr here.  
> cc [~wesmckinn]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (ARROW-1863) Should use PyObject_Str or PyObject_Repr in PyObjectStringify

Reply via email to