madhavajay edited a comment on issue #11239: URL: https://github.com/apache/arrow/issues/11239#issuecomment-1049537852
Does anyone have a suggestion for a valid replacement for `pa.serialize` going forwards which keeps many of the advantages without having to use `pickle`? As far as I know Pickle is not secure: https://docs.python.org/3/library/pickle.html `Warning The pickle module is not secure. Only unpickle data you trust.` This is not just theoretical there are tools allowing exploits to be created. https://securityboulevard.com/2021/03/never-a-dill-moment-exploiting-machine-learning-pickle-files/ Firstly, arrow seems to perform around the same speed for my use case as Pickle protocol 5 but I am guessing does not include all the insecurity of pickle allowing for arbitrary python ops to be executed. Additionally in my example the Pickle size was orders of magnitude bigger than the arrow size which will make a huge difference on network transfer. I have dug through the arrow code and it seems that there are some occasional references to `pickle` but if I understand correctly they are here: https://github.com/apache/arrow/blob/master/python/pyarrow/serialization.py#L462 ```python serialization_context.register_type( type(lambda: 0), "function", pickle=True) serialization_context.register_type(type, "type", pickle=True) ``` Which implies that for Python classes and functions Pickle gets called. However it seems like for nearly any other object with a backing `.__dict__` it simply uses the `_serialize_default_dict` which decomposes the dict into keys and values. It appears in the C++ code there is a recursive serializer which supports all the most of the Python primitives. https://github.com/apache/arrow/blob/master/cpp/src/arrow/python/serialize.cc ```C++ Status SerializeObject(PyObject* context, PyObject* sequence, SerializedPyObject* out) { PyAcquireGIL lock; SequenceBuilder builder; RETURN_NOT_OK(internal::VisitIterable( sequence, [&](PyObject* obj, bool* keep_going /* unused */) { return Append(context, obj, &builder, 0, out); })); std::shared_ptr<Array> array; RETURN_NOT_OK(builder.Finish(&array)); out->batch = MakeBatch(array); return Status::OK(); } ``` I haven't read the Pickle source code but the fact that it creates Python bytecode ops is very different and surely can never be considered safe without significant work. While sending functions is cool, and there are situations where pickle is easy and probably fine, it feels like removing this highly efficient recursive serde of primitive python types from the Arrow library would remove a significant benefit for which there is no equally safe alternative. Does anyone have a suggestion with the same convenience and performance? Also is there any suggestion of when this will be removed. Arrow 7.0.0 has been released and `pa.serialize` is still there. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
