Hi Adam,

Le 18/12/2019 à 20:00, Adam Hooper a écrit :
> Thank you for the responses, Wes and Antoine. I've sent to private@ and
> then filed https://issues.apache.org/jira/browse/ARROW-7435
> 
> @Antoine I'm glad to hear the IPC layer is a security boundary. Could you
> please clarify what that means exactly? Do you mean (please check all that
> apply):

What I meant is that ill-formed IPC streams should never cause Arrow to
crash or behave erratically when reading, but instead return an error.
However, a well-formed IPC stream can still contain ill-formed data, as
you experienced.

> 1. Arrow IPC is designed to produce defined behavior given any input?
> 2. Arrow IPC is designed such that a nefarious Arrow file can't cause reads
> or writes outside of the file's memory boundaries?
> 3. If I notice a specially-crafted file inspires Arrow IPC to violate these
> first two design goals, that's a bug I should report ASAP?

Ideally, yes to all three, but with the limitation that those principles
only apply to the IPC operation per se (the operation of writing or
reading the data from/to memory).  The IPC layer won't detect that a
serialized list array has indices out of bounds in its offsets buffer,
for example.

We also have a basic fuzzing setup for the C++ IPC implementation,
though it currently doesn't get the attention it deserves.  We will try
to improve that in the future (see
https://issues.apache.org/jira/browse/ARROW-7433 : "Set up OSS-Fuzz").

> In brief: I'm asking whether security is a project goal or more of a
> pleasant-but-unsupported consequence of Arrow's design. *Should* I use
> Arrow IPC (with Validate()) to read files produced by untrusted
> (out-of-process) code?

Security is a concern for us and we welcome any reports.  That said,
Arrow is primarily concerned about efficiency (memory and processing
efficiency), which implies that implicit checks cannot often be
comprehensive.

In git master, Arrow C++ has two validation routines.  None of them are
called by default, you have to call them yourself:

- Validate() does cheap checks (O(1) in array length), which are mostly
metadata checks
- ValidateFull() does more comprehensive data checks (potentially O(N)
in array length)

If some ill-formed data doesn't get caught by ValidateFull(), you should
definitely open a bug report on JIRA.  We also welcome suggestions for
additional checks in Validate(), provided they are not dependent on
array length.

Regards

Antoine.

Reply via email to