hi Yevgeni, The approach you describe is not unreasonable if the objective is to embed ndarrays in a Parquet file. On Tue, Oct 23, 2018 at 1:45 AM Yevgeni Litvin <[email protected]> wrote: > > In Petastorm we operate with tables of tensors. We are trying to map this > data structure into > Arrow's primitives. One way is to use pa.array of BinaryValue type while > using > FixedSizeBufferWriter to serialize a pa.Tensor type into it and deserialize > it on read. This > feels somewhat ackward and I guess does not achieve the zero-copy > behavior(?) > > This is what we do to deserialize the tensor from a single binary value: > > buffer = value.as_py() > reader = pa.BufferReader(memoryview(buffer)) > tensor = pa.read_tensor(reader) > n = tensor.to_numpy()
The call `value.as_py()` causes a copy -- this should be avoidable. I opened https://issues.apache.org/jira/browse/ARROW-3592 about adding an option to get a zero-copy buffer from such a value > > > And this is how a numpy array is serialized into a BinaryValue written to a > parquet store: > > tensor = pa.Tensor.from_numpy(array) > buffer = pa.allocate_buffer(pa.get_tensor_size(tensor)) > stream = pa.FixedSizeBufferWriter(buffer) > pa.write_tensor(tensor, stream) > bytes = bytearray(buffer.to_pybytes()) The `to_pybytes()` call should not be necessary here In [2]: buf = pa.allocate_buffer(10) In [3]: buf Out[3]: <pyarrow.lib.Buffer at 0x7f4209af64c8> In [4]: bytearray(buf) Out[4]: bytearray(b'x\x88G6B\x7f\x00\x00x\x88') In [5]: bytearray(buf.to_pybytes()) Out[5]: bytearray(b'x\x88G6B\x7f\x00\x00x\x88') It would be useful to provide some conveniences for building a BinaryArray containing serialized arrow::Tensor values (or other serializable objects) > > Is there a better, more Arrow native approach, to model our data? > > Thanks! > > - Yevgeni
