Re: bug? pyarrow deserialize_components doesn't work in multiple processes

2018-07-06 Thread Josh Quigley
That works. I've tried a bunch of debugging and work arounds- as far as I
can tell this is just a problem with deserializr from components and
multiprocess.

On Fri., 6 Jul. 2018, 5:12 pm Robert Nishihara, 
wrote:

> Can you reproduce it without all of the multiprocessing code? E.g., just
> call *pyarrow.serialize* in one interpreter. Then copy and paste the bytes
> into another interpreter and call *pyarrow.deserialize *or
> *pyarrow.deserialize_components*?
> On Thu, Jul 5, 2018 at 9:48 PM Josh Quigley <
> josh.quig...@lifetrading.com.au>
> wrote:
>
> > Attachment inline:
> >
> > import pyarrow as pa
> > import multiprocessing as mp
> > import numpy as np
> >
> > def make_payload():
> > """Common function - make data to send"""
> > return ['message', 123, np.random.uniform(-100, 100, (4, 4))]
> >
> > def send_payload(payload, connection):
> > """Common function - serialize & send data through a socket"""
> > s = pa.serialize(payload)
> > c = s.to_components()
> >
> > # Send
> > data = c.pop('data')
> > connection.send(c)
> > for d in data:
> > connection.send_bytes(d)
> > connection.send_bytes(b'')
> >
> >
> > def recv_payload(connection):
> > """Common function - recv data through a socket & deserialize"""
> > c = connection.recv()
> > c['data'] = []
> > while True:
> > r = connection.recv_bytes()
> > if len(r) == 0:
> > break
> > c['data'].append(pa.py_buffer(r))
> >
> > print('...deserialize')
> > return pa.deserialize_components(c)
> >
> >
> > def run_same_process():
> > """Same process: Send data down a socket, then read data from the
> > matching socket"""
> > print('run_same_process')
> > recv_conn,send_conn = mp.Pipe(duplex=False)
> > payload = make_payload()
> > print(payload)
> > send_payload(payload, send_conn)
> > payload2 = recv_payload(recv_conn)
> > print(payload2)
> >
> >
> > def receiver(recv_conn):
> > """Separate process: runs in a different process, recv data &
> > deserialize"""
> > print('Receiver started')
> > payload = recv_payload(recv_conn)
> > print(payload)
> >
> >
> > def run_separate_process():
> > """Separate process: launch the child process, then send data"""
> >
> >
> > print('run_separate_process')
> > recv_conn,send_conn = mp.Pipe(duplex=False)
> > process = mp.Process(target=receiver, args=(recv_conn,))
> > process.start()
> >
> > payload = make_payload()
> > print(payload)
> > send_payload(payload, send_conn)
> >
> > process.join()
> >
> > if __name__ == '__main__':
> > run_same_process()
> > run_separate_process()
> >
> >
> > On Fri, Jul 6, 2018 at 2:42 PM Josh Quigley <
> > josh.quig...@lifetrading.com.au>
> > wrote:
> >
> > > A reproducible program attached - it first runs serialize/deserialize
> > from
> > > the same process, then it does the same work using a separate process
> for
> > > the deserialize.
> > >
> > > The behaviour see is (after the same process code executes happily) is
> > > hanging / child-process crashing during the call to deserialize.
> > >
> > > Is this expected, and if not, is there a known workaround?
> > >
> > > Running Windows 10, conda distribution,  with package versions listed
> > > below. I'll also see what happens if I run on *nix.
> > >
> > >   - arrow-cpp=0.9.0=py36_vc14_7
> > >   - boost-cpp=1.66.0=vc14_1
> > >   - bzip2=1.0.6=vc14_1
> > >   - hdf5=1.10.2=vc14_0
> > >   - lzo=2.10=vc14_0
> > >   - parquet-cpp=1.4.0=vc14_0
> > >   - snappy=1.1.7=vc14_1
> > >   - zlib=1.2.11=vc14_0
> > >   - blas=1.0=mkl
> > >   - blosc=1.14.3=he51fdeb_0
> > >   - cython=0.28.3=py36hfa6e2cd_0
> > >   - icc_rt=2017.0.4=h97af966_0
> > >   - intel-openmp=2018.0.3=0
> > >   - numexpr=2.6.5=py36hcd2f87e_0
> > >   - numpy=1.14.5=py36h9fa60d3_2
> > >   - numpy-base=1.14.5=py36h5c71026_2
> > >   - pandas=0.23.1=py36h830ac7b_0
> > >   - pyarrow=0.9.0=py36hfe5e424_2
> > >   - pytables=3.4.4=py36he6f6034_0
> > >   - python=3.6.6=hea74fb7_0
> > >   - vc=14=h0510ff6_3
> > >   - vs2015_runtime=14.0.25123=3
> > >
> > >
> >
>


Re: bug? pyarrow deserialize_components doesn't work in multiple processes

2018-07-05 Thread Josh Quigley
Attachment inline:

import pyarrow as pa
import multiprocessing as mp
import numpy as np

def make_payload():
"""Common function - make data to send"""
return ['message', 123, np.random.uniform(-100, 100, (4, 4))]

def send_payload(payload, connection):
"""Common function - serialize & send data through a socket"""
s = pa.serialize(payload)
c = s.to_components()

# Send
data = c.pop('data')
connection.send(c)
for d in data:
connection.send_bytes(d)
connection.send_bytes(b'')


def recv_payload(connection):
"""Common function - recv data through a socket & deserialize"""
c = connection.recv()
c['data'] = []
while True:
r = connection.recv_bytes()
if len(r) == 0:
break
c['data'].append(pa.py_buffer(r))

print('...deserialize')
return pa.deserialize_components(c)


def run_same_process():
"""Same process: Send data down a socket, then read data from the
matching socket"""
print('run_same_process')
recv_conn,send_conn = mp.Pipe(duplex=False)
payload = make_payload()
print(payload)
send_payload(payload, send_conn)
payload2 = recv_payload(recv_conn)
print(payload2)


def receiver(recv_conn):
"""Separate process: runs in a different process, recv data & deserialize"""
print('Receiver started')
payload = recv_payload(recv_conn)
print(payload)


def run_separate_process():
"""Separate process: launch the child process, then send data"""


print('run_separate_process')
recv_conn,send_conn = mp.Pipe(duplex=False)
process = mp.Process(target=receiver, args=(recv_conn,))
process.start()

payload = make_payload()
print(payload)
send_payload(payload, send_conn)

process.join()

if __name__ == '__main__':
run_same_process()
run_separate_process()


On Fri, Jul 6, 2018 at 2:42 PM Josh Quigley 
wrote:

> A reproducible program attached - it first runs serialize/deserialize from
> the same process, then it does the same work using a separate process for
> the deserialize.
>
> The behaviour see is (after the same process code executes happily) is
> hanging / child-process crashing during the call to deserialize.
>
> Is this expected, and if not, is there a known workaround?
>
> Running Windows 10, conda distribution,  with package versions listed
> below. I'll also see what happens if I run on *nix.
>
>   - arrow-cpp=0.9.0=py36_vc14_7
>   - boost-cpp=1.66.0=vc14_1
>   - bzip2=1.0.6=vc14_1
>   - hdf5=1.10.2=vc14_0
>   - lzo=2.10=vc14_0
>   - parquet-cpp=1.4.0=vc14_0
>   - snappy=1.1.7=vc14_1
>   - zlib=1.2.11=vc14_0
>   - blas=1.0=mkl
>   - blosc=1.14.3=he51fdeb_0
>   - cython=0.28.3=py36hfa6e2cd_0
>   - icc_rt=2017.0.4=h97af966_0
>   - intel-openmp=2018.0.3=0
>   - numexpr=2.6.5=py36hcd2f87e_0
>   - numpy=1.14.5=py36h9fa60d3_2
>   - numpy-base=1.14.5=py36h5c71026_2
>   - pandas=0.23.1=py36h830ac7b_0
>   - pyarrow=0.9.0=py36hfe5e424_2
>   - pytables=3.4.4=py36he6f6034_0
>   - python=3.6.6=hea74fb7_0
>   - vc=14=h0510ff6_3
>   - vs2015_runtime=14.0.25123=3
>
>


bug? pyarrow deserialize_components doesn't work in multiple processes

2018-07-05 Thread Josh Quigley
A reproducible program attached - it first runs serialize/deserialize from
the same process, then it does the same work using a separate process for
the deserialize.

The behaviour see is (after the same process code executes happily) is
hanging / child-process crashing during the call to deserialize.

Is this expected, and if not, is there a known workaround?

Running Windows 10, conda distribution,  with package versions listed
below. I'll also see what happens if I run on *nix.

  - arrow-cpp=0.9.0=py36_vc14_7
  - boost-cpp=1.66.0=vc14_1
  - bzip2=1.0.6=vc14_1
  - hdf5=1.10.2=vc14_0
  - lzo=2.10=vc14_0
  - parquet-cpp=1.4.0=vc14_0
  - snappy=1.1.7=vc14_1
  - zlib=1.2.11=vc14_0
  - blas=1.0=mkl
  - blosc=1.14.3=he51fdeb_0
  - cython=0.28.3=py36hfa6e2cd_0
  - icc_rt=2017.0.4=h97af966_0
  - intel-openmp=2018.0.3=0
  - numexpr=2.6.5=py36hcd2f87e_0
  - numpy=1.14.5=py36h9fa60d3_2
  - numpy-base=1.14.5=py36h5c71026_2
  - pandas=0.23.1=py36h830ac7b_0
  - pyarrow=0.9.0=py36hfe5e424_2
  - pytables=3.4.4=py36he6f6034_0
  - python=3.6.6=hea74fb7_0
  - vc=14=h0510ff6_3
  - vs2015_runtime=14.0.25123=3