I'd assume you're compiling the code with Cython as well? (If you're
using the default containers, that should be fine.)
On Fri, Nov 9, 2018 at 12:09 AM Robert Bradshaw <rober...@google.com> wrote:
>
> Very cool to hear of this progress on Samza!
>
> Python protocol buffers are extraordinarily slow (lots of reflection,
> attributes lookups, and bit fiddling for serialization/deserialization
> that is certainly not Python's strong point). Each bundle processed
> involves multiple protos being constructed and sent/received (notably
> the particularly nested and branchy monitoring info one). While there
> are still some improvements that could be made for making bundles
> lighter-weight, amortizing this cost over many elements is essential
> for performance. (Note that elements within a bundle are packed into a
> single byte buffer, so avoid this overhead.)
>
> Also, it may be good to guarantee you're at least using the C++
> bindings:
> https://developers.google.com/protocol-buffers/docs/reference/python-generated
> (still slow, but not as slow).
>
> And, of course, due to the GIL one may want many python workers for
> multi-core machines.
>
> On Thu, Nov 8, 2018 at 9:18 PM Thomas Weise <t...@apache.org> wrote:
> >
> > We have been doing some end to end testing with Python and Flink
> > (streaming). You could take a look at the following and possibly replicate
> > it for your work:
> >
> > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/flink/flink_streaming_impulse.py
> >
> > We found that in order to get acceptable performance, we need larger
> > bundles (we started with single element bundles). Default in the Flink
> > runner now is to cap bundles at 1000 elements or 1 second, whatever comes
> > first. With that, I have seen decent throughput for the pipeline above (~
> > 5000k elements per second with single SDK worker).
> >
> > The Flink runner also has support to run multiple SDK workers per Flink
> > task manager.
> >
> > Thomas
> >
> >
> > On Thu, Nov 8, 2018 at 11:13 AM Xinyu Liu <xinyuliu...@gmail.com> wrote:
> >>
> >> 19mb/s throughput is enough for us. Seems the bottleneck is the rate of
> >> RPC calls. Our message size is usually 1k ~ 10k. So if we can reach
> >> 19mb/s, we will be able to process ~4k qps, that meets our requirements. I
> >> guess increasing the size of the bundles will help. Do you guys have any
> >> results from running python with Flink? We are curious about the
> >> performance there.
> >>
> >> Thanks,
> >> Xinyu
> >>
> >> On Thu, Nov 8, 2018 at 10:13 AM Lukasz Cwik <lc...@google.com> wrote:
> >>>
> >>> This benchmark[1] shows that Python is getting about 19mb/s.
> >>>
> >>> Yes, running more python sdk_worker processes will improve performance
> >>> since Python is limited to a single CPU core.
> >>>
> >>> [1]
> >>> https://performance-dot-grpc-testing.appspot.com/explore?dashboard=5652536396611584&widget=490377658&container=1286539696
> >>>
> >>>
> >>>
> >>> On Wed, Nov 7, 2018 at 5:24 PM Xinyu Liu <xinyuliu...@gmail.com> wrote:
> >>>>
> >>>> By looking at the gRPC dashboard published by the benchmark[1], it seems
> >>>> the streaming ping-pong operations per second for gRPC in python is
> >>>> around 2k ~ 3k qps. This seems quite low compared to gRPC performance in
> >>>> other languages, e.g. 600k qps for Java and Go. Is it expected to run
> >>>> multiple sdk_worker processes to improve performance?
> >>>>
> >>>> [1]
> >>>> https://performance-dot-grpc-testing.appspot.com/explore?dashboard=5652536396611584&widget=713624174&container=1012810333&maximized
> >>>>
> >>>> On Wed, Nov 7, 2018 at 11:14 AM Lukasz Cwik <lc...@google.com> wrote:
> >>>>>
> >>>>> gRPC folks provide a bunch of benchmarks for different scenarios:
> >>>>> https://grpc.io/docs/guides/benchmarking.html
> >>>>> You would be most interested in the streaming throughput benchmarks
> >>>>> since the Data API is written on top of the gRPC streaming APIs.
> >>>>>
> >>>>> 200KB/s does seem pretty small. Have you captured any Python
> >>>>> profiles[1] and looked at them?
> >>>>>
> >>>>> 1:
> >>>>> https://lists.apache.org/thread.html/f8488faede96c65906216c6b4bc521385abeddc1578c99b85937d2f2@%3Cdev.beam.apache.org%3E
> >>>>>
> >>>>>
> >>>>> On Wed, Nov 7, 2018 at 10:18 AM Hai Lu <lhai...@gmail.com> wrote:
> >>>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> This is Hai from LinkedIn. I'm currently working on Portable API for
> >>>>>> Samza Runner. I was able to make Python work with Samza container
> >>>>>> reading from Kafka. However, I'm seeing severe performance issue with
> >>>>>> my set up, achieving only ~200KB throughput between the Samza runner
> >>>>>> in the Java side and the sdk_worker in the Python part.
> >>>>>>
> >>>>>> While I'm digging into this, I wonder whether some one has benchmarked
> >>>>>> the data channel between Java and Python and had some results how much
> >>>>>> throughput can be reached? Assuming single worker thread and single
> >>>>>> JobBundleFactory.
> >>>>>>
> >>>>>> I might be missing some very basic and naive gRPC setting which leads
> >>>>>> to this unsatisfactory results. So another question is whether are any
> >>>>>> good articles or documentations about gRPC tuning dedicated to IPC?
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Hai
> >>>>>>
> >>>>>>