[GitHub] [arrow] lidavidm commented on a diff in pull request #13873: ARROW-17407: [Doc][FlightRPC] Flight/gRPC best practices

GitBox Wed, 17 Aug 2022 11:26:15 -0700


lidavidm commented on code in PR #13873:
URL: https://github.com/apache/arrow/pull/13873#discussion_r948277546



##########
docs/source/cpp/flight.rst:
##########
@@ -172,6 +172,110 @@ request/response. On the server, they can inspect 
incoming headers and
 fail the request; hence, they can be used to implement custom
 authentication methods.
 
+.. _flight-best-practices:
+
+Best practices
+==============
+
+gRPC
+----
+
+When using default gRPC transport options can be passed to it via
+:member:`arrow::flight::FlightClientOptions::generic_options`. For example:

Review Comment:
   This is less a best practice, more of a "howto" so I think this should go in 
the cookbook



##########
docs/source/cpp/flight.rst:
##########
@@ -172,6 +172,110 @@ request/response. On the server, they can inspect 
incoming headers and
 fail the request; hence, they can be used to implement custom
 authentication methods.
 
+.. _flight-best-practices:
+
+Best practices
+==============
+
+gRPC
+----
+
+When using default gRPC transport options can be passed to it via
+:member:`arrow::flight::FlightClientOptions::generic_options`. For example:
+
+.. code-block:: cpp
+
+   auto options = FlightClientOptions::Defaults();
+   // Set a very low limit at the gRPC layer to fail all calls
+   options.generic_options.emplace_back(GRPC_ARG_MAX_RECEIVE_MESSAGE_LENGTH, 
4);
+
+Also see `best gRPC practices`_ and available `gRPC keys`_.
+
+
+Re-use clients whenever possible
+--------------------------------
+
+Closing clients causes gRPC to close and clean up connections which can take
+several seconds per connection. This will stall server and client threads if
+done too frequently. Client reuse will avoid this issue.
+
+Don’t round-robin load balance
+------------------------------
+
+Round robin balancing can cause every client to have an open connection to
+every server causing an unexpected number of open connections and a depletion
+of resources.

Review Comment:
   ```suggestion
   Round robin balancing means every client can have an open connection to
   every server, causing an unexpected number of open connections and depleting
   server resources.
   ```



##########
docs/source/cpp/flight.rst:
##########
@@ -172,6 +172,110 @@ request/response. On the server, they can inspect 
incoming headers and
 fail the request; hence, they can be used to implement custom
 authentication methods.
 
+.. _flight-best-practices:
+
+Best practices
+==============
+
+gRPC
+----
+
+When using default gRPC transport options can be passed to it via
+:member:`arrow::flight::FlightClientOptions::generic_options`. For example:
+
+.. code-block:: cpp
+
+   auto options = FlightClientOptions::Defaults();
+   // Set a very low limit at the gRPC layer to fail all calls
+   options.generic_options.emplace_back(GRPC_ARG_MAX_RECEIVE_MESSAGE_LENGTH, 
4);
+
+Also see `best gRPC practices`_ and available `gRPC keys`_.
+
+
+Re-use clients whenever possible
+--------------------------------
+
+Closing clients causes gRPC to close and clean up connections which can take
+several seconds per connection. This will stall server and client threads if
+done too frequently. Client reuse will avoid this issue.
+
+Don’t round-robin load balance
+------------------------------
+
+Round robin balancing can cause every client to have an open connection to

Review Comment:
   Link to the gRPC page describing load balancing? 
https://github.com/grpc/grpc/blob/master/doc/load-balancing.md#round_robin
   
   



##########
docs/source/cpp/flight.rst:
##########
@@ -172,6 +172,110 @@ request/response. On the server, they can inspect 
incoming headers and
 fail the request; hence, they can be used to implement custom
 authentication methods.
 
+.. _flight-best-practices:
+
+Best practices
+==============
+
+gRPC
+----
+
+When using default gRPC transport options can be passed to it via
+:member:`arrow::flight::FlightClientOptions::generic_options`. For example:
+
+.. code-block:: cpp
+
+   auto options = FlightClientOptions::Defaults();
+   // Set a very low limit at the gRPC layer to fail all calls
+   options.generic_options.emplace_back(GRPC_ARG_MAX_RECEIVE_MESSAGE_LENGTH, 
4);
+
+Also see `best gRPC practices`_ and available `gRPC keys`_.
+
+
+Re-use clients whenever possible
+--------------------------------
+
+Closing clients causes gRPC to close and clean up connections which can take
+several seconds per connection. This will stall server and client threads if
+done too frequently. Client reuse will avoid this issue.
+
+Don’t round-robin load balance
+------------------------------
+
+Round robin balancing can cause every client to have an open connection to
+every server causing an unexpected number of open connections and a depletion
+of resources.
+
+Debugging
+---------
+
+Use netstat to see the number of open connections.
+For debug - env GRPC_VERBOSITY=info GRPC_TRACE=http will print the initial
+headers (on both sides) so you can see if grpc established the connection or
+not. It will also print when a message is sent, so you can tell if the
+connection is open or not.
+Note: "connect" isn't really a connect and we’ve observed that gRPC does not
+give you the actual error until you first try to make a call. This can cause
+error being reported at unexpected times.
+
+Use ListFlights sparingly

Review Comment:
   This is too specific to have, IMO



##########
docs/source/cpp/flight.rst:
##########
@@ -172,6 +172,110 @@ request/response. On the server, they can inspect 
incoming headers and
 fail the request; hence, they can be used to implement custom
 authentication methods.
 
+.. _flight-best-practices:
+
+Best practices
+==============
+
+gRPC
+----
+
+When using default gRPC transport options can be passed to it via
+:member:`arrow::flight::FlightClientOptions::generic_options`. For example:
+
+.. code-block:: cpp
+
+   auto options = FlightClientOptions::Defaults();
+   // Set a very low limit at the gRPC layer to fail all calls
+   options.generic_options.emplace_back(GRPC_ARG_MAX_RECEIVE_MESSAGE_LENGTH, 
4);
+
+Also see `best gRPC practices`_ and available `gRPC keys`_.
+
+
+Re-use clients whenever possible
+--------------------------------
+
+Closing clients causes gRPC to close and clean up connections which can take
+several seconds per connection. This will stall server and client threads if
+done too frequently. Client reuse will avoid this issue.
+
+Don’t round-robin load balance
+------------------------------
+
+Round robin balancing can cause every client to have an open connection to
+every server causing an unexpected number of open connections and a depletion
+of resources.
+
+Debugging
+---------
+
+Use netstat to see the number of open connections.
+For debug - env GRPC_VERBOSITY=info GRPC_TRACE=http will print the initial
+headers (on both sides) so you can see if grpc established the connection or
+not. It will also print when a message is sent, so you can tell if the
+connection is open or not.
+Note: "connect" isn't really a connect and we’ve observed that gRPC does not
+give you the actual error until you first try to make a call. This can cause
+error being reported at unexpected times.

Review Comment:
   ```suggestion
   gRPC may not report connection errors until a call is actually made.
   Hence, to detect connection errors when creating a client, some sort
   of dummy RPC should be made.
   ```



##########
docs/source/cpp/flight.rst:
##########
@@ -172,6 +172,110 @@ request/response. On the server, they can inspect 
incoming headers and
 fail the request; hence, they can be used to implement custom
 authentication methods.
 
+.. _flight-best-practices:
+
+Best practices

Review Comment:
   Link to the grpc best practices doc as well?



##########
docs/source/cpp/flight.rst:
##########
@@ -172,6 +172,110 @@ request/response. On the server, they can inspect 
incoming headers and
 fail the request; hence, they can be used to implement custom
 authentication methods.
 
+.. _flight-best-practices:
+
+Best practices
+==============
+
+gRPC
+----
+
+When using default gRPC transport options can be passed to it via
+:member:`arrow::flight::FlightClientOptions::generic_options`. For example:
+
+.. code-block:: cpp
+
+   auto options = FlightClientOptions::Defaults();
+   // Set a very low limit at the gRPC layer to fail all calls
+   options.generic_options.emplace_back(GRPC_ARG_MAX_RECEIVE_MESSAGE_LENGTH, 
4);
+
+Also see `best gRPC practices`_ and available `gRPC keys`_.
+
+
+Re-use clients whenever possible
+--------------------------------
+
+Closing clients causes gRPC to close and clean up connections which can take
+several seconds per connection. This will stall server and client threads if
+done too frequently. Client reuse will avoid this issue.
+
+Don’t round-robin load balance
+------------------------------
+
+Round robin balancing can cause every client to have an open connection to
+every server causing an unexpected number of open connections and a depletion
+of resources.
+
+Debugging
+---------
+
+Use netstat to see the number of open connections.
+For debug - env GRPC_VERBOSITY=info GRPC_TRACE=http will print the initial
+headers (on both sides) so you can see if grpc established the connection or
+not. It will also print when a message is sent, so you can tell if the
+connection is open or not.
+Note: "connect" isn't really a connect and we’ve observed that gRPC does not
+give you the actual error until you first try to make a call. This can cause
+error being reported at unexpected times.
+
+Use ListFlights sparingly
+-------------------------
+
+ListFlights endpoint is largely just implemented as a normal GRPC stream
+endpoint and can hit transfer bottlenecks if used too much. To estimate data
+transfer bottleneck:
+5k schemas will serialize to about 1-5 MB/call. Assuming a gRPC localhost
+bottleneck of 3GB/s you can at best serve 600-3000 clients/s.
+
+ARROW-15764_ proposes a caching optimisation for server side, but it was not
+yet implemented.
+
+
+Memory cache client-side
+------------------------
+
+Flight uses gRPC allocator wherever possible.
+
+gRPC will spawn an unbounded number of threads for concurrent clients. Those
+threads are not necessarily cleaned up (cached thread pool in java parlance).
+glibc malloc clears some per thread state and the default tuning never clears
+caches in some workloads. But you can explicitly tell malloc to dump caches.
+See ARROW-16697_ as an example.
+
+A quick way of testing: attach to the process with a debugger and call 
malloc_trim
+
+
+Excessive traffic
+-----------------
+
+There are basically two ways to handle excessive traffic:
+* unbounded thread pool -> everyone gets serviced, but it might take forever.
+This is what you are seeing now.
+bounded thread pool -> Reject connections / requests when under load, and have
+clients retry with backoff. This also gives an opportunity to retry with a
+different node. Not everyone gets serviced but quality of service stays 
consistent.
+
+Closing unresponsive connections
+--------------------------------
+
+1. A stale connection can be closed using
+   :member:`arrow::flight::FlightClientOptions::stop_token`. This requires 
recording the
+   stop token at connection establishment time.

Review Comment:
   this needs code samples
   
   (it would be great if all code samples linked to the cookbook)



##########
docs/source/cpp/flight.rst:
##########
@@ -172,6 +172,110 @@ request/response. On the server, they can inspect 
incoming headers and
 fail the request; hence, they can be used to implement custom
 authentication methods.
 
+.. _flight-best-practices:
+
+Best practices
+==============
+
+gRPC
+----
+
+When using default gRPC transport options can be passed to it via
+:member:`arrow::flight::FlightClientOptions::generic_options`. For example:
+
+.. code-block:: cpp
+
+   auto options = FlightClientOptions::Defaults();
+   // Set a very low limit at the gRPC layer to fail all calls
+   options.generic_options.emplace_back(GRPC_ARG_MAX_RECEIVE_MESSAGE_LENGTH, 
4);
+
+Also see `best gRPC practices`_ and available `gRPC keys`_.
+
+
+Re-use clients whenever possible
+--------------------------------
+
+Closing clients causes gRPC to close and clean up connections which can take
+several seconds per connection. This will stall server and client threads if
+done too frequently. Client reuse will avoid this issue.
+
+Don’t round-robin load balance
+------------------------------
+
+Round robin balancing can cause every client to have an open connection to
+every server causing an unexpected number of open connections and a depletion
+of resources.
+
+Debugging
+---------
+
+Use netstat to see the number of open connections.
+For debug - env GRPC_VERBOSITY=info GRPC_TRACE=http will print the initial
+headers (on both sides) so you can see if grpc established the connection or
+not. It will also print when a message is sent, so you can tell if the
+connection is open or not.
+Note: "connect" isn't really a connect and we’ve observed that gRPC does not
+give you the actual error until you first try to make a call. This can cause
+error being reported at unexpected times.
+
+Use ListFlights sparingly
+-------------------------
+
+ListFlights endpoint is largely just implemented as a normal GRPC stream
+endpoint and can hit transfer bottlenecks if used too much. To estimate data
+transfer bottleneck:
+5k schemas will serialize to about 1-5 MB/call. Assuming a gRPC localhost
+bottleneck of 3GB/s you can at best serve 600-3000 clients/s.
+
+ARROW-15764_ proposes a caching optimisation for server side, but it was not
+yet implemented.
+
+
+Memory cache client-side
+------------------------
+
+Flight uses gRPC allocator wherever possible.
+
+gRPC will spawn an unbounded number of threads for concurrent clients. Those

Review Comment:
    Move this to 'excessive traffic' below?



##########
docs/source/cpp/flight.rst:
##########
@@ -172,6 +172,110 @@ request/response. On the server, they can inspect 
incoming headers and
 fail the request; hence, they can be used to implement custom
 authentication methods.
 
+.. _flight-best-practices:
+
+Best practices
+==============
+
+gRPC
+----
+
+When using default gRPC transport options can be passed to it via
+:member:`arrow::flight::FlightClientOptions::generic_options`. For example:
+
+.. code-block:: cpp
+
+   auto options = FlightClientOptions::Defaults();
+   // Set a very low limit at the gRPC layer to fail all calls
+   options.generic_options.emplace_back(GRPC_ARG_MAX_RECEIVE_MESSAGE_LENGTH, 
4);
+
+Also see `best gRPC practices`_ and available `gRPC keys`_.
+
+
+Re-use clients whenever possible
+--------------------------------
+
+Closing clients causes gRPC to close and clean up connections which can take
+several seconds per connection. This will stall server and client threads if
+done too frequently. Client reuse will avoid this issue.
+
+Don’t round-robin load balance
+------------------------------
+
+Round robin balancing can cause every client to have an open connection to
+every server causing an unexpected number of open connections and a depletion
+of resources.
+
+Debugging
+---------
+
+Use netstat to see the number of open connections.
+For debug - env GRPC_VERBOSITY=info GRPC_TRACE=http will print the initial
+headers (on both sides) so you can see if grpc established the connection or
+not. It will also print when a message is sent, so you can tell if the
+connection is open or not.
+Note: "connect" isn't really a connect and we’ve observed that gRPC does not
+give you the actual error until you first try to make a call. This can cause
+error being reported at unexpected times.
+
+Use ListFlights sparingly
+-------------------------
+
+ListFlights endpoint is largely just implemented as a normal GRPC stream
+endpoint and can hit transfer bottlenecks if used too much. To estimate data
+transfer bottleneck:
+5k schemas will serialize to about 1-5 MB/call. Assuming a gRPC localhost
+bottleneck of 3GB/s you can at best serve 600-3000 clients/s.
+
+ARROW-15764_ proposes a caching optimisation for server side, but it was not
+yet implemented.
+
+
+Memory cache client-side
+------------------------
+
+Flight uses gRPC allocator wherever possible.
+
+gRPC will spawn an unbounded number of threads for concurrent clients. Those
+threads are not necessarily cleaned up (cached thread pool in java parlance).
+glibc malloc clears some per thread state and the default tuning never clears
+caches in some workloads. But you can explicitly tell malloc to dump caches.
+See ARROW-16697_ as an example.
+
+A quick way of testing: attach to the process with a debugger and call 
malloc_trim

Review Comment:
   or call ReleaseUnused on the system pool



##########
docs/source/cpp/flight.rst:
##########
@@ -172,6 +172,110 @@ request/response. On the server, they can inspect 
incoming headers and
 fail the request; hence, they can be used to implement custom
 authentication methods.
 
+.. _flight-best-practices:
+
+Best practices
+==============
+
+gRPC
+----
+
+When using default gRPC transport options can be passed to it via
+:member:`arrow::flight::FlightClientOptions::generic_options`. For example:
+
+.. code-block:: cpp
+
+   auto options = FlightClientOptions::Defaults();
+   // Set a very low limit at the gRPC layer to fail all calls
+   options.generic_options.emplace_back(GRPC_ARG_MAX_RECEIVE_MESSAGE_LENGTH, 
4);
+
+Also see `best gRPC practices`_ and available `gRPC keys`_.
+
+
+Re-use clients whenever possible
+--------------------------------
+
+Closing clients causes gRPC to close and clean up connections which can take
+several seconds per connection. This will stall server and client threads if
+done too frequently. Client reuse will avoid this issue.
+
+Don’t round-robin load balance
+------------------------------
+
+Round robin balancing can cause every client to have an open connection to
+every server causing an unexpected number of open connections and a depletion
+of resources.
+
+Debugging
+---------
+
+Use netstat to see the number of open connections.
+For debug - env GRPC_VERBOSITY=info GRPC_TRACE=http will print the initial
+headers (on both sides) so you can see if grpc established the connection or
+not. It will also print when a message is sent, so you can tell if the
+connection is open or not.
+Note: "connect" isn't really a connect and we’ve observed that gRPC does not
+give you the actual error until you first try to make a call. This can cause
+error being reported at unexpected times.
+
+Use ListFlights sparingly
+-------------------------
+
+ListFlights endpoint is largely just implemented as a normal GRPC stream
+endpoint and can hit transfer bottlenecks if used too much. To estimate data
+transfer bottleneck:
+5k schemas will serialize to about 1-5 MB/call. Assuming a gRPC localhost
+bottleneck of 3GB/s you can at best serve 600-3000 clients/s.
+
+ARROW-15764_ proposes a caching optimisation for server side, but it was not
+yet implemented.
+
+
+Memory cache client-side
+------------------------
+
+Flight uses gRPC allocator wherever possible.
+
+gRPC will spawn an unbounded number of threads for concurrent clients. Those
+threads are not necessarily cleaned up (cached thread pool in java parlance).
+glibc malloc clears some per thread state and the default tuning never clears
+caches in some workloads. But you can explicitly tell malloc to dump caches.
+See ARROW-16697_ as an example.
+
+A quick way of testing: attach to the process with a debugger and call 
malloc_trim
+
+
+Excessive traffic
+-----------------
+
+There are basically two ways to handle excessive traffic:
+* unbounded thread pool -> everyone gets serviced, but it might take forever.
+This is what you are seeing now.
+bounded thread pool -> Reject connections / requests when under load, and have
+clients retry with backoff. This also gives an opportunity to retry with a
+different node. Not everyone gets serviced but quality of service stays 
consistent.

Review Comment:
   we should add code to configure these modes 



##########
docs/source/cpp/flight.rst:
##########
@@ -172,6 +172,110 @@ request/response. On the server, they can inspect 
incoming headers and
 fail the request; hence, they can be used to implement custom
 authentication methods.
 
+.. _flight-best-practices:
+
+Best practices
+==============
+
+gRPC
+----
+
+When using default gRPC transport options can be passed to it via
+:member:`arrow::flight::FlightClientOptions::generic_options`. For example:
+
+.. code-block:: cpp
+
+   auto options = FlightClientOptions::Defaults();
+   // Set a very low limit at the gRPC layer to fail all calls
+   options.generic_options.emplace_back(GRPC_ARG_MAX_RECEIVE_MESSAGE_LENGTH, 
4);
+
+Also see `best gRPC practices`_ and available `gRPC keys`_.
+
+
+Re-use clients whenever possible
+--------------------------------
+
+Closing clients causes gRPC to close and clean up connections which can take
+several seconds per connection. This will stall server and client threads if
+done too frequently. Client reuse will avoid this issue.
+
+Don’t round-robin load balance
+------------------------------
+
+Round robin balancing can cause every client to have an open connection to
+every server causing an unexpected number of open connections and a depletion
+of resources.
+
+Debugging
+---------
+
+Use netstat to see the number of open connections.
+For debug - env GRPC_VERBOSITY=info GRPC_TRACE=http will print the initial
+headers (on both sides) so you can see if grpc established the connection or
+not. It will also print when a message is sent, so you can tell if the
+connection is open or not.
+Note: "connect" isn't really a connect and we’ve observed that gRPC does not
+give you the actual error until you first try to make a call. This can cause
+error being reported at unexpected times.
+
+Use ListFlights sparingly
+-------------------------
+
+ListFlights endpoint is largely just implemented as a normal GRPC stream
+endpoint and can hit transfer bottlenecks if used too much. To estimate data
+transfer bottleneck:
+5k schemas will serialize to about 1-5 MB/call. Assuming a gRPC localhost
+bottleneck of 3GB/s you can at best serve 600-3000 clients/s.
+
+ARROW-15764_ proposes a caching optimisation for server side, but it was not
+yet implemented.
+
+
+Memory cache client-side

Review Comment:
   this isn't specific to the client



##########
docs/source/cpp/flight.rst:
##########
@@ -172,6 +172,108 @@ request/response. On the server, they can inspect 
incoming headers and
 fail the request; hence, they can be used to implement custom
 authentication methods.
 
+Best practices
+==============
+
+gRPC
+----
+
+When using default gRPC transport options can be passed to it via
+:func:`arrow::flight::FlightClientOptions.generic_options`. For example:
+
+.. code-block:: cpp
+
+   auto options = FlightClientOptions::Defaults();
+   // Set a very low limit at the gRPC layer to fail all calls
+   options.generic_options.emplace_back(GRPC_ARG_MAX_RECEIVE_MESSAGE_LENGTH, 
4);
+
+See available `gRPC keys_`.
+Best gRPC practices are described here_.
+
+
+Re-use clients whenever possible
+
+
+Closing clients causes gRPC to close and clean up connections which can take
+several seconds per connection. This will stall server and client threads if
+done too frequently. Client reuse will avoid this issue.
+
+Don’t round-robin load balance
+------------------------------
+
+Round robin balancing can cause every client to have an open connection to
+every server causing an unexpected number of open connections and a depletion
+of resources.
+
+Debugging
+---------
+
+Use netstat to see the number of open connections.
+For debug - env GRPC_VERBOSITY=info GRPC_TRACE=http will print the initial
+headers (on both sides) so you can see if grpc established the connection or
+not. It will also print when a message is sent, so you can tell if the
+connection is open or not.
+Note: "connect" isn't really a connect and we’ve observed that gRPC does not
+give you the actual error until you first try to make a call. This can cause
+error being reported at unexpected times.
+
+Use ListFlights sparingly
+-------------------------
+
+ListFlights endpoint is largely just implemented as a normal GRPC stream
+endpoint and can hit transfer bottlenecks if used too much. To estimate data
+transfer bottleneck:
+5k schemas will serialize to about 1-5 MB/call. Assuming a gRPC localhost
+bottleneck of 3GB/s you can at best serve 600-3000 clients/s.
+
+ARROW-15764_ proposes a caching optimisation for server side, but it was not
+yet implemented.
+
+
+Memory cache client-side
+------------------------
+
+Flight uses gRPC allocator wherever possible.

Review Comment:
   ```suggestion
   Flight tries to reuse allocations made by gRPC to avoid redundant
   data copies.  However, this means that those allocations may not
   be tracked by the Arrow memory pool, and that memory usage behavior,
   such as whether free memory is returned to the system, is dependent
   on the allocator that gRPC uses (usually the system allocator).
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] lidavidm commented on a diff in pull request #13873: ARROW-17407: [Doc][FlightRPC] Flight/gRPC best practices

Reply via email to