On 2018-04-19 20:43, Ariel Weisberg wrote:
Hi,

So at technical level I don't understand this yet.

So you have a database consisting of single threaded shards and a socket for 
accept that is generating TCP connections and in advance you don't know which 
connection is going to send messages to which shard.

What is the mechanism by which you get the packets for a given TCP connection 
delivered to a specific core? I know that a given TCP connection will normally 
have all of its packets delivered to the same queue from the NIC because the 
tuple of source address + port and destination address + port is typically 
hashed to pick one of the queues the NIC presents. I might have the contents of 
the tuple slightly wrong, but it always includes a component you don't get to 
control.

Right, that's how it's done. The component you typically don't get to control is the client-side local port, but you can bind to a local port if you want.

Since it's hashing how do you manipulate which queue packets for a TCP 
connection go to and how is it made worse by having an accept socket per shard?

It's not made worse, it's just not made better.

There are three ways at least to get multiqueue to work with thread-per-core without software movement of packets, none of them pretty:

1. The client tells the server which shard to connect to. The server uses "Flow Director" [1] or an equivalent to bypass the hash and bind the connection to a particular queue. This is problematic since you need to bypass the tcp stack, and since there are a limited number of entries in the flow director table. 2. The client asks the server which shard it happened to connect to. This requires the client to open many connections in order to reach all shards, and then close any excess connections (did I mention it wasn't pretty?). 3. The server communicates the hash function to the client, or perhaps suggests local ports for the client to use in order to reach a shard. This can be problematic if the server doesn't know the hash function (can happen in some virtualized environments, or with new NICs, or with limited knowledge of the hardware topology). See similar approach in [2].

[1] https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/intel-ethernet-flow-director.pdf [2] https://github.com/scylladb/seastar/blob/0b8b851b432a1d04522a80d9830e07449d71caa2/net/tcp.hh#L790


You also mention 160 ports as bad, but it doesn't sound like a big number 
resource wise. Is it an operational headache?

Port 9042 + N can easily conflict with another statically allocated port on the server. I guess you can listen on ephemeral ports, but then if you firewall them, you need to adjust the firewall rules.

In any case it doesn't solve the problem of directing a connection's packets to a specific queue.


RE tokens distributed amongst shards. The way that would work right now is that 
each port number appears to be a discrete instance of the server. So you could 
have shards be actual shards that are simply colocated on the same box, run in 
the same process, and share resources. I know this pushes more of the 
complexity into the server vs the driver as the server expects all shards to 
share some client visible like system tables and certain identifiers.

This has its own problems, I'll address them in the other sub-thread (or using our term, other continuation).


Ariel
On Thu, Apr 19, 2018, at 12:59 PM, Avi Kivity wrote:
Port-per-shard is likely the easiest option but it's too ugly to
contemplate. We run on machines with 160 shards (IBM POWER 2s20c160t
IIRC), it will be just horrible to have 160 open ports.


It also doesn't fit will with the NICs ability to automatically
distribute packets among cores using multiple queues, so the kernel
would have to shuffle those packets around. Much better to have those
packets delivered directly to the core that will service them.


(also, some protocol changes are needed so the driver knows how tokens
are distributed among shards)

On 2018-04-19 19:46, Ben Bromhead wrote:
WRT to #3
To fit in the existing protocol, could you have each shard listen on a
different port? Drivers are likely going to support this due to
https://issues.apache.org/jira/browse/CASSANDRA-7544 (
https://issues.apache.org/jira/browse/CASSANDRA-11596).  I'm not super
familiar with the ticket so their might be something I'm missing but it
sounds like a potential approach.

This would give you a path forward at least for the short term.


On Thu, Apr 19, 2018 at 12:10 PM Ariel Weisberg <ar...@weisberg.ws> wrote:

Hi,

I think that updating the protocol spec to Cassandra puts the onus on the
party changing the protocol specification to have an implementation of the
spec in Cassandra as well as the Java and Python driver (those are both
used in the Cassandra repo). Until it's implemented in Cassandra we haven't
fully evaluated the specification change. There is no substitute for trying
to make it work.

There are also realities to consider as to what the maintainers of the
drivers are willing to commit.

RE #1,

I am +1 on the fact that we shouldn't require an extra hop for range scans.

In JIRA Jeremiah made the point that you can still do this from the client
by breaking up the token ranges, but it's a leaky abstraction to have a
paging interface that isn't a vanilla ResultSet interface. Serial vs.
parallel is kind of orthogonal as the driver can do either.

I agree it looks like the current specification doesn't make what should
be simple as simple as it could be for driver implementers.

RE #2,

+1 on this change assuming an implementation in Cassandra and the Java and
Python drivers.

RE #3,

It's hard to be +1 on this because we don't benefit by boxing ourselves in
by defining a spec we haven't implemented, tested, and decided we are
satisfied with. Having it in ScyllaDB de-risks it to a certain extent, but
what if Cassandra decides to go a different direction in some way?

I don't think there is much discussion to be had without an example of the
the changes to the CQL specification to look at, but even then if it looks
risky I am not likely to be in favor of it.

Regards,
Ariel

On Thu, Apr 19, 2018, at 9:33 AM, glom...@scylladb.com wrote:
On 2018/04/19 07:19:27, kurt greaves <k...@instaclustr.com> wrote:
1. The protocol change is developed using the Cassandra process in
     a JIRA ticket, culminating in a patch to
     doc/native_protocol*.spec when consensus is achieved.
I don't think forking would be desirable (for anyone) so this seems
the most reasonable to me. For 1 and 2 it certainly makes sense but
can't say I know enough about sharding to comment on 3 - seems to me
like it could be locking in a design before anyone truly knows what
sharding in C* looks like. But hopefully I'm wrong and there are
devs out there that have already thought that through.
Thanks. That is our view and is great to hear.

About our proposal number 3: In my view, good protocol designs are
future proof and flexible. We certainly don't want to propose a design
that works just for Scylla, but would support reasonable
implementations regardless of how they may look like.

Do we have driver authors who wish to support both projects?

Surely, but I imagine it would be a minority. ​

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org For
additional commands, e-mail: dev-h...@cassandra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org

--
Ben Bromhead
CTO | Instaclustr <https://www.instaclustr.com/>
+1 650 284 9692
Reliability at Scale
Cassandra, Spark, Elasticsearch on AWS, Azure, GCP and Softlayer


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org

Reply via email to