On Mon, Sep 22, 2014 at 8:13 PM, Clint Byrum <[email protected]> wrote:
> Excerpts from Joe Gordon's message of 2014-09-22 19:04:03 -0700: > > On Mon, Sep 22, 2014 at 5:47 PM, Zane Bitter <[email protected]> wrote: > > > > > On 22/09/14 17:06, Joe Gordon wrote: > > > > > >> On Mon, Sep 22, 2014 at 9:58 AM, Zane Bitter <[email protected]> > wrote: > > >> > > >> On 22/09/14 10:11, Gordon Sim wrote: > > >>> > > >>> On 09/19/2014 09:13 PM, Zane Bitter wrote: > > >>>> > > >>>> SQS offers very, very limited guarantees, and it's clear that the > > >>>>> reason > > >>>>> for that is to make it massively, massively scalable in the way > that > > >>>>> e.g. S3 is scalable while also remaining comparably durable (S3 is > > >>>>> supposedly designed for 11 nines, BTW). > > >>>>> > > >>>>> Zaqar, meanwhile, seems to be promising the world in terms of > > >>>>> guarantees. (And then taking it away in the fine print, where it > says > > >>>>> that the operator can disregard many of them, potentially without > the > > >>>>> user's knowledge.) > > >>>>> > > >>>>> On the other hand, IIUC Zaqar does in fact have a sharding feature > > >>>>> ("Pools") which is its answer to the massive scaling question. > > >>>>> > > >>>>> > > >>>> There are different dimensions to the scaling problem. > > >>>> > > >>>> > > >>> Many thanks for this analysis, Gordon. This is really helpful stuff. > > >>> > > >>> As I understand it, pools don't help scaling a given queue since > all > > >>> the > > >>> > > >>>> messages for that queue must be in the same pool. At present traffic > > >>>> through different Zaqar queues are essentially entirely orthogonal > > >>>> streams. Pooling can help scale the number of such orthogonal > streams, > > >>>> but to be honest, that's the easier part of the problem. > > >>>> > > >>>> > > >>> But I think it's also the important part of the problem. When I talk > > >>> about > > >>> scaling, I mean 1 million clients sending 10 messages per second > each, > > >>> not > > >>> 10 clients sending 1 million messages per second each. > > >>> > > >>> When a user gets to the point that individual queues have massive > > >>> throughput, it's unlikely that a one-size-fits-all cloud offering > like > > >>> Zaqar or SQS is _ever_ going to meet their needs. Those users will > want > > >>> to > > >>> spin up and configure their own messaging systems on Nova servers, > and at > > >>> that kind of size they'll be able to afford to. (In fact, they may > not be > > >>> able to afford _not_ to, assuming per-message-based pricing.) > > >>> > > >>> > > >> Running a message queue that has a high guarantee of not loosing a > message > > >> is hard and SQS promises exactly that, it *will* deliver your > message. If > > >> a > > >> use case can handle occasionally dropping messages then running your > own > > >> MQ > > >> makes more sense. > > >> > > >> SQS is designed to handle massive queues as well, while I haven't > found > > >> any > > >> examples of queues that have 1 million messages/second being sent or > > >> received 30k to 100k messages/second is not unheard of [0][1][2]. > > >> > > >> [0] https://www.youtube.com/watch?v=zwLC5xmCZUs#t=22m53s > > >> [1] http://java.dzone.com/articles/benchmarking-sqs > > >> [2] > > >> http://www.slideshare.net/AmazonWebServices/massive- > > >> message-processing-with-amazon-sqs-and-amazon- > > >> dynamodb-arc301-aws-reinvent-2013-28431182 > > >> > > > > > > Thanks for digging those up, that's really helpful input. I think > number > > > [1] kind of summed up part of what I'm arguing here though: > > > > > > "But once your requirements get above 35k messages per second, chances > are > > > you need custom solutions anyway; not to mention that while SQS is > cheap, > > > it may become expensive with such loads." > > > > > > If you don't require the reliability guarantees that SQS provides then > > perhaps. But I would be surprised to hear that a user can set up > something > > with this level of uptime for less: > > > > "Amazon SQS runs within Amazon’s high-availability data centers, so > queues > > will be available whenever applications need them. To prevent messages > from > > being lost or becoming unavailable, all messages are stored redundantly > > across multiple servers and data centers." [1] > > > > This is pretty easily doable with gearman or even just using Redis > directly. But it is still ops for end users. The AWS users I've talked to > who use SQS do so because they like that they can use RDS, SQS, and ELB, > and have only one type of thing to operate: their app. > > > > > > > > > > There is also the possibility of using the sharding capabilities of > the > > >>> > > >>>> underlying storage. But the pattern of use will determine how > effective > > >>>> that can be. > > >>>> > > >>>> So for example, on the ordering question, if order is defined by a > > >>>> single sequence number held in the database and atomically > incremented > > >>>> for every message published, that is not likely to be something > where > > >>>> the databases sharding is going to help in scaling the number of > > >>>> concurrent publications. > > >>>> > > >>>> Though sharding would allow scaling the total number messages on the > > >>>> queue (by distributing them over multiple shards), the total > ordering of > > >>>> those messages reduces it's effectiveness in scaling the number of > > >>>> concurrent getters (e.g. the concurrent subscribers in pub-sub) > since > > >>>> they will all be getting the messages in exactly the same order. > > >>>> > > >>>> Strict ordering impacts the competing consumers case also (and is > in my > > >>>> opinion of limited value as a guarantee anyway). At any given time, > the > > >>>> head of the queue is in one shard, and all concurrent claim requests > > >>>> will contend for messages in that same shard. Though the > unsuccessful > > >>>> claimants may then move to another shard as the head moves, they > will > > >>>> all again try to access the messages in the same order. > > >>>> > > >>>> So if Zaqar's goal is to scale the number of orthogonal queues, and > the > > >>>> number of messages held at any time within these, the pooling > facility > > >>>> and any sharding capability in the underlying store for a pool would > > >>>> likely be effective even with the strict ordering guarantee. > > >>>> > > >>>> > > >>> IMHO this is (or should be) the goal - support enormous numbers of > > >>> small-to-moderate sized queues. > > >>> > > >> > > >> > > >> If 50,000 messages per second doesn't count as small-to-moderate then > > >> Zaqar > > >> does not fulfill a major SQS use case. > > >> > > > > > > It's not a drop-in replacement, but as I mentioned you can recreate the > > > SQS semantics exactly *and* get the scalability benefits of that > approach > > > by sharding at the application level and then round-robin polling. > > > > > As I also mentioned, this is pretty easy to implement, and is only > > > required for really big applications that are more likely to be > written by > > > developers who already Know What They're Doing(TM). While the reverse > > > (emulating Zaqar semantics, i.e. FIFO, in SQS) is tricky, error-prone, > and > > > conceivably required by or at least desirable for all kinds of > > > beginner-level applications. (It's also pretty useful for a lot of use > > > cases in OpenStack itself, where OpenStack services are sending > messages to > > > the user.) > > > > > > > > I'm not convinced that is as simple to implement well as you make it out > to > > be, now every receiver has to poll N endpoints instead of 1. How would > this > > work with Long Polling? What is the impact on expanding the number of > > connections? How do you make this auto scale? etc. > > > > It's pretty easy to setup a select loop on multiple HTTP requests. I > don't think this complicates things. Also once you assume responsibility > for the service that does the queueing, you're also more likely to > accept a binary protocol rather than rely on long poll / REST semantics. > Agreed, once the user takes on the responsibility of running the service the the equation changes. I was referring to the case where someone consumes Zaqar as is. > > Number of connections I don't understand given the situation above. One > per client per availability-zone covered? > > Auto-scaling is a bit beyond the scope of the discussion. Nobody is > promising that Zaqar will auto-scale itself. > I was referring to how to know when to create a new queue, how to distribute that information, and when to remove that queue because your load has dropped. I think there may be a bunch of edge cases that are hard to deal with. > > > I don't know enough about the target audience to know if adding the FIFO > > guarantee is a good trade off or not. But I don't follow your use case > for > > OpenStack service sending messages to the user; when do these need to be > in > > order? > > > > I don't think the FIFO is needed in most cases. It's the kind of thing > that really helps you when you need it, but otherwise it's just sitting > there doing nothing for you, costing your operator more to guarantee. > > > > > > > > > > If scaling the number of communicants on a given communication > channel > > >>> > > >>>> is a goal however, then strict ordering may hamper that. If it > does, it > > >>>> seems to me that this is not just a policy tweak on the underlying > > >>>> datastore to choose the desired balance between ordering and scale, > but > > >>>> a more fundamental question on the internal structure of the queue > > >>>> implementation built on top of the datastore. > > >>>> > > >>>> > > >>> I agree with your analysis, but I don't think this should be a goal. > > >>> > > >>> Note that the user can still implement this themselves using > > >>> application-level sharding - if you know that in-order delivery is > not > > >>> important to you, then randomly assign clients to a queue and then > poll > > >>> all > > >>> of the queues in the round-robin. This yields _exactly_ the same > > >>> semantics > > >>> as SQS. > > >>> > > >> > > >> > > >> The reverse is true of SQS - if you want FIFO then you have to > implement > > >>> re-ordering by sequence number in your application. (I'm not > certain, but > > >>> it also sounds very much like this situation is ripe for losing > messages > > >>> when your client dies.) > > >>> > > >>> So the question is: in which use case do we want to push additional > > >>> complexity into the application? The case where there are truly > massive > > >>> volumes of messages flowing to a single point? Or the case where the > > >>> application wants the messages in order? > > >>> > > >>> I'd suggest both that the former applications are better able to > handle > > >>> that extra complexity and that the latter applications are probably > more > > >>> common. So it seems that the Zaqar team made a good decision. > > >>> > > >>> > > >> If Zaqar is supposed to be comparable to amazon SQS, then it has > picked > > >> the > > >> wrong choice. > > >> > > > > > > It has certainly picked a different choice. It seems like a choice > that is > > > friendlier to beginners and simple applications, while shifting some > > > complexity to larger applications without excluding them as a use case. > > > That's certainly not an invalid choice. > > > > > It isn't OpenStack's job to be cloning AWS services after all... we can > > > definitely address the same problems better when we see the > opportunity. We > > > should, of course, think very carefully about all the consequences, > > > intended and unintended, of changing a model that is already proven in > the > > > field and the market, so I'm very glad this discussion is happening. > But > > > after digging into it, the choice doesn't seem "wrong" to me. > > > > > > > To me this is less about valid or invalid choices. The Zaqar team is > > comparing Zaqar to SQS, but after digging into the two of them, zaqar > > barely looks like SQS. Zaqar doesn't guarantee what IMHO is the most > > important parts of SQS: the message will be delivered and will never be > > lost by SQS. Zaqar doesn't have the same scaling properties as SQS. Zaqar > > is aiming for low latency per message, SQS doesn't appear to be. So if > > Zaqar isn't SQS what is Zaqar and why should I use it? > > > > I have to agree. I'd like to see a simple, non-ordered, high latency, > high scale messaging service that can be used cheaply by cloud operators > and users. What I see instead is a very powerful, ordered, low latency, > medium scale messaging service that will likely cost a lot to scale out > to the thousands of users level. > > _______________________________________________ > OpenStack-dev mailing list > [email protected] > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >
_______________________________________________ OpenStack-dev mailing list [email protected] http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
