Re: Best approach in Cassandra (+ Spark?) for Continuous Queries?

Peter Lin Sat, 03 Jan 2015 15:44:38 -0800

listen to colin's advice, avoid the temptation of anti-patterns.

On Sat, Jan 3, 2015 at 6:10 PM, Colin <colpcl...@gmail.com> wrote:


> Use a message bus with a transactional get, get the message, send to
> cassandra, upon write success, submit to esp, commit get on bus.  Messaging
> systems like rabbitmq support this semantic.
>
> Using cassandra as a queuing mechanism is an anti-pattern.
>
> --
> *Colin Clark*
> +1-320-221-9531
>
>
> On Jan 3, 2015, at 6:07 PM, Hugo José Pinto <hugo.pi...@inovaworks.com>
> wrote:
>
> Thank you all for your answers.
>
> It seems I'll have to go with some event-driven processing before/during
> the Cassandra write path.
>
> My concern would be that I'd love to first guarantee the disk write of the
> Cassandra persistence and then do the event processing (which is mostly
> CRUD intercepts at this point), even if slightly delayed, and doing so via
> triggers would probably bog down the whole processing pipeline.
>
> What I'd probably do is to write, in trigger, a separate key table with
> all the CRUDed elements and to have the ESP process that table.
>
> Thank you for your contribution. Should anyone else have any experiende
> experience in these scenarios I'm obviously all ears as well.
>
> Best,
>
> Hugo
>
> Enviado do meu iPhone
>
> No dia 03/01/2015, às 11:09, DuyHai Doan <doanduy...@gmail.com> escreveu:
>
> Hello Hugo
>
>  I was facing the same kind of requirement from some users. Long story
> short, below are the possible strategies with advantages and draw-backs of
> each
>
> 1) Put Spark in front of the back-end, every incoming
> modification/update/insert goes into Spark first, then Spark will forward
> it to Cassandra for persistence. With Spark, you can perform pre or
> post-processing and notify external clients of mutation.
>
>  The draw back of this solution is that all the incoming mutations must go
> through Spark. You may set up a Kafka queue as temporary storage to
> distribute the load and consume mutations with Spark but it add ups to the
> architecture complexity with additional components & technologies
>
> 2) For high availability and resilience, you probably want to have all
> mutations saved first into Cassandra then process notifications with Spark.
> In this case the only way to have notifications from Cassandra, as of
> version 2.1, is to rely on manually coded triggers (which is still
> experimental feature).
>
> With the triggers you can notify whatever clients you want, not only Spark.
>
> The big draw back of this solution is that playing with triggers is
> dangerous if you are not familiar with Cassandra internals. Indeed the
> trigger is on the write path and may hurt performance if you are doing
> complex and blocking tasks.
>
> That's the 2 solutions I can see, maybe the ML members will propose other
> innovative choices
>
>  Regards
>
> On Sat, Jan 3, 2015 at 11:46 AM, Hugo José Pinto <
> hugo.pi...@inovaworks.com> wrote:
>
>> Hello.
>>
>> We're currently using Hazelcast (http://hazelcast.org/) as a distributed
>> in-memory data grid. That's been working sort-of-well for us, but going
>> solely in-memory has exhausted its path in our use case, and we're
>> considering porting our application to a NoSQL persistent store. After the
>> usual comparisons and evaluations, we're borderline close to picking
>> Cassandra, plus eventually Spark for analytics.
>>
>> Nonetheless, there is a gap in our architectural needs that we're still
>> not grasping how to solve in Cassandra (with or without Spark): Hazelcast
>> allows us to create a Continuous Query in that, whenever a row is
>> added/removed/modified from the clause's resultset, Hazelcast calls up back
>> with the corresponding notification. We use this to continuously update the
>> clients via AJAX streaming with the new/changed rows.
>>
>> This is probably a conceptual mismatch we're making, so - how to best
>> address this use case in Cassandra (with or without Spark's help)? Is there
>> something in the API that allows for Continuous Queries on key/clause
>> changes (haven't found it)? Is there some other way to get a stream of
>> key/clause updates? Events of some sort?
>>
>> I'm aware that we could, eventually, periodically poll Cassandra, but in
>> our use case, the client is potentially interested in a large number of
>> table clause notifications (think "all changes to Ship positions on
>> California's coastline"), and iterating out of the store would kill the
>> streamer's scalability.
>>
>> Hence, the magic question: what are we missing? Is Cassandra the wrong
>> tool for the job? Are we not aware of a particular part of the API or
>> external library in/outside the apache realm that would allow for this?
>>
>> Many thanks for any assistance!
>>
>> Hugo
>>
>
>

Re: Best approach in Cassandra (+ Spark?) for Continuous Queries?

Reply via email to