On 26.08.14 20:24, Andrzej Dębski wrote:
You're right. If you want to keep all data in Kafka without ever
deleting them, you'd need to add partitions dynamically (which is
currently possible with APIs that back the CLI). On the other
hand, using Kafka this way is the wrong approach IMO. If you
really need to keep the full event history, keep old events on
HDFS or wherever and only the more recent ones in Kafka (where a
full replay must first read from HDFS and then from Kafka) or use
a journal plugin that is explicitly designed for long-term event
storage.
That was worrying me all the time with using Kafka in a situation
where I would want to keep the events all the time (or at least
unknown amount of time). The thing that seemed nice is that I would
have journal/event store and pub-sub solution implemented in one
technology - basically I want to go around current limitation of
PersistentView. I wanted to use Kafka topic and replay all events from
the topic to dynamically added read models in my cluster. Maybe in
this situation I should stick to distributed publish-subscribe in
cluster for current event-sending and Cassandra as journal/snapshot
store. I did not read that much about Cassandra and the way it stores
data so I do not know if reading all events would be easy.
That's a single table in Cassandra (some details about ordering here
<https://github.com/krasserm/akka-analytics#event-batch-processing>).
One could derive further tables with a user-defined
ordering/filtering/... from which multiple readers/subscriber could
consume and derive read models. These derived tables are comparable to
user-defined topics in the Kafka journal. Whether they are populated by
the plugin during write transactions or later, by running separate
transformation processes, is an implementation detail. The Kafka journal
does the former, the latter gives more flexibility regarding new read
model requirements (as no upfront knowledge is required what to write to
user-defined tables/topics).
The main reason why I developed the Kafka plugin was to integrate
my Akka applications in unified log processing architectures as
descibed in Jay Kreps' excellent article
<http://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying>.
Also mentioned in this article is a snapshotting strategy that
fits typical retention times in Kafka.
Thanks for the link.
The most interesting next Kafka plugin feature for me to develop
is an HDFS integration for long-term event storage (and full event
history replay). WDYT?
That would be interesting feature - certainly would make akka + Kafka
combination viable for more use cases.
W dniu wtorek, 26 sierpnia 2014 19:44:05 UTC+2 użytkownik Martin
Krasser napisał:
On 26.08.14 16:44, Andrzej Dębski wrote:
My mind must have filtered out the possibility of making
snapshots using Views - thanks.
About partitions: I suspected as much. The only thing that I am
wondering now is: if it is possible to dynamically create
partitions in Kafka? AFAIK the number of partitions is set during
topic creation (be it programmatically using API or CLI tools)
and there is CLI tool you can use to modify existing topic:
https://cwiki.apache.org/confluence/display/KAFKA/Replication+tools#Replicationtools-5.AddPartitionTool
<https://cwiki.apache.org/confluence/display/KAFKA/Replication+tools#Replicationtools-5.AddPartitionTool>.
To keep the invariant " PersistentActor is the only writer to a
partitioned journal topic" you would have to create those
partitions dynamically (usually you don't know up front how many
PersistentActors your system will have) on per-PersistentActor basis.
You're right. If you want to keep all data in Kafka without ever
deleting them, you'd need to add partitions dynamically (which is
currently possible with APIs that back the CLI). On the other
hand, using Kafka this way is the wrong approach IMO. If you
really need to keep the full event history, keep old events on
HDFS or wherever and only the more recent ones in Kafka (where a
full replay must first read from HDFS and then from Kafka) or use
a journal plugin that is explicitly designed for long-term event
storage.
The main reason why I developed the Kafka plugin was to integrate
my Akka applications in unified log processing architectures as
descibed in Jay Kreps' excellent article
<http://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying>.
Also mentioned in this article is a snapshotting strategy that
fits typical retention times in Kafka.
On the other hand maybe you are assuming that each actor is
writing to different topic
yes, and the Kafka plugin is currently implemented that way.
- but I think this solution is not viable because information
about topics is limited by ZK and other factors:
http://grokbase.com/t/kafka/users/133v60ng6v/limit-on-number-of-kafka-topic
<http://grokbase.com/t/kafka/users/133v60ng6v/limit-on-number-of-kafka-topic>.
A more in-depth discussion about these limitations is given at
http://www.quora.com/How-many-topics-can-be-created-in-Apache-Kafka
<http://www.quora.com/How-many-topics-can-be-created-in-Apache-Kafka>
with a detailed comment from Jay. I'd say that if you designed
your application to run more than a few hundred persistent actors,
then the Kafka plugin is the probably wrong choice. I tend to
design my applications to have only a small number of persistent
actors (which is in contrast to many other discussions on
akka-user) which makes the Kafka plugin a good candidate.
To recap, the Kafka plugin is a reasonable choice if
- frequent snapshotting is done by persistent actors (every day or so)
- you don't have more than a few hundred persistent actors and
- your application is a component of a unified log processing
architecture (backed by Kafka)
The most interesting next Kafka plugin feature for me to develop
is an HDFS integration for long-term event storage (and full event
history replay). WDYT?
W dniu wtorek, 26 sierpnia 2014 15:28:47 UTC+2 użytkownik Martin
Krasser napisał:
Hi Andrzej,
On 26.08.14 09:15, Andrzej Dębski wrote:
Hello
Lately I have been reading about a possibility of using
Apache Kafka as journal/snapshot store for akka-persistence.
I am aware of the plugin created by Martin Krasser:
https://github.com/krasserm/akka-persistence-kafka/
<https://github.com/krasserm/akka-persistence-kafka/> and
also I read other topic about Kafka as journal
https://groups.google.com/forum/#!searchin/akka-user/kakfka/akka-user/iIHmvC6bVrI/zeZJtW0_6FwJ
<https://groups.google.com/forum/#%21searchin/akka-user/kakfka/akka-user/iIHmvC6bVrI/zeZJtW0_6FwJ>.
In both sources I linked two ideas were presented:
1. Set log retention to 7 days, take snapshots every 3 days
(example values)
2. Set log retention to unlimited.
Here is the first question: in first case wouldn't it mean
that persistent views would receive skewed view of the
PersistentActor state (only events from 7 days) - is it
really viable solution? As far as I know PersistentView can
only receive events - it can't receive snapshots from
corresponding PersistentActor (which is good in general case).
PersistentViews can create their own snapshots which are
isolated from the corresponding PersistentActor's snapshots.
Second question (more directed to Martin): in the thread I
linked you wrote:
I don't go into Kafka partitioning details here but it
is possible to implement the journal driver in a way
that both a single persistent actor's data are
partitioned *and* kept in order
I am very interested in this idea. AFAIK it is not yet
implemented in current plugin but I was wondering if you
could share high level idea how would you achieve that (one
persistent actor, multiple partitions, ordering ensured)?
The idea is to
- first write events 1 to n to partition 1
- then write events n+1 to 2n to partition 2
- then write events 2n+1 to 3n to partition 3
- ... and so on
This works because a PersistentActor is the only writer to a
partitioned journal topic. During replay, you first replay
partition 1, then partition 2 and so on. This should be
rather easy to implement in the Kafka journal, just didn't
have time so far; pull requests are welcome :) Btw, the
Cassandra journal
<https://github.com/krasserm/akka-persistence-cassandra>
follows the very same strategy for scaling with data volume
(by using different partition keys).
Cheers,
Martin
--
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ:
http://doc.akka.io/docs/akka/current/additional/faq.html
<http://doc.akka.io/docs/akka/current/additional/faq.html>
>>>>>>>>>> Search the archives:
https://groups.google.com/group/akka-user
<https://groups.google.com/group/akka-user>
---
You received this message because you are subscribed to the
Google Groups "Akka User List" group.
To unsubscribe from this group and stop receiving emails
from it, send an email to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/akka-user
<http://groups.google.com/group/akka-user>.
For more options, visit https://groups.google.com/d/optout
<https://groups.google.com/d/optout>.
--
Martin Krasser
blog:http://krasserm.blogspot.com
code:http://github.com/krasserm
twitter:http://twitter.com/mrt1nz
--
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ:
http://doc.akka.io/docs/akka/current/additional/faq.html
<http://doc.akka.io/docs/akka/current/additional/faq.html>
>>>>>>>>>> Search the archives:
https://groups.google.com/group/akka-user
<https://groups.google.com/group/akka-user>
---
You received this message because you are subscribed to the
Google Groups "Akka User List" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to [email protected] <javascript:>.
To post to this group, send email to [email protected]
<javascript:>.
Visit this group at http://groups.google.com/group/akka-user
<http://groups.google.com/group/akka-user>.
For more options, visit https://groups.google.com/d/optout
<https://groups.google.com/d/optout>.
--
Martin Krasser
blog:http://krasserm.blogspot.com
code:http://github.com/krasserm
twitter:http://twitter.com/mrt1nz
--
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ:
http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to the Google
Groups "Akka User List" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to [email protected]
<mailto:[email protected]>.
To post to this group, send email to [email protected]
<mailto:[email protected]>.
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.
--
Martin Krasser
blog: http://krasserm.blogspot.com
code: http://github.com/krasserm
twitter: http://twitter.com/mrt1nz
--
Read the docs: http://akka.io/docs/
Check the FAQ: http://doc.akka.io/docs/akka/current/additional/faq.html
Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to the Google Groups "Akka User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.