[
https://issues.apache.org/jira/browse/SAMZA-256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14048696#comment-14048696
]
Martin Kleppmann commented on SAMZA-256:
----------------------------------------
I would prefer approach (1), a separate factory for each type of storage
engine. I fear that a generic key-value interface that abstracts across
multiple storage engines would be a leaky abstraction; users would still have
to think about which storage engine is being used under the hood. Some of the
subtle differences that may arise:
- LevelDB and RocksDB use a sorted log-structured representation which allows
efficient range queries, but a HashMap would not allow range queries.
- Perhaps the in-memory store should use a TreeMap instead, but then it's
limited to keys that implement Comparable.
- For the in-memory storage engine, serdes may be optional. For on-disk
storage, serdes are required.
- LevelDB has no mechanism for expiry; RocksDB supports pluggable compaction
filters which allow expiry to be implemented; Guava collections have lots of
cache-replacement and expiry options. We should be able to give the user access
to whatever options the underlying storage engine provides.
I also think we should name the factories after the particular storage engine
being used (LevelDBStorageEngineFactory, RocksDBStorageEngineFactory,
HashMapStorageEngineFactory, etc) not after their persistence characteristics
(PersistentKeyValueStorageEngineFactory, InMemoryKeyValueStorageEngineFactory),
because:
# It's misleading: an in-memory storage engine can still be durable if
changelog replication is enabled, and an on-disk storage engine can still lose
data if you don't have changelog replication enabled. The difference between
on-disk and in-memory storage determines whether you can store state larger
than memory, not whether the state is durable.
# Leaky abstraction: RocksDB has different features and different performance
characteristics from LevelDB, so I don't think it makes sense to abstract over
them.
# Explicit is better than implicit: users will need to know what storage engine
is being used, so the factory name shouldn't hide it from them.
For compatibility, making KeyValueStorageEngineFactory an alias for
LevelDBStorageEngineFactory sounds good to me.
> Provide in-memory data store implementation
> -------------------------------------------
>
> Key: SAMZA-256
> URL: https://issues.apache.org/jira/browse/SAMZA-256
> Project: Samza
> Issue Type: Improvement
> Components: kv
> Affects Versions: 0.6.0
> Reporter: Jakob Homan
> Assignee: Chinmay Soman
> Fix For: 0.8.0
>
>
> The sole current kv store, LevelDbKeyValueStore, works well when the amount
> of data to be stored is prohibitively large to keep it all in memory.
> However, in cases where the state is small enough to comfortably fit in
> whatever memory is available, it would be better to provide an in-memory
> implementation. This can be backed by either a native Java class, or perhaps
> a Guava class, if that is found to scale better (or, of course, the backing
> implementation could be configurable).
--
This message was sent by Atlassian JIRA
(v6.2#6252)