Agreed, it does seem that explicit factory names are better. I'll use that 
approach.

Thanks for all the comments !

C
________________________________________
From: Martin Kleppmann (JIRA) [[email protected]]
Sent: Tuesday, July 01, 2014 2:42 AM
To: [email protected]
Subject: [jira] [Commented] (SAMZA-256) Provide in-memory data store 
implementation

    [ 
https://issues.apache.org/jira/browse/SAMZA-256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14048696#comment-14048696
 ]

Martin Kleppmann commented on SAMZA-256:
----------------------------------------

I would prefer approach (1), a separate factory for each type of storage 
engine. I fear that a generic key-value interface that abstracts across 
multiple storage engines would be a leaky abstraction; users would still have 
to think about which storage engine is being used under the hood. Some of the 
subtle differences that may arise:

- LevelDB and RocksDB use a sorted log-structured representation which allows 
efficient range queries, but a HashMap would not allow range queries.
- Perhaps the in-memory store should use a TreeMap instead, but then it's 
limited to keys that implement Comparable.
- For the in-memory storage engine, serdes may be optional. For on-disk 
storage, serdes are required.
- LevelDB has no mechanism for expiry; RocksDB supports pluggable compaction 
filters which allow expiry to be implemented; Guava collections have lots of 
cache-replacement and expiry options. We should be able to give the user access 
to whatever options the underlying storage engine provides.

I also think we should name the factories after the particular storage engine 
being used (LevelDBStorageEngineFactory, RocksDBStorageEngineFactory, 
HashMapStorageEngineFactory, etc) not after their persistence characteristics 
(PersistentKeyValueStorageEngineFactory, InMemoryKeyValueStorageEngineFactory), 
because:

# It's misleading: an in-memory storage engine can still be durable if 
changelog replication is enabled, and an on-disk storage engine can still lose 
data if you don't have changelog replication enabled. The difference between 
on-disk and in-memory storage determines whether you can store state larger 
than memory, not whether the state is durable.
# Leaky abstraction: RocksDB has different features and different performance 
characteristics from LevelDB, so I don't think it makes sense to abstract over 
them.
# Explicit is better than implicit: users will need to know what storage engine 
is being used, so the factory name shouldn't hide it from them.

For compatibility, making KeyValueStorageEngineFactory an alias for 
LevelDBStorageEngineFactory sounds good to me.

> Provide in-memory data store implementation
> -------------------------------------------
>
>                 Key: SAMZA-256
>                 URL: https://issues.apache.org/jira/browse/SAMZA-256
>             Project: Samza
>          Issue Type: Improvement
>          Components: kv
>    Affects Versions: 0.6.0
>            Reporter: Jakob Homan
>            Assignee: Chinmay Soman
>             Fix For: 0.8.0
>
>
> The sole current kv store, LevelDbKeyValueStore, works well when the amount 
> of data to be stored is prohibitively large to keep it all in memory.  
> However, in cases where the state is small enough to comfortably fit in 
> whatever memory is available, it would be better to provide an in-memory 
> implementation.  This can be backed by either a native Java class, or perhaps 
> a Guava class, if that is found to scale better (or, of course, the backing 
> implementation could be configurable).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to