Thanks for writing up this proposal, Wei. This will go a long way in satisfying a number of Samza use-cases. I'm +1 to this idea.
>> Section on proposed changes: Provide hooks to transform an incoming message to desired types (this is useful to store a subset of the incoming message). 1. I believe you mean store a "projection" of the incoming message? Might be clear to re-word it IMHO. 2. Does Adjunct Data store support change-logging? If it does not, wondering if it might be worth calling it out. >> Section on Consistency 3. IMO, adding a discussion on what causes a potential inconsistency, and how we determine what is a consistent snapshot will probably be useful (in bounded, and unbounded datasets). >> Section on Bootstrapping: A bootstrap is forced if the store is not available, or not valid (too old) 4. How do we determine if a store is invalid / old? One way would be to store the recent offsets somewhere, and compare the offsets upon startup. >> We may provide a default serde for POJO. 5. +1 for adding default serdes. Using java serialization is probably the simplest (mandating that keys and values contain serializable fields) 6. This will perhaps be clearer as we get to implementing it. Currently, there are 3 storage managers in the proposal - "TaskStorageManager", "ContainerStorageManager" and "AdjunctDataStorageManager" (different from AdjunctDataStoreManager) . Not entirely sure we need all 3. Maybe, we do. >> After bootstrap, for change capture data, we keep updating its AD store when new updates arrives 7. If used with a streaming source like Kafka, wouldn't the data storage size grow unbounded in size? Do we need to handle garbage collection of really stale data? What do you think about adding a section on how GC works? (both for bounded, and unbounded sources) >> Configuration: stores.adstore.manager.factory 8. If the user implements their own AdjunctDataStoreManagerFactory, What is the lifecycle of the returned `AdjunctDataStoreManager`? AFAICS, there is no easy way for an implementor to obtain an instance of a K-V store inside AdjunctDataStoreManagerFactory interface? Should the API take in a Map<String, KVStore> stores instead of a Map<String, StorageEngine> ? Best, Jagadish On Tue, May 16, 2017 at 8:56 AM, Wei Song <ws...@linkedin.com.invalid> wrote: > Hey everyone, > > I created a proposal for SAMZA-1278 > <https://issues.apache.org/jira/browse/SAMZA-1278>, Adjunct Data Store for > Unbounded DataSets, which introduces an automatic mechanism to store > adjunct data for stream tasks. > > https://cwiki.apache.org/confluence/display/SAMZA/Adjunct+Da > ta+Store+for+Unbounded+DataSets > > Please review and comments are welcome! > > For those who are not actively following the master branch, you may have > more questions than others. Feel free to ask them here. > > P.S. this is the 3rd try, sent this last week, but apparently no one at > Linkedin has received, including samza-dev here just to be sure. > > -- > Thanks, > -Wei > -- Jagadish V, Graduate Student, Department of Computer Science, Stanford University