Hi David, On 19 Nov 2014, at 04:52, David Pick <[email protected]> wrote: > First off, if this is the wrong place to ask these kinds of questions > please let me know. I tried in IRC but didn't get an answer within a few > hours so I'm trying here.
It's the right place, and they're good questions :) > I had a couple of questions around implementing a table to table join with > data coming from a database changelog through Kafka. Cool! Out of curiosity, can I ask what database you're using and how you're getting the changelog? > Let's say I've got two tables users and posts in my primary db where the > posts table has a user_id column. I've written a Samza job that joins those > two tables together by storing every user record and the merged document in > Leveldb and then outputing the resulting document to the changelog Kafka > topic. Not sure exactly what you mean with a merged document here. Is it a list of all the posts by a particular user? Or post records augmented with user information? Or something else? > Is this the right way to implement that kind of job? On a high level, this sounds reasonable. One detail question is whether you're using the changelog feature for the LevelDB store. If the contents of the store can be rebuilt by reprocessing the input stream, you could get away with turning off the changelog on the store, and making the input stream a bootstrap stream instead. That will save some overhead on writes, but still give you fault tolerance. Another question is whether the values you're storing in LevelDB are several merged records together, or whether you store each record individually and use a range query to retrieve them if necessary. You could benchmark which works best for you. > It seems that even > with a decent partitioning scheme the leveldb instance in each task will > get quite large, especially if we're joining several tables together that > have millions of rows (our real world use case would be 7 or 8 tables each > with many millions of records). The goal of Samza's local key-value stores is to support a few gigabytes per partition. So millions of rows distributed across (say) tens of partitions should be fine, assuming the rows aren't huge. You might want to try the new RocksDB support, which may perform better than LevelDB. > Also, given a task that's processing that much data where do you > recommended running Samza? Should we spin up another set of boxes or is it > ok to run it on the Kafka brokers (I heard it mentioned that this is how > LinkedIn is running Samza). IIRC LinkedIn saw some issues with the page cache when Kafka and LevelDB were on the same machine, which was resolved by putting them on different machines. But I'll let one of the LinkedIn folks comment on that. Best, Martin
