Hi David,

I am also wondering why you are storing merged document in local LevelDB.
If you need to check for duplicates, how about using a bloom filter to
handle duplicates.

Thanks
Milinda

On Wed, Nov 19, 2014 at 8:50 AM, Martin Kleppmann <[email protected]>
wrote:

> Hi David,
>
> On 19 Nov 2014, at 04:52, David Pick <[email protected]> wrote:
> > First off, if this is the wrong place to ask these kinds of questions
> > please let me know. I tried in IRC but didn't get an answer within a few
> > hours so I'm trying here.
>
> It's the right place, and they're good questions :)
>
> > I had a couple of questions around implementing a table to table join
> with
> > data coming from a database changelog through Kafka.
>
> Cool! Out of curiosity, can I ask what database you're using and how
> you're getting the changelog?
>
> > Let's say I've got two tables users and posts in my primary db where the
> > posts table has a user_id column. I've written a Samza job that joins
> those
> > two tables together by storing every user record and the merged document
> in
> > Leveldb and then outputing the resulting document to the changelog Kafka
> > topic.
>
> Not sure exactly what you mean with a merged document here. Is it a list
> of all the posts by a particular user? Or post records augmented with user
> information? Or something else?
>
> > Is this the right way to implement that kind of job?
>
> On a high level, this sounds reasonable. One detail question is whether
> you're using the changelog feature for the LevelDB store. If the contents
> of the store can be rebuilt by reprocessing the input stream, you could get
> away with turning off the changelog on the store, and making the input
> stream a bootstrap stream instead. That will save some overhead on writes,
> but still give you fault tolerance.
>
> Another question is whether the values you're storing in LevelDB are
> several merged records together, or whether you store each record
> individually and use a range query to retrieve them if necessary. You could
> benchmark which works best for you.
>
> > It seems that even
> > with a decent partitioning scheme the leveldb instance in each task will
> > get quite large, especially if we're joining several tables together that
> > have millions of rows (our real world use case would be 7 or 8 tables
> each
> > with many millions of records).
>
> The goal of Samza's local key-value stores is to support a few gigabytes
> per partition. So millions of rows distributed across (say) tens of
> partitions should be fine, assuming the rows aren't huge.  You might want
> to try the new RocksDB support, which may perform better than LevelDB.
>
> > Also, given a task that's processing that much data where do you
> > recommended running Samza? Should we spin up another set of boxes or is
> it
> > ok to run it on the Kafka brokers (I heard it mentioned that this is how
> > LinkedIn is running Samza).
>
> IIRC LinkedIn saw some issues with the page cache when Kafka and LevelDB
> were on the same machine, which was resolved by putting them on different
> machines. But I'll let one of the LinkedIn folks comment on that.
>
> Best,
> Martin
>
>


-- 
Milinda Pathirage

PhD Student | Research Assistant
School of Informatics and Computing | Data to Insight Center
Indiana University

twitter: milindalakmal
skype: milinda.pathirage
blog: http://milinda.pathirage.org

Reply via email to