On 19 Nov 2014, at 20:21, David Pick <[email protected]> wrote:
> Right now it's hundreds of gig. Like I mentioned before we'd love to
> partition our data better, but because we don't have a common join key on
> every table we haven't really come up with a good scheme for doing this.

Your many-to-many join example is interesting. I think you could probably do it 
as a multi-stage pipeline. Say you want to generate a list of all the 
transactions for a particular customer. Your input streams are:

* Customers, partitioned by customer_id
* Customer_Transactions, partitioned by customer_id
* Transactions, partitioned by transaction_id

The data flow is:

1. Join Customers and Customer_Transactions on customer_id, and emit messages 
of the form (Customer, transaction_id), partitioned by transaction_id.
2. Join output of step 1 with Transactions on transaction_id, and emit messages 
of the form (Customer, Transaction), partitioned by customer_id.
3. Consume output of step 2, and group them by customer_id to get all the 
transactions for one customer in one place.

That may seem a bit convoluted, but it might actually perform quite well. At 
least it will allow you to break the data into many partitions. (I'm hoping we 
can build higher-level tools in future which will abstract away this 
complexity.)

> This is great, though I think ideally what we would want is to have a kafka
> topic per database table, and a Samza task that could consume all of one
> topic for a table that's fairly small (e.g. customers) and a single
> partition of a big table (e.g. posts).

This may be even closer to what you need: 
https://issues.apache.org/jira/browse/SAMZA-402

Best,
Martin

Reply via email to