On 19 Nov 2014, at 20:21, David Pick <[email protected]> wrote: > Right now it's hundreds of gig. Like I mentioned before we'd love to > partition our data better, but because we don't have a common join key on > every table we haven't really come up with a good scheme for doing this.
Your many-to-many join example is interesting. I think you could probably do it as a multi-stage pipeline. Say you want to generate a list of all the transactions for a particular customer. Your input streams are: * Customers, partitioned by customer_id * Customer_Transactions, partitioned by customer_id * Transactions, partitioned by transaction_id The data flow is: 1. Join Customers and Customer_Transactions on customer_id, and emit messages of the form (Customer, transaction_id), partitioned by transaction_id. 2. Join output of step 1 with Transactions on transaction_id, and emit messages of the form (Customer, Transaction), partitioned by customer_id. 3. Consume output of step 2, and group them by customer_id to get all the transactions for one customer in one place. That may seem a bit convoluted, but it might actually perform quite well. At least it will allow you to break the data into many partitions. (I'm hoping we can build higher-level tools in future which will abstract away this complexity.) > This is great, though I think ideally what we would want is to have a kafka > topic per database table, and a Samza task that could consume all of one > topic for a table that's fairly small (e.g. customers) and a single > partition of a big table (e.g. posts). This may be even closer to what you need: https://issues.apache.org/jira/browse/SAMZA-402 Best, Martin
