[ https://issues.apache.org/jira/browse/KAFKA-3705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15827015#comment-15827015 ]
Paul Vaughan commented on KAFKA-3705: ------------------------------------- Healthcare data is intrinsically relational, but the bulk of the data is related to patients in the sense that if the patient did not exist, the data would not exist. For example, these FHIR Resources all depend on a Patient: Encounter, MedicationOrder, Observation, Condition, etc., while Practitioner, Organization, etc. exist independent of patients. This suggests that a good partitioning strategy for processing this data would be to partition by patient in order to get good concurrency with minimal repartitioning/distribution. Data that is independent of the patient would likely need to be distributed to all nodes, but that data is relatively small. Processing this data includes checking for referential integrity, dealing with out-of-order data (Encounters that are received before the Patient is received), and re-keying. For example, when an Encounter arrives, a downstream version of that Encounter needs to be created with the patient’s downstream key. Similarly, when a new Patient arrives it should be given a downstream key and any Encounters that reference this patient need to be updated and sent downstream. But the data is also intrinsically keyed by something other than the patient. For example, an Encounter has a key and it is possible to get duplicate copies of an Encounter, either as corrections or simple duplicates. Thus it is desirable to use the intrinsic key with compacted topics, rather than using the patient as the Kafka topic key. While it is possible to key by one thing and partition by another using explicit partitioners, that seems both error prone and insufficient to keep the data only where it needs to be. Specifically, the High-level Streaming DSL does not seem to support the latter point. Without the foreign key support discussed here, it is necessary to do aggregation and remapping that cause implicit repartitioning. It seems determined to move the data around. It is not clear to me whether this KIP would eliminate that problem. Note that I found the documentation frustrating in that this repartitioning was not apparent from the documentation – it was most apparent by looking at the set of topics that get implicitly created. I would like to see the ability to transform a set of related incoming topics into a set of downstream topics including re-keying and sometimes renormalization using the high-level Streaming DSL. This seems like it is a start towards that, but is it sufficient? > Support non-key joining in KTable > --------------------------------- > > Key: KAFKA-3705 > URL: https://issues.apache.org/jira/browse/KAFKA-3705 > Project: Kafka > Issue Type: Bug > Components: streams > Reporter: Guozhang Wang > Labels: api > > Today in Kafka Streams DSL, KTable joins are only based on keys. If users > want to join a KTable A by key {{a}} with another KTable B by key {{b}} but > with a "foreign key" {{a}}, and assuming they are read from two topics which > are partitioned on {{a}} and {{b}} respectively, they need to do the > following pattern: > {code} > tableB' = tableB.groupBy(/* select on field "a" */).agg(...); // now tableB' > is partitioned on "a" > tableA.join(tableB', joiner); > {code} > Even if these two tables are read from two topics which are already > partitioned on {{a}}, users still need to do the pre-aggregation in order to > make the two joining streams to be on the same key. This is a draw-back from > programability and we should fix it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)