[ 
https://issues.apache.org/jira/browse/KAFKA-3705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15827015#comment-15827015
 ] 

Paul Vaughan commented on KAFKA-3705:
-------------------------------------

Healthcare data is intrinsically relational, but the bulk of the data is 
related to patients in the sense that if the patient did not exist, the data 
would not exist. For example, these FHIR Resources all depend on a Patient: 
Encounter, MedicationOrder, Observation, Condition, etc., while Practitioner, 
Organization, etc. exist independent of patients. This suggests that a good 
partitioning strategy for processing this data would be to partition by patient 
in order to get good concurrency with minimal repartitioning/distribution. Data 
that is independent of the patient would likely need to be distributed to all 
nodes, but that data is relatively small. 

Processing this data includes checking for referential integrity, dealing with 
out-of-order data (Encounters that are received before the Patient is 
received), and re-keying. For example, when an Encounter arrives, a downstream 
version of that Encounter needs to be created with the patient’s downstream 
key. Similarly, when a new Patient arrives it should be given a downstream key 
and any Encounters that reference this patient need to be updated and sent 
downstream.

But the data is also intrinsically keyed by something other than the patient. 
For example, an Encounter has a key and it is possible to get duplicate copies 
of an Encounter, either as corrections or simple duplicates. Thus it is 
desirable to use the intrinsic key with compacted topics, rather than using the 
patient as the Kafka topic key. While it is possible to key by one thing and 
partition by another using explicit partitioners, that seems both error prone 
and insufficient to keep the data only where it needs to be.

Specifically, the High-level Streaming DSL does not seem to support the latter 
point. Without the foreign key support discussed here, it is necessary to do 
aggregation and remapping that cause implicit repartitioning. It seems 
determined to move the data around. It is not clear to me whether this KIP 
would eliminate that problem. Note that I found the documentation frustrating 
in that this repartitioning was not apparent from the documentation – it was 
most apparent by looking at the set of topics that get implicitly created. 

I would like to see the ability to transform a set of related incoming topics 
into a set of downstream topics including re-keying and sometimes 
renormalization using the high-level Streaming DSL. This seems like it is a 
start towards that, but is it sufficient?


> Support non-key joining in KTable
> ---------------------------------
>
>                 Key: KAFKA-3705
>                 URL: https://issues.apache.org/jira/browse/KAFKA-3705
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>            Reporter: Guozhang Wang
>              Labels: api
>
> Today in Kafka Streams DSL, KTable joins are only based on keys. If users 
> want to join a KTable A by key {{a}} with another KTable B by key {{b}} but 
> with a "foreign key" {{a}}, and assuming they are read from two topics which 
> are partitioned on {{a}} and {{b}} respectively, they need to do the 
> following pattern:
> {code}
> tableB' = tableB.groupBy(/* select on field "a" */).agg(...); // now tableB' 
> is partitioned on "a"
> tableA.join(tableB', joiner);
> {code}
> Even if these two tables are read from two topics which are already 
> partitioned on {{a}}, users still need to do the pre-aggregation in order to 
> make the two joining streams to be on the same key. This is a draw-back from 
> programability and we should fix it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to