[
https://issues.apache.org/jira/browse/KAFKA-3705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15827015#comment-15827015
]
Paul Vaughan commented on KAFKA-3705:
-------------------------------------
Healthcare data is intrinsically relational, but the bulk of the data is
related to patients in the sense that if the patient did not exist, the data
would not exist. For example, these FHIR Resources all depend on a Patient:
Encounter, MedicationOrder, Observation, Condition, etc., while Practitioner,
Organization, etc. exist independent of patients. This suggests that a good
partitioning strategy for processing this data would be to partition by patient
in order to get good concurrency with minimal repartitioning/distribution. Data
that is independent of the patient would likely need to be distributed to all
nodes, but that data is relatively small.
Processing this data includes checking for referential integrity, dealing with
out-of-order data (Encounters that are received before the Patient is
received), and re-keying. For example, when an Encounter arrives, a downstream
version of that Encounter needs to be created with the patient’s downstream
key. Similarly, when a new Patient arrives it should be given a downstream key
and any Encounters that reference this patient need to be updated and sent
downstream.
But the data is also intrinsically keyed by something other than the patient.
For example, an Encounter has a key and it is possible to get duplicate copies
of an Encounter, either as corrections or simple duplicates. Thus it is
desirable to use the intrinsic key with compacted topics, rather than using the
patient as the Kafka topic key. While it is possible to key by one thing and
partition by another using explicit partitioners, that seems both error prone
and insufficient to keep the data only where it needs to be.
Specifically, the High-level Streaming DSL does not seem to support the latter
point. Without the foreign key support discussed here, it is necessary to do
aggregation and remapping that cause implicit repartitioning. It seems
determined to move the data around. It is not clear to me whether this KIP
would eliminate that problem. Note that I found the documentation frustrating
in that this repartitioning was not apparent from the documentation – it was
most apparent by looking at the set of topics that get implicitly created.
I would like to see the ability to transform a set of related incoming topics
into a set of downstream topics including re-keying and sometimes
renormalization using the high-level Streaming DSL. This seems like it is a
start towards that, but is it sufficient?
> Support non-key joining in KTable
> ---------------------------------
>
> Key: KAFKA-3705
> URL: https://issues.apache.org/jira/browse/KAFKA-3705
> Project: Kafka
> Issue Type: Bug
> Components: streams
> Reporter: Guozhang Wang
> Labels: api
>
> Today in Kafka Streams DSL, KTable joins are only based on keys. If users
> want to join a KTable A by key {{a}} with another KTable B by key {{b}} but
> with a "foreign key" {{a}}, and assuming they are read from two topics which
> are partitioned on {{a}} and {{b}} respectively, they need to do the
> following pattern:
> {code}
> tableB' = tableB.groupBy(/* select on field "a" */).agg(...); // now tableB'
> is partitioned on "a"
> tableA.join(tableB', joiner);
> {code}
> Even if these two tables are read from two topics which are already
> partitioned on {{a}}, users still need to do the pre-aggregation in order to
> make the two joining streams to be on the same key. This is a draw-back from
> programability and we should fix it.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)