KTable.filter usage, memory consumption and materialized view semantics

Philippe Derome Thu, 23 Jun 2016 05:51:34 -0700

I made a modification of latest Confluent's example
UserRegionLambdaExample. See relevant code at end of email.


Am I correct in understanding that KTable semantics should be similar to a
store-backed cache of a view as (per wikipedia on materialized views) or
similar to Oracle's materialized views and indexed views? More
specifically, I am looking at when a (key, null value) pair can make it
into KTable on generating table from a valid KStream with a false filter.

Here's relevant code modified from example for which I observed that all
keys within userRegions are sent out to topic LargeRegions with a null
value. I would think that both regionCounts KTable and topic LargeRegions
should be empty so that the cached view agrees with the intended query (a
query with an intentional empty result set as the filter is intentionally
false as 1 >= 2).

I am not sure I understand implications properly as I am new but it seems
possible that  a highly selective filter from a large incoming stream would
result in high memory usage for regionCounts and hence the stream
application.

KTable<String, *String*> regionCounts = userRegions
    // Count by region
    // We do not need to specify any explicit serdes because the key
and value types do not change
    .groupBy((userId, region) -> KeyValue.pair(region, region))
    .count("CountsByRegion")
    // discard any regions FOR SAKE OF EXAMPLE
    .filter((regionName, count) -> *1 >= 2*)
    .mapValues(count -> count.toString());


KStream<String, *String*> regionCountsForConsole = regionCounts.toStream();

regionCountsForConsole.to(stringSerde, *stringSerde*, "LargeRegions");

KTable.filter usage, memory consumption and materialized view semantics

Reply via email to