Today, we have an issue in Pinot, where data is in an inconsistent state
during segment push, and the query results may be incorrect. This issue
becomes more critical for enterprise applications to maintain customer
trust, more so in case of REFRESH use cases with large data size, causing
the period of inconsistency can be quite large. There are various flavors
of this problem:

1. In APPEND use cases, the time-boundary is updated as soon as the first
segment from the periodic push arrives. This causes queries to hit the
offline table for period which does not have complete data in the offline
table.

2. For REFRESH use cases, there is no requirement for segments to be
partitioned, so data can be in an entirely inconsistent state during the
push time.

3. We are seeing enterprise applications that create different
denormalizations from source data(s) creating multiple tables in Pinot. In
these cases, the same application queries multiple tables for their
product. And there's increasing asks to ensure some sort of inter-table
consistencies (provided client side takes care of synchronized data pushes
to these tables).

For 1 and 2. there are several potential bottlenecks that may increase the
push time, including Pinot controller, deep-store and network b/w. For our
cases, it seems that the biggest bottleneck is the network b/w between
controller and compute farm that creates the segment.

Next steps: Exchange ideas, and create proposals for the problems above.

Cheers,
Mayank

Reply via email to