pri1712 opened a new issue, #18712:
URL: https://github.com/apache/pinot/issues/18712
**Introduction:**
Currently, there is no straightforward way to get in-depth, near real-time
metadata for segments as they are committed to the deep store _(I say committed
and not ingested, since we do not want to bottleneck ingestion)_ in Pinot.
While ZooKeeper's external view provides basic segment-level metadata
(timestamps, total docs, CRCs), more granular physical metadata; like Bloom
filter states, dictionary sizes, and specific index configurations only exist
on servers.
To access this today, we have to rely on
the`/segments/{tableName}/metadata?columns=<list of columns>` API.
This presents a few roadblocks:
- It is not performant for tables with even a moderate number of columns.
- It requires heavy, synchronous polling, which puts unnecessary load on the
servers.
- It completely prevents near real-time availability of segment metadata for
downstream systems, this can be attributed to introducing multiple bottlenecks
with this approach (disk, network)
I would like to propose having a (configurable/optional) event driven
mechanism that pushes complete segment metadata to a sink (maybe a kafka topic)
once a segment is committed to the deep store.
Capturing metadata at this granular level via an event stream would enable:
- **Better observability into Pinot operations:** Better visibility into
index storage footprints, anomaly detection, and pipeline health at a very
granular level without hammering the Controller/Server/Zookeeper APIs.
- Managing TTL'd / Cold-Tier Segments: Pushing this data to a separate Meta
table would allow a user to maintain a permanent catalog of segments even after
their TTL has expired and they are dropped from the active cluster.
_This just serves as an issue to gather community interest, if sufficient
interest is generated, will come up with a detailed plan and a PEP._
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]