gianm commented on a change in pull request #7425: Add is_overshadowed column
to sys.segments table
URL: https://github.com/apache/incubator-druid/pull/7425#discussion_r276882146
##########
File path:
sql/src/main/java/org/apache/druid/sql/calcite/schema/MetadataSegmentView.java
##########
@@ -143,10 +146,16 @@ private void poll()
final DateTime timestamp = DateTimes.nowUtc();
while (metadataSegments.hasNext()) {
- final DataSegment interned =
DataSegmentInterner.intern(metadataSegments.next());
+ final SegmentWithOvershadowedStatus segment = metadataSegments.next();
+ final DataSegment interned =
DataSegmentInterner.intern(segment.getDataSegment());
+ final SegmentWithOvershadowedStatus segmentWithOvershadowedStatus = new
SegmentWithOvershadowedStatus(
+ interned,
+ segment.isOvershadowed()
+ );
// timestamp is used to filter deleted segments
- publishedSegments.put(interned, timestamp);
Review comment:
This introduces a bug: since there are two possible
SegmentWithOvershadowedStatus for each underlying DataSegment, now the same
segment can be in `publishedSegments` twice for a period of time. There's a few
ways to deal with this:
1. Make `publishedSegments` a `TreeSet<SegmentWithOvershadowedStatus>` and
update the entire map atomically. This is a super clean solution but would
burst to higher memory usage (it would need to keep two entire copies of the
map in memory when replacing them).
2. Make `publishedSegments` a `ConcurrentSkipListMap<DataSegment,
CachedSegmentInfo>` where CachedSegmentInfo is some static class, defined in
this file, containing the updated timestamp and the overshadowed boolean. If
you do this, the SegmentWithOvershadowedStatus won't be stored long term
anymore. You could minimize memory footprint of CachedSegmentInfo, if you want,
by making the timestamp a `long` rather than `DateTime`.
3. Make `publishedSegments` a
`ConcurrentSkipListSet<SegmentWithOvershadowedStatus>`, make
SegmentWithOvershadowedStatus mutable (in a thread-safe way), make its
`equals`, `hashCode`, and `compareTo` methods only based on the `dataSegment`
field, let its `overshadowed` field be modified, and add a `timestamp` field to
it. When syncing the cache, get the current object and mutate the overshadowed
field if necessary. Btw, a ConcurrentSkipListSet uses a ConcurrentSkipListMap
under the hood, so the memory footprint savings of this aren't as much as you
might expect relative to (2).
(2) is the variant that's closest to what the code was doing before this
patch. One thing I don't love about it is that it is racey: it means that if a
set of segments is overshadowed all at once, callers will not necessarily see a
consistent view, because the map is being concurrently updated. They'll see the
overshadowed flag get set for the underlying segments one at a time. But the
same problem existed in the old code, so fixing it could be considered out of
scope for this patch.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]