gianm commented on a change in pull request #7425: Add is_overshadowed column 
to sys.segments table
URL: https://github.com/apache/incubator-druid/pull/7425#discussion_r276882146
 
 

 ##########
 File path: 
sql/src/main/java/org/apache/druid/sql/calcite/schema/MetadataSegmentView.java
 ##########
 @@ -143,10 +146,16 @@ private void poll()
 
     final DateTime timestamp = DateTimes.nowUtc();
     while (metadataSegments.hasNext()) {
-      final DataSegment interned = 
DataSegmentInterner.intern(metadataSegments.next());
+      final SegmentWithOvershadowedStatus segment = metadataSegments.next();
+      final DataSegment interned = 
DataSegmentInterner.intern(segment.getDataSegment());
+      final SegmentWithOvershadowedStatus segmentWithOvershadowedStatus = new 
SegmentWithOvershadowedStatus(
+          interned,
+          segment.isOvershadowed()
+      );
       // timestamp is used to filter deleted segments
-      publishedSegments.put(interned, timestamp);
 
 Review comment:
   This introduces a bug: since there are two possible 
SegmentWithOvershadowedStatus for each underlying DataSegment, now the same 
segment can be in `publishedSegments` twice for a period of time. There's a few 
ways to deal with this:
   
   1. Make `publishedSegments` a `TreeSet<SegmentWithOvershadowedStatus>` and 
update the entire map atomically. This is a super clean solution but would 
burst to higher memory usage (it would need to keep two entire copies of the 
map in memory when replacing them).
   2. Make `publishedSegments` a `ConcurrentSkipListMap<DataSegment, 
CachedSegmentInfo>` where CachedSegmentInfo is some static class, defined in 
this file, containing the updated timestamp and the overshadowed boolean. If 
you do this, the SegmentWithOvershadowedStatus won't be stored long term 
anymore. You could minimize memory footprint of CachedSegmentInfo, if you want, 
by making the timestamp a `long` rather than `DateTime`.
   3. Make `publishedSegments` a 
`ConcurrentSkipListSet<SegmentWithOvershadowedStatus>`, make 
SegmentWithOvershadowedStatus mutable (in a thread-safe way), make its 
`equals`, `hashCode`, and `compareTo` methods only based on the `dataSegment` 
field, let its `overshadowed` field be modified, and add a `timestamp` field to 
it. When syncing the cache, get the current object and mutate the overshadowed 
field if necessary. Btw, a ConcurrentSkipListSet uses a ConcurrentSkipListMap 
under the hood, so the memory footprint savings of this aren't as much as you 
might expect relative to (2).
   
   (2) is the variant that's closest to what the code was doing before this 
patch. One thing I don't love about it is that it is racey: it means that if a 
set of segments is overshadowed all at once, callers will not necessarily see a 
consistent view, because the map is being concurrently updated. They'll see the 
overshadowed flag get set for the underlying segments one at a time. But the 
same problem existed in the old code, so fixing it could be considered out of 
scope for this patch.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to