gianm commented on code in PR #19535:
URL: https://github.com/apache/druid/pull/19535#discussion_r3438128234
##########
server/src/main/java/org/apache/druid/segment/loading/SegmentLoaderConfig.java:
##########
@@ -105,6 +105,15 @@ public class SegmentLoaderConfig
@JsonProperty("virtualStorageMetadataReservationEstimate")
private long virtualStorageMetadataReservationEstimate = 16L * 1024L * 1024L;
+ /**
+ * When true, partial-eligible V10 segments are mounted via the partial
machinery and
+ * {@link SegmentCacheManager#acquireSegment} with {@link
AcquireMode#PARTIAL} returns a metadata-anchored segment
+ * whose columns are downloaded on demand. When false (the default), {@link
AcquireMode#PARTIAL} falls back to
+ * {@link AcquireMode#FULL} so the entire segment is downloaded up front
(matching pre-partial-download behavior).
+ */
+ @JsonProperty("virtualStoragePartialDownloadsEnabled")
+ private boolean virtualStoragePartialDownloadsEnabled = false;
Review Comment:
IMO it would be better to have this always enabled at the VSF layer, and to
have the toggle be in the query context. There can be something like
`QueryContexts#getAcquireMode`. (Something like `usePartialDownloads:
true/false`). I like this approach because it makes testing much easier, and
also it makes the code in the VSF layer easier to follow because it gets rid of
conditionals. There would also be no reason to get rid of partially-downloaded
directories during bootstrapping, because they're always considered OK.
Default should be `usePartialDownloads: false` for now.
##########
multi-stage-query/src/main/java/org/apache/druid/msq/querykit/ReadableInputQueue.java:
##########
@@ -288,7 +298,7 @@ private ListenableFuture<ReadableInput> loadNextSegment()
return null;
}
- final AcquireSegmentAction acquireSegmentAction =
nextLoadableSegment.acquire();
+ final AcquireSegmentAction acquireSegmentAction =
nextLoadableSegment.acquire(acquireMode);
Review Comment:
It's been clear in testing that good overall performance on cold data is
highly dependent on loadahead. With this change, we aren't fetching the entire
segment as part of loadahead, which is going to lead to poorer download rates.
It's ok to deal with this in a follow-up, but it does need to be dealt with
somehow.
Either the query logic should declare what parts of the segment it wants to
start with, or we should do something statistical: for each query, observe
which parts of the segment are actually used, and fetch those during loadahead
for future segments. The statistical approach only works for queries that fetch
a large number of segments (so the earlier ones can be used to inform the later
ones). But this is fine, since it's those queries where the performance concern
lies.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]