[GitHub] [druid] abhishekagarwal87 commented on a change in pull request #11398: Make SegmentLoader extensible and customizable

GitBox Wed, 14 Jul 2021 04:07:48 -0700


abhishekagarwal87 commented on a change in pull request #11398:
URL: https://github.com/apache/druid/pull/11398#discussion_r669326519




##########
File path: 
server/src/main/java/org/apache/druid/segment/loading/StorageLocation.java
##########
@@ -154,7 +166,8 @@ public synchronized File reserve(String 
segmentFilePathToAdd, String segmentId,
   {
     final File segmentFileToAdd = new File(path, segmentFilePathToAdd);
     if (files.contains(segmentFileToAdd)) {
-      return null;
+      //TODO: is this change ok?
+      return segmentFileToAdd;

Review comment:
       > What's the concern that the TODO is referencing?
   
   While I think the change is in the right direction, wanted to draw 
reviewer's attention to this change to confirm I am not breaking something 
else. 
   > Your reasoning makes sense to me, but the callers currently depend on this 
behavior to avoid duplicate work. For example, this line checks whether 
storageDir is null before downloading the segment. I think this method should 
return something that is indicative of whether the call reserves some space 
successfully or the call returns immediately at this line.
   
   In your example, there are checks before `reserve` to find out if location 
is already reserved. But I see the utility in signalling to the caller that 
file was already reserved. Maybe I can throw a checked exception 
`FileAlreadyReservedException`  that callers can catch and do the needful. 
   
   

##########
File path: 
server/src/main/java/org/apache/druid/segment/loading/StorageLocation.java
##########
@@ -154,7 +166,8 @@ public synchronized File reserve(String 
segmentFilePathToAdd, String segmentId,
   {
     final File segmentFileToAdd = new File(path, segmentFilePathToAdd);
     if (files.contains(segmentFileToAdd)) {
-      return null;
+      //TODO: is this change ok?
+      return segmentFileToAdd;

Review comment:
       thinking more about this,  I realize that this change is not actually 
required. Since I have already added `isReserved` in `StorageLocation` to check 
if the location is pre-reserved. For my use case, the extension is going to 
reserve the space via SegmentLoader#reserve so this change is redundant and can 
be removed.

##########
File path: 
server/src/main/java/org/apache/druid/segment/loading/SegmentLoaderLocalCacheManager.java
##########
@@ -203,8 +219,8 @@ public Segment getSegment(DataSegment segment, boolean 
lazy, SegmentLazyLoadFail
     } else {
       factory = new MMappedQueryableSegmentizerFactory(indexIO);
     }
-
-    return factory.factorize(segment, segmentFiles, lazy, loadFailed);
+    Segment baseSegment = factory.factorize(segment, segmentFiles, lazy, 
loadFailed);
+    return ReferenceCountingSegment.wrapSegment(baseSegment, 
segment.getShardSpec());

Review comment:
       It's not the `DataSegment` that we are re-creating. We are calling 
`factory#factorize` each time and that is going to create a new `Segment` 
object too. Though, technically, that is an immutable object too. there is 
another proposal here 
(https://github.com/apache/druid/pull/11398/files#r668987030) that allays your 
concern. 

##########
File path: 
server/src/main/java/org/apache/druid/segment/loading/SegmentLoader.java
##########
@@ -29,10 +30,45 @@
  * Loading segments from deep storage to local storage.
  * Implementations must be thread-safe.
  */
+@UnstableApi
 public interface SegmentLoader
 {
   boolean isSegmentLoaded(DataSegment segment);
-  Segment getSegment(DataSegment segment, boolean lazy, 
SegmentLazyLoadFailCallback loadFailed) throws SegmentLoadingException;
+
+  /**
+   * Returns a {@link ReferenceCountingSegment} that will be added by the 
{@link org.apache.druid.server.SegmentManager}
+   * to the {@link org.apache.druid.timeline.VersionedIntervalTimeline}. This 
method can be called multiple times
+   * by the {@link org.apache.druid.server.SegmentManager}
+   *
+   * Returning a {@code ReferenceCountingSegment} will let custom 
implementations keep track of reference count for
+   * segments that the custom implementations are creating. That way, custom 
implementations can know when the segment
+   * is in use or not.
+   * @param segment - {@link DataSegment} segment to download
+   * @param lazy - whether the loading is lazy
+   * @param loadFailed - Callback to invoke if the loading fails
+   * @return Segment object wrapped inside {@link ReferenceCountingSegment}.
+   * @throws SegmentLoadingException
+   */
+  ReferenceCountingSegment getSegment(DataSegment segment, boolean lazy, 
SegmentLazyLoadFailCallback loadFailed) throws SegmentLoadingException;
+
   File getSegmentFiles(DataSegment segment) throws SegmentLoadingException;
+
+  /**
+   * Tries to reserve the space for a segment on any location. We only return 
a boolean result instead of a pointer to
+   * {@link StorageLocation} since we don't want callers to operate on {@code 
StorageLocation} directly outside {@code SegmentLoader}.
+   * {@link SegmentLoader} operates on the {@code StorageLocation} objects in 
a thread-safe manner.
+   *
+   * @param segment - Segment to reserve
+   * @return True if enough space found to store the segment, false otherwise
+   */
+  boolean reserve(DataSegment segment);

Review comment:
       yeah.  I agree that those comments shouldn't be part of the javadoc 
itself. they add some context behind keeping the signature as it is. 
   >BTW, how do you expect these methods to be used? When should the caller use 
this method instead of getSegment()?
   
   when there is a need to de-couple the action of actually downloading the 
files vs reserving the space. With this function, extensions can try to reserve 
the space first and if not successful, make some space, clean-up segments, etc. 
There is also improved concurrency for extensions with the breakdown of 
`getSegment` into `reserve` and download operations since the former function 
is a cheap op and can be put inside a lock if required by the extensions. 
`getSegment` can't be put inside a lock since it is a time-consuming operation. 

##########
File path: 
server/src/main/java/org/apache/druid/segment/loading/SegmentLoader.java
##########
@@ -29,10 +30,45 @@
  * Loading segments from deep storage to local storage.
  * Implementations must be thread-safe.
  */
+@UnstableApi
 public interface SegmentLoader
 {
   boolean isSegmentLoaded(DataSegment segment);
-  Segment getSegment(DataSegment segment, boolean lazy, 
SegmentLazyLoadFailCallback loadFailed) throws SegmentLoadingException;
+
+  /**
+   * Returns a {@link ReferenceCountingSegment} that will be added by the 
{@link org.apache.druid.server.SegmentManager}
+   * to the {@link org.apache.druid.timeline.VersionedIntervalTimeline}. This 
method can be called multiple times
+   * by the {@link org.apache.druid.server.SegmentManager}
+   *
+   * Returning a {@code ReferenceCountingSegment} will let custom 
implementations keep track of reference count for
+   * segments that the custom implementations are creating. That way, custom 
implementations can know when the segment
+   * is in use or not.
+   * @param segment - {@link DataSegment} segment to download
+   * @param lazy - whether the loading is lazy
+   * @param loadFailed - Callback to invoke if the loading fails
+   * @return Segment object wrapped inside {@link ReferenceCountingSegment}.
+   * @throws SegmentLoadingException
+   */
+  ReferenceCountingSegment getSegment(DataSegment segment, boolean lazy, 
SegmentLazyLoadFailCallback loadFailed) throws SegmentLoadingException;
+
   File getSegmentFiles(DataSegment segment) throws SegmentLoadingException;

Review comment:
       Hmm. How about having 
   `SegmentLoader` and `SegmentFetcher` 
   
   - SegmentLoader also loads the segment from segment files and return 
`ReferenceCountingSegment`. It will be used in `SegmentManager`
   - SegmentFetcher just downloads the files. It would also have the other 
methods such as `isSegmentLoaded` (should be renamed to `isSegmentFetched`), 
`reserve` and `release`. `SegmentFetcher` would abstract away the details about 
`StorageLocation`(s). The implementations must be thread-safe. Ingestion will 
use `SegmentFetcher` directly and `SegmentFetcher` will also be used by 
`SegmentLoader`. 
   - `SegmentLoader` is given the `File` from `segmentFetcher` that 
`SegmentLoader` will load as `ReferencingCountingSegment`. It can be called 
once if `SegmentManager` will load it only if there is no existing entry in the 
timeline. But that may impact the concurrency of `SegmentManager` if we move 
`load` inside the `datasources.compute` inside `SegmentManager`.  We either let 
`load` be called multiple times or we add another method in `SegmentLoader` 
that wraps the base segment in `ReferenceCountingSegment`. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] abhishekagarwal87 commented on a change in pull request #11398: Make SegmentLoader extensible and customizable

Reply via email to