[GitHub] [druid] LakshSingla commented on a diff in pull request #13981: Eagerly fetching remote s3 files leading to out of disk (OOD)

via GitHub Tue, 28 Mar 2023 20:01:59 -0700


LakshSingla commented on code in PR #13981:
URL: https://github.com/apache/druid/pull/13981#discussion_r1151335632



##########
processing/src/main/java/org/apache/druid/frame/channel/ReadableInputStreamFrameChannel.java:
##########
@@ -40,6 +40,7 @@
  */
 public class ReadableInputStreamFrameChannel implements ReadableFrameChannel
 {
+  private boolean toStart = true;

Review Comment:
   ```suggestion
     private boolean started = true;
   ```
   `toStart` seems unclear for a boolean, something like `started` or 
`hasStarted` seems appropriate



##########
extensions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/output/S3StorageConnector.java:
##########
@@ -142,6 +144,12 @@ public boolean hasMoreElements()
       @Override
       public InputStream nextElement()
       {
+        // since Sequence input stream calls nextElement in the constructor, 
we start chunking as soon as we call read.
+        // to avoid that we pass a nullInputStream for the first iteration.
+        if (!initStream) {

Review Comment:
   I was thinking if there can be a slightly better way to achieve this, 
instead of this one off handling. The issue is arising because `nextElement()` 
prefetches the InputStream which should return the data, however the 
S3StorageConnector is also chunking the file simultaneously.
   
   If we create a new InputStream that acts like a wrapper over this 
initializing code ( which copies to the file, that is present in the 
`nextElement`) we won't require this `initStream` flag and special handling.
   
   ```
             InputStream x = new InputStream()
             {
   
               @Override
               public int read() throws IOException
               {
               File f = chunkFromS3(range); // Implement this
               FileInputStream fis = new FileInputStream(f);
               this.fis = fis;
               return fis.read();
               }
             };
   ```
   
   One advantage of the above approach would be that we wouldn't be prefetching 
a chunk. (When read() is called in the SequenceInputStream, it calls the 
`nextStream()` to prepare the next element in the iteration. Since the code 
change handles only the initialization, we would be prefetching before we call 
read() on the next element. There might not be a huge benefit to this though if 
we are reading the S3 object continuously).



##########
processing/src/main/java/org/apache/druid/frame/channel/ReadableInputStreamFrameChannel.java:
##########
@@ -150,6 +152,10 @@ public void close()
 
   private void startReading()
   {
+    if (!toStart) {

Review Comment:
   Is this change required, once the cause is fixed in the 
`S3StorageConnector`? (If so, then can you document the behavior that we buffer 
the input stream lazily)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] LakshSingla commented on a diff in pull request #13981: Eagerly fetching remote s3 files leading to out of disk (OOD)

Reply via email to