[GitHub] [hadoop] mukund-thakur commented on a change in pull request #2368: Hadoop-17296. ABFS: Force reads to be always of buffer size

2020-11-12 Thread GitBox


mukund-thakur commented on a change in pull request #2368:
URL: https://github.com/apache/hadoop/pull/2368#discussion_r522187925



##
File path: 
hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/AbfsInputStream.java
##
@@ -223,16 +244,19 @@ private int readInternal(final long position, final 
byte[] b, final int offset,
 
   // queue read-aheads
   int numReadAheads = this.readAheadQueueDepth;
-  long nextSize;
   long nextOffset = position;
+  // First read to queue needs to be of readBufferSize and later

Review comment:
   I don't think there is any bug in the current production code as such. 
As far as I understand the code the change is introduced becuase new config is 
introduced. 
   Now my question is why not use readAheadBlockSize for the first call as 
well? The calls would be like 
   offset=0 Length=4MB 
   offset=4MB Length=4MB
   
   Sorry to say this but honestly speaking, introducing so many configs is 
making the code complex and confusing.  





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[GitHub] [hadoop] mukund-thakur commented on a change in pull request #2368: Hadoop-17296. ABFS: Force reads to be always of buffer size

2020-10-19 Thread GitBox


mukund-thakur commented on a change in pull request #2368:
URL: https://github.com/apache/hadoop/pull/2368#discussion_r507647774



##
File path: 
hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/AbfsInputStream.java
##
@@ -223,16 +244,19 @@ private int readInternal(final long position, final 
byte[] b, final int offset,
 
   // queue read-aheads
   int numReadAheads = this.readAheadQueueDepth;
-  long nextSize;
   long nextOffset = position;
+  // First read to queue needs to be of readBufferSize and later

Review comment:
   What do you mean by gaps/holes in the readAhead range done here?
   Have you done any experiments on this readAheadBlockSize config? If so, 
please share. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[GitHub] [hadoop] mukund-thakur commented on a change in pull request #2368: Hadoop-17296. ABFS: Force reads to be always of buffer size

2020-10-13 Thread GitBox


mukund-thakur commented on a change in pull request #2368:
URL: https://github.com/apache/hadoop/pull/2368#discussion_r504018690



##
File path: 
hadoop-tools/hadoop-azure/src/test/java/org/apache/hadoop/fs/azurebfs/ITestAzureBlobFileSystemRandomRead.java
##
@@ -448,15 +477,119 @@ public void testRandomReadPerformance() throws Exception 
{
 ratio < maxAcceptableRatio);
   }
 
+  /**
+   * With this test we should see a full buffer read being triggered in case
+   * alwaysReadBufferSize is on, else only the requested buffer size.
+   * Hence a seek done few bytes away from last read position will trigger
+   * a network read when alwaysReadBufferSize is off, whereas it will return
+   * from the internal buffer when it is on.
+   * Reading a full buffer size is the Gen1 behaviour.
+   * @throws Throwable
+   */
+  @Test
+  public void testAlwaysReadBufferSizeConfig() throws Throwable {
+testAlwaysReadBufferSizeConfig(false);
+testAlwaysReadBufferSizeConfig(true);
+  }
+
+  private void assertStatistics(AzureBlobFileSystem fs,

Review comment:
   Why creating a new method here if we are just doing a passthrough?

##
File path: 
hadoop-tools/hadoop-azure/src/test/java/org/apache/hadoop/fs/azurebfs/services/TestAbfsInputStream.java
##
@@ -447,4 +490,168 @@ public void testReadAheadManagerForSuccessfulReadAhead() 
throws Exception {
 checkEvictedStatus(inputStream, 0, true);
   }
 
+  /**
+   * Test readahead with different config settings for request request size and
+   * readAhead block size
+   * @throws Exception
+   */
+  @Test
+  public void testDiffReadRequestSizeAndRAHBlockSize() throws Exception {
+// Set requestRequestSize = 4MB and readAheadBufferSize=8MB
+ReadBufferManager.getBufferManager()
+.testResetReadBufferManager(FOUR_MB, 
INCREASED_READ_BUFFER_AGE_THRESHOLD);
+testReadAheadConfigs(FOUR_MB, TEST_READAHEAD_DEPTH_4, false, EIGHT_MB);
+
+// Test for requestRequestSize =16KB and readAheadBufferSize=16KB
+ReadBufferManager.getBufferManager()
+.testResetReadBufferManager(SIXTEEN_KB, 
INCREASED_READ_BUFFER_AGE_THRESHOLD);
+AbfsInputStream inputStream = testReadAheadConfigs(SIXTEEN_KB,
+TEST_READAHEAD_DEPTH_2, true, SIXTEEN_KB);
+testReadAheads(inputStream, SIXTEEN_KB, SIXTEEN_KB);
+
+// Test for requestRequestSize =16KB and readAheadBufferSize=48KB
+ReadBufferManager.getBufferManager()
+.testResetReadBufferManager(FORTY_EIGHT_KB, 
INCREASED_READ_BUFFER_AGE_THRESHOLD);
+inputStream = testReadAheadConfigs(SIXTEEN_KB, TEST_READAHEAD_DEPTH_2, 
true,
+FORTY_EIGHT_KB);
+testReadAheads(inputStream, SIXTEEN_KB, FORTY_EIGHT_KB);
+
+// Test for requestRequestSize =48KB and readAheadBufferSize=16KB
+ReadBufferManager.getBufferManager()
+.testResetReadBufferManager(FORTY_EIGHT_KB, 
INCREASED_READ_BUFFER_AGE_THRESHOLD);
+inputStream = testReadAheadConfigs(FORTY_EIGHT_KB, TEST_READAHEAD_DEPTH_2,
+true,
+SIXTEEN_KB);
+testReadAheads(inputStream, FORTY_EIGHT_KB, SIXTEEN_KB);
+  }
+
+
+  private void testReadAheads(AbfsInputStream inputStream,
+  int readRequestSize,
+  int readAheadRequestSize)
+  throws Exception {
+if (readRequestSize > readAheadRequestSize) {
+  readAheadRequestSize = readRequestSize;
+}
+
+byte[] firstReadBuffer = new byte[readRequestSize];
+byte[] secondReadBuffer = new byte[readAheadRequestSize];
+
+// get the expected bytes to compare
+byte[] expectedFirstReadAheadBufferContents = new byte[readRequestSize];
+byte[] expectedSecondReadAheadBufferContents = new 
byte[readAheadRequestSize];
+getExpectedBufferData(0, readRequestSize, 
expectedFirstReadAheadBufferContents);
+getExpectedBufferData(readRequestSize, readAheadRequestSize,
+expectedSecondReadAheadBufferContents);
+
+assertTrue("Read should be of exact requested size",
+  inputStream.read(firstReadBuffer, 0, readRequestSize) == 
readRequestSize);
+assertTrue("Data mismatch found in RAH1",
+Arrays.equals(firstReadBuffer,
+expectedFirstReadAheadBufferContents));
+
+
+assertTrue("Read should be of exact requested size",
+inputStream.read(secondReadBuffer, 0, readAheadRequestSize) == 
readAheadRequestSize);
+assertTrue("Data mismatch found in RAH2",

Review comment:
   Better to use assert equals here inspite of assertTrue no?

##
File path: 
hadoop-tools/hadoop-azure/src/test/java/org/apache/hadoop/fs/azurebfs/services/TestAbfsInputStream.java
##
@@ -447,4 +490,168 @@ public void testReadAheadManagerForSuccessfulReadAhead() 
throws Exception {
 checkEvictedStatus(inputStream, 0, true);
   }
 
+  /**
+   * Test readahead with different config settings for request request size and
+   * readAhead block size
+   * @throws Exception
+   */
+  @Test
+  public void testDiffReadRequestSizeAndRAHBlockSize() throws Exception 

[GitHub] [hadoop] mukund-thakur commented on a change in pull request #2368: Hadoop-17296. ABFS: Force reads to be always of buffer size

2020-10-09 Thread GitBox


mukund-thakur commented on a change in pull request #2368:
URL: https://github.com/apache/hadoop/pull/2368#discussion_r501750162



##
File path: 
hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/AbfsInputStream.java
##
@@ -89,9 +91,24 @@ public AbfsInputStream(
 this.tolerateOobAppends = abfsInputStreamContext.isTolerateOobAppends();
 this.eTag = eTag;
 this.readAheadEnabled = true;
+this.alwaysReadBufferSize
+= abfsInputStreamContext.shouldReadBufferSizeAlways();
 this.cachedSasToken = new CachedSASToken(
 abfsInputStreamContext.getSasTokenRenewPeriodForStreamsInSeconds());
 this.streamStatistics = abfsInputStreamContext.getStreamStatistics();
+readAheadBlockSize = abfsInputStreamContext.getReadAheadBlockSize();
+if (this.bufferSize > readAheadBlockSize) {

Review comment:
   Can this LOG/validation be moved to AbfsInputStreamContext.build() ?

##
File path: 
hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/ReadBufferManager.java
##
@@ -37,10 +39,10 @@
   private static final Logger LOGGER = 
LoggerFactory.getLogger(ReadBufferManager.class);
 
   private static final int NUM_BUFFERS = 16;
-  private static final int BLOCK_SIZE = 4 * 1024 * 1024;
   private static final int NUM_THREADS = 8;
   private static final int DEFAULT_THRESHOLD_AGE_MILLISECONDS = 3000; // have 
to see if 3 seconds is a good threshold
 
+  private static int blockSize = 4 * 1024 * 1024;

Review comment:
   nit: use 4 * ONE_MB consistent as everywhere else.

##
File path: 
hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/AbfsInputStream.java
##
@@ -89,9 +91,24 @@ public AbfsInputStream(
 this.tolerateOobAppends = abfsInputStreamContext.isTolerateOobAppends();
 this.eTag = eTag;
 this.readAheadEnabled = true;
+this.alwaysReadBufferSize
+= abfsInputStreamContext.shouldReadBufferSizeAlways();
 this.cachedSasToken = new CachedSASToken(
 abfsInputStreamContext.getSasTokenRenewPeriodForStreamsInSeconds());
 this.streamStatistics = abfsInputStreamContext.getStreamStatistics();
+readAheadBlockSize = abfsInputStreamContext.getReadAheadBlockSize();
+if (this.bufferSize > readAheadBlockSize) {
+  LOG.debug(
+  "fs.azure.read.request.size[={}] is configured for higher size than "
+  + "fs.azure.read.readahead.blocksize[={}]. Auto-align "
+  + "readAhead block size to be same as readRequestSize.",
+  bufferSize, readAheadBlockSize);
+  readAheadBlockSize = this.bufferSize;
+}
+
+// Propagate the config values to ReadBufferManager so that the first 
instance
+// to initialize it get can set the readAheadBlockSize

Review comment:
   nit: typo? initialize it get can set

##
File path: 
hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/AbfsInputStream.java
##
@@ -178,11 +195,15 @@ private int readOneBlock(final byte[] b, final int off, 
final int len) throws IO
 buffer = new byte[bufferSize];
   }
 
-  // Enable readAhead when reading sequentially
-  if (-1 == fCursorAfterLastRead || fCursorAfterLastRead == fCursor || 
b.length >= bufferSize) {
+  if (alwaysReadBufferSize) {
 bytesRead = readInternal(fCursor, buffer, 0, bufferSize, false);

Review comment:
   JIRA and PR description says we are trying to read till bufferSize 
always rather than just the requested length but as per this line we are 
enabling the buffer manager readahead as well which is bypassed in random read 
in gen2 as per line 205 below. PS: I have never seen gen1 code though.

##
File path: 
hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/ReadBufferManager.java
##
@@ -49,21 +51,37 @@
   private Queue readAheadQueue = new LinkedList<>(); // queue of 
requests that are not picked up by any worker thread yet
   private LinkedList inProgressList = new LinkedList<>(); // 
requests being processed by worker threads
   private LinkedList completedReadList = new LinkedList<>(); // 
buffers available for reading
-  private static final ReadBufferManager BUFFER_MANAGER; // singleton, 
initialized in static initialization block
+  private static ReadBufferManager bufferManager; // singleton, initialized in 
static initialization block
+  private static final ReentrantLock LOCK = new ReentrantLock();
 
-  static {
-BUFFER_MANAGER = new ReadBufferManager();
-BUFFER_MANAGER.init();
+  static ReadBufferManager getBufferManager() {

Review comment:
   Why all these changes ? Why not just initilize the blockSize in init() ? 

##
File path: 
hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/constants/FileSystemConfigurations.java
##
@@ -74,6 +74,9 @@
   public static final String DEFAULT_FS_AZURE_APPEND_BLOB_DIRECTORIES 

[GitHub] [hadoop] mukund-thakur commented on a change in pull request #2368: Hadoop-17296. ABFS: Force reads to be always of buffer size

2020-10-08 Thread GitBox


mukund-thakur commented on a change in pull request #2368:
URL: https://github.com/apache/hadoop/pull/2368#discussion_r501750162



##
File path: 
hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/AbfsInputStream.java
##
@@ -89,9 +91,24 @@ public AbfsInputStream(
 this.tolerateOobAppends = abfsInputStreamContext.isTolerateOobAppends();
 this.eTag = eTag;
 this.readAheadEnabled = true;
+this.alwaysReadBufferSize
+= abfsInputStreamContext.shouldReadBufferSizeAlways();
 this.cachedSasToken = new CachedSASToken(
 abfsInputStreamContext.getSasTokenRenewPeriodForStreamsInSeconds());
 this.streamStatistics = abfsInputStreamContext.getStreamStatistics();
+readAheadBlockSize = abfsInputStreamContext.getReadAheadBlockSize();
+if (this.bufferSize > readAheadBlockSize) {

Review comment:
   Can this LOG/validation be moved to AbfsInputStreamContext.build() ?

##
File path: 
hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/ReadBufferManager.java
##
@@ -37,10 +39,10 @@
   private static final Logger LOGGER = 
LoggerFactory.getLogger(ReadBufferManager.class);
 
   private static final int NUM_BUFFERS = 16;
-  private static final int BLOCK_SIZE = 4 * 1024 * 1024;
   private static final int NUM_THREADS = 8;
   private static final int DEFAULT_THRESHOLD_AGE_MILLISECONDS = 3000; // have 
to see if 3 seconds is a good threshold
 
+  private static int blockSize = 4 * 1024 * 1024;

Review comment:
   nit: use 4 * ONE_MB consistent as everywhere else.

##
File path: 
hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/AbfsInputStream.java
##
@@ -89,9 +91,24 @@ public AbfsInputStream(
 this.tolerateOobAppends = abfsInputStreamContext.isTolerateOobAppends();
 this.eTag = eTag;
 this.readAheadEnabled = true;
+this.alwaysReadBufferSize
+= abfsInputStreamContext.shouldReadBufferSizeAlways();
 this.cachedSasToken = new CachedSASToken(
 abfsInputStreamContext.getSasTokenRenewPeriodForStreamsInSeconds());
 this.streamStatistics = abfsInputStreamContext.getStreamStatistics();
+readAheadBlockSize = abfsInputStreamContext.getReadAheadBlockSize();
+if (this.bufferSize > readAheadBlockSize) {
+  LOG.debug(
+  "fs.azure.read.request.size[={}] is configured for higher size than "
+  + "fs.azure.read.readahead.blocksize[={}]. Auto-align "
+  + "readAhead block size to be same as readRequestSize.",
+  bufferSize, readAheadBlockSize);
+  readAheadBlockSize = this.bufferSize;
+}
+
+// Propagate the config values to ReadBufferManager so that the first 
instance
+// to initialize it get can set the readAheadBlockSize

Review comment:
   nit: typo? initialize it get can set

##
File path: 
hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/AbfsInputStream.java
##
@@ -178,11 +195,15 @@ private int readOneBlock(final byte[] b, final int off, 
final int len) throws IO
 buffer = new byte[bufferSize];
   }
 
-  // Enable readAhead when reading sequentially
-  if (-1 == fCursorAfterLastRead || fCursorAfterLastRead == fCursor || 
b.length >= bufferSize) {
+  if (alwaysReadBufferSize) {
 bytesRead = readInternal(fCursor, buffer, 0, bufferSize, false);

Review comment:
   JIRA and PR description says we are trying to read till bufferSize 
always rather than just the requested length but as per this line we are 
enabling the buffer manager readahead as well which is bypassed in random read 
in gen2 as per line 205 below. PS: I have never seen gen1 code though.

##
File path: 
hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/ReadBufferManager.java
##
@@ -49,21 +51,37 @@
   private Queue readAheadQueue = new LinkedList<>(); // queue of 
requests that are not picked up by any worker thread yet
   private LinkedList inProgressList = new LinkedList<>(); // 
requests being processed by worker threads
   private LinkedList completedReadList = new LinkedList<>(); // 
buffers available for reading
-  private static final ReadBufferManager BUFFER_MANAGER; // singleton, 
initialized in static initialization block
+  private static ReadBufferManager bufferManager; // singleton, initialized in 
static initialization block
+  private static final ReentrantLock LOCK = new ReentrantLock();
 
-  static {
-BUFFER_MANAGER = new ReadBufferManager();
-BUFFER_MANAGER.init();
+  static ReadBufferManager getBufferManager() {

Review comment:
   Why all these changes ? Why not just initilize the blockSize in init() ? 

##
File path: 
hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/constants/FileSystemConfigurations.java
##
@@ -74,6 +74,9 @@
   public static final String DEFAULT_FS_AZURE_APPEND_BLOB_DIRECTORIES