[jira] [Commented] (HADOOP-19098) Vector IO: consistent specified rejection of overlapping ranges

ASF GitHub Bot (Jira) Fri, 22 Mar 2024 14:25:05 -0700


    [ 
https://issues.apache.org/jira/browse/HADOOP-19098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17829993#comment-17829993
 ]


ASF GitHub Bot commented on HADOOP-19098:
-----------------------------------------

mukund-thakur commented on code in PR #6604:
URL: https://github.com/apache/hadoop/pull/6604#discussion_r1536087673


##########
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/VectoredReadUtils.java:
##########
@@ -133,26 +172,42 @@ private static void 
readNonByteBufferPositionedReadable(PositionedReadable strea
   /**
    * Read bytes from stream into a byte buffer using an
    * intermediate byte array.
-   * @param length number of bytes to read.
+   *   <pre>
+   *     (position, buffer, buffer-offset, length): Void
+   *     position:= the position within the file to read data.
+   *     buffer := a buffer to read fully `length` bytes into.
+   *     buffer-offset := the offset within the buffer to write data
+   *     length := the number of bytes to read.
+   *   </pre>
+   * The passed in function MUST block until the required length of
+   * data is read, or an exception is thrown.
+   * @param range range to read
    * @param buffer buffer to fill.
    * @param operation operation to use for reading data.
    * @throws IOException any IOE.
    */
-  public static void readInDirectBuffer(int length,
-                                        ByteBuffer buffer,
-                                        Function4RaisingIOE<Integer, byte[], 
Integer,
-                                                Integer, Void> operation) 
throws IOException {
+  public static void readInDirectBuffer(FileRange range,
+      ByteBuffer buffer,
+      Function4RaisingIOE<Long, byte[], Integer, Integer, Void> operation)
+      throws IOException {
+
+    LOG.debug("Reading {} into a direct buffer", range);
+    validateRangeRequest(range);

Review Comment:
   we would have already validated this. why again? 



##########
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java:
##########
@@ -885,19 +904,26 @@ public int maxReadSizeForVectorReads() {
    * @throws IOException IOE if any.
    */
   @Override
-  public void readVectored(List<? extends FileRange> ranges,
+  public synchronized void readVectored(List<? extends FileRange> ranges,

Review Comment:
   moving to synchronized? We are supporting multiple readVectored() in the 
same stream and it should be fine. 



##########
hadoop-common-project/hadoop-common/src/site/markdown/filesystem/fsdatainputstream.md:
##########
@@ -459,51 +459,119 @@ The position returned by `getPos()` after 
`readVectored()` is undefined.
 If a file is changed while the `readVectored()` operation is in progress, the 
output is
 undefined. Some ranges may have old data, some may have new, and some may have 
both.
 
-While a `readVectored()` operation is in progress, normal read api calls may 
block.
-
-Note: Don't use direct buffers for reading from ChecksumFileSystem as that may
-lead to memory fragmentation explained in HADOOP-18296.
+While a `readVectored()` operation is in progress, normal read API calls MAY 
block;
+the value of `getPos(`) is also undefined. Applications SHOULD NOT make such 
requests
+while waiting for the results of a vectored read.
 
+Note: Don't use direct buffers for reading from `ChecksumFileSystem` as that 
may
+lead to memory fragmentation explained in
+[HADOOP-18296](https://issues.apache.org/jira/browse/HADOOP-18296)
+_Memory fragmentation in ChecksumFileSystem Vectored IO implementation_
 
 #### Preconditions
 
-For each requested range:
+No empty lists.
+
+```python
+if ranges = null raise NullPointerException
+if ranges.len() = 0 raise IllegalArgumentException
+if allocate = null raise NullPointerException
+```
+
+For each requested range `range[i]` in the list of ranges `range[0..n]` sorted
+on `getOffset()` ascending such that
+
+for all `i where i > 0`:
 
-    range.getOffset >= 0 else raise IllegalArgumentException
-    range.getLength >= 0 else raise EOFException
+    range[i].getOffset() > range[i-1].getOffset()
+
+For all ranges `0..i` the preconditions are:
+
+```python
+ranges[i] != null else raise IllegalArgumentException
+ranges[i].getOffset() >= 0 else raise EOFException
+ranges[i].getLength() >= 0 else raise IllegalArgumentException
+if i > 0 and ranges[i].getOffset() < (ranges[i-1].getOffset() + 
ranges[i-1].getLength) :
+   raise IllegalArgumentException
+```
+If the length of the file is known during the validation phase:
+
+```python
+if range[i].getOffset + range[i].getLength >= data.length() raise EOFException
+```
 
 #### Postconditions
 
-For each requested range:
+For each requested range `range[i]` in the list of ranges `range[0..n]`
+
+```
+ranges[i]'.getData() = CompletableFuture<buffer: ByteBuffer>
+```
 
-    range.getData() returns CompletableFuture<ByteBuffer> which will have data
-    from range.getOffset to range.getLength.
+ and when `getData().get()` completes:
+```
+let buffer = `getData().get()
+let len = ranges[i].getLength()
+let data = new byte[len]
+(buffer.position() - buffer.limit) = len
+buffer.get(data, 0, len) = readFully(ranges[i].getOffset(), data, 0, len)
+```
 
-### `minSeekForVectorReads()`
+That is: the result of every ranged read is the result of the (possibly 
asynchronous)
+call to `PositionedReadable.readFully()` for the same offset and length
+
+#### `minSeekForVectorReads()`
 
 The smallest reasonable seek. Two ranges won't be merged together if the 
difference between
 end of first and start of next range is more than this value.
 
-### `maxReadSizeForVectorReads()`
+#### `maxReadSizeForVectorReads()`
 
 Maximum number of bytes which can be read in one go after merging the ranges.
-Two ranges won't be merged if the combined data to be read is more than this 
value.
+Two ranges won't be merged if the combined data to be read It's okay we have a 
look at what we do right now for readOkayis more than this value.

Review Comment:
   typo: some random lines I guess



##########
hadoop-common-project/hadoop-common/src/site/markdown/filesystem/fsdatainputstream.md:
##########
@@ -441,9 +441,9 @@ The semantics of this are exactly equivalent to
     readFully(position, buffer, 0, len(buffer))
 
 That is, the buffer is filled entirely with the contents of the input source
-from position `position`
+from position `position`.
 
-### `default void readVectored(List<? extends FileRange> ranges, 
IntFunction<ByteBuffer> allocate)`
+### `void readVectored(List<? extends FileRange> ranges, 
IntFunction<ByteBuffer> allocate)`

Review Comment:
   are we removing the default? 





> Vector IO: consistent specified rejection of overlapping ranges
> ---------------------------------------------------------------
>
>                 Key: HADOOP-19098
>                 URL: https://issues.apache.org/jira/browse/HADOOP-19098
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs, fs/s3
>    Affects Versions: 3.3.6
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Major
>              Labels: pull-request-available
>
> Related to PARQUET-2171 q: "how do you deal with overlapping ranges?"
> I believe s3a rejects this, but the other impls may not.
> Proposed
> FS spec to say 
> * "overlap triggers IllegalArgumentException". 
> * special case: 0 byte ranges may be short circuited to return empty buffer 
> even without checking file length etc.
> Contract tests to validate this
> (+ common helper code to do this).
> I'll copy the validation stuff into the parquet PR for consistency with older 
> releases



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HADOOP-19098) Vector IO: consistent specified rejection of overlapping ranges

Reply via email to