[jira] [Commented] (HADOOP-19098) Vector IO: consistent specified rejection of overlapping ranges

ASF GitHub Bot (Jira) Mon, 25 Mar 2024 08:04:04 -0700


    [ 
https://issues.apache.org/jira/browse/HADOOP-19098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17830528#comment-17830528
 ]


ASF GitHub Bot commented on HADOOP-19098:
-----------------------------------------

steveloughran commented on code in PR #6604:
URL: https://github.com/apache/hadoop/pull/6604#discussion_r1537742811


##########
hadoop-common-project/hadoop-common/src/site/markdown/filesystem/fsdatainputstream.md:
##########
@@ -459,51 +459,119 @@ The position returned by `getPos()` after 
`readVectored()` is undefined.
 If a file is changed while the `readVectored()` operation is in progress, the 
output is
 undefined. Some ranges may have old data, some may have new, and some may have 
both.
 
-While a `readVectored()` operation is in progress, normal read api calls may 
block.
-
-Note: Don't use direct buffers for reading from ChecksumFileSystem as that may
-lead to memory fragmentation explained in HADOOP-18296.
+While a `readVectored()` operation is in progress, normal read API calls MAY 
block;
+the value of `getPos(`) is also undefined. Applications SHOULD NOT make such 
requests
+while waiting for the results of a vectored read.
 
+Note: Don't use direct buffers for reading from `ChecksumFileSystem` as that 
may
+lead to memory fragmentation explained in
+[HADOOP-18296](https://issues.apache.org/jira/browse/HADOOP-18296)
+_Memory fragmentation in ChecksumFileSystem Vectored IO implementation_
 
 #### Preconditions
 
-For each requested range:
+No empty lists.
+
+```python
+if ranges = null raise NullPointerException
+if ranges.len() = 0 raise IllegalArgumentException
+if allocate = null raise NullPointerException
+```
+
+For each requested range `range[i]` in the list of ranges `range[0..n]` sorted
+on `getOffset()` ascending such that
+
+for all `i where i > 0`:
 
-    range.getOffset >= 0 else raise IllegalArgumentException
-    range.getLength >= 0 else raise EOFException
+    range[i].getOffset() > range[i-1].getOffset()
+
+For all ranges `0..i` the preconditions are:
+
+```python
+ranges[i] != null else raise IllegalArgumentException
+ranges[i].getOffset() >= 0 else raise EOFException
+ranges[i].getLength() >= 0 else raise IllegalArgumentException
+if i > 0 and ranges[i].getOffset() < (ranges[i-1].getOffset() + 
ranges[i-1].getLength) :
+   raise IllegalArgumentException
+```
+If the length of the file is known during the validation phase:
+
+```python
+if range[i].getOffset + range[i].getLength >= data.length() raise EOFException
+```
 
 #### Postconditions
 
-For each requested range:
+For each requested range `range[i]` in the list of ranges `range[0..n]`
+
+```
+ranges[i]'.getData() = CompletableFuture<buffer: ByteBuffer>
+```
 
-    range.getData() returns CompletableFuture<ByteBuffer> which will have data
-    from range.getOffset to range.getLength.
+ and when `getData().get()` completes:
+```
+let buffer = `getData().get()
+let len = ranges[i].getLength()
+let data = new byte[len]
+(buffer.position() - buffer.limit) = len
+buffer.get(data, 0, len) = readFully(ranges[i].getOffset(), data, 0, len)
+```
 
-### `minSeekForVectorReads()`
+That is: the result of every ranged read is the result of the (possibly 
asynchronous)
+call to `PositionedReadable.readFully()` for the same offset and length
+
+#### `minSeekForVectorReads()`
 
 The smallest reasonable seek. Two ranges won't be merged together if the 
difference between
 end of first and start of next range is more than this value.
 
-### `maxReadSizeForVectorReads()`
+#### `maxReadSizeForVectorReads()`
 
 Maximum number of bytes which can be read in one go after merging the ranges.
-Two ranges won't be merged if the combined data to be read is more than this 
value.
+Two ranges won't be merged if the combined data to be read It's okay we have a 
look at what we do right now for readOkayis more than this value.

Review Comment:
   oh, speech input, must have been from a conf call. 





> Vector IO: consistent specified rejection of overlapping ranges
> ---------------------------------------------------------------
>
>                 Key: HADOOP-19098
>                 URL: https://issues.apache.org/jira/browse/HADOOP-19098
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs, fs/s3
>    Affects Versions: 3.3.6
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Major
>              Labels: pull-request-available
>
> Related to PARQUET-2171 q: "how do you deal with overlapping ranges?"
> I believe s3a rejects this, but the other impls may not.
> Proposed
> FS spec to say 
> * "overlap triggers IllegalArgumentException". 
> * special case: 0 byte ranges may be short circuited to return empty buffer 
> even without checking file length etc.
> Contract tests to validate this
> (+ common helper code to do this).
> I'll copy the validation stuff into the parquet PR for consistency with older 
> releases



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HADOOP-19098) Vector IO: consistent specified rejection of overlapping ranges

Reply via email to