wgtmac commented on code in PR #2048:
URL: https://github.com/apache/orc/pull/2048#discussion_r1841479544


##########
c++/include/orc/Reader.hh:
##########
@@ -39,6 +39,8 @@ namespace orc {
   // classes that hold data members so we can maintain binary compatibility
   struct ReaderOptionsPrivate;
   struct RowReaderOptionsPrivate;
+  struct CacheOptions;

Review Comment:
   Please note that only header files under `c++/include/orc` are exported. 
Users cannot have access to `ReadRangeCache` and `CacheOptions`. I think it is 
fine to make `ReadRangeCache` as an internal implementation. But users should 
be able to tune `CacheOptions`. You may extend `ReaderOptions` to accept a 
CacheOptions, or directly add two `setXXX` methods for the two parameters if 
you want to hide CacheOptions as well.



##########
c++/include/orc/OrcFile.hh:
##########
@@ -36,6 +37,20 @@ namespace orc {
    */
   class InputStream {
    public:
+    using Buffer = DataBuffer<char>;
+    using BufferPtr = std::shared_ptr<Buffer>;
+
+    struct BufferSlice {

Review Comment:
   ReadRangeCache is not (and should not be) exported. Please do not declare 
the buffer here. We should be strict about public API.



##########
c++/include/orc/OrcFile.hh:
##########
@@ -58,6 +73,17 @@ namespace orc {
      */
     virtual void read(void* buf, uint64_t length, uint64_t offset) = 0;
 
+    /**
+     * Read data asynchronously.
+     * @param offset the position in the stream to read from.
+     * @param length the number of bytes to read.
+     * @return a future that will be set to the buffer when the read is 
complete.
+     */
+    virtual std::future<BufferPtr> readAsync(uint64_t /*offset*/, uint64_t 
/*length*/,

Review Comment:
   I don't think this is a good practice. How to allocate the buffer is an 
implementation detail. In the current implementation, the DataBuffer instance 
can be managed by ReadRangeCache. So I would prefer the following signature:
   ```
   virtual std::future<void> read(void* buf, uint64_t length, uint64_t offset) 
= 0;
   ```



##########
c++/include/orc/Reader.hh:
##########
@@ -624,6 +626,21 @@ namespace orc {
      */
     virtual std::map<uint32_t, RowGroupIndex> getRowGroupIndex(
         uint32_t stripeIndex, const std::set<uint32_t>& included = {}) const = 
0;
+
+    /**
+     * Trigger IO prefetch and cache the prefetched contents asynchronously.
+     * @param stripes the stripes to prefetch
+     * @param includeTypes the types to prefetch
+     * @param options the cache options for prefetched contents
+     */
+    virtual void preBuffer(const std::vector<int>& stripes, const 
std::list<uint64_t>& includeTypes,

Review Comment:
   Usually `Reader` is for reading metadata and `RowReader` is for reading data 
of a specific split range. Users may have created multiple `RowReader`s against 
the same `Reader`. Please make sure `preBuffer` and `releaseBuffer` are 
thread-safe, especially the latter one.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to