ffacs commented on code in PR #2048:
URL: https://github.com/apache/orc/pull/2048#discussion_r1815100557


##########
c++/include/orc/Reader.hh:
##########
@@ -605,6 +612,26 @@ namespace orc {
      */
     virtual std::map<uint32_t, BloomFilterIndex> getBloomFilters(
         uint32_t stripeIndex, const std::set<uint32_t>& included) const = 0;
+
+    /**
+     * Get the input stream for the ORC file.
+     */
+    virtual InputStream* getStream() const = 0;
+
+    /**
+     * Get the footer of the ORC file.
+     */
+    virtual const proto::Footer* getFooter() const = 0;
+
+    /**
+     * Get the schema of the ORC file.
+     */
+    virtual const proto::Metadata* getMetadata() const = 0;
+
+    virtual void preBuffer(const std::vector<int>& stripes, const 
std::list<uint64_t>& includeTypes,

Review Comment:
   > Consider this case : there are 6 stripes in an ORC file, and the user 
wants to read the first/third/fifth stripe in it(). He may create a RowReader 
for each stripe to get `ColumnVectorBatch` from it and convert it to other data 
columnar representations. To hide more io latency, we should prefetch the next 
stripe while reading/processing current stripe. For example:
   > 
   > * prefetch the third stripe before reading the first stripe
   > * prefetch the fifth stripe before reading the third stripe
   > 
   > Actually above case happens in ClickHouse. What I want to illustrate is 
that only user knows best about which is the next stripe to prefetch and when 
to prefetch it. That's why user has to invoke `preBuffer` explicitly. Besides, 
parquet also has similar designs.
   
   Do you means there are some users who want to read certain stripes of a 
file, and the requirement of stripes would change on processing of data, so 
they want to call prefetch on their own?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to