[I] C++ API. When the row I seek to is in the same row-group as the current row, why don't use the skip function directly, but instead seek to the row-group again and then skip? [orc]

via GitHub Wed, 11 Dec 2024 23:55:57 -0800


hrbeuyz24 opened a new issue, #2084:
URL: https://github.com/apache/orc/issues/2084


   We use orc as the storage format for our real-time data warehouse, our 
online query will have a lot of random reads and frequent seeks.  We found that 
a lot of time is consumed in SeekToRowGroup and Skip. 
   Many of our target rows in multiple seeks are in the same row group, This 
leads to the problem in my title.
   For example, there is an online query, we need to read the data of row 100 
and row 130,
   The current behavior is
   1. SeekToRowGroup
   2. Skip(100)
   3. Next(1)
   4. SeekToRowGroup
   5. Skip(130)
   6. Next(1)
   Why not
   1. SeekToRowGroup
   2. Skip(100)
   3. Next(1)
   4. Skip(29)
   5. Next(1)
   We simply modified the code and found that in our scenario it can bring at 
least 50% read performance benefits.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@orc.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] C++ API. When the row I seek to is in the same row-group as the current row, why don't use the skip function directly, but instead seek to the row-group again and then skip? [orc]

Reply via email to