Re: [I] [C++][Parquet] Fast Random Rowgroup Reads [arrow]

via GitHub Wed, 24 Jan 2024 14:18:50 -0800


corwinjoy commented on issue #39676:
URL: https://github.com/apache/arrow/issues/39676#issuecomment-1909006800


   @mapleFU 
   In terms of reading only the first row group I can think of two ways to do 
this cleanly:
   1. Work the thrift compiler to create a specialized FileMetaData class (say 
FileMetaDataFast) that only reads the first row group. I haven't really 
explored this since I am unfamiliar with the thrift compiler and how it is 
invoked in this project.
   2. Create a derived class of FileMetaData in a new file with a specialized 
read method where we copy and specialize the existing read code to only read 
the first row. Then, the read_only_rowgroup_0 flag could invoke a read of this 
derived class.
   
   In both cases, I think the function would need to return after reading the 
first row group since we can't safely skip bytes. 
   (This is a bit of a problem if we want to support fields that come after 
row_groups such as encryption_algorithm. But, for files created by arrow there 
is a workaround. Since each field has a field_id we could change the order that 
fields are written to grab critical fields before the row group).
   
   In terms of providing a test file, the new unit tests in 
`src/parquet/page_index_test.cc` create their own test data. In the PR this is 
set to a somewhat smaller size of `(nColumn=6000, nRow=1000)` for ease of 
development vs the larger file used as illustration in the perf above. Anyway, 
this is easily configurable as shown below:
   ```
   TEST_F(PageIndexBuilderTest, BenchmarkReader) {
     std::string dir_string(parquet::test::get_data_dir());
     std::string path = dir_string + "/index_reader_bm_lg.parquet";
   
     int nColumn = 6000; \\ <----------------Adjust as needed. These are the 
sizes used in the above perf report
     int nRow = 10000; \\ Large file size. 10x what is in the PR
     int chunk_size = 10;
     WriteTableToParquet(nColumn, nRow, path.c_str(), chunk_size, false); \\ 
Creates file only if it does not alread exist
     ...
   }
   ```
   
   To be consistent with the other tests I am using the test data directory so 
you will need to set the test data environment variable, e.g.
   `PARQUET_TEST_DATA=/src/arrow/cpp/submodules/parquet-testing/data`
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [C++][Parquet] Fast Random Rowgroup Reads [arrow]

Reply via email to