corwinjoy commented on issue #39676:
URL: https://github.com/apache/arrow/issues/39676#issuecomment-1909006800
@mapleFU
In terms of reading only the first row group I can think of two ways to do
this cleanly:
1. Work the thrift compiler to create a specialized FileMetaData class (say
FileMetaDataFast) that only reads the first row group. I haven't really
explored this since I am unfamiliar with the thrift compiler and how it is
invoked in this project.
2. Create a derived class of FileMetaData in a new file with a specialized
read method where we copy and specialize the existing read code to only read
the first row. Then, the read_only_rowgroup_0 flag could invoke a read of this
derived class.
In both cases, I think the function would need to return after reading the
first row group since we can't safely skip bytes.
(This is a bit of a problem if we want to support fields that come after
row_groups such as encryption_algorithm. But, for files created by arrow there
is a workaround. Since each field has a field_id we could change the order that
fields are written to grab critical fields before the row group).
In terms of providing a test file, the new unit tests in
`src/parquet/page_index_test.cc` create their own test data. In the PR this is
set to a somewhat smaller size of `(nColumn=6000, nRow=1000)` for ease of
development vs the larger file used as illustration in the perf above. Anyway,
this is easily configurable as shown below:
```
TEST_F(PageIndexBuilderTest, BenchmarkReader) {
std::string dir_string(parquet::test::get_data_dir());
std::string path = dir_string + "/index_reader_bm_lg.parquet";
int nColumn = 6000; \\ <----------------Adjust as needed. These are the
sizes used in the above perf report
int nRow = 10000; \\ Large file size. 10x what is in the PR
int chunk_size = 10;
WriteTableToParquet(nColumn, nRow, path.c_str(), chunk_size, false); \\
Creates file only if it does not alread exist
...
}
```
To be consistent with the other tests I am using the test data directory so
you will need to set the test data environment variable, e.g.
`PARQUET_TEST_DATA=/src/arrow/cpp/submodules/parquet-testing/data`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]