zeroshade commented on issue #35688: URL: https://github.com/apache/arrow/issues/35688#issuecomment-1557637503
@hkpeaks To answer your questions: > 1. go get -u github.com/apache/arrow/go/v12/parquet This will just get the whole package and add it to your module. The others are commands in that package so of course there'll be duplications the two commands both rely on this package. > 2. go install github.com/apache/arrow/go/v12/parquet/cmd/parquet_reader@latest This is just a simple mainprog which will dump metadata, info and the data from a parquet file. > 3. go install github.com/apache/arrow/go/v12/parquet/cmd/parquet_schema@latest This mainprog is just a further stripped down command which *only* dumps a schema from a parquet file, it will not read data. > Firstly, I need to know where to find the start and end address for each row block. Secondly, need to know where to find the start and end address of each column and column page contained within a row block. For a particular query which requires 5 columns, I can use Goruntine and mmap to read given set of row blocks for selected columns in parallel for each batch. e.g. > > Row block batch 1: use 20+ threads to read 5 columns of 1~20 blocks > > Row block batch 2: use 20+ threads to read 5 columns of 21~40 blocks > > Row block batch 3: use 20+ threads to read 5 columns of 41~60 blocks > > To determine how much row block I shall read for each stream, I need to know how many columns and row blocks of a file. Once you open the file, (either by [`OpenParquetFile`](https://pkg.go.dev/github.com/apache/arrow/go/[email protected]/parquet/file#OpenParquetFile) or [`NewParquetReader`](https://pkg.go.dev/github.com/apache/arrow/go/[email protected]/parquet/file#NewParquetReader)), you can retrieve the information you're looking for: * `Reader.NumRowGroups()` gives you the total number of row groups in the file. * `Reader.Metadata().RowGroup(i int)` will get you the Row Group metadata for a specific row group. ([`RowGroupMetaData`](https://pkg.go.dev/github.com/apache/arrow/go/[email protected]/parquet/metadata#RowGroupMetaData)) * You can also use [`Reader.RowGroup`](https://pkg.go.dev/github.com/apache/arrow/go/[email protected]/parquet/file#Reader.RowGroup) to get a sub reader specifically for that row group. This reader can report the total byte size for the row group and provide the metadata for that row group directly. It can also get you page readers for any given column via [`GetColumnPageReader`](https://pkg.go.dev/github.com/apache/arrow/go/[email protected]/parquet/file#RowGroupReader.GetColumnPageReader) * It should be pretty easy to leverage the [`Column`](https://pkg.go.dev/github.com/apache/arrow/go/[email protected]/parquet/file#RowGroupReader.Column) method to get column readers for reading columns in parallel. In fact, we already do this for the pqarrow package when reading a whole row group into a single record, or a whole file into a table: see here: https://github.com/apache/arrow/blob/go/v12.0.0/go/parquet/pqarrow/file_reader.go#L300 * [`RowGroupMetaData.FileOffset()`](https://pkg.go.dev/github.com/apache/arrow/go/[email protected]/parquet/metadata#RowGroupMetaData.FileOffset) is the location in the file where the data for this row group begins * You can use the `ColumnChunk` method to get the [metadata](https://pkg.go.dev/github.com/apache/arrow/go/[email protected]/parquet/metadata#ColumnChunkMetaData) for a specific column, which contains the offsets for that particular column along with the rest of the column metadata. I hope that the above answers your questions concerning the API (along with pointing you at the documentation for any other methods you might need). But let me know if there's any other methods/functions you need that you can't find (or don't exist yet). > See my source code how to use mmap to read csv file partition to support streaming https://github.com/hkpeaks/peaks-consolidation/blob/main/PeaksFramework/read_file.go I looked at the file there I don't see any usage of `mmap` at all there. You're just using `os.Open` and using the `ReadAt` method on the file. (If I'm wrong, can you please provide the line number / direct link to the `mmap`?) If you follow @mapleFU's [suggestion](https://github.com/apache/arrow/issues/35688#issuecomment-1556222309) of passing `memoryMap = false` to `OpenParquetFile` then that's what will be used, it will just use `os.Open` and `ReadAt` to read from the file. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
