[GitHub] [arrow] zeroshade commented on issue #35688: How to approach to implement Parquet-Go file format

via GitHub Mon, 22 May 2023 10:42:13 -0700


zeroshade commented on issue #35688:
URL: https://github.com/apache/arrow/issues/35688#issuecomment-1557637503


   @hkpeaks To answer your questions:
   
   > 1. go get -u github.com/apache/arrow/go/v12/parquet
   
   This will just get the whole package and add it to your module. The others 
are commands in that package so of course there'll be duplications the two 
commands both rely on this package.
   
   > 2. go install 
github.com/apache/arrow/go/v12/parquet/cmd/parquet_reader@latest
   
   This is just a simple mainprog which will dump metadata, info and the data 
from a parquet file.
   
   > 3. go install 
github.com/apache/arrow/go/v12/parquet/cmd/parquet_schema@latest
   
   This mainprog is just a further stripped down command which *only* dumps a 
schema from a parquet file, it will not read data.
   
   > Firstly, I need to know where to find the start and end address for each 
row block. Secondly, need to know where to find the start and end address of 
each column and column page contained within a row block. For a particular 
query which requires 5 columns, I can use Goruntine and mmap to read given set 
of row blocks for selected columns in parallel for each batch. e.g.
   > 
   > Row block batch 1: use 20+ threads to read 5 columns of 1~20 blocks
   >
   > Row block batch 2: use 20+ threads to read 5 columns of 21~40 blocks
   >
   > Row block batch 3: use 20+ threads to read 5 columns of 41~60 blocks
   >
   > To determine how much row block I shall read for each stream, I need to 
know how many columns and row blocks of a file.
   
   Once you open the file, (either by 
[`OpenParquetFile`](https://pkg.go.dev/github.com/apache/arrow/go/[email protected]/parquet/file#OpenParquetFile)
 or 
[`NewParquetReader`](https://pkg.go.dev/github.com/apache/arrow/go/[email protected]/parquet/file#NewParquetReader)),
 you can retrieve the information you're looking for:
   
   * `Reader.NumRowGroups()` gives you the total number of row groups in the 
file.
   * `Reader.Metadata().RowGroup(i int)` will get you the Row Group metadata 
for a specific row group. 
([`RowGroupMetaData`](https://pkg.go.dev/github.com/apache/arrow/go/[email protected]/parquet/metadata#RowGroupMetaData))
     * You can also use 
[`Reader.RowGroup`](https://pkg.go.dev/github.com/apache/arrow/go/[email protected]/parquet/file#Reader.RowGroup)
 to get a sub reader specifically for that row group. This reader can report 
the total byte size for the row group and provide the metadata for that row 
group directly. It can also get you page readers for any given column via 
[`GetColumnPageReader`](https://pkg.go.dev/github.com/apache/arrow/go/[email protected]/parquet/file#RowGroupReader.GetColumnPageReader)
       * It should be pretty easy to leverage the 
[`Column`](https://pkg.go.dev/github.com/apache/arrow/go/[email protected]/parquet/file#RowGroupReader.Column)
 method to get column readers for reading columns in parallel. In fact, we 
already do this for the pqarrow package when reading a whole row group into a 
single record, or a whole file into a table: see here: 
https://github.com/apache/arrow/blob/go/v12.0.0/go/parquet/pqarrow/file_reader.go#L300
     * 
[`RowGroupMetaData.FileOffset()`](https://pkg.go.dev/github.com/apache/arrow/go/[email protected]/parquet/metadata#RowGroupMetaData.FileOffset)
 is the location in the file where the data for this row group begins
     * You can use the `ColumnChunk` method to get the 
[metadata](https://pkg.go.dev/github.com/apache/arrow/go/[email protected]/parquet/metadata#ColumnChunkMetaData)
 for a specific column, which contains the offsets for that particular column 
along with the rest of the column metadata.
   
   I hope that the above answers your questions concerning the API (along with 
pointing you at the documentation for any other methods you might need). But 
let me know if there's any other methods/functions you need that you can't find 
(or don't exist yet).
   
   > See my source code how to use mmap to read csv file partition to support 
streaming 
https://github.com/hkpeaks/peaks-consolidation/blob/main/PeaksFramework/read_file.go
   
   I looked at the file there I don't see any usage of `mmap` at all there. 
You're just using `os.Open` and using the `ReadAt` method on the file. (If I'm 
wrong, can you please provide the line number / direct link to the `mmap`?) If 
you follow @mapleFU's 
[suggestion](https://github.com/apache/arrow/issues/35688#issuecomment-1556222309)
 of passing `memoryMap = false` to `OpenParquetFile` then that's what will be 
used, it will just use `os.Open` and `ReadAt` to read from the file.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] zeroshade commented on issue #35688: How to approach to implement Parquet-Go file format

Reply via email to