[GitHub] [arrow] hkpeaks commented on issue #35688: How to approach to implement Parquet-Go file format

via GitHub Sat, 20 May 2023 18:58:04 -0700


hkpeaks commented on issue #35688:
URL: https://github.com/apache/arrow/issues/35688#issuecomment-1556062355


   Firstly, I need to know where to find the start and end address for each row 
block. Secondly, need to know where to find the start and end address of each 
column contained within a row block. For a particular query which requires 5 
columns, I can use Goruntine and mmap to read given set of row blocks for 
selected columns in parallel for each batch. e.g.
   
   Row block batch 1: use 20+ threads to read 5 columns of 1~20 blocks
   Row block batch 2: use 20+ threads to read 5 columns of 21~40 blocks
   Row block batch 3: use 20+ threads to read 5 columns of 41~60 blocks
   
   To determine how much row block I shall read for each stream, I need to know 
how many columns and row blocks of a file.
   
   I have implemented streaming for csv in a similar way and want to extend it 
to cover parquet and json file format. The parquet file format offers two 
advantages: 1) it allows reading select columns directly and 2) it offers 
compression.
   
   If the Apache library offer API to answer the about question, it can offer 
great help of my project to support a hyper performance of parquet format.
   
   Data Structure my current project: 
https://github.com/hkpeaks/peaks-consolidation/blob/main/PeaksFramework/data_structure.go


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] hkpeaks commented on issue #35688: How to approach to implement Parquet-Go file format

Reply via email to