hkpeaks commented on issue #35688: URL: https://github.com/apache/arrow/issues/35688#issuecomment-1556062355
Firstly, I need to know where to find the start and end address for each row block. Secondly, need to know where to find the start and end address of each column contained within a row block. For a particular query which requires 5 columns, I can use Goruntine and mmap to read given set of row blocks for selected columns in parallel for each batch. e.g. Row block batch 1: use 20+ threads to read 5 columns of 1~20 blocks Row block batch 2: use 20+ threads to read 5 columns of 21~40 blocks Row block batch 3: use 20+ threads to read 5 columns of 41~60 blocks To determine how much row block I shall read for each stream, I need to know how many columns and row blocks of a file. I have implemented streaming for csv in a similar way and want to extend it to cover parquet and json file format. The parquet file format offers two advantages: 1) it allows reading select columns directly and 2) it offers compression. If the Apache library offer API to answer the about question, it can offer great help of my project to support a hyper performance of parquet format. Data Structure my current project: https://github.com/hkpeaks/peaks-consolidation/blob/main/PeaksFramework/data_structure.go -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
