Re: [PR] DRAFT: Parquet 3 metadata with decoupled column metadata [parquet-format]

via GitHub Wed, 22 May 2024 14:48:10 -0700


emkornfield commented on code in PR #242:
URL: https://github.com/apache/parquet-format/pull/242#discussion_r1610705909



##########
src/main/thrift/parquet.thrift:
##########
@@ -835,6 +864,65 @@ struct ColumnMetaData {
   16: optional SizeStatistics size_statistics;
 }
 
+struct ColumnChunkMetaDataV3 {

Review Comment:
   > Well, ignoring the fact that Parquet is currently not a sparse format, 
your proposal implies that readers have to do a O(n) search to find a given 
column?
   
   IIUC, Finding a column via schema elements today is also O(N) assuming no 
nesting.  I think the difference is today the first thing implementations do 
create an efficient dictionary structure to amortize lookup of further columns. 
 
   
   I think if we want fast lookups without building any additional dictionaries 
in memory we should be considering a new stored index structure (or reconsider 
how we organize schema elements instead of a straight BFS).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] DRAFT: Parquet 3 metadata with decoupled column metadata [parquet-format]

Reply via email to