[ 
https://issues.apache.org/jira/browse/KUDU-2263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16329541#comment-16329541
 ] 

Adar Dembo commented on KUDU-2263:
----------------------------------

I think being able to omit the descriptor set is a very valuable property, 
especially when troubleshooting. I don't want to have to go digging up a 
particular descriptor set and worry about how the schema has evolved since this 
PB file was written, or whether the descriptor set even still exists.

That said, I'd certainly be OK with the "pruning" you described. How much of a 
cost savings would we realize for "small but numerous" PB files (like cmeta and 
maybe tablet superblocks)? Would it be enough to get us under 4k per file?

 

> Consider removing PB descriptors from PBC header
> ------------------------------------------------
>
>                 Key: KUDU-2263
>                 URL: https://issues.apache.org/jira/browse/KUDU-2263
>             Project: Kudu
>          Issue Type: Improvement
>          Components: util
>    Affects Versions: 1.7.0
>            Reporter: Todd Lipcon
>            Priority: Major
>
> Looking at a cmeta file on disk, it seems the vast majority of the bytes are 
> in the supplemental header. We currently serialize the entire descriptor set 
> of the referenced file and its dependencies. This means that in each cmeta 
> file, we end up serializing even things like the definition of SchemaPB – 
> unnecessary to serialize the type at hand and quite large.
>  
> At a minimum we can prune the descriptors serialized to only include those 
> that are transitively referenced by the PB type in the file. I think we 
> should also consider doing away with this information entirely and instead 
> allow 'kudu pbc dump' to take a descriptor set as external input – it's easy 
> enough to generate a descriptor set from any kudu version source tree using 
> the protoc command line.
> One potential major improvement if we can get these files down to <4kb is 
> that we could atomically rewrite them in a single disk IO using O_DIRECT 
> rather than doing a rewrite-rename-fsync dance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to