[ 
https://issues.apache.org/jira/browse/KUDU-2263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16329700#comment-16329700
 ] 

Adar Dembo commented on KUDU-2263:
----------------------------------

bq. But isn't that the point of protobuf evolution rules? eg don't remove 
fields, just mark them "obsolete", etc?

Yes, our adherence to protobuf's schema evolution guidelines should eliminate 
that particular class of worries.

bq. In the rare case that we did remove a long-dead field from the protobuf (eg 
to save space in the struct in RAM) it would still show up in the dump as an 
unknown field and emit the field ID, which could then easily be referenced 
against the file (or you could go back to an earlier release's pbc dump tool if 
necessary)

Do you consider "removing a long-dead field" to be equivalent to "no longer 
producing a particular PB file and eliminating its message type from the source 
tree"? The latter seems more plausible than the former i.e. if we start storing 
cmeta in a dedicated metadata store like rocksdb. I don't like the idea of 
dredging up an old release's CLI for this case; we often use newer CLIs against 
older releases so it'd be unfortunate to introduce one case where that's not a 
good idea.

That said, I'll concede that when I originally added the supplemental header it 
was intended for PB files where the header's cost was amortized by the larger 
number of PB records (i.e. for LBM metadata files). Perhaps the middle ground 
here is to parameterize the inclusion of the header and exclude it from cmeta 
and any other "small size, high number" PB files?


> Consider removing PB descriptors from PBC header
> ------------------------------------------------
>
>                 Key: KUDU-2263
>                 URL: https://issues.apache.org/jira/browse/KUDU-2263
>             Project: Kudu
>          Issue Type: Improvement
>          Components: util
>    Affects Versions: 1.7.0
>            Reporter: Todd Lipcon
>            Priority: Major
>
> Looking at a cmeta file on disk, it seems the vast majority of the bytes are 
> in the supplemental header. We currently serialize the entire descriptor set 
> of the referenced file and its dependencies. This means that in each cmeta 
> file, we end up serializing even things like the definition of SchemaPB – 
> unnecessary to serialize the type at hand and quite large.
>  
> At a minimum we can prune the descriptors serialized to only include those 
> that are transitively referenced by the PB type in the file. I think we 
> should also consider doing away with this information entirely and instead 
> allow 'kudu pbc dump' to take a descriptor set as external input – it's easy 
> enough to generate a descriptor set from any kudu version source tree using 
> the protoc command line.
> One potential major improvement if we can get these files down to <4kb is 
> that we could atomically rewrite them in a single disk IO using O_DIRECT 
> rather than doing a rewrite-rename-fsync dance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to