Todd Lipcon created KUDU-2263:
---------------------------------

             Summary: Consider removing PB descriptors from PBC header
                 Key: KUDU-2263
                 URL: https://issues.apache.org/jira/browse/KUDU-2263
             Project: Kudu
          Issue Type: Improvement
          Components: util
    Affects Versions: 1.7.0
            Reporter: Todd Lipcon


Looking at a cmeta file on disk, it seems the vast majority of the bytes are in 
the supplemental header. We currently serialize the entire descriptor set of 
the referenced file and its dependencies. This means that in each cmeta file, 
we end up serializing even things like the definition of SchemaPB – unnecessary 
to serialize the type at hand and quite large.

 

At a minimum we can prune the descriptors serialized to only include those that 
are transitively referenced by the PB type in the file. I think we should also 
consider doing away with this information entirely and instead allow 'kudu pbc 
dump' to take a descriptor set as external input – it's easy enough to generate 
a descriptor set from any kudu version source tree using the protoc command 
line.

One potential major improvement if we can get these files down to <4kb is that 
we could atomically rewrite them in a single disk IO using O_DIRECT rather than 
doing a rewrite-rename-fsync dance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to