Re: [protobuf] Suggesting a new Wire Type which encodes length by nested field count rather than bytes

Marc Gravell Sun, 18 Apr 2010 01:02:23 -0700

And actually, an equivalent *already* exists - simply define your message
using groups rather than nested sub-messages. The disadvantage of this is
the same as with your proposal: if this data is an unexpected member you
need to inspect each field (although maybe not parse it entirely) to skip
such data; but it is very efficient to *write*. And it doesn't involve
adding a wire-type with such a specific intent (there isn't room for many)


Marc


On 18 April 2010 00:59, Sebastian Markbåge <sebast...@calyptus.eu> wrote:

> SUMMARY
>
> I’d like to add a new wire type similar to Length-delimited for
> embedded messages and packed repeated fields. However, instead of
> specifying the byte-length, the value will be preceded by the number
> of fields within it.
>
> PROBLEM
>
> Protocol Buffers currently require that it’s possible to calculate the
> full byte length of a message before the serialization process.
>
> This becomes a problem for large data sets where it’s costly to keep
> the full prepared message structure in memory at a time. This data may
> be computed at the serialization time, it may be derived from other in-
> memory data or it may be read and derived from another source such as
> a database.
>
> Essentially, other than storing the full message structure in memory
> or disk, the only solution is to calculate the message structure
> twice. Neither are great options for performance.
>
> ALTERNATIVES
>
> Let’s say we have a message consisting of 1000 embedded large
> messages.
>
> I would assume the suggested alternative is to write a custom
> serialization format that packs the embedded Protobuf messages within
> it. This is fairly simple.
>
> Now let’s say we have 100 embedded messages that each contains 1000
> nested messages. Now things get more complicated. We could keep our
> large set of messages in a separate indexed format and perhaps
> reference each sub-set from a standard set of Protobuf messages.
>
> As the messages get more complex, it becomes more difficult to
> maintain the custom format around them.
>
> This essentially means that Protocol Buffers isn’t suitable for large
> sets of nested data. This may be where we start looking elsewhere.
>
> SOLUTION
>
> My solution is based on the assumption that it’s fairly cheap to
> derive the total number of items that’s going to be within a result
> set without actually loading all data within it. E.g. it’s easy to
> derive the number of items returned by a result set in a relational
> database without loading the actual data rows. We certainly don’t have
> to load any relationship tables that may contain nested data.
>
> Another case is if you have a large in-memory application structure
> that needs to be serialized before being sent over the wire. Imagine a
> complex graph structure or 3D drawing. The in-memory representation
> may be very different from the serialized form. Computing that
> serialization format twice is expensive. Duplicating it in memory is
> also expensive. But you probably know the number nodes or groups
> you’ll have.
>
> Even if we can’t derive the total number of items for every level in a
> message tree, it’s enough to know the total number of message at the
> first levels. That will at least give us the ability to break the
> result up into manageable chunks.
>
> Now we can use this fact to add another wire type, similar to Length-
> delimited. Except instead of specifying the number of bytes in a
> nested message or packed repeated fields, we specify the number of
> fields at the first level. Each single field within it still specifies
> its own byte length by its wire type.
>
> Note: For nested messages or packed repeated fields we only need to
> specify the number of fields directly within it. We don’t have to
> count the number of fields within further nested messages.
>
> OUT OF SCOPE?
>
> Now I realize that Protobuf isn’t really designed to work with large
> datasets like this. So this may be out of the scope of the project. I
> thought I’d mention it since this is something I run into fairly
> often. I would think that the case of large record sets in a
> relational database is fairly common.
>
> The solution is fairly simple and versatile. It makes Protocol Buffers
> more versatile and even more useful as a de facto standard interchange
> format within more organizations.
>
> The problem with this approach is that it’s not as easy to skip ahead
> over an entire nested message without parsing it. For example if you
> wanted to load the nth message within a set of repeated fields and the
> messages themselves uses this new wire type. Personally, I don’t find
> this very often because you usually need some data within the message
> to know whether you can skip it. You can’t always assume that
> information will be at the top. So you end up parsing the message.
> Even if you do, you can just use this option at the first level.
>
> There’s always a trade-off between serialization and deserialization
> costs. This addition would give us one more optimization route.
>
> --
> You received this message because you are subscribed to the Google Groups
> "Protocol Buffers" group.
> To post to this group, send email to proto...@googlegroups.com.
> To unsubscribe from this group, send email to
> protobuf+unsubscr...@googlegroups.com<protobuf%2bunsubscr...@googlegroups.com>
> .
> For more options, visit this group at
> http://groups.google.com/group/protobuf?hl=en.
>
>


-- 
Regards,

Marc

-- 
You received this message because you are subscribed to the Google Groups 
"Protocol Buffers" group.
To post to this group, send email to proto...@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.

Re: [protobuf] Suggesting a new Wire Type which encodes length by nested field count rather than bytes

Reply via email to