[protobuf] Re: Suggesting a new Wire Type which encodes length by nested field count rather than bytes

Sebastian Markbåge Mon, 19 Apr 2010 10:26:02 -0700

Thank you, Marc. I see. I wasn't clear about the Group wire-types. Are
there any futher documentation and information about it's deprecated
status? I can't seem to find anything except this old thread:


http://groups.google.com/group/protobuf/browse_thread/thread/50aa6cb61a809a3c/bb0c4f2a80e72411

The benefit of having the number of fields is that you can allocate a
reasonable amount of buffer space. E.g. allocating an array of
pointers in memory. In my case it results in one less byte per set
than start/end group. Although I don't think this is necessarily a big
enough benefit to introduce a new wire type considering the limited
space.

Jason, the point is to be able to include nested large messages. In
this case it's more difficult to split it into smaller fragments as
you also need to split the parent messages. Let's say I have 1 root
message containing 10 messages that each contains 100 nested messages
and each of those has many fields.

The client doesn't need to fit it all in memory. It can put the result
in a database buffer, split it up or send it along to new targets as
it's parsing the file.

To evaluate the full benefit we should look at the full application
usage and not just the serialization process in isolation. Even if the
source/target content is a memory representation, that memory
representation is often different than the message representation.
Such as in the .NET implementations. By targeting that representation
directly, there's no need to double the memory usage.

About the unknown fields. The number would indicate the number of
fields directly within. If that happens to be an unknown field, that
doesn't matter, as long as it's a known wire-type.

If it's an unknown wire type, it would have to fail. So should groups.
A client can't just look for the corresponding end tag since the
unknown wire-type may contain the same byte as the end tag. It's the
same issue.

Now, the start/end group wire-types does solve my problem sufficiently
well. The only problem is it's deprecated status.

>From my understanding Groups was removed because of the problematic
syntax and perceived redundancy. I think that the syntax issue was
resolved in the mentioned thread. I think that we also have solid use
cases where it isn't redundant. Would you consider bringing tag
delimited back?

On Apr 19, 5:09 pm, Jason Hsueh <jas...@google.com> wrote:
> It's not clear to me what the number of fields buys you. How do you intend
> to serialize this large data if it doesn't fit into memory? The usual
> solution to this is to split the large message into smaller fragments - in
> your example, say, 10 messages each containing 100 nested messages. You can
> serialize them separately - the concatenation of serialized data will also
> be valid. When the client parses it, the repeated fields will be
> concatenated. Of course, the client probably can't fit this data into memory
> either, so you really need to write them as separate records so that the
> client can parse them piece by piece. I think that the situation is the
> same, or even worse, if you were to use the number of fields, but again, I'm
> not sure I exactly see where you're going with this.
>
> Furthermore, the length-delimited format is important when dealing with
> unknown fields. Suppose the client reading the message doesn't know about a
> field that uses this new wire type. When the client encounters the tag
> number for this field, it doesn't have the type definition for the
> submessage and won't be able to match the number of fields to the serialized
> data. With the length-delimited format, it knows exactly how many bytes to
> skip, and with groups, it just needs to find the corresponding end tag.
>
> On Sat, Apr 17, 2010 at 4:59 PM, Sebastian Markbåge
> <sebast...@calyptus.eu>wrote:
>
>
>
>
>
> > SUMMARY
>
> > I’d like to add a new wire type similar to Length-delimited for
> > embedded messages and packed repeated fields. However, instead of
> > specifying the byte-length, the value will be preceded by the number
> > of fields within it.
>
> > PROBLEM
>
> > Protocol Buffers currently require that it’s possible to calculate the
> > full byte length of a message before the serialization process.
>
> > This becomes a problem for large data sets where it’s costly to keep
> > the full prepared message structure in memory at a time. This data may
> > be computed at the serialization time, it may be derived from other in-
> > memory data or it may be read and derived from another source such as
> > a database.
>
> > Essentially, other than storing the full message structure in memory
> > or disk, the only solution is to calculate the message structure
> > twice. Neither are great options for performance.
>
> > ALTERNATIVES
>
> > Let’s say we have a message consisting of 1000 embedded large
> > messages.
>
> > I would assume the suggested alternative is to write a custom
> > serialization format that packs the embedded Protobuf messages within
> > it. This is fairly simple.
>
> > Now let’s say we have 100 embedded messages that each contains 1000
> > nested messages. Now things get more complicated. We could keep our
> > large set of messages in a separate indexed format and perhaps
> > reference each sub-set from a standard set of Protobuf messages.
>
> > As the messages get more complex, it becomes more difficult to
> > maintain the custom format around them.
>
> > This essentially means that Protocol Buffers isn’t suitable for large
> > sets of nested data. This may be where we start looking elsewhere.
>
> > SOLUTION
>
> > My solution is based on the assumption that it’s fairly cheap to
> > derive the total number of items that’s going to be within a result
> > set without actually loading all data within it. E.g. it’s easy to
> > derive the number of items returned by a result set in a relational
> > database without loading the actual data rows. We certainly don’t have
> > to load any relationship tables that may contain nested data.
>
> > Another case is if you have a large in-memory application structure
> > that needs to be serialized before being sent over the wire. Imagine a
> > complex graph structure or 3D drawing. The in-memory representation
> > may be very different from the serialized form. Computing that
> > serialization format twice is expensive. Duplicating it in memory is
> > also expensive. But you probably know the number nodes or groups
> > you’ll have.
>
> > Even if we can’t derive the total number of items for every level in a
> > message tree, it’s enough to know the total number of message at the
> > first levels. That will at least give us the ability to break the
> > result up into manageable chunks.
>
> > Now we can use this fact to add another wire type, similar to Length-
> > delimited. Except instead of specifying the number of bytes in a
> > nested message or packed repeated fields, we specify the number of
> > fields at the first level. Each single field within it still specifies
> > its own byte length by its wire type.
>
> > Note: For nested messages or packed repeated fields we only need to
> > specify the number of fields directly within it. We don’t have to
> > count the number of fields within further nested messages.
>
> > OUT OF SCOPE?
>
> > Now I realize that Protobuf isn’t really designed to work with large
> > datasets like this. So this may be out of the scope of the project. I
> > thought I’d mention it since this is something I run into fairly
> > often. I would think that the case of large record sets in a
> > relational database is fairly common.
>
> > The solution is fairly simple and versatile. It makes Protocol Buffers
> > more versatile and even more useful as a de facto standard interchange
> > format within more organizations.
>
> > The problem with this approach is that it’s not as easy to skip ahead
> > over an entire nested message without parsing it. For example if you
> > wanted to load the nth message within a set of repeated fields and the
> > messages themselves uses this new wire type. Personally, I don’t find
> > this very often because you usually need some data within the message
> > to know whether you can skip it. You can’t always assume that
> > information will be at the top. So you end up parsing the message.
> > Even if you do, you can just use this option at the first level.
>
> > There’s always a trade-off between serialization and deserialization
> > costs. This addition would give us one more optimization route.
>
> > --
> > You received this message because you are subscribed to the Google Groups
> > "Protocol Buffers" group.
> > To post to this group, send email to proto...@googlegroups.com.
> > To unsubscribe from this group, send email to
> > protobuf+unsubscr...@googlegroups.com<protobuf%2bunsubscr...@googlegroups.c 
> > om>
> > .
> > For more options, visit this group at
> >http://groups.google.com/group/protobuf?hl=en.
>
> --
> You received this message because you are subscribed to the Google Groups 
> "Protocol Buffers" group.
> To post to this group, send email to proto...@googlegroups.com.
> To unsubscribe from this group, send email to 
> protobuf+unsubscr...@googlegroups.com.
> For more options, visit this group 
> athttp://groups.google.com/group/protobuf?hl=en.

-- 
You received this message because you are subscribed to the Google Groups 
"Protocol Buffers" group.
To post to this group, send email to proto...@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.

[protobuf] Re: Suggesting a new Wire Type which encodes length by nested field count rather than bytes

Reply via email to