And actually, an equivalent *already* exists - simply define your message using groups rather than nested sub-messages. The disadvantage of this is the same as with your proposal: if this data is an unexpected member you need to inspect each field (although maybe not parse it entirely) to skip such data; but it is very efficient to *write*. And it doesn't involve adding a wire-type with such a specific intent (there isn't room for many)
Marc On 18 April 2010 00:59, Sebastian Markbåge <sebast...@calyptus.eu> wrote: > SUMMARY > > I’d like to add a new wire type similar to Length-delimited for > embedded messages and packed repeated fields. However, instead of > specifying the byte-length, the value will be preceded by the number > of fields within it. > > PROBLEM > > Protocol Buffers currently require that it’s possible to calculate the > full byte length of a message before the serialization process. > > This becomes a problem for large data sets where it’s costly to keep > the full prepared message structure in memory at a time. This data may > be computed at the serialization time, it may be derived from other in- > memory data or it may be read and derived from another source such as > a database. > > Essentially, other than storing the full message structure in memory > or disk, the only solution is to calculate the message structure > twice. Neither are great options for performance. > > ALTERNATIVES > > Let’s say we have a message consisting of 1000 embedded large > messages. > > I would assume the suggested alternative is to write a custom > serialization format that packs the embedded Protobuf messages within > it. This is fairly simple. > > Now let’s say we have 100 embedded messages that each contains 1000 > nested messages. Now things get more complicated. We could keep our > large set of messages in a separate indexed format and perhaps > reference each sub-set from a standard set of Protobuf messages. > > As the messages get more complex, it becomes more difficult to > maintain the custom format around them. > > This essentially means that Protocol Buffers isn’t suitable for large > sets of nested data. This may be where we start looking elsewhere. > > SOLUTION > > My solution is based on the assumption that it’s fairly cheap to > derive the total number of items that’s going to be within a result > set without actually loading all data within it. E.g. it’s easy to > derive the number of items returned by a result set in a relational > database without loading the actual data rows. We certainly don’t have > to load any relationship tables that may contain nested data. > > Another case is if you have a large in-memory application structure > that needs to be serialized before being sent over the wire. Imagine a > complex graph structure or 3D drawing. The in-memory representation > may be very different from the serialized form. Computing that > serialization format twice is expensive. Duplicating it in memory is > also expensive. But you probably know the number nodes or groups > you’ll have. > > Even if we can’t derive the total number of items for every level in a > message tree, it’s enough to know the total number of message at the > first levels. That will at least give us the ability to break the > result up into manageable chunks. > > Now we can use this fact to add another wire type, similar to Length- > delimited. Except instead of specifying the number of bytes in a > nested message or packed repeated fields, we specify the number of > fields at the first level. Each single field within it still specifies > its own byte length by its wire type. > > Note: For nested messages or packed repeated fields we only need to > specify the number of fields directly within it. We don’t have to > count the number of fields within further nested messages. > > OUT OF SCOPE? > > Now I realize that Protobuf isn’t really designed to work with large > datasets like this. So this may be out of the scope of the project. I > thought I’d mention it since this is something I run into fairly > often. I would think that the case of large record sets in a > relational database is fairly common. > > The solution is fairly simple and versatile. It makes Protocol Buffers > more versatile and even more useful as a de facto standard interchange > format within more organizations. > > The problem with this approach is that it’s not as easy to skip ahead > over an entire nested message without parsing it. For example if you > wanted to load the nth message within a set of repeated fields and the > messages themselves uses this new wire type. Personally, I don’t find > this very often because you usually need some data within the message > to know whether you can skip it. You can’t always assume that > information will be at the top. So you end up parsing the message. > Even if you do, you can just use this option at the first level. > > There’s always a trade-off between serialization and deserialization > costs. This addition would give us one more optimization route. > > -- > You received this message because you are subscribed to the Google Groups > "Protocol Buffers" group. > To post to this group, send email to proto...@googlegroups.com. > To unsubscribe from this group, send email to > protobuf+unsubscr...@googlegroups.com<protobuf%2bunsubscr...@googlegroups.com> > . > For more options, visit this group at > http://groups.google.com/group/protobuf?hl=en. > > -- Regards, Marc -- You received this message because you are subscribed to the Google Groups "Protocol Buffers" group. To post to this group, send email to proto...@googlegroups.com. To unsubscribe from this group, send email to protobuf+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/protobuf?hl=en.