I am new to protobuf and reading this
https://developers.google.com/protocol-buffers/docs/techniques#large-data makes
me wonder what the options are and if any pointers to how its been handled
in real world projects. I have a similar use case of breaking a large blob
into sub-blobs when serializing them into separate Bigtable columns and
some in a different table. The challenge is when building out the objects
during deserialization stage and inferring types dynamically.
For example,
Let's say we have an email message and there are billions are these coming
in with a max size of 100MB. We persist these into storage system like
BigTable or MySQL
message Email {
Headers headers;
Meta meta;
EmailBody body;
string id;
}
message Meta {
boolean hasAttachments;
Flags flags;
Subject;
From;
To;
}
message Headers {
From from;
To to;
... some 100 fields in email headers
}
message From {
string email
string name
}
message EmailBody {
Any body; //multipart-mime stored as type, bytes
}
When data comes in we get the entire email message but needs to get broken
down before persistence happens.
Let's say we have two tables
Email Table -> 2 columns (header : (bytes), body (bytes) )
EmailMeta Table -> 1 column (meta (bytes))
Email Header Table -> 3 columns emailid, headerId, headerData
emailid-uuid ,
from , bytes(from-data)
emailid-uuid ,
to , bytes(to-data)
for the sake of discussions, let's say we want to keep each message type in
header message as separate rows.
Some field like Meta is frequently accessed so needs to go to separate
storage table compared to Email table and I am using Java.
*Now some questions*
1. Before serializing into storage, I need to chop the incoming email
payload and divide them into multiple byte[]. Is there a recommended way to
separate and manage schema separately.
2 When reading I need to dynamically infer JavaType to deserialize payload
for each header. I am reading all the headers from the header table, each
"*header
value*" maps to a specific Java type but I don't know which one. I need to
dynamical build these objects based on "*headerId*" and the complete Header
object needs to be built from 20 header rows.
Tried mergeFrom, partially building, parseBytes and merging but it doesn't
achieve the intended result.
3. Curious what patterns/techniques we have for such cases and also any
opensource projects already doing dynamic inference and mapping.
--
You received this message because you are subscribed to the Google Groups
"Protocol Buffers" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/protobuf.
For more options, visit https://groups.google.com/d/optout.