techniques#large-data

nari Thu, 04 Apr 2019 16:03:27 -0700

I am new to protobuf and reading this 
https://developers.google.com/protocol-buffers/docs/techniques#large-data makes 
me wonder what the options are and if any pointers to how its been handled 
in real world projects. I have a similar use case of breaking a large blob 
into sub-blobs when serializing them into separate Bigtable columns and 
some in a different table. The challenge is when building out the objects 
during deserialization stage and inferring types dynamically.


For example, 

Let's say we have an email message and there are billions are these coming 
in with a max size of 100MB. We persist these into storage system like 
BigTable or MySQL

message Email {
   Headers headers;
   Meta meta;
   EmailBody body;
   string id;
}

message Meta {
    boolean hasAttachments;
    Flags flags;
    Subject;
    From;
    To;
}

message Headers  {
   From from;
   To to;
   ... some 100 fields in email headers 
}

message From {
   string email
   string name
}

message EmailBody {
   Any body; //multipart-mime stored as type, bytes
}


When data comes in we get the entire email message but needs to get broken 
down before persistence happens. 

Let's say we have two tables

   Email Table -> 2 columns (header : (bytes), body (bytes) )
   EmailMeta Table -> 1 column (meta (bytes))
   Email Header Table -> 3 columns     emailid,        headerId, headerData 
                                                          emailid-uuid ,    
 from    ,  bytes(from-data)
                                                          emailid-uuid ,    
 to    ,  bytes(to-data)

for the sake of discussions, let's say we want to keep each message type in 
header message as separate rows. 

Some field like Meta is frequently accessed so needs to go to separate 
storage table compared to Email table and I am using Java. 

*Now some questions*

1. Before serializing into storage, I need to chop the incoming email 
payload and divide them into multiple byte[]. Is there a recommended way to 
separate and manage schema separately.
2 When reading I need to dynamically infer JavaType to deserialize payload 
for each header. I am reading all the headers from the header table, each 
"*header 
value*" maps to a specific Java type but I don't know which one. I need to 
dynamical build these objects based on "*headerId*" and the complete Header 
object needs to be built from 20 header rows. 
Tried mergeFrom, partially building, parseBytes and merging but it doesn't 
achieve the intended result. 
3. Curious what patterns/techniques we have for such cases and also any 
opensource projects already doing dynamic inference and mapping. 


-- 
You received this message because you are subscribed to the Google Groups 
"Protocol Buffers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/protobuf.
For more options, visit https://groups.google.com/d/optout.

[protobuf] large protobuf ref : https://developers.google.com/protocol-buffers/docs/techniques#large-data

Reply via email to