xhl0726 opened a new issue #9984:
URL: https://github.com/apache/druid/issues/9984


   ### Affected Version
   0.12+ [(In all versions that support protobuf-extension)
   
   ### Description
   Protobuf (protocol buffers) is known as a faster mechanism for serializing 
structured data. For higher efficiency in ingestion, we tried 
`protobuf-extension` and wrote a simple benchmark to compare it with Json. 
However, it turns out that protobuf is much slower.
   <img width="620" alt="pb-json-original" 
src="https://user-images.githubusercontent.com/24449727/83716399-603f7a00-a662-11ea-8f0e-bcea7cf1efd2.png";>
   
   After investigating the function `parseBatch` in class 
`ProtobufInputRowParser`, we found that the parser would first transform 
protobuf to Json(specifically, a String), and then use jsonParser to parse it. 
Despite of the huge transmission advantage of protobuf, this parsing mechanism 
would lead to slower ingestion due to the extra process.
   
   In order to achieve faster ingestion, we optimized the function `parseBatch` 
by transforming the protobuf to a map directly:
    ```
   DynamicMessage message = DynamicMessage.parseFrom(descriptor, 
ByteString.copyFrom(input));
   
   Map<String, Object> record = CollectionUtils.mapKeys(message.getAllFields(), 
k -> k.getJsonName());
   ```
   
   Then we wrote a benchmark to compare them. It turns out that the optimized 
one can reduce the ingestion time by about 80%. The result is shown below:
   <img width="620" alt="protobuf_optimized" 
src="https://user-images.githubusercontent.com/24449727/83716796-8ca7c600-a663-11ea-9514-930d3a06e02c.png";>
   
   We also run the `ProtobufInputRowParserTest` to test if the parsing result 
is correct. It shows that if there is no need of setting `JsonPathSpec` (to 
rename the key or get a subset of the value), the result is correct.  We think 
that users can decide if they have such need and then choose a proper parsing 
method for higher efficiency. 
   
   Besides, we found the protobuf [field 
masks](https://developers.google.com/protocol-buffers/docs/reference/csharp/class/google/protobuf/well-known-types/field-mask)
 , which may solve the problem of JsonPath. For a message, users can define the 
masks that has the similar format as the JsonPath. And then merge the mask and 
the original message to get the final result. 
   
   - Machine info:
   1.7GHz Intel Core i7
   16 GB 2133 MHz LPDDR3
   
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to