bhalchandrap commented on a change in pull request #2439: URL: https://github.com/apache/thrift/pull/2439#discussion_r703058445
########## File path: lib/java/src/org/apache/thrift/partial/README.md ########## @@ -0,0 +1,112 @@ +# Partial Thrift Deserialization + +## Overview +This document describes how partial deserialization of Thrift works. There are two main goals of this documentation: +1. Make it easier to understand the current Java implementation in this folder. +1. Be useful in implementing partial deserialization support in additional languages. + +This document is divided into two high level areas. The first part explains important concepts relevant to partial deserialization. The second part describes components involved in the Java implementation in this folder. + +Moreover, this blog provides some performance numbers and addtional information: https://medium.com/pinterest-engineering/improving-data-processing-efficiency-using-partial-deserialization-of-thrift-16bc3a4a38b4 + +## Basic Concepts + +### Motivation + +The main motivation behind implementing this feature is to improve performance when we need to access only a subset of fields in any Thrift object. This situation arises often when big data is stored in Thrift encoded format (for example, SequenceFile with serialized Thrift values). Many data processing jobs may access this data. However, not every job needs to access every field of each object. In such cases, if we have prior knowledge of the fields needed for a given job, we can deserialize only that subset of fields and avoid the cost deserializing the rest of the fields. There are two benefits of this approach: we save cpu cycles by not deserializing unnecessary field and we end up reducing gc pressure. Both of the savings quickly add up when processing billions of instances in a data processing job. + +### Partial deserialization + +Partial deserialization involves deserializing only a subset of the fields of a serialized Thrift object while efficiently skipping over the rest. One very important benefit of partial deserialization is that the output of the deserialization process is not limited to a `TBase` derived object. It can deserialize a serialized blob into any type by using an appropriate `ThriftFieldValueProcessor`. Review comment: Yes, it is very useful in certain cases (see earlier comment). ThriftStructProcessor enables direct deserialization into TBase. What are your thoughts on providing a simpler wrapper over ThriftStructProcessor and PartialThriftDeserializer that directly supports deserializing into TBase? That way, not every language would need to expose the ability to deserialize into non-TBase. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
