wgtmac commented on pull request #915:
URL: https://github.com/apache/orc/pull/915#issuecomment-927612662


   > Hi @dongjoon-hyun @wgtmac. After some testing and some thought, I have 
decided to modify this pr in the following way, and we will discuss any 
disagreements.
   > 
   > ## Enhancing ColumnStatistics with a plugin approach
   > The data structures in either TDigest or datasketches can be specific 
implementations in the plugin. orc-core does not add any dependencies, test and 
benchmark modules add dependencies and specific implementations.
   > 
   > ```proto
   > message Digest {
   >   optional string digestName = 1;
   >   optional bytes digestContent = 2;
   > }
   > 
   > message DoubleStatistics {
   >   optional double minimum = 1;
   >   optional double maximum = 2;
   >   optional double sum = 3;
   >   optional Digest digest = 4;
   > }
   > ```
   > 
   > Both Java and C++ will use digestName to find specific plugin 
implementation. Failed to find degrades to a default empty implementation.
   > 
   > 1. Does digest has breaking compatibility of serialization among different 
versions ?
   >    Since Digest is defined as optional. Older versions will automatically 
ignore the Digest field when reading newer versions of files, I did some tests 
that looked good. This will also be added to the unit test.
   > 2. How do I deal with the serialisation of digest between Java and C++ ?
   >    As the enhancement is provided in the form of a plugin, if the user 
needs java to write C++ to read or otherwise. This requires a user 
implementation to ensure serialisation between languages. I thought we could 
add example based on datasketches (which has multiple language implementations).
   > 
   > Also, I think I'll add a command to the tool to see the field's digestName.
   
   So I suppose you want to use digestName as a hint to let readers know how to 
decode it? Should we write some useful information including type (digest, 
datasketches or sth else), version, binding (java, c++, etc.) so that the 
readers are able to decide if they can understand it? For the plugin approach 
on the java side, we may provide an individual module to help read/write digest.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to