guiyanakuang commented on pull request #915:
URL: https://github.com/apache/orc/pull/915#issuecomment-925480666
After some testing and some thought, I have decided to modify this pr in the
following way, and we will discuss any disagreements.
## Enhancing ColumnStatistics with a plugin approach
The data structures in either TDigest or datasketches can be specific
implementations in the plugin. orc-core does not add any dependencies, test and
benchmark modules add dependencies and specific implementations.
```proto
message Digest {
optional string digestName = 1;
optional bytes digestContent = 2;
}
message DoubleStatistics {
optional double minimum = 1;
optional double maximum = 2;
optional double sum = 3;
optional Digest digest = 4;
}
```
Both Java and C++ will use digestName to find specific plugin
implementation. Failed to find degrades to a default empty implementation.
1. Does digest has breaking compatibility of serialization among different
versions ?
Since Digest is defined as optional. Older versions will automatically
ignore the Digest field when reading newer versions of files, I did some tests
that looked good. This will also be added to the unit test.
2. How do I deal with the serialisation of digest between Java and C++ ?
As the enhancement is provided in the form of a plugin, if the user needs
java to write C++ to read or otherwise. This requires a user implementation to
ensure serialisation between languages. I thought we could add example based on
datasketches (which has multiple language implementations).
Also, I think I'll add a command to the tool to see the field's digestName.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]