Thanks for sharing your findings.
I think we should change the InternalStatisticsImpl get methods to return const
reference instead. Such as -
const T & getMaximum() const { return _maximum; }
const T & getMinimum() const { return _minimum; }
This way there won’t be any new string objects constructed while calling the
get methods. If you want to fix this, you can open a JIRA and create a PR. I
can help review it.
Thanks!
> On Dec 3, 2019, at 7:36 AM, David Zanter <[email protected]> wrote:
>
> Hi All,
>
> I have been doing some callgrind performance profiling of the ORC C++ API.
>
> I am running a fairly normal scenario of Copying an ORC Table of 1.9
> million rows. (Reading all of the rows, and then Writing all the rows to a
> new table.)
> Callgrind is showing the #4 method for Instructions Executed is the
> orc::StringColumnStatisticsImpl::update method:
>
>
>
>
> As it turns out why that CPU is so high:
>
> Each of the calls to the _stats.getMinimum() and _stats.getMaximum() methods,
> results in a std::string alloc/strcpy/delete.
>
> There are probably multiple ways that this could be addressed. However,
> possibly one solution that is easy and safe, would be to call the
> getMinimum() getMaximum() once in the method and
> then reusing it like the following:
>
> void update(const char* value, size_t length) {
> if (value != nullptr) {
> if (!_stats.hasMinimum()) {
> setMinimum(std::string(value, value + length));
> setMaximum(std::string(value, value + length));
> } else {
> // update min
> std::string statMin = _stats.getMinimum(); //cache off the min/max
> for performance as each
> std::string statMax = _stats.getMaximum(); // call to getMinimum()
> performs a std::string alloc/copy/delete
> int minCmp = strncmp(statMin.c_str(),
> value,
> std::min(statMin.length(), length));
> if (minCmp > 0 ||
> (minCmp == 0 && length < statMin.length())) {
> setMinimum(std::string(value, value + length));
> }
>
> // update max
> int maxCmp = strncmp(statMax.c_str(),
> value,
> std::min(statMax.length(), length));
> if (maxCmp < 0 ||
> (maxCmp == 0 && length > statMax.length())) {
> setMaximum(std::string(value, value + length));
> }
> }
> }
>
>
> Doing that fix, looks to improve the Writes by about 20% (14.6 Billion
> Instructions -> 12.1 Billion)
>
> FYI: The Table in question (although I wouldn't say there is any "weird"
> about this table, and I believe this performance improvement would help most
> tables with Strings.)
>
> 8 Columns, 4 of which are fairly small Char Columns, and there are 1.9
> million rows
>
> Alphabetic List of Variables and Attributes
> # Variable Type Len Format Informat
>
> 1 _col0 Num 8 11. 11.
> 2 _col1 Char 12
> 3 _col2 Char 12
> 4 _col3 Char 12
> 5 _col4 Num 8 11. 11.
> 6 _col5 Char 12
> 7 _col6 Num 8 11. 11.
> 8 _col7 Num 8 11. 11.
>
> Example Data:
>
>
>
>
> OBS _col0 _col1 _col2 _col3 _col4 _col5
> _col6 _col7
>
> 4 5 M D Primary 500 Good
> 0 0
> 5 6 F D Primary 500 Good
> 0 0
> 6 7 M W Primary 500 Good
> 0 0
>
>
> Thanks,
> David Zanter