I agree, I think that returning const ref to the strings is fine. .. Owen
On Tue, Dec 3, 2019 at 1:44 PM Xiening Dai <[email protected]> wrote: > Thanks for sharing your findings. > > I think we should change the InternalStatisticsImpl get methods to return > const reference instead. Such as - > > const T & getMaximum() const { return _maximum; } > const T & getMinimum() const { return _minimum; } > > This way there won’t be any new string objects constructed while calling > the get methods. If you want to fix this, you can open a JIRA and create a > PR. I can help review it. > > Thanks! > > > > On Dec 3, 2019, at 7:36 AM, David Zanter <[email protected]> > wrote: > > > > Hi All, > > > > I have been doing some callgrind performance profiling of the ORC C++ > API. > > > > I am running a fairly normal scenario of Copying an ORC Table of 1.9 > million rows. (Reading all of the rows, and then Writing all the rows to a > new table.) > > Callgrind is showing the #4 method for Instructions Executed is the > orc::StringColumnStatisticsImpl::update method: > > > > > > > > > > As it turns out why that CPU is so high: > > > > Each of the calls to the _stats.getMinimum() and _stats.getMaximum() > methods, results in a std::string alloc/strcpy/delete. > > > > There are probably multiple ways that this could be addressed. However, > possibly one solution that is easy and safe, would be to call the > getMinimum() getMaximum() once in the method and > > then reusing it like the following: > > > > void update(const char* value, size_t length) { > > if (value != nullptr) { > > if (!_stats.hasMinimum()) { > > setMinimum(std::string(value, value + length)); > > setMaximum(std::string(value, value + length)); > > } else { > > // update min > > std::string statMin = _stats.getMinimum(); //cache off the > min/max for performance as each > > std::string statMax = _stats.getMaximum(); // call to > getMinimum() performs a std::string alloc/copy/delete > > int minCmp = strncmp(statMin.c_str(), > > value, > > std::min(statMin.length(), length)); > > if (minCmp > 0 || > > (minCmp == 0 && length < statMin.length())) { > > setMinimum(std::string(value, value + length)); > > } > > > > // update max > > int maxCmp = strncmp(statMax.c_str(), > > value, > > std::min(statMax.length(), length)); > > if (maxCmp < 0 || > > (maxCmp == 0 && length > statMax.length())) { > > setMaximum(std::string(value, value + length)); > > } > > } > > } > > > > > > Doing that fix, looks to improve the Writes by about 20% (14.6 Billion > Instructions -> 12.1 Billion) > > > > FYI: The Table in question (although I wouldn't say there is any "weird" > about this table, and I believe this performance improvement would help > most tables with Strings.) > > > > 8 Columns, 4 of which are fairly small Char Columns, and there are 1.9 > million rows > > > > Alphabetic List of Variables and Attributes > > # Variable Type Len Format Informat > > > > 1 _col0 Num 8 11. 11. > > 2 _col1 Char 12 > > 3 _col2 Char 12 > > 4 _col3 Char 12 > > 5 _col4 Num 8 11. 11. > > 6 _col5 Char 12 > > 7 _col6 Num 8 11. 11. > > 8 _col7 Num 8 11. 11. > > > > Example Data: > > > > > > > OBS _col0 _col1 _col2 _col3 _col4 > _col5 _col6 _col7 > > > > 4 5 M D Primary 500 > Good 0 0 > > 5 6 F D Primary 500 > Good 0 0 > > 6 7 M W Primary 500 > Good 0 0 > > > > > > Thanks, > > David Zanter > >
