Hi All,

   I have been doing some callgrind performance profiling of the ORC C++ API.

   I am running a fairly normal scenario of Copying an ORC Table of 1.9 million 
rows. (Reading all of the rows, and then Writing all the rows to a new table.)
   Callgrind is showing the #4 method for Instructions Executed is the 
orc::StringColumnStatisticsImpl::update method:


[cid:[email protected]]

As it turns out why that CPU is so high:

Each of the calls to the _stats.getMinimum() and _stats.getMaximum() methods, 
results in a std::string alloc/strcpy/delete.

There are probably multiple ways that this could be addressed.  However, 
possibly one solution that is easy and safe, would be to call the getMinimum() 
getMaximum() once in the method and
then reusing it like the following:

  void update(const char* value, size_t length) {
      if (value != nullptr) {
        if (!_stats.hasMinimum()) {
          setMinimum(std::string(value, value + length));
          setMaximum(std::string(value, value + length));
        } else {
          // update min
          std::string statMin = _stats.getMinimum(); //cache off the min/max 
for performance as each
          std::string statMax = _stats.getMaximum(); // call to getMinimum() 
performs a std::string alloc/copy/delete
          int minCmp = strncmp(statMin.c_str(),
                               value,
                               std::min(statMin.length(), length));
          if (minCmp > 0 ||
                (minCmp == 0 && length < statMin.length())) {
            setMinimum(std::string(value, value + length));
          }

          // update max
          int maxCmp = strncmp(statMax.c_str(),
                               value,
                               std::min(statMax.length(), length));
          if (maxCmp < 0 ||
                (maxCmp == 0 && length > statMax.length())) {
            setMaximum(std::string(value, value + length));
          }
        }
      }


Doing that fix, looks to improve the Writes by about 20% (14.6 Billion 
Instructions -> 12.1 Billion)

FYI: The Table in question (although I wouldn't say there is any "weird" about 
this table, and I believe this performance improvement would help most tables 
with Strings.)

8 Columns, 4 of which are fairly small Char Columns,  and there are 1.9 million 
rows

    Alphabetic List of Variables and Attributes
#    Variable    Type    Len    Format    Informat

1    _col0       Num       8    11.       11.
2    _col1       Char     12
3    _col2       Char     12
4    _col3       Char     12
5    _col4       Num       8    11.       11.
6    _col5       Char     12
7    _col6       Num       8    11.       11.
8    _col7       Num       8    11.       11.

Example Data:

    OBS          _col0    _col1    _col2     _col3           _col4    _col5     
     _col6          _col7

      4              5      M        D      Primary            500    Good      
         0              0
      5              6      F        D      Primary            500    Good      
         0              0
      6              7      M        W      Primary            500    Good      
         0              0


Thanks,
David Zanter

Reply via email to