Thanks for sharing your findings.

I think we should change the InternalStatisticsImpl get methods to return const 
reference instead. Such as -

    const T & getMaximum() const { return _maximum; }
    const T & getMinimum() const { return _minimum; }

This way there won’t be any new string objects constructed while calling the 
get methods. If you want to fix this, you can open a JIRA and create a PR. I 
can help review it.

Thanks!


> On Dec 3, 2019, at 7:36 AM, David Zanter <[email protected]> wrote:
> 
> Hi All,
>  
>    I have been doing some callgrind performance profiling of the ORC C++ API.
>  
>    I am running a fairly normal scenario of Copying an ORC Table of 1.9 
> million rows. (Reading all of the rows, and then Writing all the rows to a 
> new table.)
>    Callgrind is showing the #4 method for Instructions Executed is the 
> orc::StringColumnStatisticsImpl::update method:
>      
>  
> 
>  
> As it turns out why that CPU is so high:
>  
> Each of the calls to the _stats.getMinimum() and _stats.getMaximum() methods, 
> results in a std::string alloc/strcpy/delete.
>  
> There are probably multiple ways that this could be addressed.  However, 
> possibly one solution that is easy and safe, would be to call the 
> getMinimum() getMaximum() once in the method and
> then reusing it like the following:
>  
>   void update(const char* value, size_t length) {
>       if (value != nullptr) {
>         if (!_stats.hasMinimum()) {
>           setMinimum(std::string(value, value + length));
>           setMaximum(std::string(value, value + length));
>         } else {
>           // update min
>           std::string statMin = _stats.getMinimum(); //cache off the min/max 
> for performance as each
>           std::string statMax = _stats.getMaximum(); // call to getMinimum() 
> performs a std::string alloc/copy/delete
>           int minCmp = strncmp(statMin.c_str(),
>                                value,
>                                std::min(statMin.length(), length));
>           if (minCmp > 0 ||
>                 (minCmp == 0 && length < statMin.length())) {
>             setMinimum(std::string(value, value + length));
>           }
>  
>           // update max
>           int maxCmp = strncmp(statMax.c_str(),
>                                value,
>                                std::min(statMax.length(), length));
>           if (maxCmp < 0 ||
>                 (maxCmp == 0 && length > statMax.length())) {
>             setMaximum(std::string(value, value + length));
>           }
>         }
>       }
>  
>  
> Doing that fix, looks to improve the Writes by about 20% (14.6 Billion 
> Instructions -> 12.1 Billion)
>  
> FYI: The Table in question (although I wouldn't say there is any "weird" 
> about this table, and I believe this performance improvement would help most 
> tables with Strings.)
>  
> 8 Columns, 4 of which are fairly small Char Columns,  and there are 1.9 
> million rows
>  
>     Alphabetic List of Variables and Attributes
> #    Variable    Type    Len    Format    Informat
>  
> 1    _col0       Num       8    11.       11.    
> 2    _col1       Char     12                      
> 3    _col2       Char     12                     
> 4    _col3       Char     12                     
> 5    _col4       Num       8    11.       11.    
> 6    _col5       Char     12                     
> 7    _col6       Num       8    11.       11.     
> 8    _col7       Num       8    11.       11.    
>  
> Example Data:                                                                 
>                                                                               
>                                                                               
>                       
>  
>     OBS          _col0    _col1    _col2     _col3           _col4    _col5   
>        _col6          _col7
>  
>       4              5      M        D      Primary            500    Good    
>            0              0
>       5              6      F        D      Primary            500    Good    
>            0              0
>       6              7      M        W      Primary            500    Good    
>            0              0
>  
>  
> Thanks,
> David Zanter

Reply via email to