Hi David,

We use GitHub for pull request and code review. Please refer to -

https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/proposing-changes-to-your-work-with-pull-requests
 
<https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/proposing-changes-to-your-work-with-pull-requests>

You create a bug fix branch in your own repo then you can open a PR to request 
to merge to master, which would start the code review process.


> On Dec 4, 2019, at 7:26 AM, David Zanter <[email protected]> wrote:
> 
> Thank you Xiening,
> 
> I created this JIRA:
> https://issues.apache.org/jira/browse/ORC-574
> 
> And ran the test-out successfully; And ran callgrind profiling on it. [It 
> shows a pretty nice 38% CPU improvement and 10% Clock Time improvement.]
> 
> I do use Git;  But I've never pushed to apache before.
> Am I understanding this correctly, that I would make a new branch: 
> "apache/pr/<newest number>", and then pushing that new branch?  Or is there a 
> document I should follow to push?
> 
> 
> diff --git a/c++/include/orc/Statistics.hh b/c++/include/orc/Statistics.hh
> index 654956d..1d4b0b6 100644
> --- a/c++/include/orc/Statistics.hh
> +++ b/c++/include/orc/Statistics.hh
> @@ -282,13 +282,13 @@ namespace orc {
>      * Get the minimum value for the column.
>      * @return minimum value
>      */
> -    virtual std::string getMinimum() const = 0;
> +    virtual const std::string & getMinimum() const = 0;
> 
>     /**
>      * Get the maximum value for the column.
>      * @return maximum value
>      */
> -    virtual std::string getMaximum() const = 0;
> +    virtual const std::string & getMaximum() const = 0;
> 
>     /**
>      * Get the total length of all values.
> diff --git a/c++/src/Statistics.hh b/c++/src/Statistics.hh
> index 0daddfe..633450f 100644
> --- a/c++/src/Statistics.hh
> +++ b/c++/src/Statistics.hh
> @@ -94,7 +94,7 @@ namespace orc {
>     // GET / SET _maximum
>     bool hasMaximum() const { return _hasMaximum; }
> 
> -    T getMaximum() const { return _maximum; }
> +    const T & getMaximum() const { return _maximum; }
> 
>     void setHasMaximum(bool hasMax) { _hasMaximum = hasMax; }
> 
> @@ -105,7 +105,7 @@ namespace orc {
> 
>     void setHasMinimum(bool hasMin) { _hasMinimum = hasMin; }
> 
> -    T getMinimum() const { return _minimum; }
> +    const T & getMinimum() const { return _minimum; }
> 
>     void setMinimum(T min) { _minimum = min; }
> 
> @@ -1077,7 +1077,7 @@ namespace orc {
>       _stats.setHasNull(hasNull);
>     }
> 
> -    std::string getMinimum() const override {
> +    const std::string & getMinimum() const override {
>       if(hasMinimum()){
>         return _stats.getMinimum();
>       }else{
> @@ -1085,7 +1085,7 @@ namespace orc {
>       }
>     }
> 
> -    std::string getMaximum() const override {
> +    const std::string & getMaximum() const override {
>       if(hasMaximum()){
> 
> -----Original Message-----
> From: Xiening Dai <[email protected]> 
> Sent: Tuesday, December 03, 2019 4:44 PM
> To: [email protected]
> Subject: Re: C++ StringColumnStatisticsImpl::update Performance
> 
> EXTERNAL
> 
> Thanks for sharing your findings.
> 
> I think we should change the InternalStatisticsImpl get methods to return 
> const reference instead. Such as -
> 
>    const T & getMaximum() const { return _maximum; }
>    const T & getMinimum() const { return _minimum; }
> 
> This way there won’t be any new string objects constructed while calling the 
> get methods. If you want to fix this, you can open a JIRA and create a PR. I 
> can help review it.
> 
> Thanks!
> 
> 
>> On Dec 3, 2019, at 7:36 AM, David Zanter <[email protected]> 
>> wrote:
>> 
>> Hi All,
>> 
>>   I have been doing some callgrind performance profiling of the ORC C++ API.
>> 
>>   I am running a fairly normal scenario of Copying an ORC Table of 1.9 
>> million rows. (Reading all of the rows, and then Writing all the rows to a 
>> new table.)
>>   Callgrind is showing the #4 method for Instructions Executed is the 
>> orc::StringColumnStatisticsImpl::update method:
>> 
>> 
>> 
>> 
>> As it turns out why that CPU is so high:
>> 
>> Each of the calls to the _stats.getMinimum() and _stats.getMaximum() 
>> methods, results in a std::string alloc/strcpy/delete.
>> 
>> There are probably multiple ways that this could be addressed.  
>> However, possibly one solution that is easy and safe, would be to call the 
>> getMinimum() getMaximum() once in the method and then reusing it like the 
>> following:
>> 
>>  void update(const char* value, size_t length) {
>>      if (value != nullptr) {
>>        if (!_stats.hasMinimum()) {
>>          setMinimum(std::string(value, value + length));
>>          setMaximum(std::string(value, value + length));
>>        } else {
>>          // update min
>>          std::string statMin = _stats.getMinimum(); //cache off the min/max 
>> for performance as each
>>          std::string statMax = _stats.getMaximum(); // call to getMinimum() 
>> performs a std::string alloc/copy/delete
>>          int minCmp = strncmp(statMin.c_str(),
>>                               value,
>>                               std::min(statMin.length(), length));
>>          if (minCmp > 0 ||
>>                (minCmp == 0 && length < statMin.length())) {
>>            setMinimum(std::string(value, value + length));
>>          }
>> 
>>          // update max
>>          int maxCmp = strncmp(statMax.c_str(),
>>                               value,
>>                               std::min(statMax.length(), length));
>>          if (maxCmp < 0 ||
>>                (maxCmp == 0 && length > statMax.length())) {
>>            setMaximum(std::string(value, value + length));
>>          }
>>        }
>>      }
>> 
>> 
>> Doing that fix, looks to improve the Writes by about 20% (14.6 Billion 
>> Instructions -> 12.1 Billion)
>> 
>> FYI: The Table in question (although I wouldn't say there is any 
>> "weird" about this table, and I believe this performance improvement 
>> would help most tables with Strings.)
>> 
>> 8 Columns, 4 of which are fairly small Char Columns,  and there are 
>> 1.9 million rows
>> 
>>    Alphabetic List of Variables and Attributes
>> #    Variable    Type    Len    Format    Informat
>> 
>> 1    _col0       Num       8    11.       11.
>> 2    _col1       Char     12
>> 3    _col2       Char     12
>> 4    _col3       Char     12
>> 5    _col4       Num       8    11.       11.
>> 6    _col5       Char     12
>> 7    _col6       Num       8    11.       11.
>> 8    _col7       Num       8    11.       11.
>> 
>> Example Data:
>> 
>>    OBS          _col0    _col1    _col2     _col3           _col4    _col5   
>>        _col6          _col7
>> 
>>      4              5      M        D      Primary            500    Good    
>>            0              0
>>      5              6      F        D      Primary            500    Good    
>>            0              0
>>      6              7      M        W      Primary            500    Good    
>>            0              0
>> 
>> 
>> Thanks,
>> David Zanter
> 

Reply via email to