I agree, I think that returning const ref to the strings is fine.

.. Owen

On Tue, Dec 3, 2019 at 1:44 PM Xiening Dai <[email protected]> wrote:

> Thanks for sharing your findings.
>
> I think we should change the InternalStatisticsImpl get methods to return
> const reference instead. Such as -
>
>     const T & getMaximum() const { return _maximum; }
>     const T & getMinimum() const { return _minimum; }
>
> This way there won’t be any new string objects constructed while calling
> the get methods. If you want to fix this, you can open a JIRA and create a
> PR. I can help review it.
>
> Thanks!
>
>
> > On Dec 3, 2019, at 7:36 AM, David Zanter <[email protected]>
> wrote:
> >
> > Hi All,
> >
> >    I have been doing some callgrind performance profiling of the ORC C++
> API.
> >
> >    I am running a fairly normal scenario of Copying an ORC Table of 1.9
> million rows. (Reading all of the rows, and then Writing all the rows to a
> new table.)
> >    Callgrind is showing the #4 method for Instructions Executed is the
> orc::StringColumnStatisticsImpl::update method:
> >
> >
> >
> >
> > As it turns out why that CPU is so high:
> >
> > Each of the calls to the _stats.getMinimum() and _stats.getMaximum()
> methods, results in a std::string alloc/strcpy/delete.
> >
> > There are probably multiple ways that this could be addressed.  However,
> possibly one solution that is easy and safe, would be to call the
> getMinimum() getMaximum() once in the method and
> > then reusing it like the following:
> >
> >   void update(const char* value, size_t length) {
> >       if (value != nullptr) {
> >         if (!_stats.hasMinimum()) {
> >           setMinimum(std::string(value, value + length));
> >           setMaximum(std::string(value, value + length));
> >         } else {
> >           // update min
> >           std::string statMin = _stats.getMinimum(); //cache off the
> min/max for performance as each
> >           std::string statMax = _stats.getMaximum(); // call to
> getMinimum() performs a std::string alloc/copy/delete
> >           int minCmp = strncmp(statMin.c_str(),
> >                                value,
> >                                std::min(statMin.length(), length));
> >           if (minCmp > 0 ||
> >                 (minCmp == 0 && length < statMin.length())) {
> >             setMinimum(std::string(value, value + length));
> >           }
> >
> >           // update max
> >           int maxCmp = strncmp(statMax.c_str(),
> >                                value,
> >                                std::min(statMax.length(), length));
> >           if (maxCmp < 0 ||
> >                 (maxCmp == 0 && length > statMax.length())) {
> >             setMaximum(std::string(value, value + length));
> >           }
> >         }
> >       }
> >
> >
> > Doing that fix, looks to improve the Writes by about 20% (14.6 Billion
> Instructions -> 12.1 Billion)
> >
> > FYI: The Table in question (although I wouldn't say there is any "weird"
> about this table, and I believe this performance improvement would help
> most tables with Strings.)
> >
> > 8 Columns, 4 of which are fairly small Char Columns,  and there are 1.9
> million rows
> >
> >     Alphabetic List of Variables and Attributes
> > #    Variable    Type    Len    Format    Informat
> >
> > 1    _col0       Num       8    11.       11.
> > 2    _col1       Char     12
> > 3    _col2       Char     12
> > 4    _col3       Char     12
> > 5    _col4       Num       8    11.       11.
> > 6    _col5       Char     12
> > 7    _col6       Num       8    11.       11.
> > 8    _col7       Num       8    11.       11.
> >
> > Example Data:
>
>
>
> >
> >     OBS          _col0    _col1    _col2     _col3           _col4
> _col5          _col6          _col7
> >
> >       4              5      M        D      Primary            500
> Good               0              0
> >       5              6      F        D      Primary            500
> Good               0              0
> >       6              7      M        W      Primary            500
> Good               0              0
> >
> >
> > Thanks,
> > David Zanter
>
>

Reply via email to