Hi David, We use GitHub for pull request and code review. Please refer to -
https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/proposing-changes-to-your-work-with-pull-requests <https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/proposing-changes-to-your-work-with-pull-requests> You create a bug fix branch in your own repo then you can open a PR to request to merge to master, which would start the code review process. > On Dec 4, 2019, at 7:26 AM, David Zanter <[email protected]> wrote: > > Thank you Xiening, > > I created this JIRA: > https://issues.apache.org/jira/browse/ORC-574 > > And ran the test-out successfully; And ran callgrind profiling on it. [It > shows a pretty nice 38% CPU improvement and 10% Clock Time improvement.] > > I do use Git; But I've never pushed to apache before. > Am I understanding this correctly, that I would make a new branch: > "apache/pr/<newest number>", and then pushing that new branch? Or is there a > document I should follow to push? > > > diff --git a/c++/include/orc/Statistics.hh b/c++/include/orc/Statistics.hh > index 654956d..1d4b0b6 100644 > --- a/c++/include/orc/Statistics.hh > +++ b/c++/include/orc/Statistics.hh > @@ -282,13 +282,13 @@ namespace orc { > * Get the minimum value for the column. > * @return minimum value > */ > - virtual std::string getMinimum() const = 0; > + virtual const std::string & getMinimum() const = 0; > > /** > * Get the maximum value for the column. > * @return maximum value > */ > - virtual std::string getMaximum() const = 0; > + virtual const std::string & getMaximum() const = 0; > > /** > * Get the total length of all values. > diff --git a/c++/src/Statistics.hh b/c++/src/Statistics.hh > index 0daddfe..633450f 100644 > --- a/c++/src/Statistics.hh > +++ b/c++/src/Statistics.hh > @@ -94,7 +94,7 @@ namespace orc { > // GET / SET _maximum > bool hasMaximum() const { return _hasMaximum; } > > - T getMaximum() const { return _maximum; } > + const T & getMaximum() const { return _maximum; } > > void setHasMaximum(bool hasMax) { _hasMaximum = hasMax; } > > @@ -105,7 +105,7 @@ namespace orc { > > void setHasMinimum(bool hasMin) { _hasMinimum = hasMin; } > > - T getMinimum() const { return _minimum; } > + const T & getMinimum() const { return _minimum; } > > void setMinimum(T min) { _minimum = min; } > > @@ -1077,7 +1077,7 @@ namespace orc { > _stats.setHasNull(hasNull); > } > > - std::string getMinimum() const override { > + const std::string & getMinimum() const override { > if(hasMinimum()){ > return _stats.getMinimum(); > }else{ > @@ -1085,7 +1085,7 @@ namespace orc { > } > } > > - std::string getMaximum() const override { > + const std::string & getMaximum() const override { > if(hasMaximum()){ > > -----Original Message----- > From: Xiening Dai <[email protected]> > Sent: Tuesday, December 03, 2019 4:44 PM > To: [email protected] > Subject: Re: C++ StringColumnStatisticsImpl::update Performance > > EXTERNAL > > Thanks for sharing your findings. > > I think we should change the InternalStatisticsImpl get methods to return > const reference instead. Such as - > > const T & getMaximum() const { return _maximum; } > const T & getMinimum() const { return _minimum; } > > This way there won’t be any new string objects constructed while calling the > get methods. If you want to fix this, you can open a JIRA and create a PR. I > can help review it. > > Thanks! > > >> On Dec 3, 2019, at 7:36 AM, David Zanter <[email protected]> >> wrote: >> >> Hi All, >> >> I have been doing some callgrind performance profiling of the ORC C++ API. >> >> I am running a fairly normal scenario of Copying an ORC Table of 1.9 >> million rows. (Reading all of the rows, and then Writing all the rows to a >> new table.) >> Callgrind is showing the #4 method for Instructions Executed is the >> orc::StringColumnStatisticsImpl::update method: >> >> >> >> >> As it turns out why that CPU is so high: >> >> Each of the calls to the _stats.getMinimum() and _stats.getMaximum() >> methods, results in a std::string alloc/strcpy/delete. >> >> There are probably multiple ways that this could be addressed. >> However, possibly one solution that is easy and safe, would be to call the >> getMinimum() getMaximum() once in the method and then reusing it like the >> following: >> >> void update(const char* value, size_t length) { >> if (value != nullptr) { >> if (!_stats.hasMinimum()) { >> setMinimum(std::string(value, value + length)); >> setMaximum(std::string(value, value + length)); >> } else { >> // update min >> std::string statMin = _stats.getMinimum(); //cache off the min/max >> for performance as each >> std::string statMax = _stats.getMaximum(); // call to getMinimum() >> performs a std::string alloc/copy/delete >> int minCmp = strncmp(statMin.c_str(), >> value, >> std::min(statMin.length(), length)); >> if (minCmp > 0 || >> (minCmp == 0 && length < statMin.length())) { >> setMinimum(std::string(value, value + length)); >> } >> >> // update max >> int maxCmp = strncmp(statMax.c_str(), >> value, >> std::min(statMax.length(), length)); >> if (maxCmp < 0 || >> (maxCmp == 0 && length > statMax.length())) { >> setMaximum(std::string(value, value + length)); >> } >> } >> } >> >> >> Doing that fix, looks to improve the Writes by about 20% (14.6 Billion >> Instructions -> 12.1 Billion) >> >> FYI: The Table in question (although I wouldn't say there is any >> "weird" about this table, and I believe this performance improvement >> would help most tables with Strings.) >> >> 8 Columns, 4 of which are fairly small Char Columns, and there are >> 1.9 million rows >> >> Alphabetic List of Variables and Attributes >> # Variable Type Len Format Informat >> >> 1 _col0 Num 8 11. 11. >> 2 _col1 Char 12 >> 3 _col2 Char 12 >> 4 _col3 Char 12 >> 5 _col4 Num 8 11. 11. >> 6 _col5 Char 12 >> 7 _col6 Num 8 11. 11. >> 8 _col7 Num 8 11. 11. >> >> Example Data: >> >> OBS _col0 _col1 _col2 _col3 _col4 _col5 >> _col6 _col7 >> >> 4 5 M D Primary 500 Good >> 0 0 >> 5 6 F D Primary 500 Good >> 0 0 >> 6 7 M W Primary 500 Good >> 0 0 >> >> >> Thanks, >> David Zanter >
