Hi, If we are taking this up, then would ask can we support multicolumn stats such as : ANALYZE TABLE mytable COMPUTE STATISTICS FOR COLUMNS (c1,c2), c3 This should help in estimating better for conditions involving c1 and c2
Thanks. On Tue, 29 Aug 2023 at 09:05, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > short answer on top of my head > > My point was with regard to Cost Based Optimizer (CBO) in traditional > databases. The concept of a rowkey in HBase is somewhat similar to that of > a primary key in RDBMS. > Now in databases with automatic deduplication features (i.e. ignore > duplication of rowkey), inserting 100 rows with the same rowkey actually > results in only one physical entry in the database due to deduplication. > Therefore, the new statistical value added should be 1, reflecting the > distinct physical entry. If the rowkey is already present in HBase, the > value would indeed be 0, indicating that no new physical entry was created. > We need to take into account the underlying deduplication mechanism of the > database in use to ensure that statistical values accurately represent the > unique physical data entries. > > HTH > > Mich Talebzadeh, > Distinguished Technologist, Solutions Architect & Engineer > London > United Kingdom > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > https://en.everybodywiki.com/Mich_Talebzadeh > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Tue, 29 Aug 2023 at 02:07, Jia Fan <fanjia1...@gmail.com> wrote: > >> For those databases with automatic deduplication capabilities, such as >> hbase, we have inserted 100 rows with the same rowkey, but in fact there is >> only one in hbase. Is the new statistical value we added 100 or 1, or hbase >> already contains this rowkey, the value would be 0. How should we handle >> this situation? >> >> Mich Talebzadeh <mich.talebza...@gmail.com> 于2023年8月29日周二 07:22写道: >> >>> I have never been fond of the notion that measuring inserts, updates, >>> and deletes (referred to as DML) is the sole criterion for signaling a >>> necessity to update statistics for Spark's CBO. Nevertheless, in the >>> absence of an alternative mechanism, it seems this is the only approach at >>> our disposal (can we use AI for it 😁). Personally, I would prefer some >>> form of indication regarding shifts in the distribution of values in the >>> histogram, overall density, and similar indicators. The decision to execute >>> "ANALYZE TABLE xyz COMPUTE STATISTICS FOR COLUMNS" revolves around >>> column-level statistics, which is why I would tend to focus on monitoring >>> individual column-level statistics to detect any signals warranting a >>> statistics update. >>> HTH >>> >>> Mich Talebzadeh, >>> Distinguished Technologist, Solutions Architect & Engineer >>> London >>> United Kingdom >>> >>> >>> view my Linkedin profile >>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>> >>> >>> https://en.everybodywiki.com/Mich_Talebzadeh >>> >>> >>> >>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>> any loss, damage or destruction of data or any other property which may >>> arise from relying on this email's technical content is explicitly >>> disclaimed. The author will in no case be liable for any monetary damages >>> arising from such loss, damage or destruction. >>> >>> >>> >>> >>> On Sat, 26 Aug 2023 at 21:30, Mich Talebzadeh <mich.talebza...@gmail.com> >>> wrote: >>> >>>> Hi, >>>> >>>> Impressive, yet in the realm of classic DBMSs, it could be seen as a >>>> case of old wine in a new bottle. The objective, I assume, is to employ >>>> dynamic sampling to enhance the optimizer's capacity to create effective >>>> execution plans without the burden of complete I/O and in less time. >>>> >>>> For instance: >>>> ANALYZE TABLE xyz COMPUTE STATISTICS WITH SAMPLING = 5 percent >>>> >>>> This approach could potentially aid in estimating deltas by utilizing >>>> sampling. >>>> >>>> HTH >>>> >>>> Mich Talebzadeh, >>>> Distinguished Technologist, Solutions Architect & Engineer >>>> London >>>> United Kingdom >>>> >>>> >>>> view my Linkedin profile >>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>> >>>> >>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>> >>>> >>>> >>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>> any loss, damage or destruction of data or any other property which may >>>> arise from relying on this email's technical content is explicitly >>>> disclaimed. The author will in no case be liable for any monetary damages >>>> arising from such loss, damage or destruction. >>>> >>>> >>>> >>>> >>>> On Sat, 26 Aug 2023 at 20:58, RAKSON RAKESH <raksonrak...@gmail.com> >>>> wrote: >>>> >>>>> Hi all, >>>>> >>>>> I would like to propose the incremental collection of statistics in >>>>> spark. SPARK-44817 <https://issues.apache.org/jira/browse/SPARK-44817> >>>>> has been raised for the same. >>>>> >>>>> Currently, spark invalidates the stats after data changing commands >>>>> which would make CBO non-functional. To update these stats, user either >>>>> needs to run `ANALYZE TABLE` command or turn >>>>> `spark.sql.statistics.size.autoUpdate.enabled`. Both of these ways have >>>>> their own drawbacks, executing `ANALYZE TABLE` command triggers full table >>>>> scan while the other one only updates table and partition stats and can be >>>>> costly in certain cases. >>>>> >>>>> The goal of this proposal is to collect stats incrementally while >>>>> executing data changing commands by utilizing the framework introduced in >>>>> SPARK-21669 <https://issues.apache.org/jira/browse/SPARK-21669>. >>>>> >>>>> SPIP Document has been attached along with JIRA: >>>>> >>>>> https://docs.google.com/document/d/1CNPWg_L1fxfB4d2m6xfizRyYRoWS2uPCwTKzhL2fwaQ/edit?usp=sharing >>>>> >>>>> Hive also supports automatic collection of statistics to keep the >>>>> stats consistent. >>>>> I can find multiple spark JIRAs asking for the same: >>>>> https://issues.apache.org/jira/browse/SPARK-28872 >>>>> https://issues.apache.org/jira/browse/SPARK-33825 >>>>> >>>>> Regards, >>>>> Rakesh >>>>> >>>> -- -- Regards, Chetan +353899475147 +919665562626