Hi all, I would like to hear more from community on this topic. I believe it would significantly improve statistics collection in spark.
Thanks Rakesh On Sat, 2 Sep 2023 at 10:36 AM, Rakesh Raushan <raksonrak...@gmail.com> wrote: > Thanks all for all your insights. > > @Mich > I am not trying to introduce any sampling model here. > This idea is about collecting the task write metrics while writing the > data and aggregating it with the existing values present in the > catalog(create a new entry if it's a CTAS command). > This approach is much simpler to implement. Although it does introduce a > limitation for external tables where users can update data without spark. > Similarly, `Alter Table Add Partition` would also require scanning the > added partitions as currently that seems to be the only way of getting the > metrics. > > @Chetan > Sure we can analyze multi-column stats as well. But to get them to update > automatically, we need to get this one done first. That can be a future > scope for this feature. > > I feel auto gathering of stats would have been a better name for this one. > We already have similar features in Hive and major DBMSs(SQL Server, MySQL). > > I would like to hear more from the dev community on this. Are the dev > community in favour of having this feature in spark ? > I have made SPIP doc editable for further comments or questions on what I > am trying to achieve or how I am going to implement it. > Thanks, > Rakesh > > On Wed, Aug 30, 2023 at 9:42 PM Mich Talebzadeh <mich.talebza...@gmail.com> > wrote: > >> Sorry I missed this one >> >> In the context what has been changed we ought to have an additional >> column timestamp >> >> In short we can have >> >> datachange(object_name, partition_name, colname, timestamp) >> >> timestamp is the point in time you want to compare against for changes. >> >> Example >> >> SELECT * FROM <TABLE> WHERE datachange('<TABLE>', '2023-08-01 00:00:00') >> = 1 >> >> >> This query should return all rows from the <TABLE> table that have been >> changed since June 1, 2023, 00:00:00. >> >> Let me know your thoughts >> >> HTH >> >> Mich Talebzadeh, >> Distinguished Technologist, Solutions Architect & Engineer >> London >> United Kingdom >> >> >> >> >> Disclaimer: Use it at your own risk. Any and all responsibility for any >> loss, damage or destruction of data or any other property which may arise >> from relying on this email's technical content is explicitly disclaimed. >> The author will in no case be liable for any monetary damages arising from >> such loss, damage or destruction. >> >> >> >> Mich Talebzadeh, >> Distinguished Technologist, Solutions Architect & Engineer >> London >> United Kingdom >> >> >> view my Linkedin profile >> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >> >> >> https://en.everybodywiki.com/Mich_Talebzadeh >> >> >> >> *Disclaimer:* Use it at your own risk. Any and all responsibility for >> any loss, damage or destruction of data or any other property which may >> arise from relying on this email's technical content is explicitly >> disclaimed. The author will in no case be liable for any monetary damages >> arising from such loss, damage or destruction. >> >> >> >> >> On Wed, 30 Aug 2023 at 10:19, Mich Talebzadeh <mich.talebza...@gmail.com> >> wrote: >> >>> Another idea that came to my mind from the old days, is the concept of >>> having a function called *datachange* >>> >>> This datachange function should measure the amount of change in the data >>> distribution since ANALYZE STATISTICS last ran. Specifically, it should >>> measure the number of inserts, updates and deletes that have occurred on >>> the given object and helps us determine if running ANALYZE STATISTICS would >>> benefit the query plan. >>> >>> something like >>> >>> select datachange(object_name, partition_name, colname) >>> >>> Where: >>> >>> object_name – is the object name. fully qualified objectname. The >>> object_name cannot be null. >>> partition_name – is the data partition name. This can be a null value. >>> colname – is the column name for which the datachange is requested. This >>> can be a null value (meaning all columns) >>> >>> This should be expressed as a percentage of the total number of rows in >>> the table or partition (if the partition is specified). The percentage >>> value can be greater than 100% because the number of changes to an object >>> can be much greater than the number of rows in the table, particularly when >>> the number of deletes and updates to a table is very high. >>> >>> So we can run this function to see if ANALYZE STATISTICS is required on >>> a certain column. >>> >>> HTH >>> >>> Mich Talebzadeh, >>> Distinguished Technologist, Solutions Architect & Engineer >>> London >>> United Kingdom >>> >>> >>> view my Linkedin profile >>> >>> >>> https://en.everybodywiki.com/Mich_Talebzadeh >>> >>> >>> >>> Disclaimer: Use it at your own risk. Any and all responsibility for any >>> loss, damage or destruction of data or any other property which may arise >>> from relying on this email's technical content is explicitly disclaimed. >>> The author will in no case be liable for any monetary damages arising from >>> such loss, damage or destruction. >>> >>> >>> >>> Mich Talebzadeh, >>> Distinguished Technologist, Solutions Architect & Engineer >>> London >>> United Kingdom >>> >>> >>> view my Linkedin profile >>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>> >>> >>> https://en.everybodywiki.com/Mich_Talebzadeh >>> >>> >>> >>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>> any loss, damage or destruction of data or any other property which may >>> arise from relying on this email's technical content is explicitly >>> disclaimed. The author will in no case be liable for any monetary damages >>> arising from such loss, damage or destruction. >>> >>> >>> >>> >>> On Wed, 30 Aug 2023 at 00:49, Chetan <chetansuttra...@gmail.com> wrote: >>> >>>> Thanks for the detailed explanation. >>>> >>>> >>>> Regards, >>>> Chetan >>>> >>>> >>>> >>>> On Tue, Aug 29, 2023, 4:50 PM Mich Talebzadeh < >>>> mich.talebza...@gmail.com> wrote: >>>> >>>>> OK, let us take a deeper look here >>>>> >>>>> ANALYSE TABLE mytable COMPUTE STATISTICS FOR COLUMNS *(c1, c2), c3* >>>>> >>>>> In above, we are *explicitly grouping columns c1 and c2 together for >>>>> which we want to compute statistic*s. Additionally, we are also *computing >>>>> statistics for column c3 independen*t*ly*. This approach *allows CBO >>>>> to treat columns c1 and c2 as a group and compute joint statistics for >>>>> them, while computing separate statistics for column c3.* >>>>> >>>>> If columns c1 and c2 are frequently used together in conditions, I >>>>> concur it makes sense to compute joint statistics for them by using the >>>>> above syntax. On the other hand, if each column has its own significance >>>>> and the relationship between them is not crucial, we can use >>>>> >>>>> ANALYSE TABLE mytable COMPUTE STATISTICS FOR COLUMNS >>>>> *c1, c2, c3* >>>>> >>>>> This syntax can be used to compute separate statistics for each column. >>>>> >>>>> So your mileage varies. >>>>> >>>>> HTH >>>>> >>>>> Mich Talebzadeh, >>>>> Distinguished Technologist, Solutions Architect & Engineer >>>>> London >>>>> United Kingdom >>>>> >>>>> >>>>> view my Linkedin profile >>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>> >>>>> >>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>> >>>>> >>>>> >>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>>> any loss, damage or destruction of data or any other property which may >>>>> arise from relying on this email's technical content is explicitly >>>>> disclaimed. The author will in no case be liable for any monetary damages >>>>> arising from such loss, damage or destruction. >>>>> >>>>> >>>>> >>>>> >>>>> On Tue, 29 Aug 2023 at 12:14, Chetan <chetansuttra...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> If we are taking this up, then would ask can we support multicolumn >>>>>> stats such as : >>>>>> ANALYZE TABLE mytable COMPUTE STATISTICS FOR COLUMNS (c1,c2), c3 >>>>>> This should help in estimating better for conditions involving c1 and >>>>>> c2 >>>>>> >>>>>> Thanks. >>>>>> >>>>>> On Tue, 29 Aug 2023 at 09:05, Mich Talebzadeh < >>>>>> mich.talebza...@gmail.com> wrote: >>>>>> >>>>>>> short answer on top of my head >>>>>>> >>>>>>> My point was with regard to Cost Based Optimizer (CBO) in >>>>>>> traditional databases. The concept of a rowkey in HBase is somewhat >>>>>>> similar >>>>>>> to that of a primary key in RDBMS. >>>>>>> Now in databases with automatic deduplication features (i.e. ignore >>>>>>> duplication of rowkey), inserting 100 rows with the same rowkey actually >>>>>>> results in only one physical entry in the database due to deduplication. >>>>>>> Therefore, the new statistical value added should be 1, reflecting the >>>>>>> distinct physical entry. If the rowkey is already present in HBase, the >>>>>>> value would indeed be 0, indicating that no new physical entry was >>>>>>> created. >>>>>>> We need to take into account the underlying deduplication mechanism of >>>>>>> the >>>>>>> database in use to ensure that statistical values accurately represent >>>>>>> the >>>>>>> unique physical data entries. >>>>>>> >>>>>>> HTH >>>>>>> >>>>>>> Mich Talebzadeh, >>>>>>> Distinguished Technologist, Solutions Architect & Engineer >>>>>>> London >>>>>>> United Kingdom >>>>>>> >>>>>>> >>>>>>> view my Linkedin profile >>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>>> >>>>>>> >>>>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>>>> >>>>>>> >>>>>>> >>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility >>>>>>> for any loss, damage or destruction of data or any other property which >>>>>>> may >>>>>>> arise from relying on this email's technical content is explicitly >>>>>>> disclaimed. The author will in no case be liable for any monetary >>>>>>> damages >>>>>>> arising from such loss, damage or destruction. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, 29 Aug 2023 at 02:07, Jia Fan <fanjia1...@gmail.com> wrote: >>>>>>> >>>>>>>> For those databases with automatic deduplication capabilities, such >>>>>>>> as hbase, we have inserted 100 rows with the same rowkey, but in fact >>>>>>>> there >>>>>>>> is only one in hbase. Is the new statistical value we added 100 or 1, >>>>>>>> or >>>>>>>> hbase already contains this rowkey, the value would be 0. How should we >>>>>>>> handle this situation? >>>>>>>> >>>>>>>> Mich Talebzadeh <mich.talebza...@gmail.com> 于2023年8月29日周二 07:22写道: >>>>>>>> >>>>>>>>> I have never been fond of the notion that measuring inserts, >>>>>>>>> updates, and deletes (referred to as DML) is the sole criterion for >>>>>>>>> signaling a necessity to update statistics for Spark's CBO. >>>>>>>>> Nevertheless, >>>>>>>>> in the absence of an alternative mechanism, it seems this is the only >>>>>>>>> approach at our disposal (can we use AI for it 😁). Personally, I would >>>>>>>>> prefer some form of indication regarding shifts in the distribution of >>>>>>>>> values in the histogram, overall density, and similar indicators. The >>>>>>>>> decision to execute "ANALYZE TABLE xyz COMPUTE STATISTICS FOR COLUMNS" >>>>>>>>> revolves around column-level statistics, which is why I would tend to >>>>>>>>> focus >>>>>>>>> on monitoring individual column-level statistics to detect any signals >>>>>>>>> warranting a statistics update. >>>>>>>>> HTH >>>>>>>>> >>>>>>>>> Mich Talebzadeh, >>>>>>>>> Distinguished Technologist, Solutions Architect & Engineer >>>>>>>>> London >>>>>>>>> United Kingdom >>>>>>>>> >>>>>>>>> >>>>>>>>> view my Linkedin profile >>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>>>>> >>>>>>>>> >>>>>>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility >>>>>>>>> for any loss, damage or destruction of data or any other property >>>>>>>>> which may >>>>>>>>> arise from relying on this email's technical content is explicitly >>>>>>>>> disclaimed. The author will in no case be liable for any monetary >>>>>>>>> damages >>>>>>>>> arising from such loss, damage or destruction. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Sat, 26 Aug 2023 at 21:30, Mich Talebzadeh < >>>>>>>>> mich.talebza...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> Impressive, yet in the realm of classic DBMSs, it could be seen >>>>>>>>>> as a case of old wine in a new bottle. The objective, I assume, is to >>>>>>>>>> employ dynamic sampling to enhance the optimizer's capacity to create >>>>>>>>>> effective execution plans without the burden of complete I/O and in >>>>>>>>>> less >>>>>>>>>> time. >>>>>>>>>> >>>>>>>>>> For instance: >>>>>>>>>> ANALYZE TABLE xyz COMPUTE STATISTICS WITH SAMPLING = 5 percent >>>>>>>>>> >>>>>>>>>> This approach could potentially aid in estimating deltas by >>>>>>>>>> utilizing sampling. >>>>>>>>>> >>>>>>>>>> HTH >>>>>>>>>> >>>>>>>>>> Mich Talebzadeh, >>>>>>>>>> Distinguished Technologist, Solutions Architect & Engineer >>>>>>>>>> London >>>>>>>>>> United Kingdom >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> view my Linkedin profile >>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all >>>>>>>>>> responsibility for any loss, damage or destruction of data or any >>>>>>>>>> other >>>>>>>>>> property which may arise from relying on this email's technical >>>>>>>>>> content is >>>>>>>>>> explicitly disclaimed. The author will in no case be liable for any >>>>>>>>>> monetary damages arising from such loss, damage or destruction. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Sat, 26 Aug 2023 at 20:58, RAKSON RAKESH < >>>>>>>>>> raksonrak...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Hi all, >>>>>>>>>>> >>>>>>>>>>> I would like to propose the incremental collection of statistics >>>>>>>>>>> in spark. SPARK-44817 >>>>>>>>>>> <https://issues.apache.org/jira/browse/SPARK-44817> has been >>>>>>>>>>> raised for the same. >>>>>>>>>>> >>>>>>>>>>> Currently, spark invalidates the stats after data changing >>>>>>>>>>> commands which would make CBO non-functional. To update these >>>>>>>>>>> stats, user >>>>>>>>>>> either needs to run `ANALYZE TABLE` command or turn >>>>>>>>>>> `spark.sql.statistics.size.autoUpdate.enabled`. Both of these ways >>>>>>>>>>> have >>>>>>>>>>> their own drawbacks, executing `ANALYZE TABLE` command triggers >>>>>>>>>>> full table >>>>>>>>>>> scan while the other one only updates table and partition stats and >>>>>>>>>>> can be >>>>>>>>>>> costly in certain cases. >>>>>>>>>>> >>>>>>>>>>> The goal of this proposal is to collect stats incrementally >>>>>>>>>>> while executing data changing commands by utilizing the framework >>>>>>>>>>> introduced in SPARK-21669 >>>>>>>>>>> <https://issues.apache.org/jira/browse/SPARK-21669>. >>>>>>>>>>> >>>>>>>>>>> SPIP Document has been attached along with JIRA: >>>>>>>>>>> >>>>>>>>>>> https://docs.google.com/document/d/1CNPWg_L1fxfB4d2m6xfizRyYRoWS2uPCwTKzhL2fwaQ/edit?usp=sharing >>>>>>>>>>> >>>>>>>>>>> Hive also supports automatic collection of statistics to keep >>>>>>>>>>> the stats consistent. >>>>>>>>>>> I can find multiple spark JIRAs asking for the same: >>>>>>>>>>> https://issues.apache.org/jira/browse/SPARK-28872 >>>>>>>>>>> https://issues.apache.org/jira/browse/SPARK-33825 >>>>>>>>>>> >>>>>>>>>>> Regards, >>>>>>>>>>> Rakesh >>>>>>>>>>> >>>>>>>>>> >>>>>> >>>>>> -- >>>>>> -- >>>>>> Regards, >>>>>> Chetan >>>>>> >>>>>> +353899475147 >>>>>> +919665562626 >>>>>> >>>>>>