Re: [DISCUSS] Incremental statistics collection

Rakesh Raushan Wed, 06 Sep 2023 00:45:45 -0700

Hi all,

I would like to hear more from community on this topic.
I believe it would significantly improve statistics collection in spark.


Thanks
Rakesh



On Sat, 2 Sep 2023 at 10:36 AM, Rakesh Raushan <raksonrak...@gmail.com>
wrote:

> Thanks all for all your insights.
>
> @Mich
> I am not trying to introduce any sampling model here.
> This idea is about collecting the task write metrics while writing the
> data and aggregating it with the existing values present in the
> catalog(create a new entry if it's a CTAS command).
> This approach is much simpler to implement. Although it does introduce a
> limitation for external tables where users can update data without spark.
> Similarly, `Alter Table Add Partition` would also require scanning the
> added partitions as currently that seems to be the only way of getting the
> metrics.
>
> @Chetan
> Sure we can analyze multi-column stats as well. But to get them to update
> automatically, we need to get this one done first. That can be a future
> scope for this feature.
>
> I feel auto gathering of stats would have been a better name for this one.
> We already have similar features in Hive and major DBMSs(SQL Server, MySQL).
>
> I would like to hear more from the dev community on this. Are the dev
> community in favour of having this feature in spark ?
> I have made SPIP doc editable for further comments or questions on what I
> am trying to achieve or how I am going to implement it.
> Thanks,
> Rakesh
>
> On Wed, Aug 30, 2023 at 9:42 PM Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
>> Sorry I missed this one
>>
>> In the context what has been changed we ought to have an additional
>> column timestamp
>>
>> In short we can have
>>
>> datachange(object_name, partition_name, colname, timestamp)
>>
>> timestamp is the point in time you want to compare against for changes.
>>
>> Example
>>
>> SELECT * FROM <TABLE> WHERE datachange('<TABLE>', '2023-08-01 00:00:00')
>> = 1
>>
>>
>> This query should return all rows from the <TABLE> table that have been
>> changed since June 1, 2023, 00:00:00.
>>
>> Let me know your thoughts
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Distinguished Technologist, Solutions Architect & Engineer
>> London
>> United Kingdom
>>
>>
>>
>>
>> Disclaimer: Use it at your own risk. Any and all responsibility for any
>> loss, damage or destruction of data or any other property which may arise
>> from relying on this email's technical content is explicitly disclaimed.
>> The author will in no case be liable for any monetary damages arising from
>> such loss, damage or destruction.
>>
>>
>>
>> Mich Talebzadeh,
>> Distinguished Technologist, Solutions Architect & Engineer
>> London
>> United Kingdom
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Wed, 30 Aug 2023 at 10:19, Mich Talebzadeh <mich.talebza...@gmail.com>
>> wrote:
>>
>>> Another idea that came to my mind from the old days, is the concept of
>>> having a function called *datachange*
>>>
>>> This datachange function should measure the amount of change in the data
>>> distribution since ANALYZE STATISTICS last ran. Specifically, it should
>>> measure the number of inserts, updates and deletes that have occurred on
>>> the given object and helps us determine if running ANALYZE STATISTICS would
>>> benefit the query plan.
>>>
>>> something like
>>>
>>> select datachange(object_name, partition_name, colname)
>>>
>>> Where:
>>>
>>> object_name – is the object name. fully qualified objectname. The
>>> object_name cannot be null.
>>> partition_name – is the data partition name. This can be a null value.
>>> colname – is the column name for which the datachange is requested. This
>>> can be a null value (meaning all columns)
>>>
>>> This should be expressed as a percentage of the total number of rows in
>>> the table or partition (if the partition is specified). The percentage
>>> value can be greater than 100% because the number of changes to an object
>>> can be much greater than the number of rows in the table, particularly when
>>> the number of deletes and updates to a table is very high.
>>>
>>> So we can run this function to see if ANALYZE STATISTICS is required on
>>> a certain column.
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Distinguished Technologist, Solutions Architect & Engineer
>>> London
>>> United Kingdom
>>>
>>>
>>>    view my Linkedin profile
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> Disclaimer: Use it at your own risk. Any and all responsibility for any
>>> loss, damage or destruction of data or any other property which may arise
>>> from relying on this email's technical content is explicitly disclaimed.
>>> The author will in no case be liable for any monetary damages arising from
>>> such loss, damage or destruction.
>>>
>>>
>>>
>>> Mich Talebzadeh,
>>> Distinguished Technologist, Solutions Architect & Engineer
>>> London
>>> United Kingdom
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Wed, 30 Aug 2023 at 00:49, Chetan <chetansuttra...@gmail.com> wrote:
>>>
>>>> Thanks for the detailed explanation.
>>>>
>>>>
>>>> Regards,
>>>> Chetan
>>>>
>>>>
>>>>
>>>> On Tue, Aug 29, 2023, 4:50 PM Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> OK, let us take a deeper look here
>>>>>
>>>>> ANALYSE TABLE mytable COMPUTE STATISTICS FOR COLUMNS *(c1, c2), c3*
>>>>>
>>>>> In above, we are *explicitly grouping columns c1 and c2 together for
>>>>> which we want to compute statistic*s. Additionally, we are also *computing
>>>>> statistics for column c3 independen*t*ly*. This approach *allows CBO
>>>>> to treat columns c1 and c2 as a group and compute joint statistics for
>>>>> them, while computing separate statistics for column c3.*
>>>>>
>>>>> If columns c1 and c2 are frequently used together in conditions, I
>>>>> concur it makes sense to compute joint statistics for them by using the
>>>>> above syntax. On the other hand, if each column has its own significance
>>>>> and the relationship between them is not crucial, we can use
>>>>>
>>>>> ANALYSE TABLE mytable COMPUTE STATISTICS FOR COLUMNS
>>>>> *c1, c2, c3*
>>>>>
>>>>> This syntax can be used to compute separate statistics for each column.
>>>>>
>>>>> So your mileage varies.
>>>>>
>>>>> HTH
>>>>>
>>>>> Mich Talebzadeh,
>>>>> Distinguished Technologist, Solutions Architect & Engineer
>>>>> London
>>>>> United Kingdom
>>>>>
>>>>>
>>>>>    view my Linkedin profile
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>
>>>>>
>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, 29 Aug 2023 at 12:14, Chetan <chetansuttra...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> If we are taking this up, then would ask can we support multicolumn
>>>>>> stats such as :
>>>>>> ANALYZE TABLE mytable COMPUTE STATISTICS FOR COLUMNS (c1,c2), c3
>>>>>> This should help in estimating better for conditions involving c1 and
>>>>>> c2
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>> On Tue, 29 Aug 2023 at 09:05, Mich Talebzadeh <
>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>
>>>>>>> short answer on top of my head
>>>>>>>
>>>>>>> My point was with regard to  Cost Based Optimizer (CBO) in
>>>>>>> traditional databases. The concept of a rowkey in HBase is somewhat 
>>>>>>> similar
>>>>>>> to that of a primary key in RDBMS.
>>>>>>> Now in databases with automatic deduplication features (i.e. ignore
>>>>>>> duplication of rowkey), inserting 100 rows with the same rowkey actually
>>>>>>> results in only one physical entry in the database due to deduplication.
>>>>>>> Therefore, the new statistical value added should be 1, reflecting the
>>>>>>> distinct physical entry. If the rowkey is already present in HBase, the
>>>>>>> value would indeed be 0, indicating that no new physical entry was 
>>>>>>> created.
>>>>>>> We need to take into account the underlying deduplication mechanism of 
>>>>>>> the
>>>>>>> database in use to ensure that statistical values accurately represent 
>>>>>>> the
>>>>>>> unique physical data entries.
>>>>>>>
>>>>>>> HTH
>>>>>>>
>>>>>>> Mich Talebzadeh,
>>>>>>> Distinguished Technologist, Solutions Architect & Engineer
>>>>>>> London
>>>>>>> United Kingdom
>>>>>>>
>>>>>>>
>>>>>>>    view my Linkedin profile
>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>
>>>>>>>
>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>> for any loss, damage or destruction of data or any other property which 
>>>>>>> may
>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>> disclaimed. The author will in no case be liable for any monetary 
>>>>>>> damages
>>>>>>> arising from such loss, damage or destruction.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, 29 Aug 2023 at 02:07, Jia Fan <fanjia1...@gmail.com> wrote:
>>>>>>>
>>>>>>>> For those databases with automatic deduplication capabilities, such
>>>>>>>> as hbase, we have inserted 100 rows with the same rowkey, but in fact 
>>>>>>>> there
>>>>>>>> is only one in hbase. Is the new statistical value we added 100 or 1, 
>>>>>>>> or
>>>>>>>> hbase already contains this rowkey, the value would be 0. How should we
>>>>>>>> handle this situation?
>>>>>>>>
>>>>>>>> Mich Talebzadeh <mich.talebza...@gmail.com> 于2023年8月29日周二 07:22写道：
>>>>>>>>
>>>>>>>>> I have never been fond of the notion that measuring inserts,
>>>>>>>>> updates, and deletes (referred to as DML) is the sole criterion for
>>>>>>>>> signaling a necessity to update statistics for Spark's CBO. 
>>>>>>>>> Nevertheless,
>>>>>>>>> in the absence of an alternative mechanism, it seems this is the only
>>>>>>>>> approach at our disposal (can we use AI for it 😁). Personally, I would
>>>>>>>>> prefer some form of indication regarding shifts in the distribution of
>>>>>>>>> values in the histogram, overall density, and similar indicators. The
>>>>>>>>> decision to execute "ANALYZE TABLE xyz COMPUTE STATISTICS FOR COLUMNS"
>>>>>>>>> revolves around column-level statistics, which is why I would tend to 
>>>>>>>>> focus
>>>>>>>>> on monitoring individual column-level statistics to detect any signals
>>>>>>>>> warranting a statistics update.
>>>>>>>>> HTH
>>>>>>>>>
>>>>>>>>> Mich Talebzadeh,
>>>>>>>>> Distinguished Technologist, Solutions Architect & Engineer
>>>>>>>>> London
>>>>>>>>> United Kingdom
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    view my Linkedin profile
>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>>>> for any loss, damage or destruction of data or any other property 
>>>>>>>>> which may
>>>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>>>> disclaimed. The author will in no case be liable for any monetary 
>>>>>>>>> damages
>>>>>>>>> arising from such loss, damage or destruction.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, 26 Aug 2023 at 21:30, Mich Talebzadeh <
>>>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> Impressive, yet in the realm of classic DBMSs, it could be seen
>>>>>>>>>> as a case of old wine in a new bottle. The objective, I assume, is to
>>>>>>>>>> employ dynamic sampling to enhance the optimizer's capacity to create
>>>>>>>>>> effective execution plans without the burden of complete I/O and in 
>>>>>>>>>> less
>>>>>>>>>> time.
>>>>>>>>>>
>>>>>>>>>> For instance:
>>>>>>>>>> ANALYZE TABLE xyz COMPUTE STATISTICS WITH SAMPLING = 5 percent
>>>>>>>>>>
>>>>>>>>>> This approach could potentially aid in estimating deltas by
>>>>>>>>>> utilizing sampling.
>>>>>>>>>>
>>>>>>>>>> HTH
>>>>>>>>>>
>>>>>>>>>> Mich Talebzadeh,
>>>>>>>>>> Distinguished Technologist, Solutions Architect & Engineer
>>>>>>>>>> London
>>>>>>>>>> United Kingdom
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    view my Linkedin profile
>>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all
>>>>>>>>>> responsibility for any loss, damage or destruction of data or any 
>>>>>>>>>> other
>>>>>>>>>> property which may arise from relying on this email's technical 
>>>>>>>>>> content is
>>>>>>>>>> explicitly disclaimed. The author will in no case be liable for any
>>>>>>>>>> monetary damages arising from such loss, damage or destruction.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, 26 Aug 2023 at 20:58, RAKSON RAKESH <
>>>>>>>>>> raksonrak...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi all,
>>>>>>>>>>>
>>>>>>>>>>> I would like to propose the incremental collection of statistics
>>>>>>>>>>> in spark. SPARK-44817
>>>>>>>>>>> <https://issues.apache.org/jira/browse/SPARK-44817> has been
>>>>>>>>>>> raised for the same.
>>>>>>>>>>>
>>>>>>>>>>> Currently, spark invalidates the stats after data changing
>>>>>>>>>>> commands which would make CBO non-functional. To update these 
>>>>>>>>>>> stats, user
>>>>>>>>>>> either needs to run `ANALYZE TABLE` command or turn
>>>>>>>>>>> `spark.sql.statistics.size.autoUpdate.enabled`. Both of these ways 
>>>>>>>>>>> have
>>>>>>>>>>> their own drawbacks, executing `ANALYZE TABLE` command triggers 
>>>>>>>>>>> full table
>>>>>>>>>>> scan while the other one only updates table and partition stats and 
>>>>>>>>>>> can be
>>>>>>>>>>> costly in certain cases.
>>>>>>>>>>>
>>>>>>>>>>> The goal of this proposal is to collect stats incrementally
>>>>>>>>>>> while executing data changing commands by utilizing the framework
>>>>>>>>>>> introduced in SPARK-21669
>>>>>>>>>>> <https://issues.apache.org/jira/browse/SPARK-21669>.
>>>>>>>>>>>
>>>>>>>>>>> SPIP Document has been attached along with JIRA:
>>>>>>>>>>>
>>>>>>>>>>> https://docs.google.com/document/d/1CNPWg_L1fxfB4d2m6xfizRyYRoWS2uPCwTKzhL2fwaQ/edit?usp=sharing
>>>>>>>>>>>
>>>>>>>>>>> Hive also supports automatic collection of statistics to keep
>>>>>>>>>>> the stats consistent.
>>>>>>>>>>> I can find multiple spark JIRAs asking for the same:
>>>>>>>>>>> https://issues.apache.org/jira/browse/SPARK-28872
>>>>>>>>>>> https://issues.apache.org/jira/browse/SPARK-33825
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Rakesh
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>
>>>>>> --
>>>>>> --
>>>>>> Regards,
>>>>>> Chetan
>>>>>>
>>>>>> +353899475147
>>>>>> +919665562626
>>>>>>
>>>>>>

Re: [DISCUSS] Incremental statistics collection

Reply via email to