Re: [DISCUSS] Incremental statistics collection

Chetan Tue, 29 Aug 2023 04:14:43 -0700

Hi,

If we are taking this up, then would ask can we support multicolumn stats
such as :
ANALYZE TABLE mytable COMPUTE STATISTICS FOR COLUMNS (c1,c2), c3
This should help in estimating better for conditions involving c1 and c2


Thanks.

On Tue, 29 Aug 2023 at 09:05, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> short answer on top of my head
>
> My point was with regard to  Cost Based Optimizer (CBO) in traditional
> databases. The concept of a rowkey in HBase is somewhat similar to that of
> a primary key in RDBMS.
> Now in databases with automatic deduplication features (i.e. ignore
> duplication of rowkey), inserting 100 rows with the same rowkey actually
> results in only one physical entry in the database due to deduplication.
> Therefore, the new statistical value added should be 1, reflecting the
> distinct physical entry. If the rowkey is already present in HBase, the
> value would indeed be 0, indicating that no new physical entry was created.
> We need to take into account the underlying deduplication mechanism of the
> database in use to ensure that statistical values accurately represent the
> unique physical data entries.
>
> HTH
>
> Mich Talebzadeh,
> Distinguished Technologist, Solutions Architect & Engineer
> London
> United Kingdom
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 29 Aug 2023 at 02:07, Jia Fan <fanjia1...@gmail.com> wrote:
>
>> For those databases with automatic deduplication capabilities, such as
>> hbase, we have inserted 100 rows with the same rowkey, but in fact there is
>> only one in hbase. Is the new statistical value we added 100 or 1, or hbase
>> already contains this rowkey, the value would be 0. How should we handle
>> this situation?
>>
>> Mich Talebzadeh <mich.talebza...@gmail.com> 于2023年8月29日周二 07:22写道：
>>
>>> I have never been fond of the notion that measuring inserts, updates,
>>> and deletes (referred to as DML) is the sole criterion for signaling a
>>> necessity to update statistics for Spark's CBO. Nevertheless, in the
>>> absence of an alternative mechanism, it seems this is the only approach at
>>> our disposal (can we use AI for it 😁). Personally, I would prefer some
>>> form of indication regarding shifts in the distribution of values in the
>>> histogram, overall density, and similar indicators. The decision to execute
>>> "ANALYZE TABLE xyz COMPUTE STATISTICS FOR COLUMNS" revolves around
>>> column-level statistics, which is why I would tend to focus on monitoring
>>> individual column-level statistics to detect any signals warranting a
>>> statistics update.
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Distinguished Technologist, Solutions Architect & Engineer
>>> London
>>> United Kingdom
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Sat, 26 Aug 2023 at 21:30, Mich Talebzadeh <mich.talebza...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> Impressive, yet in the realm of classic DBMSs, it could be seen as a
>>>> case of old wine in a new bottle. The objective, I assume, is to employ
>>>> dynamic sampling to enhance the optimizer's capacity to create effective
>>>> execution plans without the burden of complete I/O and in less time.
>>>>
>>>> For instance:
>>>> ANALYZE TABLE xyz COMPUTE STATISTICS WITH SAMPLING = 5 percent
>>>>
>>>> This approach could potentially aid in estimating deltas by utilizing
>>>> sampling.
>>>>
>>>> HTH
>>>>
>>>> Mich Talebzadeh,
>>>> Distinguished Technologist, Solutions Architect & Engineer
>>>> London
>>>> United Kingdom
>>>>
>>>>
>>>>    view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Sat, 26 Aug 2023 at 20:58, RAKSON RAKESH <raksonrak...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I would like to propose the incremental collection of statistics in
>>>>> spark. SPARK-44817 <https://issues.apache.org/jira/browse/SPARK-44817>
>>>>> has been raised for the same.
>>>>>
>>>>> Currently, spark invalidates the stats after data changing commands
>>>>> which would make CBO non-functional. To update these stats, user either
>>>>> needs to run `ANALYZE TABLE` command or turn
>>>>> `spark.sql.statistics.size.autoUpdate.enabled`. Both of these ways have
>>>>> their own drawbacks, executing `ANALYZE TABLE` command triggers full table
>>>>> scan while the other one only updates table and partition stats and can be
>>>>> costly in certain cases.
>>>>>
>>>>> The goal of this proposal is to collect stats incrementally while
>>>>> executing data changing commands by utilizing the framework introduced in
>>>>> SPARK-21669 <https://issues.apache.org/jira/browse/SPARK-21669>.
>>>>>
>>>>> SPIP Document has been attached along with JIRA:
>>>>>
>>>>> https://docs.google.com/document/d/1CNPWg_L1fxfB4d2m6xfizRyYRoWS2uPCwTKzhL2fwaQ/edit?usp=sharing
>>>>>
>>>>> Hive also supports automatic collection of statistics to keep the
>>>>> stats consistent.
>>>>> I can find multiple spark JIRAs asking for the same:
>>>>> https://issues.apache.org/jira/browse/SPARK-28872
>>>>> https://issues.apache.org/jira/browse/SPARK-33825
>>>>>
>>>>> Regards,
>>>>> Rakesh
>>>>>
>>>>

-- 
-- 
Regards,
Chetan

+353899475147
+919665562626

Re: [DISCUSS] Incremental statistics collection

Reply via email to