Re: Propose a scheme for Coordinator to pull metadata incrementally

Julian Jaffe Tue, 06 Apr 2021 23:28:09 -0700

Hey Benedict,

Have you tried creating indices on your segments table? I’ve managed Druid 
clusters with orders of magnitude more segments without this issue by indexing 
key filter columns. (The coordinator is still a painful bottle neck, just not 
due to query times to the metadata server 😛)


Best,
Julian

> On Apr 6, 2021, at 8:53 PM, Benedict Jin <asdf2...@apache.org> wrote:
> 
> Hi Jihoon Son,
> 
> Yes, it does bring some compatibility issues. I was checking the latest 
> metadata information just now. At present, the total number of records in the 
> metadata table is five million, of which nearly half are marked as used, and 
> the physical resources of the machine where the metadata is stored are 
> relatively idle.
> 
> Regards,
> Benedict Jin
> 
>> On 2021/04/07 02:35:32, Jihoon Son <jihoon...@apache.org> wrote: 
>> For this sort of issue, we should think about if there is any other
>> way that can address the same problem without modifying metadata table
>> schema.
>> Because, modifying metadata table schema introduces compatibility
>> issues, such as the upgrade path for existing users.
>> 
>> Benedict, as Samarth and Lucas pointed out, it would be nice if you
>> share more details of exactly where the bottleneck is. That will make
>> the problem clearer and get everyone on the same page.
>> 
>>> On Tue, Apr 6, 2021 at 6:54 PM Benedict Jin <asdf2...@apache.org> wrote:
>>> 
>>> Hi Ben Krug,
>>> 
>>> +1 for adding the is_deleted column, and then we can create a timing 
>>> trigger to clear these old records.
>>> 
>>> Regards,
>>> Benedict Jin
>>> 
>>> On 2021/04/06 18:28:45, Ben Krug <ben.k...@imply.io> wrote:
>>>> Oh, that's easier than tombstones.  flag is_deleted and update timestamp
>>>> (so it gets pulled again).
>>>> 
>>>> On Tue, Apr 6, 2021 at 10:48 AM Tijo Thomas <tijothoma...@gmail.com> wrote:
>>>> 
>>>>> Abhishek,
>>>>> Good point.  Do we need one more col for storing if it's deleted or not?
>>>>> 
>>>>> On Tue, Apr 6, 2021 at 4:32 PM Abhishek Agarwal <abhishek.agar...@imply.io
>>>>>> 
>>>>> wrote:
>>>>> 
>>>>>> If an entry is deleted from the metadata, how is the coordinator going to
>>>>>> update its own state?
>>>>>> 
>>>>>> On Tue, Apr 6, 2021 at 3:38 PM Itai Yaffe <itai.ya...@gmail.com> wrote:
>>>>>> 
>>>>>>> Hey,
>>>>>>> I'm not a Druid developer, so it's quite possible I'm missing many
>>>>>>> considerations here, but from a first glance, I like your offer, as it
>>>>>>> resembles the *tsColumn *in JDBC lookups (
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> https://druid.apache.org/docs/latest/development/extensions-core/lookups-cached-global.html#jdbc-lookup
>>>>>>> ).
>>>>>>> 
>>>>>>> Anyway, just my 2 cents.
>>>>>>> 
>>>>>>> Thanks!
>>>>>>>          Itai
>>>>>>> 
>>>>>>> On Tue, Apr 6, 2021 at 6:07 AM Benedict Jin <asdf2...@apache.org>
>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi all,
>>>>>>>> 
>>>>>>>> Recently, when the Coordinator in our company's Druid cluster pulls
>>>>>>>> metadata, there is a performance bottleneck. The main reason is the
>>>>>> huge
>>>>>>>> amount of metadata, which leads to a very slow process of scanning
>>>>> the
>>>>>>> full
>>>>>>>> table of metadata storage and deserializing metadata. The size of the
>>>>>>> full
>>>>>>>> metadata has been reduced through TTL, Compaction, Rollup, and etc.,
>>>>>> but
>>>>>>>> the effect is not very significant. Therefore, I want to design a
>>>>>> scheme
>>>>>>>> for Coordinator to pull metadata incrementally, that is, each time
>>>>>>>> Coordinator only pulls newly added metadata, so as to reduce the
>>>>> query
>>>>>>>> pressure of metadata storage and the pressure of deserializing
>>>>>> metadata.
>>>>>>>> The general idea is to add a column last_update to the druid_segments
>>>>>>> table
>>>>>>>> to record the update time of each record. Furthermore, when we query
>>>>>> the
>>>>>>>> metadata table, we can add filter conditions for the last_update
>>>>> column
>>>>>>> to
>>>>>>>> avoid full table scan operations. Moreover, whether it is MySQL or
>>>>>>>> PostgreSQL as the metadata storage medium, it can support
>>>>>>>> automatic update of the timestamp field, which is somewhat similar
>>>>> to
>>>>>>> the
>>>>>>>> characteristics of triggers. So, have you encountered this problem
>>>>>>> before?
>>>>>>>> If so, how did you solve it? In addition, do you have any suggestions
>>>>>> or
>>>>>>>> comments on the above incremental acquisition of metadata? Please let
>>>>>> me
>>>>>>>> know, thanks a lot.
>>>>>>>> 
>>>>>>>> Regards,
>>>>>>>> Benedict Jin
>>>>>>>> 
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
>>>>>>>> For additional commands, e-mail: dev-h...@druid.apache.org
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Thanks & Regards
>>>>> Tijo Thomas
>>>>> 
>>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
>>> For additional commands, e-mail: dev-h...@druid.apache.org
>>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
>> For additional commands, e-mail: dev-h...@druid.apache.org
>> 
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> For additional commands, e-mail: dev-h...@druid.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
For additional commands, e-mail: dev-h...@druid.apache.org

Re: Propose a scheme for Coordinator to pull metadata incrementally

Reply via email to