Re: Searching for duplicates during feed ingestion.

Jianfeng Jia Mon, 08 May 2017 12:11:32 -0700

Aha, never knew that before. We will definitely try upsert feed next time! 
Thanks for pointing it out!


> On May 8, 2017, at 12:07 PM, Ildar Absalyamov <[email protected]> 
> wrote:
> 
> I believe we already support upsert feeds ;)
> https://github.com/apache/asterixdb/blob/master/asterixdb/asterix-app/src/test/resources/runtimets/queries/feeds/upsert-feed/upsert-feed.1.ddl.aql
>  
> <https://github.com/apache/asterixdb/blob/master/asterixdb/asterix-app/src/test/resources/runtimets/queries/feeds/upsert-feed/upsert-feed.1.ddl.aql>
>> On May 8, 2017, at 12:04, Jianfeng Jia <[email protected]> wrote:
>> 
>> I also observe this getting slower problem every-time when we re-ingest the 
>> twitter data. One difference is that the duplicate key could happen, and we 
>> know that is indeed duplicate record. To skip the search, we would expect an 
>>  “upsert” logic ( just replace the old one :-) ) instead of an insert. 
>> 
>> Then maybe we can add some configuration in feed configuration like
>> 
>> create feed MessageFeed using localfs(
>> ("format"="adm"),
>> ("type-name"="typeX"),
>> ("upsert"="true")
>> );
>> 
>> to indicate that this feed using the upsert logic instead of insert. 
>> 
>> One thing we need to confirm is that if “upsert” is actually implemented in 
>> a no-search fashion? 
>> Based on the way we searching the components, only the most recent one will 
>> be popped out. Then blindly insert should be OK logically. Correct me if I 
>> missed some other cases (highly likely :-)).
>> 
>> 
>>> On May 8, 2017, at 11:05 AM, Mike Carey <[email protected]> wrote:
>>> 
>>> +0.99 from me.
>>> 
>>> 
>>> On 5/8/17 9:50 AM, Taewoo Kim wrote:
>>>> +1 for auto-generated ID case
>>>> 
>>>> Best,
>>>> Taewoo
>>>> 
>>>> On Mon, May 8, 2017 at 8:57 AM, Yingyi Bu <[email protected]> wrote:
>>>> 
>>>>> Abdullah has a pending change that disables searches if there's no
>>>>> secondary indexes [1].
>>>>> Auto-generated ID could be another case for which we can disable searches
>>>>> as well.
>>>>> 
>>>>> Best,
>>>>> Yingyi
>>>>> 
>>>>> [1] https://asterix-gerrit.ics.uci.edu/#/c/1711/
>>>>> 
>>>>> 
>>>>> On Mon, May 8, 2017 at 4:30 AM, Wail Alkowaileet <[email protected]>
>>>>> wrote:
>>>>> 
>>>>>> Hi Devs,
>>>>>> 
>>>>>> I'm noticing a behavior during the ingestion is that it's getting slower
>>>>> by
>>>>>> time. I know that is an expected behavior in LSM-indexes. But what I'm
>>>>>> seeing is that I can notice the drop in ingestion rate roughly after
>>>>> having
>>>>>> 10 components (around ~13 GB). That's what I'm not sure if it's expected?
>>>>>> 
>>>>>> I tried multiple setups (increasing Memory component size +
>>>>>> max-mergable-component-size). All of which delayed the problem but not
>>>>>> solved it. The only part I've never changed is the bloom-filter
>>>>>> false-positive rate (1%). Which I want to investigate next.
>>>>>> 
>>>>>> So..
>>>>>> What I want to suggest is that when the primary key is auto-generated,
>>>>> why
>>>>>> AsterixDB looks for duplicates? it seems a wasteful operation to me.
>>>>> Also,
>>>>>> can we give the user the ability to tell the index that all keys are
>>>>> unique
>>>>>> ? I know I should not trust the user .. but in certain cases, probably
>>>>> the
>>>>>> user is certain that the key is unique. Or a more elegant solution can
>>>>>> shine in the end :-)
>>>>>> 
>>>>>> --
>>>>>> 
>>>>>> *Regards,*
>>>>>> Wail Alkowaileet
>>>>>> 
>>> 
>> 
> 
> Best regards,
> Ildar
>

Re: Searching for duplicates during feed ingestion.

Reply via email to