Re: Searching for duplicates during feed ingestion.

Ildar Absalyamov Mon, 08 May 2017 12:08:04 -0700

I believe we already support upsert feeds ;)
https://github.com/apache/asterixdb/blob/master/asterixdb/asterix-app/src/test/resources/runtimets/queries/feeds/upsert-feed/upsert-feed.1.ddl.aql
 
<https://github.com/apache/asterixdb/blob/master/asterixdb/asterix-app/src/test/resources/runtimets/queries/feeds/upsert-feed/upsert-feed.1.ddl.aql>
> On May 8, 2017, at 12:04, Jianfeng Jia <[email protected]> wrote:
> 
> I also observe this getting slower problem every-time when we re-ingest the 
> twitter data. One difference is that the duplicate key could happen, and we 
> know that is indeed duplicate record. To skip the search, we would expect an  
> “upsert” logic ( just replace the old one :-) ) instead of an insert. 
> 
> Then maybe we can add some configuration in feed configuration like
> 
> create feed MessageFeed using localfs(
> ("format"="adm"),
> ("type-name"="typeX"),
> ("upsert"="true")
> );
> 
> to indicate that this feed using the upsert logic instead of insert. 
> 
> One thing we need to confirm is that if “upsert” is actually implemented in a 
> no-search fashion? 
> Based on the way we searching the components, only the most recent one will 
> be popped out. Then blindly insert should be OK logically. Correct me if I 
> missed some other cases (highly likely :-)).
> 
> 
>> On May 8, 2017, at 11:05 AM, Mike Carey <[email protected]> wrote:
>> 
>> +0.99 from me.
>> 
>> 
>> On 5/8/17 9:50 AM, Taewoo Kim wrote:
>>> +1 for auto-generated ID case
>>> 
>>> Best,
>>> Taewoo
>>> 
>>> On Mon, May 8, 2017 at 8:57 AM, Yingyi Bu <[email protected]> wrote:
>>> 
>>>> Abdullah has a pending change that disables searches if there's no
>>>> secondary indexes [1].
>>>> Auto-generated ID could be another case for which we can disable searches
>>>> as well.
>>>> 
>>>> Best,
>>>> Yingyi
>>>> 
>>>> [1] https://asterix-gerrit.ics.uci.edu/#/c/1711/
>>>> 
>>>> 
>>>> On Mon, May 8, 2017 at 4:30 AM, Wail Alkowaileet <[email protected]>
>>>> wrote:
>>>> 
>>>>> Hi Devs,
>>>>> 
>>>>> I'm noticing a behavior during the ingestion is that it's getting slower
>>>> by
>>>>> time. I know that is an expected behavior in LSM-indexes. But what I'm
>>>>> seeing is that I can notice the drop in ingestion rate roughly after
>>>> having
>>>>> 10 components (around ~13 GB). That's what I'm not sure if it's expected?
>>>>> 
>>>>> I tried multiple setups (increasing Memory component size +
>>>>> max-mergable-component-size). All of which delayed the problem but not
>>>>> solved it. The only part I've never changed is the bloom-filter
>>>>> false-positive rate (1%). Which I want to investigate next.
>>>>> 
>>>>> So..
>>>>> What I want to suggest is that when the primary key is auto-generated,
>>>> why
>>>>> AsterixDB looks for duplicates? it seems a wasteful operation to me.
>>>> Also,
>>>>> can we give the user the ability to tell the index that all keys are
>>>> unique
>>>>> ? I know I should not trust the user .. but in certain cases, probably
>>>> the
>>>>> user is certain that the key is unique. Or a more elegant solution can
>>>>> shine in the end :-)
>>>>> 
>>>>> --
>>>>> 
>>>>> *Regards,*
>>>>> Wail Alkowaileet
>>>>> 
>> 
>


Best regards,
Ildar

Re: Searching for duplicates during feed ingestion.

Reply via email to