Re: Searching for duplicates during feed ingestion.

Jianfeng Jia Mon, 08 May 2017 13:37:59 -0700

Got the point now…
I would image If the record has a version number that could potentially solve 
some problems here. However, it would be a totally difference story then..


> On May 8, 2017, at 12:39 PM, Mike Carey <[email protected]> wrote:
> 
> Note that upserts don't avoid searches.... (Still need to get the old record 
> to update secondary indexes from.)
> 
> 
> On 5/8/17 12:10 PM, Jianfeng Jia wrote:
>> Aha, never knew that before. We will definitely try upsert feed next time! 
>> Thanks for pointing it out!
>> 
>>> On May 8, 2017, at 12:07 PM, Ildar Absalyamov <[email protected]> 
>>> wrote:
>>> 
>>> I believe we already support upsert feeds ;)
>>> https://github.com/apache/asterixdb/blob/master/asterixdb/asterix-app/src/test/resources/runtimets/queries/feeds/upsert-feed/upsert-feed.1.ddl.aql
>>>  
>>> <https://github.com/apache/asterixdb/blob/master/asterixdb/asterix-app/src/test/resources/runtimets/queries/feeds/upsert-feed/upsert-feed.1.ddl.aql>
>>>> On May 8, 2017, at 12:04, Jianfeng Jia <[email protected]> wrote:
>>>> 
>>>> I also observe this getting slower problem every-time when we re-ingest 
>>>> the twitter data. One difference is that the duplicate key could happen, 
>>>> and we know that is indeed duplicate record. To skip the search, we would 
>>>> expect an  “upsert” logic ( just replace the old one :-) ) instead of an 
>>>> insert.
>>>> 
>>>> Then maybe we can add some configuration in feed configuration like
>>>> 
>>>> create feed MessageFeed using localfs(
>>>> ("format"="adm"),
>>>> ("type-name"="typeX"),
>>>> ("upsert"="true")
>>>> );
>>>> 
>>>> to indicate that this feed using the upsert logic instead of insert.
>>>> 
>>>> One thing we need to confirm is that if “upsert” is actually implemented 
>>>> in a no-search fashion?
>>>> Based on the way we searching the components, only the most recent one 
>>>> will be popped out. Then blindly insert should be OK logically. Correct me 
>>>> if I missed some other cases (highly likely :-)).
>>>> 
>>>> 
>>>>> On May 8, 2017, at 11:05 AM, Mike Carey <[email protected]> wrote:
>>>>> 
>>>>> +0.99 from me.
>>>>> 
>>>>> 
>>>>> On 5/8/17 9:50 AM, Taewoo Kim wrote:
>>>>>> +1 for auto-generated ID case
>>>>>> 
>>>>>> Best,
>>>>>> Taewoo
>>>>>> 
>>>>>> On Mon, May 8, 2017 at 8:57 AM, Yingyi Bu <[email protected]> wrote:
>>>>>> 
>>>>>>> Abdullah has a pending change that disables searches if there's no
>>>>>>> secondary indexes [1].
>>>>>>> Auto-generated ID could be another case for which we can disable 
>>>>>>> searches
>>>>>>> as well.
>>>>>>> 
>>>>>>> Best,
>>>>>>> Yingyi
>>>>>>> 
>>>>>>> [1] https://asterix-gerrit.ics.uci.edu/#/c/1711/
>>>>>>> 
>>>>>>> 
>>>>>>> On Mon, May 8, 2017 at 4:30 AM, Wail Alkowaileet <[email protected]>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi Devs,
>>>>>>>> 
>>>>>>>> I'm noticing a behavior during the ingestion is that it's getting 
>>>>>>>> slower
>>>>>>> by
>>>>>>>> time. I know that is an expected behavior in LSM-indexes. But what I'm
>>>>>>>> seeing is that I can notice the drop in ingestion rate roughly after
>>>>>>> having
>>>>>>>> 10 components (around ~13 GB). That's what I'm not sure if it's 
>>>>>>>> expected?
>>>>>>>> 
>>>>>>>> I tried multiple setups (increasing Memory component size +
>>>>>>>> max-mergable-component-size). All of which delayed the problem but not
>>>>>>>> solved it. The only part I've never changed is the bloom-filter
>>>>>>>> false-positive rate (1%). Which I want to investigate next.
>>>>>>>> 
>>>>>>>> So..
>>>>>>>> What I want to suggest is that when the primary key is auto-generated,
>>>>>>> why
>>>>>>>> AsterixDB looks for duplicates? it seems a wasteful operation to me.
>>>>>>> Also,
>>>>>>>> can we give the user the ability to tell the index that all keys are
>>>>>>> unique
>>>>>>>> ? I know I should not trust the user .. but in certain cases, probably
>>>>>>> the
>>>>>>>> user is certain that the key is unique. Or a more elegant solution can
>>>>>>>> shine in the end :-)
>>>>>>>> 
>>>>>>>> --
>>>>>>>> 
>>>>>>>> *Regards,*
>>>>>>>> Wail Alkowaileet
>>>>>>>> 
>>> Best regards,
>>> Ildar
>>> 
>

Re: Searching for duplicates during feed ingestion.

Reply via email to