Re: Searching for duplicates during feed ingestion.

Mike Carey Mon, 08 May 2017 12:40:02 -0700

Note that upserts don't avoid searches.... (Still need to get the oldrecord to update secondary indexes from.)


On 5/8/17 12:10 PM, Jianfeng Jia wrote:

Aha, never knew that before. We will definitely try upsert feed next time! 
Thanks for pointing it out!

On May 8, 2017, at 12:07 PM, Ildar Absalyamov <[email protected]> 
wrote:

I believe we already support upsert feeds ;)
https://github.com/apache/asterixdb/blob/master/asterixdb/asterix-app/src/test/resources/runtimets/queries/feeds/upsert-feed/upsert-feed.1.ddl.aql
 
<https://github.com/apache/asterixdb/blob/master/asterixdb/asterix-app/src/test/resources/runtimets/queries/feeds/upsert-feed/upsert-feed.1.ddl.aql>

On May 8, 2017, at 12:04, Jianfeng Jia <[email protected]> wrote:

I also observe this getting slower problem every-time when we re-ingest the 
twitter data. One difference is that the duplicate key could happen, and we 
know that is indeed duplicate record. To skip the search, we would expect an  
“upsert” logic ( just replace the old one :-) ) instead of an insert.

Then maybe we can add some configuration in feed configuration like

create feed MessageFeed using localfs(
("format"="adm"),
("type-name"="typeX"),
("upsert"="true")
);

to indicate that this feed using the upsert logic instead of insert.

One thing we need to confirm is that if “upsert” is actually implemented in a 
no-search fashion?
Based on the way we searching the components, only the most recent one will be 
popped out. Then blindly insert should be OK logically. Correct me if I missed 
some other cases (highly likely :-)).

On May 8, 2017, at 11:05 AM, Mike Carey <[email protected]> wrote:

+0.99 from me.


On 5/8/17 9:50 AM, Taewoo Kim wrote:

+1 for auto-generated ID case

Best,
Taewoo

On Mon, May 8, 2017 at 8:57 AM, Yingyi Bu <[email protected]> wrote:

Abdullah has a pending change that disables searches if there's no
secondary indexes [1].
Auto-generated ID could be another case for which we can disable searches
as well.

Best,
Yingyi

[1] https://asterix-gerrit.ics.uci.edu/#/c/1711/


On Mon, May 8, 2017 at 4:30 AM, Wail Alkowaileet <[email protected]>
wrote:

Hi Devs,

I'm noticing a behavior during the ingestion is that it's getting slower

by

time. I know that is an expected behavior in LSM-indexes. But what I'm
seeing is that I can notice the drop in ingestion rate roughly after

having

10 components (around ~13 GB). That's what I'm not sure if it's expected?

I tried multiple setups (increasing Memory component size +
max-mergable-component-size). All of which delayed the problem but not
solved it. The only part I've never changed is the bloom-filter
false-positive rate (1%). Which I want to investigate next.

So..
What I want to suggest is that when the primary key is auto-generated,

why

AsterixDB looks for duplicates? it seems a wasteful operation to me.

Also,

can we give the user the ability to tell the index that all keys are

unique

? I know I should not trust the user .. but in certain cases, probably

the

user is certain that the key is unique. Or a more elegant solution can
shine in the end :-)

--

*Regards,*
Wail Alkowaileet

Best regards,
Ildar

Re: Searching for duplicates during feed ingestion.

Reply via email to