Hi Adam

About the issues where indexes go out of sync with data: Yes, we see this issue 
often enough in customer environments and even in our environments. The root 
cause is that JanusGraph does not rollback a transaction even if index commit 
fails.

When this happens, users see that their data is retrieved using Advanced Search 
but does not show up when searched via Basic Search.

Few months back we added ability to repair index as part of our Java Patch 
framework. This approach has a much higher throughput compared with using the 
index-repair utility.

JIRA: https://issues.apache.org/jira/browse/ATLAS-4015
Commit ids:

  *   c810e47a4
  *   1e06e372e
Set these in atlas-application.properties
atlas.patch.numWorkers=<number of cores - 1> * 2
atlas.patch.batchSize=1000
atlas.rebuild.index=true

We are brainstorming on a more proactive solution to this problem. So far, we 
don’t have a design.

Best regards,

~ ashutosh
Ashutosh Mestry<mailto:[email protected]> . Cloudera, Inc.

From: Adam Bellemare <[email protected]>
Date: Tuesday, June 22, 2021 at 7:43 AM
To: [email protected] <[email protected]>
Cc: Olessia D'Souza <[email protected]>, Karl Taylor 
<[email protected]>, Nargiza Sarkulova <[email protected]>
Subject: Entities missing from ES/Solr index - why?
Hi Folks

After bulk loading large amounts of entities (Atlas 2.1.0, Solr), we often
have a number of them missing from the Solr Index. We are still able to
find these entities when we do Advanced Search on HBase directly, but they
are not in the Solr index.

>From https://atlas.apache.org/2.0.0/AtlasRepairIndex.html

"In rare, cases it is possible that during entity creation, the entity is
stored in the data store, but the *corresponding indexes are not created in
Solr*. Since Atlas relies heavily on Solr in the operation of its Basic
Search, this will result in entity not being returned by a search."

Why does this occur? Is this a race condition somewhere?
Is there an open issue for this? I cannot find one in Atlas JIRA.

How "rare" is this? We are seeing this frequently in our local dev
environment, (as in, missing 100s of records when bulk uploading 100k+
entities), but we have also identified that resource constraints appear to
be a factor (eg: CPU starvation), as well as a series of poor default
configurations. Is this the same root cause of missing Solr indices?

Thanks
Adam

Reply via email to