Q: Deleting documents by _id but without index info?

Bernt Rostad Wed, 15 Oct 2014 06:10:41 -0700

Hi, 

This may be a newbie question but I'm struggling to solve a problem for my 
company in which a MySql database is used to feed and maintain a cluster of 
Elasticsearch servers used for front-end searches.


My problem is: Given a set of document IDs from MySql, I need to remove 
those documents from all of our Elasticsearch indices. We typically have 
one index per year and though I know the document ID, which has the same ID 
in Elasticsearch, I don't know the index it is stored in.

I started out trying the Perl module Search::Elasticsearch::Bulk which 
offered the delete_ids() and add_action() methods, which both seemed to 
allow me to delete a large number of documents, but requiring an index 
name. I thought I could do the same trick as with the _search endpoint and 
use a wildcard to indicate "all indices", e.g. index => 'published-*', but 
that failed spectacularly:

InvalidIndexNameException[[pulse-*] Invalid index name [published-*], must 
not contain the following characters [\, /, *, ?, ", <, >, |,  , ,]];  at 
/usr/local/lib/site_perl/Cpan/share/perl/5.14.2/Search/Elasticsearch/Role/Bulk.pm
 
line 188.

So, I couldn't use a wildcard and I still didn't know the exact index for a 
given document ID. I was back to the drawing board again. 

My next attempt was to dynamically decide the index for each document, by 
doing a search before building up the bulk delete. For this I thought I 
could use the _mget endpoint, which seemed similar to _search and thus 
would allow me to query all the document IDs to learn their indices. But 
that didn't work either. Here I've copied the commands I tried running in 
Sense:

GET /published-*/_mget
{
  "docs" : [
    { "_id" : "019001201409294aa579ddb20348cbbf402116c91f6d15_811110" },
    { "_id" : "0190012014092947e9fe351dc78763dabc45231301e9f9_811110" }
  ]
}

GET /_mget
{
  "docs" : [
    { "_id" : "019001201409294aa579ddb20348cbbf402116c91f6d15_811110" },
    { "_id" : "0190012014092947e9fe351dc78763dabc45231301e9f9_811110" }
  ]
}

The first only returned errors with each document while the second didn't 
return anything, just issued an "index is missing" error.

However, when calling the _search endpoint I can either use '/published-*' 
or no index info at all and still get a sensible result back, e.g.:

GET /_search
{
  "query": {
    "filtered": {
      "query": {
        "term" : { "_id" : 
"0190012014092947e9fe351dc78763dabc45231301e9f9_811110" }
      }
    }
  }
}


This has left me perplexed: Why can I query one document ID from the 
_search endpoint and get back the index information but not from _mget?

This situation seems to force me to loop over each document ID, of possibly 
hundreds of thousand per night, calling the _search endpoint for each ID to 
get the index information and then build up the bulk delete.

Can life really be this difficult?

Are there other mechanisms I can look at that will allow me, for a given 
list of document IDs, to delete the associated documents from unspecified 
Elasticsearch indices?


I'm sorry if this was a trivial question but I've spent several days 
pouring over the Search::Elasticsearch documentation and googling 
Elasticsearch examples without finding other ways to get the job done.

Best wishes,
Bernt Rostad
Retriever Norge

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/719fd493-9eaa-4f22-8552-3f1c91245786%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Q: Deleting documents by _id but without index info?

Reply via email to