Found the answer after another day of trial & errors. For those interested, 
the clue is the "ids" parameter in a regular _search query:

GET /published-*/_search
{
  "query" : {
    "ids" : {
      "values" : ["019001201409294aa579ddb20348cbbf402116c91f6d15_811110", 
"0190012014092947e9fe351dc78763dabc45231301e9f9_811110"]
    }
  }
}

This allows me to fetch index data for for each of the document IDs :)
bernt


On Wednesday, October 15, 2014 3:09:45 PM UTC+2, Bernt Rostad wrote:
>
> Hi, 
>
> This may be a newbie question but I'm struggling to solve a problem for my 
> company in which a MySql database is used to feed and maintain a cluster of 
> Elasticsearch servers used for front-end searches.
>
> My problem is: Given a set of document IDs from MySql, I need to remove 
> those documents from all of our Elasticsearch indices. We typically have 
> one index per year and though I know the document ID, which has the same ID 
> in Elasticsearch, I don't know the index it is stored in.
>
> I started out trying the Perl module Search::Elasticsearch::Bulk which 
> offered the delete_ids() and add_action() methods, which both seemed to 
> allow me to delete a large number of documents, but requiring an index 
> name. I thought I could do the same trick as with the _search endpoint and 
> use a wildcard to indicate "all indices", e.g. index => 'published-*', but 
> that failed spectacularly:
>
> InvalidIndexNameException[[pulse-*] Invalid index name [published-*], must 
> not contain the following characters [\, /, *, ?, ", <, >, |,  , ,]];  at 
> /usr/local/lib/site_perl/Cpan/share/perl/5.14.2/Search/Elasticsearch/Role/Bulk.pm
>  
> line 188.
>
> So, I couldn't use a wildcard and I still didn't know the exact index for 
> a given document ID. I was back to the drawing board again. 
>
> My next attempt was to dynamically decide the index for each document, by 
> doing a search before building up the bulk delete. For this I thought I 
> could use the _mget endpoint, which seemed similar to _search and thus 
> would allow me to query all the document IDs to learn their indices. But 
> that didn't work either. Here I've copied the commands I tried running in 
> Sense:
>
> GET /published-*/_mget
> {
>   "docs" : [
>     { "_id" : "019001201409294aa579ddb20348cbbf402116c91f6d15_811110" },
>     { "_id" : "0190012014092947e9fe351dc78763dabc45231301e9f9_811110" }
>   ]
> }
>
> GET /_mget
> {
>   "docs" : [
>     { "_id" : "019001201409294aa579ddb20348cbbf402116c91f6d15_811110" },
>     { "_id" : "0190012014092947e9fe351dc78763dabc45231301e9f9_811110" }
>   ]
> }
>
> The first only returned errors with each document while the second didn't 
> return anything, just issued an "index is missing" error.
>
> However, when calling the _search endpoint I can either use '/published-*' 
> or no index info at all and still get a sensible result back, e.g.:
>
> GET /_search
> {
>   "query": {
>     "filtered": {
>       "query": {
>         "term" : { "_id" : 
> "0190012014092947e9fe351dc78763dabc45231301e9f9_811110" }
>       }
>     }
>   }
> }
>
>
> This has left me perplexed: Why can I query one document ID from the 
> _search endpoint and get back the index information but not from _mget?
>
> This situation seems to force me to loop over each document ID, of 
> possibly hundreds of thousand per night, calling the _search endpoint for 
> each ID to get the index information and then build up the bulk delete.
>
> Can life really be this difficult?
>
> Are there other mechanisms I can look at that will allow me, for a given 
> list of document IDs, to delete the associated documents from unspecified 
> Elasticsearch indices?
>
>
> I'm sorry if this was a trivial question but I've spent several days 
> pouring over the Search::Elasticsearch documentation and googling 
> Elasticsearch examples without finding other ways to get the job done.
>
> Best wishes,
> Bernt Rostad
> Retriever Norge
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/bce09ee4-9742-46ac-9c98-7ce274b95818%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to