Found the answer after another day of trial & errors. For those interested,
the clue is the "ids" parameter in a regular _search query:
GET /published-*/_search
{
"query" : {
"ids" : {
"values" : ["019001201409294aa579ddb20348cbbf402116c91f6d15_811110",
"0190012014092947e9fe351dc78763dabc45231301e9f9_811110"]
}
}
}
This allows me to fetch index data for for each of the document IDs :)
bernt
On Wednesday, October 15, 2014 3:09:45 PM UTC+2, Bernt Rostad wrote:
>
> Hi,
>
> This may be a newbie question but I'm struggling to solve a problem for my
> company in which a MySql database is used to feed and maintain a cluster of
> Elasticsearch servers used for front-end searches.
>
> My problem is: Given a set of document IDs from MySql, I need to remove
> those documents from all of our Elasticsearch indices. We typically have
> one index per year and though I know the document ID, which has the same ID
> in Elasticsearch, I don't know the index it is stored in.
>
> I started out trying the Perl module Search::Elasticsearch::Bulk which
> offered the delete_ids() and add_action() methods, which both seemed to
> allow me to delete a large number of documents, but requiring an index
> name. I thought I could do the same trick as with the _search endpoint and
> use a wildcard to indicate "all indices", e.g. index => 'published-*', but
> that failed spectacularly:
>
> InvalidIndexNameException[[pulse-*] Invalid index name [published-*], must
> not contain the following characters [\, /, *, ?, ", <, >, |, , ,]]; at
> /usr/local/lib/site_perl/Cpan/share/perl/5.14.2/Search/Elasticsearch/Role/Bulk.pm
>
> line 188.
>
> So, I couldn't use a wildcard and I still didn't know the exact index for
> a given document ID. I was back to the drawing board again.
>
> My next attempt was to dynamically decide the index for each document, by
> doing a search before building up the bulk delete. For this I thought I
> could use the _mget endpoint, which seemed similar to _search and thus
> would allow me to query all the document IDs to learn their indices. But
> that didn't work either. Here I've copied the commands I tried running in
> Sense:
>
> GET /published-*/_mget
> {
> "docs" : [
> { "_id" : "019001201409294aa579ddb20348cbbf402116c91f6d15_811110" },
> { "_id" : "0190012014092947e9fe351dc78763dabc45231301e9f9_811110" }
> ]
> }
>
> GET /_mget
> {
> "docs" : [
> { "_id" : "019001201409294aa579ddb20348cbbf402116c91f6d15_811110" },
> { "_id" : "0190012014092947e9fe351dc78763dabc45231301e9f9_811110" }
> ]
> }
>
> The first only returned errors with each document while the second didn't
> return anything, just issued an "index is missing" error.
>
> However, when calling the _search endpoint I can either use '/published-*'
> or no index info at all and still get a sensible result back, e.g.:
>
> GET /_search
> {
> "query": {
> "filtered": {
> "query": {
> "term" : { "_id" :
> "0190012014092947e9fe351dc78763dabc45231301e9f9_811110" }
> }
> }
> }
> }
>
>
> This has left me perplexed: Why can I query one document ID from the
> _search endpoint and get back the index information but not from _mget?
>
> This situation seems to force me to loop over each document ID, of
> possibly hundreds of thousand per night, calling the _search endpoint for
> each ID to get the index information and then build up the bulk delete.
>
> Can life really be this difficult?
>
> Are there other mechanisms I can look at that will allow me, for a given
> list of document IDs, to delete the associated documents from unspecified
> Elasticsearch indices?
>
>
> I'm sorry if this was a trivial question but I've spent several days
> pouring over the Search::Elasticsearch documentation and googling
> Elasticsearch examples without finding other ways to get the job done.
>
> Best wishes,
> Bernt Rostad
> Retriever Norge
>
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/bce09ee4-9742-46ac-9c98-7ce274b95818%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.