We are using ES 1.2.2 server with a rails application as the client 
(ActiveRecord document model) and it seems as though some of the documents 
in the index might have been corrupted because the *id *field of the 
document is some garbled text like "JorMcjefSe2_VQkP_ntd8Q" when its 
supposed to be an Integer value based on the mappings.

As an example here is a document in the index with a corrupted id. Notice 
the corrupted document id, and the source id of the document is null

curl -XGET 
http://localhost:9200/production_restaurants/restaurant/Gu-NGnHtR3ef4V2z4NfNsQ?pretty
{
  "_index" : "production_restaurants_20140714222814907",
  "_type" : "restaurant",
  "_id" : "Gu-NGnHtR3ef4V2z4NfNsQ",
  "_version" : 1,
  "found" : true,
  "_source":{"_id":null,"_type":"restaurant","title":"Wreck Bar and 
Grill","address":"Rum 
Point","phone":null,"location_hint":null,"popularity":0,"votes_percent":null,"price":null,"city":null,"state":"KY","zip":null,"city_id":375,"neighborhood_id":54892,"activity":null,"location":{"lat":19.371508,"lon":-81.271523},"closed":false,"neighborhood":{"title":"Grand
 
Cayman","id":54892},"cuisines":[],"tags":[],"dishes":[],"restaurant_path":"http://www.urbanspoon.com"}
}

It seems like the corruption might be around document deletion from the 
index because such indexed documents are no longer in our MySQL data which 
is the source for indexing documents in ES. Aside from finding what the 
issue might be with corruption, I am right now looking to find such bad 
documents in the index. I am finding no love with either a regex query 
<http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-regexp-query.html#query-dsl-regexp-query>
 or 
the missing filter 
<http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/_dealing_with_null_values.html#_missing_filter>
 which 
i apply them to the id field. Its a strange situation because *id *is of 
type *integer *in my index mapping i cannot apply regex query to it and get 
a NumberFormatException from Lucene.

Any suggestion for a query that I could use to find such corrupted 
documents and remove them ahead of time. Right now I've had to be very 
reactive and remove these as I discover them my rails logs / error reports. 
Before I consider a full-reindex (which is heavy in of itself) I would like 
to explore what other options I have, including what might be the cause of 
these corruption.

thanks
- anurag  

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/8bd57820-6647-44f9-a089-2f22c2c83431%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to