I have a large index (say 500GB) that with a large percentage of near duplicate documents.
I have to keep the documents there (can't delete them) as the metadata is important. Is it possible to get the documents to be contiguous somehow? Once they are contiguous then they will compress very well - which I've already confirmed by writing the exact same document N times. IDEALLY I could use two fields and have a unique document ID but then a group_id so that they can be located on disk by the group_id... but I don't think this is possible. Can I just create a synthetic "id" field for this and assume that "id" is ordered on disk in the lucene index? -- We’re hiring if you know of any awesome Java Devops or Linux Operations Engineers! Founder/CEO Spinn3r.com Location: *San Francisco, CA* blog: http://burtonator.wordpress.com … or check out my Google+ profile <https://plus.google.com/102718274791889610666/posts>