I have a large index (say 500GB) that with a large percentage of near
duplicate documents.

I have to keep the documents there (can't delete them) as the metadata is
important.

Is it possible to get the documents to be contiguous somehow?

Once they are contiguous then they will compress very well - which I've
already confirmed by writing the exact same document N times.

IDEALLY I could use two fields and have a unique document ID but then a
group_id so that they can be located on disk by the group_id... but I don't
think this is possible.

Can I just create a synthetic "id" field for this and assume that "id" is
ordered on disk in the lucene index?


-- 

We’re hiring if you know of any awesome Java Devops or Linux Operations
Engineers!

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>

Reply via email to