1- is there someone out there that already wrote an extension to Lucene so that 'stored' string for each document/field is in fact stored in a centralized repository? Meaning, only an 'index' is actually stored in the document and the real data is put somewhere else.
2- If not, how hard would it be to write such extension? Which classes would need to be modified? FSDirectory? Document? 3- Any ideas on how else I could do this? I'm fully open to discussion! It's easy if I got what you need, you need some sort of simple dictionary compression. Write your Analyzer that is constructed with some HashMap<YOUR_STRING, Integer> and make it replace tokens with Integers (you could use VInts later to save some more space). Fill this HashMap with unique terms from your field and if too many of them encode only the most frequent.... you have transformations in this case: it starts with Document.Field == "Array of Strings" -> -> put it in your analyzer -> you get Document.Field == "Array of Integers" (presumably more space efficient for your case?) ->Store Ints as VInts to spare a few bits more Later you will need an array or HashMap to revert back ints to tokens to reconstruct your docs Of course, you will need to map your Queries the same way Is that what you wanted? Send instant messages to your online friends http://uk.messenger.yahoo.com --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]