1-       is there someone out there that already wrote an extension to
Lucene so that 'stored' string for each document/field is in fact stored in
a centralized repository? Meaning, only an 'index' is actually stored in the
document and the real data is put somewhere else.

2-       If not, how hard would it be to write such extension?  Which
classes would need to be modified?  FSDirectory? Document? 

3-       Any ideas on how else I could do this?  I'm fully open to
discussion!

 It's easy if I got what you need, you need some sort of simple dictionary 
compression. Write your Analyzer that is constructed with some 
HashMap<YOUR_STRING, Integer> and make it replace tokens with Integers (you 
could use VInts later to save some more space).  
Fill this HashMap  with unique terms from your field and if too many of them 
encode only the most frequent.... 

you have transformations in this case: 
it starts with Document.Field == "Array of Strings" ->
-> put it in your analyzer 
-> you get Document.Field == "Array of Integers" (presumably more space 
efficient for your case?)
->Store Ints as VInts to spare a few bits more

Later you will need an array or HashMap to revert back ints to tokens to 
reconstruct your docs

Of course, you will need to map your Queries the same way
Is that what you wanted?





Send instant messages to your online friends http://uk.messenger.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to