Hi,

I would like to index documents which contain term frequencies instead of
the actual text. For example, instead of getting "The big wolf ate the big
sheep" I would get "the|2 big|2 wolf|1 ate|1 sheep|1". An easy way to index
this would be to convert the frequencies back into text, so into something
like "the the big big wolf ate sheep", but it does not look that elegant
since I would be expanding the text, just to have Lucene "compress" it
again.

Any ideas? Or directions I should look into?

I am considering:
- Custom Analyzer (so I expand on while generating the TokenStream from the
compressed text)

Thanks in Advance,

Stephen

Reply via email to