An example why you might do this is if your input is a term vector (ie
a list of unique terms with weights) rather than a text in the usual
sense. It does seem as if the best way forward in this case is to
generate a text with repeated terms. I looked at the alternative and it
is quite involved in low level Lucene code.
-Mike
On 02/11/2015 08:01 AM, Erick Erickson wrote:
You could consider payloads but why do you want to do this?
What's the use case here? Sounds a little like an XY problem, you're
asking us how to do something without explaining the why; there
may be other ways to accomplish your task.
For instance, there's the "termfreq" function, which an be returned
as a field in the doc, see:
https://cwiki.apache.org/confluence/display/solr/Function+Queries
Best,
Erick
On Wed, Feb 11, 2015 at 4:54 AM, Stephen Fenech <luvsc...@gmail.com> wrote:
Hi,
I would like to index documents which contain term frequencies instead of
the actual text. For example, instead of getting "The big wolf ate the big
sheep" I would get "the|2 big|2 wolf|1 ate|1 sheep|1". An easy way to index
this would be to convert the frequencies back into text, so into something
like "the the big big wolf ate sheep", but it does not look that elegant
since I would be expanding the text, just to have Lucene "compress" it
again.
Any ideas? Or directions I should look into?
I am considering:
- Custom Analyzer (so I expand on while generating the TokenStream from the
compressed text)
Thanks in Advance,
Stephen
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org