Re: Indexing documents with pre-calculated term frequencies

Michael Sokolov Wed, 11 Feb 2015 08:19:10 -0800

An example why you might do this is if your input is a term vector (iea list of unique terms with weights) rather than a text in the usualsense. It does seem as if the best way forward in this case is togenerate a text with repeated terms. I looked at the alternative and itis quite involved in low level Lucene code.


-Mike


On 02/11/2015 08:01 AM, Erick Erickson wrote:

You could consider payloads but why do you want to do this?
What's the use case here? Sounds a little like an XY problem, you're
asking us how to do something without explaining the why; there
may be other ways to accomplish your task.

For instance, there's the "termfreq" function, which an be returned
as a field in the doc, see:
https://cwiki.apache.org/confluence/display/solr/Function+Queries

Best,
Erick

On Wed, Feb 11, 2015 at 4:54 AM, Stephen Fenech <luvsc...@gmail.com> wrote:

Hi,

I would like to index documents which contain term frequencies instead of
the actual text. For example, instead of getting "The big wolf ate the big
sheep" I would get "the|2 big|2 wolf|1 ate|1 sheep|1". An easy way to index
this would be to convert the frequencies back into text, so into something
like "the the big big wolf ate sheep", but it does not look that elegant
since I would be expanding the text, just to have Lucene "compress" it
again.

Any ideas? Or directions I should look into?

I am considering:
- Custom Analyzer (so I expand on while generating the TokenStream from the
compressed text)

Thanks in Advance,

Stephen

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Indexing documents with pre-calculated term frequencies

Reply via email to