Thank you.
I suppose the solution for this is to not create an index but to store
co-occurence frequencies at Analyzer level.
Adrian.
On Mon, Mar 16, 2009 at 11:37 AM, Michael McCandless
luc...@mikemccandless.com wrote:
Be careful: docFreq does not take deletions into account.
Adrian Dimulescu wrote:
Thank you.
I suppose the solution for this is to not create an index but to store
co-occurence frequencies at Analyzer level.
I don't understand how this would address the docFreq does
not reflect deletions.
You can use the shingles analyzer (under
This is all getting very complicated!
Adrian - have you looked any further into why your original two term
query was too slow? My experience is that simple queries are usually
extremely fast. Standard questions: have you warmed up the searcher?
How large is the index? How many occurrences of
Michael McCandless wrote:
I don't understand how this would address the docFreq does
not reflect deletions.
Bad mail-quoting, sorry. I am not interested by document deletion, I
just index Wikipedia once, and want to get a co-occurrence-based
similarity distance between words called NGD
Ian Lea wrote:
Adrian - have you looked any further into why your original two term
query was too slow? My experience is that simple queries are usually
extremely fast.
Let me first point out that it is not too slow in absolute terms, it
is only for my particular needs of attempting the
OK - thanks for the explanation. So this is not just a simple search ...
I'll go away and leave you and Michael and the other experts to talk
about clever solutions.
--
Ian.
On Tue, Mar 17, 2009 at 11:35 AM, Adrian Dimulescu
adrian.dimule...@gmail.com wrote:
Ian Lea wrote:
Adrian - have
Is this a one-time computation? If so, couldn't you wait a long time
for the machine to simply finish it?
With the simple approach (doing 100 million 2-term AND queries), how
long do you estimate it'd take?
I think you could do this with your own analyzer (as you
suggested)... it would run
Michael McCandless wrote:
Is this a one-time computation? If so, couldn't you wait a long time
for the machine to simply finish it?
The final production computation is one-time, still, I have to
recurrently come back and correct some errors, then retry...
With the simple approach (doing 100
You may want to try Filters (starting from TermFilter) for this, especially
those based on the default OpenBitSet (see the intersection count method)
because of your interest in stop words.
10k OpenBitSets for 39 M docs will probably not fit in memory in one go,
but that can be worked around by
: The final production computation is one-time, still, I have to recurrently
: come back and correct some errors, then retry...
this doesn't really seem like a problem ideally suited for Lucene ... this
seems like the type of problem sequential batch crunching could solve
better...
first
Hello,
I need the number of pages that contain two terms. Only the number of
hits, I don't care about retrieving the pages. Right now I am using the
following code in order to get it:
Term first, second;
TermQuery q1 = new TermQuery(first);
TermQuery q2 = new TermQuery(second);
Adrian Dimulescu wrote:
Hello,
I need the number of pages that contain two terms. Only the number
of hits, I don't care about retrieving the pages. Right now I am
using the following code in order to get it:
Term first, second;
TermQuery q1 = new TermQuery(first);
TermQuery q2 = new
12 matches
Mail list logo