from:"Doron Cohen \(Commented\) \(JIRA\)"

[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

2012-03-09 Thread Doron Cohen (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13226494#comment-13226494
]

Doron Cohen commented on LUCENE-3821:
-

{quote}
Not understanding really how SloppyPhraseScorer works now, but not trying to
add confusion to the issue, I can't help but think this problem is a variant on
LevensteinAutomata... in fact that was the motivation for the new test, i just
stole the testing methodology from there and applied it to this!
{quote}

Interesting! I was not aware of this. I went and read some about this
automaton, It is relevant.

{quote}
It seems many things are the same but with a few twists:

* fundamentally we are interleaving the streams from the subscorers into the
'index automaton'
'query automaton' is produced from the user-supplied terms
{quote}

True. In fact, the current code works hard to decide on the correct
interleaving order - while if we had a Perfect Levenstein Automaton that
took care of the computation we could just interleave, in the term position
order (forget about phrase position and all that) and let the automaton compute
the distance.

This might capture the difficulty in making the sloppy phrase scorer correct:
it started with the algorithm that was optimized for exact matching, and
adopted (hacked?) it for approximate matching.

Instead, starting with a model that fits approximate matching, might be easier
and cleaner. I like that.

{quote}
* our 'alphabet' is the terms, and holes from position increment are just an
additional symbol.
* just like the LevensteinAutomata case, repeats are problematic because they
are different characteristic vectors
* stacked terms at the same position (index or query) just make the automata
more complex (so they arent just strings)

I'm not suggesting we try to re-use any of that code at all, i don't think it
will work. But I wonder if we can re-use even
some of the math to redefine the problem more formally to figure out what
minimal state/lookahead we need for example...
{quote}

I agree. I'll think of this.

In the meantime I'll commit this partial fix.

SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
---

Key: LUCENE-3821
URL: https://issues.apache.org/jira/browse/LUCENE-3821
Project: Lucene - Java
Issue Type: Bug
Affects Versions: 3.5, 4.0
Reporter: Naomi Dushay
Assignee: Doron Cohen
Attachments: LUCENE-3821-SloppyDecays.patch, LUCENE-3821.patch,
LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch,
LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml

The general bug is a case where a phrase with no slop is found,
but if you add slop its not.
I committed a test today (TestSloppyPhraseQuery2) that actually triggers this
case,
jenkins just hasn't had enough time to chew on it.
ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make
it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

2012-03-09 Thread Doron Cohen (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13226623#comment-13226623
 ] 

Doron Cohen commented on LUCENE-3821:
-

Committed:
- r1299077  3x
- r1299112  trunk

bq. I would be glad to try out a nightly build with the patch as is against our 
tests - I would be glad to get the 80% solution if it fixes my bug.

It's in now...

bq.  But I wonder if we can re-use even some of the math to redefine the 
problem more formally to figure out what minimal state/lookahead we need for 
example...

Robert, this gave me an idea... currently, in case of collision between 
repeaters, we compare them and advance the lesser of them 
(SloppyPhraseScorer.lesser(PhrasePositions, PhrasePositions)) - it should be 
fairly easy to add lookahead to this logic: if one of the two is multi-term, 
lesser can also do a lookahead. The amount of lookahead can depend on the slop. 
I'll give it a try before closing this issue.


 SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
 ---

 Key: LUCENE-3821
 URL: https://issues.apache.org/jira/browse/LUCENE-3821
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 3.5, 4.0
Reporter: Naomi Dushay
Assignee: Doron Cohen
 Attachments: LUCENE-3821-SloppyDecays.patch, LUCENE-3821.patch, 
 LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, 
 LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml


 The general bug is a case where a phrase with no slop is found,
 but if you add slop its not.
 I committed a test today (TestSloppyPhraseQuery2) that actually triggers this 
 case,
 jenkins just hasn't had enough time to chew on it.
 ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make 
 it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

2012-03-06 Thread Doron Cohen (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13223077#comment-13223077
 ] 

Doron Cohen commented on LUCENE-3821:
-

Thanks Robert, okay, I'll continue with option 2 then.

In addition, perhaps should try harder for a sloppy version of current 
ExactPhraseScorer, for both performance and correctness reasons. 

In ExactPhraseScorer, the increment of count[posIndex] is by 1, each time the 
conditions for a match (still) holds.  

A sloppy version of this, with N terms and slop=S could increment differently:
{noformat}
1 + N*Sat posIndex
1 + N*S - 1at posIndex-1 and posIndex+1
1 + N*S - 2 at posIndex-2 and posIndex+3
...
1 + N*S - S at posIndex-S and posIndex+S
{noformat}

For S=0, this falls back to only increment by 1 and only at posIndex, same as 
the ExactPhraseScorer, which makes sense.

Also, the success criteria in ExactPhraseScorer, when checking term k, is, 
before adding up 1 for term k:
* count[posIndex] == k-1
Or, after adding up 1 for term k:
* count[posIndex] == k

In the sloppy version the criteria (after adding up term k) would be:
* count[posIndex] = k*(1+N*S)-S

Again, for S=0 this falls to the ExactPhraseScorer logic:
* count[posIndex] = k  

Mike (and all), correctness wise, what do you think?

If you are wondering why the increment at the actual position is (1 + N*S) - it 
allows to match only posIndexes where all terms contributed something.

I drew an example with 5 terms and slop=2 and so far it seems correct.

Also tried 2 terms and slop=5, this seems correct as well, just that, when 
there is a match, several posIndexes will contribute to the score of the same 
match. I think this is not too bad, as for these parameters, same behavior 
would be in all documents. I would be especially forgiving for this if we this 
way get some of the performance benefits of the ExactPhraseScorer.

If we agree on correctness, need to understand how to implement it, and 
consider the performance effect. The tricky part is to increment at posIndex-n. 
Say there are 3 terms in the query and one of the terms is found at indexes 10, 
15, 18. Assume the slope is 2. Since N=3, the max increment is:
- 1 + N*S = 1 + 3*2 = 7.

So the increments for this term would be (pos, incr):
{noformat}
Pos   Increment
---   -
 8  ,  5
 9  ,  6
10  ,  7
11  ,  6
12  ,  5
13  ,  5
14  ,  6
15  ,  7   =  max(7,5)
16  ,  6   =  max(6,5)
17  ,  6   =  max(5,6)
18  ,  7
19  ,  6
20  ,  5
{noformat}

So when we get to posIndex 17, we know that posIndex 15 contributes 5, but we 
do not know yet about the contribution of posIndex 18, which is 6, and should 
be used instead of 5. So some look-ahead (or some fix-back) is required, which 
will complicate the code.

If this seems promising, should probably open a new issue for it, just wanted 
to get some feedback first.

 SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
 ---

 Key: LUCENE-3821
 URL: https://issues.apache.org/jira/browse/LUCENE-3821
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 3.5, 4.0
Reporter: Naomi Dushay
Assignee: Doron Cohen
 Attachments: LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, 
 LUCENE-3821.patch, LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml


 The general bug is a case where a phrase with no slop is found,
 but if you add slop its not.
 I committed a test today (TestSloppyPhraseQuery2) that actually triggers this 
 case,
 jenkins just hasn't had enough time to chew on it.
 ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make 
 it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

2012-03-06 Thread Doron Cohen (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13223554#comment-13223554
]

Doron Cohen commented on LUCENE-3821:
-

OK great!

If you did not point a problem with this up front there's a good chance it will
work and I'd like to give it a try.

I have a first patch - not working or anything - it opens ExactPhraseScorer a
bit for extensions and adds a class (temporary name) - NonExactPhraseScorer.

The idea is to hide in the ChunkState the details of decaying frequencies due
to slops. I will try it over the weekend. If we can make it this way, I'd
rather do it in this issue rather than committing the other new code for the
fix and then replacing it. If that won't go quick, I'll commit the (other)
changes to SloppyPhraseScorer and start a new issue.

SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
---

Key: LUCENE-3821
URL: https://issues.apache.org/jira/browse/LUCENE-3821
Project: Lucene - Java
Issue Type: Bug
Affects Versions: 3.5, 4.0
Reporter: Naomi Dushay
Assignee: Doron Cohen
Attachments: LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch,
LUCENE-3821.patch, LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

2012-03-06 Thread Doron Cohen (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13223700#comment-13223700
]

Doron Cohen commented on LUCENE-3821:
-

I'm afraid it won't solve the problem.

The complicity of SloppyPhraseScorer stems firstly from the slop.
That part is handled in the scorer for long time.

Two additional complications are repeating terms, and multi-term phrases.
Each one of these, separately, is handled as well.
Their combination however, is the cause for this discussion.

To prevent two repeating terms from landing on the same document position, we
propagate the smaller of them (smaller in its phrase-position, which takes into
account both the doc-position and the offset of that term in the query).

Without this special treatment, a phrase query a b a~2 might match a document
a b, because both a's (query terms) will land on the same document's a.
This is illegal and is prevented by such propagation.

But when one of the repeating terms is a multi-term, it is not possible to know
which of the repeating terms to propagate. This is the unsolved bug.

Now, back to current ExactPhraseScorer.
It does not have this problem with repeating terms.
But not because of the different algorithm - rather because of the different
scenario.
It does not have this problem because exact phrase scoring does not have it.
In exact phrase scoring, a match is declared only when all PPs are in the same
phrase position.
Recall that phrase position = doc-position - query-offset, it is visible that
when two PPs with different query offset are in the same phrase-position, their
doc-position cannot be the same, and therefore no special handling is needed
for repeating terms in exact phrase scorers.

However, once we will add that slopy-decaying frequency, we will match in
certain posIndex, different phrase-positions. This is because of the slop. So
they might land on the same doc-position, and then we start again...

This is really too bad. Sorry for the lengthy post, hopefully this would help
when someone wants to get into this.

Back to option 2.

SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

2012-03-04 Thread Doron Cohen (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13221840#comment-13221840
 ] 

Doron Cohen commented on LUCENE-3821:
-

The remaining failure still happens with the updated patch (same seed), and 
still seems to me like an ExactPhraseScorer bug. 

However, it is probably not a simple one I think, because when adding to 
TestMultiPhraseQuery, it passes, that is, no documents are matched, as 
expected, although this supposedly created the exact scenario that failed 
above. 

Perhaps ExactPhraseScorer behavior too is influenced by other docs processed so 
far.

{code:title=Add this to TestMultiPhraseQuery}
  public void test_LUCENE_XYZ() throws Exception {
Directory indexStore = newDirectory();
RandomIndexWriter writer = new RandomIndexWriter(random, indexStore);
add(s o b h j t j z o, LUCENE-XYZ, writer);

IndexReader reader = writer.getReader();
IndexSearcher searcher = newSearcher(reader);

MultiPhraseQuery q = new MultiPhraseQuery();
q.add(new Term[] {new Term(body, j), new Term(body, o), new 
Term(body, s)});
q.add(new Term[] {new Term(body, i), new Term(body, b), new 
Term(body, j)});
q.add(new Term[] {new Term(body, t), new Term(body, d)});
assertEquals(Wrong number of hits, 0,
searcher.search(q, null, 1).totalHits);

// just make sure no exc:
searcher.explain(q, 0);

writer.close();
searcher.close();
reader.close();
indexStore.close();
  }
{code}

 SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
 ---

 Key: LUCENE-3821
 URL: https://issues.apache.org/jira/browse/LUCENE-3821
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 3.5, 4.0
Reporter: Naomi Dushay
Assignee: Doron Cohen
 Attachments: LUCENE-3821.patch, LUCENE-3821.patch, 
 LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml


 The general bug is a case where a phrase with no slop is found,
 but if you add slop its not.
 I committed a test today (TestSloppyPhraseQuery2) that actually triggers this 
 case,
 jenkins just hasn't had enough time to chew on it.
 ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make 
 it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

2012-03-04 Thread Doron Cohen (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13221867#comment-13221867
 ] 

Doron Cohen commented on LUCENE-3821:
-

Update: apparently MultiPhraseQuery.toString does not print its holes.

So the query that failed was not:
{noformat}field:(j o s) (i b j) (t d){noformat}

But rather:
{noformat}(j o s) ? (i b j) ? ? (t d){noformat}

Which is a different story: this query should match the document
{noformat}s o b h j t j z o{noformat}

There is a match for ExactPhraseScorer, but not for Sloppy with slope 1.
So there is still work to do on SloppyPhraseScorer...

(I'll fix MFQ.toString() as well)

 SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
 ---

 Key: LUCENE-3821
 URL: https://issues.apache.org/jira/browse/LUCENE-3821
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 3.5, 4.0
Reporter: Naomi Dushay
Assignee: Doron Cohen
 Attachments: LUCENE-3821.patch, LUCENE-3821.patch, 
 LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml


 The general bug is a case where a phrase with no slop is found,
 but if you add slop its not.
 I committed a test today (TestSloppyPhraseQuery2) that actually triggers this 
 case,
 jenkins just hasn't had enough time to chew on it.
 ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make 
 it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

2012-03-04 Thread Doron Cohen (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13221879#comment-13221879
 ] 

Doron Cohen commented on LUCENE-3821:
-

I think I understand the cause.

In current implementation there is an assumption that once we landed on the 
first candidate document, it is possible to check if there are repeating pps, 
by just comparing the in-doc-positions of the terms. 

Tricky as it is, while this is true for plain PhrasePositions, it is not true 
for MultiPhrasePositions - assume to MPPs: (a m n) and (b x y), and first 
candidate document that starts with a b. The in-doc-positions of the two pps 
would be 0,1 respectively (for 'a' and 'b') and we would not even detect the 
fact that there are repetitions, not to mention not putting them in the same 
group.

MPPs conflicts with current patch in an additional manner: It is now assumed 
that each repetition can be assigned a repetition group. 

So assume these PPs (and query positions): 
0:a 1:b 3:a 4:b 7:c
There are clearly two repetition groups {0:a, 3:a} and {1:b, 4:b}, 
while 7:c is not a repetition.

But assume these PPs (and query positions): 
0:(a b) 1:(b x) 3:a 4:b 7:(c x)
We end up with a single large repetition group:
{0:(a b) 1:(b x) 3:a 4:b 7:(c x)}

I think if the groups are created correctly at the first candidate document, 
scorer logic would still work, as a collision is decided only when two pps are 
in the same in-doc-position. The only impact of MPPs would be performance cost: 
since repetition groups are larger, it would take longer to check if there are 
repetitions.

Just need to figure out how to detect repetition groups without relying on 
in-(first-)doc-positions.

 SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
 ---

 Key: LUCENE-3821
 URL: https://issues.apache.org/jira/browse/LUCENE-3821
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 3.5, 4.0
Reporter: Naomi Dushay
Assignee: Doron Cohen
 Attachments: LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, 
 LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml


 The general bug is a case where a phrase with no slop is found,
 but if you add slop its not.
 I committed a test today (TestSloppyPhraseQuery2) that actually triggers this 
 case,
 jenkins just hasn't had enough time to chew on it.
 ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make 
 it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

2012-03-03 Thread Doron Cohen (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13221737#comment-13221737
 ] 

Doron Cohen commented on LUCENE-3821:
-

I understand the problem. 

It all has to do - as Robert mentioned - with the repeating terms in the phrase 
query. I am working on a patch - it will change the way that repeats are 
handled. 

Repeating PPs require additional computation - and current SloppyPhraseScorer 
attempts to do that additional work efficiently, but over simplifies in that 
and fail to cover all cases. 

In the core of things, each time a repeating PP is selected (from the queue) 
and  propagated, *all* its sibling repeaters are propagated as well, to prevent 
a case that two repeating PPs point to the same document position (which was 
the bug that originally triggered handling repeats in this code). 

But this is wrong, because it propagates all siblings repeaters, and misses 
some cases.

Also, the code is hard to read, as Mike noted in LUCENE-2410 ([this 
comment|https://issues.apache.org/jira/browse/LUCENE-2410?focusedCommentId=12879443page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12879443])
 ).

So this is a chance to also make the code more maintainable.

I have a working version which is not ready to commit yet, and all the tests 
pass, except for one which I think is a bug in ExactPhraseScorer, but maybe i 
am missing something. 

The case that fails is this:

{noformat}
AssertionError: Missing in super-set: doc 706
q1: field:(j o s) (i b j) (t d)
q2: field:(j o s) (i b j) (t d)~1
td1: [doc=706 score=7.7783184 shardIndex=-1, doc=175 score=6.222655 
shardIndex=-1]
td2: [doc=523 score=5.5001016 shardIndex=-1, doc=957 score=5.5001016 
shardIndex=-1, doc=228 score=4.400081 shardIndex=-1, doc=357 score=4.400081 
shardIndex=-1, doc=390 score=4.400081 shardIndex=-1, doc=503 score=4.400081 
shardIndex=-1, doc=602 score=4.400081 shardIndex=-1, doc=757 score=4.400081 
shardIndex=-1, doc=758 score=4.400081 shardIndex=-1]
doc 706: Documentstored,indexed,tokenizedfield:s o b h j t j z o
{noformat}

It seems that q1 too should not match this document?

 SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
 ---

 Key: LUCENE-3821
 URL: https://issues.apache.org/jira/browse/LUCENE-3821
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 3.5, 4.0
Reporter: Naomi Dushay
Assignee: Doron Cohen
 Attachments: LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml


 The general bug is a case where a phrase with no slop is found,
 but if you add slop its not.
 I committed a test today (TestSloppyPhraseQuery2) that actually triggers this 
 case,
 jenkins just hasn't had enough time to chew on it.
 ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make 
 it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

2012-02-29 Thread Doron Cohen (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13219156#comment-13219156
 ] 

Doron Cohen commented on LUCENE-3821:
-

Fails here too like this: 

ant test -Dtestcase=TestSloppyPhraseQuery2 
-Dtestmethod=testRandomIncreasingSloppiness 
-Dtests.seed=-171bbb992c697625:203709d611c854a5:1ca48cb6d33b3f74 
-Dargs=-Dfile.encoding=UTF-8

I'll look into it

 SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
 ---

 Key: LUCENE-3821
 URL: https://issues.apache.org/jira/browse/LUCENE-3821
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 3.5, 4.0
Reporter: Naomi Dushay
 Attachments: LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml


 The general bug is a case where a phrase with no slop is found,
 but if you add slop its not.
 I committed a test today (TestSloppyPhraseQuery2) that actually triggers this 
 case,
 jenkins just hasn't had enough time to chew on it.
 ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make 
 it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3746) suggest.fst.Sort.BufferSize should not automatically fail just because of freeMemory()

2012-02-05 Thread Doron Cohen (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13201073#comment-13201073
 ] 

Doron Cohen commented on LUCENE-3746:
-

Thanks Dawid! 

{quote}
it's probably a system daemon thread for sending memory threshold notifications
{quote}

Yes this makes sense. 
Still the difference between the two JDKs felt bothering.
Some more digging, and now I think it is clear. 

Here are the stack traces reported (at the end of the test) with Oracle:
{noformat}
1.  Thread[ReaderThread,5,main]
2.  Thread[main,5,main]
3.  Thread[Reference Handler,10,system]
4.  Thread[Signal Dispatcher,9,system]
5.  Thread[Finalizer,8,system]
6.  Thread[Attach Listener,5,system]
{noformat}

And with IBM JDK:
{noformat}
1.  Thread[Attach API wait loop,10,main]
2.  Thread[Finalizer thread,5,system]
3.  Thread[JIT Compilation Thread,10,system]
4.  Thread[main,5,main]
5.  Thread[Gc Slave Thread,5,system]
6.  Thread[ReaderThread,5,main]
7.  Thread[Signal Dispatcher,5,main]
8.  Thread[MemoryPoolMXBean notification dispatcher,6,main]
{noformat}

The 8th thread is the one that started only after accessing the memory 
management layer. The thing is, that in the IBM JDK that thread is created in 
the ThreadGroup main, while in the Oracle JDK it is created under system. 
To me the latter makes more sense. 

To be more sure I added a fake memory notification listener and check the 
thread in which notification happens: 
{code}
MemoryMXBean mmxb = ManagementFactory.getMemoryMXBean();
NotificationListener listener = new NotificationListener() {
  @Override
  public void handleNotification(Notification notification, Object handback) {
System.out.println(Thread.currentThread());
  }
};
((NotificationEmitter) mmxb).addNotificationListener(listener, null, null);
{code}

Evidently in IBM JDK the notification is in main group thread (also in line 
with the thread-group in the original warning message which triggered this 
threads discussion):
{noformat}
Thread[MemoryPoolMXBean notification dispatcher,6,main]
{noformat}

While in Oracle JDK notification is in system group thread:
{noformat}
Thread[Low Memory Detector,9,system]
{noformat}

This also explains why the warning is reported only for IBM JDK: because the 
threads check in LTC only account for the threads in the same thread-group as 
the one running the specific test case. So when dispatching happens in a 
system group thread it is not sensed by that check at all.

Ok now with mystery solved I can commit the simpler code...

 suggest.fst.Sort.BufferSize should not automatically fail just because of 
 freeMemory()
 --

 Key: LUCENE-3746
 URL: https://issues.apache.org/jira/browse/LUCENE-3746
 Project: Lucene - Java
  Issue Type: Bug
  Components: modules/spellchecker
Reporter: Doron Cohen
 Fix For: 3.6, 4.0

 Attachments: LUCENE-3746.patch, LUCENE-3746.patch, LUCENE-3746.patch


 Follow up op dev thread: [FSTCompletionTest failure At least 0.5MB RAM 
 buffer is needed | http://markmail.org/message/d7ugfo5xof4h5jeh]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3746) suggest.fst.Sort.BufferSize should not automatically fail just because of freeMemory()

2012-02-02 Thread Doron Cohen (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13199038#comment-13199038
 ] 

Doron Cohen commented on LUCENE-3746:
-

{quote}
[Dawid:|http://markmail.org/message/jobtemqm4u4vrxze] (maxMemory - totalMemory) 
because that's how much the heap can
grow? The problem is none of this is atomic, so the result can
unpredictable. There are other methods in management interface that
permit a somewhat more detailed checks.  Don't know if they guarantee
atomicity of the returned snapshot, but I doubt it.
- 
[MemoryMXBean.getHeapMemoryUsage()|http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/management/MemoryMXBean.html#getHeapMemoryUsage()]
- 
[MemoryPoolMXBean.getPeakUsage()|http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/management/MemoryPoolMXBean.html#getPeakUsage()]
{quote}

Current patch not (yet) handling the atomicity issue Dawid described. 

 suggest.fst.Sort.BufferSize should not automatically fail just because of 
 freeMemory()
 --

 Key: LUCENE-3746
 URL: https://issues.apache.org/jira/browse/LUCENE-3746
 Project: Lucene - Java
  Issue Type: Bug
  Components: modules/spellchecker
Reporter: Doron Cohen
 Fix For: 3.6, 4.0

 Attachments: LUCENE-3746.patch


 Follow up op dev thread: [FSTCompletionTest failure At least 0.5MB RAM 
 buffer is needed | http://markmail.org/message/d7ugfo5xof4h5jeh]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-1812) Static index pruning by in-document term frequency (Carmel pruning)

2012-01-31 Thread Doron Cohen (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13196845#comment-13196845
]

Doron Cohen commented on LUCENE-1812:
-

while merging to trunk I noticed that idea's settings for modules/queries and
modules/queryparser refer to lucene/contrib instead of modules. Seems trivial
to fix but I have no Idea installed at the moment so no way to verify. Created
LUCENE-3737 to handle that later.

Static index pruning by in-document term frequency (Carmel pruning)
---

Key: LUCENE-1812
URL: https://issues.apache.org/jira/browse/LUCENE-1812
Project: Lucene - Java
Issue Type: New Feature
Components: modules/other
Reporter: Andrzej Bialecki
Assignee: Doron Cohen
Fix For: 3.6, 4.0

Attachments: pruning.patch, pruning.patch, pruning.patch,
pruning.patch, pruning.patch, pruning.patch

This module provides tools to produce a subset of input indexes by removing
postings data for those terms where their in-document frequency is below a
specified threshold. The net effect of this processing is a much smaller
index that for common types of queries returns nearly identical top-N results
as compared with the original index, but with increased performance.
Optionally, stored values and term vectors can also be removed. This
functionality is largely independent, so it can be used without term pruning
(when term freq. threshold is set to 1).
As the threshold value increases, the total size of the index decreases,
search performance increases, and recall decreases (i.e. search quality
deteriorates). NOTE: especially phrase recall deteriorates significantly at
higher threshold values.
Primary purpose of this class is to produce small first-tier indexes that fit
completely in RAM, and store these indexes using
IndexWriter.addIndexes(IndexReader[]). Usually the performance of this class
will not be sufficient to use the resulting index view for on-the-fly pruning
and searching.
NOTE: If the input index is optimized (i.e. doesn't contain deletions) then
the index produced via IndexWriter.addIndexes(IndexReader[]) will preserve
internal document id-s so that they are in sync with the original index. This
means that all other auxiliary information not necessary for first-tier
processing, such as some stored fields, can also be removed, to be quickly
retrieved on-demand from the original index using the same internal document
id.
Threshold values can be specified globally (for terms in all fields) using
defaultThreshold parameter, and can be overriden using per-field or per-term
values supplied in a thresholds map. Keys in this map are either field names,
or terms in field:text format. The precedence of these values is the
following: first a per-term threshold is used if present, then per-field
threshold if present, and finally the default threshold.
A command-line tool (PruningTool) is provided for convenience. At this moment
it doesn't support all functionality available through API.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3737) Idea modules settings - verify and fix

2012-01-31 Thread Doron Cohen (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13196858#comment-13196858
 ] 

Doron Cohen commented on LUCENE-3737:
-

In dev-tools/idea/.idea/ant.xml there are these two:

{code}
buildFile url=file://$PROJECT_DIR$/lucene/contrib/queries/build.xml /
buildFile url=file://$PROJECT_DIR$/lucene/contrib/queryparser/build.xml /
{code}

I assume this has the potential to break an Idea setup, but haven't tried it 
yet, just wanted to not forget about it, therefore this issue. Is this a 
none-issue?


 Idea modules settings - verify and fix
 --

 Key: LUCENE-3737
 URL: https://issues.apache.org/jira/browse/LUCENE-3737
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 4.0
Reporter: Doron Cohen
Assignee: Doron Cohen
Priority: Trivial

 Idea's settings for modules/queries and modules/queryparser refer to 
 lucene/contrib instead of modules.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3737) Idea modules settings - verify and fix

2012-01-31 Thread Doron Cohen (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13197014#comment-13197014
 ] 

Doron Cohen commented on LUCENE-3737:
-

Yes, only saw this on trunk, thanks for taking care of this!

 Idea modules settings - verify and fix
 --

 Key: LUCENE-3737
 URL: https://issues.apache.org/jira/browse/LUCENE-3737
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 4.0
Reporter: Doron Cohen
Assignee: Steven Rowe
Priority: Trivial
 Fix For: 4.0


 Idea's settings for modules/queries and modules/queryparser refer to 
 lucene/contrib instead of modules.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-1812) Static index pruning by in-document term frequency (Carmel pruning)

2012-01-30 Thread Doron Cohen (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13196339#comment-13196339
]

Doron Cohen commented on LUCENE-1812:
-

That dead code was removed and some javadocs added.
Still room for more javadocs - e.g. the static tool - and better test coverage.
Committed to 3x: r1237937.

Static index pruning by in-document term frequency (Carmel pruning)
---

Attachments: pruning.patch, pruning.patch, pruning.patch,
pruning.patch, pruning.patch, pruning.patch

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-1812) Static index pruning by in-document term frequency (Carmel pruning)

2012-01-30 Thread Doron Cohen (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13196429#comment-13196429
]

Doron Cohen commented on LUCENE-1812:
-

bq. Excellent, thanks for seeing this through!

Yeah, only more than a year delay ;)

BTW in trunk it will be under modules.

Static index pruning by in-document term frequency (Carmel pruning)
---

Attachments: pruning.patch, pruning.patch, pruning.patch,
pruning.patch, pruning.patch, pruning.patch

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3718) SamplingWrapperTest failure with certain test seed

2012-01-24 Thread Doron Cohen (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13192031#comment-13192031
 ] 

Doron Cohen commented on LUCENE-3718:
-

well this is not a test bug after all, but rather exposing a bug in 
Lucene40PostingsReader.

 SamplingWrapperTest failure with certain test seed
 --

 Key: LUCENE-3718
 URL: https://issues.apache.org/jira/browse/LUCENE-3718
 Project: Lucene - Java
  Issue Type: Bug
  Components: modules/facet
Reporter: Doron Cohen
Assignee: Doron Cohen
 Fix For: 3.6, 4.0


 Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk/12231/
 1 tests failed.
 REGRESSION:  
 org.apache.lucene.facet.search.SamplingWrapperTest.testCountUsingSamping
 Error Message:
 Results are not the same!
 Stack Trace:
 org.apache.lucene.facet.FacetTestBase$NotSameResultError: Results are not the 
 same!
at 
 org.apache.lucene.facet.FacetTestBase.assertSameResults(FacetTestBase.java:333)
at 
 org.apache.lucene.facet.search.sampling.BaseSampleTestTopK.assertSampling(BaseSampleTestTopK.java:104)
at 
 org.apache.lucene.facet.search.sampling.BaseSampleTestTopK.testCountUsingSamping(BaseSampleTestTopK.java:82)
at 
 org.apache.lucene.util.LuceneTestCase$3$1.evaluate(LuceneTestCase.java:529)
at 
 org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:165)
at 
 org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:57)
 NOTE: reproduce with: ant test -Dtestcase=SamplingWrapperTest 
 -Dtestmethod=testCountUsingSamping 
 -Dtests.seed=4a5994491f79fc80:-18509d134c89c159:-34f6ecbb32e930f7 
 -Dtests.multiplier=3 -Dargs=-Dfile.encoding=UTF-8
 NOTE: test params are: codec=Lucene40: 
 {$facets=PostingsFormat(name=MockRandom), 
 $full_path$=PostingsFormat(name=MockSep), content=Pulsing40(freqCutoff=19 
 minBlockSize=65 maxBlockSize=209), 
 $payloads$=PostingsFormat(name=Lucene40WithOrds)}, 
 sim=RandomSimilarityProvider(queryNorm=true,coord=true): {$facets=LM 
 Jelinek-Mercer(0.70), content=DFR I(n)B3(800.0)}, locale=bg, 
 timezone=Asia/Manila

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3718) SamplingWrapperTest failure with certain test seed

2012-01-24 Thread Doron Cohen (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13192046#comment-13192046
 ] 

Doron Cohen commented on LUCENE-3718:
-

Fix committed in r1235190 (trunk).
Added no CHANGES entry - seems to me an overkill here... other opinions?


 SamplingWrapperTest failure with certain test seed
 --

 Key: LUCENE-3718
 URL: https://issues.apache.org/jira/browse/LUCENE-3718
 Project: Lucene - Java
  Issue Type: Bug
  Components: modules/facet
Reporter: Doron Cohen
Assignee: Doron Cohen
 Fix For: 3.6, 4.0

 Attachments: LUCENE-3718.patch, LUCENE-3718.patch


 Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk/12231/
 1 tests failed.
 REGRESSION:  
 org.apache.lucene.facet.search.SamplingWrapperTest.testCountUsingSamping
 Error Message:
 Results are not the same!
 Stack Trace:
 org.apache.lucene.facet.FacetTestBase$NotSameResultError: Results are not the 
 same!
at 
 org.apache.lucene.facet.FacetTestBase.assertSameResults(FacetTestBase.java:333)
at 
 org.apache.lucene.facet.search.sampling.BaseSampleTestTopK.assertSampling(BaseSampleTestTopK.java:104)
at 
 org.apache.lucene.facet.search.sampling.BaseSampleTestTopK.testCountUsingSamping(BaseSampleTestTopK.java:82)
at 
 org.apache.lucene.util.LuceneTestCase$3$1.evaluate(LuceneTestCase.java:529)
at 
 org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:165)
at 
 org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:57)
 NOTE: reproduce with: ant test -Dtestcase=SamplingWrapperTest 
 -Dtestmethod=testCountUsingSamping 
 -Dtests.seed=4a5994491f79fc80:-18509d134c89c159:-34f6ecbb32e930f7 
 -Dtests.multiplier=3 -Dargs=-Dfile.encoding=UTF-8
 NOTE: test params are: codec=Lucene40: 
 {$facets=PostingsFormat(name=MockRandom), 
 $full_path$=PostingsFormat(name=MockSep), content=Pulsing40(freqCutoff=19 
 minBlockSize=65 maxBlockSize=209), 
 $payloads$=PostingsFormat(name=Lucene40WithOrds)}, 
 sim=RandomSimilarityProvider(queryNorm=true,coord=true): {$facets=LM 
 Jelinek-Mercer(0.70), content=DFR I(n)B3(800.0)}, locale=bg, 
 timezone=Asia/Manila

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-1812) Static index pruning by in-document term frequency (Carmel pruning)

2012-01-24 Thread Doron Cohen (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13192101#comment-13192101
 ] 

Doron Cohen commented on LUCENE-1812:
-

I ran 'javadocs' under 3x/lucene/contrib/pruning and 'javadocs-all' under 
3x/lucene. 

The latter failed due to multiple package.html under o.a.l.index - in core and 
under contrib/pruning. 

Entirely renaming the package to o.a.l.pruning.index won't work because 
PruningReader accesses package protected SegmentTermVector.

I can move the other classes to that new package and keep only PruningReader in 
that index friend package. (Unless there are javadoc/ant tricks that will 
avoid this error and still generate valid javadocs in both cases).


 Static index pruning by in-document term frequency (Carmel pruning)
 ---

 Key: LUCENE-1812
 URL: https://issues.apache.org/jira/browse/LUCENE-1812
 Project: Lucene - Java
  Issue Type: New Feature
  Components: modules/other
Reporter: Andrzej Bialecki 
Assignee: Doron Cohen
 Fix For: 3.6, 4.0

 Attachments: pruning.patch, pruning.patch, pruning.patch, 
 pruning.patch, pruning.patch


 This module provides tools to produce a subset of input indexes by removing 
 postings data for those terms where their in-document frequency is below a 
 specified threshold. The net effect of this processing is a much smaller 
 index that for common types of queries returns nearly identical top-N results 
 as compared with the original index, but with increased performance. 
 Optionally, stored values and term vectors can also be removed. This 
 functionality is largely independent, so it can be used without term pruning 
 (when term freq. threshold is set to 1).
 As the threshold value increases, the total size of the index decreases, 
 search performance increases, and recall decreases (i.e. search quality 
 deteriorates). NOTE: especially phrase recall deteriorates significantly at 
 higher threshold values. 
 Primary purpose of this class is to produce small first-tier indexes that fit 
 completely in RAM, and store these indexes using 
 IndexWriter.addIndexes(IndexReader[]). Usually the performance of this class 
 will not be sufficient to use the resulting index view for on-the-fly pruning 
 and searching. 
 NOTE: If the input index is optimized (i.e. doesn't contain deletions) then 
 the index produced via IndexWriter.addIndexes(IndexReader[]) will preserve 
 internal document id-s so that they are in sync with the original index. This 
 means that all other auxiliary information not necessary for first-tier 
 processing, such as some stored fields, can also be removed, to be quickly 
 retrieved on-demand from the original index using the same internal document 
 id. 
 Threshold values can be specified globally (for terms in all fields) using 
 defaultThreshold parameter, and can be overriden using per-field or per-term 
 values supplied in a thresholds map. Keys in this map are either field names, 
 or terms in field:text format. The precedence of these values is the 
 following: first a per-term threshold is used if present, then per-field 
 threshold if present, and finally the default threshold.
 A command-line tool (PruningTool) is provided for convenience. At this moment 
 it doesn't support all functionality available through API.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-1812) Static index pruning by in-document term frequency (Carmel pruning)

2012-01-23 Thread Doron Cohen (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13191206#comment-13191206
]

Doron Cohen commented on LUCENE-1812:
-

Getting to this, at last.

I did not handle the above TODO's and I rather commit so they can be handled
later separately (progress not perfection as Mike says).

Changes in this patch:
- PruningReader overrides also getSequentialSubReaders(), otherwise no pruning
takes place on sub-readers (and tests fail).
- StorePruningPolicy fixed to use FieldInfos API.

I modified for Idea and maven by following templates for other contrib
components but have no way to test this and would appreciate a review of this.

Static index pruning by in-document term frequency (Carmel pruning)
---

Attachments: pruning.patch, pruning.patch, pruning.patch,
pruning.patch, pruning.patch

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-1812) Static index pruning by in-document term frequency (Carmel pruning)

2012-01-23 Thread Doron Cohen (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13191209#comment-13191209
]

Doron Cohen commented on LUCENE-1812:
-

I now see that all other contrib components have svn:ignore for *.iml and
pom.xml - I'll add that for pruning as well (though it is not in the attached
patch).

Static index pruning by in-document term frequency (Carmel pruning)
---

Attachments: pruning.patch, pruning.patch, pruning.patch,
pruning.patch, pruning.patch

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-1812) Static index pruning by in-document term frequency (Carmel pruning)

2012-01-23 Thread Doron Cohen (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13191222#comment-13191222
]

Doron Cohen commented on LUCENE-1812:
-

bq. I didn't test them, but I will once they have been committed.

Great, thanks!

Static index pruning by in-document term frequency (Carmel pruning)
---

Attachments: pruning.patch, pruning.patch, pruning.patch,
pruning.patch, pruning.patch

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3718) SamplingWrapperTest failure with certain test seed

2012-01-23 Thread Doron Cohen (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13191963#comment-13191963
 ] 

Doron Cohen commented on LUCENE-3718:
-

failure consistently recreated with these parameters.
It is most likely a test bug, but still annoying.
Should also rename misspelled method - should be: testCountUsingSampling()

 SamplingWrapperTest failure with certain test seed
 --

 Key: LUCENE-3718
 URL: https://issues.apache.org/jira/browse/LUCENE-3718
 Project: Lucene - Java
  Issue Type: Bug
  Components: modules/facet
Reporter: Doron Cohen
Assignee: Doron Cohen
 Fix For: 3.6, 4.0


 Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk/12231/
 1 tests failed.
 REGRESSION:  
 org.apache.lucene.facet.search.SamplingWrapperTest.testCountUsingSamping
 Error Message:
 Results are not the same!
 Stack Trace:
 org.apache.lucene.facet.FacetTestBase$NotSameResultError: Results are not the 
 same!
at 
 org.apache.lucene.facet.FacetTestBase.assertSameResults(FacetTestBase.java:333)
at 
 org.apache.lucene.facet.search.sampling.BaseSampleTestTopK.assertSampling(BaseSampleTestTopK.java:104)
at 
 org.apache.lucene.facet.search.sampling.BaseSampleTestTopK.testCountUsingSamping(BaseSampleTestTopK.java:82)
at 
 org.apache.lucene.util.LuceneTestCase$3$1.evaluate(LuceneTestCase.java:529)
at 
 org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:165)
at 
 org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:57)
 NOTE: reproduce with: ant test -Dtestcase=SamplingWrapperTest 
 -Dtestmethod=testCountUsingSamping 
 -Dtests.seed=4a5994491f79fc80:-18509d134c89c159:-34f6ecbb32e930f7 
 -Dtests.multiplier=3 -Dargs=-Dfile.encoding=UTF-8
 NOTE: test params are: codec=Lucene40: 
 {$facets=PostingsFormat(name=MockRandom), 
 $full_path$=PostingsFormat(name=MockSep), content=Pulsing40(freqCutoff=19 
 minBlockSize=65 maxBlockSize=209), 
 $payloads$=PostingsFormat(name=Lucene40WithOrds)}, 
 sim=RandomSimilarityProvider(queryNorm=true,coord=true): {$facets=LM 
 Jelinek-Mercer(0.70), content=DFR I(n)B3(800.0)}, locale=bg, 
 timezone=Asia/Manila

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3703) DirectoryTaxonomyReader.refresh misbehaves with ref counts

2012-01-19 Thread Doron Cohen (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-3703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13189036#comment-13189036
]

Doron Cohen commented on LUCENE-3703:
-

Missed that test comment about no need for random directory.
About the decRef dup code, yeah, that's what I meant, but okay.
I think this is ready to commit.

DirectoryTaxonomyReader.refresh misbehaves with ref counts
--

Key: LUCENE-3703
URL: https://issues.apache.org/jira/browse/LUCENE-3703
Project: Lucene - Java
Issue Type: Bug
Components: modules/facet
Reporter: Shai Erera
Assignee: Shai Erera
Fix For: 3.6, 4.0

Attachments: LUCENE-3703.patch, LUCENE-3703.patch

DirectoryTaxonomyReader uses the internal IndexReader in order to track its
own reference counting. However, when you call refresh(), it reopens the
internal IndexReader, and from that point, all previous reference counting
gets lost (since the new IndexReader's refCount is 1).
The solution is to track reference counting in DTR itself. I wrote a simple
unit test which exposes the bug (will be attached with the patch shortly).

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3703) DirectoryTaxonomyReader.refresh misbehaves with ref counts

2012-01-18 Thread Doron Cohen (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-3703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13188975#comment-13188975
]

Doron Cohen commented on LUCENE-3703:
-

Patch looks good, builds and passes for me, thanks for fixing this Shai.

Few comments:
* CHANGES: rephrase the e.g. part like this: (e.g. if application called
incRef/decRef).
* New test:
** LTC.newDirectory() instead of new RAMDirectory().
** text messages in the asserts.
* DTR:
** Would it be simpler to make close() synchronized (just like IR.close())
** Would it - again - be simpler to keep maintaining the ref-counts in the
internal IR and just, in refresh, decRef as needed in the old one and incRef
accordingly in the new one? This way we continue to delegate that logic to IR,
and do not duplicate it.
** Current patch removes the ensureOpen() check from getRefCount(). I think
this is correct - in fact I needed that when debugging this. Perhaps should
document about it in CHANGES entry.

DirectoryTaxonomyReader.refresh misbehaves with ref counts
--

Key: LUCENE-3703
URL: https://issues.apache.org/jira/browse/LUCENE-3703
Project: Lucene - Java
Issue Type: Bug
Components: modules/facet
Reporter: Shai Erera
Assignee: Shai Erera
Fix For: 3.6, 4.0

Attachments: LUCENE-3703.patch

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3635) Allow setting arbitrary objects on PerfRunData

2011-12-19 Thread Doron Cohen (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13172215#comment-13172215
 ] 

Doron Cohen commented on LUCENE-3635:
-

Patch looks good.

bq. I do not propose to move IR/IW/TR/TW etc. into that map. If however people 
think that we should, I can do that as well.

I rather keep these ones explicit as they are now.

bq. I wonder if we should have this Map require Closeable so that we can close 
the objects on PerfRunData.close()

Closing would be convenient, but I think requiring to pass Closeable is too 
restrictive? 
Instead, you could add something like this to close():

{code}
for (Object o : perfObjects.values()) {
  if (o instanceof Closeable) {
IOUtils.close((Closeable) o);
  }
}
{code}

This is done only once at the end, so instanceof is not a perf issue here.
If we close like this, we also need to document it at setPerfObject().

I think, BTW, that PFD.close() is not called by the Benchmark, it has to be 
explicitly invoked by the user.

 Allow setting arbitrary objects on PerfRunData
 --

 Key: LUCENE-3635
 URL: https://issues.apache.org/jira/browse/LUCENE-3635
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/benchmark
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor
 Fix For: 3.6, 4.0

 Attachments: LUCENE-3635.patch


 PerfRunData is used as the intermediary objects between PerfRunTasks. Just 
 like we can set IndexReader/Writer on it, it will be good if it allows 
 setting other arbitrary objects that are e.g. created by one task and used by 
 another.
 A recent example is the enhancement to the benchmark package following the 
 addition of the facet module. We had to add TaxoReader/Writer.
 The proposal is to add a HashMapString, Object that custom PerfTasks can 
 set()/get(). I do not propose to move IR/IW/TR/TW etc. into that map. If 
 however people think that we should, I can do that as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3604) 3x/lucene/contrib/CHANGES.txt has two API Changes subsections for 3.5.0

2011-11-28 Thread Doron Cohen (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13158306#comment-13158306
 ] 

Doron Cohen commented on LUCENE-3604:
-

Fixed the 3x file in r1207018 - ordering the API Changes entries by their 
date (by svn log).
Keeping open for fixing the Changes.html that already appears in the Web site.

 3x/lucene/contrib/CHANGES.txt has two API Changes subsections for 3.5.0
 -

 Key: LUCENE-3604
 URL: https://issues.apache.org/jira/browse/LUCENE-3604
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Doron Cohen
Assignee: Doron Cohen
Priority: Minor

 There are two API Changes sections which is confusing when looking at the 
 txt version of the file. 
 The HTML expands only the first of the two, unless expand-all is clicked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3604) 3x/lucene/contrib/CHANGES.txt has two API Changes subsections for 3.5.0

2011-11-28 Thread Doron Cohen (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13159086#comment-13159086
 ] 

Doron Cohen commented on LUCENE-3604:
-

bq. The new version will show up on the website once the periodic resync 
happens.

[3.5-contrib-changes|http://lucene.apache.org/java/3_5_0/changes/Contrib-Changes.html#3.5.0.api_changes]
 now shows the correct API changes. Thanks Steven!

 3x/lucene/contrib/CHANGES.txt has two API Changes subsections for 3.5.0
 -

 Key: LUCENE-3604
 URL: https://issues.apache.org/jira/browse/LUCENE-3604
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Doron Cohen
Assignee: Steven Rowe
Priority: Minor
 Fix For: 3.5


 There are two API Changes sections which is confusing when looking at the 
 txt version of the file. 
 The HTML expands only the first of the two, unless expand-all is clicked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3596) DirectoryTaxonomyWriter extensions should be able to set internal index writer config attributes such as info stream

2011-11-27 Thread Doron Cohen (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13157717#comment-13157717
 ] 

Doron Cohen commented on LUCENE-3596:
-

Also, there seems to be a bug in current taxonomy writer test - TestIndexClose 
- where the IndexWriterConfig's merge policy might allow to merge segments 
out-of-order. That test calls LTC.newIndexWriterConfig() and it is just by luck 
that this test have not failed so far.

This is a bad type of failure for an application (is there ever a good 
type?;)), because by the time the bug is exposed it would show as a wrong facet 
returned in faceted search, and go figure that late that this is because at an 
earlier time an index writer was created which allowed out-of-order merging...

Therefore, it would have been useful if, in addition to the javadocs about 
requiring type of merge policy, we would also throw an exception 
(IllegalArgument or IO) if the IWC's merge policy allows merging out-of-order. 
This should be checked in two locations: 
- createIWC() returns
- openIndex() returns, by examining the IWC of the index

The second check is more involved as it is done after the index was already 
opened, so it must be closed prior to throwing that exception.

However, merge-policy does not have in its contract anything like 
Collector.acceptsDocsOutOfOrder(), so it is not possible to verify this at all.

Adding such a method to MergePolicy seems to me an over-kill, for this 
particular case, unless there is additional interest in such a declaration?

Otherwise, it is possible to require that the merge policy must be a descendant 
of LogMergePolicy. This on the other hand would not allow to test this class 
with other order-preserving policies, such as NoMerge.

So I am not sure what is the best  way to proceed in this regard.

I think there are two options actually:
# just javadoc that fact, and fix the test to always create an order preserving 
MP.
# add that declaration to MP.

Unless there are opinions favoring the second option I'll go with the first one.

In addition, (this is true for both options) I will move the call to createIWC 
into the constructor and modify openIndex signature to accept an IWC instead of 
the open mode, as it seems wrong - API wise - that one extension point 
(createIWC) is invoked by another extension point (openIndex) - better have 
them both be invoked from the constructor, making it harder for someone to, by 
mistake, totally ignore in createIndex() the value returned by createIWC().

 DirectoryTaxonomyWriter extensions should be able to set internal index 
 writer config attributes such as info stream
 

 Key: LUCENE-3596
 URL: https://issues.apache.org/jira/browse/LUCENE-3596
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/facet
Reporter: Doron Cohen
Assignee: Doron Cohen
Priority: Minor
 Attachments: LUCENE-3596.patch


 Current protected openIndexWriter(Directory directory, OpenMode openMode) 
 does not provide access to the IWC it creates.
 So extensions must reimplement this method completely in order to set e.f. 
 info stream for the internal index writer.
 This came up in [user question: Taxonomy indexer debug 
 |http://lucene.472066.n3.nabble.com/Taxonomy-indexer-debug-td3533341.html]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3596) DirectoryTaxonomyWriter extensions should be able to set internal index writer config attributes such as info stream

2011-11-26 Thread Doron Cohen (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13157605#comment-13157605
]

Doron Cohen commented on LUCENE-3596:
-

bq. and getIWC (if you intend to add it).
Yes that's what I would like to add.
These docs are missing then anyhow, with or without getIWC().
This added extendability is useful although behavior regarding info-stream
differs between trunk and 3x - i.e. that in 3x one can set that stream also
with current extension point.

DirectoryTaxonomyWriter extensions should be able to set internal index
writer config attributes such as info stream

Key: LUCENE-3596
URL: https://issues.apache.org/jira/browse/LUCENE-3596
Project: Lucene - Java
Issue Type: Improvement
Components: modules/facet
Reporter: Doron Cohen
Priority: Minor

Current protected openIndexWriter(Directory directory, OpenMode openMode)
does not provide access to the IWC it creates.
So extensions must reimplement this method completely in order to set e.f.
info stream for the internal index writer.
This came up in [user question: Taxonomy indexer debug
|http://lucene.472066.n3.nabble.com/Taxonomy-indexer-debug-td3533341.html]

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3588) Try harder to prevent SIGSEGV on cloned MMapIndexInputs

2011-11-23 Thread Doron Cohen (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-3588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13155897#comment-13155897
]

Doron Cohen commented on LUCENE-3588:
-

Patch (last one) works well for me - the new test fails without the fix and
passes with the fix.

It relies on shallow cloning of 'clones' - and so would break if WHM starts to
implement Cloneable for some reason, but then the 'assert clone.clones ==
this.clones' in clone() guarantees early detection of this in the tests, cool.

Try harder to prevent SIGSEGV on cloned MMapIndexInputs
---

Key: LUCENE-3588
URL: https://issues.apache.org/jira/browse/LUCENE-3588
Project: Lucene - Java
Issue Type: Improvement
Components: core/store
Affects Versions: 3.4, 3.5
Reporter: Uwe Schindler
Assignee: Uwe Schindler
Fix For: 3.6, 4.0

Attachments: LUCENE-3588-simpler.patch, LUCENE-3588-simpler.patch,
LUCENE-3588-simpler.patch, LUCENE-3588.patch, LUCENE-3588.patch,
LUCENE-3588.patch

We are unmapping mmapped byte buffers which is disallowed by the JDK, because
it has the risk of SIGSEGV when you access the mapped byte buffer after
unmapping.
We currently prevent this for the main IndexInput by setting its buffer to
null, so we NPE if somebody tries to access the underlying buffer. I recently
fixed also the stupid curBuf (LUCENE-3200) by setting to null.
The big problem are cloned IndexInputs which are generally not closed. Those
still contain references to the unmapped ByteBuffer, which lead to SIGSEGV
easily. The patch from Mike in LUCENE-3439 prevents most of this in Lucene
3.5, but its still not 100% safe (as it uses non-volatiles).
This patch will fix the remaining issues by also setting the buffers of
clones to null when the original is closed. The trick is to record weak
references of all clones created and close them together with the original.
This uses a ConcurrentHashMapWeakReferenceMMapIndexInput,? as store with
the logic borrowed from WeakHashMap to cleanup the GCed references (using
ReferenceQueue).
If we respin 3.5, we should maybe also get this in.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3573) TaxonomyReader.refresh() is broken, replace its logic with reopen(), following IR.reopen pattern

2011-11-16 Thread Doron Cohen (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13151188#comment-13151188
 ] 

Doron Cohen commented on LUCENE-3573:
-

Hmm, now that there is a test for LTW.rollback() my changes fail LTW's 
testRollback() because LTW.close() now may call IW.commit(Map) (which it did 
not do before my changes).

For fixing this:
- added private doClose() which closes IW and nullifies it, and calls 
closeResources().
- rollback() calls doClose() instead of close().

Also, rollback() now made synchronized.

 TaxonomyReader.refresh() is broken, replace its logic with reopen(), 
 following IR.reopen pattern
 

 Key: LUCENE-3573
 URL: https://issues.apache.org/jira/browse/LUCENE-3573
 Project: Lucene - Java
  Issue Type: Bug
  Components: modules/facet
Reporter: Doron Cohen
Assignee: Doron Cohen
Priority: Minor
 Attachments: LUCENE-3573.patch, LUCENE-3573.patch


 When recreating the taxonomy index, TR's assumption that categories are only 
 added does not hold anymore.
 As result, calling TR.refresh() will be incorrect at best, but usually throw 
 an AIOOBE.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3573) TaxonomyReader.refresh() is broken, replace its logic with reopen(), following IR.reopen pattern

2011-11-14 Thread Doron Cohen (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13149684#comment-13149684
]

Doron Cohen commented on LUCENE-3573:
-

I agree about keeping the same notions as IR.

bq. returns null (no changes, or the taxonomy wasn't recreated)

In fact I was thinking of a different contract.

So we have two approaches here for the returned value:

* Option A:
## *new TR* - if the taxonomy was recreated.
## *null* - if the taxonomy was either not modified or just grew.

* Option B:
## *new TR* - if the taxonomy was modified (either recreated or just grew)
## *null* - if the taxonomy was not modified.

Option A is simpler to implement, but I think it has two drawbacks:
* it is confusingly different from that of IR
* the fact that the TR was refreshed is hidden from the caller.

Option B is a bit more involved to implement:
* would need to copy arrays' data from old TR to new one in case the taxonomy
only grew

I started to implement option B but now rethinking this...

bq. Was there any reason to add it to TestTaxonomyCombined?

Good point, should probably move this to TestDirectoryTaxonomyReader.

TaxonomyReader.refresh() is broken, replace its logic with reopen(),
following IR.reopen pattern

Key: LUCENE-3573
URL: https://issues.apache.org/jira/browse/LUCENE-3573
Project: Lucene - Java
Issue Type: Bug
Components: modules/facet
Reporter: Doron Cohen
Assignee: Doron Cohen
Priority: Minor
Attachments: LUCENE-3573.patch

When recreating the taxonomy index, TR's assumption that categories are only
added does not hold anymore.
As result, calling TR.refresh() will be incorrect at best, but usually throw
an AIOOBE.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3573) TaxonomyReader.refresh() is broken, replace its logic with reopen(), following IR.reopen pattern

2011-11-14 Thread Doron Cohen (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13149687#comment-13149687
 ] 

Doron Cohen commented on LUCENE-3573:
-

One more thing 
- In approach B, the fact that the taxonomy just grew simply allows an 
optimization (read only the new ordinals), and so it is not a part of the API 
logic, and the only logic is - was the taxonomy modified or not. 
- In approach A, this fact is part of the API logic. 

 TaxonomyReader.refresh() is broken, replace its logic with reopen(), 
 following IR.reopen pattern
 

 Key: LUCENE-3573
 URL: https://issues.apache.org/jira/browse/LUCENE-3573
 Project: Lucene - Java
  Issue Type: Bug
  Components: modules/facet
Reporter: Doron Cohen
Assignee: Doron Cohen
Priority: Minor
 Attachments: LUCENE-3573.patch


 When recreating the taxonomy index, TR's assumption that categories are only 
 added does not hold anymore.
 As result, calling TR.refresh() will be incorrect at best, but usually throw 
 an AIOOBE.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3573) TaxonomyReader.refresh() is broken, replace its logic with reopen(), following IR.reopen pattern

2011-11-14 Thread Doron Cohen (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13150279#comment-13150279
]

Doron Cohen commented on LUCENE-3573:
-

bq. So in fact, let's not call it openIfChanged, because may not be meaningful.

Yes this bothered me too.

bq. so maybe refreshIfChanged?

... let's stick to refresh() (but...)

Current refresh impl is efficient in that (1) arrays only grow if needed (2)
caches only cleaned from wrong 'invalid ordinals'. In that, it relies on that
the taxonomy can only grow (unless it is recreated, hence this issue).

So I now think it would be best to modify slightly refresh() - in case it
detects that the taxonomy was created, it will throw a new (checked) exception
- telling the application that this TR cannot be refreshed but the app can open
a new TR.

This way there is no 3-way logic - either the TR was refreshed or it was not.

And while we are at this, refresh() is void. I think it would be useful to
return boolean, indicating whether any refresh took place.

TaxonomyReader.refresh() is broken, replace its logic with reopen(),
following IR.reopen pattern

When recreating the taxonomy index, TR's assumption that categories are only
added does not hold anymore.
As result, calling TR.refresh() will be incorrect at best, but usually throw
an AIOOBE.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3564) rename IndexWriter.rollback to .rollbackAndClose

2011-11-06 Thread Doron Cohen (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13145045#comment-13145045
 ] 

Doron Cohen commented on LUCENE-3564:
-

My personal preference for this API is the current simple and short name 
*rollback()*.

 rename IndexWriter.rollback to .rollbackAndClose
 

 Key: LUCENE-3564
 URL: https://issues.apache.org/jira/browse/LUCENE-3564
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.5, 4.0


 Spinoff from LUCENE-3454, where Shai noticed that rollback is trappy since it 
 [unexpected] closes the IW.
 I think we should rename it to rollbackAndClose.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3454) rename optimize to a less cool-sounding name

2011-11-06 Thread Doron Cohen (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13145047#comment-13145047
 ] 

Doron Cohen commented on LUCENE-3454:
-

bq. Perhaps I am the only one, but I find these ifNeeded, mabyeThis, mabyeThat 
method names so ugly. I prefer JavaDoc for trying to catch the subtleties.

I feel that way too.

But a name change here seems in place, because as pointed above, there is an 
issue with current catchy name *optimize()*.

My personal preference between the names suggested above is Mike's last one: 
*forceMerge(int)*:
- it describes what's done
- does not suggest to do wonders
- requires caller to think twice because of deciding to force a certain behavior

 rename optimize to a less cool-sounding name
 

 Key: LUCENE-3454
 URL: https://issues.apache.org/jira/browse/LUCENE-3454
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 3.4, 4.0
Reporter: Robert Muir
Assignee: Michael McCandless
 Attachments: LUCENE-3454.patch


 I think users see the name optimize and feel they must do this, because who 
 wants a suboptimal system? but this probably just results in wasted time and 
 resources.
 maybe rename to collapseSegments or something?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3506) tests for verifying that assertions are enabled do nothing since they ignore AssertionError

2011-10-27 Thread Doron Cohen (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13136823#comment-13136823
 ] 

Doron Cohen commented on LUCENE-3506:
-

bq. I just committed this change to the IntelliJ IDEA configuration

Thanks for fixing for IntelliJ!

 tests for verifying that assertions are enabled do nothing since they ignore 
 AssertionError
 ---

 Key: LUCENE-3506
 URL: https://issues.apache.org/jira/browse/LUCENE-3506
 Project: Lucene - Java
  Issue Type: Bug
  Components: general/test
Reporter: Doron Cohen
Assignee: Doron Cohen
Priority: Minor
 Attachments: LUCENE-3506.patch, LUCENE-3506.patch


 Follow-up from LUCENE-3501

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3506) tests for verifying that assertions are enabled do nothing since they ignore AssertionError

2011-10-27 Thread Doron Cohen (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13136826#comment-13136826
 ] 

Doron Cohen commented on LUCENE-3506:
-

{quote}
bq.Also, we've often done performance tests as unit tests in the past. Is there 
an easy way to disable this assertions enabled test?

You can also enable assertions just for the class/package which checks if 
assertions are enabled, Yonik. This should make the check pass and disable all 
other assertions (for benchmarking). I don't remember the syntax off the top of 
my head though.
{quote}

Yonik, is this sufficient for running the perf tests? 
Otherwise I can add a -D flag for disabling testing this in LTC.

 tests for verifying that assertions are enabled do nothing since they ignore 
 AssertionError
 ---

 Key: LUCENE-3506
 URL: https://issues.apache.org/jira/browse/LUCENE-3506
 Project: Lucene - Java
  Issue Type: Bug
  Components: general/test
Reporter: Doron Cohen
Assignee: Doron Cohen
Priority: Minor
 Attachments: LUCENE-3506.patch, LUCENE-3506.patch


 Follow-up from LUCENE-3501

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3506) tests for verifying that assertions are enabled do nothing since they ignore AssertionError

2011-10-27 Thread Doron Cohen (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13136915#comment-13136915
 ] 

Doron Cohen commented on LUCENE-3506:
-

For easier perf testing I added a -D flag to tell LTC not to fail each and 
every test if Java assertions are not enabled:

{noformat}
-Dtests.asserts.gracious=true
{noformat}

(Tests requiring Java assertions - e.g. TestAssertions - will still fail, on 
purpose.)

- r1189655 - trunk
- r1189663 - 3x

 tests for verifying that assertions are enabled do nothing since they ignore 
 AssertionError
 ---

 Key: LUCENE-3506
 URL: https://issues.apache.org/jira/browse/LUCENE-3506
 Project: Lucene - Java
  Issue Type: Bug
  Components: general/test
Reporter: Doron Cohen
Assignee: Doron Cohen
Priority: Minor
 Attachments: LUCENE-3506.patch, LUCENE-3506.patch


 Follow-up from LUCENE-3501

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3506) tests for verifying that assertions are enabled do nothing since they ignore AssertionError

2011-10-25 Thread Doron Cohen (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-3506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13135086#comment-13135086
]

Doron Cohen commented on LUCENE-3506:
-

bq. (Whereas today if you run that test w/o assertions you get a failure,
albeit a confusing one).

Actually today when you run the tests - with assertions, without assertions, -
you get no failures at all - which is what I was trying to fix here (unless I
missed something seriously) - because:
- the original tests, after deciding to fail, invoked fail()
- this threw AssertionError
- but it was ignored as part of their wrong logic.

bq. I'm confused here – the changes to TestSegmentMerger look like they'll
allow the test to pass when assertions are disabled?

Right, I fixed it such that *only if* assertions are enabled, they verify that
the expected assertion errors are not thrown, so they allow you to run tests
also without enabling assertions. See my comment above only one test I
take it that this kind of flexibility is not required. So will change it so
that these tests will fail if assertions are not enabled.

bq. The other day I committed an accidental change to common-build that
disabled assertions, and it was a little confusing to track down.

I see, so we make the entire test framework to fail if assertions are not
enabled.
I'll update the patch.

tests for verifying that assertions are enabled do nothing since they ignore
AssertionError
---

Key: LUCENE-3506
URL: https://issues.apache.org/jira/browse/LUCENE-3506
Project: Lucene - Java
Issue Type: Bug
Components: general/test
Reporter: Doron Cohen
Assignee: Doron Cohen
Priority: Minor
Attachments: LUCENE-3506.patch

Follow-up from LUCENE-3501

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3501) random sampler is not random (and so facet SamplingWrapperTest occasionally fails)

2011-10-11 Thread Doron Cohen (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13124997#comment-13124997
 ] 

Doron Cohen commented on LUCENE-3501:
-

Fixed in trunk: r1181760

Shai's comment on catching AssertionError made me search for other cases of 
catching this error in Lucene. Few such cases exist, and they all seem wrong, 
as they call fail when failing fail :) due to assert not enabled but fail to 
detect that failure since they then silently ignore AssertionError thrown by 
fail(). Opened LUCENE-3506 for this.

 random sampler is not random (and so facet SamplingWrapperTest occasionally 
 fails)
 --

 Key: LUCENE-3501
 URL: https://issues.apache.org/jira/browse/LUCENE-3501
 Project: Lucene - Java
  Issue Type: Bug
  Components: modules/facet
Reporter: Doron Cohen
Assignee: Doron Cohen
Priority: Minor
 Attachments: LUCENE-3501.patch


 RandomSample is not random at all:
 It does not even import java.util.Random, and its behavior is deterministic.
 in addition, the test testCountUsingSamping() never retries as it was 
 supposed to (for taking care of the hoped-for randomness).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3501) random sampler is not random (and so facet SamplingWrapperTest occasionally fails)

2011-10-09 Thread Doron Cohen (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13123703#comment-13123703
 ] 

Doron Cohen commented on LUCENE-3501:
-

The error (from Jenkins) was:

{noformat}
junit.framework.AssertionFailedError: Results are not the same!
at 
org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:149)
at 
org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:51)
at 
org.apache.lucene.facet.FacetTestBase.assertSameResults(FacetTestBase.java:316)
at 
org.apache.lucene.facet.search.sampling.BaseSampleTestTopK.assertSampling(BaseSampleTestTopK.java:93)
at 
org.apache.lucene.facet.search.sampling.BaseSampleTestTopK.testCountUsingSamping(BaseSampleTestTopK.java:76)
at 
org.apache.lucene.util.LuceneTestCase$2$1.evaluate(LuceneTestCase.java:610)

reproduce with: ant test -Dtestcase=SamplingWrapperTest 
-Dtestmethod=testCountUsingSamping 
-Dtests.seed=39c6b88dcada2192:-cf936a4278714b1:-770b2814b4a6acd7
{noformat}


 random sampler is not random (and so facet SamplingWrapperTest occasionally 
 fails)
 --

 Key: LUCENE-3501
 URL: https://issues.apache.org/jira/browse/LUCENE-3501
 Project: Lucene - Java
  Issue Type: Bug
  Components: modules/facet
Reporter: Doron Cohen
Assignee: Doron Cohen
Priority: Minor

 RandomSample is not random at all:
 It does not even import java.util.Random, and its behavior is deterministic.
 in addition, the test testCountUsingSamping() never retries as it was 
 supposed to (for taking care of the hoped-for randomness).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3262) Facet benchmarking

2011-10-09 Thread Doron Cohen (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-3262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13123714#comment-13123714
]

Doron Cohen commented on LUCENE-3262:
-

bq. I reduced those to 1-20 per document with depth of 1-3 and got results I
could live with.

I agree, tried this too now and the comparison is more reasonable.
Perhaps what are reasonable numbers (for #facets/doc and their depth) is
debatable, but I agree that 200 facets per document is too many.

Changing the defaults to 20/3 and preparing to commit.

Facet benchmarking
--

Key: LUCENE-3262
URL: https://issues.apache.org/jira/browse/LUCENE-3262
Project: Lucene - Java
Issue Type: New Feature
Components: modules/benchmark, modules/facet
Reporter: Shai Erera
Assignee: Doron Cohen
Attachments: CorpusGenerator.java, LUCENE-3262.patch,
LUCENE-3262.patch, LUCENE-3262.patch, TestPerformanceHack.java

A spin off from LUCENE-3079. We should define few benchmarks for faceting
scenarios, so we can evaluate the new faceting module as well as any
improvement we'd like to consider in the future (such as cutting over to
docvalues, implement FST-based caches etc.).
Toke attached a preliminary test case to LUCENE-3079, so I'll attach it here
as a starting point.
We've also done some preliminary job for extending Benchmark for faceting, so
I'll attach it here as well.
We should perhaps create a Wiki page where we clearly describe the benchmark
scenarios, then include results of 'default settings' and 'optimized
settings', or something like that.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3262) Facet benchmarking

2011-10-09 Thread Doron Cohen (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13123735#comment-13123735
 ] 

Doron Cohen commented on LUCENE-3262:
-

Committed to 3x in r1180637, thanks Gilad!
Now porting to trunk, it is more involved than anticipated, because of 
contrib/modules differences.
Managed to make the tests pass, and the benchmark alg of choice to run.
However I noticed that in 3x that alg - when indexing reuters - added the 
entire collection, that is 21578 docs, while in trunk it only added about 400 
docs.  Might be something in my set-up, digging...

 Facet benchmarking
 --

 Key: LUCENE-3262
 URL: https://issues.apache.org/jira/browse/LUCENE-3262
 Project: Lucene - Java
  Issue Type: New Feature
  Components: modules/benchmark, modules/facet
Reporter: Shai Erera
Assignee: Doron Cohen
 Attachments: CorpusGenerator.java, LUCENE-3262.patch, 
 LUCENE-3262.patch, LUCENE-3262.patch, TestPerformanceHack.java


 A spin off from LUCENE-3079. We should define few benchmarks for faceting 
 scenarios, so we can evaluate the new faceting module as well as any 
 improvement we'd like to consider in the future (such as cutting over to 
 docvalues, implement FST-based caches etc.).
 Toke attached a preliminary test case to LUCENE-3079, so I'll attach it here 
 as a starting point.
 We've also done some preliminary job for extending Benchmark for faceting, so 
 I'll attach it here as well.
 We should perhaps create a Wiki page where we clearly describe the benchmark 
 scenarios, then include results of 'default settings' and 'optimized 
 settings', or something like that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3501) random sampler is not random (and so facet SamplingWrapperTest occasionally fails)

2011-10-09 Thread Doron Cohen (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13123744#comment-13123744
 ] 

Doron Cohen commented on LUCENE-3501:
-

Thanks for reviewing Shai!
I'll change as you propose (confirming your understanding) and commit tomorrow.

 random sampler is not random (and so facet SamplingWrapperTest occasionally 
 fails)
 --

 Key: LUCENE-3501
 URL: https://issues.apache.org/jira/browse/LUCENE-3501
 Project: Lucene - Java
  Issue Type: Bug
  Components: modules/facet
Reporter: Doron Cohen
Assignee: Doron Cohen
Priority: Minor
 Attachments: LUCENE-3501.patch


 RandomSample is not random at all:
 It does not even import java.util.Random, and its behavior is deterministic.
 in addition, the test testCountUsingSamping() never retries as it was 
 supposed to (for taking care of the hoped-for randomness).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3262) Facet benchmarking

2011-10-07 Thread Doron Cohen (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-3262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13122580#comment-13122580
]

Doron Cohen commented on LUCENE-3262:
-

bq. changes entry
Right, I always forget to include it in the patch, and add it only afterwords,
should change that...

Also, I am not comfortable with the use of a config property in AddDocTask to
tell that facets should be added. Seems too implicit to me, all of the
sudden... So I think it would be better to refactor the doc creation in AddDoc
into a method, and add AddFacetedDocTask that extends AddDoc and overrides the
creation of the doc to be added, calling super, and then add the facets into it.

Facet benchmarking
--

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3262) Facet benchmarking

2011-10-07 Thread Doron Cohen (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13122598#comment-13122598
 ] 

Doron Cohen commented on LUCENE-3262:
-

Actually, since the doc is created at setup() it is sufficient to make the doc 
protected (was private). Also that with.facets property is useful for 
comparisons, so I kept it (now used only in AddFacetedDocTask) but modified its 
default to true.

 Facet benchmarking
 --

 Key: LUCENE-3262
 URL: https://issues.apache.org/jira/browse/LUCENE-3262
 Project: Lucene - Java
  Issue Type: New Feature
  Components: modules/benchmark, modules/facet
Reporter: Shai Erera
Assignee: Doron Cohen
 Attachments: CorpusGenerator.java, LUCENE-3262.patch, 
 LUCENE-3262.patch, TestPerformanceHack.java


 A spin off from LUCENE-3079. We should define few benchmarks for faceting 
 scenarios, so we can evaluate the new faceting module as well as any 
 improvement we'd like to consider in the future (such as cutting over to 
 docvalues, implement FST-based caches etc.).
 Toke attached a preliminary test case to LUCENE-3079, so I'll attach it here 
 as a starting point.
 We've also done some preliminary job for extending Benchmark for faceting, so 
 I'll attach it here as well.
 We should perhaps create a Wiki page where we clearly describe the benchmark 
 scenarios, then include results of 'default settings' and 'optimized 
 settings', or something like that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3262) Facet benchmarking

2011-10-07 Thread Doron Cohen (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13123134#comment-13123134
 ] 

Doron Cohen commented on LUCENE-3262:
-

bq. Someone can use AddFacetedDocTask w/ and w/o facets? What for? 

It is useful for specifying the property like this:

{code}
with.facets=facets:true:false
...
{ MAddDocs AddFacetedDoc  : 400
{code}

and then getting in the report something like this:

{noformat}
 Report sum by Prefix (MAddDocs) and Round (4 about 4 out of 42)
Operationround facets   runCnt   recsPerRunrec/s  elapsedSec
MAddDocs_400 0   true1  400   246.611.62
MAddDocs_400 -   1  false -  -   1 -  -  -  400 -   1,801.80 -  -   0.22
MAddDocs_400 2   true1  400   412.800.97
MAddDocs_400 -   3  false -  -   1 -  -  -  400 -   2,139.04 -  -   0.19
{noformat}

 Facet benchmarking
 --

 Key: LUCENE-3262
 URL: https://issues.apache.org/jira/browse/LUCENE-3262
 Project: Lucene - Java
  Issue Type: New Feature
  Components: modules/benchmark, modules/facet
Reporter: Shai Erera
Assignee: Doron Cohen
 Attachments: CorpusGenerator.java, LUCENE-3262.patch, 
 LUCENE-3262.patch, LUCENE-3262.patch, TestPerformanceHack.java


 A spin off from LUCENE-3079. We should define few benchmarks for faceting 
 scenarios, so we can evaluate the new faceting module as well as any 
 improvement we'd like to consider in the future (such as cutting over to 
 docvalues, implement FST-based caches etc.).
 Toke attached a preliminary test case to LUCENE-3079, so I'll attach it here 
 as a starting point.
 We've also done some preliminary job for extending Benchmark for faceting, so 
 I'll attach it here as well.
 We should perhaps create a Wiki page where we clearly describe the benchmark 
 scenarios, then include results of 'default settings' and 'optimized 
 settings', or something like that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3262) Facet benchmarking

2011-10-04 Thread Doron Cohen (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13120003#comment-13120003
 ] 

Doron Cohen commented on LUCENE-3262:
-

I am working on a patch for this, much in the lines of the Solr benchmark patch 
in SOLR-2646.
Currently the direction is:

- Add to PerfRunData:
-- Taxonomy Directory
-- Taxonomy Writer
-- Taxonomy Reader

- Add tasks for manipulating facets and taxonomies:
-- create/open/commit/close Taxonomy Index
-- open/close Taxonomy Reader
-- AddDocWith facets

- FacetDocMaker will also build the categories into the document
- FacetSource will bring back categories to be added to current doc

- ReadTask will be extended to also support faceted search.
  This is different from the Solr benchmark approach, where a SolrSearchTask is 
not extending ReadTask but rather extending PerfTask.
  Not sure yet if this is the way to go - still work to be done here.

Should have a start patch in a day or two.

 Facet benchmarking
 --

 Key: LUCENE-3262
 URL: https://issues.apache.org/jira/browse/LUCENE-3262
 Project: Lucene - Java
  Issue Type: New Feature
  Components: modules/benchmark, modules/facet
Reporter: Shai Erera
Assignee: Doron Cohen
 Attachments: CorpusGenerator.java, TestPerformanceHack.java


 A spin off from LUCENE-3079. We should define few benchmarks for faceting 
 scenarios, so we can evaluate the new faceting module as well as any 
 improvement we'd like to consider in the future (such as cutting over to 
 docvalues, implement FST-based caches etc.).
 Toke attached a preliminary test case to LUCENE-3079, so I'll attach it here 
 as a starting point.
 We've also done some preliminary job for extending Benchmark for faceting, so 
 I'll attach it here as well.
 We should perhaps create a Wiki page where we clearly describe the benchmark 
 scenarios, then include results of 'default settings' and 'optimized 
 settings', or something like that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

51 matches

Mail list logo