Solr-trunk - Build # 1389 - Failure
Build: https://hudson.apache.org/hudson/job/Solr-trunk/1389/ All tests passed Build Log (for compile errors): [...truncated 18778 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2868) It should be easy to make use of TermState; rewritten queries should be shared automatically
[ https://issues.apache.org/jira/browse/LUCENE-2868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12986281#action_12986281 ] Simon Willnauer commented on LUCENE-2868: - bq.Here's my take on the patch, including ability to cache weight objects. I have a couple of comments here - first I can not apply your patch to the current trunk can you update it? * you keep a cache per IndexSeacher (btw. QueryDataCache is missing in the patch) which is used to cache several things across searches. This is very dangerous! While I don't know how it is implemented I would guess you need to synchronized access to it so it would slow down searches ey? * Caching Scorers is going to break since Scorers are stateful and might be advanced to different documents. Yet, I can see what you are trying to do here since doing work in a scorer is costly so common TermQueries for instance should not need to load the same posting list twice. There are two things which come to my mind right away. 1. Postinglist caching - should be done on a codec level IMO 2. Building PerReaderTermState only once for a common TermQuery. While caching PostingLists is going to be tricky and quite a task reusing PerReaderTermState could work fine as far as I can see if you are in the same searcher. * Caching Weights is kind of weird - what is the reason for this again? The only thing you really save here is setup costs which are generally very low. Overall I don' t like that this way you tightly couple something to Weight / Query etc. for a single purpose what could be solved with some kind of query optimization phase similar to what I had in my last patch and Earwin has proposed. I think we should not tight couple things like that into lucene. This is really extremely application dependent in the most cases and we should only provide the infrastructure to do it. bq. Earwin - I think we should make a new issue and get something like that implemented in there which is more general than what I just sketched out. If you could share your code that would be awesome! Earwin, any new on this - shall I open an issue for that? bq. It occurs to me that the name of the common class that gets created in IndexSearcher and passed around should probably be named something more appropriate, like QueryContext. That way people will feel free to extend it to hold all sorts of query-local data, in time. Thoughts? You refer to ScorerContext? This class was actually not intended to be expendable its public final until now. I am not sure if we should open that up though. It should be easy to make use of TermState; rewritten queries should be shared automatically Key: LUCENE-2868 URL: https://issues.apache.org/jira/browse/LUCENE-2868 Project: Lucene - Java Issue Type: Improvement Components: Query/Scoring Reporter: Karl Wright Attachments: lucene-2868.patch, query-rewriter.patch When you have the same query in a query hierarchy multiple times, tremendous savings can now be had if the user knows enough to share the rewritten queries in the hierarchy, due to the TermState addition. But this is clumsy and requires a lot of coding by the user to take advantage of. Lucene should be smart enough to share the rewritten queries automatically. This can be most readily (and powerfully) done by introducing a new method to Query.java: Query rewriteUsingCache(IndexReader indexReader) ... and including a caching implementation right in Query.java which would then work for all. Of course, all callers would want to use this new method rather than the current rewrite(). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: Solr-trunk - Build # 1389 - Failure
F**ck! I am posting a comment, the stack trace looks different! - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Apache Hudson Server [mailto:hud...@hudson.apache.org] Sent: Tuesday, January 25, 2011 9:53 AM To: dev@lucene.apache.org Subject: Solr-trunk - Build # 1389 - Failure Build: https://hudson.apache.org/hudson/job/Solr-trunk/1389/ All tests passed Build Log (for compile errors): [...truncated 18778 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2010) Remove segments with all documents deleted in commit/flush/close of IndexWriter instead of waiting until a merge occurs.
[ https://issues.apache.org/jira/browse/LUCENE-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12986346#action_12986346 ] Michael McCandless commented on LUCENE-2010: bq. Do you want to fix the rest of the tests and remove the text-only keepAllSegments method? It's actually only the QueryUtils test class that uses this... it makes an empty index by adding N docs and then deleting them all. So the test-only API needs to be public (QueryUtils is in oal.search). I'll mark it as lucene.internal... Remove segments with all documents deleted in commit/flush/close of IndexWriter instead of waiting until a merge occurs. Key: LUCENE-2010 URL: https://issues.apache.org/jira/browse/LUCENE-2010 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.9 Reporter: Uwe Schindler Assignee: Michael McCandless Fix For: 3.1, 4.0 Attachments: LUCENE-2010.patch I do not know if this is a bug in 2.9.0, but it seems that segments with all documents deleted are not automatically removed: {noformat} 4 of 14: name=_dlo docCount=5 compound=true hasProx=true numFiles=2 size (MB)=0.059 diagnostics = {java.version=1.5.0_21, lucene.version=2.9.0 817268P - 2009-09-21 10:25:09, os=SunOS, os.arch=amd64, java.vendor=Sun Microsystems Inc., os.version=5.10, source=flush} has deletions [delFileName=_dlo_1.del] test: open reader.OK [5 deleted docs] test: fields..OK [136 fields] test: field norms.OK [136 fields] test: terms, freq, prox...OK [1698 terms; 4236 terms/docs pairs; 0 tokens] test: stored fields...OK [0 total field count; avg ? fields per doc] test: term vectorsOK [0 total vector count; avg ? term/freq vector fields per doc] {noformat} Shouldn't such segments not be removed automatically during the next commit/close of IndexWriter? *Mike McCandless:* Lucene doesn't actually short-circuit this case, ie, if every single doc in a given segment has been deleted, it will still merge it [away] like normal, rather than simply dropping it immediately from the index, which I agree would be a simple optimization. Can you open a new issue? I would think IW can drop such a segment immediately (ie not wait for a merge or optimize) on flushing new deletes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2868) It should be easy to make use of TermState; rewritten queries should be shared automatically
[ https://issues.apache.org/jira/browse/LUCENE-2868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright updated LUCENE-2868: Attachment: lucene-2868.patch Oops, forgot to add a key file. Seriously, the weight caching is of minor utility. The scorer caching is not enabled. So all that this patch does differently is try to define a broader concept of query context, rather than the narrow fix Simon proposes. It should be easy to make use of TermState; rewritten queries should be shared automatically Key: LUCENE-2868 URL: https://issues.apache.org/jira/browse/LUCENE-2868 Project: Lucene - Java Issue Type: Improvement Components: Query/Scoring Reporter: Karl Wright Attachments: lucene-2868.patch, lucene-2868.patch, query-rewriter.patch When you have the same query in a query hierarchy multiple times, tremendous savings can now be had if the user knows enough to share the rewritten queries in the hierarchy, due to the TermState addition. But this is clumsy and requires a lot of coding by the user to take advantage of. Lucene should be smart enough to share the rewritten queries automatically. This can be most readily (and powerfully) done by introducing a new method to Query.java: Query rewriteUsingCache(IndexReader indexReader) ... and including a caching implementation right in Query.java which would then work for all. Of course, all callers would want to use this new method rather than the current rewrite(). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [jira] Created: (LUCENE-2886) Adaptive Frame Of Reference
Hi Paul, This is a good question. The two methods, i.e., VSE and AFOR, are very similar. The two methods can be considered as an extension of FOR to make it less sensitive to outliers by adapting the encoding to the value distribution. To achieve this, the two methods are encoding a list of values by - partitioning it into frames (or sequence of consecutive integers) of variable lengths, - encoding each frame using a different bit frame (the minimum number of bits required to encode any integer in the frame, and still be able to distinguish them) - relying on algorithms to automatically find a good list partitioning. Apart from the minor differences in the implementation design (that I will discuss later), the main difference is that VSE is optimised for achieving a high compression rate and a fast decompression but disregards the efficiency of compression, while AFOR is optimised for achieving a high compression rate, a fast decompression but also a fast compression speed. VSE is using a Dynamic Programming method to find the *optimal partitioning* of a list (optimal in term of compression rate). While this approach provides a higher compression rate than the one proposed in AFOR, the complexity of such a partitioning algorithm is O(n * k), with the term n being the number of values and the term k the size of the larger frame, which might greatly impact the compression performance. In AFOR, we use instead a local optimisation algorithm that is less effective in term of compression rate but faster to compute. In term of implementation details, there is a few differences. 1) VSE allows frames of length 1, 2, 4, 6, 8, 12, 16 and 32. The current implementation of AFOR restrict the length of a frame to be a multiple of 8 to to be aligned with the start and end of a byte boundary (and also to minimise the number of loop-unrolled highly-optimised routines). More precisely, AFOR-2 use three frame lengths: 8, 16 and 32. 2) To allow the *optimal partitioning* of a list, the original implementation of VSE needs to operate on the full list. On the contrary, AFOR has been developed to operate on small subsets of the list, so that AFOR can be applied during incremental construction of the compressed list (it does not require the full list, but works on small block of 32 or more integers). However, we can think of applying VSE on small subset, as in AFOR. In this case, VSE does not compute the optimal partition of a list, but only the optimal partition of the subset of the list. VSE and AFOR encodes a frame in a similar way: first, a header (1 byte) which provides the bit frame and the frames length, then the encoded frame. So, as you can see, in essence, the two models are very similar. For the background, I know well Fabrizio Silvestri (co-author of VSE), and he was my PhD thesis examiner (the AFOR compression scheme is a chapter of my thesis). The funny thing is that we come up with these two models at the same time, this summer, without knowing we were working on something similar ;o). However, he was more lucky than I am to publish his findings before me. I hope this answers to your question. Feel free to ask if you have any other questions, Regards, -- Renaud Delbru On 24/01/11 22:02, Paul Elschot wrote: Any idea on how this compares to the vector split encoding here: http://puma.isti.cnr.it/publichtml/section_cnr_isti/cnr_isti_2010-TR-016.html ? Regards, Paul Elschot On Monday 24 January 2011 19:32:44 Renaud Delbru (JIRA) wrote: Adaptive Frame Of Reference Key: LUCENE-2886 URL: https://issues.apache.org/jira/browse/LUCENE-2886 Project: Lucene - Java Issue Type: New Feature Components: Codecs Reporter: Renaud Delbru Fix For: 4.0 We could test the implementation of the Adaptive Frame Of Reference [1] on the lucene-4.0 branch. I am providing the source code of its implementation. Some work needs to be done, as this implementation is working on the old lucene-1458 branch. I will attach a tarball containing a running version (with tests) of the AFOR implementation, as well as the implementations of PFOR and of Simple64 (simple family codec working on 64bits word) that has been used in the experiments in [1]. [1] http://www.deri.ie/fileadmin/documents/deri-tr-afor.pdf -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [jira] Created: (LUCENE-2886) Adaptive Frame Of Reference
-- sorry, resending it as I don't know what happens to the layout of the previous one Hi Paul, This is a good question. The two methods, i.e., VSE and AFOR, are very similar. The two methods can be considered as an extension of FOR to make it less sensitive to outliers by adapting the encoding to the value distribution. To achieve this, the two methods are encoding a list of values by - partitioning it into frames (or sequence of consecutive integers) of variable lengths, - encoding each frame using a different bit frame (the minimum number of bits required to encode any integer in the frame, and still be able to distinguish them) - relying on algorithms to automatically find a good list partitioning. Apart from the minor differences in the implementation design (that I will discuss later), the main difference is that VSE is optimised for achieving a high compression rate and a fast decompression but disregards the efficiency of compression, while AFOR is optimised for achieving a high compression rate, a fast decompression but also a fast compression speed. VSE is using a Dynamic Programming method to find the *optimal partitioning* of a list (optimal in term of compression rate). While this approach provides a higher compression rate than the one proposed in AFOR, the complexity of such a partitioning algorithm is O(n * k), with the term n being the number of values and the term k the size of the larger frame, which might greatly impact the compression performance. In AFOR, we use instead a local optimisation algorithm that is less effective in term of compression rate but faster to compute. In term of implementation details, there is a few differences. 1) VSE allows frames of length 1, 2, 4, 6, 8, 12, 16 and 32. The current implementation of AFOR restrict the length of a frame to be a multiple of 8 to to be aligned with the start and end of a byte boundary (and also to minimise the number of loop-unrolled highly-optimised routines). More precisely, AFOR-2 use three frame lengths: 8, 16 and 32. 2) To allow the *optimal partitioning* of a list, the original implementation of VSE needs to operate on the full list. On the contrary, AFOR has been developed to operate on small subsets of the list, so that AFOR can be applied during incremental construction of the compressed list (it does not require the full list, but works on small block of 32 or more integers). However, we can think of applying VSE on small subset, as in AFOR. In this case, VSE does not compute the optimal partition of a list, but only the optimal partition of the subset of the list. VSE and AFOR encodes a frame in a similar way: first, a header (1 byte) which provides the bit frame and the frames length, then the encoded frame. So, as you can see, in essence, the two models are very similar. For the background, I know well Fabrizio Silvestri (co-author of VSE), and he was my PhD thesis examiner (the AFOR compression scheme is a chapter of my thesis). The funny thing is that we come up with these two models at the same time, this summer, without knowing we were working on something similar ;o). However, he was more lucky than I am to publish his findings before me. I hope this answers to your question. Feel free to ask if you have any other questions, Regards, -- Renaud Delbru On 25/01/11 12:24, Renaud Delbru wrote: Hi Paul, This is a good question. The two methods, i.e., VSE and AFOR, are very similar. The two methods can be considered as an extension of FOR to make it less sensitive to outliers by adapting the encoding to the value distribution. To achieve this, the two methods are encoding a list of values by - partitioning it into frames (or sequence of consecutive integers) of variable lengths, - encoding each frame using a different bit frame (the minimum number of bits required to encode any integer in the frame, and still be able to distinguish them) - relying on algorithms to automatically find a good list partitioning. Apart from the minor differences in the implementation design (that I will discuss later), the main difference is that VSE is optimised for achieving a high compression rate and a fast decompression but disregards the efficiency of compression, while AFOR is optimised for achieving a high compression rate, a fast decompression but also a fast compression speed. VSE is using a Dynamic Programming method to find the *optimal partitioning* of a list (optimal in term of compression rate). While this approach provides a higher compression rate than the one proposed in AFOR, the complexity of such a partitioning algorithm is O(n * k), with the term n being the number of values and the term k the size of the larger frame, which might greatly impact the compression performance. In AFOR, we use instead a local optimisation algorithm that is less effective in term of compression rate but faster to compute. In term of implementation details,
[jira] Commented: (LUCENE-2887) Remove/deprecate IndexReader.undeleteAll
[ https://issues.apache.org/jira/browse/LUCENE-2887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12986364#action_12986364 ] Doron Cohen commented on LUCENE-2887: - I think it is correct to say that if the result of ir.numDeletedDocs() is N, then calling ir.undeleteAll() will delete exactly N documents... or am I missing it? Because if a merge was invoked for the segments seen by this reader, I see two options: # A merge is on going, or the merge is done but uncommitted yet. This means that an index writer has a lock on the index, hence ir.undeleteAll() will fail to get the lock. # The a merge was already committed. This means that the index reader will fail to get write permission for being Stale. So I think this method behaves deterministically - perhaps its jdoc should say something like: *Undeletes all #numDeletedDocs() documents currently marked as deleted in this index.* ? Remove/deprecate IndexReader.undeleteAll Key: LUCENE-2887 URL: https://issues.apache.org/jira/browse/LUCENE-2887 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.1 This API is rather dangerous in that it's best effort since it can only un-delete docs that have not yet been merged away, or, dropped (as of LUCENE-2010). Given that it exposes impl details of how Lucene prunes deleted docs, I think we should remove this API. Are there legitimate use cases? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-2887) Remove/deprecate IndexReader.undeleteAll
[ https://issues.apache.org/jira/browse/LUCENE-2887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12986364#action_12986364 ] Doron Cohen edited comment on LUCENE-2887 at 1/25/11 7:49 AM: -- I think it is correct to say that if the result of ir.numDeletedDocs() is *N*, then calling ir.undeleteAll() will undelete exactly *N* documents... or am I missing it? Because if a merge was invoked for the segments seen by this reader, I see two options: # A merge is on going, or the merge is done but uncommitted yet. This means that an index writer has a lock on the index, hence ir.undeleteAll() will fail to get the lock. # The merge was already committed. This means that the index reader will fail to get write permission for being Stale. So I think this method behaves deterministically - perhaps its jdoc should say something like: *Undeletes all #numDeletedDocs() documents currently marked as deleted in this index.* ? was (Author: doronc): I think it is correct to say that if the result of ir.numDeletedDocs() is N, then calling ir.undeleteAll() will delete exactly N documents... or am I missing it? Because if a merge was invoked for the segments seen by this reader, I see two options: # A merge is on going, or the merge is done but uncommitted yet. This means that an index writer has a lock on the index, hence ir.undeleteAll() will fail to get the lock. # The a merge was already committed. This means that the index reader will fail to get write permission for being Stale. So I think this method behaves deterministically - perhaps its jdoc should say something like: *Undeletes all #numDeletedDocs() documents currently marked as deleted in this index.* ? Remove/deprecate IndexReader.undeleteAll Key: LUCENE-2887 URL: https://issues.apache.org/jira/browse/LUCENE-2887 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.1 This API is rather dangerous in that it's best effort since it can only un-delete docs that have not yet been merged away, or, dropped (as of LUCENE-2010). Given that it exposes impl details of how Lucene prunes deleted docs, I think we should remove this API. Are there legitimate use cases? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Closed: (LUCENE-942) TopDocCollector.topDocs throws ArrayIndexOutOfBoundsException when called twice
[ https://issues.apache.org/jira/browse/LUCENE-942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera closed LUCENE-942. - Resolution: Not A Problem TopDocsCollector documents that you cannot call topDocs() more than once for each search execution. TopDocCollector.topDocs throws ArrayIndexOutOfBoundsException when called twice --- Key: LUCENE-942 URL: https://issues.apache.org/jira/browse/LUCENE-942 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.2 Reporter: Aaron Isotton Priority: Minor Here's the implementation of TopDocCollector.topDocs(): public TopDocs topDocs() { ScoreDoc[] scoreDocs = new ScoreDoc[hq.size()]; for (int i = hq.size()-1; i = 0; i--) // put docs in array scoreDocs[i] = (ScoreDoc)hq.pop(); float maxScore = (totalHits==0) ? Float.NEGATIVE_INFINITY : scoreDocs[0].score; return new TopDocs(totalHits, scoreDocs, maxScore); } When you call topDocs(), hq gets emptied. Thus the second time you call it scoreDocs.length will be 0 and scoreDocs[0] will throw an ArrayIndexOutOfBoundsException. I don't know whether this 'call only once' semantics is intended behavior or not; if not, it should be fixed, if yes it should be documented. Thanks a lot for an absolutely fantastic product, Aaron -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Closed: (LUCENE-423) thread pool implementation of parallel queries
[ https://issues.apache.org/jira/browse/LUCENE-423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir closed LUCENE-423. -- Resolution: Fixed Fix Version/s: 3.1 Assignee: (was: Lucene Developers) You can provide an ExecutorService now, so I think this one is resolved. thread pool implementation of parallel queries -- Key: LUCENE-423 URL: https://issues.apache.org/jira/browse/LUCENE-423 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 1.4 Environment: Operating System: other Platform: Other Reporter: Randy Puttick Priority: Minor Fix For: 3.1 Attachments: ConcurrentMultiSearcher.java This component is a replacement for ParallelMultiQuery that runs a thread pool with queue instead of starting threads for every query execution (so its performance is better). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-522) SpanFuzzyQuery
[ https://issues.apache.org/jira/browse/LUCENE-522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir resolved LUCENE-522. Resolution: Duplicate This is fixed in LUCENE-2754, you can use any multitermquery in spans. SpanFuzzyQuery -- Key: LUCENE-522 URL: https://issues.apache.org/jira/browse/LUCENE-522 Project: Lucene - Java Issue Type: New Feature Components: Search Affects Versions: 1.9 Reporter: Karl Wettin Priority: Minor This is my SpanFuzzyQuery. It is released under the Apache licensence. Just paste it in. package se.snigel.lucene; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.Term; import org.apache.lucene.search.*; import org.apache.lucene.search.spans.SpanOrQuery; import org.apache.lucene.search.spans.SpanQuery; import org.apache.lucene.search.spans.SpanTermQuery; import org.apache.lucene.search.spans.Spans; import java.io.IOException; import java.util.Collection; import java.util.LinkedList; /** * @author Karl Wettin ka...@snigel.net */ public class SpanFuzzyQuery extends SpanQuery { public final static float defaultMinSimilarity = 0.7f; public final static int defaultPrefixLength = 0; private final Term term; private final float minimumSimilarity; private final int prefixLength; private BooleanQuery rewrittenFuzzyQuery; public SpanFuzzyQuery(Term term) { this(term, defaultMinSimilarity, defaultPrefixLength); } public SpanFuzzyQuery(Term term, float minimumSimilarity, int prefixLength) { this.term = term; this.minimumSimilarity = minimumSimilarity; this.prefixLength = prefixLength; if (minimumSimilarity = 1.0f) { throw new IllegalArgumentException(minimumSimilarity = 1); } else if (minimumSimilarity 0.0f) { throw new IllegalArgumentException(minimumSimilarity 0); } if (prefixLength 0) { throw new IllegalArgumentException(prefixLength 0); } } public Query rewrite(IndexReader reader) throws IOException { FuzzyQuery fuzzyQuery = new FuzzyQuery(term, minimumSimilarity, prefixLength); rewrittenFuzzyQuery = (BooleanQuery) fuzzyQuery.rewrite(reader); BooleanClause[] clauses = rewrittenFuzzyQuery.getClauses(); SpanQuery[] spanQueries = new SpanQuery[clauses.length]; for (int i = 0; i clauses.length; i++) { BooleanClause clause = clauses[i]; TermQuery termQuery = (TermQuery) clause.getQuery(); spanQueries[i] = new SpanTermQuery(termQuery.getTerm()); spanQueries[i].setBoost(termQuery.getBoost()); } SpanOrQuery query = new SpanOrQuery(spanQueries); query.setBoost(fuzzyQuery.getBoost()); return query; } /** Expert: Returns the matches for this query in an index. Used internally * to search for spans. */ public Spans getSpans(IndexReader reader) throws IOException { throw new UnsupportedOperationException(Query should have been rewritten); } /** Returns the name of the field matched by this query.*/ public String getField() { return term.field(); } /** Returns a collection of all terms matched by this query.*/ public Collection getTerms() { if (rewrittenFuzzyQuery == null) { throw new RuntimeException(Query must be rewritten prior to calling getTerms()!); } else { LinkedListTerm terms = new LinkedListTerm(); BooleanClause[] clauses = rewrittenFuzzyQuery.getClauses(); for (int i = 0; i clauses.length; i++) { BooleanClause clause = clauses[i]; TermQuery termQuery = (TermQuery) clause.getQuery(); terms.add(termQuery.getTerm()); } return terms; } } /** Prints a query to a string, with codefield/code as the default field * for terms. pThe representation used is one that is supposed to be readable * by {@link org.apache.lucene.queryParser.QueryParser QueryParser}. However, * there are the following limitations: * ul * liIf the query was created by the parser, the printed * representation may not be exactly what was parsed. For example, * characters that need to be escaped will be represented without * the required backslash./li * liSome of the more complicated queries (e.g. span queries) * don't have a representation that can be parsed by QueryParser./li * /ul */ public String toString(String field) { return spans( + rewrittenFuzzyQuery.toString() + ); } } -- This message is automatically generated by JIRA. - You can reply to
[jira] Resolved: (LUCENE-538) Using WildcardQuery with MultiSearcher, and Boolean MUST_NOT clause
[ https://issues.apache.org/jira/browse/LUCENE-538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir resolved LUCENE-538. Resolution: Fixed Fix Version/s: 3.1 This is now fixed by Mike's cleanup to MultiSearcher etc, which fixes this combine/rewrite bug Using WildcardQuery with MultiSearcher, and Boolean MUST_NOT clause --- Key: LUCENE-538 URL: https://issues.apache.org/jira/browse/LUCENE-538 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 1.9 Environment: Ubuntu Linux, java version 1.5.0_04 Reporter: Helen Warren Priority: Minor Fix For: 3.1 Attachments: TestMultiSearchWildCard.java We are searching across multiple indices using a MultiSearcher. There seems to be a problem when we use a WildcardQuery to exclude documents from the result set. I attach a set of unit tests illustrating the problem. In these tests, we have two indices. Each index contains a set of documents with fields for 'title', 'section' and 'index'. The final aim is to do a keyword search, across both indices, on the title field and be able to exclude documents from certain sections (and their subsections) using a WildcardQuery on the section field. e.g. return documents from both indices which have the string 'xyzpqr' in their title but which do not lie in the news section or its subsections (section = /news/*). The first unit test (testExcludeSectionsWildCard) fails trying to do this. If we relax any of the constraints made above, tests pass: * Don't use WildcardQuery, but pass in the news section and it's child section to exclude explicitly (testExcludeSectionsExplicit)/li * Exclude results from just one section, not it's children too i.e. don't use WildcardQuery(testExcludeSingleSection)/li * Do use WildcardQuery, and exclude a section and its children, but just use one index thereby using the simple IndexReader and IndexSearcher objects (testExcludeSectionsOneIndex). * Try the boolean MUST clause rather than MUST_NOT using the WildcardQuery i.e. only include results from the /news/ section and its children. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Assigned: (LUCENE-1250) Some equals methods do not check for null argument
[ https://issues.apache.org/jira/browse/LUCENE-1250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera reassigned LUCENE-1250: -- Assignee: Shai Erera Some equals methods do not check for null argument -- Key: LUCENE-1250 URL: https://issues.apache.org/jira/browse/LUCENE-1250 Project: Lucene - Java Issue Type: Bug Components: Index, Search Reporter: David Dillard Assignee: Shai Erera Priority: Minor Fix For: 3.1, 4.0 The equals methods in the following classes do not check for a null argument and thus would incorrectly fail with a null pointer exception if passed null: - org.apache.lucene.index.SegmentInfo - org.apache.lucene.search.function.CustomScoreQuery - org.apache.lucene.search.function.OrdFieldSource - org.apache.lucene.search.function.ReverseOrdFieldSource - org.apache.lucene.search.function.ValueSourceQuery If a null parameter is passed to equals() then false should be returned. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1250) Some equals methods do not check for null argument
[ https://issues.apache.org/jira/browse/LUCENE-1250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-1250: --- Lucene Fields: [New, Patch Available] (was: [New]) Affects Version/s: (was: 2.3.2) (was: 2.3.1) Fix Version/s: 4.0 3.1 This is now only applicable to OrdFieldSource and ReverseOrdFieldSource. I'll fix both of them. Some equals methods do not check for null argument -- Key: LUCENE-1250 URL: https://issues.apache.org/jira/browse/LUCENE-1250 Project: Lucene - Java Issue Type: Bug Components: Index, Search Reporter: David Dillard Assignee: Shai Erera Priority: Minor Fix For: 3.1, 4.0 The equals methods in the following classes do not check for a null argument and thus would incorrectly fail with a null pointer exception if passed null: - org.apache.lucene.index.SegmentInfo - org.apache.lucene.search.function.CustomScoreQuery - org.apache.lucene.search.function.OrdFieldSource - org.apache.lucene.search.function.ReverseOrdFieldSource - org.apache.lucene.search.function.ValueSourceQuery If a null parameter is passed to equals() then false should be returned. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-901) DefaultSimilarity.queryNorm() should never return Infinity
[ https://issues.apache.org/jira/browse/LUCENE-901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir resolved LUCENE-901. Resolution: Fixed Fix Version/s: 3.1 This one is fixed (there is a Nan/Inf check in queryNorm added fairly recently) DefaultSimilarity.queryNorm() should never return Infinity -- Key: LUCENE-901 URL: https://issues.apache.org/jira/browse/LUCENE-901 Project: Lucene - Java Issue Type: Bug Components: Search Reporter: Michael Busch Priority: Trivial Fix For: 3.1 Currently DefaultSimilarity.queryNorm() returns Infinity if sumOfSquaredWeights=0. This can result in a score of NaN (e. g. in TermScorer) if boost=0.0f. A simple fix would be to return 1.0f in case zero is passed in. See LUCENE-698 for discussions about this. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-1148) Create a new sub-class of SpanQuery to enable use of a RangeQuery within a SpanQuery
[ https://issues.apache.org/jira/browse/LUCENE-1148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir resolved LUCENE-1148. - Resolution: Fixed Fix Version/s: 3.1 This one is fixed by LUCENE-2754, you can just wrap a RangeQuery (or any other MultiTermQuery) as a SpanQuery Create a new sub-class of SpanQuery to enable use of a RangeQuery within a SpanQuery Key: LUCENE-1148 URL: https://issues.apache.org/jira/browse/LUCENE-1148 Project: Lucene - Java Issue Type: New Feature Components: Search Affects Versions: 2.4 Reporter: Michael Goddard Priority: Minor Fix For: 3.1 Attachments: span_range_query_01.24.2008.patch Original Estimate: 1h Remaining Estimate: 1h Our users express queries using a syntax which enables them to embed various query types within SpanQuery instances. One feature they've been asking for is the ability to embed a numeric range query so they could, for example, find documents matching [2.0 2.75]MHz. The attached patch adds the capability and I hope others will find it useful. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-522) SpanFuzzyQuery
[ https://issues.apache.org/jira/browse/LUCENE-522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-522: - Fix Version/s: 3.1 SpanFuzzyQuery -- Key: LUCENE-522 URL: https://issues.apache.org/jira/browse/LUCENE-522 Project: Lucene - Java Issue Type: New Feature Components: Search Affects Versions: 1.9 Reporter: Karl Wettin Priority: Minor Fix For: 3.1 This is my SpanFuzzyQuery. It is released under the Apache licensence. Just paste it in. package se.snigel.lucene; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.Term; import org.apache.lucene.search.*; import org.apache.lucene.search.spans.SpanOrQuery; import org.apache.lucene.search.spans.SpanQuery; import org.apache.lucene.search.spans.SpanTermQuery; import org.apache.lucene.search.spans.Spans; import java.io.IOException; import java.util.Collection; import java.util.LinkedList; /** * @author Karl Wettin ka...@snigel.net */ public class SpanFuzzyQuery extends SpanQuery { public final static float defaultMinSimilarity = 0.7f; public final static int defaultPrefixLength = 0; private final Term term; private final float minimumSimilarity; private final int prefixLength; private BooleanQuery rewrittenFuzzyQuery; public SpanFuzzyQuery(Term term) { this(term, defaultMinSimilarity, defaultPrefixLength); } public SpanFuzzyQuery(Term term, float minimumSimilarity, int prefixLength) { this.term = term; this.minimumSimilarity = minimumSimilarity; this.prefixLength = prefixLength; if (minimumSimilarity = 1.0f) { throw new IllegalArgumentException(minimumSimilarity = 1); } else if (minimumSimilarity 0.0f) { throw new IllegalArgumentException(minimumSimilarity 0); } if (prefixLength 0) { throw new IllegalArgumentException(prefixLength 0); } } public Query rewrite(IndexReader reader) throws IOException { FuzzyQuery fuzzyQuery = new FuzzyQuery(term, minimumSimilarity, prefixLength); rewrittenFuzzyQuery = (BooleanQuery) fuzzyQuery.rewrite(reader); BooleanClause[] clauses = rewrittenFuzzyQuery.getClauses(); SpanQuery[] spanQueries = new SpanQuery[clauses.length]; for (int i = 0; i clauses.length; i++) { BooleanClause clause = clauses[i]; TermQuery termQuery = (TermQuery) clause.getQuery(); spanQueries[i] = new SpanTermQuery(termQuery.getTerm()); spanQueries[i].setBoost(termQuery.getBoost()); } SpanOrQuery query = new SpanOrQuery(spanQueries); query.setBoost(fuzzyQuery.getBoost()); return query; } /** Expert: Returns the matches for this query in an index. Used internally * to search for spans. */ public Spans getSpans(IndexReader reader) throws IOException { throw new UnsupportedOperationException(Query should have been rewritten); } /** Returns the name of the field matched by this query.*/ public String getField() { return term.field(); } /** Returns a collection of all terms matched by this query.*/ public Collection getTerms() { if (rewrittenFuzzyQuery == null) { throw new RuntimeException(Query must be rewritten prior to calling getTerms()!); } else { LinkedListTerm terms = new LinkedListTerm(); BooleanClause[] clauses = rewrittenFuzzyQuery.getClauses(); for (int i = 0; i clauses.length; i++) { BooleanClause clause = clauses[i]; TermQuery termQuery = (TermQuery) clause.getQuery(); terms.add(termQuery.getTerm()); } return terms; } } /** Prints a query to a string, with codefield/code as the default field * for terms. pThe representation used is one that is supposed to be readable * by {@link org.apache.lucene.queryParser.QueryParser QueryParser}. However, * there are the following limitations: * ul * liIf the query was created by the parser, the printed * representation may not be exactly what was parsed. For example, * characters that need to be escaped will be represented without * the required backslash./li * liSome of the more complicated queries (e.g. span queries) * don't have a representation that can be parsed by QueryParser./li * /ul */ public String toString(String field) { return spans( + rewrittenFuzzyQuery.toString() + ); } } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue
[jira] Resolved: (LUCENE-943) ComparatorKey in Locale based sorting
[ https://issues.apache.org/jira/browse/LUCENE-943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir resolved LUCENE-943. Resolution: Fixed This one is available as CollationKeyAnalyzer/ICUCollationKeyAnalyzer ComparatorKey in Locale based sorting - Key: LUCENE-943 URL: https://issues.apache.org/jira/browse/LUCENE-943 Project: Lucene - Java Issue Type: New Feature Components: Search Reporter: Ronnie Kolehmainen Priority: Minor Attachments: LocaleBasedSortComparator.diff This is a reply/follow-up on Chris Hostetter's message on Lucene developers list (aug 2006): http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200608.mbox/%3cpine.lnx.4.58.0608211050330.5...@hal.rescomp.berkeley.edu%3e perhaps it would be worthwhile for comparatorStringLocale to convert the String[] it gets back from FieldCache.DEFAULT.getStrings to a new CollationKey[]? or maybe even for FieldCache.DEFAULT.getStrings to be deprecated, and replaced with a FieldCache.DEFAULT.getCollationKeys(reader,field,Collator)? I think the best is to keep the default behavior as it is today. There is a cost of building caches for sort fields which I think not everyone wants. However for some international production environments there are indeed possible performance gains in comparing precalculated keys instead of comparing strings with rulebased collators. Since Lucene's Sort architecture is pluggable it is easy to create a custom locale-based comparator, which utilizes the built-in caching/warming mechanism of FieldCache, and may be used in SortField constructor. I'm not sure whether there should be classes for this in Lucene core or not, but it could be nice to have the option of performance vs. memory consumption in localized sorting without having to use additional jars. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1360) A Similarity class which has unique length norms for numTerms = 10
[ https://issues.apache.org/jira/browse/LUCENE-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12986378#action_12986378 ] Robert Muir commented on LUCENE-1360: - Now that we have custom norm encoders, is this one obselete? you can just use SmallFloat.floatToByte52 to enc/dec your norms? A Similarity class which has unique length norms for numTerms = 10 --- Key: LUCENE-1360 URL: https://issues.apache.org/jira/browse/LUCENE-1360 Project: Lucene - Java Issue Type: Improvement Components: Query/Scoring Reporter: Sean Timm Assignee: Otis Gospodnetic Priority: Trivial Attachments: LUCENE-1380 visualization.pdf, ShortFieldNormSimilarity.java A Similarity class which extends DefaultSimilarity and simply overrides lengthNorm. lengthNorm is implemented as a lookup for numTerms = 10, else as {{1/sqrt(numTerms)}}. This is to avoid term counts below 11 from having the same lengthNorm after stored as a single byte in the index. This is useful if your search is only on short fields such as titles or product descriptions. See mailing list discussion: http://www.nabble.com/How-to-boost-the-score-higher-in-case-user-query-matches-entire-field-value-than-just-some-words-within-a-field-td19079221.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-1250) Some equals methods do not check for null argument
[ https://issues.apache.org/jira/browse/LUCENE-1250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera resolved LUCENE-1250. Resolution: Fixed Committed revision 1063271 (3x). Committed revision 1063272 (trunk). Thanks David ! Some equals methods do not check for null argument -- Key: LUCENE-1250 URL: https://issues.apache.org/jira/browse/LUCENE-1250 Project: Lucene - Java Issue Type: Bug Components: Index, Search Reporter: David Dillard Assignee: Shai Erera Priority: Minor Fix For: 3.1, 4.0 The equals methods in the following classes do not check for a null argument and thus would incorrectly fail with a null pointer exception if passed null: - org.apache.lucene.index.SegmentInfo - org.apache.lucene.search.function.CustomScoreQuery - org.apache.lucene.search.function.OrdFieldSource - org.apache.lucene.search.function.ReverseOrdFieldSource - org.apache.lucene.search.function.ValueSourceQuery If a null parameter is passed to equals() then false should be returned. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1165) Reduce exposure of nightly build documentation
[ https://issues.apache.org/jira/browse/LUCENE-1165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12986385#action_12986385 ] Uwe Schindler commented on LUCENE-1165: --- This was once fixed by adding a robots.txt to Hudson. But since move of Hudson to new machines this is an issue again. Reduce exposure of nightly build documentation -- Key: LUCENE-1165 URL: https://issues.apache.org/jira/browse/LUCENE-1165 Project: Lucene - Java Issue Type: Improvement Components: Website Reporter: Doron Cohen Assignee: Uwe Schindler Priority: Minor From LUCENE-1157 - ..the nightly build documentation is too prominent. A search for indexwriter api on Google or Yahoo! returns nightly documentation before released documentation. (https://issues.apache.org/jira/browse/LUCENE-1157?focusedCommentId=12565820#action_12565820) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1165) Reduce exposure of nightly build documentation
[ https://issues.apache.org/jira/browse/LUCENE-1165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-1165: -- Component/s: (was: Javadocs) Website Assignee: Uwe Schindler Reduce exposure of nightly build documentation -- Key: LUCENE-1165 URL: https://issues.apache.org/jira/browse/LUCENE-1165 Project: Lucene - Java Issue Type: Improvement Components: Website Reporter: Doron Cohen Assignee: Uwe Schindler Priority: Minor From LUCENE-1157 - ..the nightly build documentation is too prominent. A search for indexwriter api on Google or Yahoo! returns nightly documentation before released documentation. (https://issues.apache.org/jira/browse/LUCENE-1157?focusedCommentId=12565820#action_12565820) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Closed: (LUCENE-83) ESCAPING BUG \(abc\) and \(a*c\) in v1.2
[ https://issues.apache.org/jira/browse/LUCENE-83?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera closed LUCENE-83. Resolution: Not A Problem Assignee: (was: Lucene Developers) I verified on both 3x and trunk, queries like \\(a?c\\) and \\(a*c\\) work (return the correct result). I guess it was a problem fixed already in QP at some point. ESCAPING BUG \(abc\) and \(a*c\) in v1.2 Key: LUCENE-83 URL: https://issues.apache.org/jira/browse/LUCENE-83 Project: Lucene - Java Issue Type: Bug Components: QueryParser Affects Versions: 1.2 Environment: Operating System: Windows XP Platform: All Reporter: Lukas Zapletal Priority: Minor PLEASE TEST THIS CODE: -- import junit.framework.*; import org.apache.lucene.index.*; import org.apache.lucene.analysis.*; import org.apache.lucene.analysis.standard.*; import org.apache.lucene.store.*; import org.apache.lucene.document.*; import org.apache.lucene.search.*; import org.apache.lucene.queryParser.*; /** * Escape bug (now with same analyzers). By l...@root.cz. * Here is the description: * * When searching for \(abc\) everything is ok. But let`s search for: \(a?c\) * YES! Nothing found! It`s same with \ and maybe other escaped characters. * * User: Lukas Zapletal * Date: Feb 1, 2003 * * JUnit test case follows: */ public class juEscapeBug extends TestCase { Directory dir = new RAMDirectory(); String testText = This is a test. (abc) Is there a bug OR not? \Question\!; public juEscapeBug(String tn) { super(tn); } protected void setUp() throws Exception { IndexWriter writer = new IndexWriter(dir, new StandardAnalyzer(), true); Document doc = new Document(); doc.add(Field.Text(contents, testText)); writer.addDocument(doc); writer.optimize(); writer.close(); } private boolean doQuery(String queryString) throws Exception { Searcher searcher = new IndexSearcher(dir); Analyzer analyzer = new StandardAnalyzer(); Query query = QueryParser.parse(queryString, contents, analyzer); Hits hits = searcher.search(query); searcher.close(); return (hits.length() == 1); } public void testBugOk1() throws Exception { assertTrue(doQuery(Test)); } public void testBugOk2() throws Exception { assertFalse(doQuery(This is not there)); } public void testBugOk3() throws Exception { assertTrue(doQuery(abc)); } public void testBugOk4() throws Exception { assertTrue(doQuery(\\(abc\\))); } public void testBugHere1() throws Exception { assertTrue(doQuery(\\(a?c\\))); // BUG HERE !!! } public void testBugHere2() throws Exception { assertTrue(doQuery(\\(a*\\))); // BUG HERE !!! } public void testBugHere3() throws Exception { assertTrue(doQuery(\\\qu*on\\\)); // BUG HERE !!! } } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Closed: (LUCENE-507) CLONE -[PATCH] remove unused variables
[ https://issues.apache.org/jira/browse/LUCENE-507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera closed LUCENE-507. - Resolution: Not A Problem Assignee: (was: Lucene Developers) This is not a problem. First, many of the mentions in the patch file are irrelevant anymore, b/c this issue is old. Second, we're doing this sort of cleanup from to time, and those unused variables will keep popping in, and we'll keep cleaning them. So I see no reason to keep this issue open anymore. CLONE -[PATCH] remove unused variables -- Key: LUCENE-507 URL: https://issues.apache.org/jira/browse/LUCENE-507 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: unspecified Environment: Operating System: other Platform: Other Reporter: Steven Tamm Priority: Minor Attachments: Unused.patch Seems I'm the only person who has the unused variable warning turned on in Eclipse :-) This patch removes those unused variables and imports (for now only in the search package). This doesn't introduce changes in functionality, but it should be reviewed anyway: there might be cases where the variables *should* be used, but they are not because of a bug. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Closed: (LUCENE-1074) Workaround in Searcher.java for gcj bug#15411 no longer needed
[ https://issues.apache.org/jira/browse/LUCENE-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera closed LUCENE-1074. -- Resolution: Not A Problem Searcher is removed from trunk and deprecated in 3x. Also, I see the comment was removed from 3x, and the methods are still there. Given that this class is going away, and that this issue is way too old, I'll close it. Workaround in Searcher.java for gcj bug#15411 no longer needed -- Key: LUCENE-1074 URL: https://issues.apache.org/jira/browse/LUCENE-1074 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Paul Elschot Priority: Minor Attachments: LUCENE-1074.patch This gcj bug has meanwhile been fixed, see: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=15411 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1165) Reduce exposure of nightly build documentation
[ https://issues.apache.org/jira/browse/LUCENE-1165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12986396#action_12986396 ] Uwe Schindler commented on LUCENE-1165: --- I opened INFRA-3389. Reduce exposure of nightly build documentation -- Key: LUCENE-1165 URL: https://issues.apache.org/jira/browse/LUCENE-1165 Project: Lucene - Java Issue Type: Improvement Components: Website Reporter: Doron Cohen Assignee: Uwe Schindler Priority: Minor From LUCENE-1157 - ..the nightly build documentation is too prominent. A search for indexwriter api on Google or Yahoo! returns nightly documentation before released documentation. (https://issues.apache.org/jira/browse/LUCENE-1157?focusedCommentId=12565820#action_12565820) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1391) Token type and flags values get lost when using ShingleMatrixFilter
[ https://issues.apache.org/jira/browse/LUCENE-1391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-1391: -- Affects Version/s: 2.9 3.0 This issue is still valid, ShingleMatrixFilter still sets its class name as type attribute for all tokens and resets flags to 0. Furthermore, ShingleMatrixFilter does not respect custom/new attributes at all (like KeywordAttribute). Token type and flags values get lost when using ShingleMatrixFilter --- Key: LUCENE-1391 URL: https://issues.apache.org/jira/browse/LUCENE-1391 Project: Lucene - Java Issue Type: Bug Components: contrib/analyzers Affects Versions: 2.4, 2.9, 3.0 Reporter: Wouter Heijke Assignee: Karl Wettin Fix For: 3.1, 4.0 While using the new ShingleMatrixFilter I noticed that a token's type and flags get lost while using this filter. ShingleFilter does respect these values like the other filters I know. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1391) Token type and flags values get lost when using ShingleMatrixFilter
[ https://issues.apache.org/jira/browse/LUCENE-1391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-1391: -- Fix Version/s: 4.0 3.1 Token type and flags values get lost when using ShingleMatrixFilter --- Key: LUCENE-1391 URL: https://issues.apache.org/jira/browse/LUCENE-1391 Project: Lucene - Java Issue Type: Bug Components: contrib/analyzers Affects Versions: 2.4, 2.9, 3.0 Reporter: Wouter Heijke Assignee: Karl Wettin Fix For: 3.1, 4.0 While using the new ShingleMatrixFilter I noticed that a token's type and flags get lost while using this filter. ShingleFilter does respect these values like the other filters I know. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Assigned: (LUCENE-1391) Token type and flags values get lost when using ShingleMatrixFilter
[ https://issues.apache.org/jira/browse/LUCENE-1391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler reassigned LUCENE-1391: - Assignee: Uwe Schindler (was: Karl Wettin) Token type and flags values get lost when using ShingleMatrixFilter --- Key: LUCENE-1391 URL: https://issues.apache.org/jira/browse/LUCENE-1391 Project: Lucene - Java Issue Type: Bug Components: contrib/analyzers Affects Versions: 2.4, 2.9, 3.0 Reporter: Wouter Heijke Assignee: Uwe Schindler Fix For: 3.1, 4.0 While using the new ShingleMatrixFilter I noticed that a token's type and flags get lost while using this filter. ShingleFilter does respect these values like the other filters I know. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2326) Replication command indexversion fails to return index version
[ https://issues.apache.org/jira/browse/SOLR-2326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12986402#action_12986402 ] Eric Pugh commented on SOLR-2326: - So I did discover one odd thing. If I don't have a /update update requesthandler listed in the solrconfig.xml, then the commitPoint is ALWAYS null, it's almost like having that in the stack causes the commitPoint to be done. My other datapoint, that I think, but haven't verified is that if you don't have the replicate on startup set, then it *seems*, but I am not positive, to give that result. One question I have is why is there that race condition? I mean, if the command=details works, then shouldn't indexversion work the same, or raise an error? versus returning a rather unuseful 0? Maybe just logging no commitPoint found would help. Replication command indexversion fails to return index version -- Key: SOLR-2326 URL: https://issues.apache.org/jira/browse/SOLR-2326 Project: Solr Issue Type: Bug Components: replication (java) Environment: Branch 3x latest Reporter: Eric Pugh Assignee: Mark Miller Fix For: 3.1 To test this, I took the /example/multicore/core0 solrconfig and added a simple replication handler: requestHandler name=/replication class=solr.ReplicationHandler lst name=master str name=replicateAftercommit/str str name=replicateAfterstartup/str str name=confFilesschema.xml/str /lst /requestHandler When I query the handler for details I get back the indexVersion that I expect: http://localhost:8983/solr/core0/replication?command=detailswt=jsonindent=true But when I ask for just the indexVersion I get back a 0, which prevent the slaves from pulling updates: http://localhost:8983/solr/core0/replication?command=indexversionwt=jsonindent=true -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Closed: (LUCENE-1263) NullPointerException in java.util.Hashtable from executing a Query
[ https://issues.apache.org/jira/browse/LUCENE-1263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera closed LUCENE-1263. -- Resolution: Cannot Reproduce This problem could not be reproduced, and the person reporting it did not provide any information as to how to reproduce it since Nov-2008. Closing. NullPointerException in java.util.Hashtable from executing a Query -- Key: LUCENE-1263 URL: https://issues.apache.org/jira/browse/LUCENE-1263 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.1 Reporter: Benjamin Pasero Priority: Minor Lately we are seeing this stacktrace showing up when executing a Query. Any ideas? java.lang.NullPointerException at java.util.Hashtable.get(Hashtable.java:482) at org.apache.lucene.index.MultiReader.norms(MultiReader.java:167) at org.apache.lucene.search.spans.SpanWeight.scorer(SpanWeight.java:72) at org.apache.lucene.search.DisjunctionMaxQuery$DisjunctionMaxWeight.scorer(DisjunctionMaxQuery.java:131) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:130) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:100) at org.apache.lucene.search.MultiSearcher.search(MultiSearcher.java:192) at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:66) at org.apache.lucene.search.Hits.(Hits.java:45) at org.apache.lucene.search.Searcher.search(Searcher.java:45) at org.apache.lucene.search.Searcher.search(Searcher.java:37) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Closed: (LUCENE-487) Database as a lucene index target
[ https://issues.apache.org/jira/browse/LUCENE-487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera closed LUCENE-487. - Resolution: Not A Problem Not active since 2006 and we already have DBDirectory. Database as a lucene index target - Key: LUCENE-487 URL: https://issues.apache.org/jira/browse/LUCENE-487 Project: Lucene - Java Issue Type: New Feature Components: Store Affects Versions: 1.9 Environment: MySql (version 4.1 an up), Oracle (version 8.1.7 and up) Reporter: Amir Kibbar Priority: Minor Attachments: files.zip I've written an extension for the Directory object called DBDirectory, that allows you to read and write a Lucene index to a database instead of a file system. This is done using blobs. Each blob represents a file. Also, each blob has a name which is equivalent to the filename and a prefix, which is equivalent to a directory on a file system. This allows you to create multiple Lucene indexes in a single database schema. The solution uses two tables: LUCENE_INDEX - which holds the index files as blobs LUCENE_LOCK - holds the different locks Attached is my proposed solution. This solution is still very basic, but it does the job. The solution supports Oracle and mysql To use this solution: 1. Place the files: - DBDirectory in src/java/org/apache/lucene/store - TestDBIndex in src/test/org/apache/lucene/index - objects-mysql.sql in src/db - objects-oracle.sql in src/db 2. Edit the parameters for the database connection in TestDBIndex 3. Create the database tables using the objects-mysql.sql script (assuming you're using mysql) 4. Build Lucene 5. Run TestDBIndex with the database driver in the classpath I've tested the solution on mysql, but it *should* work on Oracle, I will test that in a few days. Amir -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1263) NullPointerException in java.util.Hashtable from executing a Query
[ https://issues.apache.org/jira/browse/LUCENE-1263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12986419#action_12986419 ] Uwe Schindler commented on LUCENE-1263: --- The issue is definitely fixed: - since 2.9 we do per-segment searches, so MultiReader's norms cache is no longer used - and even before 2.9, at some time we changed the Hashtable to a HashMap that allowed null keys and null values. NullPointerException in java.util.Hashtable from executing a Query -- Key: LUCENE-1263 URL: https://issues.apache.org/jira/browse/LUCENE-1263 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.1 Reporter: Benjamin Pasero Priority: Minor Lately we are seeing this stacktrace showing up when executing a Query. Any ideas? java.lang.NullPointerException at java.util.Hashtable.get(Hashtable.java:482) at org.apache.lucene.index.MultiReader.norms(MultiReader.java:167) at org.apache.lucene.search.spans.SpanWeight.scorer(SpanWeight.java:72) at org.apache.lucene.search.DisjunctionMaxQuery$DisjunctionMaxWeight.scorer(DisjunctionMaxQuery.java:131) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:130) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:100) at org.apache.lucene.search.MultiSearcher.search(MultiSearcher.java:192) at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:66) at org.apache.lucene.search.Hits.(Hits.java:45) at org.apache.lucene.search.Searcher.search(Searcher.java:45) at org.apache.lucene.search.Searcher.search(Searcher.java:37) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-856) Optimize segment merging
[ https://issues.apache.org/jira/browse/LUCENE-856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-856. --- Resolution: Not A Problem We've already made good improvements here, with stored fields term vectors being bulk merged. Postings are still costly to merge -- even on a fast machine I see merging CPU bound. It's possible a codec could bulk-copy the postings, if eg there are no (or, not too many) deletions. I think we can open separate issues in the future for that... Optimize segment merging Key: LUCENE-856 URL: https://issues.apache.org/jira/browse/LUCENE-856 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.1 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor With LUCENE-843, the time spent indexing documents has been substantially reduced and now the time spent merging is a sizable portion of indexing time. I ran a test using the patch for LUCENE-843, building an index of 10 million docs, each with ~5,500 byte plain text, with term vectors (positions + offsets) on and with 2 small stored fields per document. RAM buffer size was 32 MB. I didn't optimize the index in the end, though optimize speed would also improve if we optimize segment merging. Index size is 86 GB. Total time to build the index was 8 hrs 38 minutes, 5 hrs 40 minutes of which was spent merging. That's 65.6% of the time! Most of this time is presumably IO which probably can't be reduced much unless we improve overall merge policy and experiment with values for mergeFactor / buffer size. These tests were run on a Mac Pro with 2 dual-core Intel CPUs. The IO system is RAID 0 of 4 drives, so, these times are probably better than the more common case of a single hard drive which would likely be slower IO. I think there are some simple things we could do to speed up merging: * Experiment with buffer sizes -- maybe larger buffers for the IndexInputs used during merging could help? Because at a default mergeFactor of 10, the disk heads must do alot of seeking back and forth between these 10 files (and then to the 11th file where we are writing). * Use byte copying when possible, eg if there are no deletions on a segment we can almost (I think?) just copy things like prox postings, stored fields, term vectors, instead of full parsing to Jave objects and then re-serializing them. * Experiment with mergeFactor / different merge policies. For example I think LUCENE-854 would reduce time spend merging for a given index size. This is currently just a place to list ideas for optimizing segment merges. I don't plan on working on this until after LUCENE-843. Note that for autoCommit=false, this optimization is somewhat less important, depending on how often you actually close/open a new IndexWriter. In the extreme case, if you open a writer, add 100 MM docs, close the writer, then no segment merges happen at all. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Closed: (LUCENE-401) [PATCH] fixes for gcj target.
[ https://issues.apache.org/jira/browse/LUCENE-401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler closed LUCENE-401. Resolution: Fixed Assignee: (was: Lucene Developers) Closing, because we no longer support GCJ. [PATCH] fixes for gcj target. - Key: LUCENE-401 URL: https://issues.apache.org/jira/browse/LUCENE-401 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: unspecified Environment: Operating System: Linux Platform: Other Reporter: Robert Newson Priority: Minor Attachments: gcj.patch I've modified the Makefile so that it compiles with GCJ-4.0. This involved fixing the CORE_OBJ macro to match the generated jar file as well as excluding FieldCacheImpl from being used from its .java source (GCJ has problems with anonymous inner classes, I guess). Also, I changed the behaviour of FieldInfos.fieldInfo(int). It depended on catching IndexOutOfBoundsException exception. I've modified it to test the bounds first, returning -1 in that case. This helps with gcj since we build with -fno-bounds-check. I compiled with; GCJ=gcj-4.0 GCJH=gcjh-4.0 GPLUSPLUS=g++-4.0 ant clean gcj patch to follow. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [DISCUSSION] Trunk and Stable release strategy
+1 Makes sense to me. On Jan 24, 2011, at 4:07 AM, Shai Erera wrote: Hi Few days ago Robert and I discussed this matter over IRC and thought it's something we should bring forward to the list. This issue arise due to recent index format change introduced in LUCENE-2720, and the interesting question was if we say 4.0 is required to read all 3x indexes, how would 4.0 support a future version of 3x, that did not even exist when 4.0 was released. Trunk means the 'unstable' branch (today's 4.0) and Stable is today's 3.0, but the same issue will arise after we make 4.0 Stable and 5.0 Trunk. After some discussion we came to a solution that we would like to propose to the list: we continue to release 3x until we stabilize trunk. When we're happy with trunk, we release it, say 4.0, and the last 3x release becomes the bug fix release for 3x and from that point we maintain 4.0 (new features and all, while maintaining API back-compat) and Trunk becomes the next big thing (5.0). There won't be interleaving 4.0 and 3x releases and we won't reach the situation where we released 4.0 and then release 3.2, w/ say index format change (that we just had to make). While we can say 3x can be released after 4.0 w/ no index format changes whatsoever, we think this proposal makes sense. There's no point maintaining 2 stable branches (3x and 4x) and an unstable Trunk. This will allow us to release 3x as frequent as we want, hold on w/ trunk as much as we want, and at some point cut over to 4.0 and think about the next big things we'd like to bring to Lucene. What do you think? Shai - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Closed: (LUCENE-505) MultiReader.norm() takes up too much memory: norms byte[] should be made into an Object
[ https://issues.apache.org/jira/browse/LUCENE-505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler closed LUCENE-505. Resolution: Fixed Fix Version/s: 2.9 Since Lucene 2.9 we search on each segment separately, so MultiReader's norms cache would never be used, exept in custom code that calls norms() on the MultiReader/DirectoryReader. Since Lucene 4.0 this is also not allowed anymore, non-atomic readers don't support norms. If you still need to get global norms, you can use MultiNorms but that is discouraged. See also: LUCENE-2771 MultiReader.norm() takes up too much memory: norms byte[] should be made into an Object --- Key: LUCENE-505 URL: https://issues.apache.org/jira/browse/LUCENE-505 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.0.0 Environment: Patch is against Lucene 1.9 trunk (as of Mar 1 06) Reporter: Steven Tamm Priority: Minor Fix For: 2.9 Attachments: LazyNorms.patch, NormFactors.patch, NormFactors.patch, NormFactors20.patch MultiReader.norms() is very inefficient: it has to construct a byte array that's as long as all the documents in every segment. This doubles the memory requirement for scoring MultiReaders vs. Segment Readers. Although this is cached, it's still a baseline of memory that is unnecessary. The problem is that the Normalization Factors are passed around as a byte[]. If it were instead replaced with an Object, you could perform a whole host of optimizations a. When reading, you wouldn't have to construct a fakeNorms array of all 1.0fs. You could instead return a singleton object that would just return 1.0f. b. MultiReader could use an object that could delegate to NormFactors of the subreaders c. You could write an implementation that could use mmap to access the norm factors. Or if the index isn't long lived, you could use an implementation that reads directly from the disk. The patch provided here replaces the use of byte[] with a new abstract class called NormFactors. NormFactors has two methods on it public abstract byte getByte(int doc) throws IOException; // Returns the byte[doc] public float getFactor(int doc) throws IOException;// Calls Similarity.decodeNorm(getByte(doc)) There are four implementations of this abstract class 1. NormFactors.EmptyNormFactors - This replaces the fakeNorms with a singleton that only returns 1.0 2. NormFactors.ByteNormFactors - Converts a byte[] to a NormFactors for backwards compatibility in constructors. 3. MultiNormFactors - Multiplexes the NormFactors in MultiReader to prevent the need to construct the gigantic norms array. 4. SegmentReader.Norm - Same class, but now extends NormFactors to provide the same access. In addition, Many of the Query and Scorer classes were changes to pass around NormFactors instead of byte[], and to call getFactor() instead of using the byte[]. I have kept around IndexReader.norms(String) for backwards compatibiltiy, but marked it as deprecated. I believe that the use of ByteNormFactors in IndexReader.getNormFactors() will keep backward compatibility with other IndexReader implementations, but I don't know how to test that. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-406) sort missing string fields last
[ https://issues.apache.org/jira/browse/LUCENE-406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler resolved LUCENE-406. -- Resolution: Fixed Fix Version/s: 4.0 This is resolved / is being resolved by the new FieldCache deleted docs support: LUCENE-2671, LUCENE-2649 sort missing string fields last --- Key: LUCENE-406 URL: https://issues.apache.org/jira/browse/LUCENE-406 Project: Lucene - Java Issue Type: New Feature Components: Search Affects Versions: 1.4 Environment: Operating System: All Platform: All Reporter: Yonik Seeley Assignee: Hoss Man Priority: Minor Fix For: 4.0 Attachments: MissingStringLastComparatorSource.java, MissingStringLastComparatorSource.java, TestMissingStringLastComparatorSource.java A SortComparatorSource for string fields that orders documents with the sort field missing after documents with the field. This is the reverse of the default Lucene implementation. The concept and first-pass implementation was done by Chris Hostetter. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-770) CfsExtractor tool
[ https://issues.apache.org/jira/browse/LUCENE-770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12986438#action_12986438 ] Uwe Schindler commented on LUCENE-770: -- In my opinion, this tool is not needed and does not really help, because it would not de-compound an index successfully. The correct way to decompound is: Create a new IndexWriter on a empty directory, set CFS to off and then use addIndexes(IndexReader...) to force a merge over to the new dir. CfsExtractor tool - Key: LUCENE-770 URL: https://issues.apache.org/jira/browse/LUCENE-770 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.1 Reporter: Otis Gospodnetic Priority: Minor Attachments: LUCENE-770.patch A tool for extracting the content of a CFS file, in order to go from a compound index to a multi-file index. This may be handy for people who want to go back to multi-file index format now that field norms are in a single file - LUCENE-756. Most of this code already existed and was hiding in IndexReader.main. I'll commit tomorrow, unless I hear otherwise. I think I should also remove IndexReader.main then. Ja? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-770) CfsExtractor tool
[ https://issues.apache.org/jira/browse/LUCENE-770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler resolved LUCENE-770. -- Resolution: Not A Problem CfsExtractor tool - Key: LUCENE-770 URL: https://issues.apache.org/jira/browse/LUCENE-770 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.1 Reporter: Otis Gospodnetic Priority: Minor Attachments: LUCENE-770.patch A tool for extracting the content of a CFS file, in order to go from a compound index to a multi-file index. This may be handy for people who want to go back to multi-file index format now that field norms are in a single file - LUCENE-756. Most of this code already existed and was hiding in IndexReader.main. I'll commit tomorrow, unless I hear otherwise. I think I should also remove IndexReader.main then. Ja? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Closed: (LUCENE-1418) QueryParser can throw NullPointerException during parsing of some queries in case if default field passed to constructor is null
[ https://issues.apache.org/jira/browse/LUCENE-1418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera closed LUCENE-1418. -- Resolution: Not A Problem I don't think QP should support 'null' passed as the default field, and I doubt if people really pass null as the default field. True, we can add a null check to the ctor, but due to long inactivity, I think it's not a problem people hit, so closing. QueryParser can throw NullPointerException during parsing of some queries in case if default field passed to constructor is null Key: LUCENE-1418 URL: https://issues.apache.org/jira/browse/LUCENE-1418 Project: Lucene - Java Issue Type: Bug Components: QueryParser Affects Versions: 2.4 Environment: CentOS 5.2 (probably any applies) Reporter: Alexei Dets Priority: Minor In case if QueryParser was constructed using QueryParser(String f, Analyzer a) constructor and f equals null then QueryParser can fail with NullPointerException during parsing of some queries that _does_ contain field name but have unbalanced parenthesis. Example 1: Query: field:(expr1) expr2) Result: java.lang.NullPointerException at org.apache.lucene.index.Term.init(Term.java:50) at org.apache.lucene.index.Term.init(Term.java:36) at org.apache.lucene.queryParser.QueryParser.getFieldQuery(QueryParser.java:543) at org.apache.lucene.queryParser.QueryParser.Term(QueryParser.java:1324) at org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:1211) at org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:1168) at org.apache.lucene.queryParser.QueryParser.TopLevelQuery(QueryParser.java:1128) at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:170) Example2: Query: field:(expr1) expr2) Result: java.lang.NullPointerException at org.apache.lucene.index.Term.init(Term.java:50) at org.apache.lucene.index.Term.init(Term.java:36) at org.apache.lucene.queryParser.QueryParser.getFieldQuery(QueryParser.java:543) at org.apache.lucene.queryParser.QueryParser.getFieldQuery(QueryParser.java:612) at org.apache.lucene.queryParser.QueryParser.Term(QueryParser.java:1459) at org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:1211) at org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:1168) at org.apache.lucene.queryParser.QueryParser.TopLevelQuery(QueryParser.java:1128) at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:170) Workaround: pass in constructor empty string as a default field name - in this case QueryParser.parse method will throw ParseException (expected result because query string is wrong) instead of NullPointerException. It is not obvious to me how to fix this so I'll describe my usecase, may be I'm doing something completely wrong. Basically I have a set of per-field queries entered by user and need to programmatically construct (after some preprocessing) one real Lucene query combined from these user-entered per-field subqueries. To achieve this I basically do the following (simplified a bit): QueryParser parser = new QueryParser(null, analyzer); // I'll always provide a field name in a query string as it is different each time and I don't have any default BooleanQuery query = new BooleanQuery(); Query subQuery1 = parser.parse(field1 + :( + queryString1 + ')'); query.add(subQuery1, operator1); // operator = BooleanClause.Occur.MUST, BooleanClause.Occur.MUST_NOT or BooleanClause.Occur.SHOULD Query subQuery2 = parser.parse(field2 + :( + queryString2 + ')'); query.add(subQuery2, operator2); Query subQuery3 = parser.parse(field3 + :( + queryString3 + ')'); query.add(subQuery3, operator3); ... IMHO either QueryParser constructor should be changed to throw NullPointerException/InvalidArgumentException in case of null field passed (and API documentation updated) or QueryParser.parse behavior should be fixed to correctly throw ParseException instead of NullPointerException. Also IMHO of a great help can be _public_ setField/getField methods of QueryParser (that set/get field), this can help in use cases like my: QueryParser parser = new QueryParser(null, analyzer); // or add constructor with analyzer _only_ for such cases BooleanQuery query = new BooleanQuery(); parser.setField(field1); Query subQuery1 = parser.parse(queryString1); query.add(subQuery1, operator1); parser.setField(field2); Query subQuery2 = parser.parse(queryString2); query.add(subQuery2, operator2); ... -- This message is automatically generated by JIRA. - You can reply to this email to add a
[jira] Commented: (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans
[ https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12986442#action_12986442 ] Grant Ingersoll commented on LUCENE-2878: - I haven't looked at the patch, but one of the biggest issues with Spans is the duality within the spans themselves. The whole point of spans is that you care about position information. However, in order to get both the search results and the positions, you have to, effectively, execute the query twice, once to get the results and once to get the positions. A Collector like interface, IMO, would be ideal because it would allow applications to leverage position information as the queries are being scored and hits being collected. In other words, if we are rethinking how we handle position based queries, let's get it right this time and make it so it is actually useful for people who need the functionality. As for PayloadSpanUtil, I think that was primarily put in to help w/ highlighting at the time, but if it has outlived it's usefulness, than dump it. If we are consolidating all queries to support positions and payloads, then it shouldn't be needed, right? Allow Scorer to expose positions and payloads aka. nuke spans -- Key: LUCENE-2878 URL: https://issues.apache.org/jira/browse/LUCENE-2878 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: Bulk Postings branch Reporter: Simon Willnauer Assignee: Simon Willnauer Attachments: LUCENE-2878.patch, LUCENE-2878.patch Currently we have two somewhat separate types of queries, the one which can make use of positions (mainly spans) and payloads (spans). Yet Span*Query doesn't really do scoring comparable to what other queries do and at the end of the day they are duplicating lot of code all over lucene. Span*Queries are also limited to other Span*Query instances such that you can not use a TermQuery or a BooleanQuery with SpanNear or anthing like that. Beside of the Span*Query limitation other queries lacking a quiet interesting feature since they can not score based on term proximity since scores doesn't expose any positional information. All those problems bugged me for a while now so I stared working on that using the bulkpostings API. I would have done that first cut on trunk but TermScorer is working on BlockReader that do not expose positions while the one in this branch does. I started adding a new Positions class which users can pull from a scorer, to prevent unnecessary positions enums I added ScorerContext#needsPositions and eventually Scorere#needsPayloads to create the corresponding enum on demand. Yet, currently only TermQuery / TermScorer implements this API and other simply return null instead. To show that the API really works and our BulkPostings work fine too with positions I cut over TermSpanQuery to use a TermScorer under the hood and nuked TermSpans entirely. A nice sideeffect of this was that the Position BulkReading implementation got some exercise which now :) work all with positions while Payloads for bulkreading are kind of experimental in the patch and those only work with Standard codec. So all spans now work on top of TermScorer ( I truly hate spans since today ) including the ones that need Payloads (StandardCodec ONLY)!! I didn't bother to implement the other codecs yet since I want to get feedback on the API and on this first cut before I go one with it. I will upload the corresponding patch in a minute. I also had to cut over SpanQuery.getSpans(IR) to SpanQuery.getSpans(AtomicReaderContext) which I should probably do on trunk first but after that pain today I need a break first :). The patch passes all core tests (org.apache.lucene.search.highlight.HighlighterTest still fails but I didn't look into the MemoryIndex BulkPostings API yet) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Closed: (LUCENE-663) New feature rich higlighter for Lucene.
[ https://issues.apache.org/jira/browse/LUCENE-663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler closed LUCENE-663. Resolution: Fixed Fix Version/s: 2.9 Since Lucene 2.9 we have FastVectorHighlighter which uses TermVectors to highligt. Also the conventional Highlighter was extended to support more query types. New feature rich higlighter for Lucene. --- Key: LUCENE-663 URL: https://issues.apache.org/jira/browse/LUCENE-663 Project: Lucene - Java Issue Type: New Feature Components: Search Reporter: Karel Tejnora Priority: Minor Fix For: 2.9 Attachments: lucene-hlt-src.jar Well, I refactored (took) some code from two previous highlighters. This highlighter: + use TermPositionVector where available + use Analyzer if no TermPositionVector found or is forced to use it. + support for all lucene queries (Term, Phrase with slops, Prefix, Wildcard, Range) except Fuzzy Query (can be implemented easly) - has no support for scoring (yet) - use same prefix,postfix for accepted terms (yet) ? It's written in Java5 In next release I'd like to add support for Fuzzy, coloring f.e. diffrent color for terms btw. phrase terms (slops), scoring of fragments It's apache licensed - I hope so :-) I put licene statement in every file -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2888) Several DocsEnum / DocsAndPositionsEnum return wrong docID when next() / advance(int) return NO_MORE_DOCS
Several DocsEnum / DocsAndPositionsEnum return wrong docID when next() / advance(int) return NO_MORE_DOCS - Key: LUCENE-2888 URL: https://issues.apache.org/jira/browse/LUCENE-2888 Project: Lucene - Java Issue Type: Bug Reporter: Simon Willnauer Assignee: Simon Willnauer Priority: Minor Fix For: 4.0 During work on LUCENE-2878 I found some minor problems in PreFlex and Pulsing Codec - they are not returning NO_MORE_DOCS but the last docID instead from DocsEnum#docID() when next() or advance(int) returned NO_MORE_DOCS. The JavaDoc clearly says that it should return NO_MORE_DOCS. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-753) Use NIO positional read to avoid synchronization in FSIndexInput
[ https://issues.apache.org/jira/browse/LUCENE-753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler resolved LUCENE-753. -- Resolution: Fixed This issue was resolved a long time ago, but left open for the stupid Windows Sun JRE bug which was never resolved. With Lucene 3.x and trunk we have better defaults (use e.g. MMapDirectory on Windows-64). Users should default to FSDirectory.open() and use the returned directory for best performance. Use NIO positional read to avoid synchronization in FSIndexInput Key: LUCENE-753 URL: https://issues.apache.org/jira/browse/LUCENE-753 Project: Lucene - Java Issue Type: New Feature Components: Store Reporter: Yonik Seeley Assignee: Michael McCandless Attachments: FileReadTest.java, FileReadTest.java, FileReadTest.java, FileReadTest.java, FileReadTest.java, FileReadTest.java, FileReadTest.java, FileReadTest.java, FSDirectoryPool.patch, FSIndexInput.patch, FSIndexInput.patch, LUCENE-753.patch, LUCENE-753.patch, LUCENE-753.patch, LUCENE-753.patch, LUCENE-753.patch, lucene-753.patch, lucene-753.patch As suggested by Doug, we could use NIO pread to avoid synchronization on the underlying file. This could mitigate any MT performance drop caused by reducing the number of files in the index format. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Closed: (LUCENE-72) [PATCH] Query parser inconsistency when using terms to exclude.
[ https://issues.apache.org/jira/browse/LUCENE-72?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera closed LUCENE-72. Resolution: Won't Fix Assignee: (was: Lucene Developers) As per the discussion, this should have been closed long time ago. [PATCH] Query parser inconsistency when using terms to exclude. --- Key: LUCENE-72 URL: https://issues.apache.org/jira/browse/LUCENE-72 Project: Lucene - Java Issue Type: Bug Components: QueryParser Affects Versions: 1.2 Environment: Operating System: All Platform: PC Reporter: Carlos Priority: Minor Attachments: patch6.txt, patch7.txt, TestRegressionLucene72.java, TestRegressionLucene72.java Hi. The problem I am having occurs when using queryparser and also when building the query using the API. Assume that we want to look for documents about fruits or vegetables but excluding tomatoes and bananas. I suppose the right query sould be: +(fruits vegetables) AND (-tomatoes -bananas) wich I think is equivalent to (if tou parse it and then print the query.toString () result that is what you get) +(fruits vegetables) +(-tomatoes -bananas) but the query doesn't work as expected, in fact the query that works is +(fruits vegetables) -(-tomatoes -bananas) which doesn´t really make much sense, because the second part seems to say: All documents where the condition tomatoes is not present and bananas is not present is false, which means the opposite. In fact, second query works as (even if they look quite opposite): +(fruits vegetables) -tomatoes -bananas Hope someone could help, thanks -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2888) Several DocsEnum / DocsAndPositionsEnum return wrong docID when next() / advance(int) return NO_MORE_DOCS
[ https://issues.apache.org/jira/browse/LUCENE-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-2888: Attachment: LUCENE-2888.patch here is a patch including the ported testcase from LUCENE-2878 Several DocsEnum / DocsAndPositionsEnum return wrong docID when next() / advance(int) return NO_MORE_DOCS - Key: LUCENE-2888 URL: https://issues.apache.org/jira/browse/LUCENE-2888 Project: Lucene - Java Issue Type: Bug Reporter: Simon Willnauer Assignee: Simon Willnauer Priority: Minor Fix For: 4.0 Attachments: LUCENE-2888.patch During work on LUCENE-2878 I found some minor problems in PreFlex and Pulsing Codec - they are not returning NO_MORE_DOCS but the last docID instead from DocsEnum#docID() when next() or advance(int) returned NO_MORE_DOCS. The JavaDoc clearly says that it should return NO_MORE_DOCS. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Assigned: (SOLR-445) XmlUpdateRequestHandler bad documents mid batch aborts rest of batch
[ https://issues.apache.org/jira/browse/SOLR-445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll reassigned SOLR-445: Assignee: Grant Ingersoll (was: Erick Erickson) XmlUpdateRequestHandler bad documents mid batch aborts rest of batch Key: SOLR-445 URL: https://issues.apache.org/jira/browse/SOLR-445 Project: Solr Issue Type: Bug Components: update Affects Versions: 1.3 Reporter: Will Johnson Assignee: Grant Ingersoll Fix For: Next Attachments: SOLR-445-3_x.patch, SOLR-445.patch, SOLR-445.patch, SOLR-445.patch, solr-445.xml, SOLR-445_3x.patch Has anyone run into the problem of handling bad documents / failures mid batch. Ie: add doc field name=id1/field /doc doc field name=id2/field field name=myDateFieldI_AM_A_BAD_DATE/field /doc doc field name=id3/field /doc /add Right now solr adds the first doc and then aborts. It would seem like it should either fail the entire batch or log a message/return a code and then continue on to add doc 3. Option 1 would seem to be much harder to accomplish and possibly require more memory while Option 2 would require more information to come back from the API. I'm about to dig into this but I thought I'd ask to see if anyone had any suggestions, thoughts or comments. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Closed: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
[ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera closed LUCENE-855. - Resolution: Duplicate We already have FieldCacheRangeFilter (introduced in LUCENE-1461), so closing as duplicate. MemoryCachedRangeFilter to boost performance of Range queries - Key: LUCENE-855 URL: https://issues.apache.org/jira/browse/LUCENE-855 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.1 Reporter: Andy Liu Attachments: contrib-filters.tar.gz, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter_Lucene_2.3.0.patch, MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, TestRangeFilterPerformanceComparison.java, TestRangeFilterPerformanceComparison.java Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range. This requires iterating through every single term in the index and can get rather slow for large document sets. MemoryCachedRangeFilter reads all docId, value pairs of a given field, sorts by value, and stores in a SortedFieldCache. During bits(), binary searches are used to find the start and end indices of the lower and upper bound values. The BitSet is populated by all the docId values that fall in between the start and end indices. TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range. Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms. Using MemoryCachedRangeFilter, it took 876ms. Performance increase is less dramatic when you have less unique terms in a field or using less number of documents. Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings. A side benefit of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros. The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement. So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue. The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs. MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus). Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field. So in summery, MemoryCachedRangeFilter can be useful when: - Performance is critical - Memory is not an issue - Field contains many unique numeric values - Index contains large amount of documents -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Closed: (LUCENE-320) [PATCH] Increases visibility of methods/classes from protected/package level to public
[ https://issues.apache.org/jira/browse/LUCENE-320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera closed LUCENE-320. - Resolution: Not A Problem Assignee: (was: Lucene Developers) This API is already public, so I don't think there's a problem anymore. [PATCH] Increases visibility of methods/classes from protected/package level to public -- Key: LUCENE-320 URL: https://issues.apache.org/jira/browse/LUCENE-320 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: CVS Nightly - Specify date in submission Environment: Operating System: All Platform: All Reporter: Alexey Panchenko Priority: Minor Attachments: lucene-more-public.patch I am building a Query implementation which should match documents that are matched by specified number of subqueries. It works very much the same as BooleanQuery, but checks the number of matched subqueries which should be greater than or equal to the specified value. The patch is needed to allow access to these classes/members from other packages, not just org.apache.lucene.search. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Closed: (LUCENE-988) Benchmarker tasks for the TPB data collection
[ https://issues.apache.org/jira/browse/LUCENE-988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera closed LUCENE-988. - Resolution: Not A Problem Closing because I'm not sure what's the license level of The Pirate Bay DB and also not sure that we want to have such DB in Lucene. Benchmark's API allows for someone to write a ContentSource which reads whatever source he wants, and convert it to DocData that is later fed and index by DocMaker. Benchmarker tasks for the TPB data collection - Key: LUCENE-988 URL: https://issues.apache.org/jira/browse/LUCENE-988 Project: Lucene - Java Issue Type: New Feature Components: contrib/benchmark Affects Versions: 2.3 Reporter: Karl Wettin Priority: Trivial Attachments: LUCENE-988.txt Very simple DocMaker and QueryMaker for the TPB data collection (~150,000 content items, ~500,000 comments to the contents and ~3,700,000 user queries). URL to dataset: http://thepiratebay.org/tor/3783572/db_dump_and_query_log_from_piratebay.org__summer_of_2006 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans
[ https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12986457#action_12986457 ] Simon Willnauer commented on LUCENE-2878: - {quote} I haven't looked at the patch, but one of the biggest issues with Spans is the duality within the spans themselves. The whole point of spans is that you care about position information. However, in order to get both the search results and the positions, you have to, effectively, execute the query twice, once to get the results and once to get the positions. A Collector like interface, IMO, would be ideal because it would allow applications to leverage position information as the queries are being scored and hits being collected. In other words, if we are rethinking how we handle position based queries, let's get it right this time and make it so it is actually useful for people who need the functionality. {quote} Grant I completely agree! Any help here very much welcome. I am so busy fixing all the BulkEnums and spinnoffs from this issue but I hope I have a first sketch of how I think this should work by the end of the week! bq. As for PayloadSpanUtil, I think that was primarily put in to help w/ highlighting at the time, but if it has outlived it's usefulness, than dump it. If we are consolidating all queries to support positions and payloads, then it shouldn't be needed, right? Yeah! Allow Scorer to expose positions and payloads aka. nuke spans -- Key: LUCENE-2878 URL: https://issues.apache.org/jira/browse/LUCENE-2878 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: Bulk Postings branch Reporter: Simon Willnauer Assignee: Simon Willnauer Attachments: LUCENE-2878.patch, LUCENE-2878.patch Currently we have two somewhat separate types of queries, the one which can make use of positions (mainly spans) and payloads (spans). Yet Span*Query doesn't really do scoring comparable to what other queries do and at the end of the day they are duplicating lot of code all over lucene. Span*Queries are also limited to other Span*Query instances such that you can not use a TermQuery or a BooleanQuery with SpanNear or anthing like that. Beside of the Span*Query limitation other queries lacking a quiet interesting feature since they can not score based on term proximity since scores doesn't expose any positional information. All those problems bugged me for a while now so I stared working on that using the bulkpostings API. I would have done that first cut on trunk but TermScorer is working on BlockReader that do not expose positions while the one in this branch does. I started adding a new Positions class which users can pull from a scorer, to prevent unnecessary positions enums I added ScorerContext#needsPositions and eventually Scorere#needsPayloads to create the corresponding enum on demand. Yet, currently only TermQuery / TermScorer implements this API and other simply return null instead. To show that the API really works and our BulkPostings work fine too with positions I cut over TermSpanQuery to use a TermScorer under the hood and nuked TermSpans entirely. A nice sideeffect of this was that the Position BulkReading implementation got some exercise which now :) work all with positions while Payloads for bulkreading are kind of experimental in the patch and those only work with Standard codec. So all spans now work on top of TermScorer ( I truly hate spans since today ) including the ones that need Payloads (StandardCodec ONLY)!! I didn't bother to implement the other codecs yet since I want to get feedback on the API and on this first cut before I go one with it. I will upload the corresponding patch in a minute. I also had to cut over SpanQuery.getSpans(IR) to SpanQuery.getSpans(AtomicReaderContext) which I should probably do on trunk first but after that pain today I need a break first :). The patch passes all core tests (org.apache.lucene.search.highlight.HighlighterTest still fails but I didn't look into the MemoryIndex BulkPostings API yet) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-666) TERM1 OR NOT TERM2 does not perform as expected
[ https://issues.apache.org/jira/browse/LUCENE-666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler resolved LUCENE-666. -- Resolution: Not A Problem This is not a problem of QueryParser, its more a problem of the combination of SHOULD and MUST_NOT clauses in a single BooleanQuery. The first clause must be required to have the wanted effect. To prevent such a thing at all, I would tend to disallow MUST/SHOULD clauses in a BooleanQuery. No need to add ParseExceptions to QueryParser as the same problem would also happen to users constructing BooleanQuery programatically. TERM1 OR NOT TERM2 does not perform as expected --- Key: LUCENE-666 URL: https://issues.apache.org/jira/browse/LUCENE-666 Project: Lucene - Java Issue Type: Bug Components: QueryParser Affects Versions: 2.0.0 Environment: Windows XP, JavaCC 4.0, JDK 1.5 Reporter: Dejan Nenov Attachments: TestAornotB.java test: [junit] Testsuite: org.apache.lucene.search.TestAornotB [junit] Tests run: 3, Failures: 1, Errors: 0, Time elapsed: 0.39 sec [junit] - Standard Output --- [junit] Doc1 = A B C [junit] Doc2 = A B C D [junit] Doc3 = A C D [junit] Doc4 = B C D [junit] Doc5 = C D [junit] - [junit] With query A OR NOT B we expect to hit [junit] all documents EXCEPT Doc4, instead we only match on Doc3. [junit] While LUCENE currently explicitly does not support queries of [junit] the type find docs that do not contain TERM - this explains [junit] not finding Doc5, but does not justify elimnating Doc1 and Doc2 [junit] - [junit] the fix shoould likely require a modification to QueryParser.jj [junit] around the method: [junit] protected void addClause(Vector clauses, int conj, int mods, Query q) [junit] Query:c:a -c:b hits.length=1 [junit] Query Found:Doc[0]= A C D [junit] 0.0 = (NON-MATCH) Failure to meet condition(s) of required/prohibited clause(s) [junit] 0.6115718 = (MATCH) fieldWeight(c:a in 1), product of: [junit] 1.0 = tf(termFreq(c:a)=1) [junit] 1.2231436 = idf(docFreq=3) [junit] 0.5 = fieldNorm(field=c, doc=1) [junit] 0.0 = match on prohibited clause (c:b) [junit] 0.6115718 = (MATCH) fieldWeight(c:b in 1), product of: [junit] 1.0 = tf(termFreq(c:b)=1) [junit] 1.2231436 = idf(docFreq=3) [junit] 0.5 = fieldNorm(field=c, doc=1) [junit] 0.6115718 = (MATCH) sum of: [junit] 0.6115718 = (MATCH) fieldWeight(c:a in 2), product of: [junit] 1.0 = tf(termFreq(c:a)=1) [junit] 1.2231436 = idf(docFreq=3) [junit] 0.5 = fieldNorm(field=c, doc=2) [junit] 0.0 = (NON-MATCH) Failure to meet condition(s) of required/prohibited clause(s) [junit] 0.0 = match on prohibited clause (c:b) [junit] 0.6115718 = (MATCH) fieldWeight(c:b in 3), product of: [junit] 1.0 = tf(termFreq(c:b)=1) [junit] 1.2231436 = idf(docFreq=3) [junit] 0.5 = fieldNorm(field=c, doc=3) [junit] Query:c:a (-c:b) hits.length=3 [junit] Query Found:Doc[0]= A B C [junit] Query Found:Doc[1]= A B C D [junit] Query Found:Doc[2]= A C D [junit] 0.3057859 = (MATCH) product of: [junit] 0.6115718 = (MATCH) sum of: [junit] 0.6115718 = (MATCH) fieldWeight(c:a in 1), product of: [junit] 1.0 = tf(termFreq(c:a)=1) [junit] 1.2231436 = idf(docFreq=3) [junit] 0.5 = fieldNorm(field=c, doc=1) [junit] 0.5 = coord(1/2) [junit] 0.3057859 = (MATCH) product of: [junit] 0.6115718 = (MATCH) sum of: [junit] 0.6115718 = (MATCH) fieldWeight(c:a in 2), product of: [junit] 1.0 = tf(termFreq(c:a)=1) [junit] 1.2231436 = idf(docFreq=3) [junit] 0.5 = fieldNorm(field=c, doc=2) [junit] 0.5 = coord(1/2) [junit] 0.0 = (NON-MATCH) product of: [junit] 0.0 = (NON-MATCH) sum of: [junit] 0.0 = coord(0/2) [junit] - --- [junit] Testcase: testFAIL(org.apache.lucene.search.TestAornotB): FAILED [junit] resultDocs =A C D expected:3 but was:1 [junit] junit.framework.AssertionFailedError: resultDocs =A C D expected:3 but was:1 [junit] at org.apache.lucene.search.TestAornotB.testFAIL(TestAornotB.java:137) [junit] Test org.apache.lucene.search.TestAornotB FAILED -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail:
[jira] Updated: (LUCENE-666) TERM1 OR NOT TERM2 does not perform as expected
[ https://issues.apache.org/jira/browse/LUCENE-666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-666: - Comment: was deleted (was: This is not a problem of QueryParser, its more a problem of the combination of SHOULD and MUST_NOT clauses in a single BooleanQuery. The first clause must be required to have the wanted effect. To prevent such a thing at all, I would tend to disallow MUST/SHOULD clauses in a BooleanQuery. No need to add ParseExceptions to QueryParser as the same problem would also happen to users constructing BooleanQuery programatically.) TERM1 OR NOT TERM2 does not perform as expected --- Key: LUCENE-666 URL: https://issues.apache.org/jira/browse/LUCENE-666 Project: Lucene - Java Issue Type: Bug Components: QueryParser Affects Versions: 2.0.0 Environment: Windows XP, JavaCC 4.0, JDK 1.5 Reporter: Dejan Nenov Attachments: TestAornotB.java test: [junit] Testsuite: org.apache.lucene.search.TestAornotB [junit] Tests run: 3, Failures: 1, Errors: 0, Time elapsed: 0.39 sec [junit] - Standard Output --- [junit] Doc1 = A B C [junit] Doc2 = A B C D [junit] Doc3 = A C D [junit] Doc4 = B C D [junit] Doc5 = C D [junit] - [junit] With query A OR NOT B we expect to hit [junit] all documents EXCEPT Doc4, instead we only match on Doc3. [junit] While LUCENE currently explicitly does not support queries of [junit] the type find docs that do not contain TERM - this explains [junit] not finding Doc5, but does not justify elimnating Doc1 and Doc2 [junit] - [junit] the fix shoould likely require a modification to QueryParser.jj [junit] around the method: [junit] protected void addClause(Vector clauses, int conj, int mods, Query q) [junit] Query:c:a -c:b hits.length=1 [junit] Query Found:Doc[0]= A C D [junit] 0.0 = (NON-MATCH) Failure to meet condition(s) of required/prohibited clause(s) [junit] 0.6115718 = (MATCH) fieldWeight(c:a in 1), product of: [junit] 1.0 = tf(termFreq(c:a)=1) [junit] 1.2231436 = idf(docFreq=3) [junit] 0.5 = fieldNorm(field=c, doc=1) [junit] 0.0 = match on prohibited clause (c:b) [junit] 0.6115718 = (MATCH) fieldWeight(c:b in 1), product of: [junit] 1.0 = tf(termFreq(c:b)=1) [junit] 1.2231436 = idf(docFreq=3) [junit] 0.5 = fieldNorm(field=c, doc=1) [junit] 0.6115718 = (MATCH) sum of: [junit] 0.6115718 = (MATCH) fieldWeight(c:a in 2), product of: [junit] 1.0 = tf(termFreq(c:a)=1) [junit] 1.2231436 = idf(docFreq=3) [junit] 0.5 = fieldNorm(field=c, doc=2) [junit] 0.0 = (NON-MATCH) Failure to meet condition(s) of required/prohibited clause(s) [junit] 0.0 = match on prohibited clause (c:b) [junit] 0.6115718 = (MATCH) fieldWeight(c:b in 3), product of: [junit] 1.0 = tf(termFreq(c:b)=1) [junit] 1.2231436 = idf(docFreq=3) [junit] 0.5 = fieldNorm(field=c, doc=3) [junit] Query:c:a (-c:b) hits.length=3 [junit] Query Found:Doc[0]= A B C [junit] Query Found:Doc[1]= A B C D [junit] Query Found:Doc[2]= A C D [junit] 0.3057859 = (MATCH) product of: [junit] 0.6115718 = (MATCH) sum of: [junit] 0.6115718 = (MATCH) fieldWeight(c:a in 1), product of: [junit] 1.0 = tf(termFreq(c:a)=1) [junit] 1.2231436 = idf(docFreq=3) [junit] 0.5 = fieldNorm(field=c, doc=1) [junit] 0.5 = coord(1/2) [junit] 0.3057859 = (MATCH) product of: [junit] 0.6115718 = (MATCH) sum of: [junit] 0.6115718 = (MATCH) fieldWeight(c:a in 2), product of: [junit] 1.0 = tf(termFreq(c:a)=1) [junit] 1.2231436 = idf(docFreq=3) [junit] 0.5 = fieldNorm(field=c, doc=2) [junit] 0.5 = coord(1/2) [junit] 0.0 = (NON-MATCH) product of: [junit] 0.0 = (NON-MATCH) sum of: [junit] 0.0 = coord(0/2) [junit] - --- [junit] Testcase: testFAIL(org.apache.lucene.search.TestAornotB): FAILED [junit] resultDocs =A C D expected:3 but was:1 [junit] junit.framework.AssertionFailedError: resultDocs =A C D expected:3 but was:1 [junit] at org.apache.lucene.search.TestAornotB.testFAIL(TestAornotB.java:137) [junit] Test org.apache.lucene.search.TestAornotB FAILED -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail:
[jira] Reopened: (LUCENE-666) TERM1 OR NOT TERM2 does not perform as expected
[ https://issues.apache.org/jira/browse/LUCENE-666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler reopened LUCENE-666: -- Sorry, misunderstood the issue! TERM1 OR NOT TERM2 does not perform as expected --- Key: LUCENE-666 URL: https://issues.apache.org/jira/browse/LUCENE-666 Project: Lucene - Java Issue Type: Bug Components: QueryParser Affects Versions: 2.0.0 Environment: Windows XP, JavaCC 4.0, JDK 1.5 Reporter: Dejan Nenov Attachments: TestAornotB.java test: [junit] Testsuite: org.apache.lucene.search.TestAornotB [junit] Tests run: 3, Failures: 1, Errors: 0, Time elapsed: 0.39 sec [junit] - Standard Output --- [junit] Doc1 = A B C [junit] Doc2 = A B C D [junit] Doc3 = A C D [junit] Doc4 = B C D [junit] Doc5 = C D [junit] - [junit] With query A OR NOT B we expect to hit [junit] all documents EXCEPT Doc4, instead we only match on Doc3. [junit] While LUCENE currently explicitly does not support queries of [junit] the type find docs that do not contain TERM - this explains [junit] not finding Doc5, but does not justify elimnating Doc1 and Doc2 [junit] - [junit] the fix shoould likely require a modification to QueryParser.jj [junit] around the method: [junit] protected void addClause(Vector clauses, int conj, int mods, Query q) [junit] Query:c:a -c:b hits.length=1 [junit] Query Found:Doc[0]= A C D [junit] 0.0 = (NON-MATCH) Failure to meet condition(s) of required/prohibited clause(s) [junit] 0.6115718 = (MATCH) fieldWeight(c:a in 1), product of: [junit] 1.0 = tf(termFreq(c:a)=1) [junit] 1.2231436 = idf(docFreq=3) [junit] 0.5 = fieldNorm(field=c, doc=1) [junit] 0.0 = match on prohibited clause (c:b) [junit] 0.6115718 = (MATCH) fieldWeight(c:b in 1), product of: [junit] 1.0 = tf(termFreq(c:b)=1) [junit] 1.2231436 = idf(docFreq=3) [junit] 0.5 = fieldNorm(field=c, doc=1) [junit] 0.6115718 = (MATCH) sum of: [junit] 0.6115718 = (MATCH) fieldWeight(c:a in 2), product of: [junit] 1.0 = tf(termFreq(c:a)=1) [junit] 1.2231436 = idf(docFreq=3) [junit] 0.5 = fieldNorm(field=c, doc=2) [junit] 0.0 = (NON-MATCH) Failure to meet condition(s) of required/prohibited clause(s) [junit] 0.0 = match on prohibited clause (c:b) [junit] 0.6115718 = (MATCH) fieldWeight(c:b in 3), product of: [junit] 1.0 = tf(termFreq(c:b)=1) [junit] 1.2231436 = idf(docFreq=3) [junit] 0.5 = fieldNorm(field=c, doc=3) [junit] Query:c:a (-c:b) hits.length=3 [junit] Query Found:Doc[0]= A B C [junit] Query Found:Doc[1]= A B C D [junit] Query Found:Doc[2]= A C D [junit] 0.3057859 = (MATCH) product of: [junit] 0.6115718 = (MATCH) sum of: [junit] 0.6115718 = (MATCH) fieldWeight(c:a in 1), product of: [junit] 1.0 = tf(termFreq(c:a)=1) [junit] 1.2231436 = idf(docFreq=3) [junit] 0.5 = fieldNorm(field=c, doc=1) [junit] 0.5 = coord(1/2) [junit] 0.3057859 = (MATCH) product of: [junit] 0.6115718 = (MATCH) sum of: [junit] 0.6115718 = (MATCH) fieldWeight(c:a in 2), product of: [junit] 1.0 = tf(termFreq(c:a)=1) [junit] 1.2231436 = idf(docFreq=3) [junit] 0.5 = fieldNorm(field=c, doc=2) [junit] 0.5 = coord(1/2) [junit] 0.0 = (NON-MATCH) product of: [junit] 0.0 = (NON-MATCH) sum of: [junit] 0.0 = coord(0/2) [junit] - --- [junit] Testcase: testFAIL(org.apache.lucene.search.TestAornotB): FAILED [junit] resultDocs =A C D expected:3 but was:1 [junit] junit.framework.AssertionFailedError: resultDocs =A C D expected:3 but was:1 [junit] at org.apache.lucene.search.TestAornotB.testFAIL(TestAornotB.java:137) [junit] Test org.apache.lucene.search.TestAornotB FAILED -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-547) Directory implementation for Applets
[ https://issues.apache.org/jira/browse/LUCENE-547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12986465#action_12986465 ] Andre Schild commented on LUCENE-547: - The reason for this implementation is the following: We have built a QM documentation system which generates static PDF and html pages, with a tree navigation. It also generates a fulltext lucene index to be able to do a full text search. We don't require a server to deliver the content, but instead we can just start the documentation system from a local harddisk, or even a CDROM drive. So, since we don't have a server on hand, we can't use REST. Directory implementation for Applets Key: LUCENE-547 URL: https://issues.apache.org/jira/browse/LUCENE-547 Project: Lucene - Java Issue Type: Improvement Components: Store Affects Versions: 1.9 Environment: Applets Reporter: Andre Schild Priority: Minor Attachments: AppletDirectory.zip This directory implementation can be used inside of applets, where the index files are located on the server. Also teh applet is not required to be signed, as no calls to the System.getProperty are made. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Closed: (LUCENE-1155) BoostingTermQuery#defaultTermBoost
[ https://issues.apache.org/jira/browse/LUCENE-1155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera closed LUCENE-1155. -- Resolution: Won't Fix We don't have BoostingTermQuery anymore, and there was never consensus here to fix it within Lucene, vs. e.g. the workarounds Grant proposed. Given that, and the fact that the issue is inactive since Sep-2008, and that today we give enough API for someone to write this sort of capability in his application, I'm closing the issue. BoostingTermQuery#defaultTermBoost -- Key: LUCENE-1155 URL: https://issues.apache.org/jira/browse/LUCENE-1155 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Karl Wettin Priority: Trivial This patch allows a null payload to mean something different than 1f. (I have this use case where 99% of my tokens share the same rather large token position payload boost.) {code} Index: src/java/org/apache/lucene/search/payloads/BoostingTermQuery.java === --- src/java/org/apache/lucene/search/payloads/BoostingTermQuery.java (revision 615215) +++ src/java/org/apache/lucene/search/payloads/BoostingTermQuery.java (working copy) @@ -41,11 +41,16 @@ */ public class BoostingTermQuery extends SpanTermQuery{ + private Float defaultTermBoost = null; public BoostingTermQuery(Term term) { super(term); } + public BoostingTermQuery(Term term, Float defaultTermBoost) { +super(term); +this.defaultTermBoost = defaultTermBoost; + } protected Weight createWeight(Searcher searcher) throws IOException { return new BoostingTermWeight(this, searcher); @@ -107,7 +112,9 @@ payload = positions.getPayload(payload, 0); payloadScore += similarity.scorePayload(term.field(), payload, 0, positions.getPayloadLength()); payloadsSeen++; - +} else if (defaultTermBoost != null) { + payloadScore += defaultTermBoost; + payloadsSeen++; } else { //zero out the payload? } @@ -146,7 +153,14 @@ } + public Float getDefaultTermBoost() { +return defaultTermBoost; + } + public void setDefaultTermBoost(Float defaultTermBoost) { +this.defaultTermBoost = defaultTermBoost; + } + public boolean equals(Object o) { if (!(o instanceof BoostingTermQuery)) return false; {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-547) Directory implementation for Applets
[ https://issues.apache.org/jira/browse/LUCENE-547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12986471#action_12986471 ] Shai Erera commented on LUCENE-547: --- If you don't have a server, where does the Directory take its files from? If it's from the local hard-disk, you can use RAMDirectory to load the files from a FSDirectory. Directory implementation for Applets Key: LUCENE-547 URL: https://issues.apache.org/jira/browse/LUCENE-547 Project: Lucene - Java Issue Type: Improvement Components: Store Affects Versions: 1.9 Environment: Applets Reporter: Andre Schild Priority: Minor Attachments: AppletDirectory.zip This directory implementation can be used inside of applets, where the index files are located on the server. Also teh applet is not required to be signed, as no calls to the System.getProperty are made. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2888) Several DocsEnum / DocsAndPositionsEnum return wrong docID when next() / advance(int) return NO_MORE_DOCS
[ https://issues.apache.org/jira/browse/LUCENE-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-2888: Attachment: LUCENE-2888.patch there was a wrong assignment in the last patch... I will go ahead and commit that one soon Several DocsEnum / DocsAndPositionsEnum return wrong docID when next() / advance(int) return NO_MORE_DOCS - Key: LUCENE-2888 URL: https://issues.apache.org/jira/browse/LUCENE-2888 Project: Lucene - Java Issue Type: Bug Reporter: Simon Willnauer Assignee: Simon Willnauer Priority: Minor Fix For: 4.0 Attachments: LUCENE-2888.patch, LUCENE-2888.patch During work on LUCENE-2878 I found some minor problems in PreFlex and Pulsing Codec - they are not returning NO_MORE_DOCS but the last docID instead from DocsEnum#docID() when next() or advance(int) returned NO_MORE_DOCS. The JavaDoc clearly says that it should return NO_MORE_DOCS. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Closed: (LUCENE-798) Factory for RangeFilters that caches sections of ranges to reduce disk reads
[ https://issues.apache.org/jira/browse/LUCENE-798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler closed LUCENE-798. Resolution: Not A Problem This patch does not apply anymore, as Filters no longer use BitSets, but DocIdSets. Also this issue is solved by NumericRangeQuery, NumericRangeFilter, FieldCacheRangeFilter - one of these classes should meet your requirements. Factory for RangeFilters that caches sections of ranges to reduce disk reads Key: LUCENE-798 URL: https://issues.apache.org/jira/browse/LUCENE-798 Project: Lucene - Java Issue Type: New Feature Components: Search Reporter: Mark Harwood Attachments: CachedRangesFilterFactory.java RangeFilters can be cached using CachingWrapperFilter but are only re-used if a user happens to use *exactly* the same upper/lower bounds. This class demonstrates a caching approach where *sections* of ranges are cached as bitsets and these are re-used/combined to construct large range filters if they fall within the required range. This can improve the cache hit ratio and avoid going to disk to read large lists of Doc ids from TermDocs. This class needs some more work to add thread safety but I'm making it available to gather feedback on the design at this early stage before making robust. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-2888) Several DocsEnum / DocsAndPositionsEnum return wrong docID when next() / advance(int) return NO_MORE_DOCS
[ https://issues.apache.org/jira/browse/LUCENE-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer resolved LUCENE-2888. - Resolution: Fixed Committed revision 1063332. Several DocsEnum / DocsAndPositionsEnum return wrong docID when next() / advance(int) return NO_MORE_DOCS - Key: LUCENE-2888 URL: https://issues.apache.org/jira/browse/LUCENE-2888 Project: Lucene - Java Issue Type: Bug Reporter: Simon Willnauer Assignee: Simon Willnauer Priority: Minor Fix For: 4.0 Attachments: LUCENE-2888.patch, LUCENE-2888.patch During work on LUCENE-2878 I found some minor problems in PreFlex and Pulsing Codec - they are not returning NO_MORE_DOCS but the last docID instead from DocsEnum#docID() when next() or advance(int) returned NO_MORE_DOCS. The JavaDoc clearly says that it should return NO_MORE_DOCS. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (SOLR-482) Error handling in CSVLoader
[ https://issues.apache.org/jira/browse/SOLR-482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated SOLR-482: - Attachment: SOLR-482.patch Working through some old patches, this one is pretty tame and gives a little more info when an error is encountered than it used to. Will commit shortly. Error handling in CSVLoader --- Key: SOLR-482 URL: https://issues.apache.org/jira/browse/SOLR-482 Project: Solr Issue Type: Improvement Components: update Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Attachments: SOLR-482.patch, SOLR-482.patch Sometimes the underlying CSV parser can't read a line and throws an exception. Solr currently just passes the exception out to the client. Wrapping this in a SolrException allows us to pass out information about what line failed (which isn't always in the CSV IOException thrown). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Closed: (LUCENE-218) Query Parser flags clauses with explicit OR as required when followed by explicit AND.
[ https://issues.apache.org/jira/browse/LUCENE-218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera closed LUCENE-218. - Resolution: Not A Problem Assignee: (was: Lucene Developers) Note that the query has OR FIVE AND SIX and hence FIVE and SIX are required. Same for the last OR FOUR AND FIVE. If you want to get exact boolean ordering, you should use clauses. Query Parser flags clauses with explicit OR as required when followed by explicit AND. -- Key: LUCENE-218 URL: https://issues.apache.org/jira/browse/LUCENE-218 Project: Lucene - Java Issue Type: Bug Components: QueryParser Affects Versions: 1.0.2 Environment: Operating System: other Platform: PC Reporter: David Mabe Priority: Minor When the following string is parsed: ONE NOT TWO OR THREE NOT FOUR OR FIVE AND SIX SEVEN OR THRE OR FIVEE OR FOUR AND FIVE SIXX The following query is returned: +ONE -TWO THREE -FOUR +FIVE +SIX SEVEN THRE FIVEE +FOUR +FIVE +SIXX Note that the first FIVE is required when it should not be. Also note that the first THREE is calculated correctly with the explicit OR. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Closed: (LUCENE-1159) jarify target gives misleading message when svnversion doesn't exist
[ https://issues.apache.org/jira/browse/LUCENE-1159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera closed LUCENE-1159. -- Resolution: Not A Problem This seems to be fixed already. From common-build.xml: {noformat} !-- If possible, include the svnversion -- exec dir=. executable=${svnversion.exe} outputproperty=svnversion failifexecutionfails=false arg line=./ /exec {noformat} jarify target gives misleading message when svnversion doesn't exist Key: LUCENE-1159 URL: https://issues.apache.org/jira/browse/LUCENE-1159 Project: Lucene - Java Issue Type: Bug Components: Build Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Trivial The jarify command in common-build.xml seems to indicate failure when it can't find svnversion, but this is, in fact, just a warning. We should check to see if svnversion exists before attempting the command at all, if possible. The message looks something like: [exec] Execute failed: java.io.IOException: java.io.IOException: svnversion: not found Which is understandable, but it is not clear what the ramifications are of this missing. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-547) Directory implementation for Applets
[ https://issues.apache.org/jira/browse/LUCENE-547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12986490#action_12986490 ] Andre Schild commented on LUCENE-547: - The problem with RAMDirectory is, that it uses java.io.File When you work in a applet environment, then you don't have explicit java.io.File objects (for security reasons) but instead you have to use the java.net.URL to get access to the files. So inheriting from RAMDirectory won't do. But you can leave it closed, as I have a working implementation, and since noby seems to have the need for that in over 5 years Thanks anyway for your work. Directory implementation for Applets Key: LUCENE-547 URL: https://issues.apache.org/jira/browse/LUCENE-547 Project: Lucene - Java Issue Type: Improvement Components: Store Affects Versions: 1.9 Environment: Applets Reporter: Andre Schild Priority: Minor Attachments: AppletDirectory.zip This directory implementation can be used inside of applets, where the index files are located on the server. Also teh applet is not required to be signed, as no calls to the System.getProperty are made. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Closed: (LUCENE-149) [PATCH] URLDirectory implementation
[ https://issues.apache.org/jira/browse/LUCENE-149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera closed LUCENE-149. - Resolution: Not A Problem Assignee: (was: Lucene Developers) This looks more like a tool to construct a Directory from a zipped file, than a Directory implementation. The Directory extension is just forced here - unzipping the files and populate a RAMDirectory will achieve the same effect. Anyway, idle for too many years :). [PATCH] URLDirectory implementation --- Key: LUCENE-149 URL: https://issues.apache.org/jira/browse/LUCENE-149 Project: Lucene - Java Issue Type: Improvement Components: Store Affects Versions: unspecified Environment: Operating System: other Platform: Other Reporter: Otis Gospodnetic Priority: Minor Attachments: URLDirectory.zip August 15th, 2003 contribution from Lukas Zapletal zaple...@inf.upol.cz Suitable for Lucene Sandbox contribution containing alternate Directory implementations. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Closed: (LUCENE-150) [PATCH] DBDirectory implementation
[ https://issues.apache.org/jira/browse/LUCENE-150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera closed LUCENE-150. - Resolution: Not A Problem Assignee: (was: Lucene Developers) We have DBDirectory in contrib. [PATCH] DBDirectory implementation -- Key: LUCENE-150 URL: https://issues.apache.org/jira/browse/LUCENE-150 Project: Lucene - Java Issue Type: Improvement Components: Store Affects Versions: unspecified Environment: Operating System: other Platform: Other Reporter: Otis Gospodnetic Priority: Minor Attachments: lucene-dbdirectory-1.0.zip Implementation of the Lucene Directory interface which stores data in a JDBC-accessible database. June 2nd, 2003, a contribution from Anthony Eden m...@anthonyeden.com. Original email: Version 1.0 of the DBDirectory library, which implements a Directory which can store indeces in a database is now available for download. There are two versions: Tar GZIP: http://www.anthonyeden.com/download/lucene-dbdirectory-1.0.tar.gz ZIP: http://www.anthonyeden.com/download/lucene-dbdirectory-1.0.zip The source code is included. Please read the README file for instructions on using DBDirectory. I have only tested it with MySQL but would be happy to add other database scripts if anyone would like to submit them. Please post any questions here on the mailing list. Otis, is there anything left to do to get this into the sandbox? Additionally, how will I maintain the code if it is in the sandbox? Will I get write access to the part of the CVS repository which would house DBDirectory? I currently have all of the code in my private CVS. Sincerely, Anthony Eden -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Closed: (LUCENE-151) [PATCH] Clonable RAMDirectory
[ https://issues.apache.org/jira/browse/LUCENE-151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera closed LUCENE-151. - Resolution: Not A Problem Assignee: (was: Lucene Developers) RAMDirectory has a ctor which takes a Directory, which can be used for cloning. [PATCH] Clonable RAMDirectory - Key: LUCENE-151 URL: https://issues.apache.org/jira/browse/LUCENE-151 Project: Lucene - Java Issue Type: Improvement Components: Store Affects Versions: unspecified Environment: Operating System: other Platform: Other Reporter: Otis Gospodnetic Priority: Minor Attachments: ramdir.diff, RamDirectory-clonable.patch A patch for RAMDirectory that makes it clonable. May 22nd, 2003 contribution from Nick Smith nick.sm...@techop.ch Original email: Hi Lucene Developers, Thanks for a great product! I need to be able to 'snapshot' our in-memory indices (RAMDirectory instances). I have been using : RAMDirectory activeDir = new RAMDirectory(); // many inserts, deletes etc RAMDirectory cloneDir = new RAMDirectory(activeDir); but unfortunately this is rather slow for large indices. I have a suggestion - implement java.lang.Cloneable interface in RAMDirectory. I.e to be able to call : RAMDirectory cloneDir = (RAMDirectory)activeDir.clone(); This bypasses the input/output stream handling of the copy constructor by cloneing the underlying buffers that form the directory and is much faster. (Diff attached). Any comments? Regards, Nick -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2883) Consolidate Solr Lucene FunctionQuery into modules
[ https://issues.apache.org/jira/browse/LUCENE-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12986502#action_12986502 ] Yonik Seeley commented on LUCENE-2883: -- One issue here is the different purposes for lucene and solr function queries. Solr's function queries have always evolved at a rapid pace (and are continuing to evolve) to support higher level features and interfaces in Solr. They are able to evolve rapidly because they are seen more as an implementation detail rather than interface classes, and I'd hate to lose that. So if we do try to make Solr's function queries more accessible to lucene users (again), it should be as a Solr module. As we can see from history and usage, function queries are critically important to Solr, but are obviously not to Lucene. Consolidate Solr Lucene FunctionQuery into modules - Key: LUCENE-2883 URL: https://issues.apache.org/jira/browse/LUCENE-2883 Project: Lucene - Java Issue Type: Task Components: Search Affects Versions: 4.0 Reporter: Simon Willnauer Fix For: 4.0 Spin-off from the [dev list | http://www.mail-archive.com/dev@lucene.apache.org/msg13261.html] -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-1188) equals and hashCode implementation in org.apache.lucene.search.* package
[ https://issues.apache.org/jira/browse/LUCENE-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler resolved LUCENE-1188. --- Resolution: Fixed Fix Version/s: 2.9 The equals and hashCode implementations in Query subclasses were already fixed to use getClass() and not instanceof in 2.9 by various other issues. Also the boost comparison was mostly removed by calling super. equals and hashCode implementation in org.apache.lucene.search.* package Key: LUCENE-1188 URL: https://issues.apache.org/jira/browse/LUCENE-1188 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.2, 2.3, 2.3.1 Environment: All Reporter: Chandan Raj Rupakheti Fix For: 2.9 Original Estimate: 0.5h Remaining Estimate: 0.5h I would like to talk about the implementation of equals and hashCode method in org.apache.lucene.search.* package. Example One: org.apache.lucene.search.spans.SpanTermQuery (Super Class) - org.apache.lucene.search.payloads.BoostingTermQuery (Sub Class) Observation: * BoostingTermQuery defines equals but inherits hashCode from SpanTermQuery. Definition of equals is a code clone of SpanTermQuery with a change in class name. Intention: I believe the intention of equals redefinition in BoostingTermQuery is not to make the objects of SpanTermQuery and BoostingTermQuery comparable. ie. spanTermQuery.equals(boostingTermQuery) == false boostingTermQuery.equals(spanTermQuery) == false. Problem: With current implementation, the intention might not be respected as a result of symmetric property violation of equals contract i.e. spanTermQuery.equals(boostingTermQuery) == true (can be) boostingTermQuery.equals(spanTermQuery) == false. (always) (Note: Provided their state variables are equal) Solution: Change implementation of equals in SpanTermQuery from: {code:title=SpanTermQuery.java|borderStyle=solid} public boolean equals(Object o) { if (!(o instanceof SpanTermQuery)) return false; SpanTermQuery other = (SpanTermQuery)o; return (this.getBoost() == other.getBoost()) this.term.equals(other.term); } {code} To: {code:title=SpanTermQuery.java|borderStyle=solid} public boolean equals(Object o) { if(o == this) return true; if(o == null || o.getClass() != this.getClass()) return false; //if (!(o instanceof SpanTermQuery)) // return false; SpanTermQuery other = (SpanTermQuery)o; return (this.getBoost() == other.getBoost()) this.term.equals(other.term); } {code} Advantage: * BoostingTermQuery.equals and BoostingTermQuery.hashCode is not needed while still preserving the same intention as before. * Any further subclassing that does not add new state variables in the extended classes of SpanTermQuery, does not have to redefine equals and hashCode. * Even if a new state variable is added in a subclass, the symmetric property of equals contract will still be respected irrespective of implementation (i.e. instanceof / getClass) of equals and hashCode in the subclasses. Example Two: org.apache.lucene.search.CachingWrapperFilter (Super Class) - org.apache.lucene.search.CachingWrapperFilterHelper (Sub Class) Observation: Same as Example One. Problem: Same as Example one. Solution: Change equals in CachingWrapperFilter from: {code:title=CachingWrapperFilter.java|borderStyle=solid} public boolean equals(Object o) { if (!(o instanceof CachingWrapperFilter)) return false; return this.filter.equals(((CachingWrapperFilter)o).filter); } {code} To: {code:title=CachingWrapperFilter.java|borderStyle=solid} public boolean equals(Object o) { //if (!(o instanceof CachingWrapperFilter)) return false; if(o == this) return true; if(o == null || o.getClass() != this.getClass()) return false; return this.filter.equals(((CachingWrapperFilter)o).filter); } {code} Advantage: Same as Example One. Here, CachingWrapperFilterHelper.equals and CachingWrapperFilterHelper.hashCode is not needed. Example Three: org.apache.lucene.search.MultiTermQuery (Abstract Parent) - org.apache.lucene.search.FuzzyQuery (Concrete Sub) - org.apache.lucene.search.WildcardQuery (Concrete Sub) Observation (Not a problem): * WildcardQuery defines equals but inherits hashCode from MultiTermQuery. Definition of equals contains just super.equals invocation. * FuzzyQuery has few state variables added that are referenced in its equals and hashCode. Intention: I believe the intention here is not to make objects of FuzzyQuery and WildcardQuery comparable. ie. fuzzyQuery.equals(wildCardQuery) == false wildCardQuery.equals(fuzzyQuery) == false. Proposed
[jira] Resolved: (SOLR-2320) ReplicationHandler doesn't return master details unless it's also configured as a slave
[ https://issues.apache.org/jira/browse/SOLR-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man resolved SOLR-2320. Resolution: Fixed Committed revision 1063339. - trunk Committed revision 1063343. - 3x ReplicationHandler doesn't return master details unless it's also configured as a slave --- Key: SOLR-2320 URL: https://issues.apache.org/jira/browse/SOLR-2320 Project: Solr Issue Type: Bug Components: replication (java) Affects Versions: 1.4, 1.4.1 Reporter: Hoss Man Assignee: Hoss Man Fix For: 3.1, 4.0 Attachments: SOLR-2320.patch, SOLR-2320.patch, SOLR-2320.patch While investigating SOLR-2314 i found a bug which seems to be the opposite of the behavior described there -- so i'm filing a seperate bug to track it. if ReplicationHandler is only configured as a master, command=details requests won't include the master section. that section is only output if it is also configured as a slave. the method responsible for the details command generates the master details just fine, but the code to add it to the response seems to have erroneously been nested inside an if that only evaluates to true if there is a non-null SnapPuller (ie: it's also a slave) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2317) Slaves have leftover index.xxxxx directories, and leftover files in index/ directory
[ https://issues.apache.org/jira/browse/SOLR-2317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12986514#action_12986514 ] Jayendra Patil commented on SOLR-2317: -- For the extra index. you can try the patch @ https://issues.apache.org/jira/browse/SOLR-2156 Slaves have leftover index.x directories, and leftover files in index/ directory Key: SOLR-2317 URL: https://issues.apache.org/jira/browse/SOLR-2317 Project: Solr Issue Type: Bug Affects Versions: 3.1 Reporter: Bill Bell When replicating, we are getting leftover files on slaves. Some slaves are getting index.number with files leftover. And more concerning, the index/ direcotry has left over files from previous replicated runs. This is a pain to keep cleaning up. Bill -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Resolved: (SOLR-482) Error handling in CSVLoader
[ https://issues.apache.org/jira/browse/SOLR-482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll resolved SOLR-482. -- Resolution: Fixed Fix Version/s: 4.0 3.1 committed on trunk and 3.x Error handling in CSVLoader --- Key: SOLR-482 URL: https://issues.apache.org/jira/browse/SOLR-482 Project: Solr Issue Type: Improvement Components: update Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 3.1, 4.0 Attachments: SOLR-482.patch, SOLR-482.patch Sometimes the underlying CSV parser can't read a line and throws an exception. Solr currently just passes the exception out to the client. Wrapping this in a SolrException allows us to pass out information about what line failed (which isn't always in the CSV IOException thrown). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2723) Speed up Lucene's low level bulk postings read API
[ https://issues.apache.org/jira/browse/LUCENE-2723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-2723: Attachment: LUCENE-2723-BulkEnumWrapper.patch This patch adds a BulkPostingsEnumWrapper that implement DocsEnumAndPositions by using the bulkpostings. I first just added this as a class to ease testsing for PositionDeltaBulks but it seems that this could be useful for more than just testing. Codecs that don't want to implement the DocsEnumAndPositions API can just use this wrapper to provide the functionality. I also added a testcase for MemoryIndex that uses this wrapper Speed up Lucene's low level bulk postings read API -- Key: LUCENE-2723 URL: https://issues.apache.org/jira/browse/LUCENE-2723 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 4.0 Attachments: LUCENE-2723-BulkEnumWrapper.patch, LUCENE-2723-termscorer.patch, LUCENE-2723-termscorer.patch, LUCENE-2723-termscorer.patch, LUCENE-2723.patch, LUCENE-2723.patch, LUCENE-2723.patch, LUCENE-2723.patch, LUCENE-2723.patch, LUCENE-2723.patch, LUCENE-2723_bulkvint.patch, LUCENE-2723_facetPerSeg.patch, LUCENE-2723_facetPerSeg.patch, LUCENE-2723_openEnum.patch, LUCENE-2723_termscorer.patch, LUCENE-2723_wastedint.patch Spinoff from LUCENE-1410. The flex DocsEnum has a simple bulk-read API that reads the next chunk of docs/freqs. But it's a poor fit for intblock codecs like FOR/PFOR (from LUCENE-1410). This is not unlike sucking coffee through those tiny plastic coffee stirrers they hand out airplanes that, surprisingly, also happen to function as a straw. As a result we see no perf gain from using FOR/PFOR. I had hacked up a fix for this, described at in my blog post at http://chbits.blogspot.com/2010/08/lucene-performance-with-pfordelta-codec.html I'm opening this issue to get that work to a committable point. So... I've worked out a new bulk-read API to address performance bottleneck. It has some big changes over the current bulk-read API: * You can now also bulk-read positions (but not payloads), but, I have yet to cutover positional queries. * The buffer contains doc deltas, not absolute values, for docIDs and positions (freqs are absolute). * Deleted docs are not filtered out. * The doc freq buffers need not be aligned. For fixed intblock codecs (FOR/PFOR) they will be, but for varint codecs (Simple9/16, Group varint, etc.) they won't be. It's still a work in progress... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2883) Consolidate Solr Lucene FunctionQuery into modules
[ https://issues.apache.org/jira/browse/LUCENE-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12986529#action_12986529 ] Simon Willnauer commented on LUCENE-2883: - bq. One issue here is the different purposes for lucene and solr function queries. Yonik, if that is your only issue then we are good to go. I don't think that moving stuff to modules changes anything how we develop software. Modularization, decoupling, interfaces etc. you know how to work with those ey? so hey what is really the point here, this modularization is a key point of merging development with lucene and everytime somebody proposes something like this you fear that that monolithic thing under /solr could become more modular and decoupled. I don't know why this is the case but we should and will move on with modularization. Folks will use it once its there, thats for sure. Same is true for faceting, replication, queryparsers, functionparser... those are on the list! Consolidate Solr Lucene FunctionQuery into modules - Key: LUCENE-2883 URL: https://issues.apache.org/jira/browse/LUCENE-2883 Project: Lucene - Java Issue Type: Task Components: Search Affects Versions: 4.0 Reporter: Simon Willnauer Fix For: 4.0 Spin-off from the [dev list | http://www.mail-archive.com/dev@lucene.apache.org/msg13261.html] -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2392) Enable flexible scoring
[ https://issues.apache.org/jira/browse/LUCENE-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-2392: Attachment: LUCENE-2392_take2.patch here's a really really rough take 2 at the problem. The general idea is to take a smaller baby-step as Mike calls it, to the problem. Really we have been working our way towards this anyway, exposing additional statistics, making Similarity per-field, fixing up inconsistencies... and this is the way I prefer, as we get things actually committed and moving. So whatever is in this patch (which is full of nocommits, but all tests pass and all queries work with it), we could possibly then split up into other issues and continue slowly proceeding, or maybe create a branch, whatever. My problem with the other patch is it requires a ton more work to make any progress on it... and things don't even compile with it, forget about tests. The basics here are to: # Split the matching and scoring calculations of Scorer. All responsibility of calculations belongs in the Similarity, the Scorer should be matching positions, working docsEnums, etc etc. # Similarity as we know it now, gets a more low-level API, and TFIDFSimilarity implements this API, but exposes its customizations via the tf(), idf(), etc we know now. # Things like score-caching and specialization of calculations are the responsibility of the Similarity, as these depend upon the formula being used. For TFIDFSimilarity, i added some optimizations here, for example it specializes its norms == null case away to remove the per-doc if. # Since all Weights create PerReaderTermState (-- this one needs a new name), to separate the seeking/stats collection from the calculations, i also optimized PhraseQuery's Weight/Scorer construction to be single-pass. Also I like to benchmark every step of the way, so we don't come up with this design that won't be performant: here are the scores for lucene's default Sim with the patch: ||Query||QPS trunk||QPS patch||Pct diff |spanNear([unit, state], 10, true)|3.04|2.92|{color:red}-4.0%{color}| |doctitle:.*[Uu]nited.*|4.00|3.99|{color:red}-0.1%{color}| |+unit +state|8.11|8.12|{color:green}0.2%{color}| |united~2.0|4.36|4.40|{color:green}1.0%{color}| |united~1.0|18.70|18.93|{color:green}1.2%{color}| |unit~2.0|8.54|8.71|{color:green}2.1%{color}| |spanFirst(unit, 5)|11.35|11.59|{color:green}2.2%{color}| |unit~1.0|8.69|8.91|{color:green}2.6%{color}| |unit state|7.03|7.23|{color:green}2.8%{color}| |unit state~3|3.74|3.86|{color:green}3.2%{color}| |u*d|16.72|17.30|{color:green}3.5%{color}| |state|19.24|20.04|{color:green}4.1%{color}| |un*d|49.42|51.55|{color:green}4.3%{color}| |unit state|5.99|6.31|{color:green}5.3%{color}| |+nebraska +state|140.74|151.85|{color:green}7.9%{color}| |uni*|10.66|11.55|{color:green}8.4%{color}| |unit*|18.77|20.41|{color:green}8.7%{color}| |doctimesecnum:[1 TO 6]|6.97|7.70|{color:green}10.4%{color}| All Lucene/Solr tests pass, but there are lots of nocommits, especially # No Javadocs # Explains need to be fixed: in general the explanation of matching belongs where it is now, but the explanation of score calculations belongs in the Similarity. # need to refactor more out of Weight, currently we pass it to the docscorer, but its the wrong object, as it can only hold a single float. Anyway, its gonna take some time to rough all this out I'm sure, but I wanted to show some progress/invite ideas, and also show we can do this stuff without losing performance. I have separate patches that need to be integrated/relevance tested e.g. for average doc length... maybe i'll do that next so we can get some concrete alternate sims in here before going any further. Enable flexible scoring --- Key: LUCENE-2392 URL: https://issues.apache.org/jira/browse/LUCENE-2392 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 4.0 Attachments: LUCENE-2392.patch, LUCENE-2392.patch, LUCENE-2392_take2.patch This is a first step (nowhere near committable!), implementing the design iterated to in the recent Baby steps towards making Lucene's scoring more flexible java-dev thread. The idea is (if you turn it on for your Field; it's off by default) to store full stats in the index, into a new _X.sts file, per doc (X field) in the index. And then have FieldSimilarityProvider impls that compute doc's boost bytes (norms) from these stats. The patch is able to index the stats, merge them when segments are merged, and provides an iterator-only API. It also has starting point for per-field Sims that use the stats iterator API to compute boost bytes. But it's not at all tied into actual searching! There's still tons left to do, eg, how does
[jira] Commented: (LUCENE-403) Alternate Lucene Query Highlighter
[ https://issues.apache.org/jira/browse/LUCENE-403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12986542#action_12986542 ] Uwe Schindler commented on LUCENE-403: -- Mark Miller: What do you think, is this issue still relevant? If not, we should close it and say: resolved by FastVectorHighlighter or because recent improvements in standard highlighter? Alternate Lucene Query Highlighter -- Key: LUCENE-403 URL: https://issues.apache.org/jira/browse/LUCENE-403 Project: Lucene - Java Issue Type: Improvement Components: contrib/highlighter Affects Versions: 1.4 Environment: Operating System: All Platform: All Reporter: David Bohl Priority: Minor Attachments: HighlighterTest.java, HighlighterTest.java, QueryHighlighter.java, QueryHighlighter.java, QueryHighlighter.java, QuerySpansExtractor.java I created a lucene query highlighter (borrowing some code from the one in the sandbox) that my company is using. It better handles phrase queries, doesn't break HTML entities, and has the ability to either highlight terms in an entire document or to highlight fragments from the document. I would like to make it available to anyone who wants it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-403) Alternate Lucene Query Highlighter
[ https://issues.apache.org/jira/browse/LUCENE-403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12986544#action_12986544 ] Mark Miller commented on LUCENE-403: Yeah - I would totally close this. This work has been superseded - and it looks like highlighting may be able to take another leap forward soon. Alternate Lucene Query Highlighter -- Key: LUCENE-403 URL: https://issues.apache.org/jira/browse/LUCENE-403 Project: Lucene - Java Issue Type: Improvement Components: contrib/highlighter Affects Versions: 1.4 Environment: Operating System: All Platform: All Reporter: David Bohl Priority: Minor Attachments: HighlighterTest.java, HighlighterTest.java, QueryHighlighter.java, QueryHighlighter.java, QueryHighlighter.java, QuerySpansExtractor.java I created a lucene query highlighter (borrowing some code from the one in the sandbox) that my company is using. It better handles phrase queries, doesn't break HTML entities, and has the ability to either highlight terms in an entire document or to highlight fragments from the document. I would like to make it available to anyone who wants it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-403) Alternate Lucene Query Highlighter
[ https://issues.apache.org/jira/browse/LUCENE-403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller resolved LUCENE-403. Resolution: Won't Fix Assignee: Mark Miller Some of this work moved into other issues. Some of it just too old now. I think this issue has served it's purpose. Alternate Lucene Query Highlighter -- Key: LUCENE-403 URL: https://issues.apache.org/jira/browse/LUCENE-403 Project: Lucene - Java Issue Type: Improvement Components: contrib/highlighter Affects Versions: 1.4 Environment: Operating System: All Platform: All Reporter: David Bohl Assignee: Mark Miller Priority: Minor Attachments: HighlighterTest.java, HighlighterTest.java, QueryHighlighter.java, QueryHighlighter.java, QueryHighlighter.java, QuerySpansExtractor.java I created a lucene query highlighter (borrowing some code from the one in the sandbox) that my company is using. It better handles phrase queries, doesn't break HTML entities, and has the ability to either highlight terms in an entire document or to highlight fragments from the document. I would like to make it available to anyone who wants it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Closed: (LUCENE-403) Alternate Lucene Query Highlighter
[ https://issues.apache.org/jira/browse/LUCENE-403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller closed LUCENE-403. -- Alternate Lucene Query Highlighter -- Key: LUCENE-403 URL: https://issues.apache.org/jira/browse/LUCENE-403 Project: Lucene - Java Issue Type: Improvement Components: contrib/highlighter Affects Versions: 1.4 Environment: Operating System: All Platform: All Reporter: David Bohl Assignee: Mark Miller Priority: Minor Attachments: HighlighterTest.java, HighlighterTest.java, QueryHighlighter.java, QueryHighlighter.java, QueryHighlighter.java, QuerySpansExtractor.java I created a lucene query highlighter (borrowing some code from the one in the sandbox) that my company is using. It better handles phrase queries, doesn't break HTML entities, and has the ability to either highlight terms in an entire document or to highlight fragments from the document. I would like to make it available to anyone who wants it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Closed: (LUCENE-990) ParallelMultiSearcher.search with a custom HitCollector should run parallel
[ https://issues.apache.org/jira/browse/LUCENE-990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler closed LUCENE-990. Resolution: Won't Fix ParallelMultiSearcher was dropped with MultiSearcher in Lucene trunk (because of too mayn unsolveable scoring and deMorgan bugs). The replacement is a parallelized IndexSearcher on MultiReaders. It's not possible to solve this even for the new one, as it would need Collector to be synchronized. ParallelMultiSearcher.search with a custom HitCollector should run parallel --- Key: LUCENE-990 URL: https://issues.apache.org/jira/browse/LUCENE-990 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.2, 2.3 Reporter: Jan-Pascal Priority: Minor The ParallelMultiSearcher.search(Weight weight, Filter filter, final HitCollector results) should search over its underlying Searchers in parallel, like the TopDocs versions of the search() method. There's a @todo for this in the method's Javadoc comment. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-1264) Use of IOException in analysis component method signatures leads to poor error management
[ https://issues.apache.org/jira/browse/LUCENE-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler resolved LUCENE-1264. --- Resolution: Won't Fix This issue is quite old and no response was given to Hoss' comment. In general this is not an issue, as you can also always throw RuntimeExceptions. IOException is only listed in throws there because it is unfortunately checked and needed by Tokenizer as it works on java.io.Reader. Use of IOException in analysis component method signatures leads to poor error management - Key: LUCENE-1264 URL: https://issues.apache.org/jira/browse/LUCENE-1264 Project: Lucene - Java Issue Type: Bug Components: contrib/analyzers Affects Versions: 2.3.1 Reporter: Benson Margulies Methods such as 'next' and 'reset' are defined to throw only IOException. IOException, as one of the older and dustier Java exceptions, lacks a constructor over a 'cause' exception. So, if a Tokenizer (for example) uses some complex underlying facility that throws arbitrary exceptions, the coder has two bad choices: wrap an IOException around some string derived from the real problem, or throw an unchecked wrapper. Please consider adding a new checked exception to the signature of these methods that implements the 'cause' pattern. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[no subject]
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12986556#action_12986556 ] Jason Rutherglen commented on LUCENE-2324: -- The compilation errors are gone, TestNRTThreads and TestStressIndexing2 are still failing. I think we need to implement Mike's idea: https://issues.apache.org/jira/browse/LUCENE-2324?focusedCommentId=12984285page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12984285 then retest. Is a test deadlocking somewhere, ant hasn't returned. Per thread DocumentsWriters that write their own private segments - Key: LUCENE-2324 URL: https://issues.apache.org/jira/browse/LUCENE-2324 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: Realtime Branch Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324.patch, LUCENE-2324.patch, LUCENE-2324.patch, lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch, test.out, test.out, test.out, test.out See LUCENE-2293 for motivation and more details. I'm copying here Mike's summary he posted on 2293: Change the approach for how we buffer in RAM to a more isolated approach, whereby IW has N fully independent RAM segments in-process and when a doc needs to be indexed it's added to one of them. Each segment would also write its own doc stores and normal segment merging (not the inefficient merge we now do on flush) would merge them. This should be a good simplification in the chain (eg maybe we can remove the *PerThread classes). The segments can flush independently, letting us make much better concurrent use of IO CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2883) Consolidate Solr Lucene FunctionQuery into modules
[ https://issues.apache.org/jira/browse/LUCENE-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12986558#action_12986558 ] Michael McCandless commented on LUCENE-2883: Can't we consolidate them under a new toplevel module? modules/queries? We can mark the classes as lucene.experimental? Then we are free to iterate quickly. Does that address your concern Yonik? Consolidate Solr Lucene FunctionQuery into modules - Key: LUCENE-2883 URL: https://issues.apache.org/jira/browse/LUCENE-2883 Project: Lucene - Java Issue Type: Task Components: Search Affects Versions: 4.0 Reporter: Simon Willnauer Fix For: 4.0 Spin-off from the [dev list | http://www.mail-archive.com/dev@lucene.apache.org/msg13261.html] -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2883) Consolidate Solr Lucene FunctionQuery into modules
[ https://issues.apache.org/jira/browse/LUCENE-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12986561#action_12986561 ] Yonik Seeley commented on LUCENE-2883: -- Not sure if I communicated the issue clearly: taking what is essentially implementation and trying to make it interface clearly has a cost. Function queries and the solr qparser architecture are constantly evolving, and wind all through solr. If we attempt to make this easier to use by lucene users by moving it out to a module then: - it should be a solr module... keep the solr package names and make it clear that it's primary purpose is supporting higher level features in solr - we should make it such that java interface back compatibility is not a requirement, even for point releases The other approach is to make a Lucene function query module (actually, we already have that), try to update it with stuff from solr, but make it's primary purpose to support the Java interfaces. Consolidate Solr Lucene FunctionQuery into modules - Key: LUCENE-2883 URL: https://issues.apache.org/jira/browse/LUCENE-2883 Project: Lucene - Java Issue Type: Task Components: Search Affects Versions: 4.0 Reporter: Simon Willnauer Fix For: 4.0 Spin-off from the [dev list | http://www.mail-archive.com/dev@lucene.apache.org/msg13261.html] -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2666) ArrayIndexOutOfBoundsException when iterating over TermDocs
[ https://issues.apache.org/jira/browse/LUCENE-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12986566#action_12986566 ] Michael McCandless commented on LUCENE-2666: Hmmm --- given that exception, I would expect CheckIndex to have also seen this issue. Searching at the same time as indexing shouldn't cause this. Lucene doesn't cache postings, but does cache metadata for the term, though I can't see how that could lead to this exception. This could also be a hardware issue? Do you see the problem on more than one machine? ArrayIndexOutOfBoundsException when iterating over TermDocs --- Key: LUCENE-2666 URL: https://issues.apache.org/jira/browse/LUCENE-2666 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 3.0.2 Reporter: Shay Banon Attachments: checkindex-out.txt A user got this very strange exception, and I managed to get the index that it happens on. Basically, iterating over the TermDocs causes an AAOIB exception. I easily reproduced it using the FieldCache which does exactly that (the field in question is indexed as numeric). Here is the exception: Exception in thread main java.lang.ArrayIndexOutOfBoundsException: 114 at org.apache.lucene.util.BitVector.get(BitVector.java:104) at org.apache.lucene.index.SegmentTermDocs.next(SegmentTermDocs.java:127) at org.apache.lucene.search.FieldCacheImpl$LongCache.createValue(FieldCacheImpl.java:501) at org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:183) at org.apache.lucene.search.FieldCacheImpl.getLongs(FieldCacheImpl.java:470) at TestMe.main(TestMe.java:56) It happens on the following segment: _26t docCount: 914 delCount: 1 delFileName: _26t_1.del And as you can see, it smells like a corner case (it fails for document number 912, the AIOOB happens from the deleted docs). The code to recreate it is simple: FSDirectory dir = FSDirectory.open(new File(index)); IndexReader reader = IndexReader.open(dir, true); IndexReader[] subReaders = reader.getSequentialSubReaders(); for (IndexReader subReader : subReaders) { Field field = subReader.getClass().getSuperclass().getDeclaredField(si); field.setAccessible(true); SegmentInfo si = (SegmentInfo) field.get(subReader); System.out.println(-- + si); if (si.getDocStoreSegment().contains(_26t)) { // this is the probleatic one... System.out.println(problematic one...); FieldCache.DEFAULT.getLongs(subReader, __documentdate, FieldCache.NUMERIC_UTILS_LONG_PARSER); } } Here is the result of a check index on that segment: 8 of 10: name=_26t docCount=914 compound=true hasProx=true numFiles=2 size (MB)=1.641 diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.18-194.11.1.el5.centos.plus, os=Linux, mergeDocStores=true, lucene.version=3.0.2 953716 - 2010-06-11 17:13:53, source=merge, os.arch=amd64, java.version=1.6.0, java.vendor=Sun Microsystems Inc.} has deletions [delFileName=_26t_1.del] test: open reader.OK [1 deleted docs] test: fields..OK [32 fields] test: field norms.OK [32 fields] test: terms, freq, prox...ERROR [114] java.lang.ArrayIndexOutOfBoundsException: 114 at org.apache.lucene.util.BitVector.get(BitVector.java:104) at org.apache.lucene.index.SegmentTermDocs.next(SegmentTermDocs.java:127) at org.apache.lucene.index.SegmentTermPositions.next(SegmentTermPositions.java:102) at org.apache.lucene.index.CheckIndex.testTermIndex(CheckIndex.java:616) at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:509) at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:299) at TestMe.main(TestMe.java:47) test: stored fields...ERROR [114] java.lang.ArrayIndexOutOfBoundsException: 114 at org.apache.lucene.util.BitVector.get(BitVector.java:104) at org.apache.lucene.index.ReadOnlySegmentReader.isDeleted(ReadOnlySegmentReader.java:34) at org.apache.lucene.index.CheckIndex.testStoredFields(CheckIndex.java:684) at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:512) at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:299) at TestMe.main(TestMe.java:47) test: term vectorsERROR [114] java.lang.ArrayIndexOutOfBoundsException: 114 at org.apache.lucene.util.BitVector.get(BitVector.java:104) at org.apache.lucene.index.ReadOnlySegmentReader.isDeleted(ReadOnlySegmentReader.java:34) at
[jira] Commented: (LUCENE-2723) Speed up Lucene's low level bulk postings read API
[ https://issues.apache.org/jira/browse/LUCENE-2723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12986568#action_12986568 ] Robert Muir commented on LUCENE-2723: - Simon, just took a quick glance (not a serious review, all the bulkpostings stuff is heavy). I agree with the idea that Codecs should only need to implement the bulk api at a minimum: if all serious stuff (queries) is using these bulk apis, then the friendly iterator methods can simply be a wrapper over it. but separately, i know there are some performance degradations with the bulk APIs today versus trunk... (with the same index). I know if i use other fixed-int codecs i see these same problems, so I dont think its just Standard's implementation: pretty sure the issue is somewhere with advance()/jump(). I really wish we could debug whatever this performance problem is, just in case the bulk APIs themselves need changing... a little concerned about them at the moment thats all... not sure it should stand in the way of your patch, just saying I don't like the performance regression. Speed up Lucene's low level bulk postings read API -- Key: LUCENE-2723 URL: https://issues.apache.org/jira/browse/LUCENE-2723 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 4.0 Attachments: LUCENE-2723-BulkEnumWrapper.patch, LUCENE-2723-termscorer.patch, LUCENE-2723-termscorer.patch, LUCENE-2723-termscorer.patch, LUCENE-2723.patch, LUCENE-2723.patch, LUCENE-2723.patch, LUCENE-2723.patch, LUCENE-2723.patch, LUCENE-2723.patch, LUCENE-2723_bulkvint.patch, LUCENE-2723_facetPerSeg.patch, LUCENE-2723_facetPerSeg.patch, LUCENE-2723_openEnum.patch, LUCENE-2723_termscorer.patch, LUCENE-2723_wastedint.patch Spinoff from LUCENE-1410. The flex DocsEnum has a simple bulk-read API that reads the next chunk of docs/freqs. But it's a poor fit for intblock codecs like FOR/PFOR (from LUCENE-1410). This is not unlike sucking coffee through those tiny plastic coffee stirrers they hand out airplanes that, surprisingly, also happen to function as a straw. As a result we see no perf gain from using FOR/PFOR. I had hacked up a fix for this, described at in my blog post at http://chbits.blogspot.com/2010/08/lucene-performance-with-pfordelta-codec.html I'm opening this issue to get that work to a committable point. So... I've worked out a new bulk-read API to address performance bottleneck. It has some big changes over the current bulk-read API: * You can now also bulk-read positions (but not payloads), but, I have yet to cutover positional queries. * The buffer contains doc deltas, not absolute values, for docIDs and positions (freqs are absolute). * Deleted docs are not filtered out. * The doc freq buffers need not be aligned. For fixed intblock codecs (FOR/PFOR) they will be, but for varint codecs (Simple9/16, Group varint, etc.) they won't be. It's still a work in progress... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2883) Consolidate Solr Lucene FunctionQuery into modules
[ https://issues.apache.org/jira/browse/LUCENE-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12986569#action_12986569 ] Robert Muir commented on LUCENE-2883: - Wait, why again did we merge lucene and solr? This is crazy-talk. I don't see a single valid reason why queries should be in solr-only. Consolidate Solr Lucene FunctionQuery into modules - Key: LUCENE-2883 URL: https://issues.apache.org/jira/browse/LUCENE-2883 Project: Lucene - Java Issue Type: Task Components: Search Affects Versions: 4.0 Reporter: Simon Willnauer Fix For: 4.0 Spin-off from the [dev list | http://www.mail-archive.com/dev@lucene.apache.org/msg13261.html] -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2883) Consolidate Solr Lucene FunctionQuery into modules
[ https://issues.apache.org/jira/browse/LUCENE-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12986571#action_12986571 ] Yonik Seeley commented on LUCENE-2883: -- bq. We can mark the classes as lucene.experimental? If they remain experimental I suppose, but lucene.internal would be a more accurate description. Consolidate Solr Lucene FunctionQuery into modules - Key: LUCENE-2883 URL: https://issues.apache.org/jira/browse/LUCENE-2883 Project: Lucene - Java Issue Type: Task Components: Search Affects Versions: 4.0 Reporter: Simon Willnauer Fix For: 4.0 Spin-off from the [dev list | http://www.mail-archive.com/dev@lucene.apache.org/msg13261.html] -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-445) XmlUpdateRequestHandler bad documents mid batch aborts rest of batch
[ https://issues.apache.org/jira/browse/SOLR-445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12986581#action_12986581 ] Grant Ingersoll commented on SOLR-445: -- This patch looks pretty reasonable from the details of the implementation, but I don't think it's quite ready for commit yet. First, we should be able to extend this to all that implement ContentStreamLoader (JSONLoader, CSVLoader) if they want it (it doesn't make sense for the SolrCell stuff). As I see it, we can do this by putting some base functionality into ContentStreamLoader which does what is done in this patch. I think we need two methods, one that handles the immediate error (takes in a StringBuilder and the info about the doc that failed) and decides whether to abort or buffer the error for later reporting depending on the configuration setting. I don't think the configuration of the item belongs in the UpdateHandler. Erik H. meant that it goes in the configuration of the /update RequestHandler in the config, not the DirectUpdateHandler2, as in {code}requestHandler name=/update class=solr.XmlUpdateRequestHandler /{code} This config could be a request param just like any other (such that one could even say they want to override it via a request via the defaults, appends, invariants). Also, I know it is tempting to do so, but please don't reformat the code in the patch. It slows down review significantly. In general, I try to reformat right before committing as do most committers. XmlUpdateRequestHandler bad documents mid batch aborts rest of batch Key: SOLR-445 URL: https://issues.apache.org/jira/browse/SOLR-445 Project: Solr Issue Type: Bug Components: update Affects Versions: 1.3 Reporter: Will Johnson Assignee: Grant Ingersoll Fix For: Next Attachments: SOLR-445-3_x.patch, SOLR-445.patch, SOLR-445.patch, SOLR-445.patch, solr-445.xml, SOLR-445_3x.patch Has anyone run into the problem of handling bad documents / failures mid batch. Ie: add doc field name=id1/field /doc doc field name=id2/field field name=myDateFieldI_AM_A_BAD_DATE/field /doc doc field name=id3/field /doc /add Right now solr adds the first doc and then aborts. It would seem like it should either fail the entire batch or log a message/return a code and then continue on to add doc 3. Option 1 would seem to be much harder to accomplish and possibly require more memory while Option 2 would require more information to come back from the API. I'm about to dig into this but I thought I'd ask to see if anyone had any suggestions, thoughts or comments. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-445) XmlUpdateRequestHandler bad documents mid batch aborts rest of batch
[ https://issues.apache.org/jira/browse/SOLR-445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12986584#action_12986584 ] Grant Ingersoll commented on SOLR-445: -- Oh, one other thing. You don't need to produce a 3.x patch. We can just do an SVN merge. XmlUpdateRequestHandler bad documents mid batch aborts rest of batch Key: SOLR-445 URL: https://issues.apache.org/jira/browse/SOLR-445 Project: Solr Issue Type: Bug Components: update Affects Versions: 1.3 Reporter: Will Johnson Assignee: Grant Ingersoll Fix For: Next Attachments: SOLR-445-3_x.patch, SOLR-445.patch, SOLR-445.patch, SOLR-445.patch, solr-445.xml, SOLR-445_3x.patch Has anyone run into the problem of handling bad documents / failures mid batch. Ie: add doc field name=id1/field /doc doc field name=id2/field field name=myDateFieldI_AM_A_BAD_DATE/field /doc doc field name=id3/field /doc /add Right now solr adds the first doc and then aborts. It would seem like it should either fail the entire batch or log a message/return a code and then continue on to add doc 3. Option 1 would seem to be much harder to accomplish and possibly require more memory while Option 2 would require more information to come back from the API. I'm about to dig into this but I thought I'd ask to see if anyone had any suggestions, thoughts or comments. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Resolved: (SOLR-2171) Using stats feature over a function, Function returning as a field value
[ https://issues.apache.org/jira/browse/SOLR-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll resolved SOLR-2171. --- Resolution: Duplicate See SOLR-1298 Using stats feature over a function, Function returning as a field value Key: SOLR-2171 URL: https://issues.apache.org/jira/browse/SOLR-2171 Project: Solr Issue Type: New Feature Components: Schema and Analysis, search Environment: All Reporter: Tanguy Moal Priority: Minor In order to be able to take big advantage of the stats component, it would be great to be able to define a function as a field. Returning the result of a function as a virtual field for each document for example, would enable us to have a much more advanced use of the stats component. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-2010) Remove segments with all documents deleted in commit/flush/close of IndexWriter instead of waiting until a merge occurs.
[ https://issues.apache.org/jira/browse/LUCENE-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-2010. Resolution: Fixed Remove segments with all documents deleted in commit/flush/close of IndexWriter instead of waiting until a merge occurs. Key: LUCENE-2010 URL: https://issues.apache.org/jira/browse/LUCENE-2010 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.9 Reporter: Uwe Schindler Assignee: Michael McCandless Fix For: 3.1, 4.0 Attachments: LUCENE-2010.patch I do not know if this is a bug in 2.9.0, but it seems that segments with all documents deleted are not automatically removed: {noformat} 4 of 14: name=_dlo docCount=5 compound=true hasProx=true numFiles=2 size (MB)=0.059 diagnostics = {java.version=1.5.0_21, lucene.version=2.9.0 817268P - 2009-09-21 10:25:09, os=SunOS, os.arch=amd64, java.vendor=Sun Microsystems Inc., os.version=5.10, source=flush} has deletions [delFileName=_dlo_1.del] test: open reader.OK [5 deleted docs] test: fields..OK [136 fields] test: field norms.OK [136 fields] test: terms, freq, prox...OK [1698 terms; 4236 terms/docs pairs; 0 tokens] test: stored fields...OK [0 total field count; avg ? fields per doc] test: term vectorsOK [0 total vector count; avg ? term/freq vector fields per doc] {noformat} Shouldn't such segments not be removed automatically during the next commit/close of IndexWriter? *Mike McCandless:* Lucene doesn't actually short-circuit this case, ie, if every single doc in a given segment has been deleted, it will still merge it [away] like normal, rather than simply dropping it immediately from the index, which I agree would be a simple optimization. Can you open a new issue? I would think IW can drop such a segment immediately (ie not wait for a merge or optimize) on flushing new deletes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Resolved: (SOLR-2177) Add More Facet demonstrations to the /browse example
[ https://issues.apache.org/jira/browse/SOLR-2177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll resolved SOLR-2177. --- Resolution: Fixed Add More Facet demonstrations to the /browse example Key: SOLR-2177 URL: https://issues.apache.org/jira/browse/SOLR-2177 Project: Solr Issue Type: Improvement Components: Response Writers Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Trivial Attachments: SOLR-2177.patch, SOLR-2177.patch Demonstrate other faceting techniques in the /browse example: range, date, pivot, etc. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Closed: (LUCENE-805) New Lucene Demo
[ https://issues.apache.org/jira/browse/LUCENE-805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll closed LUCENE-805. -- Resolution: Won't Fix New Lucene Demo --- Key: LUCENE-805 URL: https://issues.apache.org/jira/browse/LUCENE-805 Project: Lucene - Java Issue Type: Improvement Components: Examples Reporter: Grant Ingersoll Priority: Minor The much maligned demo, while useful, could use a breath of fresh air. This issue is to start collecting requirements about what people would like to see in a demo and what they don't like in the current one. Ideas (not necessarily in order of importance): 1. More in-depth tutorial explaining indexing/searching 2. Multilingual support/demonstration 3. Better demonstration of querying capabilities: Spans, Phrases, Wildcards, Filters, sorting, etc. 4. Dealing with different content types and pointers to resources 5. Wiki use cases links -- I think it would be cool to solicit people to contribute use cases to the docs. 6. Demonstration of contrib packages, esp. Highlighter 7. Performance issues/factors/tradeoffs. Lucene lessons learned and best practices Advanced tutorials: 1. Hadoop + Lucene 2. Writing custom analyzers/filters/tokenizers 3. Changing Scoring 4. Payloads (when they are committed) Please contribute what else you would like to see. I may be able to address some of these issues for my ApacheCon talk, but not all of them. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Threading of JIRA e-mails in gmail?
Hi everyone, There's a fair bit of info on the internet about this, apparently gmail groups by subject only and JIRA includes varying content in an issue's subject, depending on the action (comment, update, etc.). Did anybody find a solution to thread ALL of an issue's messages into a single thread (other than hacking through a proxy account and rewriting message subjects? :) Dawid - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Threading of JIRA e-mails in gmail?
This is an awful problem! I made a Python script to workaround this... it's kinda scary: it logs in (over IMAP), finds the messages, removes the old ones, and puts back new ones with the corrected subject line so that gmail groups them properly. If you want I can send the Python script... but it's pretty scary. If it has bugs it can delete your emails! And it requires you to put your IMAP credentials into a Python source... etc. I wish there were a cleaner solution :) Mike On Tue, Jan 25, 2011 at 2:14 PM, Dawid Weiss dawid.we...@gmail.com wrote: Hi everyone, There's a fair bit of info on the internet about this, apparently gmail groups by subject only and JIRA includes varying content in an issue's subject, depending on the action (comment, update, etc.). Did anybody find a solution to thread ALL of an issue's messages into a single thread (other than hacking through a proxy account and rewriting message subjects? :) Dawid - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1574) PooledSegmentReader, pools SegmentReader underlying byte arrays
[ https://issues.apache.org/jira/browse/LUCENE-1574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12986630#action_12986630 ] Michael McCandless commented on LUCENE-1574: I've been testing on a 25M doc index (all of en Wikipedia, at least as of March 2010). Yes, I think likely alloc of big BitVector, System.arraycopy, destroying it, may be a fairly low cost compared to lucene resolving the deleted term, indexing the doc, flushing the tiny segment, etc. PooledSegmentReader, pools SegmentReader underlying byte arrays --- Key: LUCENE-1574 URL: https://issues.apache.org/jira/browse/LUCENE-1574 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Affects Versions: 2.4.1 Reporter: Jason Rutherglen Priority: Minor Fix For: 4.0 Attachments: LUCENE-1574.patch Original Estimate: 168h Remaining Estimate: 168h PooledSegmentReader pools the underlying byte arrays of deleted docs and norms for realtime search. It is designed for use with IndexReader.clone which can create many copies of byte arrays, which are of the same length for a given segment. When pooled they can be reused which could save on memory. Do we want to benchmark the memory usage comparison of PooledSegmentReader vs GC? Many times GC is enough for these smaller objects. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org