Re: [Lucene.Net] Contribution
thank you stefan, will do so. best regards, danijel On Fri, Sep 23, 2011 at 6:03 AM, Stefan Bodewig bode...@apache.org wrote: On 2011-09-22, Danijel Kecman wrote: i would like to contribute. welcome Danijel. The best way to start contributing is by looking at the issues in JIRA pick one and start providing patches there - as well as engaging in discussion on this list. Cheers Stefan
[JENKINS] Solr-trunk - Build # 1647 - Failure
Build: https://builds.apache.org/job/Solr-trunk/1647/ 1 tests failed. REGRESSION: org.apache.solr.search.TestRealTimeGet.testStressGetRealtime Error Message: java.lang.AssertionError: Some threads threw uncaught exceptions! Stack Trace: java.lang.RuntimeException: java.lang.AssertionError: Some threads threw uncaught exceptions! at org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:739) at org.apache.solr.SolrTestCaseJ4.tearDown(SolrTestCaseJ4.java:89) at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:149) at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:51) at org.apache.lucene.util.LuceneTestCase.checkUncaughtExceptionsAfter(LuceneTestCase.java:767) at org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:711) Build Log (for compile errors): [...truncated 28831 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-1536) if a filter can support random access API, we should use it
[ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13128744#comment-13128744 ] Chris Male commented on LUCENE-1536: Is the only question mark remaining around the BooleanWeight work? If so, I think its definitely worth examining that in a wider separate issue after this is committed. if a filter can support random access API, we should use it --- Key: LUCENE-1536 URL: https://issues.apache.org/jira/browse/LUCENE-1536 Project: Lucene - Java Issue Type: Improvement Components: core/search Affects Versions: 2.4 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: CachedFilterIndexReader.java, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536_hack.patch, changes-yonik-uwe.patch, luceneutil.patch I ran some performance tests, comparing applying a filter via random-access API instead of current trunk's iterator API. This was inspired by LUCENE-1476, where we realized deletions should really be implemented just like a filter, but then in testing found that switching deletions to iterator was a very sizable performance hit. Some notes on the test: * Index is first 2M docs of Wikipedia. Test machine is Mac OS X 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153. * I test across multiple queries. 1-X means an OR query, eg 1-4 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2 AND 3 AND 4. u s means united states (phrase search). * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90, 95, 98, 99, 99.9 (filter is non-null but all bits are set), 100 (filter=null, control)). * Method high means I use random-access filter API in IndexSearcher's main loop. Method low means I use random-access filter API down in SegmentTermDocs (just like deleted docs today). * Baseline (QPS) is current trunk, where filter is applied as iterator up high (ie in IndexSearcher's search loop). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-1536) if a filter can support random access API, we should use it
[ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13128749#comment-13128749 ] Uwe Schindler commented on LUCENE-1536: --- bq. Is the only question mark remaining around the BooleanWeight work? If so, I think its definitely worth examining that in a wider separate issue after this is committed. The patch requests scorer always in order for now, so BooleanWeight is not mixed up for different segments. This is not different as in current trunk, as Scorers are always requested in order if filters are used. The optimization in the future would be to use out-of-order scoring if random access bits are used. if a filter can support random access API, we should use it --- Key: LUCENE-1536 URL: https://issues.apache.org/jira/browse/LUCENE-1536 Project: Lucene - Java Issue Type: Improvement Components: core/search Affects Versions: 2.4 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: CachedFilterIndexReader.java, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536_hack.patch, changes-yonik-uwe.patch, luceneutil.patch I ran some performance tests, comparing applying a filter via random-access API instead of current trunk's iterator API. This was inspired by LUCENE-1476, where we realized deletions should really be implemented just like a filter, but then in testing found that switching deletions to iterator was a very sizable performance hit. Some notes on the test: * Index is first 2M docs of Wikipedia. Test machine is Mac OS X 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153. * I test across multiple queries. 1-X means an OR query, eg 1-4 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2 AND 3 AND 4. u s means united states (phrase search). * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90, 95, 98, 99, 99.9 (filter is non-null but all bits are set), 100 (filter=null, control)). * Method high means I use random-access filter API in IndexSearcher's main loop. Method low means I use random-access filter API down in SegmentTermDocs (just like deleted docs today). * Baseline (QPS) is current trunk, where filter is applied as iterator up high (ie in IndexSearcher's search loop). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-1536) if a filter can support random access API, we should use it
[ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13128754#comment-13128754 ] Chris Male commented on LUCENE-1536: I was more referring to Robert's comment: bq. It seems to me these parameters (topLevel/scoresInOrder) really shouldn't be parameters to weight.scorer()! if a filter can support random access API, we should use it --- Key: LUCENE-1536 URL: https://issues.apache.org/jira/browse/LUCENE-1536 Project: Lucene - Java Issue Type: Improvement Components: core/search Affects Versions: 2.4 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: CachedFilterIndexReader.java, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536_hack.patch, changes-yonik-uwe.patch, luceneutil.patch I ran some performance tests, comparing applying a filter via random-access API instead of current trunk's iterator API. This was inspired by LUCENE-1476, where we realized deletions should really be implemented just like a filter, but then in testing found that switching deletions to iterator was a very sizable performance hit. Some notes on the test: * Index is first 2M docs of Wikipedia. Test machine is Mac OS X 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153. * I test across multiple queries. 1-X means an OR query, eg 1-4 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2 AND 3 AND 4. u s means united states (phrase search). * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90, 95, 98, 99, 99.9 (filter is non-null but all bits are set), 100 (filter=null, control)). * Method high means I use random-access filter API in IndexSearcher's main loop. Method low means I use random-access filter API down in SegmentTermDocs (just like deleted docs today). * Baseline (QPS) is current trunk, where filter is applied as iterator up high (ie in IndexSearcher's search loop). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3440) FastVectorHighlighter: IDF-weighted terms for ordered fragments
[ https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13128762#comment-13128762 ] sebastian L. commented on LUCENE-3440: -- Hi Koji, patch don't work because of https://issues.apache.org/jira/browse/LUCENE-3513. bq. And I found a lot of test errors... Frankly, I didn't run the tests because I thought the changes provided with the last patch shouldn't affect the original behavior. I'll have a look into it. But this may take some time, due to the fact that I have no knowledge about the test-framework. FastVectorHighlighter: IDF-weighted terms for ordered fragments Key: LUCENE-3440 URL: https://issues.apache.org/jira/browse/LUCENE-3440 Project: Lucene - Java Issue Type: Improvement Components: modules/highlighter Affects Versions: 3.5, 4.0 Reporter: sebastian L. Priority: Minor Labels: FastVectorHighlighter Fix For: 3.5, 4.0 Attachments: LUCENE-3.5-SNAPSHOT-3440-8.patch, LUCENE-3440.patch, LUCENE-4.0-SNAPSHOT-3440-9.patch, weight-vs-boost_table01.html, weight-vs-boost_table02.html The FastVectorHighlighter uses for every term found in a fragment an equal weight, which causes a higher ranking for fragments with a high number of words or, in the worst case, a high number of very common words than fragments that contains *all* of the terms used in the original query. This patch provides ordered fragments with IDF-weighted terms: total weight = total weight + IDF for unique term per fragment * boost of query; The ranking-formula should be the same, or at least similar, to that one used in org.apache.lucene.search.highlight.QueryTermScorer. The patch is simple, but it works for us. Some ideas: - A better approach would be moving the whole fragments-scoring into a separate class. - Switch scoring via parameter - Exact phrases should be given a even better score, regardless if a phrase-query was executed or not - edismax/dismax-parameters pf, ps and pf^boost should be observed and corresponding fragments should be ranked higher -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-1536) if a filter can support random access API, we should use it
[ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13128767#comment-13128767 ] Uwe Schindler commented on LUCENE-1536: --- Yes, that should be sorted out in another issue. We have a working fix, the rest is optimization and unrelated api changes. if a filter can support random access API, we should use it --- Key: LUCENE-1536 URL: https://issues.apache.org/jira/browse/LUCENE-1536 Project: Lucene - Java Issue Type: Improvement Components: core/search Affects Versions: 2.4 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: CachedFilterIndexReader.java, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536_hack.patch, changes-yonik-uwe.patch, luceneutil.patch I ran some performance tests, comparing applying a filter via random-access API instead of current trunk's iterator API. This was inspired by LUCENE-1476, where we realized deletions should really be implemented just like a filter, but then in testing found that switching deletions to iterator was a very sizable performance hit. Some notes on the test: * Index is first 2M docs of Wikipedia. Test machine is Mac OS X 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153. * I test across multiple queries. 1-X means an OR query, eg 1-4 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2 AND 3 AND 4. u s means united states (phrase search). * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90, 95, 98, 99, 99.9 (filter is non-null but all bits are set), 100 (filter=null, control)). * Method high means I use random-access filter API in IndexSearcher's main loop. Method low means I use random-access filter API down in SegmentTermDocs (just like deleted docs today). * Baseline (QPS) is current trunk, where filter is applied as iterator up high (ie in IndexSearcher's search loop). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2842) Re-factor UpdateChain and UpdateProcessor interfaces
[ https://issues.apache.org/jira/browse/SOLR-2842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13128793#comment-13128793 ] Jan Høydahl commented on SOLR-2842: --- Some valid points there. I thought I saw a possibility for generalization that would help solve SOLR-1526, but wanted to flesh out the feasibility here. So far I do not see any other example than Tika extraction which could really benefit from being done client-side. There may be others, but perhaps not justifying this change. Another option for SOLR-1526 could be to provide a ClientExtractingUpdateProcessorFactory class which instantiates the ExtractingUpdateProcessor for client side use. Then if other processors are useful on the client side as well, people simply write a Client factory for them? Re-factor UpdateChain and UpdateProcessor interfaces Key: SOLR-2842 URL: https://issues.apache.org/jira/browse/SOLR-2842 Project: Solr Issue Type: Improvement Components: update Reporter: Jan Høydahl The UpdateChain's main task is to send SolrInputDocuments through a chain of UpdateRequestProcessors in order to transform them in some way and then (typically) indexing them. This generic pipeline concept would also be useful on the client side (SolrJ), so that we could choose to do parts or all of the processing on the client. The most prominent use case is extracting text (Tika) from large binary documents, residing on local storage on the client(s). Streaming hundreds of Mb over to Solr for processing is not efficcient. See SOLR-1526. We're already implementing Tika as an UpdateProcessor in SOLR-1763, and what would be more natural than reusing this - and any other processor - on the client side? However, for this to be possible, some interfaces need to change slightly.. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3440) FastVectorHighlighter: IDF-weighted terms for ordered fragments
[ https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13128795#comment-13128795 ] Koji Sekiguchi commented on LUCENE-3440: Hi sebastian, {quote} Frankly, I didn't run the tests because I thought the changes provided with the last patch shouldn't affect the original behavior. I'll have a look into it. But this may take some time, due to the fact that I have no knowledge about the test-framework. {quote} Ok, no problem. I'll see the test case (hopefully next week or so). But can you take care of the following to go forward? {quote} Ah, sebastian, I think you needed to check Grant license to ASF for inclusion in ASF works when you attach your patch. Can you remove the latest patches and reattach them with that flag? Thanks! {quote} FastVectorHighlighter: IDF-weighted terms for ordered fragments Key: LUCENE-3440 URL: https://issues.apache.org/jira/browse/LUCENE-3440 Project: Lucene - Java Issue Type: Improvement Components: modules/highlighter Affects Versions: 3.5, 4.0 Reporter: sebastian L. Priority: Minor Labels: FastVectorHighlighter Fix For: 3.5, 4.0 Attachments: LUCENE-3.5-SNAPSHOT-3440-8.patch, LUCENE-3440.patch, LUCENE-4.0-SNAPSHOT-3440-9.patch, weight-vs-boost_table01.html, weight-vs-boost_table02.html The FastVectorHighlighter uses for every term found in a fragment an equal weight, which causes a higher ranking for fragments with a high number of words or, in the worst case, a high number of very common words than fragments that contains *all* of the terms used in the original query. This patch provides ordered fragments with IDF-weighted terms: total weight = total weight + IDF for unique term per fragment * boost of query; The ranking-formula should be the same, or at least similar, to that one used in org.apache.lucene.search.highlight.QueryTermScorer. The patch is simple, but it works for us. Some ideas: - A better approach would be moving the whole fragments-scoring into a separate class. - Switch scoring via parameter - Exact phrases should be given a even better score, regardless if a phrase-query was executed or not - edismax/dismax-parameters pf, ps and pf^boost should be observed and corresponding fragments should be ranked higher -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3440) FastVectorHighlighter: IDF-weighted terms for ordered fragments
[ https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi updated LUCENE-3440: --- Attachment: (was: LUCENE-3440.patch) FastVectorHighlighter: IDF-weighted terms for ordered fragments Key: LUCENE-3440 URL: https://issues.apache.org/jira/browse/LUCENE-3440 Project: Lucene - Java Issue Type: Improvement Components: modules/highlighter Affects Versions: 3.5, 4.0 Reporter: sebastian L. Priority: Minor Labels: FastVectorHighlighter Fix For: 3.5, 4.0 Attachments: LUCENE-3.5-SNAPSHOT-3440-8.patch, LUCENE-4.0-SNAPSHOT-3440-9.patch, weight-vs-boost_table01.html, weight-vs-boost_table02.html The FastVectorHighlighter uses for every term found in a fragment an equal weight, which causes a higher ranking for fragments with a high number of words or, in the worst case, a high number of very common words than fragments that contains *all* of the terms used in the original query. This patch provides ordered fragments with IDF-weighted terms: total weight = total weight + IDF for unique term per fragment * boost of query; The ranking-formula should be the same, or at least similar, to that one used in org.apache.lucene.search.highlight.QueryTermScorer. The patch is simple, but it works for us. Some ideas: - A better approach would be moving the whole fragments-scoring into a separate class. - Switch scoring via parameter - Exact phrases should be given a even better score, regardless if a phrase-query was executed or not - edismax/dismax-parameters pf, ps and pf^boost should be observed and corresponding fragments should be ranked higher -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3440) FastVectorHighlighter: IDF-weighted terms for ordered fragments
[ https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13128799#comment-13128799 ] Koji Sekiguchi commented on LUCENE-3440: I've removed my latest patch. Because the patch had ASF granted license flag but it was not right because it was totally based on sebastian's patch, which was not granted to ASF. FastVectorHighlighter: IDF-weighted terms for ordered fragments Key: LUCENE-3440 URL: https://issues.apache.org/jira/browse/LUCENE-3440 Project: Lucene - Java Issue Type: Improvement Components: modules/highlighter Affects Versions: 3.5, 4.0 Reporter: sebastian L. Priority: Minor Labels: FastVectorHighlighter Fix For: 3.5, 4.0 Attachments: LUCENE-3.5-SNAPSHOT-3440-8.patch, LUCENE-4.0-SNAPSHOT-3440-9.patch, weight-vs-boost_table01.html, weight-vs-boost_table02.html The FastVectorHighlighter uses for every term found in a fragment an equal weight, which causes a higher ranking for fragments with a high number of words or, in the worst case, a high number of very common words than fragments that contains *all* of the terms used in the original query. This patch provides ordered fragments with IDF-weighted terms: total weight = total weight + IDF for unique term per fragment * boost of query; The ranking-formula should be the same, or at least similar, to that one used in org.apache.lucene.search.highlight.QueryTermScorer. The patch is simple, but it works for us. Some ideas: - A better approach would be moving the whole fragments-scoring into a separate class. - Switch scoring via parameter - Exact phrases should be given a even better score, regardless if a phrase-query was executed or not - edismax/dismax-parameters pf, ps and pf^boost should be observed and corresponding fragments should be ranked higher -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3440) FastVectorHighlighter: IDF-weighted terms for ordered fragments
[ https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sebastian L. updated LUCENE-3440: - Attachment: LUCENE-4.0-SNAPSHOT-3440-9.patch FastVectorHighlighter: IDF-weighted terms for ordered fragments Key: LUCENE-3440 URL: https://issues.apache.org/jira/browse/LUCENE-3440 Project: Lucene - Java Issue Type: Improvement Components: modules/highlighter Affects Versions: 3.5, 4.0 Reporter: sebastian L. Priority: Minor Labels: FastVectorHighlighter Fix For: 3.5, 4.0 Attachments: LUCENE-3.5-SNAPSHOT-3440-8.patch, LUCENE-4.0-SNAPSHOT-3440-9.patch, LUCENE-4.0-SNAPSHOT-3440-9.patch, weight-vs-boost_table01.html, weight-vs-boost_table02.html The FastVectorHighlighter uses for every term found in a fragment an equal weight, which causes a higher ranking for fragments with a high number of words or, in the worst case, a high number of very common words than fragments that contains *all* of the terms used in the original query. This patch provides ordered fragments with IDF-weighted terms: total weight = total weight + IDF for unique term per fragment * boost of query; The ranking-formula should be the same, or at least similar, to that one used in org.apache.lucene.search.highlight.QueryTermScorer. The patch is simple, but it works for us. Some ideas: - A better approach would be moving the whole fragments-scoring into a separate class. - Switch scoring via parameter - Exact phrases should be given a even better score, regardless if a phrase-query was executed or not - edismax/dismax-parameters pf, ps and pf^boost should be observed and corresponding fragments should be ranked higher -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3440) FastVectorHighlighter: IDF-weighted terms for ordered fragments
[ https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13128802#comment-13128802 ] sebastian L. commented on LUCENE-3440: -- bq. Ah, sebastian, I think you needed to check Grant license to ASF for inclusion in ASF works when you attach your patch. Can you remove the latest patches and reattach them with that flag? Thanks! Sorry, I forgot that. Done. FastVectorHighlighter: IDF-weighted terms for ordered fragments Key: LUCENE-3440 URL: https://issues.apache.org/jira/browse/LUCENE-3440 Project: Lucene - Java Issue Type: Improvement Components: modules/highlighter Affects Versions: 3.5, 4.0 Reporter: sebastian L. Priority: Minor Labels: FastVectorHighlighter Fix For: 3.5, 4.0 Attachments: LUCENE-3.5-SNAPSHOT-3440-8.patch, LUCENE-4.0-SNAPSHOT-3440-9.patch, LUCENE-4.0-SNAPSHOT-3440-9.patch, weight-vs-boost_table01.html, weight-vs-boost_table02.html The FastVectorHighlighter uses for every term found in a fragment an equal weight, which causes a higher ranking for fragments with a high number of words or, in the worst case, a high number of very common words than fragments that contains *all* of the terms used in the original query. This patch provides ordered fragments with IDF-weighted terms: total weight = total weight + IDF for unique term per fragment * boost of query; The ranking-formula should be the same, or at least similar, to that one used in org.apache.lucene.search.highlight.QueryTermScorer. The patch is simple, but it works for us. Some ideas: - A better approach would be moving the whole fragments-scoring into a separate class. - Switch scoring via parameter - Exact phrases should be given a even better score, regardless if a phrase-query was executed or not - edismax/dismax-parameters pf, ps and pf^boost should be observed and corresponding fragments should be ranked higher -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Assigned] (LUCENE-3522) TermsFilter.getDocIdSet(context) NPE on missing field
[ https://issues.apache.org/jira/browse/LUCENE-3522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless reassigned LUCENE-3522: -- Assignee: Michael McCandless TermsFilter.getDocIdSet(context) NPE on missing field - Key: LUCENE-3522 URL: https://issues.apache.org/jira/browse/LUCENE-3522 Project: Lucene - Java Issue Type: Bug Components: modules/other Affects Versions: 4.0 Reporter: Dan Climan Assignee: Michael McCandless Priority: Minor Attachments: LUCENE-3522.patch If the context does not contain the field for a term when calling TermsFilter.getDocIdSet(AtomicReaderContext context) then a NullPointerException is thrown due to not checking for null Terms before getting iterator. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3522) TermsFilter.getDocIdSet(context) NPE on missing field
[ https://issues.apache.org/jira/browse/LUCENE-3522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13128812#comment-13128812 ] Shai Erera commented on LUCENE-3522: Good catch Dan! Patch looks good, but I have some comments about the test: # You don't close the directories at the end of it, and the test fails due that. # I think it can be simplified to create just one Directory, with f1:content and request f2:content. I actually tried it and the test still fails (NPE reproduced without your fix). Here is the modified, more compact, test: {code} public void testMissingField() throws Exception { // LUCENE-3522: if requested field does not exist in the index, TermsFilter threw NPE. Directory dir = newDirectory(); RandomIndexWriter writer = new RandomIndexWriter(random, dir); Document doc = new Document(); doc.add(newField(f1, content, StringField.TYPE_STORED)); writer.addDocument(doc); IndexReader reader = writer.getReader(); writer.close(); TermsFilter tf = new TermsFilter(); tf.addTerm(new Term(f2, content)); FixedBitSet bits = (FixedBitSet) tf.getDocIdSet(reader.getTopReaderContext().leaves()[0]); assertTrue(Must be = 0, bits.cardinality() = 0); reader.close(); dir.close(); } {code} Would you mind changing the test case to this compact one? Or did you want to demonstrate something else with the two readers? TermsFilter.getDocIdSet(context) NPE on missing field - Key: LUCENE-3522 URL: https://issues.apache.org/jira/browse/LUCENE-3522 Project: Lucene - Java Issue Type: Bug Components: modules/other Affects Versions: 4.0 Reporter: Dan Climan Assignee: Michael McCandless Priority: Minor Attachments: LUCENE-3522.patch If the context does not contain the field for a term when calling TermsFilter.getDocIdSet(AtomicReaderContext context) then a NullPointerException is thrown due to not checking for null Terms before getting iterator. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-3522) TermsFilter.getDocIdSet(context) NPE on missing field
[ https://issues.apache.org/jira/browse/LUCENE-3522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-3522. Resolution: Fixed Fix Version/s: 4.0 Thanks Dan! I committed to trunk and backported the test case to 3.x. I had to add missing rd1/2.close() at the end of the test case. TermsFilter.getDocIdSet(context) NPE on missing field - Key: LUCENE-3522 URL: https://issues.apache.org/jira/browse/LUCENE-3522 Project: Lucene - Java Issue Type: Bug Components: modules/other Affects Versions: 4.0 Reporter: Dan Climan Assignee: Michael McCandless Priority: Minor Fix For: 4.0 Attachments: LUCENE-3522.patch If the context does not contain the field for a term when calling TermsFilter.getDocIdSet(AtomicReaderContext context) then a NullPointerException is thrown due to not checking for null Terms before getting iterator. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3522) TermsFilter.getDocIdSet(context) NPE on missing field
[ https://issues.apache.org/jira/browse/LUCENE-3522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13128816#comment-13128816 ] Michael McCandless commented on LUCENE-3522: Dan/Shai feel free to fix the test case if you want... didn't see your comments here until after I committed! TermsFilter.getDocIdSet(context) NPE on missing field - Key: LUCENE-3522 URL: https://issues.apache.org/jira/browse/LUCENE-3522 Project: Lucene - Java Issue Type: Bug Components: modules/other Affects Versions: 4.0 Reporter: Dan Climan Assignee: Michael McCandless Priority: Minor Fix For: 4.0 Attachments: LUCENE-3522.patch If the context does not contain the field for a term when calling TermsFilter.getDocIdSet(AtomicReaderContext context) then a NullPointerException is thrown due to not checking for null Terms before getting iterator. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3522) TermsFilter.getDocIdSet(context) NPE on missing field
[ https://issues.apache.org/jira/browse/LUCENE-3522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-3522: --- Fix Version/s: 3.5 Added 3.5 as a fix version as well TermsFilter.getDocIdSet(context) NPE on missing field - Key: LUCENE-3522 URL: https://issues.apache.org/jira/browse/LUCENE-3522 Project: Lucene - Java Issue Type: Bug Components: modules/other Affects Versions: 4.0 Reporter: Dan Climan Assignee: Michael McCandless Priority: Minor Fix For: 3.5, 4.0 Attachments: LUCENE-3522.patch If the context does not contain the field for a term when calling TermsFilter.getDocIdSet(AtomicReaderContext context) then a NullPointerException is thrown due to not checking for null Terms before getting iterator. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3522) TermsFilter.getDocIdSet(context) NPE on missing field
[ https://issues.apache.org/jira/browse/LUCENE-3522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13128820#comment-13128820 ] Michael McCandless commented on LUCENE-3522: OK that's fine but technically this bug didn't exist on 3.5... I only backported the test case. TermsFilter.getDocIdSet(context) NPE on missing field - Key: LUCENE-3522 URL: https://issues.apache.org/jira/browse/LUCENE-3522 Project: Lucene - Java Issue Type: Bug Components: modules/other Affects Versions: 4.0 Reporter: Dan Climan Assignee: Michael McCandless Priority: Minor Fix For: 3.5, 4.0 Attachments: LUCENE-3522.patch If the context does not contain the field for a term when calling TermsFilter.getDocIdSet(AtomicReaderContext context) then a NullPointerException is thrown due to not checking for null Terms before getting iterator. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[JENKINS-MAVEN] Lucene-Solr-Maven-3.x #271: POMs out of sync
Build: https://builds.apache.org/job/Lucene-Solr-Maven-3.x/271/ No tests ran. Build Log (for compile errors): [...truncated 19697 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-434) Lucene database bindings
[ https://issues.apache.org/jira/browse/LUCENE-434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13128904#comment-13128904 ] Mark Harwood commented on LUCENE-434: - Note: Lucene 4.0's DocValueFields aka column stride fields will be of benefit here in efficient KeyMap implementations - e.g. updating the load routines in CachedKeyMapImpl. An overhaul is also needed e.g support segment-level mappings Lucene database bindings Key: LUCENE-434 URL: https://issues.apache.org/jira/browse/LUCENE-434 Project: Lucene - Java Issue Type: New Feature Components: modules/other Reporter: Mark Harwood Priority: Minor Attachments: LuceneDb.zip Code and examples for embedding Lucene in HSQLDB and Derby relational databases. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[JENKINS-MAVEN] Lucene-Solr-Maven-trunk #268: POMs out of sync
Build: https://builds.apache.org/job/Lucene-Solr-Maven-trunk/268/ No tests ran. Build Log (for compile errors): [...truncated 18924 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[JENKINS] Lucene-Solr-tests-only-trunk - Build # 10867 - Failure
Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk/10867/ 1 tests failed. REGRESSION: org.apache.solr.cloud.ZkSolrClientTest.testReconnect Error Message: Node does not exist, but it should Stack Trace: junit.framework.AssertionFailedError: Node does not exist, but it should at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:149) at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:51) at org.apache.solr.cloud.ZkSolrClientTest.testReconnect(ZkSolrClientTest.java:149) at org.apache.lucene.util.LuceneTestCase$2$1.evaluate(LuceneTestCase.java:610) Build Log (for compile errors): [...truncated 7783 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2841) Scriptable UpdateRequestChain
[ https://issues.apache.org/jira/browse/SOLR-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13128966#comment-13128966 ] Jan Høydahl commented on SOLR-2841: --- The DSL could be based on Groovy, JRuby, Jython or JS. Here's my quasi sketch of a Groovy example from 2823: ...This approach also solves another wish of mine, namely being able to define chains outside of solrconfig.xml. Logically, configuring schema and document processing is done by a content guy, but configuring solrconfig.xml is done by the hardware/operations guys. Imagine a solr/conf/pipeline.groovy defined in solrconfig.xml: {code:xml} updateProcessorChain class=solr.ScriptedUpdateProcessorChainFactory file=updateprocessing.groovy / {code} updateprocessing.groovy: {code} chain simple { process(langid) process(copyfield) chain(logAndRun) } chain moreComplex { process(langid) if(doc.getFieldValue(employees) 10) process(copyfield) else chain(myOtherProcesses) doc.deleteField(title) chain(logAndRun) } chain logAndRun { process(log) process(run) } processor langid { class = solr.LanguageIdentifierUpdateProcessorFactory config(langid.fl, title,body) config(langid.langField, language) config(map, true) } processor copyfield { script = copyfield.groovy config(from, title) config(to, title_en) } {code} I don't know what it takes to code such a thing, but if we had it, I'd never go back to defining pipelines in XML :) Scriptable UpdateRequestChain - Key: SOLR-2841 URL: https://issues.apache.org/jira/browse/SOLR-2841 Project: Solr Issue Type: New Feature Components: update Reporter: Jan Høydahl UpdateProcessorChains must currently be defined with XML in solrconfig.xml. We should explore a scriptable chain implementation with a DSL that allows for full flexibility. The first step would be to make UpdateChain implementations pluggable in solrconfig.xml, for backward compat support. Benefits and possibilities with a Scriptable UpdateChain: * A compact DSL for defining Processors and Chains (Workflows would be a better, less limited term here) * Keeping update processor config separate from solrconfig.xml gives better separations of roles * Use this as an opportunity to natively support scripting language Processors (ideas from SOLR-1725) This issue is spun off from SOLR-2823. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2155) Geospatial search using geohash prefixes
[ https://issues.apache.org/jira/browse/SOLR-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13128980#comment-13128980 ] Olivier Jacquet commented on SOLR-2155: --- I just wanted to mention another use case for multivalued point fields since everyone is always talking about this in a location context. The PointType can also be used to categorize other other stuff. In my case we're storing qualifications of persons as a tuple of experience, function and skill (eg. senior, developer, java) which are internally represented by numerical ids. Now with Solr I would like to be able to do the query: return everything that is a java developer which would be the same as asking for all points on a certain line. Geospatial search using geohash prefixes Key: SOLR-2155 URL: https://issues.apache.org/jira/browse/SOLR-2155 Project: Solr Issue Type: Improvement Reporter: David Smiley Assignee: Grant Ingersoll Attachments: GeoHashPrefixFilter.patch, GeoHashPrefixFilter.patch, GeoHashPrefixFilter.patch, SOLR-2155_GeoHashPrefixFilter_with_sorting_no_poly.patch, SOLR.2155.p3.patch, SOLR.2155.p3tests.patch, Solr2155-1.0.2-project.zip, Solr2155-1.0.3-project.zip, Solr2155-for-1.0.2-3.x-port.patch There currently isn't a solution in Solr for doing geospatial filtering on documents that have a variable number of points. This scenario occurs when there is location extraction (i.e. via a gazateer) occurring on free text. None, one, or many geospatial locations might be extracted from any given document and users want to limit their search results to those occurring in a user-specified area. I've implemented this by furthering the GeoHash based work in Lucene/Solr with a geohash prefix based filter. A geohash refers to a lat-lon box on the earth. Each successive character added further subdivides the box into a 4x8 (or 8x4 depending on the even/odd length of the geohash) grid. The first step in this scheme is figuring out which geohash grid squares cover the user's search query. I've added various extra methods to GeoHashUtils (and added tests) to assist in this purpose. The next step is an actual Lucene Filter, GeoHashPrefixFilter, that uses these geohash prefixes in TermsEnum.seek() to skip to relevant grid squares in the index. Once a matching geohash grid is found, the points therein are compared against the user's query to see if it matches. I created an abstraction GeoShape extended by subclasses named PointDistance... and CartesianBox to support different queried shapes so that the filter need not care about these details. This work was presented at LuceneRevolution in Boston on October 8th. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Issue Comment Edited] (SOLR-2155) Geospatial search using geohash prefixes
[ https://issues.apache.org/jira/browse/SOLR-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13128980#comment-13128980 ] Olivier Jacquet edited comment on SOLR-2155 at 10/17/11 4:30 PM: - I just wanted to mention another use case for multivalued point fields since everyone is always talking about this in a location context. The PointType can also be used to categorize other other stuff. In my case we're storing qualifications of persons as a tuple of experience, function and skill (eg. senior, developer, java) which are internally represented by numerical ids. Now with Solr I would like to be able to do the query: return everything that is a java developer which would be the same as asking for all points on a certain plane. was (Author: ojacquet): I just wanted to mention another use case for multivalued point fields since everyone is always talking about this in a location context. The PointType can also be used to categorize other other stuff. In my case we're storing qualifications of persons as a tuple of experience, function and skill (eg. senior, developer, java) which are internally represented by numerical ids. Now with Solr I would like to be able to do the query: return everything that is a java developer which would be the same as asking for all points on a certain line. Geospatial search using geohash prefixes Key: SOLR-2155 URL: https://issues.apache.org/jira/browse/SOLR-2155 Project: Solr Issue Type: Improvement Reporter: David Smiley Assignee: Grant Ingersoll Attachments: GeoHashPrefixFilter.patch, GeoHashPrefixFilter.patch, GeoHashPrefixFilter.patch, SOLR-2155_GeoHashPrefixFilter_with_sorting_no_poly.patch, SOLR.2155.p3.patch, SOLR.2155.p3tests.patch, Solr2155-1.0.2-project.zip, Solr2155-1.0.3-project.zip, Solr2155-for-1.0.2-3.x-port.patch There currently isn't a solution in Solr for doing geospatial filtering on documents that have a variable number of points. This scenario occurs when there is location extraction (i.e. via a gazateer) occurring on free text. None, one, or many geospatial locations might be extracted from any given document and users want to limit their search results to those occurring in a user-specified area. I've implemented this by furthering the GeoHash based work in Lucene/Solr with a geohash prefix based filter. A geohash refers to a lat-lon box on the earth. Each successive character added further subdivides the box into a 4x8 (or 8x4 depending on the even/odd length of the geohash) grid. The first step in this scheme is figuring out which geohash grid squares cover the user's search query. I've added various extra methods to GeoHashUtils (and added tests) to assist in this purpose. The next step is an actual Lucene Filter, GeoHashPrefixFilter, that uses these geohash prefixes in TermsEnum.seek() to skip to relevant grid squares in the index. Once a matching geohash grid is found, the points therein are compared against the user's query to see if it matches. I created an abstraction GeoShape extended by subclasses named PointDistance... and CartesianBox to support different queried shapes so that the filter need not care about these details. This work was presented at LuceneRevolution in Boston on October 8th. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Issue Comment Edited] (SOLR-2155) Geospatial search using geohash prefixes
[ https://issues.apache.org/jira/browse/SOLR-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13128980#comment-13128980 ] Olivier Jacquet edited comment on SOLR-2155 at 10/17/11 4:32 PM: - I just wanted to mention another use case for multivalued point fields since everyone is always talking about this in a location context. The PointType can also be used to categorize other other stuff. In my case we're storing qualifications of persons as a tuple of experience, function and skill (eg. senior, developer, java) which are internally represented by numerical ids. Now with Solr I would like to be able to do the query: return everything that is a java developer which would be the same as asking for all points on a certain line. was (Author: ojacquet): I just wanted to mention another use case for multivalued point fields since everyone is always talking about this in a location context. The PointType can also be used to categorize other other stuff. In my case we're storing qualifications of persons as a tuple of experience, function and skill (eg. senior, developer, java) which are internally represented by numerical ids. Now with Solr I would like to be able to do the query: return everything that is a java developer which would be the same as asking for all points on a certain plane. Geospatial search using geohash prefixes Key: SOLR-2155 URL: https://issues.apache.org/jira/browse/SOLR-2155 Project: Solr Issue Type: Improvement Reporter: David Smiley Assignee: Grant Ingersoll Attachments: GeoHashPrefixFilter.patch, GeoHashPrefixFilter.patch, GeoHashPrefixFilter.patch, SOLR-2155_GeoHashPrefixFilter_with_sorting_no_poly.patch, SOLR.2155.p3.patch, SOLR.2155.p3tests.patch, Solr2155-1.0.2-project.zip, Solr2155-1.0.3-project.zip, Solr2155-for-1.0.2-3.x-port.patch There currently isn't a solution in Solr for doing geospatial filtering on documents that have a variable number of points. This scenario occurs when there is location extraction (i.e. via a gazateer) occurring on free text. None, one, or many geospatial locations might be extracted from any given document and users want to limit their search results to those occurring in a user-specified area. I've implemented this by furthering the GeoHash based work in Lucene/Solr with a geohash prefix based filter. A geohash refers to a lat-lon box on the earth. Each successive character added further subdivides the box into a 4x8 (or 8x4 depending on the even/odd length of the geohash) grid. The first step in this scheme is figuring out which geohash grid squares cover the user's search query. I've added various extra methods to GeoHashUtils (and added tests) to assist in this purpose. The next step is an actual Lucene Filter, GeoHashPrefixFilter, that uses these geohash prefixes in TermsEnum.seek() to skip to relevant grid squares in the index. Once a matching geohash grid is found, the points therein are compared against the user's query to see if it matches. I created an abstraction GeoShape extended by subclasses named PointDistance... and CartesianBox to support different queried shapes so that the filter need not care about these details. This work was presented at LuceneRevolution in Boston on October 8th. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2155) Geospatial search using geohash prefixes
[ https://issues.apache.org/jira/browse/SOLR-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13129002#comment-13129002 ] David Smiley commented on SOLR-2155: Oliver: Your scenario is interesting but I wouldn't recommend spatial for that. A key part of spatial is the use of a numerical range. In your case there are discrete values. Instead, I recommend you experiment with phrase queries, and if you are expert in Lucene then span queries. As a toy hack example, imagine indexing each of these values in the form senior developer java (3 words, one for each part). We assume each value tokenizes as one token. Then search for the developer java in which the was substituted as a kind of wildcard for the first position to find java developers in all levels of experience. The is a stopword and in effect creates a wildcard placeholder. If you search the solr-user list then you will see information on this topic. I've solved this problem in a different more difficult way because my values were not single tokens, but based on the example you present, the solution I present here isn't bad. If you want to discuss this further I recommend the solr-user list. Geospatial search using geohash prefixes Key: SOLR-2155 URL: https://issues.apache.org/jira/browse/SOLR-2155 Project: Solr Issue Type: Improvement Reporter: David Smiley Assignee: Grant Ingersoll Attachments: GeoHashPrefixFilter.patch, GeoHashPrefixFilter.patch, GeoHashPrefixFilter.patch, SOLR-2155_GeoHashPrefixFilter_with_sorting_no_poly.patch, SOLR.2155.p3.patch, SOLR.2155.p3tests.patch, Solr2155-1.0.2-project.zip, Solr2155-1.0.3-project.zip, Solr2155-for-1.0.2-3.x-port.patch There currently isn't a solution in Solr for doing geospatial filtering on documents that have a variable number of points. This scenario occurs when there is location extraction (i.e. via a gazateer) occurring on free text. None, one, or many geospatial locations might be extracted from any given document and users want to limit their search results to those occurring in a user-specified area. I've implemented this by furthering the GeoHash based work in Lucene/Solr with a geohash prefix based filter. A geohash refers to a lat-lon box on the earth. Each successive character added further subdivides the box into a 4x8 (or 8x4 depending on the even/odd length of the geohash) grid. The first step in this scheme is figuring out which geohash grid squares cover the user's search query. I've added various extra methods to GeoHashUtils (and added tests) to assist in this purpose. The next step is an actual Lucene Filter, GeoHashPrefixFilter, that uses these geohash prefixes in TermsEnum.seek() to skip to relevant grid squares in the index. Once a matching geohash grid is found, the points therein are compared against the user's query to see if it matches. I created an abstraction GeoShape extended by subclasses named PointDistance... and CartesianBox to support different queried shapes so that the filter need not care about these details. This work was presented at LuceneRevolution in Boston on October 8th. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2842) Re-factor UpdateChain and UpdateProcessor interfaces
[ https://issues.apache.org/jira/browse/SOLR-2842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13129030#comment-13129030 ] Ryan McKinley commented on SOLR-2842: - rather then use UpdateProcessor directly, what about adding a simple interface like: {code:java} SolrInputDocument transform(SolrInputDocument) {code} and using simple bean getter/setters -- perhaps also respecting the 'aware' interfaces (SolrCoreAware, SchemaAware, ResourceLoaderAware) It seems like most of the custom things we would want to do only care about 'add' and don't care about commit,delete,merge,rollback. Starting with a simple interface like this would give us lots of flexibility to integrate wherever it feels most appropriate -- client/server or any other pipeline framework (I've been using commons pipeline with pretty reasonable success) Re-factor UpdateChain and UpdateProcessor interfaces Key: SOLR-2842 URL: https://issues.apache.org/jira/browse/SOLR-2842 Project: Solr Issue Type: Improvement Components: update Reporter: Jan Høydahl The UpdateChain's main task is to send SolrInputDocuments through a chain of UpdateRequestProcessors in order to transform them in some way and then (typically) indexing them. This generic pipeline concept would also be useful on the client side (SolrJ), so that we could choose to do parts or all of the processing on the client. The most prominent use case is extracting text (Tika) from large binary documents, residing on local storage on the client(s). Streaming hundreds of Mb over to Solr for processing is not efficcient. See SOLR-1526. We're already implementing Tika as an UpdateProcessor in SOLR-1763, and what would be more natural than reusing this - and any other processor - on the client side? However, for this to be possible, some interfaces need to change slightly.. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[JENKINS] Lucene-Solr-tests-only-trunk-java7 - Build # 647 - Failure
Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk-java7/647/ 1 tests failed. REGRESSION: org.apache.solr.client.solrj.embedded.LargeVolumeJettyTest.testMultiThreaded Error Message: java.lang.AssertionError: Some threads threw uncaught exceptions! Stack Trace: java.lang.RuntimeException: java.lang.AssertionError: Some threads threw uncaught exceptions! at org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:739) at org.apache.solr.SolrTestCaseJ4.tearDown(SolrTestCaseJ4.java:89) at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:149) at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:51) at org.apache.lucene.util.LuceneTestCase.checkUncaughtExceptionsAfter(LuceneTestCase.java:767) at org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:711) Build Log (for compile errors): [...truncated 12202 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3522) TermsFilter.getDocIdSet(context) NPE on missing field
[ https://issues.apache.org/jira/browse/LUCENE-3522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-3522: --- Fix Version/s: (was: 3.5) Ah. I thought that we need the Fix Version to properly track which issues are part of a release. But you're right - if this bug didn't exist in 3.x, then we better not mark that it was fixed there. TermsFilter.getDocIdSet(context) NPE on missing field - Key: LUCENE-3522 URL: https://issues.apache.org/jira/browse/LUCENE-3522 Project: Lucene - Java Issue Type: Bug Components: modules/other Affects Versions: 4.0 Reporter: Dan Climan Assignee: Michael McCandless Priority: Minor Fix For: 4.0 Attachments: LUCENE-3522.patch If the context does not contain the field for a term when calling TermsFilter.getDocIdSet(AtomicReaderContext context) then a NullPointerException is thrown due to not checking for null Terms before getting iterator. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[JENKINS] Lucene-Solr-tests-only-trunk - Build # 10870 - Failure
Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk/10870/ 1 tests failed. REGRESSION: org.apache.solr.update.AutoCommitTest.testMaxDocs Error Message: should find one query failed XPath: //result[@numFound=1] xml response was: ?xml version=1.0 encoding=UTF-8? response lst name=responseHeaderint name=status0/intint name=QTime2/int/lstresult name=response numFound=0 start=0/result /response request was: start=0q=id:14qt=standardrows=20version=2.2 Stack Trace: junit.framework.AssertionFailedError: should find one query failed XPath: //result[@numFound=1] xml response was: ?xml version=1.0 encoding=UTF-8? response lst name=responseHeaderint name=status0/intint name=QTime2/int/lstresult name=response numFound=0 start=0/result /response request was: start=0q=id:14qt=standardrows=20version=2.2 at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:149) at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:51) xml response was: ?xml version=1.0 encoding=UTF-8? response lst name=responseHeaderint name=status0/intint name=QTime2/int/lstresult name=response numFound=0 start=0/result /response request was: start=0q=id:14qt=standardrows=20version=2.2 at org.apache.solr.util.AbstractSolrTestCase.assertQ(AbstractSolrTestCase.java:262) at org.apache.solr.update.AutoCommitTest.testMaxDocs(AutoCommitTest.java:182) at org.apache.lucene.util.LuceneTestCase$2$1.evaluate(LuceneTestCase.java:610) Build Log (for compile errors): [...truncated 7776 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-2843) WordBreakSpellChecker
WordBreakSpellChecker - Key: SOLR-2843 URL: https://issues.apache.org/jira/browse/SOLR-2843 Project: Solr Issue Type: Improvement Components: spellchecker Affects Versions: 3.5, 4.0 Reporter: James Dyer Priority: Minor Fix For: 3.5, 4.0 A spellchecker that generates suggestions by combining two or more terms and/or breaking terms into multiple words. This would typically be used in addition to one of the existing spell checkers to get both traditional and word-break suggestions for the end user. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Closed] (SOLR-2843) WordBreakSpellChecker
[ https://issues.apache.org/jira/browse/SOLR-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Dyer closed SOLR-2843. Resolution: Won't Fix this is a LUCENE- issue, not a SOLR- issue. WordBreakSpellChecker - Key: SOLR-2843 URL: https://issues.apache.org/jira/browse/SOLR-2843 Project: Solr Issue Type: Improvement Components: spellchecker Affects Versions: 3.5, 4.0 Reporter: James Dyer Priority: Minor Fix For: 3.5, 4.0 A spellchecker that generates suggestions by combining two or more terms and/or breaking terms into multiple words. This would typically be used in addition to one of the existing spell checkers to get both traditional and word-break suggestions for the end user. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3523) WordBreakSpellChecker
WordBreakSpellChecker - Key: LUCENE-3523 URL: https://issues.apache.org/jira/browse/LUCENE-3523 Project: Lucene - Java Issue Type: Improvement Components: modules/spellchecker Affects Versions: 3.5, 4.0 Reporter: James Dyer Priority: Minor Fix For: 3.5, 4.0 Attachments: LUCENE-3523.patch A spellchecker that generates suggestions by combining two or more terms and/or breaking terms into multiple words. This would typically be used in addition to one of the existing spell checkers to get both traditional and word-break suggestions for the end user. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3523) WordBreakSpellChecker
[ https://issues.apache.org/jira/browse/LUCENE-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Dyer updated LUCENE-3523: --- Attachment: LUCENE-3523.patch WordBreakSpellChecker - Key: LUCENE-3523 URL: https://issues.apache.org/jira/browse/LUCENE-3523 Project: Lucene - Java Issue Type: Improvement Components: modules/spellchecker Affects Versions: 3.5, 4.0 Reporter: James Dyer Priority: Minor Fix For: 3.5, 4.0 Attachments: LUCENE-3523.patch A spellchecker that generates suggestions by combining two or more terms and/or breaking terms into multiple words. This would typically be used in addition to one of the existing spell checkers to get both traditional and word-break suggestions for the end user. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2842) Re-factor UpdateChain and UpdateProcessor interfaces
[ https://issues.apache.org/jira/browse/SOLR-2842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13129112#comment-13129112 ] Jan Høydahl commented on SOLR-2842: --- Interesting thought Ryan. I already commonly isolate processing in a similar method to simplify unit testing so this is useful in several ways. Do you suggest to let UpdateProcessor base class implement this interface? But you still need to construct and initialize the processors even if they are wrapped in the interface, thus my suggestion for a client side version of the factory. Re-factor UpdateChain and UpdateProcessor interfaces Key: SOLR-2842 URL: https://issues.apache.org/jira/browse/SOLR-2842 Project: Solr Issue Type: Improvement Components: update Reporter: Jan Høydahl The UpdateChain's main task is to send SolrInputDocuments through a chain of UpdateRequestProcessors in order to transform them in some way and then (typically) indexing them. This generic pipeline concept would also be useful on the client side (SolrJ), so that we could choose to do parts or all of the processing on the client. The most prominent use case is extracting text (Tika) from large binary documents, residing on local storage on the client(s). Streaming hundreds of Mb over to Solr for processing is not efficcient. See SOLR-1526. We're already implementing Tika as an UpdateProcessor in SOLR-1763, and what would be more natural than reusing this - and any other processor - on the client side? However, for this to be possible, some interfaces need to change slightly.. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[JENKINS] Lucene-Solr-tests-only-trunk - Build # 10871 - Still Failing
Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk/10871/ 1 tests failed. REGRESSION: org.apache.solr.search.TestRealTimeGet.testStressGetRealtime Error Message: java.lang.AssertionError: Some threads threw uncaught exceptions! Stack Trace: java.lang.RuntimeException: java.lang.AssertionError: Some threads threw uncaught exceptions! at org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:739) at org.apache.solr.SolrTestCaseJ4.tearDown(SolrTestCaseJ4.java:89) at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:149) at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:51) at org.apache.lucene.util.LuceneTestCase.checkUncaughtExceptionsAfter(LuceneTestCase.java:767) at org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:711) Build Log (for compile errors): [...truncated 7942 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[JENKINS] Lucene-Solr-tests-only-3.x - Build # 10891 - Failure
Build: https://builds.apache.org/job/Lucene-Solr-tests-only-3.x/10891/ 1 tests failed. REGRESSION: org.apache.solr.core.TestJmxIntegration.testJmxOnCoreReload Error Message: Number of registered MBeans is not the same as info registry size expected:58 but was:55 Stack Trace: junit.framework.AssertionFailedError: Number of registered MBeans is not the same as info registry size expected:58 but was:55 at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:147) at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:50) at org.apache.solr.core.TestJmxIntegration.testJmxOnCoreReload(TestJmxIntegration.java:137) at org.apache.lucene.util.LuceneTestCase$2$1.evaluate(LuceneTestCase.java:435) Build Log (for compile errors): [...truncated 14016 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2829) Filter queries have false-positive matches. Exposed by user's list titled Regarding geodist and multiple location fields
[ https://issues.apache.org/jira/browse/SOLR-2829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Espina updated SOLR-2829: -- Attachment: SOLR-2829.patch I modified the tests to reproduce the issue in the mailing list. The suggestion Erick made about adding this.origField.equals(other.origField) solves the problem. That line is included in the patch. Filter queries have false-positive matches. Exposed by user's list titled Regarding geodist and multiple location fields -- Key: SOLR-2829 URL: https://issues.apache.org/jira/browse/SOLR-2829 Project: Solr Issue Type: Bug Components: search Affects Versions: 3.4, 4.0 Environment: N/A Reporter: Erick Erickson Attachments: SOLR-2829.patch I don't know how generic this is, whether it's just a problem with fqs when combined with spatial or whether it has wider applicability , but here's what I know so far. Marc Tinnemeyer in a post titled: Regarding geodist and multiple location fields outlines this. I checked this on 3.4 and trunk and it's weird in both cases. HOLD THE PRESSES: After looking at this a bit more, it looks like a caching issue, NOT a geodist issue. When I bounce Solr between changing the sfield from home to work, it seems to work as expected. H, very strange. If I comment out BOTH the filterCache and queryResultCache then it works fine. Switching from home to work in the query finds/fails to find the document. But commenting out only one of those caches doesn't fix the problem. on trunk I used this query; just flipping home to work and back: http://localhost:8983/solr/select?q=id:1fq={!geofilt sfield=home pt=52.67,7.30 d=5} The info below is what I used to test. From Marc's posts: field name=home type=location indexed=true stored=true/ field name=work type=location indexed=true stored=true/ field name=elsewhere type=location indexed=true stored=true/ At first I thought so too. Here is a simple document. add doc field name=id1/field field name=namefirst/field field name=work48.60,11.61/field field name=home52.67,7.30/field /doc /add and here is the result that shouldn't be: response ... str name=q*:*/str str name=fq{!geofilt sfield=work pt=52.67,7.30 d=5}/str ... /lst /lst result name=response numFound=1 start=0 doc str name=home52.67,7.30/str str name=id1/str str name=namefirst/str str name=work48.60,11.61/str /doc /result /response Yonik's comment** It's going to be a bug in an equals() implementation somewhere in the query. The top level equals will be SpatialDistanceQuery.equals() (from LatLonField.java) On trunk, I already see a bug introduced when the new lucene field cache stuff was done. DoubleValueSource now just inherits it's equals method from NumericFieldCacheSource... and that equals() method only tests if the CachedArrayCreator.getClass() is the same! That's definitely wrong. I don't know why 3x would also have this behavior (unless there's more than one bug!) Anyway, first step is to modify the spatial tests to catch the bug... from there it should be pretty easy to debug. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2382) DIH Cache Improvements
[ https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Dyer updated SOLR-2382: - Attachment: SOLR-2382-entities.patch Noble, Here is a version of the entities patch using .iterator() methods as you suggest. Let me know if this is what you had in mind and also if there is anything else you'd like to address. DIH Cache Improvements -- Key: SOLR-2382 URL: https://issues.apache.org/jira/browse/SOLR-2382 Project: Solr Issue Type: New Feature Components: contrib - DataImportHandler Reporter: James Dyer Priority: Minor Attachments: SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-properties.patch, SOLR-2382-properties.patch, SOLR-2382-solrwriter-verbose-fix.patch, SOLR-2382-solrwriter.patch, SOLR-2382-solrwriter.patch, SOLR-2382-solrwriter.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch Functionality: 1. Provide a pluggable caching framework for DIH so that users can choose a cache implementation that best suits their data and application. 2. Provide a means to temporarily cache a child Entity's data without needing to create a special cached implementation of the Entity Processor (such as CachedSqlEntityProcessor). 3. Provide a means to write the final (root entity) DIH output to a cache rather than to Solr. Then provide a way for a subsequent DIH call to use the cache as an Entity input. Also provide the ability to do delta updates on such persistent caches. 4. Provide the ability to partition data across multiple caches that can then be fed back into DIH and indexed either to varying Solr Shards, or to the same Core in parallel. Use Cases: 1. We needed a flexible scalable way to temporarily cache child-entity data prior to joining to parent entities. - Using SqlEntityProcessor with Child Entities can cause an n+1 select problem. - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching mechanism and does not scale. - There is no way to cache non-SQL inputs (ex: flat files, xml, etc). 2. We needed the ability to gather data from long-running entities by a process that runs separate from our main indexing process. 3. We wanted the ability to do a delta import of only the entities that changed. - Lucene/Solr requires entire documents to be re-indexed, even if only a few fields changed. - Our data comes from 50+ complex sql queries and/or flat files. - We do not want to incur overhead re-gathering all of this data if only 1 entity's data changed. - Persistent DIH caches solve this problem. 4. We want the ability to index several documents in parallel (using 1.4.1, which did not have the threads parameter). 5. In the future, we may need to use Shards, creating a need to easily partition our source data into Shards. Implementation Details: 1. De-couple EntityProcessorBase from caching. - Created a new interface, DIHCache two implementations: - SortedMapBackedCache - An in-memory cache, used as default with CachedSqlEntityProcessor (now deprecated). - BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested with je-4.1.6.jar - NOTE: the existing Lucene Contrib db project uses je-3.3.93.jar. I believe this may be incompatible due to Generic Usage. - NOTE: I did not modify the ant script to automatically get this jar, so to use or evaluate this patch, download bdb-je from http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html 2. Allow Entity Processors to take a cacheImpl parameter to cause the entity data to be cached (see EntityProcessorBase DIHCacheProperties). 3. Partially De-couple SolrWriter from DocBuilder - Created a new interface DIHWriter, two implementations: - SolrWriter (refactored) - DIHCacheWriter (allows DIH to write ultimately to a Cache). 4. Create a new Entity Processor, DIHCacheProcessor, which reads a persistent Cache as DIH Entity Input. 5. Support a partition parameter with both DIHCacheWriter and DIHCacheProcessor to allow for easy partitioning of source entity data. 6. Change the semantics of entity.destroy() - Previously, it was being called on each iteration of DocBuilder.buildDocument(). - Now it is does one-time cleanup tasks (like closing or deleting a disk-backed cache) once the entity processor is completed. - The only out-of-the-box entity processor that previously implemented destroy() was LineEntitiyProcessor, so this is not a
[jira] [Commented] (SOLR-2842) Re-factor UpdateChain and UpdateProcessor interfaces
[ https://issues.apache.org/jira/browse/SOLR-2842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13129229#comment-13129229 ] Ryan McKinley commented on SOLR-2842: - I don't have a real proposal... just thinking about generally reusable pipeline code. bq. Do you suggest to let UpdateProcessor base class implement this interface? No, since most domain specific UpdateProcessors can be boiled down to this (tika, langid, geonames, etc) i don't think they need to have access to the whole UpdateProcessor -- only sometimes do they need access to SolrCore/Schema/ResourceLoader etc. With minimal dependencies, moving them around would be easy. I was thinking we could have a general TransformingUpdateProcesor that could take a list of transformers (or something), rather then having all the dependencies bq. But you still need to construct and initialize the processors even if they are wrapped in the interface, thus my suggestion for a client side version of the factory. I'm not convinced that a client side framework is necessary if the interfaces were easy enough to deal with directly. I can see where a DSL would be cool, but having a client side NamedListInitalizedPlugin seems like a can of worms . Re-factor UpdateChain and UpdateProcessor interfaces Key: SOLR-2842 URL: https://issues.apache.org/jira/browse/SOLR-2842 Project: Solr Issue Type: Improvement Components: update Reporter: Jan Høydahl The UpdateChain's main task is to send SolrInputDocuments through a chain of UpdateRequestProcessors in order to transform them in some way and then (typically) indexing them. This generic pipeline concept would also be useful on the client side (SolrJ), so that we could choose to do parts or all of the processing on the client. The most prominent use case is extracting text (Tika) from large binary documents, residing on local storage on the client(s). Streaming hundreds of Mb over to Solr for processing is not efficcient. See SOLR-1526. We're already implementing Tika as an UpdateProcessor in SOLR-1763, and what would be more natural than reusing this - and any other processor - on the client side? However, for this to be possible, some interfaces need to change slightly.. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[JENKINS] Lucene-Solr-tests-only-trunk-java7 - Build # 649 - Failure
Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk-java7/649/ 1 tests failed. REGRESSION: org.apache.solr.core.TestJmxIntegration.testJmxOnCoreReload Error Message: Number of registered MBeans is not the same as info registry size expected:56 but was:51 Stack Trace: junit.framework.AssertionFailedError: Number of registered MBeans is not the same as info registry size expected:56 but was:51 at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:149) at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:51) at org.apache.solr.core.TestJmxIntegration.testJmxOnCoreReload(TestJmxIntegration.java:134) at org.apache.lucene.util.LuceneTestCase$2$1.evaluate(LuceneTestCase.java:610) Build Log (for compile errors): [...truncated 11102 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[JENKINS] Lucene-Solr-tests-only-trunk - Build # 10876 - Failure
Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk/10876/ 1 tests failed. REGRESSION: org.apache.lucene.analysis.th.TestThaiAnalyzer.testRandomStrings Error Message: 255 Stack Trace: java.lang.ArrayIndexOutOfBoundsException: 255 at java.text.DictionaryBasedBreakIterator.lookupCategory(DictionaryBasedBreakIterator.java:319) at java.text.RuleBasedBreakIterator.handleNext(RuleBasedBreakIterator.java:903) at java.text.DictionaryBasedBreakIterator.handleNext(DictionaryBasedBreakIterator.java:281) at java.text.RuleBasedBreakIterator.next(RuleBasedBreakIterator.java:621) at org.apache.lucene.analysis.th.ThaiWordFilter.incrementToken(ThaiWordFilter.java:85) at org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:49) at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:280) at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:247) at org.apache.lucene.analysis.th.TestThaiAnalyzer.testRandomStrings(TestThaiAnalyzer.java:151) at org.apache.lucene.util.LuceneTestCase$2$1.evaluate(LuceneTestCase.java:610) at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:149) at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:51) Build Log (for compile errors): [...truncated 3569 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2842) Re-factor UpdateChain and UpdateProcessor interfaces
[ https://issues.apache.org/jira/browse/SOLR-2842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13129448#comment-13129448 ] Chris Male commented on SOLR-2842: -- Ryan is on the right track here. Really many of these changes are just related to the 'Add' update logic. UpdateProcessor provides a lot of integration with Solr's CRUD features and I think we should leave it that way. If we want to provide a cleaner system for doing Document manipulations that would be part of adding a Document, then lets keep the focus on that. Having a simple interface like Ryan suggests seems the best way to go forward. Then all the actual meat of the Document manipulation logic can go there. Integration with UpdateProcessor seems pretty straightforward. It then frees us up to tackle configuration / DSLs / reuse / whatever else buzzword, without changing the already powerful and functional UpdateProcessor framework. Re-factor UpdateChain and UpdateProcessor interfaces Key: SOLR-2842 URL: https://issues.apache.org/jira/browse/SOLR-2842 Project: Solr Issue Type: Improvement Components: update Reporter: Jan Høydahl The UpdateChain's main task is to send SolrInputDocuments through a chain of UpdateRequestProcessors in order to transform them in some way and then (typically) indexing them. This generic pipeline concept would also be useful on the client side (SolrJ), so that we could choose to do parts or all of the processing on the client. The most prominent use case is extracting text (Tika) from large binary documents, residing on local storage on the client(s). Streaming hundreds of Mb over to Solr for processing is not efficcient. See SOLR-1526. We're already implementing Tika as an UpdateProcessor in SOLR-1763, and what would be more natural than reusing this - and any other processor - on the client side? However, for this to be possible, some interfaces need to change slightly.. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2382) DIH Cache Improvements
[ https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13129493#comment-13129493 ] Noble Paul commented on SOLR-2382: -- A few more points Let us avoid methods which are not used example SolrWriter.setDeltaKeys(); Now that we have a concept of DIHCache, move all cache related logic from EntityprocessorBase to another class. Probably a baseDIHCache. it is not implemented and I am not even clear why it is there Let us put in the minimum amount of changes remove the DIHCacheProperties class and inline the constants. That is the way it done everywhere else I don't understand the need for DocBuilder.resetEntity() According to me the DataCOnfig state must not be changed between runs. DIH Cache Improvements -- Key: SOLR-2382 URL: https://issues.apache.org/jira/browse/SOLR-2382 Project: Solr Issue Type: New Feature Components: contrib - DataImportHandler Reporter: James Dyer Priority: Minor Attachments: SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-properties.patch, SOLR-2382-properties.patch, SOLR-2382-solrwriter-verbose-fix.patch, SOLR-2382-solrwriter.patch, SOLR-2382-solrwriter.patch, SOLR-2382-solrwriter.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch Functionality: 1. Provide a pluggable caching framework for DIH so that users can choose a cache implementation that best suits their data and application. 2. Provide a means to temporarily cache a child Entity's data without needing to create a special cached implementation of the Entity Processor (such as CachedSqlEntityProcessor). 3. Provide a means to write the final (root entity) DIH output to a cache rather than to Solr. Then provide a way for a subsequent DIH call to use the cache as an Entity input. Also provide the ability to do delta updates on such persistent caches. 4. Provide the ability to partition data across multiple caches that can then be fed back into DIH and indexed either to varying Solr Shards, or to the same Core in parallel. Use Cases: 1. We needed a flexible scalable way to temporarily cache child-entity data prior to joining to parent entities. - Using SqlEntityProcessor with Child Entities can cause an n+1 select problem. - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching mechanism and does not scale. - There is no way to cache non-SQL inputs (ex: flat files, xml, etc). 2. We needed the ability to gather data from long-running entities by a process that runs separate from our main indexing process. 3. We wanted the ability to do a delta import of only the entities that changed. - Lucene/Solr requires entire documents to be re-indexed, even if only a few fields changed. - Our data comes from 50+ complex sql queries and/or flat files. - We do not want to incur overhead re-gathering all of this data if only 1 entity's data changed. - Persistent DIH caches solve this problem. 4. We want the ability to index several documents in parallel (using 1.4.1, which did not have the threads parameter). 5. In the future, we may need to use Shards, creating a need to easily partition our source data into Shards. Implementation Details: 1. De-couple EntityProcessorBase from caching. - Created a new interface, DIHCache two implementations: - SortedMapBackedCache - An in-memory cache, used as default with CachedSqlEntityProcessor (now deprecated). - BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested with je-4.1.6.jar - NOTE: the existing Lucene Contrib db project uses je-3.3.93.jar. I believe this may be incompatible due to Generic Usage. - NOTE: I did not modify the ant script to automatically get this jar, so to use or evaluate this patch, download bdb-je from http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html 2. Allow Entity Processors to take a cacheImpl parameter to cause the entity data to be cached (see EntityProcessorBase DIHCacheProperties). 3. Partially De-couple SolrWriter from DocBuilder - Created a new interface DIHWriter, two implementations: - SolrWriter (refactored) - DIHCacheWriter (allows DIH to write ultimately to a Cache). 4. Create a new Entity Processor, DIHCacheProcessor, which reads a persistent Cache as DIH Entity Input. 5. Support a partition parameter with both DIHCacheWriter and DIHCacheProcessor to allow for easy partitioning of source entity