[jira] [Commented] (LUCENE-3296) Enable passing a config into PKIndexSplitter
[ https://issues.apache.org/jira/browse/LUCENE-3296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13062862#comment-13062862 ] Jason Rutherglen commented on LUCENE-3296: -- Uwe, the first patch [1] is implemented with CURRENT. 1. https://issues.apache.org/jira/secure/attachment/12485805/LUCENE-3296.patch Enable passing a config into PKIndexSplitter Key: LUCENE-3296 URL: https://issues.apache.org/jira/browse/LUCENE-3296 Project: Lucene - Java Issue Type: Improvement Components: core/other Affects Versions: 3.3, 4.0 Reporter: Jason Rutherglen Assignee: Simon Willnauer Priority: Trivial Attachments: LUCENE-3296.patch, LUCENE-3296.patch I need to be able to pass the IndexWriterConfig into the IW used by PKIndexSplitter. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: New facet module
Actually I think the faceting module is per-segment? That would be very cool. I reviewed the user guide and it is ambiguous on this topic. Eg, why does the facet taxonomy need to be be committed for every IW commit? Mapping that to [N]RT will be tricky. Page 17: In faceted search, we complicate things somewhat by adding a second index – the taxonomy index. The taxonomy API also follows point-in-time semantics, but this is not quite enough. Some attention must be paid by the user to keep those two indexes consistently in sync: The main index refers to category numbers defined in the taxonomy index. Therefore, it is important that we open the TaxonomyReader after opening the IndexReader. Moreover, every time an IndexReader is reopen()ed, the TaxonomyReader needs to be refresh()1ed as well. But there is one extra caution: whenever the application deems it has written enough information worthy a commit, it must first call commit() for the TaxonomyWriter and only after that call commit() for the IndexWriter. Closing the indices should also be done in this order – first close the taxonomy, and only after that close the index. On Sat, Jul 9, 2011 at 4:13 AM, Michael McCandless luc...@mikemccandless.com wrote: Actually I think the faceting module is per-segment? The facets are encoded into payloads, and then it visits the payload of each hit right per segment, and aggregates the counts. Like, on reopen (NRT or not) of a reader, there are no global data structures that must be recomputed. EG, this facets impl doesn't use FieldCache on the global reader (leading to insanity). Mike McCandless http://blog.mikemccandless.com On Sat, Jul 9, 2011 at 12:40 AM, Shai Erera ser...@gmail.com wrote: Well, the approach is entirely different, and the new module introduces features not available in the other impls (and I imagine vice versa). The taxonomy is managed on the side, hence why it is global to the 'content' index. It plays very well with NRT, and we in fact have several apps that use the module in an NRT environment. The taxonomy index supports NRT by itself, by using the IR.open(IW) API and then it's up to the application to manage its content index search as NRT. I think you should read the high-level description I put on LUCENE-3079 and the userguide I put on LUCENE-3261. As I said, the approach is quite different than the bitset and FieldCache ones. Shai On Saturday, July 9, 2011, Jason Rutherglen jason.rutherg...@gmail.com wrote: The taxonomy is global to the index, but I think it will be interesting to explore per-segment taxonomy, and how it can be used to improve indexing or search perf (hopefully both) Right so with NRT this'll be an issue. Is there a write up on this? It sounds fairly radical in design. Eg, I'm curious as to how it compares with the bit set and un-inverted field cache based faceting systems. On Fri, Jul 8, 2011 at 8:44 PM, Shai Erera ser...@gmail.com wrote: Currently it doesn't facet per segment, because the approach it uses is irrelevant to per segment. It maintains a count array in the size of the taxonomy and every matching document contributes to the weight of the categories it is associated with, orregardless of the segment it is found in. The taxonomy is global to the index, but I think it will be interesting to explore per-segment taxonomy, and how it can be used to improve indexing or search perf (hopefully both). Shai On Saturday, July 9, 2011, Jason Rutherglen jason.rutherg...@gmail.com wrote: Is it faceting per-segment? - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2551) Check dataimport.properties for write access before starting import
[ https://issues.apache.org/jira/browse/SOLR-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13062866#comment-13062866 ] Shalin Shekhar Mangar commented on SOLR-2551: - I'm taking a look. Thanks Chris. Check dataimport.properties for write access before starting import --- Key: SOLR-2551 URL: https://issues.apache.org/jira/browse/SOLR-2551 Project: Solr Issue Type: Improvement Components: contrib - DataImportHandler Affects Versions: 1.4.1, 3.1 Reporter: C S Assignee: Shalin Shekhar Mangar Priority: Minor Fix For: 3.3, 4.0 Attachments: SOLR-2551.patch A common mistake is that the /conf (respectively the dataimport.properties) file is not writable for solr. It would be great if that were detected on starting a dataimport job. Currently and import might grind away for days and fail if it can't write its timestamp to the dataimport.properties file. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: New facet module
Hi Jason, The reason why the taxonomy and content indexes need to be in sync is because the taxonomy index manages the categories and their ordinals. The ordinals are written in a special posting list in the content index (I think we should cut over this part to use DocValues). Now imagine that you only commit to the content index, but not to the taxonomy index. If the system crashes, the content index will refer to ordinals which the taxonomy index does not know about. NRT-wise, the taxonomy index is much smaller than the content index. Imagine what it takes to manage NRT over regular content. Every document you add includes probably some moderate size of text that's parsed, stored fields, term vectors and what not. Flushing that data (during getReader()), whether to FSDir or RAMDir is much more expensive than flushing the information in the taxonomy index, where every document contains a single term with the category label. So I don't think we should be worried too much about the taxonomy index's NRT support and performance. It is orders of magnitude smaller than the other index. There is one thing we should improve about per-segment faceting in the new module -- by default categories are read from the posting list's payload, but there is a way to load all categories into RAM and fetch them from there during search. Today that code is not per-segment, and I think it should be. Shai On Mon, Jul 11, 2011 at 9:27 AM, Jason Rutherglen jason.rutherg...@gmail.com wrote: Actually I think the faceting module is per-segment? That would be very cool. I reviewed the user guide and it is ambiguous on this topic. Eg, why does the facet taxonomy need to be be committed for every IW commit? Mapping that to [N]RT will be tricky. Page 17: In faceted search, we complicate things somewhat by adding a second index – the taxonomy index. The taxonomy API also follows point-in-time semantics, but this is not quite enough. Some attention must be paid by the user to keep those two indexes consistently in sync: The main index refers to category numbers defined in the taxonomy index. Therefore, it is important that we open the TaxonomyReader after opening the IndexReader. Moreover, every time an IndexReader is reopen()ed, the TaxonomyReader needs to be refresh()1ed as well. But there is one extra caution: whenever the application deems it has written enough information worthy a commit, it must first call commit() for the TaxonomyWriter and only after that call commit() for the IndexWriter. Closing the indices should also be done in this order – first close the taxonomy, and only after that close the index. On Sat, Jul 9, 2011 at 4:13 AM, Michael McCandless luc...@mikemccandless.com wrote: Actually I think the faceting module is per-segment? The facets are encoded into payloads, and then it visits the payload of each hit right per segment, and aggregates the counts. Like, on reopen (NRT or not) of a reader, there are no global data structures that must be recomputed. EG, this facets impl doesn't use FieldCache on the global reader (leading to insanity). Mike McCandless http://blog.mikemccandless.com On Sat, Jul 9, 2011 at 12:40 AM, Shai Erera ser...@gmail.com wrote: Well, the approach is entirely different, and the new module introduces features not available in the other impls (and I imagine vice versa). The taxonomy is managed on the side, hence why it is global to the 'content' index. It plays very well with NRT, and we in fact have several apps that use the module in an NRT environment. The taxonomy index supports NRT by itself, by using the IR.open(IW) API and then it's up to the application to manage its content index search as NRT. I think you should read the high-level description I put on LUCENE-3079 and the userguide I put on LUCENE-3261. As I said, the approach is quite different than the bitset and FieldCache ones. Shai On Saturday, July 9, 2011, Jason Rutherglen jason.rutherg...@gmail.com wrote: The taxonomy is global to the index, but I think it will be interesting to explore per-segment taxonomy, and how it can be used to improve indexing or search perf (hopefully both) Right so with NRT this'll be an issue. Is there a write up on this? It sounds fairly radical in design. Eg, I'm curious as to how it compares with the bit set and un-inverted field cache based faceting systems. On Fri, Jul 8, 2011 at 8:44 PM, Shai Erera ser...@gmail.com wrote: Currently it doesn't facet per segment, because the approach it uses is irrelevant to per segment. It maintains a count array in the size of the taxonomy and every matching document contributes to the weight of the categories it is associated with, orregardless of the segment it is found in. The taxonomy is global to the index, but I think it will be interesting to explore per-segment taxonomy, and how it can be used to
Re: New facet module
On Sat, 2011-07-09 at 05:44 +0200, Shai Erera wrote: The taxonomy is global to the index, but I think it will be interesting to explore per-segment taxonomy, and how it can be used to improve indexing or search perf (hopefully both). I have struggled with this for some time and still haven't found a real solution. Distributed faceting, with the special case segment based faceting, is hard to do without a central taxonomy. The new faceting module is explicit about the central taxonomy. My experiments with https://issues.apache.org/jira/browse/LUCENE-2369 computes it at index open time. None of them work very well, if at all, for a real distributed environment. The problem is the same for flat faceting but is magnified with hierarchical faceting: When the sorting order of facet elements is popularity based, computing the correct counts for a top-X might potentially involve comparison of the whole result from each part. A pathological case for flat faceting is Part 1: A1(2), A2(2)... An(2) Part 2: B1(3), B2(2), B3(2)... Bn(2), An(1) where the correct top 3 answer is An(3), B1(3), A2(2), which requires the full part results to get to the An(2) and An(1) as they are the last elements. For real world use, we can do clever counting so that we only return what is necessary, but it does not change the worst case. To ensure that we don't hit any million entries merge situations, we must cheat and make a cutoff point. With a multi-level faceting result (state/town/street expanded to top 5 elements on all levels) we must resolve quite a lot of elements to ensure a high chance of getting the right elements with the right counts. We can avoid this by drilling down one level at a time, but that is just replacing bulk transfers with multiple requests: 1*5*5 is the unrealistically low minimum for the address case. - Toke - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[JENKINS-MAVEN] Lucene-Solr-Maven-trunk #184: POMs out of sync
Build: https://builds.apache.org/job/Lucene-Solr-Maven-trunk/184/ 1 tests failed. REGRESSION: org.apache.lucene.index.TestIndexWriter.testThreadInterruptDeadlock Error Message: Stack Trace: java.lang.AssertionError: at org.junit.Assert.fail(Assert.java:91) at org.junit.Assert.assertTrue(Assert.java:43) at org.junit.Assert.assertFalse(Assert.java:68) at org.junit.Assert.assertFalse(Assert.java:79) at org.apache.lucene.index.TestIndexWriter.testThreadInterruptDeadlock(TestIndexWriter.java:1204) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:616) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20) at org.junit.rules.TestWatchman$1.evaluate(TestWatchman.java:48) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:31) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:76) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1464) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1382) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:31) at org.junit.runners.ParentRunner.run(ParentRunner.java:236) at org.apache.maven.surefire.junit4.JUnit4TestSet.execute(JUnit4TestSet.java:35) at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:146) at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:97) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:616) at org.apache.maven.surefire.booter.ProviderFactory$ClassLoaderProxy.invoke(ProviderFactory.java:103) at $Proxy0.invoke(Unknown Source) at org.apache.maven.surefire.booter.SurefireStarter.invokeProvider(SurefireStarter.java:145) at org.apache.maven.surefire.booter.SurefireStarter.runSuitesInProcess(SurefireStarter.java:87) at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:69) Build Log (for compile errors): [...truncated 20110 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[JENKINS] Lucene-Solr-tests-only-trunk - Build # 9492 - Failure
Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk/9492/ 1 tests failed. REGRESSION: org.apache.solr.handler.dataimport.TestSqlEntityProcessorDelta2.testCompositePk_DeltaImport_delete Error Message: Exception during query Stack Trace: java.lang.RuntimeException: Exception during query at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:405) at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:372) at org.apache.solr.handler.dataimport.TestSqlEntityProcessorDelta2.testCompositePk_DeltaImport_delete(TestSqlEntityProcessorDelta2.java:113) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1464) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1382) Caused by: java.lang.RuntimeException: REQUEST FAILED: xpath=//*[@numFound='0'] xml response was: ?xml version=1.0 encoding=UTF-8? response lst name=responseHeaderint name=status0/intint name=QTime1/intlst name=paramsstr name=start0/strstr name=q*:* OR testCompositePk_DeltaImport_delete/strstr name=qtstandard/strstr name=rows20/strstr name=version2.2/str/lst/lstresult name=response numFound=1 start=0docstr name=solr_idprefix-1/strarr name=descstrhello/str/arrdate name=timestamp2011-07-11T08:09:46.491Z/date/doc/result /response request was:start=0q=*:*+OR+testCompositePk_DeltaImport_deleteqt=standardrows=20version=2.2 at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:398) Build Log (for compile errors): [...truncated 11997 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-2647) DOMUtilTestBase should be abstract
DOMUtilTestBase should be abstract -- Key: SOLR-2647 URL: https://issues.apache.org/jira/browse/SOLR-2647 Project: Solr Issue Type: Improvement Reporter: Chris Male Priority: Trivial Attachments: SOLR-2647.patch Its serves as a base for other test classes that use the DOM. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2647) DOMUtilTestBase should be abstract
[ https://issues.apache.org/jira/browse/SOLR-2647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Male updated SOLR-2647: - Attachment: SOLR-2647.patch Patch. DOMUtilTestBase should be abstract -- Key: SOLR-2647 URL: https://issues.apache.org/jira/browse/SOLR-2647 Project: Solr Issue Type: Improvement Reporter: Chris Male Priority: Trivial Attachments: SOLR-2647.patch Its serves as a base for other test classes that use the DOM. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2564) Integrating grouping module into Solr 4.0
[ https://issues.apache.org/jira/browse/SOLR-2564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063231#comment-13063231 ] Martijn van Groningen commented on SOLR-2564: - Since Lucene is now also Java 6 we can just change the code in AbstractFirstPassGroupingCollector and the TermFirstPassGroupingCollectorJava6 in grouping.java is no longer needed, right? Integrating grouping module into Solr 4.0 - Key: SOLR-2564 URL: https://issues.apache.org/jira/browse/SOLR-2564 Project: Solr Issue Type: Improvement Reporter: Martijn van Groningen Assignee: Martijn van Groningen Priority: Blocker Fix For: 4.0 Attachments: LUCENE-2564.patch, SOLR-2564.patch, SOLR-2564.patch, SOLR-2564.patch, SOLR-2564.patch, SOLR-2564.patch, SOLR-2564.patch, SOLR-2564.patch, SOLR-2564_performance_loss_fix.patch Since work on grouping module is going well. I think it is time to wire this up in Solr. Besides the current grouping features Solr provides, Solr will then also support second pass caching and total count based on groups. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[JENKINS] Lucene-Solr-tests-only-trunk - Build # 9493 - Still Failing
Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk/9493/ 1 tests failed. REGRESSION: org.apache.solr.handler.dataimport.TestSqlEntityProcessorDelta3.testCompositePk_DeltaImport_delete Error Message: Exception during query Stack Trace: java.lang.RuntimeException: Exception during query at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:405) at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:372) at org.apache.solr.handler.dataimport.TestSqlEntityProcessorDelta3.testCompositePk_DeltaImport_delete(TestSqlEntityProcessorDelta3.java:111) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1464) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1382) Caused by: java.lang.RuntimeException: REQUEST FAILED: xpath=//*[@numFound='0'] xml response was: ?xml version=1.0 encoding=UTF-8? response lst name=responseHeaderint name=status0/intint name=QTime1/intlst name=paramsstr name=start0/strstr name=q*:* OR testCompositePk_DeltaImport_delete/strstr name=qtstandard/strstr name=rows20/strstr name=version2.2/str/lst/lstresult name=response numFound=1 start=0docarr name=descstrd1/str/arrstr name=id2/strdate name=timestamp2011-07-11T09:22:55.278Z/date/doc/result /response request was:start=0q=*:*+OR+testCompositePk_DeltaImport_deleteqt=standardrows=20version=2.2 at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:398) Build Log (for compile errors): [...truncated 12030 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2564) Integrating grouping module into Solr 4.0
[ https://issues.apache.org/jira/browse/SOLR-2564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063244#comment-13063244 ] Simon Willnauer commented on SOLR-2564: --- bq. Since Lucene is now also Java 6 we can just change the code in AbstractFirstPassGroupingCollector and the TermFirstPassGroupingCollectorJava6 in grouping.java is no longer needed, right? yes thats right Integrating grouping module into Solr 4.0 - Key: SOLR-2564 URL: https://issues.apache.org/jira/browse/SOLR-2564 Project: Solr Issue Type: Improvement Reporter: Martijn van Groningen Assignee: Martijn van Groningen Priority: Blocker Fix For: 4.0 Attachments: LUCENE-2564.patch, SOLR-2564.patch, SOLR-2564.patch, SOLR-2564.patch, SOLR-2564.patch, SOLR-2564.patch, SOLR-2564.patch, SOLR-2564.patch, SOLR-2564_performance_loss_fix.patch Since work on grouping module is going well. I think it is time to wire this up in Solr. Besides the current grouping features Solr provides, Solr will then also support second pass caching and total count based on groups. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3295) BitVector never skips fully populated bytes when writing ClearedDgaps
[ https://issues.apache.org/jira/browse/LUCENE-3295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063245#comment-13063245 ] Simon Willnauer commented on LUCENE-3295: - thanks for resolving this mike BitVector never skips fully populated bytes when writing ClearedDgaps - Key: LUCENE-3295 URL: https://issues.apache.org/jira/browse/LUCENE-3295 Project: Lucene - Java Issue Type: Bug Components: core/other Affects Versions: 4.0 Reporter: Simon Willnauer Assignee: Simon Willnauer Priority: Minor Fix For: 4.0 Attachments: LUCENE-3295.patch, LUCENE-3295.patch When writing cleared DGaps in BitVector we compare a byte against 0xFF (255) yet the byte is casted into an int (-1) and the comparison will never succeed. We should mask the byte with 0xFF before comparing or compare against -1 -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[JENKINS] Lucene-Solr-tests-only-3.x - Build # 9505 - Failure
Build: https://builds.apache.org/job/Lucene-Solr-tests-only-3.x/9505/ All tests passed Build Log (for compile errors): [...truncated 17572 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[JENKINS] Lucene-Solr-tests-only-trunk - Build # 9494 - Still Failing
Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk/9494/ 1 tests failed. REGRESSION: org.apache.solr.handler.dataimport.TestSqlEntityProcessor2.testCompositePk_DeltaImport Error Message: Exception during query Stack Trace: java.lang.RuntimeException: Exception during query at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:405) at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:372) at org.apache.solr.handler.dataimport.TestSqlEntityProcessor2.testCompositePk_DeltaImport(TestSqlEntityProcessor2.java:129) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1464) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1382) Caused by: java.lang.RuntimeException: REQUEST FAILED: xpath=//*[@numFound='1'] xml response was: ?xml version=1.0 encoding=UTF-8? response lst name=responseHeaderint name=status0/intint name=QTime1/intlst name=paramsstr name=start0/strstr name=qid:5/strstr name=qtstandard/strstr name=rows20/strstr name=version2.2/str/lst/lstresult name=response numFound=0 start=0/result /response request was:start=0q=id:5qt=standardrows=20version=2.2 at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:398) Build Log (for compile errors): [...truncated 12034 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3296) Enable passing a config into PKIndexSplitter
[ https://issues.apache.org/jira/browse/LUCENE-3296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-3296: Attachment: LUCENE-3296.patch here is a new patch. I added a second IWC since we can not reuse IWC instances across IW due to SetOnce restrictions. I also moved out the VERSION_CURRENT and made it a ctor argument. We should not randomly use the VERSION_CURRENT but rather be consistent when we use version. Enable passing a config into PKIndexSplitter Key: LUCENE-3296 URL: https://issues.apache.org/jira/browse/LUCENE-3296 Project: Lucene - Java Issue Type: Improvement Components: core/other Affects Versions: 3.3, 4.0 Reporter: Jason Rutherglen Assignee: Simon Willnauer Priority: Trivial Attachments: LUCENE-3296.patch, LUCENE-3296.patch, LUCENE-3296.patch I need to be able to pass the IndexWriterConfig into the IW used by PKIndexSplitter. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Issue Comment Edited] (LUCENE-3296) Enable passing a config into PKIndexSplitter
[ https://issues.apache.org/jira/browse/LUCENE-3296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063249#comment-13063249 ] Simon Willnauer edited comment on LUCENE-3296 at 7/11/11 9:54 AM: -- here is a new patch. I added a second IWC since we can not reuse IWC instances across IW due to SetOnce restrictions. I also moved out the VERSION_CURRENT and made it a ctor argument. We should not randomly use the VERSION_CURRENT but rather be consistent when we use version. bq. Simon: The Version.LUCENE_CURRENT is not important here, for easier porting, the version should be LUCENE_CURRENT (and it was before Jason's patch). Else we will have to always upgrade it with every new release. The same applies to the IndexUpdater class in core, it also uses LUCENE_CURRENT when you not pass in anything (as the version is completely useless for simple merge operations - like here). not entirely true, we use the index splitter in 3.x and if you upgrade from 3.1 to 3.2 you get a new mergepolicy by default which doesn't merge in order. I think its a problem that this version is not in 3.x yet so let fix it properly and backport. Simon was (Author: simonw): here is a new patch. I added a second IWC since we can not reuse IWC instances across IW due to SetOnce restrictions. I also moved out the VERSION_CURRENT and made it a ctor argument. We should not randomly use the VERSION_CURRENT but rather be consistent when we use version. Enable passing a config into PKIndexSplitter Key: LUCENE-3296 URL: https://issues.apache.org/jira/browse/LUCENE-3296 Project: Lucene - Java Issue Type: Improvement Components: core/other Affects Versions: 3.3, 4.0 Reporter: Jason Rutherglen Assignee: Simon Willnauer Priority: Trivial Attachments: LUCENE-3296.patch, LUCENE-3296.patch, LUCENE-3296.patch I need to be able to pass the IndexWriterConfig into the IW used by PKIndexSplitter. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3296) Enable passing a config into PKIndexSplitter
[ https://issues.apache.org/jira/browse/LUCENE-3296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063259#comment-13063259 ] Uwe Schindler commented on LUCENE-3296: --- bq. not entirely true, we use the index splitter in 3.x and if you upgrade from 3.1 to 3.2 you get a new mergepolicy by default which doesn't merge in order. I think its a problem that this version is not in 3.x yet so let fix it properly and backport. PKIndexSplitter is new in 3.3, so you would never used it with older versions... Enable passing a config into PKIndexSplitter Key: LUCENE-3296 URL: https://issues.apache.org/jira/browse/LUCENE-3296 Project: Lucene - Java Issue Type: Improvement Components: core/other Affects Versions: 3.3, 4.0 Reporter: Jason Rutherglen Assignee: Simon Willnauer Priority: Trivial Attachments: LUCENE-3296.patch, LUCENE-3296.patch, LUCENE-3296.patch I need to be able to pass the IndexWriterConfig into the IW used by PKIndexSplitter. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2564) Integrating grouping module into Solr 4.0
[ https://issues.apache.org/jira/browse/SOLR-2564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063272#comment-13063272 ] Martijn van Groningen commented on SOLR-2564: - Hi Matteo, I can also confirm the bug and only happens when group.main=true. I also think that this error occurs on 3x code base. I'll provide a fix for this issue soon. Integrating grouping module into Solr 4.0 - Key: SOLR-2564 URL: https://issues.apache.org/jira/browse/SOLR-2564 Project: Solr Issue Type: Improvement Reporter: Martijn van Groningen Assignee: Martijn van Groningen Priority: Blocker Fix For: 4.0 Attachments: LUCENE-2564.patch, SOLR-2564.patch, SOLR-2564.patch, SOLR-2564.patch, SOLR-2564.patch, SOLR-2564.patch, SOLR-2564.patch, SOLR-2564.patch, SOLR-2564_performance_loss_fix.patch Since work on grouping module is going well. I think it is time to wire this up in Solr. Besides the current grouping features Solr provides, Solr will then also support second pass caching and total count based on groups. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2524) Adding grouping to Solr 3x
[ https://issues.apache.org/jira/browse/SOLR-2524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063284#comment-13063284 ] Yuriy Akopov commented on SOLR-2524: I suppose I'm late with these questions, but could you please acknowledge if the following is correct: 1) The functionality from this patch was included into Solr 3.3, so no need to apply it to any version = 3.3 2) This patch (as well as the collapsing functionality in 3.3) doesn't allow calculation of facet numbers after collapsing. Faceting is still possible for collapsed results but the numbers returned for facets are always calculated before collapsing the results. 3) In order to calculate facets after collapsing, LUCENE-3097 must be applied to Solr 3.3. Thanks. Adding grouping to Solr 3x -- Key: SOLR-2524 URL: https://issues.apache.org/jira/browse/SOLR-2524 Project: Solr Issue Type: New Feature Reporter: Martijn van Groningen Assignee: Martijn van Groningen Fix For: 3.3 Attachments: SOLR-2524.patch, SOLR-2524.patch, SOLR-2524.patch, SOLR-2524.patch, SOLR-2524.patch, SOLR-2524.patch Grouping was recently added to Lucene 3x. See LUCENE-1421 for more information. I think it would be nice if we expose this functionality also to the Solr users that are bound to a 3.x version. The grouping feature added to Lucene is currently a subset of the functionality that Solr 4.0-trunk offers. Mainly it doesn't support grouping by function / query. The work involved getting the grouping contrib to work on Solr 3x is acceptable. I have it more or less running here. It supports the response format and request parameters (expect: group.query and group.func) described in the FieldCollapse page on the Solr wiki. I think it would be great if this is included in the Solr 3.2 release. Many people are using grouping as patch now and this would help them a lot. Any thoughts? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[JENKINS] Lucene-Solr-tests-only-trunk - Build # 9496 - Failure
Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk/9496/ 2 tests failed. REGRESSION: org.apache.solr.handler.dataimport.TestSqlEntityProcessor2.testCompositePk_FullImport Error Message: Exception during query Stack Trace: java.lang.RuntimeException: Exception during query at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:405) at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:372) at org.apache.solr.handler.dataimport.TestSqlEntityProcessor2.testCompositePk_FullImport(TestSqlEntityProcessor2.java:66) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1464) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1382) Caused by: java.lang.RuntimeException: REQUEST FAILED: xpath=//*[@numFound='1'] xml response was: ?xml version=1.0 encoding=UTF-8? response lst name=responseHeaderint name=status0/intint name=QTime0/intlst name=paramsstr name=start0/strstr name=qid:1/strstr name=qtstandard/strstr name=rows20/strstr name=version2.2/str/lst/lstresult name=response numFound=0 start=0/result /response request was:start=0q=id:1qt=standardrows=20version=2.2 at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:398) REGRESSION: org.apache.solr.handler.dataimport.TestSqlEntityProcessorDelta3.testCompositePk_FullImport Error Message: Exception during query Stack Trace: java.lang.RuntimeException: Exception during query at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:405) at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:372) at org.apache.solr.handler.dataimport.TestSqlEntityProcessorDelta3.add1document(TestSqlEntityProcessorDelta3.java:83) at org.apache.solr.handler.dataimport.TestSqlEntityProcessorDelta3.testCompositePk_FullImport(TestSqlEntityProcessorDelta3.java:92) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1464) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1382) Caused by: java.lang.RuntimeException: REQUEST FAILED: xpath=//*[@numFound='1'] xml response was: ?xml version=1.0 encoding=UTF-8? response lst name=responseHeaderint name=status0/intint name=QTime1/intlst name=paramsstr name=start0/strstr name=q*:* OR add1document/strstr name=qtstandard/strstr name=rows20/strstr name=version2.2/str/lst/lstresult name=response numFound=0 start=0/result /response request was:start=0q=*:*+OR+add1documentqt=standardrows=20version=2.2 at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:398) Build Log (for compile errors): [...truncated 12155 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Assigned] (LUCENE-3233) HuperDuperSynonymsFilter™
[ https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir reassigned LUCENE-3233: --- Assignee: Robert Muir HuperDuperSynonymsFilter™ - Key: LUCENE-3233 URL: https://issues.apache.org/jira/browse/LUCENE-3233 Project: Lucene - Java Issue Type: Improvement Reporter: Robert Muir Assignee: Robert Muir Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, synonyms.zip The current synonymsfilter uses a lot of ram and cpu, especially at build time. I think yesterday I heard about huge synonyms files three times. So, I think we should use an FST-based structure, sharing the inputs and outputs. And we should be more efficient with the tokenStream api, e.g. using save/restoreState instead of cloneAttributes() -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2647) DOMUtilTestBase should be abstract
[ https://issues.apache.org/jira/browse/SOLR-2647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063303#comment-13063303 ] Steven Rowe commented on SOLR-2647: --- +1 This is my mistake - thanks for fixing! DOMUtilTestBase should be abstract -- Key: SOLR-2647 URL: https://issues.apache.org/jira/browse/SOLR-2647 Project: Solr Issue Type: Improvement Reporter: Chris Male Priority: Trivial Attachments: SOLR-2647.patch Its serves as a base for other test classes that use the DOM. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3295) BitVector never skips fully populated bytes when writing ClearedDgaps
[ https://issues.apache.org/jira/browse/LUCENE-3295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063312#comment-13063312 ] Michael McCandless commented on LUCENE-3295: Thank you for catching that something was amiss in the first place ;) That's the hardest part. BitVector never skips fully populated bytes when writing ClearedDgaps - Key: LUCENE-3295 URL: https://issues.apache.org/jira/browse/LUCENE-3295 Project: Lucene - Java Issue Type: Bug Components: core/other Affects Versions: 4.0 Reporter: Simon Willnauer Assignee: Simon Willnauer Priority: Minor Fix For: 4.0 Attachments: LUCENE-3295.patch, LUCENE-3295.patch When writing cleared DGaps in BitVector we compare a byte against 0xFF (255) yet the byte is casted into an int (-1) and the comparison will never succeed. We should mask the byte with 0xFF before comparing or compare against -1 -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (SOLR-2647) DOMUtilTestBase should be abstract
[ https://issues.apache.org/jira/browse/SOLR-2647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Male resolved SOLR-2647. -- Resolution: Fixed Fix Version/s: 4.0 Assignee: Chris Male Committed revision 1145154. DOMUtilTestBase should be abstract -- Key: SOLR-2647 URL: https://issues.apache.org/jira/browse/SOLR-2647 Project: Solr Issue Type: Improvement Reporter: Chris Male Assignee: Chris Male Priority: Trivial Fix For: 4.0 Attachments: SOLR-2647.patch Its serves as a base for other test classes that use the DOM. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2644) DIH handler - when using threads=2 the default logging is set too high
[ https://issues.apache.org/jira/browse/SOLR-2644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar updated SOLR-2644: Attachment: SOLR-2644.patch This was probably added for debugging. Attached patch to remove the extra logging. I'll commit shortly. DIH handler - when using threads=2 the default logging is set too high -- Key: SOLR-2644 URL: https://issues.apache.org/jira/browse/SOLR-2644 Project: Solr Issue Type: Bug Components: contrib - DataImportHandler Affects Versions: 3.3 Reporter: Bill Bell Fix For: 3.4, 4.0 Attachments: SOLR-2644.patch Setting threads parameter in DIH handler, every add outputs to the log in INFO level. The only current solution is to set the following in log4j.properties: log4j.rootCategory=INFO, logfile log4j.logger.org.apache.solr.handler.dataimport.DocBuilder=FATAL log4j.logger.org.apache.solr.handler.dataimport.ThreadedEntityProcessorWrapper=FATAL These 2 log messages need to be changed to INFO. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2644) DIH handler - when using threads=2 the default logging is set too high
[ https://issues.apache.org/jira/browse/SOLR-2644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar updated SOLR-2644: Fix Version/s: 4.0 3.4 Assignee: Shalin Shekhar Mangar DIH handler - when using threads=2 the default logging is set too high -- Key: SOLR-2644 URL: https://issues.apache.org/jira/browse/SOLR-2644 Project: Solr Issue Type: Bug Components: contrib - DataImportHandler Affects Versions: 3.3 Reporter: Bill Bell Assignee: Shalin Shekhar Mangar Fix For: 3.4, 4.0 Attachments: SOLR-2644.patch Setting threads parameter in DIH handler, every add outputs to the log in INFO level. The only current solution is to set the following in log4j.properties: log4j.rootCategory=INFO, logfile log4j.logger.org.apache.solr.handler.dataimport.DocBuilder=FATAL log4j.logger.org.apache.solr.handler.dataimport.ThreadedEntityProcessorWrapper=FATAL These 2 log messages need to be changed to INFO. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2551) Check dataimport.properties for write access before starting import
[ https://issues.apache.org/jira/browse/SOLR-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063346#comment-13063346 ] Shalin Shekhar Mangar commented on SOLR-2551: - bq. Doesn't every test of delta functionality write to the dataimport.properties file? Yes, it does but I don't think any of our tests rely on the contents of the properties file. Ironically, the fact that the tests failed is proof that this feature works :) Check dataimport.properties for write access before starting import --- Key: SOLR-2551 URL: https://issues.apache.org/jira/browse/SOLR-2551 Project: Solr Issue Type: Improvement Components: contrib - DataImportHandler Affects Versions: 1.4.1, 3.1 Reporter: C S Assignee: Shalin Shekhar Mangar Priority: Minor Fix For: 3.3, 4.0 Attachments: SOLR-2551.patch A common mistake is that the /conf (respectively the dataimport.properties) file is not writable for solr. It would be great if that were detected on starting a dataimport job. Currently and import might grind away for days and fail if it can't write its timestamp to the dataimport.properties file. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-3233) HuperDuperSynonymsFilter™
[ https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir resolved LUCENE-3233. - Resolution: Fixed Fix Version/s: 4.0 3.4 HuperDuperSynonymsFilter™ - Key: LUCENE-3233 URL: https://issues.apache.org/jira/browse/LUCENE-3233 Project: Lucene - Java Issue Type: Improvement Reporter: Robert Muir Assignee: Robert Muir Fix For: 3.4, 4.0 Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, synonyms.zip The current synonymsfilter uses a lot of ram and cpu, especially at build time. I think yesterday I heard about huge synonyms files three times. So, I think we should use an FST-based structure, sharing the inputs and outputs. And we should be more efficient with the tokenStream api, e.g. using save/restoreState instead of cloneAttributes() -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2551) Check dataimport.properties for write access before starting import
[ https://issues.apache.org/jira/browse/SOLR-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063352#comment-13063352 ] Chris Male commented on SOLR-2551: -- Okay, so, given that basically every test writes to this file, what are our options? To me it seems since the file is getting written too (whether we rely on the contents or not), this could get in the way of another test. So perhaps we need to pull the checkWritablePersistFile method out for awhile and re-assess how to achieve the same functionality in a way the tests can handle? Check dataimport.properties for write access before starting import --- Key: SOLR-2551 URL: https://issues.apache.org/jira/browse/SOLR-2551 Project: Solr Issue Type: Improvement Components: contrib - DataImportHandler Affects Versions: 1.4.1, 3.1 Reporter: C S Assignee: Shalin Shekhar Mangar Priority: Minor Fix For: 3.3, 4.0 Attachments: SOLR-2551.patch A common mistake is that the /conf (respectively the dataimport.properties) file is not writable for solr. It would be great if that were detected on starting a dataimport job. Currently and import might grind away for days and fail if it can't write its timestamp to the dataimport.properties file. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2551) Check dataimport.properties for write access before starting import
[ https://issues.apache.org/jira/browse/SOLR-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063354#comment-13063354 ] Chris Male commented on SOLR-2551: -- or alternatively, we could make the DIH tests run sequentially, so we don't hit this problem. Check dataimport.properties for write access before starting import --- Key: SOLR-2551 URL: https://issues.apache.org/jira/browse/SOLR-2551 Project: Solr Issue Type: Improvement Components: contrib - DataImportHandler Affects Versions: 1.4.1, 3.1 Reporter: C S Assignee: Shalin Shekhar Mangar Priority: Minor Fix For: 3.3, 4.0 Attachments: SOLR-2551.patch A common mistake is that the /conf (respectively the dataimport.properties) file is not writable for solr. It would be great if that were detected on starting a dataimport job. Currently and import might grind away for days and fail if it can't write its timestamp to the dataimport.properties file. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2551) Check dataimport.properties for write access before starting import
[ https://issues.apache.org/jira/browse/SOLR-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063355#comment-13063355 ] Shalin Shekhar Mangar commented on SOLR-2551: - Yes, lets disable this test for now. I don't think it is even worth testing. I guess I just had too much time that day :) Another option could be to run the DIH tests sequentially. Check dataimport.properties for write access before starting import --- Key: SOLR-2551 URL: https://issues.apache.org/jira/browse/SOLR-2551 Project: Solr Issue Type: Improvement Components: contrib - DataImportHandler Affects Versions: 1.4.1, 3.1 Reporter: C S Assignee: Shalin Shekhar Mangar Priority: Minor Fix For: 3.3, 4.0 Attachments: SOLR-2551.patch A common mistake is that the /conf (respectively the dataimport.properties) file is not writable for solr. It would be great if that were detected on starting a dataimport job. Currently and import might grind away for days and fail if it can't write its timestamp to the dataimport.properties file. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2551) Check dataimport.properties for write access before starting import
[ https://issues.apache.org/jira/browse/SOLR-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063366#comment-13063366 ] Steven Rowe commented on SOLR-2551: --- I'll switch the DIH tests to run sequentially. The benchmark module does this by setting the {{tests.threadspercpu}} property to zero. Here's the patch: {{ Index: solr/contrib/dataimporthandler/build.xml === --- solr/contrib/dataimporthandler/build.xml(revision 1145189) +++ solr/contrib/dataimporthandler/build.xml(working copy) @@ -23,6 +23,9 @@ Data Import Handler /description + !-- the tests have some parallel problems: writability to single copy of dataimport.properties -- + property name=tests.threadspercpu value=0/ + import file=../contrib-build.xml/ /project }} Committing shortly. Check dataimport.properties for write access before starting import --- Key: SOLR-2551 URL: https://issues.apache.org/jira/browse/SOLR-2551 Project: Solr Issue Type: Improvement Components: contrib - DataImportHandler Affects Versions: 1.4.1, 3.1 Reporter: C S Assignee: Shalin Shekhar Mangar Priority: Minor Fix For: 3.3, 4.0 Attachments: SOLR-2551.patch A common mistake is that the /conf (respectively the dataimport.properties) file is not writable for solr. It would be great if that were detected on starting a dataimport job. Currently and import might grind away for days and fail if it can't write its timestamp to the dataimport.properties file. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2551) Check dataimport.properties for write access before starting import
[ https://issues.apache.org/jira/browse/SOLR-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063369#comment-13063369 ] Steven Rowe commented on SOLR-2551: --- Committed the patch to run DIH tests sequentially: - r1145194: trunk - r1145196: branch_3x Check dataimport.properties for write access before starting import --- Key: SOLR-2551 URL: https://issues.apache.org/jira/browse/SOLR-2551 Project: Solr Issue Type: Improvement Components: contrib - DataImportHandler Affects Versions: 1.4.1, 3.1 Reporter: C S Assignee: Shalin Shekhar Mangar Priority: Minor Fix For: 3.3, 4.0 Attachments: SOLR-2551.patch A common mistake is that the /conf (respectively the dataimport.properties) file is not writable for solr. It would be great if that were detected on starting a dataimport job. Currently and import might grind away for days and fail if it can't write its timestamp to the dataimport.properties file. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2382) DIH Cache Improvements
[ https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063372#comment-13063372 ] James Dyer commented on SOLR-2382: -- Noble, Are you still able to work with me on this issue? Is there anything else you are waiting for from me? The patch I submitted on June 24 passes parameters via the Context object as you requested. Also, I previously separated BerkleyBackedCache out into a separate issue to (SOLR-2613) so we won't run into licensing issues here. Let me know what else you think we need to do. Thanks. DIH Cache Improvements -- Key: SOLR-2382 URL: https://issues.apache.org/jira/browse/SOLR-2382 Project: Solr Issue Type: New Feature Components: contrib - DataImportHandler Reporter: James Dyer Priority: Minor Attachments: SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch Functionality: 1. Provide a pluggable caching framework for DIH so that users can choose a cache implementation that best suits their data and application. 2. Provide a means to temporarily cache a child Entity's data without needing to create a special cached implementation of the Entity Processor (such as CachedSqlEntityProcessor). 3. Provide a means to write the final (root entity) DIH output to a cache rather than to Solr. Then provide a way for a subsequent DIH call to use the cache as an Entity input. Also provide the ability to do delta updates on such persistent caches. 4. Provide the ability to partition data across multiple caches that can then be fed back into DIH and indexed either to varying Solr Shards, or to the same Core in parallel. Use Cases: 1. We needed a flexible scalable way to temporarily cache child-entity data prior to joining to parent entities. - Using SqlEntityProcessor with Child Entities can cause an n+1 select problem. - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching mechanism and does not scale. - There is no way to cache non-SQL inputs (ex: flat files, xml, etc). 2. We needed the ability to gather data from long-running entities by a process that runs separate from our main indexing process. 3. We wanted the ability to do a delta import of only the entities that changed. - Lucene/Solr requires entire documents to be re-indexed, even if only a few fields changed. - Our data comes from 50+ complex sql queries and/or flat files. - We do not want to incur overhead re-gathering all of this data if only 1 entity's data changed. - Persistent DIH caches solve this problem. 4. We want the ability to index several documents in parallel (using 1.4.1, which did not have the threads parameter). 5. In the future, we may need to use Shards, creating a need to easily partition our source data into Shards. Implementation Details: 1. De-couple EntityProcessorBase from caching. - Created a new interface, DIHCache two implementations: - SortedMapBackedCache - An in-memory cache, used as default with CachedSqlEntityProcessor (now deprecated). - BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested with je-4.1.6.jar - NOTE: the existing Lucene Contrib db project uses je-3.3.93.jar. I believe this may be incompatible due to Generic Usage. - NOTE: I did not modify the ant script to automatically get this jar, so to use or evaluate this patch, download bdb-je from http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html 2. Allow Entity Processors to take a cacheImpl parameter to cause the entity data to be cached (see EntityProcessorBase DIHCacheProperties). 3. Partially De-couple SolrWriter from DocBuilder - Created a new interface DIHWriter, two implementations: - SolrWriter (refactored) - DIHCacheWriter (allows DIH to write ultimately to a Cache). 4. Create a new Entity Processor, DIHCacheProcessor, which reads a persistent Cache as DIH Entity Input. 5. Support a partition parameter with both DIHCacheWriter and DIHCacheProcessor to allow for easy partitioning of source entity data. 6. Change the semantics of entity.destroy() - Previously, it was being called on each iteration of DocBuilder.buildDocument(). - Now it is does one-time cleanup tasks (like closing or deleting a disk-backed cache) once the entity processor is completed. - The only out-of-the-box entity processor that previously implemented destroy() was LineEntitiyProcessor, so this is not a very invasive change. General Notes: We are near completion in converting our search functionality from a legacy search engine to Solr. However, I found that DIH did
[jira] [Resolved] (SOLR-2615) Have LogUpdateProcessor log each command (add, delete, ...) at debug/FINE level
[ https://issues.apache.org/jira/browse/SOLR-2615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yonik Seeley resolved SOLR-2615. Resolution: Fixed Have LogUpdateProcessor log each command (add, delete, ...) at debug/FINE level --- Key: SOLR-2615 URL: https://issues.apache.org/jira/browse/SOLR-2615 Project: Solr Issue Type: Improvement Components: update Reporter: David Smiley Assignee: Yonik Seeley Priority: Minor Fix For: 3.4, 4.0 Attachments: SOLR-2615_LogUpdateProcessor_debug_logging.patch, SOLR-2615_LogUpdateProcessor_debug_logging.patch It would be great if the LogUpdateProcessor logged each command (add, delete, ...) at debug (Fine) level. Presently it only logs a summary of 8 commands and it does so at the very end. The attached patch implements this. * I moved the LogUpdateProcessor ahead of RunUpdateProcessor so that the debug level log happens before Solr does anything with it. It should not affect the ordering of the existing summary log which happens at finish(). * I changed UpdateRequestProcessor's static log variable to be an instance variable that uses the current class name. I think this makes much more sense since I want to be able to alter logging levels for a specific processor without doing it for all of them. This change did require me to tweak the factory's detection of the log level which avoids creating the LogUpdateProcessor. * There was an NPE bug in AddUpdateCommand.getPrintableId() in the event there is no schema unique field. I fixed that. You may notice I use SLF4J's nifty log.debug(message blah {} blah, var) syntax, which is both performant and concise as there's no point in guarding the debug message with an isDebugEnabled() since debug() will internally check this any way and there is no string concatenation if debug isn't enabled. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2452) rewrite solr build system
[ https://issues.apache.org/jira/browse/SOLR-2452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063394#comment-13063394 ] Yonik Seeley commented on SOLR-2452: bq. What's the right thing to do here in terms of a patch against the old file structure? Is it reasonable to check out fresh code, hack the patch file to reflect the new paths and apply it to the new structure or must I re-edit the source? That's what I did. bq. And is SVN merge smart enough to deal when merging from trunk to 3x when 3x hasn't been changed, or is it better to just wait on it all until the back-port is done? Apply the changes in 3x however you can (i.e. patch, etc) and then use svn merge --record-only. http://wiki.apache.org/lucene-java/SvnMerge rewrite solr build system - Key: SOLR-2452 URL: https://issues.apache.org/jira/browse/SOLR-2452 Project: Solr Issue Type: Task Components: Build Reporter: Robert Muir Assignee: Steven Rowe Fix For: 3.4, 4.0 Attachments: SOLR-2452-post-reshuffling.patch, SOLR-2452-post-reshuffling.patch, SOLR-2452-post-reshuffling.patch, SOLR-2452.diffSource.py.patch.zip, SOLR-2452.dir.reshuffle.sh, SOLR-2452.dir.reshuffle.sh As discussed some in SOLR-2002 (but that issue is long and hard to follow), I think we should rewrite the solr build system. Its slow, cumbersome, and messy, and makes it hard for us to improve things. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
heads up: reindex trunk indexes
I just committed https://issues.apache.org/jira/browse/LUCENE-3233, which includes improvements that change the format of the terms index. You should reindex. -- lucidimagination.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2452) rewrite solr build system
[ https://issues.apache.org/jira/browse/SOLR-2452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063418#comment-13063418 ] Yonik Seeley commented on SOLR-2452: The script produced output like this: {code} Index: solr/core/src/java/org/apache/solr/core/SolrCore.java === --- solr/src/java/org/apache/solr/core/SolrCore.java(revision 80231429dc9c7680375a0a21b1886e59b194) +++ solr/src/java/org/apache/solr/core/SolrCore.java(revision ) {code} Notice that core wasn't substituted on the lines starting with --- and +++ Trying to use the resulting patch file, I got: {code} /opt/code/lusolr$ patch -p0 tt.patch can't find file to patch at input line 5 Perhaps you used the wrong -p or --strip option? The text leading up to this was: -- |Index: solr/core/src/java/org/apache/solr/core/SolrCore.java |=== |--- solr/src/java/org/apache/solr/core/SolrCore.java (revision 80231429dc9c7680375a0a21b1886e59b194) |+++ solr/src/java/org/apache/solr/core/SolrCore.java (revision ) -- {code} rewrite solr build system - Key: SOLR-2452 URL: https://issues.apache.org/jira/browse/SOLR-2452 Project: Solr Issue Type: Task Components: Build Reporter: Robert Muir Assignee: Steven Rowe Fix For: 3.4, 4.0 Attachments: SOLR-2452-post-reshuffling.patch, SOLR-2452-post-reshuffling.patch, SOLR-2452-post-reshuffling.patch, SOLR-2452.diffSource.py.patch.zip, SOLR-2452.dir.reshuffle.sh, SOLR-2452.dir.reshuffle.sh, SOLR-2452.patch.hack.pl As discussed some in SOLR-2002 (but that issue is long and hard to follow), I think we should rewrite the solr build system. Its slow, cumbersome, and messy, and makes it hard for us to improve things. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2452) rewrite solr build system
[ https://issues.apache.org/jira/browse/SOLR-2452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063419#comment-13063419 ] Steven Rowe commented on SOLR-2452: --- Thanks Yonik - I'll fix it rewrite solr build system - Key: SOLR-2452 URL: https://issues.apache.org/jira/browse/SOLR-2452 Project: Solr Issue Type: Task Components: Build Reporter: Robert Muir Assignee: Steven Rowe Fix For: 3.4, 4.0 Attachments: SOLR-2452-post-reshuffling.patch, SOLR-2452-post-reshuffling.patch, SOLR-2452-post-reshuffling.patch, SOLR-2452.diffSource.py.patch.zip, SOLR-2452.dir.reshuffle.sh, SOLR-2452.dir.reshuffle.sh, SOLR-2452.patch.hack.pl As discussed some in SOLR-2002 (but that issue is long and hard to follow), I think we should rewrite the solr build system. Its slow, cumbersome, and messy, and makes it hard for us to improve things. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2452) rewrite solr build system
[ https://issues.apache.org/jira/browse/SOLR-2452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rowe updated SOLR-2452: -- Attachment: SOLR-2452.patch.hack.pl This version of the patch hacking script is fixed so that all paths are modified instead of just the ones on 'Index:' lines rewrite solr build system - Key: SOLR-2452 URL: https://issues.apache.org/jira/browse/SOLR-2452 Project: Solr Issue Type: Task Components: Build Reporter: Robert Muir Assignee: Steven Rowe Fix For: 3.4, 4.0 Attachments: SOLR-2452-post-reshuffling.patch, SOLR-2452-post-reshuffling.patch, SOLR-2452-post-reshuffling.patch, SOLR-2452.diffSource.py.patch.zip, SOLR-2452.dir.reshuffle.sh, SOLR-2452.dir.reshuffle.sh, SOLR-2452.patch.hack.pl, SOLR-2452.patch.hack.pl As discussed some in SOLR-2002 (but that issue is long and hard to follow), I think we should rewrite the solr build system. Its slow, cumbersome, and messy, and makes it hard for us to improve things. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2551) Check dataimport.properties for write access before starting import
[ https://issues.apache.org/jira/browse/SOLR-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063424#comment-13063424 ] Shalin Shekhar Mangar commented on SOLR-2551: - Thanks Steven! Check dataimport.properties for write access before starting import --- Key: SOLR-2551 URL: https://issues.apache.org/jira/browse/SOLR-2551 Project: Solr Issue Type: Improvement Components: contrib - DataImportHandler Affects Versions: 1.4.1, 3.1 Reporter: C S Assignee: Shalin Shekhar Mangar Priority: Minor Fix For: 3.3, 4.0 Attachments: SOLR-2551.patch A common mistake is that the /conf (respectively the dataimport.properties) file is not writable for solr. It would be great if that were detected on starting a dataimport job. Currently and import might grind away for days and fail if it can't write its timestamp to the dataimport.properties file. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [Lucene.Net] Incubator Status Page
On Sun, Jul 10, 2011 at 6:24 PM, Stefan Bodewig bode...@apache.org wrote: Hi all, http://incubator.apache.org/projects/lucene.net.html contains quite a few blanks that I think we could easily fill. I intend to either add some N/A or real dates where I can during the coming week. On the IP issues part (copyright and distribution rights) I trust the Lucene PMC has been taking care of this before Lucene.NET headed back to the Incubator and after that all contributions have come either directly by people with a CLA on file or as patches via JIRA where the ASF may use this checkbox has been checked - is this correct? absolutely. For the project specific tasks I'd ask all of you to fill in whatever you feel like adding. All Lucene.NET committers should be able to modify the status page. Stefan DIGY
[jira] [Created] (SOLR-2648) improve interaction of synonymsfilterfactory with analysis chain
improve interaction of synonymsfilterfactory with analysis chain Key: SOLR-2648 URL: https://issues.apache.org/jira/browse/SOLR-2648 Project: Solr Issue Type: Improvement Components: Schema and Analysis Affects Versions: 3.4, 4.0 Reporter: Robert Muir Spinoff of LUCENE-3233 (there is a TODO here), this was also mentioned by Otis on the mailing list: http://www.lucidimagination.com/search/document/8e91f858314562e/automatic_synonyms_for_multiple_variations_of_a_word#76c3d09f95f7a58f As of LUCENE-3233, the builder for the synonyms structure uses an Analyzer behind the scenes to actually tokenize the synonyms in your synonyms file. Currently the solr factory uses a WhitespaceTokenizer, unless you supply the tokenizerchain parameter, which lets you specify a tokenizer. If there was some way to instead specify a chain to this factory (e.g. charfilters, tokenizer, tokenfilter such as stemmers) versus just a tokenizerfactory, it would be a lot more flexible (e.g. it would stem your synonyms for you), and would solve this use case. Personally I think it would be most ideal if this just automatically work, e.g. if you have a chain of A, B, SynonymsFilter, C, D: then in my opinion the synonyms should be analyzed with an analysis chain of A, B. This way the injected synonyms are processed as if they were in the tokenstream to begin with. Note: there are some limitations here to what the chain can do, e.g. you cant be putting WDF before synonyms or other things that muck with positions, and you cant have a synonym that analyzes to nothing at all, but the parser checks for all these conditions and throws a syntax error so it would be clear to the user that they put the synonymsfilter in the wrong place in their chain. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-3280) Add new bit set impl for caching filters
[ https://issues.apache.org/jira/browse/LUCENE-3280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-3280. Resolution: Fixed Add new bit set impl for caching filters Key: LUCENE-3280 URL: https://issues.apache.org/jira/browse/LUCENE-3280 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.4, 4.0 Attachments: LUCENE-3280.patch, LUCENE-3280.patch I think OpenBitSet is trying to satisfy too many audiences, and it's confusing/error-proned as a result. It has int/long variants of many methods. Some methods require in-bound access, others don't; of those others, some methods auto-grow the bits, some don't. OpenBitSet doesn't always know its numBits. I'd like to factor out a more focused bit set impl whose primary target usage is a cached Lucene Filter, ie a bit set indexed by docID (int, not long) whose size is known and fixed up front (backed by final long[]) and is always accessed in-bounds. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2048) Omit positions but keep termFreq
[ https://issues.apache.org/jira/browse/LUCENE-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063455#comment-13063455 ] Robert Muir commented on LUCENE-2048: - i created a throwaway branch: branches/omitp, to hopefully sucker mike into helping me with some random fails (always pulsing is involved!) in general the pulsing cutover was tricky for me. Omit positions but keep termFreq Key: LUCENE-2048 URL: https://issues.apache.org/jira/browse/LUCENE-2048 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.1 Reporter: Andrzej Bialecki Assignee: Robert Muir Fix For: 4.0 Attachments: LUCENE-2048.patch it would be useful to have an option to discard positional information but still keep the term frequency - currently setOmitTermFreqAndPositions discards both. Even though position-dependent queries wouldn't work in such case, still any other queries would work fine and we would get the right scoring. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3293) Use IOContext.READONCE in VarGapTermsIndexReader to load FST
[ https://issues.apache.org/jira/browse/LUCENE-3293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Thacker updated LUCENE-3293: -- Attachment: LUCENE-3293.patch Also edited SegmentReader#loadLiveDocs Use IOContext.READONCE in VarGapTermsIndexReader to load FST Key: LUCENE-3293 URL: https://issues.apache.org/jira/browse/LUCENE-3293 Project: Lucene - Java Issue Type: Task Components: core/codecs Affects Versions: 4.0 Reporter: Simon Willnauer Assignee: Varun Thacker Priority: Minor Fix For: 4.0 Attachments: LUCENE-3293.patch VarGapTermsIndexReader should pass READONCE context down when it opens/reads the FST. Yet, it should just replace the ctx passed in, ie if we are merging vs reading we want to differentiate. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2452) rewrite solr build system
[ https://issues.apache.org/jira/browse/SOLR-2452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063488#comment-13063488 ] Steven Rowe commented on SOLR-2452: --- If there are no objections, I plan on committing the patch hacking script to {{dev-tools/scripts/}} later today. rewrite solr build system - Key: SOLR-2452 URL: https://issues.apache.org/jira/browse/SOLR-2452 Project: Solr Issue Type: Task Components: Build Reporter: Robert Muir Assignee: Steven Rowe Fix For: 3.4, 4.0 Attachments: SOLR-2452-post-reshuffling.patch, SOLR-2452-post-reshuffling.patch, SOLR-2452-post-reshuffling.patch, SOLR-2452.diffSource.py.patch.zip, SOLR-2452.dir.reshuffle.sh, SOLR-2452.dir.reshuffle.sh, SOLR-2452.patch.hack.pl, SOLR-2452.patch.hack.pl As discussed some in SOLR-2002 (but that issue is long and hard to follow), I think we should rewrite the solr build system. Its slow, cumbersome, and messy, and makes it hard for us to improve things. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[JENKINS] Lucene-Solr-tests-only-3.x - Build # 9511 - Failure
Build: https://builds.apache.org/job/Lucene-Solr-tests-only-3.x/9511/ No tests ran. Build Log (for compile errors): [...truncated 3060 lines...] [javac] found : java.util.Collection [javac] required: java.util.Collectionjava.lang.String [javac] public Collection getFileNames() throws IOException { [javac] ^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/java/org/apache/solr/core/IndexDeletionPolicyWrapper.java:211: warning: getUserData() in org.apache.solr.core.IndexDeletionPolicyWrapper.IndexCommitWrapper overrides getUserData() in org.apache.lucene.index.IndexCommit; return type requires unchecked conversion [javac] found : java.util.Map [javac] required: java.util.Mapjava.lang.String,java.lang.String [javac] public Map getUserData() throws IOException { [javac]^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/java/org/apache/solr/handler/RequestHandlerBase.java:173: warning: [unchecked] unchecked call to add(java.lang.String,T) as a member of the raw type org.apache.solr.common.util.NamedList [javac] lst.add(handlerStart,handlerStart); [javac]^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/java/org/apache/solr/handler/RequestHandlerBase.java:174: warning: [unchecked] unchecked call to add(java.lang.String,T) as a member of the raw type org.apache.solr.common.util.NamedList [javac] lst.add(requests, numRequests); [javac]^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/java/org/apache/solr/handler/RequestHandlerBase.java:175: warning: [unchecked] unchecked call to add(java.lang.String,T) as a member of the raw type org.apache.solr.common.util.NamedList [javac] lst.add(errors, numErrors); [javac]^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/java/org/apache/solr/handler/RequestHandlerBase.java:176: warning: [unchecked] unchecked call to add(java.lang.String,T) as a member of the raw type org.apache.solr.common.util.NamedList [javac] lst.add(timeouts, numTimeouts); [javac]^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/java/org/apache/solr/handler/RequestHandlerBase.java:177: warning: [unchecked] unchecked call to add(java.lang.String,T) as a member of the raw type org.apache.solr.common.util.NamedList [javac] lst.add(totalTime,totalTime); [javac]^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/java/org/apache/solr/handler/RequestHandlerBase.java:178: warning: [unchecked] unchecked call to add(java.lang.String,T) as a member of the raw type org.apache.solr.common.util.NamedList [javac] lst.add(avgTimePerRequest, (float) totalTime / (float) this.numRequests); [javac]^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/java/org/apache/solr/handler/RequestHandlerBase.java:179: warning: [unchecked] unchecked call to add(java.lang.String,T) as a member of the raw type org.apache.solr.common.util.NamedList [javac] lst.add(avgRequestsPerSecond, (float) numRequests*1000 / (float)(System.currentTimeMillis()-handlerStart)); [javac]^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/java/org/apache/solr/handler/admin/CoreAdminHandler.java:213: warning: [unchecked] unchecked conversion [javac] found : org.apache.solr.util.RefCounted[] [javac] required: org.apache.solr.util.RefCountedorg.apache.solr.search.SolrIndexSearcher[] [javac] searchers = new RefCounted[sourceCores.length]; [javac] ^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/java/org/apache/solr/handler/component/ResponseBuilder.java:291: warning: [unchecked] unchecked call to add(java.lang.String,T) as a member of the raw type org.apache.solr.common.util.NamedList [javac] rsp.getResponseHeader().add( partialResults, Boolean.TRUE ); [javac] ^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/java/org/apache/solr/search/FunctionQParser.java:254: warning: [unchecked] unchecked conversion [javac] found : java.util.HashMap [javac] required: java.util.Mapjava.lang.String,java.lang.String [javac] int end = QueryParsing.parseLocalParams(qs, start, nestedLocalParams, getParams()); [javac] ^ [javac]
[jira] [Resolved] (LUCENE-3289) FST should allow controlling how hard builder tries to share suffixes
[ https://issues.apache.org/jira/browse/LUCENE-3289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-3289. Resolution: Fixed FST should allow controlling how hard builder tries to share suffixes - Key: LUCENE-3289 URL: https://issues.apache.org/jira/browse/LUCENE-3289 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.4, 4.0 Attachments: LUCENE-3289.patch, LUCENE-3289.patch Today we have a boolean option to the FST builder telling it whether it should share suffixes. If you turn this off, building is much faster, uses much less RAM, and the resulting FST is a prefix trie. But, the FST is larger than it needs to be. When it's on, the builder maintains a node hash holding every node seen so far in the FST -- this uses up RAM and slows things down. On a dataset that Elmer (see java-user thread Autocompletion on large index on Jul 6 2011) provided (thank you!), which is 1.32 M titles avg 67.3 chars per title, building with suffix sharing on took 22.5 seconds, required 1.25 GB heap, and produced 91.6 MB FST. With suffix sharing off, it was 8.2 seconds, 450 MB heap and 129 MB FST. I think we should allow this boolean to be shade-of-gray instead: usually, how well suffixes can share is a function of how far they are from the end of the string, so, by adding a tunable N to only share when suffix length N, we can let caller make reasonable tradeoffs. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: [JENKINS] Lucene-Solr-tests-only-3.x - Build # 9511 - Failure
This compilation failure is down to @Override annotations - I've committed the fix (removing the annotations): [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/java/org/apache/solr/analysis/FSTSynonymFilterFactory.java:57: method does not override a method from its superclass [javac] @Override [javac]^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/java/org/apache/solr/analysis/FSTSynonymFilterFactory.java:62: method does not override a method from its superclass [javac] @Override [javac]^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/java/org/apache/solr/analysis/SynonymFilterFactory.java:43: method does not override a method from its superclass [javac] @Override [javac]^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/java/org/apache/solr/analysis/SynonymFilterFactory.java:49: method does not override a method from its superclass [javac] @Override [javac]^ -Original Message- From: Apache Jenkins Server [mailto:jenk...@builds.apache.org] Sent: Monday, July 11, 2011 3:43 PM To: dev@lucene.apache.org Subject: [JENKINS] Lucene-Solr-tests-only-3.x - Build # 9511 - Failure Build: https://builds.apache.org/job/Lucene-Solr-tests-only-3.x/9511/ No tests ran. Build Log (for compile errors): [...truncated 3060 lines...] [javac] found : java.util.Collection [javac] required: java.util.Collectionjava.lang.String [javac] public Collection getFileNames() throws IOException { [javac] ^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests- only- 3.x/checkout/solr/src/java/org/apache/solr/core/IndexDeletionPolicyWrappe r.java:211: warning: getUserData() in org.apache.solr.core.IndexDeletionPolicyWrapper.IndexCommitWrapper overrides getUserData() in org.apache.lucene.index.IndexCommit; return type requires unchecked conversion [javac] found : java.util.Map [javac] required: java.util.Mapjava.lang.String,java.lang.String [javac] public Map getUserData() throws IOException { [javac]^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests- only- 3.x/checkout/solr/src/java/org/apache/solr/handler/RequestHandlerBase.jav a:173: warning: [unchecked] unchecked call to add(java.lang.String,T) as a member of the raw type org.apache.solr.common.util.NamedList [javac] lst.add(handlerStart,handlerStart); [javac]^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests- only- 3.x/checkout/solr/src/java/org/apache/solr/handler/RequestHandlerBase.jav a:174: warning: [unchecked] unchecked call to add(java.lang.String,T) as a member of the raw type org.apache.solr.common.util.NamedList [javac] lst.add(requests, numRequests); [javac]^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests- only- 3.x/checkout/solr/src/java/org/apache/solr/handler/RequestHandlerBase.jav a:175: warning: [unchecked] unchecked call to add(java.lang.String,T) as a member of the raw type org.apache.solr.common.util.NamedList [javac] lst.add(errors, numErrors); [javac]^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests- only- 3.x/checkout/solr/src/java/org/apache/solr/handler/RequestHandlerBase.jav a:176: warning: [unchecked] unchecked call to add(java.lang.String,T) as a member of the raw type org.apache.solr.common.util.NamedList [javac] lst.add(timeouts, numTimeouts); [javac]^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests- only- 3.x/checkout/solr/src/java/org/apache/solr/handler/RequestHandlerBase.jav a:177: warning: [unchecked] unchecked call to add(java.lang.String,T) as a member of the raw type org.apache.solr.common.util.NamedList [javac] lst.add(totalTime,totalTime); [javac]^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests- only- 3.x/checkout/solr/src/java/org/apache/solr/handler/RequestHandlerBase.jav a:178: warning: [unchecked] unchecked call to add(java.lang.String,T) as a member of the raw type org.apache.solr.common.util.NamedList [javac] lst.add(avgTimePerRequest, (float) totalTime / (float) this.numRequests); [javac]^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests- only- 3.x/checkout/solr/src/java/org/apache/solr/handler/RequestHandlerBase.jav a:179: warning: [unchecked] unchecked call to add(java.lang.String,T) as a member of the raw type org.apache.solr.common.util.NamedList [javac] lst.add(avgRequestsPerSecond, (float) numRequests*1000 / (float)(System.currentTimeMillis()-handlerStart)); [javac]^
[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans
[ https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063499#comment-13063499 ] Mike Sokolov commented on LUCENE-2878: -- OK I think I brushed by some of your comments, Simon, in my hasty response, sorry. Here's a little more thought, I hope: bq. So bottom line here is that we need an api that is capable of collecting fine grained parts of the scorer tree. The only way I see doing this is 1. have a subscribe / register method and 2. do this subscription during scorer creation. Once we have this we can implement very simple collect methods that only collect positions for the current match like in a near query, while the current matching document is collected all contributing TermScorers have their positioninterval ready for collection. The collect method can then be called from the consumer instead of in the loop this way we only get the positions we need since we know the document we are collecting. I *think* it's necessary to have both a callback from within the scoring loop, and a mechanism for iterating over the current state of the iterator. For boolean queries, the positions will never be iterated in the scoring loop (all you care about is the frequencies, positions are ignored), so some new process: either the position collector (highlighter, say), or a loop in the scorer that knows positions are being consumed (needsPositions==true) has to cause the iteration to be performed. But for position-aware queries (like phrases), the scorer *will* iterate over positions, and in order to score properly, I think the Scorer has to drive the iteration? I tried a few different approaches at this before deciding to just push the iteration into the Scorer, but none of them really worked properly. Let's say, for example that a document is collected. Then the position consumer comes in to find out what positions were matched - it may already too late, because during scoring, some of the positions may have been consumed (eg to score phrases)? It's possible I may be suffering from some delusion, though :) But if I'm right, then it means there has to be some sort of callback mechanism in place *during scoring*, or else we have to resign ourselves to scoring first, and then re-setting and iterating positions in a second pass. I actually think that if we follow through with the registration-during-construction idea, we can have the tests done in an efficient way during scoring (with final boolean properties of the scorers), and it can be OK to have them in the scoring loop. Allow Scorer to expose positions and payloads aka. nuke spans -- Key: LUCENE-2878 URL: https://issues.apache.org/jira/browse/LUCENE-2878 Project: Lucene - Java Issue Type: Improvement Components: core/search Affects Versions: Bulk Postings branch Reporter: Simon Willnauer Assignee: Simon Willnauer Labels: gsoc2011, lucene-gsoc-11, mentor Attachments: LUCENE-2878-OR.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878_trunk.patch, LUCENE-2878_trunk.patch, PosHighlighter.patch, PosHighlighter.patch Currently we have two somewhat separate types of queries, the one which can make use of positions (mainly spans) and payloads (spans). Yet Span*Query doesn't really do scoring comparable to what other queries do and at the end of the day they are duplicating lot of code all over lucene. Span*Queries are also limited to other Span*Query instances such that you can not use a TermQuery or a BooleanQuery with SpanNear or anthing like that. Beside of the Span*Query limitation other queries lacking a quiet interesting feature since they can not score based on term proximity since scores doesn't expose any positional information. All those problems bugged me for a while now so I stared working on that using the bulkpostings API. I would have done that first cut on trunk but TermScorer is working on BlockReader that do not expose positions while the one in this branch does. I started adding a new Positions class which users can pull from a scorer, to prevent unnecessary positions enums I added ScorerContext#needsPositions and eventually Scorere#needsPayloads to create the corresponding enum on demand. Yet, currently only TermQuery / TermScorer implements this API and other simply return null instead. To show that the API really works and our BulkPostings work fine too with positions I cut over TermSpanQuery to use a TermScorer under the hood and nuked TermSpans entirely. A nice sideeffect of this was that the Position BulkReading implementation got some exercise which now :) work all with
[JENKINS] Lucene-Solr-tests-only-3.x - Build # 9512 - Still Failing
Build: https://builds.apache.org/job/Lucene-Solr-tests-only-3.x/9512/ No tests ran. Build Log (for compile errors): [...truncated 3549 lines...] [javac] NamedListNamedList fieldTypes = result.get(field_types); [javac] ^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:133: warning: [unchecked] unchecked conversion [javac] found : org.apache.solr.common.util.NamedList [javac] required: org.apache.solr.common.util.NamedListorg.apache.solr.common.util.NamedList [javac] NamedListNamedList textType = fieldTypes.get(text); [javac] ^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:136: warning: [unchecked] unchecked conversion [javac] found : org.apache.solr.common.util.NamedList [javac] required: org.apache.solr.common.util.NamedListjava.util.Listorg.apache.solr.common.util.NamedList [javac] NamedListListNamedList indexPart = textType.get(index); [javac]^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:201: warning: [unchecked] unchecked conversion [javac] found : org.apache.solr.common.util.NamedList [javac] required: org.apache.solr.common.util.NamedListjava.util.Listorg.apache.solr.common.util.NamedList [javac] NamedListListNamedList queryPart = textType.get(query); [javac]^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:230: warning: [unchecked] unchecked conversion [javac] found : org.apache.solr.common.util.NamedList [javac] required: org.apache.solr.common.util.NamedListorg.apache.solr.common.util.NamedList [javac] NamedListNamedList nameTextType = fieldTypes.get(nametext); [javac] ^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:233: warning: [unchecked] unchecked conversion [javac] found : org.apache.solr.common.util.NamedList [javac] required: org.apache.solr.common.util.NamedListjava.util.Listorg.apache.solr.common.util.NamedList [javac] indexPart = nameTextType.get(index); [javac] ^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:250: warning: [unchecked] unchecked conversion [javac] found : org.apache.solr.common.util.NamedList [javac] required: org.apache.solr.common.util.NamedListjava.util.Listorg.apache.solr.common.util.NamedList [javac] queryPart = nameTextType.get(query); [javac] ^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:256: warning: [unchecked] unchecked conversion [javac] found : org.apache.solr.common.util.NamedList [javac] required: org.apache.solr.common.util.NamedListorg.apache.solr.common.util.NamedList [javac] NamedListNamedList fieldNames = result.get(field_names); [javac] ^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:259: warning: [unchecked] unchecked conversion [javac] found : org.apache.solr.common.util.NamedList [javac] required: org.apache.solr.common.util.NamedListorg.apache.solr.common.util.NamedList [javac] NamedListNamedList whitetok = fieldNames.get(whitetok); [javac] ^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:262: warning: [unchecked] unchecked conversion [javac] found : org.apache.solr.common.util.NamedList [javac] required: org.apache.solr.common.util.NamedListjava.util.Listorg.apache.solr.common.util.NamedList [javac] indexPart = whitetok.get(index); [javac] ^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:279: warning: [unchecked]
RE: [JENKINS] Lucene-Solr-tests-only-3.x - Build # 9512 - Still Failing
More @Override annotations - I've again committed the fix (removing the annotations): [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/test/org/apache/solr/analysis/TestSynonymMap.java:274: method does not override a method from its superclass [javac] @Override [javac]^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/test/org/apache/solr/analysis/TestSynonymMap.java:284: method does not override a method from its superclass [javac] @Override [javac]^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/test/org/apache/solr/analysis/TestSynonymMap.java:289: method does not override a method from its superclass [javac] @Override [javac]^ -Original Message- From: Apache Jenkins Server [mailto:jenk...@builds.apache.org] Sent: Monday, July 11, 2011 4:11 PM To: dev@lucene.apache.org Subject: [JENKINS] Lucene-Solr-tests-only-3.x - Build # 9512 - Still Failing Build: https://builds.apache.org/job/Lucene-Solr-tests-only-3.x/9512/ No tests ran. Build Log (for compile errors): [...truncated 3549 lines...] [javac] NamedListNamedList fieldTypes = result.get(field_types); [javac] ^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests- only- 3.x/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHa ndlerTest.java:133: warning: [unchecked] unchecked conversion [javac] found : org.apache.solr.common.util.NamedList [javac] required: org.apache.solr.common.util.NamedListorg.apache.solr.common.util.NamedLi st [javac] NamedListNamedList textType = fieldTypes.get(text); [javac] ^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests- only- 3.x/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHa ndlerTest.java:136: warning: [unchecked] unchecked conversion [javac] found : org.apache.solr.common.util.NamedList [javac] required: org.apache.solr.common.util.NamedListjava.util.Listorg.apache.solr.comm on.util.NamedList [javac] NamedListListNamedList indexPart = textType.get(index); [javac]^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests- only- 3.x/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHa ndlerTest.java:201: warning: [unchecked] unchecked conversion [javac] found : org.apache.solr.common.util.NamedList [javac] required: org.apache.solr.common.util.NamedListjava.util.Listorg.apache.solr.comm on.util.NamedList [javac] NamedListListNamedList queryPart = textType.get(query); [javac]^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests- only- 3.x/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHa ndlerTest.java:230: warning: [unchecked] unchecked conversion [javac] found : org.apache.solr.common.util.NamedList [javac] required: org.apache.solr.common.util.NamedListorg.apache.solr.common.util.NamedLi st [javac] NamedListNamedList nameTextType = fieldTypes.get(nametext); [javac] ^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests- only- 3.x/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHa ndlerTest.java:233: warning: [unchecked] unchecked conversion [javac] found : org.apache.solr.common.util.NamedList [javac] required: org.apache.solr.common.util.NamedListjava.util.Listorg.apache.solr.comm on.util.NamedList [javac] indexPart = nameTextType.get(index); [javac] ^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests- only- 3.x/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHa ndlerTest.java:250: warning: [unchecked] unchecked conversion [javac] found : org.apache.solr.common.util.NamedList [javac] required: org.apache.solr.common.util.NamedListjava.util.Listorg.apache.solr.comm on.util.NamedList [javac] queryPart = nameTextType.get(query); [javac] ^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests- only- 3.x/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHa ndlerTest.java:256: warning: [unchecked] unchecked conversion [javac] found : org.apache.solr.common.util.NamedList [javac] required: org.apache.solr.common.util.NamedListorg.apache.solr.common.util.NamedLi st [javac] NamedListNamedList fieldNames = result.get(field_names); [javac]
[jira] [Commented] (SOLR-2551) Check dataimport.properties for write access before starting import
[ https://issues.apache.org/jira/browse/SOLR-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063511#comment-13063511 ] Steven Rowe commented on SOLR-2551: --- The [Lucene-Solr-tests-only-trunk Jenkins job|https://builds.apache.org/job/Lucene-Solr-tests-only-trunk/] has run only once since the DIH tests were made to run sequentially (https://builds.apache.org/job/Lucene-Solr-tests-only-trunk/9500/), so I'll delay closing this issue until it's successfully run 15 or 20 more times, which should take less than one day. Check dataimport.properties for write access before starting import --- Key: SOLR-2551 URL: https://issues.apache.org/jira/browse/SOLR-2551 Project: Solr Issue Type: Improvement Components: contrib - DataImportHandler Affects Versions: 1.4.1, 3.1 Reporter: C S Assignee: Shalin Shekhar Mangar Priority: Minor Fix For: 3.3, 4.0 Attachments: SOLR-2551.patch A common mistake is that the /conf (respectively the dataimport.properties) file is not writable for solr. It would be great if that were detected on starting a dataimport job. Currently and import might grind away for days and fail if it can't write its timestamp to the dataimport.properties file. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3304) Allow WeightedSpanTermExtractor to collect positions for TermQuerys
Allow WeightedSpanTermExtractor to collect positions for TermQuerys --- Key: LUCENE-3304 URL: https://issues.apache.org/jira/browse/LUCENE-3304 Project: Lucene - Java Issue Type: Improvement Components: modules/highlighter Affects Versions: 3.3 Reporter: Jahangir Anwari Priority: Trivial Spinoff from this thread: http://www.gossamer-threads.com/lists/lucene/java-user/129668 Currently WeightedSpanTermExtractor only collects positions for position sensitive queries. Allowing WeightedSpanTermExtractor to store positions for TermQuery would enable the WeightedSpanTermExtractor to be used outside the highlighter in custom plugins to get positions information. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3287) Allow ability to set maxDocCharsToAnalyze in WeightedSpanTermExtractor
[ https://issues.apache.org/jira/browse/LUCENE-3287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jahangir Anwari updated LUCENE-3287: Description: Spinoff from this thread: http://www.gossamer-threads.com/lists/lucene/java-user/129668 In WeightedSpanTermExtractor the default maxDocCharsToAnalyze value is 0. This inhibits us from getting the weighted span terms in any custom code(e.g attached CustomHighlighter.java) that uses WeightedSpanTermExtractor. Currently the setMaxDocCharsToAnalyze() method is protected, which prevents us from setting maxDocCharsToAnalyze to a value greater than 0. Changing the method to public would give us the ability to set the maxDocCharsToAnalyze. was: In WeightedSpanTermExtractor the default maxDocCharsToAnalyze value is 0. This inhibits us from getting the weighted span terms in any custom code(e.g attached CustomHighlighter.java) that uses WeightedSpanTermExtractor. Currently the setMaxDocCharsToAnalyze() method is protected, which prevents us from setting maxDocCharsToAnalyze to a value greater than 0. Changing the method to public would give us the ability to set the maxDocCharsToAnalyze. Allow ability to set maxDocCharsToAnalyze in WeightedSpanTermExtractor -- Key: LUCENE-3287 URL: https://issues.apache.org/jira/browse/LUCENE-3287 Project: Lucene - Java Issue Type: Improvement Components: modules/highlighter Affects Versions: 3.3 Reporter: Jahangir Anwari Priority: Trivial Attachments: CustomHighlighter.java, WeightedSpanTermExtractor.patch Spinoff from this thread: http://www.gossamer-threads.com/lists/lucene/java-user/129668 In WeightedSpanTermExtractor the default maxDocCharsToAnalyze value is 0. This inhibits us from getting the weighted span terms in any custom code(e.g attached CustomHighlighter.java) that uses WeightedSpanTermExtractor. Currently the setMaxDocCharsToAnalyze() method is protected, which prevents us from setting maxDocCharsToAnalyze to a value greater than 0. Changing the method to public would give us the ability to set the maxDocCharsToAnalyze. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-2308: --- Attachment: LUCENE-2308-ltc.patch Small patch to fix LTC.newField to again randomly add in term vectors when they are disabled. Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: core/index Reporter: Michael McCandless Assignee: Michael McCandless Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: LUCENE-2308-2.patch, LUCENE-2308-3.patch, LUCENE-2308-4.patch, LUCENE-2308-4.patch, LUCENE-2308-5.patch, LUCENE-2308-6.patch, LUCENE-2308-7.patch, LUCENE-2308-8.patch, LUCENE-2308-9.patch, LUCENE-2308-ltc.patch, LUCENE-2308.patch, LUCENE-2308.patch This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-1768) NumericRange support for new query parser
[ https://issues.apache.org/jira/browse/LUCENE-1768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063555#comment-13063555 ] Uwe Schindler commented on LUCENE-1768: --- Vinicius, do you have any plans about backporting the stuff to Lucene 3.x - it should not be that hard :-) bq. I am not sure about numeric support, Vinicius changed TermRangeQueryNode inheritance, which breaks the backwards compatibility. I am not saying the change is bad, I agree with the new structure, however Vinicius will need to find another solution before backporting it to 3.x. I am not sure if this is really a break when you change inheritance. If code still compiles, its no break, if classes were renamed its more serious. I am not sure, if implementation classes (and -names) should be covered by the backwards compatibility. In my opinion, mainly the configuration and interfaces of the QP must be covered by backwards policy. As we are now at mid-time, it would be a good idea, to maybe add some extra syntax support for numerics, like and ? We should also add tests/support for half-open ranges, so syntax like [* TO 1.0] should also be supported (I am not sure, if TermRangeQueryNode supports this, but numerics should do this in all cases) - the above syntax is also printed out on NumericRangeQuery.toString(), if one of the bounds is null. The latter could be easily implemented by checking for * as input to the range bounds and map those special values to NULL. Adding support for and (also =, =) needs knowledge of JavaCC parser language. Vinicius, have you ever worked with JavaCC, so do you think you will be able to extend the syntax? NumericRange support for new query parser - Key: LUCENE-1768 URL: https://issues.apache.org/jira/browse/LUCENE-1768 Project: Lucene - Java Issue Type: New Feature Components: core/queryparser Affects Versions: 2.9 Reporter: Uwe Schindler Assignee: Uwe Schindler Labels: contrib, gsoc, gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: week-7.patch, week1.patch, week2.patch, week3.patch, week4.patch, week5-6.patch It would be good to specify some type of schema for the query parser in future, to automatically create NumericRangeQuery for different numeric types? It would then be possible to index a numeric value (double,float,long,int) using NumericField and then the query parser knows, which type of field this is and so it correctly creates a NumericRangeQuery for strings like [1.567..*] or (1.787..19.5]. There is currently no way to extract if a field is numeric from the index, so the user will have to configure the FieldConfig objects in the ConfigHandler. But if this is done, it will not be that difficult to implement the rest. The only difference between the current handling of RangeQuery is then the instantiation of the correct Query type and conversion of the entered numeric values (simple Number.valueOf(...) cast of the user entered numbers). Evenerything else is identical, NumericRangeQuery also supports the MTQ rewrite modes (as it is a MTQ). Another thing is a change in Date semantics. There are some strange flags in the current parser that tells it how to handle dates. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-1768) NumericRange support for new query parser
[ https://issues.apache.org/jira/browse/LUCENE-1768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063570#comment-13063570 ] Adriano Crestani commented on LUCENE-1768: -- {quote} I am not sure if this is really a break when you change inheritance. If code still compiles, its no break, if classes were renamed its more serious. I am not sure, if implementation classes (and -names) should be covered by the backwards compatibility. In my opinion, mainly the configuration and interfaces of the QP must be covered by backwards policy. {quote} I didn't see any class renaming, I need to double check Vinicius's patches. But he did change the query node inheritance, which may affect how processors and builder (specially QueryNodeTreeBuilder) work. I am not saying it is not possible to implement his approach on 3.x, but he will need to deal differently with query nodes classes he created. As I said before, what he did is good and clean, I like the way it is, but it will break someone's code if pushed to 3.x. So if you ask me whether to push it to 3.x, I say YES, just make sure to not break the query node structure that people may be relying on. {quote} As we are now at mid-time, it would be a good idea, to maybe add some extra syntax support for numerics, like and ? We should also add tests/support for half-open ranges, so syntax like [* TO 1.0] should also be supported (I am not sure, if TermRangeQueryNode supports this, but numerics should do this in all cases) - the above syntax is also printed out on NumericRangeQuery.toString(), if one of the bounds is null. The latter could be easily implemented by checking for * as input to the range bounds and map those special values to NULL. Adding support for and (also =, =) needs knowledge of JavaCC parser language. Vinicius, have you ever worked with JavaCC, so do you think you will be able to extend the syntax? {quote} I still need to investigate the bugs Vinicius reported (should have been created a JIRA for that already), I never really tried open ranges in contrib QP. And if Vinicius thinks he will have time and skills to do the JAVACC change to support those new operators, go for it! And remember Vinicius, you don't need to do everything during gsoc, you are always welcome to contribute code whenever you want :) NumericRange support for new query parser - Key: LUCENE-1768 URL: https://issues.apache.org/jira/browse/LUCENE-1768 Project: Lucene - Java Issue Type: New Feature Components: core/queryparser Affects Versions: 2.9 Reporter: Uwe Schindler Assignee: Uwe Schindler Labels: contrib, gsoc, gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: week-7.patch, week1.patch, week2.patch, week3.patch, week4.patch, week5-6.patch It would be good to specify some type of schema for the query parser in future, to automatically create NumericRangeQuery for different numeric types? It would then be possible to index a numeric value (double,float,long,int) using NumericField and then the query parser knows, which type of field this is and so it correctly creates a NumericRangeQuery for strings like [1.567..*] or (1.787..19.5]. There is currently no way to extract if a field is numeric from the index, so the user will have to configure the FieldConfig objects in the ConfigHandler. But if this is done, it will not be that difficult to implement the rest. The only difference between the current handling of RangeQuery is then the instantiation of the correct Query type and conversion of the entered numeric values (simple Number.valueOf(...) cast of the user entered numbers). Evenerything else is identical, NumericRangeQuery also supports the MTQ rewrite modes (as it is a MTQ). Another thing is a change in Date semantics. There are some strange flags in the current parser that tells it how to handle dates. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-2048) Omit positions but keep termFreq
[ https://issues.apache.org/jira/browse/LUCENE-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-2048: Attachment: LUCENE-2048.patch ok here's a updated patch. I think its ready to commit! Omit positions but keep termFreq Key: LUCENE-2048 URL: https://issues.apache.org/jira/browse/LUCENE-2048 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.1 Reporter: Andrzej Bialecki Assignee: Robert Muir Fix For: 4.0 Attachments: LUCENE-2048.patch, LUCENE-2048.patch it would be useful to have an option to discard positional information but still keep the term frequency - currently setOmitTermFreqAndPositions discards both. Even though position-dependent queries wouldn't work in such case, still any other queries would work fine and we would get the right scoring. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans
[ https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063582#comment-13063582 ] Robert Muir commented on LUCENE-2878: - {quote} But if I'm right, then it means there has to be some sort of callback mechanism in place during scoring, or else we have to resign ourselves to scoring first, and then re-setting and iterating positions in a second pass. {quote} But I think this is what I think we want? If there are 10 million documents that match a query, but our priority queue size is 20 (1 page), we only want to do the expensive highlighting on those 20 documents. Its the same for the positional scoring, its too expensive to look at positions for all documents, so you re-order maybe the top 100 or so. Or maybe I'm totally confused by the comments! Allow Scorer to expose positions and payloads aka. nuke spans -- Key: LUCENE-2878 URL: https://issues.apache.org/jira/browse/LUCENE-2878 Project: Lucene - Java Issue Type: Improvement Components: core/search Affects Versions: Bulk Postings branch Reporter: Simon Willnauer Assignee: Simon Willnauer Labels: gsoc2011, lucene-gsoc-11, mentor Attachments: LUCENE-2878-OR.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878_trunk.patch, LUCENE-2878_trunk.patch, PosHighlighter.patch, PosHighlighter.patch Currently we have two somewhat separate types of queries, the one which can make use of positions (mainly spans) and payloads (spans). Yet Span*Query doesn't really do scoring comparable to what other queries do and at the end of the day they are duplicating lot of code all over lucene. Span*Queries are also limited to other Span*Query instances such that you can not use a TermQuery or a BooleanQuery with SpanNear or anthing like that. Beside of the Span*Query limitation other queries lacking a quiet interesting feature since they can not score based on term proximity since scores doesn't expose any positional information. All those problems bugged me for a while now so I stared working on that using the bulkpostings API. I would have done that first cut on trunk but TermScorer is working on BlockReader that do not expose positions while the one in this branch does. I started adding a new Positions class which users can pull from a scorer, to prevent unnecessary positions enums I added ScorerContext#needsPositions and eventually Scorere#needsPayloads to create the corresponding enum on demand. Yet, currently only TermQuery / TermScorer implements this API and other simply return null instead. To show that the API really works and our BulkPostings work fine too with positions I cut over TermSpanQuery to use a TermScorer under the hood and nuked TermSpans entirely. A nice sideeffect of this was that the Position BulkReading implementation got some exercise which now :) work all with positions while Payloads for bulkreading are kind of experimental in the patch and those only work with Standard codec. So all spans now work on top of TermScorer ( I truly hate spans since today ) including the ones that need Payloads (StandardCodec ONLY)!! I didn't bother to implement the other codecs yet since I want to get feedback on the API and on this first cut before I go one with it. I will upload the corresponding patch in a minute. I also had to cut over SpanQuery.getSpans(IR) to SpanQuery.getSpans(AtomicReaderContext) which I should probably do on trunk first but after that pain today I need a break first :). The patch passes all core tests (org.apache.lucene.search.highlight.HighlighterTest still fails but I didn't look into the MemoryIndex BulkPostings API yet) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans
[ https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063595#comment-13063595 ] Mike Sokolov commented on LUCENE-2878: -- bq. But I think this is what I think we want? If there are 10 million documents that match a query, but our priority queue size is 20 (1 page), we only want to do the expensive highlighting on those 20 documents. Yes - the comments may be getting lost in the weeds a bit here; sorry. I've been assuming you'd search once to collect documents and then search again with the same query plus a constraint to limited by gathered docids, with an indication that positions are required - this pushes you towards some sort of collector-style callback API. Maybe life would be simpler if instead you could just say getPositionIterator(docid, query). But that would force you actually into a third pass (I think), if you wanted positional scoring too, wouldn't it? Allow Scorer to expose positions and payloads aka. nuke spans -- Key: LUCENE-2878 URL: https://issues.apache.org/jira/browse/LUCENE-2878 Project: Lucene - Java Issue Type: Improvement Components: core/search Affects Versions: Bulk Postings branch Reporter: Simon Willnauer Assignee: Simon Willnauer Labels: gsoc2011, lucene-gsoc-11, mentor Attachments: LUCENE-2878-OR.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878_trunk.patch, LUCENE-2878_trunk.patch, PosHighlighter.patch, PosHighlighter.patch Currently we have two somewhat separate types of queries, the one which can make use of positions (mainly spans) and payloads (spans). Yet Span*Query doesn't really do scoring comparable to what other queries do and at the end of the day they are duplicating lot of code all over lucene. Span*Queries are also limited to other Span*Query instances such that you can not use a TermQuery or a BooleanQuery with SpanNear or anthing like that. Beside of the Span*Query limitation other queries lacking a quiet interesting feature since they can not score based on term proximity since scores doesn't expose any positional information. All those problems bugged me for a while now so I stared working on that using the bulkpostings API. I would have done that first cut on trunk but TermScorer is working on BlockReader that do not expose positions while the one in this branch does. I started adding a new Positions class which users can pull from a scorer, to prevent unnecessary positions enums I added ScorerContext#needsPositions and eventually Scorere#needsPayloads to create the corresponding enum on demand. Yet, currently only TermQuery / TermScorer implements this API and other simply return null instead. To show that the API really works and our BulkPostings work fine too with positions I cut over TermSpanQuery to use a TermScorer under the hood and nuked TermSpans entirely. A nice sideeffect of this was that the Position BulkReading implementation got some exercise which now :) work all with positions while Payloads for bulkreading are kind of experimental in the patch and those only work with Standard codec. So all spans now work on top of TermScorer ( I truly hate spans since today ) including the ones that need Payloads (StandardCodec ONLY)!! I didn't bother to implement the other codecs yet since I want to get feedback on the API and on this first cut before I go one with it. I will upload the corresponding patch in a minute. I also had to cut over SpanQuery.getSpans(IR) to SpanQuery.getSpans(AtomicReaderContext) which I should probably do on trunk first but after that pain today I need a break first :). The patch passes all core tests (org.apache.lucene.search.highlight.HighlighterTest still fails but I didn't look into the MemoryIndex BulkPostings API yet) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nikola Tankovic updated LUCENE-2308: Attachment: LUCENE-2308-10.patch Solr cutover to FieldType. Having repeated similar errors in tests. Trying to debug. Help is appriciated :) Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: core/index Reporter: Michael McCandless Assignee: Michael McCandless Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: LUCENE-2308-10.patch, LUCENE-2308-2.patch, LUCENE-2308-3.patch, LUCENE-2308-4.patch, LUCENE-2308-4.patch, LUCENE-2308-5.patch, LUCENE-2308-6.patch, LUCENE-2308-7.patch, LUCENE-2308-8.patch, LUCENE-2308-9.patch, LUCENE-2308-ltc.patch, LUCENE-2308.patch, LUCENE-2308.patch This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-1768) NumericRange support for new query parser
[ https://issues.apache.org/jira/browse/LUCENE-1768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063607#comment-13063607 ] Vinicius Barros commented on LUCENE-1768: - Thanks for committing the patch Uwe! I will review the code again looking for switch without default case and fix it. I never did anything with javacc, I just quickly looked at the code, does not seem complicated, however, I have no idea how complex is to run javacc and regenerate the java files. Does lucene ant script do that automaticaly? I can try to fix open range queries on contrib query parser, add =-like operators or backport numeric support to 3.x. Just let me know the priorities and I will work on it. My suggestion is that the bug on open range queries is the most critical now, so I could start working on that. Your call Uwe. NumericRange support for new query parser - Key: LUCENE-1768 URL: https://issues.apache.org/jira/browse/LUCENE-1768 Project: Lucene - Java Issue Type: New Feature Components: core/queryparser Affects Versions: 2.9 Reporter: Uwe Schindler Assignee: Uwe Schindler Labels: contrib, gsoc, gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: week-7.patch, week1.patch, week2.patch, week3.patch, week4.patch, week5-6.patch It would be good to specify some type of schema for the query parser in future, to automatically create NumericRangeQuery for different numeric types? It would then be possible to index a numeric value (double,float,long,int) using NumericField and then the query parser knows, which type of field this is and so it correctly creates a NumericRangeQuery for strings like [1.567..*] or (1.787..19.5]. There is currently no way to extract if a field is numeric from the index, so the user will have to configure the FieldConfig objects in the ConfigHandler. But if this is done, it will not be that difficult to implement the rest. The only difference between the current handling of RangeQuery is then the instantiation of the correct Query type and conversion of the entered numeric values (simple Number.valueOf(...) cast of the user entered numbers). Evenerything else is identical, NumericRangeQuery also supports the MTQ rewrite modes (as it is a MTQ). Another thing is a change in Date semantics. There are some strange flags in the current parser that tells it how to handle dates. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans
[ https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063610#comment-13063610 ] Robert Muir commented on LUCENE-2878: - {quote} But that would force you actually into a third pass (I think), if you wanted positional scoring too, wouldn't it? {quote} I think thats ok? because the two things are different: * in general i think you want to rerank more than just page 1 with scoring, e.g. maybe 100 or even 1000 documents versus the 20 that highlighting needs. * for scoring, we need to adjust our PQ, resulting in a (possibly) different set of page 1 documents for the highlighting process, so if we are doing both algorithms, we still don't yet know what to highlight anyway. * if we assume we are going to add offsets (optionally) to our postings lists in parallel to the positions, thats another difference: scoring doesnt care about offsets, but highlighting needs them. Allow Scorer to expose positions and payloads aka. nuke spans -- Key: LUCENE-2878 URL: https://issues.apache.org/jira/browse/LUCENE-2878 Project: Lucene - Java Issue Type: Improvement Components: core/search Affects Versions: Bulk Postings branch Reporter: Simon Willnauer Assignee: Simon Willnauer Labels: gsoc2011, lucene-gsoc-11, mentor Attachments: LUCENE-2878-OR.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878_trunk.patch, LUCENE-2878_trunk.patch, PosHighlighter.patch, PosHighlighter.patch Currently we have two somewhat separate types of queries, the one which can make use of positions (mainly spans) and payloads (spans). Yet Span*Query doesn't really do scoring comparable to what other queries do and at the end of the day they are duplicating lot of code all over lucene. Span*Queries are also limited to other Span*Query instances such that you can not use a TermQuery or a BooleanQuery with SpanNear or anthing like that. Beside of the Span*Query limitation other queries lacking a quiet interesting feature since they can not score based on term proximity since scores doesn't expose any positional information. All those problems bugged me for a while now so I stared working on that using the bulkpostings API. I would have done that first cut on trunk but TermScorer is working on BlockReader that do not expose positions while the one in this branch does. I started adding a new Positions class which users can pull from a scorer, to prevent unnecessary positions enums I added ScorerContext#needsPositions and eventually Scorere#needsPayloads to create the corresponding enum on demand. Yet, currently only TermQuery / TermScorer implements this API and other simply return null instead. To show that the API really works and our BulkPostings work fine too with positions I cut over TermSpanQuery to use a TermScorer under the hood and nuked TermSpans entirely. A nice sideeffect of this was that the Position BulkReading implementation got some exercise which now :) work all with positions while Payloads for bulkreading are kind of experimental in the patch and those only work with Standard codec. So all spans now work on top of TermScorer ( I truly hate spans since today ) including the ones that need Payloads (StandardCodec ONLY)!! I didn't bother to implement the other codecs yet since I want to get feedback on the API and on this first cut before I go one with it. I will upload the corresponding patch in a minute. I also had to cut over SpanQuery.getSpans(IR) to SpanQuery.getSpans(AtomicReaderContext) which I should probably do on trunk first but after that pain today I need a break first :). The patch passes all core tests (org.apache.lucene.search.highlight.HighlighterTest still fails but I didn't look into the MemoryIndex BulkPostings API yet) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2048) Omit positions but keep termFreq
[ https://issues.apache.org/jira/browse/LUCENE-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063612#comment-13063612 ] Michael McCandless commented on LUCENE-2048: Looks great! +1 to commit. Omit positions but keep termFreq Key: LUCENE-2048 URL: https://issues.apache.org/jira/browse/LUCENE-2048 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.1 Reporter: Andrzej Bialecki Assignee: Robert Muir Fix For: 4.0 Attachments: LUCENE-2048.patch, LUCENE-2048.patch it would be useful to have an option to discard positional information but still keep the term frequency - currently setOmitTermFreqAndPositions discards both. Even though position-dependent queries wouldn't work in such case, still any other queries would work fine and we would get the right scoring. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans
[ https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063614#comment-13063614 ] Grant Ingersoll commented on LUCENE-2878: - FWIW, I do think there are use cases where one wants positions over all hits (or most such that you might as well do all), so if it doesn't cause problems for the main use case, it would be nice to support it. In fact, in these scenarios, you usually care less about the PQ and more about the positions. Allow Scorer to expose positions and payloads aka. nuke spans -- Key: LUCENE-2878 URL: https://issues.apache.org/jira/browse/LUCENE-2878 Project: Lucene - Java Issue Type: Improvement Components: core/search Affects Versions: Bulk Postings branch Reporter: Simon Willnauer Assignee: Simon Willnauer Labels: gsoc2011, lucene-gsoc-11, mentor Attachments: LUCENE-2878-OR.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878_trunk.patch, LUCENE-2878_trunk.patch, PosHighlighter.patch, PosHighlighter.patch Currently we have two somewhat separate types of queries, the one which can make use of positions (mainly spans) and payloads (spans). Yet Span*Query doesn't really do scoring comparable to what other queries do and at the end of the day they are duplicating lot of code all over lucene. Span*Queries are also limited to other Span*Query instances such that you can not use a TermQuery or a BooleanQuery with SpanNear or anthing like that. Beside of the Span*Query limitation other queries lacking a quiet interesting feature since they can not score based on term proximity since scores doesn't expose any positional information. All those problems bugged me for a while now so I stared working on that using the bulkpostings API. I would have done that first cut on trunk but TermScorer is working on BlockReader that do not expose positions while the one in this branch does. I started adding a new Positions class which users can pull from a scorer, to prevent unnecessary positions enums I added ScorerContext#needsPositions and eventually Scorere#needsPayloads to create the corresponding enum on demand. Yet, currently only TermQuery / TermScorer implements this API and other simply return null instead. To show that the API really works and our BulkPostings work fine too with positions I cut over TermSpanQuery to use a TermScorer under the hood and nuked TermSpans entirely. A nice sideeffect of this was that the Position BulkReading implementation got some exercise which now :) work all with positions while Payloads for bulkreading are kind of experimental in the patch and those only work with Standard codec. So all spans now work on top of TermScorer ( I truly hate spans since today ) including the ones that need Payloads (StandardCodec ONLY)!! I didn't bother to implement the other codecs yet since I want to get feedback on the API and on this first cut before I go one with it. I will upload the corresponding patch in a minute. I also had to cut over SpanQuery.getSpans(IR) to SpanQuery.getSpans(AtomicReaderContext) which I should probably do on trunk first but after that pain today I need a break first :). The patch passes all core tests (org.apache.lucene.search.highlight.HighlighterTest still fails but I didn't look into the MemoryIndex BulkPostings API yet) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063618#comment-13063618 ] Michael McCandless commented on LUCENE-2308: Nikola tracked this down -- it's because we're not reading numeric field back properly from stored fields. Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: core/index Reporter: Michael McCandless Assignee: Michael McCandless Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: LUCENE-2308-10.patch, LUCENE-2308-2.patch, LUCENE-2308-3.patch, LUCENE-2308-4.patch, LUCENE-2308-4.patch, LUCENE-2308-5.patch, LUCENE-2308-6.patch, LUCENE-2308-7.patch, LUCENE-2308-8.patch, LUCENE-2308-9.patch, LUCENE-2308-ltc.patch, LUCENE-2308.patch, LUCENE-2308.patch This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3282) BlockJoinQuery: Allow to add a custom child collector, and customize the parent bitset extraction
[ https://issues.apache.org/jira/browse/LUCENE-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063619#comment-13063619 ] Shay Banon commented on LUCENE-3282: Heya, In my app, I have a wrapper around OBS, that has a common interface that allows to access bits by index (similar to Bits in trunk), so I need to extract from it the OBS. Regarding the Collector, I will work on CollectorProvider interface. I liked the NoOpCollector option since then you don't have to check for nulls each time... BlockJoinQuery: Allow to add a custom child collector, and customize the parent bitset extraction - Key: LUCENE-3282 URL: https://issues.apache.org/jira/browse/LUCENE-3282 Project: Lucene - Java Issue Type: Improvement Components: core/search Affects Versions: 3.4, 4.0 Reporter: Shay Banon Attachments: LUCENE-3282.patch It would be nice to allow to add a custom child collector to the BlockJoinQuery to be called on every matching doc (so we can do things with it, like counts and such). Also, allow to extend BlockJoinQuery to have a custom code that converts the filter bitset to an OpenBitSet. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3282) BlockJoinQuery: Allow to add a custom child collector, and customize the parent bitset extraction
[ https://issues.apache.org/jira/browse/LUCENE-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shay Banon updated LUCENE-3282: --- Attachment: LUCENE-3282.patch New version, with CollectorProvider. BlockJoinQuery: Allow to add a custom child collector, and customize the parent bitset extraction - Key: LUCENE-3282 URL: https://issues.apache.org/jira/browse/LUCENE-3282 Project: Lucene - Java Issue Type: Improvement Components: core/search Affects Versions: 3.4, 4.0 Reporter: Shay Banon Attachments: LUCENE-3282.patch, LUCENE-3282.patch It would be nice to allow to add a custom child collector to the BlockJoinQuery to be called on every matching doc (so we can do things with it, like counts and such). Also, allow to extend BlockJoinQuery to have a custom code that converts the filter bitset to an OpenBitSet. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans
[ https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063622#comment-13063622 ] Robert Muir commented on LUCENE-2878: - {quote} FWIW, I do think there are use cases where one wants positions over all hits (or most such that you might as well do all), so if it doesn't cause problems for the main use case, it would be nice to support it. In fact, in these scenarios, you usually care less about the PQ and more about the positions. {quote} I don't think this issue should try to solve that problem: if you are doing that, it sounds like you are using the wrong Query! Allow Scorer to expose positions and payloads aka. nuke spans -- Key: LUCENE-2878 URL: https://issues.apache.org/jira/browse/LUCENE-2878 Project: Lucene - Java Issue Type: Improvement Components: core/search Affects Versions: Bulk Postings branch Reporter: Simon Willnauer Assignee: Simon Willnauer Labels: gsoc2011, lucene-gsoc-11, mentor Attachments: LUCENE-2878-OR.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878_trunk.patch, LUCENE-2878_trunk.patch, PosHighlighter.patch, PosHighlighter.patch Currently we have two somewhat separate types of queries, the one which can make use of positions (mainly spans) and payloads (spans). Yet Span*Query doesn't really do scoring comparable to what other queries do and at the end of the day they are duplicating lot of code all over lucene. Span*Queries are also limited to other Span*Query instances such that you can not use a TermQuery or a BooleanQuery with SpanNear or anthing like that. Beside of the Span*Query limitation other queries lacking a quiet interesting feature since they can not score based on term proximity since scores doesn't expose any positional information. All those problems bugged me for a while now so I stared working on that using the bulkpostings API. I would have done that first cut on trunk but TermScorer is working on BlockReader that do not expose positions while the one in this branch does. I started adding a new Positions class which users can pull from a scorer, to prevent unnecessary positions enums I added ScorerContext#needsPositions and eventually Scorere#needsPayloads to create the corresponding enum on demand. Yet, currently only TermQuery / TermScorer implements this API and other simply return null instead. To show that the API really works and our BulkPostings work fine too with positions I cut over TermSpanQuery to use a TermScorer under the hood and nuked TermSpans entirely. A nice sideeffect of this was that the Position BulkReading implementation got some exercise which now :) work all with positions while Payloads for bulkreading are kind of experimental in the patch and those only work with Standard codec. So all spans now work on top of TermScorer ( I truly hate spans since today ) including the ones that need Payloads (StandardCodec ONLY)!! I didn't bother to implement the other codecs yet since I want to get feedback on the API and on this first cut before I go one with it. I will upload the corresponding patch in a minute. I also had to cut over SpanQuery.getSpans(IR) to SpanQuery.getSpans(AtomicReaderContext) which I should probably do on trunk first but after that pain today I need a break first :). The patch passes all core tests (org.apache.lucene.search.highlight.HighlighterTest still fails but I didn't look into the MemoryIndex BulkPostings API yet) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans
[ https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063625#comment-13063625 ] Grant Ingersoll commented on LUCENE-2878: - bq. I don't think this issue should try to solve that problem: if you are doing that, it sounds like you are using the wrong Query! It's basically a boolean match on any arbitrary Query where you care about the positions. Pretty common in e-discovery and other areas. You have a query that tells you all the matches and you want to operate over the positions. Right now, it's a pain as you have to execute the query twice. Once to get the scores and once to get the positions/spans. If you have a callback mechanism, one can do both at once. Allow Scorer to expose positions and payloads aka. nuke spans -- Key: LUCENE-2878 URL: https://issues.apache.org/jira/browse/LUCENE-2878 Project: Lucene - Java Issue Type: Improvement Components: core/search Affects Versions: Bulk Postings branch Reporter: Simon Willnauer Assignee: Simon Willnauer Labels: gsoc2011, lucene-gsoc-11, mentor Attachments: LUCENE-2878-OR.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878_trunk.patch, LUCENE-2878_trunk.patch, PosHighlighter.patch, PosHighlighter.patch Currently we have two somewhat separate types of queries, the one which can make use of positions (mainly spans) and payloads (spans). Yet Span*Query doesn't really do scoring comparable to what other queries do and at the end of the day they are duplicating lot of code all over lucene. Span*Queries are also limited to other Span*Query instances such that you can not use a TermQuery or a BooleanQuery with SpanNear or anthing like that. Beside of the Span*Query limitation other queries lacking a quiet interesting feature since they can not score based on term proximity since scores doesn't expose any positional information. All those problems bugged me for a while now so I stared working on that using the bulkpostings API. I would have done that first cut on trunk but TermScorer is working on BlockReader that do not expose positions while the one in this branch does. I started adding a new Positions class which users can pull from a scorer, to prevent unnecessary positions enums I added ScorerContext#needsPositions and eventually Scorere#needsPayloads to create the corresponding enum on demand. Yet, currently only TermQuery / TermScorer implements this API and other simply return null instead. To show that the API really works and our BulkPostings work fine too with positions I cut over TermSpanQuery to use a TermScorer under the hood and nuked TermSpans entirely. A nice sideeffect of this was that the Position BulkReading implementation got some exercise which now :) work all with positions while Payloads for bulkreading are kind of experimental in the patch and those only work with Standard codec. So all spans now work on top of TermScorer ( I truly hate spans since today ) including the ones that need Payloads (StandardCodec ONLY)!! I didn't bother to implement the other codecs yet since I want to get feedback on the API and on this first cut before I go one with it. I will upload the corresponding patch in a minute. I also had to cut over SpanQuery.getSpans(IR) to SpanQuery.getSpans(AtomicReaderContext) which I should probably do on trunk first but after that pain today I need a break first :). The patch passes all core tests (org.apache.lucene.search.highlight.HighlighterTest still fails but I didn't look into the MemoryIndex BulkPostings API yet) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-2048) Omit positions but keep termFreq
[ https://issues.apache.org/jira/browse/LUCENE-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-2048: Fix Version/s: 3.4 Omit positions but keep termFreq Key: LUCENE-2048 URL: https://issues.apache.org/jira/browse/LUCENE-2048 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.1 Reporter: Andrzej Bialecki Assignee: Robert Muir Fix For: 3.4, 4.0 Attachments: LUCENE-2048.patch, LUCENE-2048.patch it would be useful to have an option to discard positional information but still keep the term frequency - currently setOmitTermFreqAndPositions discards both. Even though position-dependent queries wouldn't work in such case, still any other queries would work fine and we would get the right scoring. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans
[ https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063626#comment-13063626 ] Robert Muir commented on LUCENE-2878: - I don't understand the exact use case... it still sounds like the wrong query? What operating over the positions do you need to do? Allow Scorer to expose positions and payloads aka. nuke spans -- Key: LUCENE-2878 URL: https://issues.apache.org/jira/browse/LUCENE-2878 Project: Lucene - Java Issue Type: Improvement Components: core/search Affects Versions: Bulk Postings branch Reporter: Simon Willnauer Assignee: Simon Willnauer Labels: gsoc2011, lucene-gsoc-11, mentor Attachments: LUCENE-2878-OR.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878_trunk.patch, LUCENE-2878_trunk.patch, PosHighlighter.patch, PosHighlighter.patch Currently we have two somewhat separate types of queries, the one which can make use of positions (mainly spans) and payloads (spans). Yet Span*Query doesn't really do scoring comparable to what other queries do and at the end of the day they are duplicating lot of code all over lucene. Span*Queries are also limited to other Span*Query instances such that you can not use a TermQuery or a BooleanQuery with SpanNear or anthing like that. Beside of the Span*Query limitation other queries lacking a quiet interesting feature since they can not score based on term proximity since scores doesn't expose any positional information. All those problems bugged me for a while now so I stared working on that using the bulkpostings API. I would have done that first cut on trunk but TermScorer is working on BlockReader that do not expose positions while the one in this branch does. I started adding a new Positions class which users can pull from a scorer, to prevent unnecessary positions enums I added ScorerContext#needsPositions and eventually Scorere#needsPayloads to create the corresponding enum on demand. Yet, currently only TermQuery / TermScorer implements this API and other simply return null instead. To show that the API really works and our BulkPostings work fine too with positions I cut over TermSpanQuery to use a TermScorer under the hood and nuked TermSpans entirely. A nice sideeffect of this was that the Position BulkReading implementation got some exercise which now :) work all with positions while Payloads for bulkreading are kind of experimental in the patch and those only work with Standard codec. So all spans now work on top of TermScorer ( I truly hate spans since today ) including the ones that need Payloads (StandardCodec ONLY)!! I didn't bother to implement the other codecs yet since I want to get feedback on the API and on this first cut before I go one with it. I will upload the corresponding patch in a minute. I also had to cut over SpanQuery.getSpans(IR) to SpanQuery.getSpans(AtomicReaderContext) which I should probably do on trunk first but after that pain today I need a break first :). The patch passes all core tests (org.apache.lucene.search.highlight.HighlighterTest still fails but I didn't look into the MemoryIndex BulkPostings API yet) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans
[ https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063630#comment-13063630 ] Grant Ingersoll commented on LUCENE-2878: - In the cases where I've both done this and seen it done, you often have an arbitrary query that matches X docs. You then want to know where exactly the matches occur and then you often want to do something in a window around those matches. Right now, w/ Spans, you have to run the query once to get the scores and then run a second time to get the windows. The times I've seen it, the result is most often given to some downstream process that does deeper analysis of the window, so in these cases X can be quite large (1000's if not more). In those cases, some people care about the score, some do not. For instance, if one is analyzing all the words around the name of a company, you search term would be the company name and you want to iterate over all the positions where it matched, looking for other words near it (perhaps sentiment words or other things) Allow Scorer to expose positions and payloads aka. nuke spans -- Key: LUCENE-2878 URL: https://issues.apache.org/jira/browse/LUCENE-2878 Project: Lucene - Java Issue Type: Improvement Components: core/search Affects Versions: Bulk Postings branch Reporter: Simon Willnauer Assignee: Simon Willnauer Labels: gsoc2011, lucene-gsoc-11, mentor Attachments: LUCENE-2878-OR.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878_trunk.patch, LUCENE-2878_trunk.patch, PosHighlighter.patch, PosHighlighter.patch Currently we have two somewhat separate types of queries, the one which can make use of positions (mainly spans) and payloads (spans). Yet Span*Query doesn't really do scoring comparable to what other queries do and at the end of the day they are duplicating lot of code all over lucene. Span*Queries are also limited to other Span*Query instances such that you can not use a TermQuery or a BooleanQuery with SpanNear or anthing like that. Beside of the Span*Query limitation other queries lacking a quiet interesting feature since they can not score based on term proximity since scores doesn't expose any positional information. All those problems bugged me for a while now so I stared working on that using the bulkpostings API. I would have done that first cut on trunk but TermScorer is working on BlockReader that do not expose positions while the one in this branch does. I started adding a new Positions class which users can pull from a scorer, to prevent unnecessary positions enums I added ScorerContext#needsPositions and eventually Scorere#needsPayloads to create the corresponding enum on demand. Yet, currently only TermQuery / TermScorer implements this API and other simply return null instead. To show that the API really works and our BulkPostings work fine too with positions I cut over TermSpanQuery to use a TermScorer under the hood and nuked TermSpans entirely. A nice sideeffect of this was that the Position BulkReading implementation got some exercise which now :) work all with positions while Payloads for bulkreading are kind of experimental in the patch and those only work with Standard codec. So all spans now work on top of TermScorer ( I truly hate spans since today ) including the ones that need Payloads (StandardCodec ONLY)!! I didn't bother to implement the other codecs yet since I want to get feedback on the API and on this first cut before I go one with it. I will upload the corresponding patch in a minute. I also had to cut over SpanQuery.getSpans(IR) to SpanQuery.getSpans(AtomicReaderContext) which I should probably do on trunk first but after that pain today I need a break first :). The patch passes all core tests (org.apache.lucene.search.highlight.HighlighterTest still fails but I didn't look into the MemoryIndex BulkPostings API yet) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans
[ https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063635#comment-13063635 ] Robert Muir commented on LUCENE-2878: - {quote} In those cases, some people care about the score, some do not. For instance, if one is analyzing all the words around the name of a company, you search term would be the company name and you want to iterate over all the positions where it matched, looking for other words near it {quote} Grant, I'm not sure this sounds like an inverted index is even the best data structure for what you describe. I just don't want us to confuse the issue with the nuking of spans/speeding up highlighting/enabling positional scoring use cases which are core to search. Allow Scorer to expose positions and payloads aka. nuke spans -- Key: LUCENE-2878 URL: https://issues.apache.org/jira/browse/LUCENE-2878 Project: Lucene - Java Issue Type: Improvement Components: core/search Affects Versions: Bulk Postings branch Reporter: Simon Willnauer Assignee: Simon Willnauer Labels: gsoc2011, lucene-gsoc-11, mentor Attachments: LUCENE-2878-OR.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878_trunk.patch, LUCENE-2878_trunk.patch, PosHighlighter.patch, PosHighlighter.patch Currently we have two somewhat separate types of queries, the one which can make use of positions (mainly spans) and payloads (spans). Yet Span*Query doesn't really do scoring comparable to what other queries do and at the end of the day they are duplicating lot of code all over lucene. Span*Queries are also limited to other Span*Query instances such that you can not use a TermQuery or a BooleanQuery with SpanNear or anthing like that. Beside of the Span*Query limitation other queries lacking a quiet interesting feature since they can not score based on term proximity since scores doesn't expose any positional information. All those problems bugged me for a while now so I stared working on that using the bulkpostings API. I would have done that first cut on trunk but TermScorer is working on BlockReader that do not expose positions while the one in this branch does. I started adding a new Positions class which users can pull from a scorer, to prevent unnecessary positions enums I added ScorerContext#needsPositions and eventually Scorere#needsPayloads to create the corresponding enum on demand. Yet, currently only TermQuery / TermScorer implements this API and other simply return null instead. To show that the API really works and our BulkPostings work fine too with positions I cut over TermSpanQuery to use a TermScorer under the hood and nuked TermSpans entirely. A nice sideeffect of this was that the Position BulkReading implementation got some exercise which now :) work all with positions while Payloads for bulkreading are kind of experimental in the patch and those only work with Standard codec. So all spans now work on top of TermScorer ( I truly hate spans since today ) including the ones that need Payloads (StandardCodec ONLY)!! I didn't bother to implement the other codecs yet since I want to get feedback on the API and on this first cut before I go one with it. I will upload the corresponding patch in a minute. I also had to cut over SpanQuery.getSpans(IR) to SpanQuery.getSpans(AtomicReaderContext) which I should probably do on trunk first but after that pain today I need a break first :). The patch passes all core tests (org.apache.lucene.search.highlight.HighlighterTest still fails but I didn't look into the MemoryIndex BulkPostings API yet) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans
[ https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063644#comment-13063644 ] Grant Ingersoll commented on LUCENE-2878: - bq. I'm not sure this sounds like an inverted index is even the best data structure for what you describe The key is you usually have a fairly complex Query to begin with, so I do think it is legitimate and it is the right data structure. It is always driven by the search results. I've seen this use case multiple times, where multiple is more than 10, so I am pretty convinced it is beyond just me. I think if you are taking away the ability to create windows around a match (if you read my early comments on this issue I brought it up from the beginning), that is a pretty big loss. I don't think the two things are mutually exclusive. As long as I have a way to get at the positions for all matches, I don't care that it. A collector type callback interface or a way for one to iterate all positions for a given match should be sufficient. That being said, if Mike's comments about a collector like API are how it is implemented, I think it should work. In reality, I think one would just need a way to, for whatever number of results, be told about positions as they happen. Naturally, the default should be to only do this after the top X are retrieved, when X is small, but I could see implementing it in the scoring loop on certain occasions (and I'm not saying Lucene need have first order support for that). As long as you don't preclude me from doing that, it should be fine. I'll try to find time to review the patch in more depth in the coming day or so. Allow Scorer to expose positions and payloads aka. nuke spans -- Key: LUCENE-2878 URL: https://issues.apache.org/jira/browse/LUCENE-2878 Project: Lucene - Java Issue Type: Improvement Components: core/search Affects Versions: Bulk Postings branch Reporter: Simon Willnauer Assignee: Simon Willnauer Labels: gsoc2011, lucene-gsoc-11, mentor Attachments: LUCENE-2878-OR.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878_trunk.patch, LUCENE-2878_trunk.patch, PosHighlighter.patch, PosHighlighter.patch Currently we have two somewhat separate types of queries, the one which can make use of positions (mainly spans) and payloads (spans). Yet Span*Query doesn't really do scoring comparable to what other queries do and at the end of the day they are duplicating lot of code all over lucene. Span*Queries are also limited to other Span*Query instances such that you can not use a TermQuery or a BooleanQuery with SpanNear or anthing like that. Beside of the Span*Query limitation other queries lacking a quiet interesting feature since they can not score based on term proximity since scores doesn't expose any positional information. All those problems bugged me for a while now so I stared working on that using the bulkpostings API. I would have done that first cut on trunk but TermScorer is working on BlockReader that do not expose positions while the one in this branch does. I started adding a new Positions class which users can pull from a scorer, to prevent unnecessary positions enums I added ScorerContext#needsPositions and eventually Scorere#needsPayloads to create the corresponding enum on demand. Yet, currently only TermQuery / TermScorer implements this API and other simply return null instead. To show that the API really works and our BulkPostings work fine too with positions I cut over TermSpanQuery to use a TermScorer under the hood and nuked TermSpans entirely. A nice sideeffect of this was that the Position BulkReading implementation got some exercise which now :) work all with positions while Payloads for bulkreading are kind of experimental in the patch and those only work with Standard codec. So all spans now work on top of TermScorer ( I truly hate spans since today ) including the ones that need Payloads (StandardCodec ONLY)!! I didn't bother to implement the other codecs yet since I want to get feedback on the API and on this first cut before I go one with it. I will upload the corresponding patch in a minute. I also had to cut over SpanQuery.getSpans(IR) to SpanQuery.getSpans(AtomicReaderContext) which I should probably do on trunk first but after that pain today I need a break first :). The patch passes all core tests (org.apache.lucene.search.highlight.HighlighterTest still fails but I didn't look into the MemoryIndex BulkPostings API yet) -- This message is automatically generated by JIRA. For more information on JIRA, see:
[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans
[ https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063657#comment-13063657 ] Robert Muir commented on LUCENE-2878: - {quote} The key is you usually have a fairly complex Query to begin with, so I do think it is legitimate and it is the right data structure. {quote} Really, just because its complicated? Accessing other terms 'around the position' seems like accessing the document in a non-inverted way. {quote} I've seen this use case multiple times, where multiple is more than 10, so I am pretty convinced it is beyond just me. {quote} Really? If this is so common, why do the spans get so little attention? if the queries are so complex, how is this even possible now given that spans have so many problems, even basic ones (e.g. discarding boosts) If performance here is so important towards looking at these 'windows around a match' (which is gonna be slow as shit via term vectors), why don't I see codecs that e.g. deduplicate terms and store pointers to the term windows around themselves in payloads, and things like that for this use case? I don't think we need to lock ourselves into a particular solution (such as per-position callback API) for something that sounds like its really slow already. Allow Scorer to expose positions and payloads aka. nuke spans -- Key: LUCENE-2878 URL: https://issues.apache.org/jira/browse/LUCENE-2878 Project: Lucene - Java Issue Type: Improvement Components: core/search Affects Versions: Bulk Postings branch Reporter: Simon Willnauer Assignee: Simon Willnauer Labels: gsoc2011, lucene-gsoc-11, mentor Attachments: LUCENE-2878-OR.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878_trunk.patch, LUCENE-2878_trunk.patch, PosHighlighter.patch, PosHighlighter.patch Currently we have two somewhat separate types of queries, the one which can make use of positions (mainly spans) and payloads (spans). Yet Span*Query doesn't really do scoring comparable to what other queries do and at the end of the day they are duplicating lot of code all over lucene. Span*Queries are also limited to other Span*Query instances such that you can not use a TermQuery or a BooleanQuery with SpanNear or anthing like that. Beside of the Span*Query limitation other queries lacking a quiet interesting feature since they can not score based on term proximity since scores doesn't expose any positional information. All those problems bugged me for a while now so I stared working on that using the bulkpostings API. I would have done that first cut on trunk but TermScorer is working on BlockReader that do not expose positions while the one in this branch does. I started adding a new Positions class which users can pull from a scorer, to prevent unnecessary positions enums I added ScorerContext#needsPositions and eventually Scorere#needsPayloads to create the corresponding enum on demand. Yet, currently only TermQuery / TermScorer implements this API and other simply return null instead. To show that the API really works and our BulkPostings work fine too with positions I cut over TermSpanQuery to use a TermScorer under the hood and nuked TermSpans entirely. A nice sideeffect of this was that the Position BulkReading implementation got some exercise which now :) work all with positions while Payloads for bulkreading are kind of experimental in the patch and those only work with Standard codec. So all spans now work on top of TermScorer ( I truly hate spans since today ) including the ones that need Payloads (StandardCodec ONLY)!! I didn't bother to implement the other codecs yet since I want to get feedback on the API and on this first cut before I go one with it. I will upload the corresponding patch in a minute. I also had to cut over SpanQuery.getSpans(IR) to SpanQuery.getSpans(AtomicReaderContext) which I should probably do on trunk first but after that pain today I need a break first :). The patch passes all core tests (org.apache.lucene.search.highlight.HighlighterTest still fails but I didn't look into the MemoryIndex BulkPostings API yet) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[JENKINS] Lucene-Solr-tests-only-trunk - Build # 9507 - Failure
Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk/9507/ 2 tests failed. REGRESSION: org.apache.lucene.queryParser.standard.TestNumericQueryParser.testInclusiveNumericRange Error Message: null Stack Trace: java.lang.NullPointerException at org.apache.lucene.queryParser.standard.config.NumberDateFormat.parse(NumberDateFormat.java:50) at java.text.NumberFormat.parse(NumberFormat.java:348) at org.apache.lucene.queryParser.standard.processors.NumericRangeQueryNodeProcessor.postProcessNode(NumericRangeQueryNodeProcessor.java:72) at org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processIteration(QueryNodeProcessorImpl.java:98) at org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processChildren(QueryNodeProcessorImpl.java:124) at org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processIteration(QueryNodeProcessorImpl.java:96) at org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processChildren(QueryNodeProcessorImpl.java:124) at org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processIteration(QueryNodeProcessorImpl.java:96) at org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.process(QueryNodeProcessorImpl.java:89) at org.apache.lucene.queryParser.core.processors.QueryNodeProcessorPipeline.process(QueryNodeProcessorPipeline.java:88) at org.apache.lucene.queryParser.core.QueryParserHelper.parse(QueryParserHelper.java:254) at org.apache.lucene.queryParser.standard.StandardQueryParser.parse(StandardQueryParser.java:166) at org.apache.lucene.queryParser.standard.TestNumericQueryParser.testQuery(TestNumericQueryParser.java:385) at org.apache.lucene.queryParser.standard.TestNumericQueryParser.assertRangeQuery(TestNumericQueryParser.java:356) at org.apache.lucene.queryParser.standard.TestNumericQueryParser.testInclusiveNumericRange(TestNumericQueryParser.java:282) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1464) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1382) REGRESSION: org.apache.lucene.queryParser.standard.TestNumericQueryParser.testExclusiveNumericRange Error Message: null Stack Trace: java.lang.NullPointerException at org.apache.lucene.queryParser.standard.config.NumberDateFormat.parse(NumberDateFormat.java:50) at java.text.NumberFormat.parse(NumberFormat.java:348) at org.apache.lucene.queryParser.standard.processors.NumericRangeQueryNodeProcessor.postProcessNode(NumericRangeQueryNodeProcessor.java:72) at org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processIteration(QueryNodeProcessorImpl.java:98) at org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processChildren(QueryNodeProcessorImpl.java:124) at org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processIteration(QueryNodeProcessorImpl.java:96) at org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processChildren(QueryNodeProcessorImpl.java:124) at org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processIteration(QueryNodeProcessorImpl.java:96) at org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.process(QueryNodeProcessorImpl.java:89) at org.apache.lucene.queryParser.core.processors.QueryNodeProcessorPipeline.process(QueryNodeProcessorPipeline.java:88) at org.apache.lucene.queryParser.core.QueryParserHelper.parse(QueryParserHelper.java:254) at org.apache.lucene.queryParser.standard.StandardQueryParser.parse(StandardQueryParser.java:166) at org.apache.lucene.queryParser.standard.TestNumericQueryParser.testQuery(TestNumericQueryParser.java:385) at org.apache.lucene.queryParser.standard.TestNumericQueryParser.assertRangeQuery(TestNumericQueryParser.java:356) at org.apache.lucene.queryParser.standard.TestNumericQueryParser.testExclusiveNumericRange(TestNumericQueryParser.java:311) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1464) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1382) Build Log (for compile errors): [...truncated 3344 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2644) DIH handler - when using threads=2 the default logging is set too high
[ https://issues.apache.org/jira/browse/SOLR-2644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Bell updated SOLR-2644: Description: Setting threads parameter in DIH handler, every add outputs to the log in INFO level. The only current solution is to set the following in log4j.properties: log4j.rootCategory=INFO, logfile log4j.logger.org.apache.solr.handler.dataimport.DocBuilder=FATAL log4j.logger.org.apache.solr.handler.dataimport.ThreadedEntityProcessorWrapper=FATAL These 2 log messages need to be changed to DEBUG. was: Setting threads parameter in DIH handler, every add outputs to the log in INFO level. The only current solution is to set the following in log4j.properties: log4j.rootCategory=INFO, logfile log4j.logger.org.apache.solr.handler.dataimport.DocBuilder=FATAL log4j.logger.org.apache.solr.handler.dataimport.ThreadedEntityProcessorWrapper=FATAL These 2 log messages need to be changed to INFO. DIH handler - when using threads=2 the default logging is set too high -- Key: SOLR-2644 URL: https://issues.apache.org/jira/browse/SOLR-2644 Project: Solr Issue Type: Bug Components: contrib - DataImportHandler Affects Versions: 3.3 Reporter: Bill Bell Assignee: Shalin Shekhar Mangar Fix For: 3.4, 4.0 Attachments: SOLR-2644.patch Setting threads parameter in DIH handler, every add outputs to the log in INFO level. The only current solution is to set the following in log4j.properties: log4j.rootCategory=INFO, logfile log4j.logger.org.apache.solr.handler.dataimport.DocBuilder=FATAL log4j.logger.org.apache.solr.handler.dataimport.ThreadedEntityProcessorWrapper=FATAL These 2 log messages need to be changed to DEBUG. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans
[ https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063662#comment-13063662 ] Grant Ingersoll commented on LUCENE-2878: - bq. Really, just because its complicated? Accessing other terms 'around the position' seems like accessing the document in a non-inverted way. Isn't that what highlighting does? This is just highlighting on a much bigger set of documents. I don't see why we should prevent users from doing it just b/c you don't see the use case. bq. Really? If this is so common, why do the spans get so little attention? if the queries are so complex, how is this even possible now given that spans have so many problems, even basic ones (e.g. discarding boosts) Isn't that the point of this whole patch? To bring spans into the fold and treat as first class citizens? I didn't say it happened all the time. I just said it happened enough that I think it warrants being covered before one nukes spans. bq. If performance here is so important towards looking at these 'windows around a match' (which is gonna be slow as shit via term vectors), why don't I see codecs that e.g. deduplicate terms and store pointers to the term windows around themselves in payloads, and things like that for this use case? Um, b/c it's open source and not everything gets implemented the minute you think of it? bq. I don't think we need to lock ourselves into a particular solution (such as per-position callback API) for something that sounds like its really slow already. Never said we did. Allow Scorer to expose positions and payloads aka. nuke spans -- Key: LUCENE-2878 URL: https://issues.apache.org/jira/browse/LUCENE-2878 Project: Lucene - Java Issue Type: Improvement Components: core/search Affects Versions: Bulk Postings branch Reporter: Simon Willnauer Assignee: Simon Willnauer Labels: gsoc2011, lucene-gsoc-11, mentor Attachments: LUCENE-2878-OR.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878_trunk.patch, LUCENE-2878_trunk.patch, PosHighlighter.patch, PosHighlighter.patch Currently we have two somewhat separate types of queries, the one which can make use of positions (mainly spans) and payloads (spans). Yet Span*Query doesn't really do scoring comparable to what other queries do and at the end of the day they are duplicating lot of code all over lucene. Span*Queries are also limited to other Span*Query instances such that you can not use a TermQuery or a BooleanQuery with SpanNear or anthing like that. Beside of the Span*Query limitation other queries lacking a quiet interesting feature since they can not score based on term proximity since scores doesn't expose any positional information. All those problems bugged me for a while now so I stared working on that using the bulkpostings API. I would have done that first cut on trunk but TermScorer is working on BlockReader that do not expose positions while the one in this branch does. I started adding a new Positions class which users can pull from a scorer, to prevent unnecessary positions enums I added ScorerContext#needsPositions and eventually Scorere#needsPayloads to create the corresponding enum on demand. Yet, currently only TermQuery / TermScorer implements this API and other simply return null instead. To show that the API really works and our BulkPostings work fine too with positions I cut over TermSpanQuery to use a TermScorer under the hood and nuked TermSpans entirely. A nice sideeffect of this was that the Position BulkReading implementation got some exercise which now :) work all with positions while Payloads for bulkreading are kind of experimental in the patch and those only work with Standard codec. So all spans now work on top of TermScorer ( I truly hate spans since today ) including the ones that need Payloads (StandardCodec ONLY)!! I didn't bother to implement the other codecs yet since I want to get feedback on the API and on this first cut before I go one with it. I will upload the corresponding patch in a minute. I also had to cut over SpanQuery.getSpans(IR) to SpanQuery.getSpans(AtomicReaderContext) which I should probably do on trunk first but after that pain today I need a break first :). The patch passes all core tests (org.apache.lucene.search.highlight.HighlighterTest still fails but I didn't look into the MemoryIndex BulkPostings API yet) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To
[jira] [Updated] (SOLR-2644) DIH handler - when using threads=2 the default logging is set too high
[ https://issues.apache.org/jira/browse/SOLR-2644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Bell updated SOLR-2644: Attachment: SOLR-2644-2.patch DIH handler - when using threads=2 the default logging is set too high -- Key: SOLR-2644 URL: https://issues.apache.org/jira/browse/SOLR-2644 Project: Solr Issue Type: Bug Components: contrib - DataImportHandler Affects Versions: 3.3 Reporter: Bill Bell Assignee: Shalin Shekhar Mangar Fix For: 3.4, 4.0 Attachments: SOLR-2644-2.patch, SOLR-2644.patch Setting threads parameter in DIH handler, every add outputs to the log in INFO level. The only current solution is to set the following in log4j.properties: log4j.rootCategory=INFO, logfile log4j.logger.org.apache.solr.handler.dataimport.DocBuilder=FATAL log4j.logger.org.apache.solr.handler.dataimport.ThreadedEntityProcessorWrapper=FATAL These 2 log messages need to be changed to DEBUG. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2644) DIH handler - when using threads=2 the default logging is set too high
[ https://issues.apache.org/jira/browse/SOLR-2644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Bell updated SOLR-2644: Attachment: (was: SOLR-2644-2.patch) DIH handler - when using threads=2 the default logging is set too high -- Key: SOLR-2644 URL: https://issues.apache.org/jira/browse/SOLR-2644 Project: Solr Issue Type: Bug Components: contrib - DataImportHandler Affects Versions: 3.3 Reporter: Bill Bell Assignee: Shalin Shekhar Mangar Fix For: 3.4, 4.0 Attachments: SOLR-2644-2.patch, SOLR-2644.patch Setting threads parameter in DIH handler, every add outputs to the log in INFO level. The only current solution is to set the following in log4j.properties: log4j.rootCategory=INFO, logfile log4j.logger.org.apache.solr.handler.dataimport.DocBuilder=FATAL log4j.logger.org.apache.solr.handler.dataimport.ThreadedEntityProcessorWrapper=FATAL These 2 log messages need to be changed to DEBUG. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2644) DIH handler - when using threads=2 the default logging is set too high
[ https://issues.apache.org/jira/browse/SOLR-2644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Bell updated SOLR-2644: Attachment: SOLR-2644-2.patch DIH handler - when using threads=2 the default logging is set too high -- Key: SOLR-2644 URL: https://issues.apache.org/jira/browse/SOLR-2644 Project: Solr Issue Type: Bug Components: contrib - DataImportHandler Affects Versions: 3.3 Reporter: Bill Bell Assignee: Shalin Shekhar Mangar Fix For: 3.4, 4.0 Attachments: SOLR-2644-2.patch, SOLR-2644.patch Setting threads parameter in DIH handler, every add outputs to the log in INFO level. The only current solution is to set the following in log4j.properties: log4j.rootCategory=INFO, logfile log4j.logger.org.apache.solr.handler.dataimport.DocBuilder=FATAL log4j.logger.org.apache.solr.handler.dataimport.ThreadedEntityProcessorWrapper=FATAL These 2 log messages need to be changed to DEBUG. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2644) DIH handler - when using threads=2 the default logging is set too high
[ https://issues.apache.org/jira/browse/SOLR-2644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063664#comment-13063664 ] Bill Bell commented on SOLR-2644: - New patch you forgot solr/contrib/dataimporthandler/src/java/org/apache/solr/handler/dataimport/DocBuilder.java. Also, I would rather change to debug and leave it. DIH handler - when using threads=2 the default logging is set too high -- Key: SOLR-2644 URL: https://issues.apache.org/jira/browse/SOLR-2644 Project: Solr Issue Type: Bug Components: contrib - DataImportHandler Affects Versions: 3.3 Reporter: Bill Bell Assignee: Shalin Shekhar Mangar Fix For: 3.4, 4.0 Attachments: SOLR-2644-2.patch, SOLR-2644.patch Setting threads parameter in DIH handler, every add outputs to the log in INFO level. The only current solution is to set the following in log4j.properties: log4j.rootCategory=INFO, logfile log4j.logger.org.apache.solr.handler.dataimport.DocBuilder=FATAL log4j.logger.org.apache.solr.handler.dataimport.ThreadedEntityProcessorWrapper=FATAL These 2 log messages need to be changed to DEBUG. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans
[ https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063667#comment-13063667 ] Robert Muir commented on LUCENE-2878: - {quote} Isn't that what highlighting does? This is just highlighting on a much bigger set of documents. I don't see why we should prevent users from doing it just b/c you don't see the use case. {quote} well it is different: I'm not saying we should prevent users from doing it, but we shouldn't slow down normal use cases either: I think its fine for this to be a 2-pass operation, because any performance differences from it being 2-pass across many documents are going to be completely dwarfed by the term vector access! Allow Scorer to expose positions and payloads aka. nuke spans -- Key: LUCENE-2878 URL: https://issues.apache.org/jira/browse/LUCENE-2878 Project: Lucene - Java Issue Type: Improvement Components: core/search Affects Versions: Bulk Postings branch Reporter: Simon Willnauer Assignee: Simon Willnauer Labels: gsoc2011, lucene-gsoc-11, mentor Attachments: LUCENE-2878-OR.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878_trunk.patch, LUCENE-2878_trunk.patch, PosHighlighter.patch, PosHighlighter.patch Currently we have two somewhat separate types of queries, the one which can make use of positions (mainly spans) and payloads (spans). Yet Span*Query doesn't really do scoring comparable to what other queries do and at the end of the day they are duplicating lot of code all over lucene. Span*Queries are also limited to other Span*Query instances such that you can not use a TermQuery or a BooleanQuery with SpanNear or anthing like that. Beside of the Span*Query limitation other queries lacking a quiet interesting feature since they can not score based on term proximity since scores doesn't expose any positional information. All those problems bugged me for a while now so I stared working on that using the bulkpostings API. I would have done that first cut on trunk but TermScorer is working on BlockReader that do not expose positions while the one in this branch does. I started adding a new Positions class which users can pull from a scorer, to prevent unnecessary positions enums I added ScorerContext#needsPositions and eventually Scorere#needsPayloads to create the corresponding enum on demand. Yet, currently only TermQuery / TermScorer implements this API and other simply return null instead. To show that the API really works and our BulkPostings work fine too with positions I cut over TermSpanQuery to use a TermScorer under the hood and nuked TermSpans entirely. A nice sideeffect of this was that the Position BulkReading implementation got some exercise which now :) work all with positions while Payloads for bulkreading are kind of experimental in the patch and those only work with Standard codec. So all spans now work on top of TermScorer ( I truly hate spans since today ) including the ones that need Payloads (StandardCodec ONLY)!! I didn't bother to implement the other codecs yet since I want to get feedback on the API and on this first cut before I go one with it. I will upload the corresponding patch in a minute. I also had to cut over SpanQuery.getSpans(IR) to SpanQuery.getSpans(AtomicReaderContext) which I should probably do on trunk first but after that pain today I need a break first :). The patch passes all core tests (org.apache.lucene.search.highlight.HighlighterTest still fails but I didn't look into the MemoryIndex BulkPostings API yet) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [JENKINS] Lucene-Solr-tests-only-trunk - Build # 9507 - Failure
I'm seeing this locally as well. On Tue, Jul 12, 2011 at 1:55 PM, Apache Jenkins Server jenk...@builds.apache.org wrote: Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk/9507/ 2 tests failed. REGRESSION: org.apache.lucene.queryParser.standard.TestNumericQueryParser.testInclusiveNumericRange Error Message: null Stack Trace: java.lang.NullPointerException at org.apache.lucene.queryParser.standard.config.NumberDateFormat.parse(NumberDateFormat.java:50) at java.text.NumberFormat.parse(NumberFormat.java:348) at org.apache.lucene.queryParser.standard.processors.NumericRangeQueryNodeProcessor.postProcessNode(NumericRangeQueryNodeProcessor.java:72) at org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processIteration(QueryNodeProcessorImpl.java:98) at org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processChildren(QueryNodeProcessorImpl.java:124) at org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processIteration(QueryNodeProcessorImpl.java:96) at org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processChildren(QueryNodeProcessorImpl.java:124) at org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processIteration(QueryNodeProcessorImpl.java:96) at org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.process(QueryNodeProcessorImpl.java:89) at org.apache.lucene.queryParser.core.processors.QueryNodeProcessorPipeline.process(QueryNodeProcessorPipeline.java:88) at org.apache.lucene.queryParser.core.QueryParserHelper.parse(QueryParserHelper.java:254) at org.apache.lucene.queryParser.standard.StandardQueryParser.parse(StandardQueryParser.java:166) at org.apache.lucene.queryParser.standard.TestNumericQueryParser.testQuery(TestNumericQueryParser.java:385) at org.apache.lucene.queryParser.standard.TestNumericQueryParser.assertRangeQuery(TestNumericQueryParser.java:356) at org.apache.lucene.queryParser.standard.TestNumericQueryParser.testInclusiveNumericRange(TestNumericQueryParser.java:282) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1464) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1382) REGRESSION: org.apache.lucene.queryParser.standard.TestNumericQueryParser.testExclusiveNumericRange Error Message: null Stack Trace: java.lang.NullPointerException at org.apache.lucene.queryParser.standard.config.NumberDateFormat.parse(NumberDateFormat.java:50) at java.text.NumberFormat.parse(NumberFormat.java:348) at org.apache.lucene.queryParser.standard.processors.NumericRangeQueryNodeProcessor.postProcessNode(NumericRangeQueryNodeProcessor.java:72) at org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processIteration(QueryNodeProcessorImpl.java:98) at org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processChildren(QueryNodeProcessorImpl.java:124) at org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processIteration(QueryNodeProcessorImpl.java:96) at org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processChildren(QueryNodeProcessorImpl.java:124) at org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processIteration(QueryNodeProcessorImpl.java:96) at org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.process(QueryNodeProcessorImpl.java:89) at org.apache.lucene.queryParser.core.processors.QueryNodeProcessorPipeline.process(QueryNodeProcessorPipeline.java:88) at org.apache.lucene.queryParser.core.QueryParserHelper.parse(QueryParserHelper.java:254) at org.apache.lucene.queryParser.standard.StandardQueryParser.parse(StandardQueryParser.java:166) at org.apache.lucene.queryParser.standard.TestNumericQueryParser.testQuery(TestNumericQueryParser.java:385) at org.apache.lucene.queryParser.standard.TestNumericQueryParser.assertRangeQuery(TestNumericQueryParser.java:356) at org.apache.lucene.queryParser.standard.TestNumericQueryParser.testExclusiveNumericRange(TestNumericQueryParser.java:311) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1464) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1382) Build Log (for compile errors): [...truncated 3344 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Chris Male | Software Developer | JTeam BV.| www.jteam.nl
[jira] [Commented] (LUCENE-3285) Move QueryParsers from contrib/queryparser to queryparser module
[ https://issues.apache.org/jira/browse/LUCENE-3285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063671#comment-13063671 ] Chris Male commented on LUCENE-3285: Committed revision 1145430. Now moving onto flexible QP. Move QueryParsers from contrib/queryparser to queryparser module Key: LUCENE-3285 URL: https://issues.apache.org/jira/browse/LUCENE-3285 Project: Lucene - Java Issue Type: Sub-task Components: modules/queryparser Reporter: Chris Male Attachments: LUCENE-3285.patch Each of the QueryParsers will be ported across. Those which use the flexible parsing framework will be placed under the package flexible. The StandardQueryParser will be renamed to FlexibleQueryParser and surround.QueryParser will be renamed to SurroundQueryParser. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans
[ https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063672#comment-13063672 ] Grant Ingersoll commented on LUCENE-2878: - Yeah, I agree. I don't want to block the primary use case, I'm just really hoping we can have a solution for the second one that elegantly falls out of the primary one and doesn't require a two pass solution. You are correct on the Term Vec access, but for large enough sets, the second search isn't trivial, even if it is dwarfed. Although, I think it may be possible to at least access them in document order. Allow Scorer to expose positions and payloads aka. nuke spans -- Key: LUCENE-2878 URL: https://issues.apache.org/jira/browse/LUCENE-2878 Project: Lucene - Java Issue Type: Improvement Components: core/search Affects Versions: Bulk Postings branch Reporter: Simon Willnauer Assignee: Simon Willnauer Labels: gsoc2011, lucene-gsoc-11, mentor Attachments: LUCENE-2878-OR.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878_trunk.patch, LUCENE-2878_trunk.patch, PosHighlighter.patch, PosHighlighter.patch Currently we have two somewhat separate types of queries, the one which can make use of positions (mainly spans) and payloads (spans). Yet Span*Query doesn't really do scoring comparable to what other queries do and at the end of the day they are duplicating lot of code all over lucene. Span*Queries are also limited to other Span*Query instances such that you can not use a TermQuery or a BooleanQuery with SpanNear or anthing like that. Beside of the Span*Query limitation other queries lacking a quiet interesting feature since they can not score based on term proximity since scores doesn't expose any positional information. All those problems bugged me for a while now so I stared working on that using the bulkpostings API. I would have done that first cut on trunk but TermScorer is working on BlockReader that do not expose positions while the one in this branch does. I started adding a new Positions class which users can pull from a scorer, to prevent unnecessary positions enums I added ScorerContext#needsPositions and eventually Scorere#needsPayloads to create the corresponding enum on demand. Yet, currently only TermQuery / TermScorer implements this API and other simply return null instead. To show that the API really works and our BulkPostings work fine too with positions I cut over TermSpanQuery to use a TermScorer under the hood and nuked TermSpans entirely. A nice sideeffect of this was that the Position BulkReading implementation got some exercise which now :) work all with positions while Payloads for bulkreading are kind of experimental in the patch and those only work with Standard codec. So all spans now work on top of TermScorer ( I truly hate spans since today ) including the ones that need Payloads (StandardCodec ONLY)!! I didn't bother to implement the other codecs yet since I want to get feedback on the API and on this first cut before I go one with it. I will upload the corresponding patch in a minute. I also had to cut over SpanQuery.getSpans(IR) to SpanQuery.getSpans(AtomicReaderContext) which I should probably do on trunk first but after that pain today I need a break first :). The patch passes all core tests (org.apache.lucene.search.highlight.HighlighterTest still fails but I didn't look into the MemoryIndex BulkPostings API yet) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [JENKINS] Lucene-Solr-tests-only-trunk - Build # 9507 - Failure
I think this test has incorrect randomization, because it initializes its random locale and timezone statically (not in @beforeclass). You can see this by running the test, it has the same timezone every time. On Mon, Jul 11, 2011 at 10:18 PM, Chris Male gento...@gmail.com wrote: I'm seeing this locally as well. On Tue, Jul 12, 2011 at 1:55 PM, Apache Jenkins Server jenk...@builds.apache.org wrote: Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk/9507/ 2 tests failed. REGRESSION: org.apache.lucene.queryParser.standard.TestNumericQueryParser.testInclusiveNumericRange Error Message: null Stack Trace: java.lang.NullPointerException at org.apache.lucene.queryParser.standard.config.NumberDateFormat.parse(NumberDateFormat.java:50) at java.text.NumberFormat.parse(NumberFormat.java:348) at org.apache.lucene.queryParser.standard.processors.NumericRangeQueryNodeProcessor.postProcessNode(NumericRangeQueryNodeProcessor.java:72) at org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processIteration(QueryNodeProcessorImpl.java:98) at org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processChildren(QueryNodeProcessorImpl.java:124) at org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processIteration(QueryNodeProcessorImpl.java:96) at org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processChildren(QueryNodeProcessorImpl.java:124) at org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processIteration(QueryNodeProcessorImpl.java:96) at org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.process(QueryNodeProcessorImpl.java:89) at org.apache.lucene.queryParser.core.processors.QueryNodeProcessorPipeline.process(QueryNodeProcessorPipeline.java:88) at org.apache.lucene.queryParser.core.QueryParserHelper.parse(QueryParserHelper.java:254) at org.apache.lucene.queryParser.standard.StandardQueryParser.parse(StandardQueryParser.java:166) at org.apache.lucene.queryParser.standard.TestNumericQueryParser.testQuery(TestNumericQueryParser.java:385) at org.apache.lucene.queryParser.standard.TestNumericQueryParser.assertRangeQuery(TestNumericQueryParser.java:356) at org.apache.lucene.queryParser.standard.TestNumericQueryParser.testInclusiveNumericRange(TestNumericQueryParser.java:282) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1464) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1382) REGRESSION: org.apache.lucene.queryParser.standard.TestNumericQueryParser.testExclusiveNumericRange Error Message: null Stack Trace: java.lang.NullPointerException at org.apache.lucene.queryParser.standard.config.NumberDateFormat.parse(NumberDateFormat.java:50) at java.text.NumberFormat.parse(NumberFormat.java:348) at org.apache.lucene.queryParser.standard.processors.NumericRangeQueryNodeProcessor.postProcessNode(NumericRangeQueryNodeProcessor.java:72) at org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processIteration(QueryNodeProcessorImpl.java:98) at org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processChildren(QueryNodeProcessorImpl.java:124) at org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processIteration(QueryNodeProcessorImpl.java:96) at org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processChildren(QueryNodeProcessorImpl.java:124) at org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processIteration(QueryNodeProcessorImpl.java:96) at org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.process(QueryNodeProcessorImpl.java:89) at org.apache.lucene.queryParser.core.processors.QueryNodeProcessorPipeline.process(QueryNodeProcessorPipeline.java:88) at org.apache.lucene.queryParser.core.QueryParserHelper.parse(QueryParserHelper.java:254) at org.apache.lucene.queryParser.standard.StandardQueryParser.parse(StandardQueryParser.java:166) at org.apache.lucene.queryParser.standard.TestNumericQueryParser.testQuery(TestNumericQueryParser.java:385) at org.apache.lucene.queryParser.standard.TestNumericQueryParser.assertRangeQuery(TestNumericQueryParser.java:356) at org.apache.lucene.queryParser.standard.TestNumericQueryParser.testExclusiveNumericRange(TestNumericQueryParser.java:311) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1464) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1382) Build Log (for compile errors): [...truncated 3344 lines...]
Re: [JENKINS] Lucene-Solr-tests-only-trunk - Build # 9507 - Failure
here's a change that makes the test reproducible (run it a few times and eventually you get a problematic locale/tz, then the seed will reproduce the problem): Index: lucene/contrib/queryparser/src/test/org/apache/lucene/queryParser/standard/TestNumericQueryParser.java === --- lucene/contrib/queryparser/src/test/org/apache/lucene/queryParser/standard/TestNumericQueryParser.java (revision 1145431) +++ lucene/contrib/queryparser/src/test/org/apache/lucene/queryParser/standard/TestNumericQueryParser.java (working copy) @@ -63,26 +63,30 @@ final private static int PRECISION_STEP = 8; final private static String FIELD_NAME = field; - final private static Locale LOCALE = randomLocale(random); - final private static TimeZone TIMEZONE = randomTimeZone(random); - final private static MapString,Number RANDOM_NUMBER_MAP; + private static Locale LOCALE; + private static TimeZone TIMEZONE; + private static MapString,Number RANDOM_NUMBER_MAP; final private static EscapeQuerySyntax ESCAPER = new EscapeQuerySyntaxImpl(); final private static String DATE_FIELD_NAME = date; - final private static int DATE_STYLE = randomDateStyle(random); - final private static int TIME_STYLE = randomDateStyle(random); + private static int DATE_STYLE; + private static int TIME_STYLE; + private static Analyzer ANALYZER; - final private static Analyzer ANALYZER = new MockAnalyzer(random); + private static NumberFormat NUMBER_FORMAT; - final private static NumberFormat NUMBER_FORMAT = NumberFormat - .getNumberInstance(LOCALE); + private static StandardQueryParser qp; - final private static StandardQueryParser qp = new StandardQueryParser( - ANALYZER); + private static NumberDateFormat DATE_FORMAT; - final private static NumberDateFormat DATE_FORMAT; - - static { + static void initFormats() { try { + LOCALE = randomLocale(random); + TIMEZONE = randomTimeZone(random); + DATE_STYLE = randomDateStyle(random); + TIME_STYLE = randomDateStyle(random); + ANALYZER = new MockAnalyzer(random); + NUMBER_FORMAT = NumberFormat.getNumberInstance(LOCALE); + qp = new StandardQueryParser(ANALYZER); NUMBER_FORMAT.setMaximumFractionDigits((random.nextInt() 20) + 1); NUMBER_FORMAT.setMinimumFractionDigits((random.nextInt() 20) + 1); NUMBER_FORMAT.setMaximumIntegerDigits((random.nextInt() 20) + 1); @@ -145,6 +149,7 @@ @BeforeClass public static void beforeClass() throws Exception { +initFormats(); directory = newDirectory(); RandomIndexWriter writer = new RandomIndexWriter(random, directory, newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(random)) On Mon, Jul 11, 2011 at 10:30 PM, Robert Muir rcm...@gmail.com wrote: I think this test has incorrect randomization, because it initializes its random locale and timezone statically (not in @beforeclass). You can see this by running the test, it has the same timezone every time. On Mon, Jul 11, 2011 at 10:18 PM, Chris Male gento...@gmail.com wrote: I'm seeing this locally as well. On Tue, Jul 12, 2011 at 1:55 PM, Apache Jenkins Server jenk...@builds.apache.org wrote: Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk/9507/ 2 tests failed. REGRESSION: org.apache.lucene.queryParser.standard.TestNumericQueryParser.testInclusiveNumericRange Error Message: null Stack Trace: java.lang.NullPointerException at org.apache.lucene.queryParser.standard.config.NumberDateFormat.parse(NumberDateFormat.java:50) at java.text.NumberFormat.parse(NumberFormat.java:348) at org.apache.lucene.queryParser.standard.processors.NumericRangeQueryNodeProcessor.postProcessNode(NumericRangeQueryNodeProcessor.java:72) at org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processIteration(QueryNodeProcessorImpl.java:98) at org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processChildren(QueryNodeProcessorImpl.java:124) at org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processIteration(QueryNodeProcessorImpl.java:96) at org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processChildren(QueryNodeProcessorImpl.java:124) at org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processIteration(QueryNodeProcessorImpl.java:96) at org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.process(QueryNodeProcessorImpl.java:89) at org.apache.lucene.queryParser.core.processors.QueryNodeProcessorPipeline.process(QueryNodeProcessorPipeline.java:88) at org.apache.lucene.queryParser.core.QueryParserHelper.parse(QueryParserHelper.java:254) at org.apache.lucene.queryParser.standard.StandardQueryParser.parse(StandardQueryParser.java:166) at
[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans
[ https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063689#comment-13063689 ] Mike Sokolov commented on LUCENE-2878: -- I hope you all will review the patch and see what you think. My gut at the moment tells me we can have it both ways with a bit more tinkering. I think that as it stands now, if you ask for positions you get them in more or less the most efficient way we know how. At the moment there is some performance hit when you don't want positions, but I think we can deal with that. Simon had the idea we could rely on the JIT compiler to optimize away the test we have if we set it up as a final false boolean (totally do-able if we set up the state during Scorer construction), which would be great and convenient. I'm no compiler expert, so not sure how reliable that is - is it? But we could also totally separate the two cases (say with a wrapping Scorer? - no need for compiler tricks) while still allowing us to retrieve positions while querying, collecting docs, and scoring. Allow Scorer to expose positions and payloads aka. nuke spans -- Key: LUCENE-2878 URL: https://issues.apache.org/jira/browse/LUCENE-2878 Project: Lucene - Java Issue Type: Improvement Components: core/search Affects Versions: Bulk Postings branch Reporter: Simon Willnauer Assignee: Simon Willnauer Labels: gsoc2011, lucene-gsoc-11, mentor Attachments: LUCENE-2878-OR.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878_trunk.patch, LUCENE-2878_trunk.patch, PosHighlighter.patch, PosHighlighter.patch Currently we have two somewhat separate types of queries, the one which can make use of positions (mainly spans) and payloads (spans). Yet Span*Query doesn't really do scoring comparable to what other queries do and at the end of the day they are duplicating lot of code all over lucene. Span*Queries are also limited to other Span*Query instances such that you can not use a TermQuery or a BooleanQuery with SpanNear or anthing like that. Beside of the Span*Query limitation other queries lacking a quiet interesting feature since they can not score based on term proximity since scores doesn't expose any positional information. All those problems bugged me for a while now so I stared working on that using the bulkpostings API. I would have done that first cut on trunk but TermScorer is working on BlockReader that do not expose positions while the one in this branch does. I started adding a new Positions class which users can pull from a scorer, to prevent unnecessary positions enums I added ScorerContext#needsPositions and eventually Scorere#needsPayloads to create the corresponding enum on demand. Yet, currently only TermQuery / TermScorer implements this API and other simply return null instead. To show that the API really works and our BulkPostings work fine too with positions I cut over TermSpanQuery to use a TermScorer under the hood and nuked TermSpans entirely. A nice sideeffect of this was that the Position BulkReading implementation got some exercise which now :) work all with positions while Payloads for bulkreading are kind of experimental in the patch and those only work with Standard codec. So all spans now work on top of TermScorer ( I truly hate spans since today ) including the ones that need Payloads (StandardCodec ONLY)!! I didn't bother to implement the other codecs yet since I want to get feedback on the API and on this first cut before I go one with it. I will upload the corresponding patch in a minute. I also had to cut over SpanQuery.getSpans(IR) to SpanQuery.getSpans(AtomicReaderContext) which I should probably do on trunk first but after that pain today I need a break first :). The patch passes all core tests (org.apache.lucene.search.highlight.HighlighterTest still fails but I didn't look into the MemoryIndex BulkPostings API yet) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2641) Auto Facet Selection component
[ https://issues.apache.org/jira/browse/SOLR-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063722#comment-13063722 ] Toke Eskildsen commented on SOLR-2641: -- This looks like a variant of hierarchical faceting. For popularity count as selector, paths like color/green, memory_size/4GB would produce the desired result. Auto Facet Selection component -- Key: SOLR-2641 URL: https://issues.apache.org/jira/browse/SOLR-2641 Project: Solr Issue Type: Improvement Components: SearchComponents - other Reporter: Erik Hatcher Assignee: Erik Hatcher Priority: Minor Attachments: SOLR_2641.patch It sure would be nice if you could have Solr automatically select field(s) for faceting based dynamically off the profile of the results. For example, you're indexing disparate types of products, all with varying attributes (color, size - like for apparel, memory_size - for electronics, subject - for books, etc), and a user searches for ipod where most products match products with color and memory_size attributes... let's automatically facet on those fields. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2382) DIH Cache Improvements
[ https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063727#comment-13063727 ] Noble Paul commented on SOLR-2382: -- my apologies for the delay. The problem w/ the patch is the size/scope. You may not need to open up other issues but stuff like abstracting DIHWriter,DIHpropertiesWriter etc can be given as a separate patch in the same issue and I can commit them straight away. Though the issue is aboout cache improvements , it goes far beyond that scope. committing it in as a whole is difficult. DIH Cache Improvements -- Key: SOLR-2382 URL: https://issues.apache.org/jira/browse/SOLR-2382 Project: Solr Issue Type: New Feature Components: contrib - DataImportHandler Reporter: James Dyer Priority: Minor Attachments: SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch Functionality: 1. Provide a pluggable caching framework for DIH so that users can choose a cache implementation that best suits their data and application. 2. Provide a means to temporarily cache a child Entity's data without needing to create a special cached implementation of the Entity Processor (such as CachedSqlEntityProcessor). 3. Provide a means to write the final (root entity) DIH output to a cache rather than to Solr. Then provide a way for a subsequent DIH call to use the cache as an Entity input. Also provide the ability to do delta updates on such persistent caches. 4. Provide the ability to partition data across multiple caches that can then be fed back into DIH and indexed either to varying Solr Shards, or to the same Core in parallel. Use Cases: 1. We needed a flexible scalable way to temporarily cache child-entity data prior to joining to parent entities. - Using SqlEntityProcessor with Child Entities can cause an n+1 select problem. - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching mechanism and does not scale. - There is no way to cache non-SQL inputs (ex: flat files, xml, etc). 2. We needed the ability to gather data from long-running entities by a process that runs separate from our main indexing process. 3. We wanted the ability to do a delta import of only the entities that changed. - Lucene/Solr requires entire documents to be re-indexed, even if only a few fields changed. - Our data comes from 50+ complex sql queries and/or flat files. - We do not want to incur overhead re-gathering all of this data if only 1 entity's data changed. - Persistent DIH caches solve this problem. 4. We want the ability to index several documents in parallel (using 1.4.1, which did not have the threads parameter). 5. In the future, we may need to use Shards, creating a need to easily partition our source data into Shards. Implementation Details: 1. De-couple EntityProcessorBase from caching. - Created a new interface, DIHCache two implementations: - SortedMapBackedCache - An in-memory cache, used as default with CachedSqlEntityProcessor (now deprecated). - BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested with je-4.1.6.jar - NOTE: the existing Lucene Contrib db project uses je-3.3.93.jar. I believe this may be incompatible due to Generic Usage. - NOTE: I did not modify the ant script to automatically get this jar, so to use or evaluate this patch, download bdb-je from http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html 2. Allow Entity Processors to take a cacheImpl parameter to cause the entity data to be cached (see EntityProcessorBase DIHCacheProperties). 3. Partially De-couple SolrWriter from DocBuilder - Created a new interface DIHWriter, two implementations: - SolrWriter (refactored) - DIHCacheWriter (allows DIH to write ultimately to a Cache). 4. Create a new Entity Processor, DIHCacheProcessor, which reads a persistent Cache as DIH Entity Input. 5. Support a partition parameter with both DIHCacheWriter and DIHCacheProcessor to allow for easy partitioning of source entity data. 6. Change the semantics of entity.destroy() - Previously, it was being called on each iteration of DocBuilder.buildDocument(). - Now it is does one-time cleanup tasks (like closing or deleting a disk-backed cache) once the entity processor is completed. - The only out-of-the-box entity processor that previously implemented destroy() was LineEntitiyProcessor, so this is not a very invasive change. General Notes: We are near completion in converting our search functionality from a legacy search engine to Solr. However, I found that DIH did not