[Lucene.Net] [jira] [Commented] (LUCENENET-412) Replacing ArrayLists, Hashtables etc. with appropriate Generics.
[ https://issues.apache.org/jira/browse/LUCENENET-412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035092#comment-13035092 ] Digy commented on LUCENENET-412: One more sample {code} From: class AnonymousFilterCache : FilterCache { class AnonymousFilteredDocIdSet : FilteredDocIdSet { IndexReader r; public AnonymousFilteredDocIdSet(DocIdSet innerSet, IndexReader r) : base(innerSet) { this.r = r; } public override bool Match(int docid) { return !r.IsDeleted(docid); } } public AnonymousFilterCache(DeletesMode deletesMode) : base(deletesMode) { } protected override object MergeDeletes(IndexReader reader, object docIdSet) { return new AnonymousFilteredDocIdSet((DocIdSet)docIdSet, reader); } } ... cache = new AnonymousFilterCache(deletesMode); To: cache = new FilterCacheDocIdSet(deletesMode, (reader,docIdSet)={ return new FilteredDocIdSet((DocIdSet)docIdSet, (docid) = { return !reader.IsDeleted(docid); }); }); {code} DIGY Replacing ArrayLists, Hashtables etc. with appropriate Generics. Key: LUCENENET-412 URL: https://issues.apache.org/jira/browse/LUCENENET-412 Project: Lucene.Net Issue Type: Improvement Affects Versions: Lucene.Net 2.9.4 Reporter: Digy Priority: Minor Fix For: Lucene.Net 2.9.4 Attachments: IEquatable for QuerySubclasses.patch, LUCENENET-412.patch, lucene_2.9.4g_exceptions_fix This will move Lucene.Net.2.9.4 closer to lucene.3.0.3 and allow some performance gains. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: [Lucene.Net] [jira] [Commented] (LUCENENET-412) Replacing ArrayLists, Hashtables etc. with appropriate Generics.
This is a great improvement, but why not also remove the braces and returns? var cache = new FilterCacheDocIdSet(deletesMode, (reader, docIdSet) = new FilteredDocIdSet( (DocIdSet)docIdSet, docid = !reader.IsDeleted(docid))); On Tue, May 17, 2011 at 3:01 PM, Digy (JIRA) j...@apache.org wrote: [ https://issues.apache.org/jira/browse/LUCENENET-412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035092#comment-13035092] Digy commented on LUCENENET-412: One more sample {code} From: class AnonymousFilterCache : FilterCache { class AnonymousFilteredDocIdSet : FilteredDocIdSet { IndexReader r; public AnonymousFilteredDocIdSet(DocIdSet innerSet, IndexReader r) : base(innerSet) { this.r = r; } public override bool Match(int docid) { return !r.IsDeleted(docid); } } public AnonymousFilterCache(DeletesMode deletesMode) : base(deletesMode) { } protected override object MergeDeletes(IndexReader reader, object docIdSet) { return new AnonymousFilteredDocIdSet((DocIdSet)docIdSet, reader); } } ... cache = new AnonymousFilterCache(deletesMode); To: cache = new FilterCacheDocIdSet(deletesMode, (reader,docIdSet)={ return new FilteredDocIdSet((DocIdSet)docIdSet, (docid) = { return !reader.IsDeleted(docid); }); }); {code} DIGY Replacing ArrayLists, Hashtables etc. with appropriate Generics. Key: LUCENENET-412 URL: https://issues.apache.org/jira/browse/LUCENENET-412 Project: Lucene.Net Issue Type: Improvement Affects Versions: Lucene.Net 2.9.4 Reporter: Digy Priority: Minor Fix For: Lucene.Net 2.9.4 Attachments: IEquatable for QuerySubclasses.patch, LUCENENET-412.patch, lucene_2.9.4g_exceptions_fix This will move Lucene.Net.2.9.4 closer to lucene.3.0.3 and allow some performance gains. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (LUCENE-3090) DWFlushControl does not take active DWPT out of the loop on fullFlush
[ https://issues.apache.org/jira/browse/LUCENE-3090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer resolved LUCENE-3090. - Resolution: Fixed Lucene Fields: [New, Patch Available] (was: [New]) Committed in revision 1104026. DWFlushControl does not take active DWPT out of the loop on fullFlush - Key: LUCENE-3090 URL: https://issues.apache.org/jira/browse/LUCENE-3090 Project: Lucene - Java Issue Type: Bug Components: core/index Affects Versions: 4.0 Reporter: Simon Willnauer Assignee: Simon Willnauer Priority: Critical Fix For: 4.0 Attachments: LUCENE-3090.patch, LUCENE-3090.patch, LUCENE-3090.patch We have seen several OOM on TestNRTThreads and all of them are caused by DWFlushControl missing DWPT that are set as flushPending but can't full due to a full flush going on. Yet that means that those DWPT are filling up in the background while they should actually be checked out and blocked until the full flush finishes. Even further we currently stall on the maxNumThreadStates while we should stall on the num of active thread states. I will attach a patch tomorrow. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2736) Wrong implementation of DocIdSetIterator.advance
[ https://issues.apache.org/jira/browse/LUCENE-2736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034618#comment-13034618 ] Doron Cohen commented on LUCENE-2736: - Shai, with the modified text the NOTE on implementations freedom to not advance beyond in some situations becomes strange... I think that the original text stress the fact the real intended behavior is to do advance beyond current, just that for performance reasons the decision whether to advance beyond in some situations is left for implementation decision, and so, if caller provides a target which is not greater than current, it should be aware of this possibility. So I think it is perhaps better to either not modify this at all, or at most, to add (see NOTE below) just after beyond: {noformat} - * Advances to the first beyond the current whose document number is greater + * Advances to the first beyond (see NOTE below) the current whose document number is greater {noformat} This would prevent the confusion I think? Wrong implementation of DocIdSetIterator.advance - Key: LUCENE-2736 URL: https://issues.apache.org/jira/browse/LUCENE-2736 Project: Lucene - Java Issue Type: Bug Components: core/search Affects Versions: 3.2, 4.0 Reporter: Hardy Ferentschik Assignee: Shai Erera Attachments: LUCENE-2736.patch Implementations of {{DocIdSetIterator}} behave differently when advanced is called. Taking the following test for {{OpenBitSet}}, {{DocIdBitSet}} and {{SortedVIntList}} only {{SortedVIntList}} passes the test: {code:title=org.apache.lucene.search.TestDocIdSet.java|borderStyle=solid} ... public void testAdvanceWithOpenBitSet() throws IOException { DocIdSet idSet = new OpenBitSet( new long[] { 1121 }, 1 ); // bits 0, 5, 6, 10 assertAdvance( idSet ); } public void testAdvanceDocIdBitSet() throws IOException { BitSet bitSet = new BitSet(); bitSet.set( 0 ); bitSet.set( 5 ); bitSet.set( 6 ); bitSet.set( 10 ); DocIdSet idSet = new DocIdBitSet(bitSet); assertAdvance( idSet ); } public void testAdvanceWithSortedVIntList() throws IOException { DocIdSet idSet = new SortedVIntList( 0, 5, 6, 10 ); assertAdvance( idSet ); } private void assertAdvance(DocIdSet idSet) throws IOException { DocIdSetIterator iter = idSet.iterator(); int docId = iter.nextDoc(); assertEquals( First doc id should be 0, 0, docId ); docId = iter.nextDoc(); assertEquals( Second doc id should be 5, 5, docId ); docId = iter.advance( 5 ); assertEquals( Advancing iterator should return the next doc id, 6, docId ); } {code} The javadoc for {{advance}} says: {quote} Advances to the first *beyond* the current whose document number is greater than or equal to _target_. {quote} This seems to indicate that {{SortedVIntList}} behaves correctly, whereas the other two don't. Just looking at the {{DocIdBitSet}} implementation advance is implemented as: {code} bitSet.nextSetBit(target); {code} where the docs of {{nextSetBit}} say: {quote} Returns the index of the first bit that is set to true that occurs *on or after* the specified starting index {quote} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3108) Land DocValues on trunk
Land DocValues on trunk --- Key: LUCENE-3108 URL: https://issues.apache.org/jira/browse/LUCENE-3108 Project: Lucene - Java Issue Type: Task Components: core/index, core/search, core/store Affects Versions: CSF branch, 4.0 Reporter: Simon Willnauer Assignee: Simon Willnauer Fix For: 4.0 Its time to move another feature from branch to trunk. I want to start this process now while still a couple of issues remain on the branch. Currently I am down to a single nocommit (javadocs on DocValues.java) and a couple of testing TODOs (explicit multithreaded tests and unoptimized with deletions) but I think those are not worth separate issues so we can resolve them as we go. The already created issues (LUCENE-3075 and LUCENE-3074) should not block this process here IMO, we can fix them once we are on trunk. Here is a quick feature overview of what has been implemented: * DocValues implementations for Ints (based on PackedInts), Float 32 / 64, Bytes (fixed / variable size each in sorted, straight and deref variations) * Integration into Flex-API, Codec provides a PerDocConsumer-DocValuesConsumer (write) / PerDocValues-DocValues (read) * By-Default enabled in all codecs except of PreFlex * Follows other flex-API patterns like non-segment reader throw UOE forcing MultiPerDocValues if on DirReader etc. * Integration into IndexWriter, FieldInfos etc. * Random-testing enabled via RandomIW - injecting random DocValues into documents * Basic checks in CheckIndex (which runs after each test) * FieldComparator for int and float variants (Sorting, currently directly integrated into SortField, this might go into a separate DocValuesSortField eventually) * Extended TestSort for DocValues * RAM-Resident random access API plus on-disk DocValuesEnum (currently only sequential access) - Source.java / DocValuesEnum.java * Extensible Cache implementation for RAM-Resident DocValues (by-default loaded into RAM only once and freed once IR is closed) - SourceCache.java PS: Currently the RAM resident API is named Source (Source.java) which seems too generic. I think we should rename it into RamDocValues or something like that, suggestion welcome! Any comments, questions (rants :)) are very much appreciated. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Moving towards Lucene 4.0
On Mon, May 16, 2011 at 5:24 PM, Shai Erera ser...@gmail.com wrote: We anyway seem to mark every new API as @lucene.experimental these days, so we shouldn't have too much problem when 4.0 is out :). Experimental API is subject to change at any time. We can consider that as an option as well (maybe it adds another option to Robert's?). Though personally, I'm not a big fan of this notion - I think we deceive ourselves and users when we have @experimental on a stable branch. Any @experimental API on trunk today falls into this bucket after 4.0 is out. And I'm sure there are a couple in 3.x already. Don't get me wrong - I don't suggest we should stop using it. But I think we should consider to review the @experimental API before every stable release, and reduce it over time, not increase it. +1 Shai On Mon, May 16, 2011 at 4:20 PM, Robert Muir rcm...@gmail.com wrote: On Mon, May 16, 2011 at 9:12 AM, Simon Willnauer simon.willna...@googlemail.com wrote: I have to admit that branch is very rough and the API is super hard to use. For now! Lets not be dragged away into discussion how this API should look like there will be time for that. +1, this is what i really meant by decide how to handle. I don't think we will be able to quickly decide how to fix the branch itself, i think its really complicated. But we can admit its really complicated and won't be solved very soon, and try to figure out a release strategy with this in mind. (p.s. sorry simon, you got two copies of this message i accidentally hit reply instead of reply-all) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3109) Rename FieldsConsumer to InvertedFieldsConsumer
Rename FieldsConsumer to InvertedFieldsConsumer --- Key: LUCENE-3109 URL: https://issues.apache.org/jira/browse/LUCENE-3109 Project: Lucene - Java Issue Type: Task Components: core/codecs Affects Versions: 4.0 Reporter: Simon Willnauer Priority: Minor Fix For: 4.0 The name FieldsConsumer is missleading here it really is an InvertedFieldsConsumer and since we are extending codecs to consume non-inverted Fields we should be clear here. Same applies to Fields.java as well as FieldsProducer. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3110) ASCIIFoldingFilter wrongly folds german Umlauts
ASCIIFoldingFilter wrongly folds german Umlauts --- Key: LUCENE-3110 URL: https://issues.apache.org/jira/browse/LUCENE-3110 Project: Lucene - Java Issue Type: Improvement Components: modules/analysis Affects Versions: 3.1 Reporter: Michael Gaber the german umlauts are currently mapped as follows. Ä/ä = A/a Ö/ö = O/o Ü/ü = U/u the correct mapping would be Ä/ä = Ae/ae Ö/ö = Oe/oe Ü/ü = Ue/ue so the corresponding rows in the switch statement should be moved down to the ae/oe/ue positions. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3111) TestFSTs.testRandomWords failure
TestFSTs.testRandomWords failure Key: LUCENE-3111 URL: https://issues.apache.org/jira/browse/LUCENE-3111 Project: Lucene - Java Issue Type: Bug Reporter: selckin Priority: Minor Was running some while(1) tests on the docvalues branch (r1103705) and the following test failed: {code} [junit] Testsuite: org.apache.lucene.util.automaton.fst.TestFSTs [junit] Testcase: testRandomWords(org.apache.lucene.util.automaton.fst.TestFSTs): FAILED [junit] expected:771 but was:TwoLongs:771,771 [junit] junit.framework.AssertionFailedError: expected:771 but was:TwoLongs:771,771 [junit] at org.apache.lucene.util.automaton.fst.TestFSTs$FSTTester.verifyUnPruned(TestFSTs.java:540) [junit] at org.apache.lucene.util.automaton.fst.TestFSTs$FSTTester.doTest(TestFSTs.java:496) [junit] at org.apache.lucene.util.automaton.fst.TestFSTs$FSTTester.doTest(TestFSTs.java:359) [junit] at org.apache.lucene.util.automaton.fst.TestFSTs.doTest(TestFSTs.java:319) [junit] at org.apache.lucene.util.automaton.fst.TestFSTs.testRandomWords(TestFSTs.java:940) [junit] at org.apache.lucene.util.automaton.fst.TestFSTs.testRandomWords(TestFSTs.java:915) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1282) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1211) [junit] [junit] [junit] Tests run: 7, Failures: 1, Errors: 0, Time elapsed: 7.628 sec [junit] [junit] - Standard Error - [junit] NOTE: Ignoring nightly-only test method 'testBigSet' [junit] NOTE: reproduce with: ant test -Dtestcase=TestFSTs -Dtestmethod=testRandomWords -Dtests.seed=-269475578956012681:0 [junit] NOTE: test params are: codec=PreFlex, locale=ar, timezone=America/Blanc-Sablon [junit] NOTE: all tests run in this JVM: [junit] [TestToken, TestCodecs, TestIndexReaderReopen, TestIndexWriterMerging, TestNoDeletionPolicy, TestParallelReaderEmptyIndex, TestParallelTermEnum, TestPerSegmentDeletes, TestSegmentReader, TestSegmentTermDocs, TestStressAdvance, TestTermVectorsReader, TestSurrogates, TestMultiFieldQueryParser, TestAutomatonQuery, TestBooleanScorer, TestFuzzyQuery, TestMultiTermConstantScore, TestNumericRangeQuery64, TestPositiveScoresOnlyCollector, TestPrefixFilter, TestQueryTermVector, TestScorerPerf, TestSloppyPhraseQuery, TestSpansAdvanced, TestWindowsMMap, TestRamUsageEstimator, TestSmallFloat, TestUnicodeUtil, TestFSTs] [junit] NOTE: Linux 2.6.37-gentoo amd64/Sun Microsystems Inc. 1.6.0_25 (64-bit)/cpus=8,threads=1,free=137329960,total=208207872 [junit] - --- [junit] TEST org.apache.lucene.util.automaton.fst.TestFSTs FAILED {code} I am not able to reproduce -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3105) String.intern() calls slow down IndexWriter.close() and IndexReader.open() for index with large number of unique field names
[ https://issues.apache.org/jira/browse/LUCENE-3105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034639#comment-13034639 ] Earwin Burrfoot commented on LUCENE-3105: - StringInterner is in fact faster than CHM. And is compatible with String.intern(), ie - it returns the same String instances. It also won't eat up memory if spammed with numerous unique strings (which is a strange feature, but people requested that). In Lucene 4.0 all of this is moot anyway, fields there are strongly separated and intern() is not used. String.intern() calls slow down IndexWriter.close() and IndexReader.open() for index with large number of unique field names Key: LUCENE-3105 URL: https://issues.apache.org/jira/browse/LUCENE-3105 Project: Lucene - Java Issue Type: Bug Components: core/index Affects Versions: 3.1 Reporter: Mark Kristensson Attachments: LUCENE-3105.patch We have one index with several hundred thousand unqiue field names (we're optimistic that Lucene 4.0 is flexible enough to allow us to change our index design...) and found that opening an index writer and closing an index reader results in horribly slow performance on that one index. I have isolated the problem down to the calls to String.intern() that are used to allow for quick string comparisons of field names throughout Lucene. These String.intern() calls are unnecessary and can be replaced with a hashmap lookup. In fact, StringHelper.java has its own hashmap implementation that it uses in conjunction with String.intern(). Rather than using a one-off hashmap, I've elected to use a ConcurrentHashMap in this patch. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3105) String.intern() calls slow down IndexWriter.close() and IndexReader.open() for index with large number of unique field names
[ https://issues.apache.org/jira/browse/LUCENE-3105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034640#comment-13034640 ] Earwin Burrfoot commented on LUCENE-3105: - Hmm.. Ok, it *is* still used, but that's gonna be fixed, mm? String.intern() calls slow down IndexWriter.close() and IndexReader.open() for index with large number of unique field names Key: LUCENE-3105 URL: https://issues.apache.org/jira/browse/LUCENE-3105 Project: Lucene - Java Issue Type: Bug Components: core/index Affects Versions: 3.1 Reporter: Mark Kristensson Attachments: LUCENE-3105.patch We have one index with several hundred thousand unqiue field names (we're optimistic that Lucene 4.0 is flexible enough to allow us to change our index design...) and found that opening an index writer and closing an index reader results in horribly slow performance on that one index. I have isolated the problem down to the calls to String.intern() that are used to allow for quick string comparisons of field names throughout Lucene. These String.intern() calls are unnecessary and can be replaced with a hashmap lookup. In fact, StringHelper.java has its own hashmap implementation that it uses in conjunction with String.intern(). Rather than using a one-off hashmap, I've elected to use a ConcurrentHashMap in this patch. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3100) IW.commit() writes but fails to fsync the N.fnx file
[ https://issues.apache.org/jira/browse/LUCENE-3100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034642#comment-13034642 ] Michael McCandless commented on LUCENE-3100: Patch looks good Simon! IW.commit() writes but fails to fsync the N.fnx file Key: LUCENE-3100 URL: https://issues.apache.org/jira/browse/LUCENE-3100 Project: Lucene - Java Issue Type: Bug Reporter: Michael McCandless Assignee: Simon Willnauer Fix For: 4.0 Attachments: LUCENE-3100.patch In making a unit test for NRTCachingDir (LUCENE-3092) I hit this surprising bug! Because the new N.fnx file is written at the last minute along with the segments file, it's not included in the sis.files() that IW uses to figure out which files to sync. This bug means one could call IW.commit(), successfully, return, and then the machine could crash and when it comes back up your index could be corrupted. We should hopefully first fix TestCrash so that it hits this bug (maybe it needs more/better randomization?), then fix the bug -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3092) NRTCachingDirectory, to buffer small segments in a RAMDir
[ https://issues.apache.org/jira/browse/LUCENE-3092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-3092: --- Attachment: LUCENE-3092.patch New patch, fixes the issue Simon hit (was just a bug in the test -- it was using a silly MergePolicy that ignored partial optimize). Test now passes w/ the patch from LUCENE-3100. I think this is ready to commit, after LUCENE-3100 is in. NRTCachingDirectory, to buffer small segments in a RAMDir - Key: LUCENE-3092 URL: https://issues.apache.org/jira/browse/LUCENE-3092 Project: Lucene - Java Issue Type: Improvement Components: core/store Reporter: Michael McCandless Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3092-listener.patch, LUCENE-3092.patch, LUCENE-3092.patch, LUCENE-3092.patch, LUCENE-3092.patch I created this simply Directory impl, whose goal is reduce IO contention in a frequent reopen NRT use case. The idea is, when reopening quickly, but not indexing that much content, you wind up with many small files created with time, that can possibly stress the IO system eg if merges, searching are also fighting for IO. So, NRTCachingDirectory puts these newly created files into a RAMDir, and only when they are merged into a too-large segment, does it then write-through to the real (delegate) directory. This lets you spend some RAM to reduce I0. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3108) Land DocValues on trunk
[ https://issues.apache.org/jira/browse/LUCENE-3108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034645#comment-13034645 ] Michael McCandless commented on LUCENE-3108: +1, excellent! Land DocValues on trunk --- Key: LUCENE-3108 URL: https://issues.apache.org/jira/browse/LUCENE-3108 Project: Lucene - Java Issue Type: Task Components: core/index, core/search, core/store Affects Versions: CSF branch, 4.0 Reporter: Simon Willnauer Assignee: Simon Willnauer Fix For: 4.0 Its time to move another feature from branch to trunk. I want to start this process now while still a couple of issues remain on the branch. Currently I am down to a single nocommit (javadocs on DocValues.java) and a couple of testing TODOs (explicit multithreaded tests and unoptimized with deletions) but I think those are not worth separate issues so we can resolve them as we go. The already created issues (LUCENE-3075 and LUCENE-3074) should not block this process here IMO, we can fix them once we are on trunk. Here is a quick feature overview of what has been implemented: * DocValues implementations for Ints (based on PackedInts), Float 32 / 64, Bytes (fixed / variable size each in sorted, straight and deref variations) * Integration into Flex-API, Codec provides a PerDocConsumer-DocValuesConsumer (write) / PerDocValues-DocValues (read) * By-Default enabled in all codecs except of PreFlex * Follows other flex-API patterns like non-segment reader throw UOE forcing MultiPerDocValues if on DirReader etc. * Integration into IndexWriter, FieldInfos etc. * Random-testing enabled via RandomIW - injecting random DocValues into documents * Basic checks in CheckIndex (which runs after each test) * FieldComparator for int and float variants (Sorting, currently directly integrated into SortField, this might go into a separate DocValuesSortField eventually) * Extended TestSort for DocValues * RAM-Resident random access API plus on-disk DocValuesEnum (currently only sequential access) - Source.java / DocValuesEnum.java * Extensible Cache implementation for RAM-Resident DocValues (by-default loaded into RAM only once and freed once IR is closed) - SourceCache.java PS: Currently the RAM resident API is named Source (Source.java) which seems too generic. I think we should rename it into RamDocValues or something like that, suggestion welcome! Any comments, questions (rants :)) are very much appreciated. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-3100) IW.commit() writes but fails to fsync the N.fnx file
[ https://issues.apache.org/jira/browse/LUCENE-3100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer resolved LUCENE-3100. - Resolution: Fixed Committed in revision 1104090. IW.commit() writes but fails to fsync the N.fnx file Key: LUCENE-3100 URL: https://issues.apache.org/jira/browse/LUCENE-3100 Project: Lucene - Java Issue Type: Bug Reporter: Michael McCandless Assignee: Simon Willnauer Fix For: 4.0 Attachments: LUCENE-3100.patch In making a unit test for NRTCachingDir (LUCENE-3092) I hit this surprising bug! Because the new N.fnx file is written at the last minute along with the segments file, it's not included in the sis.files() that IW uses to figure out which files to sync. This bug means one could call IW.commit(), successfully, return, and then the machine could crash and when it comes back up your index could be corrupted. We should hopefully first fix TestCrash so that it hits this bug (maybe it needs more/better randomization?), then fix the bug -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3092) NRTCachingDirectory, to buffer small segments in a RAMDir
[ https://issues.apache.org/jira/browse/LUCENE-3092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034661#comment-13034661 ] Simon Willnauer commented on LUCENE-3092: - Mike I committed LUCENE-3100 you can go ahead :) NRTCachingDirectory, to buffer small segments in a RAMDir - Key: LUCENE-3092 URL: https://issues.apache.org/jira/browse/LUCENE-3092 Project: Lucene - Java Issue Type: Improvement Components: core/store Reporter: Michael McCandless Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3092-listener.patch, LUCENE-3092.patch, LUCENE-3092.patch, LUCENE-3092.patch, LUCENE-3092.patch I created this simply Directory impl, whose goal is reduce IO contention in a frequent reopen NRT use case. The idea is, when reopening quickly, but not indexing that much content, you wind up with many small files created with time, that can possibly stress the IO system eg if merges, searching are also fighting for IO. So, NRTCachingDirectory puts these newly created files into a RAMDir, and only when they are merged into a too-large segment, does it then write-through to the real (delegate) directory. This lets you spend some RAM to reduce I0. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-1421) Ability to group search results by field
[ https://issues.apache.org/jira/browse/LUCENE-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034691#comment-13034691 ] Michael McCandless commented on LUCENE-1421: I adding grouping queries to the nightly benchmarks (http://people.apache.org/~mikemccand/lucenebench) -- see TermGroup100/10K/1M. The F annotation is the day grouping queries first ran. Those queries are the same queries running as TermQuery, just with grouping turned on on 3 randomly generated fields, with 100, 10,000 and 1 million unique values. So we can gauge the perf hit by comparing to TermQuery each night. I use the CachingCollector. First off, I'm impressed that the perf hit for grouping is not too bad: ||Query||QPS||Slowdown|| |TermQuery (baseline)|30.72|0| |TermGroup100|13.59|2.26| |TermQuery10K|13.2|2.34| |TermQuery1M|12.15|2.53| I had expected we'd pay a bigger perf hit! Second, there more unique groups you have, the slower grouping gets, but that multiplier really isn't so bad -- the 1M unique groups case is only 10.6% slower than the 100 unique groups case. Remember, though, that these groups are randomly generated full-unicode strings, so real data could very well produce different results... Third, and this is insanity, the addition of grouping caused other unexpected changes. Most horribly, SpanNearQuery slowed down by ~12.2% (http://people.apache.org/~mikemccand/lucenebench/SpanNear.html), while other queries seem to get a bit faster. I think this is [frustratingly!] due to hotspot making different decisions about which code to optimize/inline. Similarly strange, when I added sorting (TermQuery sorting by title and date/time, E annotation in all graphs), I saw the variance in the unsorted TermQuery performance drop substantially. I'm pretty sure this wide variance was due to hotspot's erratic decision making, but somehow the addition of sorting, while not change TermQuery's mean QPS, caused hotspot to at least be somewhat more consistent in how it compiled the code. Maybe as we add more and more diverse queries to the benchmark we'll see hotspot behave more reasonably Ability to group search results by field Key: LUCENE-1421 URL: https://issues.apache.org/jira/browse/LUCENE-1421 Project: Lucene - Java Issue Type: New Feature Components: core/search Reporter: Artyom Sokolov Assignee: Michael McCandless Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-1421.patch, LUCENE-1421.patch, lucene-grouping.patch It would be awesome to group search results by specified field. Some functionality was provided for Apache Solr but I think it should be done in Core Lucene. There could be some useful information like total hits about collapsed data like total count and so on. Thanks, Artyom -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3112) Add IW.add/updateDocuments to support nested documents
Add IW.add/updateDocuments to support nested documents -- Key: LUCENE-3112 URL: https://issues.apache.org/jira/browse/LUCENE-3112 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 3.2, 4.0 I think nested documents (LUCENE-2454) is a very compelling addition to Lucene. It's also a popular (many votes) issue. Beyond supporting nested document querying, which is already an incredible addition since it preserves the relational model on indexing normalized content (eg, DB tables, XML docs), LUCENE-2454 should also enable speedups in grouping implementation when you group by a nested field. For the same reason, it can also enable very fast post-group facet counting impl (LUCENE-3097) when you what to count(distinct(nestedField)), instead of unique documents, as your identifier. I expect many apps that use faceting need this ability (to count(distinct(nestedField)) not distinct(docID)). To support these use cases, I believe the only core change needed is the ability to atomically add or update multiple documents, which you cannot do today since in between add/updateDocument calls a flush (eg due to commit or getReader()) could occur. This new API (addDocuments(IterableDocument), updateDocuments(Term delTerm, IterableDocument) would also further guarantee that the documents are assigned sequential docIDs in the order the iterator provided them, and that the docIDs all reside in one segment. Segment merging never splits segments apart, so this invariant would hold even as merges/optimizes take place. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3112) Add IW.add/updateDocuments to support nested documents
[ https://issues.apache.org/jira/browse/LUCENE-3112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-3112: --- Attachment: LUCENE-3112.patch Initial patch. It's not done yet (needs tests, and the nocommit needs to be addressed). Add IW.add/updateDocuments to support nested documents -- Key: LUCENE-3112 URL: https://issues.apache.org/jira/browse/LUCENE-3112 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3112.patch I think nested documents (LUCENE-2454) is a very compelling addition to Lucene. It's also a popular (many votes) issue. Beyond supporting nested document querying, which is already an incredible addition since it preserves the relational model on indexing normalized content (eg, DB tables, XML docs), LUCENE-2454 should also enable speedups in grouping implementation when you group by a nested field. For the same reason, it can also enable very fast post-group facet counting impl (LUCENE-3097) when you what to count(distinct(nestedField)), instead of unique documents, as your identifier. I expect many apps that use faceting need this ability (to count(distinct(nestedField)) not distinct(docID)). To support these use cases, I believe the only core change needed is the ability to atomically add or update multiple documents, which you cannot do today since in between add/updateDocument calls a flush (eg due to commit or getReader()) could occur. This new API (addDocuments(IterableDocument), updateDocuments(Term delTerm, IterableDocument) would also further guarantee that the documents are assigned sequential docIDs in the order the iterator provided them, and that the docIDs all reside in one segment. Segment merging never splits segments apart, so this invariant would hold even as merges/optimizes take place. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2454) Nested Document query support
[ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034702#comment-13034702 ] Michael McCandless commented on LUCENE-2454: I think this is a very important addition to Lucene, so let's get this done! I just opened LUCENE-3112, to add IW.add/updateDocuments, which would atomically add Document produced by an iterator, and ensure they all wind up in the same segment. I think this is the only core change necessary for this feature? Ie, all else can be built on top of Lucene once LUCENE-3112 is committed? Nested Document query support - Key: LUCENE-2454 URL: https://issues.apache.org/jira/browse/LUCENE-2454 Project: Lucene - Java Issue Type: New Feature Components: core/search Affects Versions: 3.0.2 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: LuceneNestedDocumentSupport.zip A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3110) ASCIIFoldingFilter wrongly folds german Umlauts
[ https://issues.apache.org/jira/browse/LUCENE-3110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034709#comment-13034709 ] Robert Muir commented on LUCENE-3110: - Hi, these characters are not German umlauts. They are unicode characters used by a number of languages. the purpose of ASCIIFolding is to do simple accent-stripping. ASCIIFoldingFilter wrongly folds german Umlauts --- Key: LUCENE-3110 URL: https://issues.apache.org/jira/browse/LUCENE-3110 Project: Lucene - Java Issue Type: Improvement Components: modules/analysis Affects Versions: 3.1 Reporter: Michael Gaber the german umlauts are currently mapped as follows. Ä/ä = A/a Ö/ö = O/o Ü/ü = U/u the correct mapping would be Ä/ä = Ae/ae Ö/ö = Oe/oe Ü/ü = Ue/ue so the corresponding rows in the switch statement should be moved down to the ae/oe/ue positions. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3112) Add IW.add/updateDocuments to support nested documents
[ https://issues.apache.org/jira/browse/LUCENE-3112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034711#comment-13034711 ] Simon Willnauer commented on LUCENE-3112: - bq. Initial patch. nice simple idea! I like the refactorings into pre/postUpdate - looks much cleaner. Yet, I think you should push the document iteration etc into DWPT to actually apply the delterm only once to make it really atomic. I also wonder if we should allow multiple delTerm e.g. TupleDelTerm, Document otherwise you would be bound to one delterm pre collection but what if you want to remove only one of the sub-documents? So if we would have those tuples you really want to push the iteration into DWPT to make a final finishDocument(Term[] terms) call pushing the terms into a single DeleteItem. Add IW.add/updateDocuments to support nested documents -- Key: LUCENE-3112 URL: https://issues.apache.org/jira/browse/LUCENE-3112 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3112.patch I think nested documents (LUCENE-2454) is a very compelling addition to Lucene. It's also a popular (many votes) issue. Beyond supporting nested document querying, which is already an incredible addition since it preserves the relational model on indexing normalized content (eg, DB tables, XML docs), LUCENE-2454 should also enable speedups in grouping implementation when you group by a nested field. For the same reason, it can also enable very fast post-group facet counting impl (LUCENE-3097) when you what to count(distinct(nestedField)), instead of unique documents, as your identifier. I expect many apps that use faceting need this ability (to count(distinct(nestedField)) not distinct(docID)). To support these use cases, I believe the only core change needed is the ability to atomically add or update multiple documents, which you cannot do today since in between add/updateDocument calls a flush (eg due to commit or getReader()) could occur. This new API (addDocuments(IterableDocument), updateDocuments(Term delTerm, IterableDocument) would also further guarantee that the documents are assigned sequential docIDs in the order the iterator provided them, and that the docIDs all reside in one segment. Segment merging never splits segments apart, so this invariant would hold even as merges/optimizes take place. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-1421) Ability to group search results by field
[ https://issues.apache.org/jira/browse/LUCENE-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034714#comment-13034714 ] Martijn van Groningen commented on LUCENE-1421: --- bq. I adding grouping queries to the nightly benchmarks Nice! Are the regular sort and group sort different in these test cases? Do think when new features are added that these also need be added to this test suite? Or is this perfomance test suite just for the basic features? Ability to group search results by field Key: LUCENE-1421 URL: https://issues.apache.org/jira/browse/LUCENE-1421 Project: Lucene - Java Issue Type: New Feature Components: core/search Reporter: Artyom Sokolov Assignee: Michael McCandless Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-1421.patch, LUCENE-1421.patch, lucene-grouping.patch It would be awesome to group search results by specified field. Some functionality was provided for Apache Solr but I think it should be done in Core Lucene. There could be some useful information like total hits about collapsed data like total count and so on. Thanks, Artyom -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3105) String.intern() calls slow down IndexWriter.close() and IndexReader.open() for index with large number of unique field names
[ https://issues.apache.org/jira/browse/LUCENE-3105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034723#comment-13034723 ] Uwe Schindler commented on LUCENE-3105: --- Yes it's gonna fixed, see linked issue LUCENE-2548. The biggest problem is Solr at the moment. The other things are minor identity vs. equals in FieldCache. String.intern() calls slow down IndexWriter.close() and IndexReader.open() for index with large number of unique field names Key: LUCENE-3105 URL: https://issues.apache.org/jira/browse/LUCENE-3105 Project: Lucene - Java Issue Type: Bug Components: core/index Affects Versions: 3.1 Reporter: Mark Kristensson Attachments: LUCENE-3105.patch We have one index with several hundred thousand unqiue field names (we're optimistic that Lucene 4.0 is flexible enough to allow us to change our index design...) and found that opening an index writer and closing an index reader results in horribly slow performance on that one index. I have isolated the problem down to the calls to String.intern() that are used to allow for quick string comparisons of field names throughout Lucene. These String.intern() calls are unnecessary and can be replaced with a hashmap lookup. In fact, StringHelper.java has its own hashmap implementation that it uses in conjunction with String.intern(). Rather than using a one-off hashmap, I've elected to use a ConcurrentHashMap in this patch. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2454) Nested Document query support
[ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034726#comment-13034726 ] Mark Harwood commented on LUCENE-2454: -- bq. I think this is the only core change necessary for this feature? Yup. A same-segment indexing guarantee is all that is required. Nested Document query support - Key: LUCENE-2454 URL: https://issues.apache.org/jira/browse/LUCENE-2454 Project: Lucene - Java Issue Type: New Feature Components: core/search Affects Versions: 3.0.2 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: LuceneNestedDocumentSupport.zip A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-2736) Wrong implementation of DocIdSetIterator.advance
[ https://issues.apache.org/jira/browse/LUCENE-2736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera resolved LUCENE-2736. Resolution: Fixed Fix Version/s: 4.0 3.2 Lucene Fields: [New, Patch Available] (was: [New]) Thanks Doron - I changed the javadocs as you suggest. Committed revision 1104159 (3x). Committed revision 1104167 (trunk). Thanks Hardy for reporting that ! Wrong implementation of DocIdSetIterator.advance - Key: LUCENE-2736 URL: https://issues.apache.org/jira/browse/LUCENE-2736 Project: Lucene - Java Issue Type: Bug Components: core/search Affects Versions: 3.2, 4.0 Reporter: Hardy Ferentschik Assignee: Shai Erera Fix For: 3.2, 4.0 Attachments: LUCENE-2736.patch Implementations of {{DocIdSetIterator}} behave differently when advanced is called. Taking the following test for {{OpenBitSet}}, {{DocIdBitSet}} and {{SortedVIntList}} only {{SortedVIntList}} passes the test: {code:title=org.apache.lucene.search.TestDocIdSet.java|borderStyle=solid} ... public void testAdvanceWithOpenBitSet() throws IOException { DocIdSet idSet = new OpenBitSet( new long[] { 1121 }, 1 ); // bits 0, 5, 6, 10 assertAdvance( idSet ); } public void testAdvanceDocIdBitSet() throws IOException { BitSet bitSet = new BitSet(); bitSet.set( 0 ); bitSet.set( 5 ); bitSet.set( 6 ); bitSet.set( 10 ); DocIdSet idSet = new DocIdBitSet(bitSet); assertAdvance( idSet ); } public void testAdvanceWithSortedVIntList() throws IOException { DocIdSet idSet = new SortedVIntList( 0, 5, 6, 10 ); assertAdvance( idSet ); } private void assertAdvance(DocIdSet idSet) throws IOException { DocIdSetIterator iter = idSet.iterator(); int docId = iter.nextDoc(); assertEquals( First doc id should be 0, 0, docId ); docId = iter.nextDoc(); assertEquals( Second doc id should be 5, 5, docId ); docId = iter.advance( 5 ); assertEquals( Advancing iterator should return the next doc id, 6, docId ); } {code} The javadoc for {{advance}} says: {quote} Advances to the first *beyond* the current whose document number is greater than or equal to _target_. {quote} This seems to indicate that {{SortedVIntList}} behaves correctly, whereas the other two don't. Just looking at the {{DocIdBitSet}} implementation advance is implemented as: {code} bitSet.nextSetBit(target); {code} where the docs of {{nextSetBit}} say: {quote} Returns the index of the first bit that is set to true that occurs *on or after* the specified starting index {quote} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3102) Few issues with CachingCollector
[ https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-3102: --- Attachment: LUCENE-3102-factory.patch Patch against 3x which: * Adds factory method to CachingCollector, specializing on cacheScores * Clarify Collector.needScores() TODO There are two remaining issues, let's address them after we iterate on this patch. Few issues with CachingCollector Key: LUCENE-3102 URL: https://issues.apache.org/jira/browse/LUCENE-3102 Project: Lucene - Java Issue Type: Bug Components: core/search Reporter: Shai Erera Assignee: Shai Erera Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3102-factory.patch, LUCENE-3102.patch, LUCENE-3102.patch CachingCollector (introduced in LUCENE-1421) has few issues: # Since the wrapped Collector may support out-of-order collection, the document IDs cached may be out-of-order (depends on the Query) and thus replay(Collector) will forward document IDs out-of-order to a Collector that may not support it. # It does not clear cachedScores + cachedSegs upon exceeding RAM limits # I think that instead of comparing curScores to null, in order to determine if scores are requested, we should have a specific boolean - for clarity # This check if (base + nextLength maxDocsToCache) (line 168) can be relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want to try and cache them? Also: * The TODO in line 64 (having Collector specify needsScores()) -- why do we need that if CachingCollector ctor already takes a boolean cacheScores? I think it's better defined explicitly than implicitly? * Let's introduce a factory method for creating a specialized version if scoring is requested / not (i.e., impl the TODO in line 189) * I think it's a useful collector, which stands on its own and not specific to grouping. Can we move it to core? * How about using OpenBitSet instead of int[] for doc IDs? ** If the number of hits is big, we'd gain some RAM back, and be able to cache more entries ** NOTE: OpenBitSet can only be used for in-order collection only. So we can use that if the wrapped Collector does not support out-of-order * Do you think we can modify this Collector to not necessarily wrap another Collector? We have such Collector which stores (in-memory) all matching doc IDs + scores (if required). Those are later fed into several processes that operate on them (e.g. fetch more info from the index etc.). I am thinking, we can make CachingCollector *optionally* wrap another Collector and then someone can reuse it by setting RAM limit to unlimited (we should have a constant for that) in order to simply collect all matching docs + scores. * I think a set of dedicated unit tests for this class alone would be good. That's it so far. Perhaps, if we do all of the above, more things will pop up. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3084) MergePolicy.OneMerge.segments should be ListSegmentInfo not SegmentInfos
[ https://issues.apache.org/jira/browse/LUCENE-3084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-3084: -- Attachment: LUCENE-3084-trunk-only.patch Further refactoring: - I was able to move more internal ArrayList-modifying code out of IndexWriter - the returned List view is now unmodifiable! - It's now possible to also add a Set view for better contains. ...working... MergePolicy.OneMerge.segments should be ListSegmentInfo not SegmentInfos -- Key: LUCENE-3084 URL: https://issues.apache.org/jira/browse/LUCENE-3084 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084.patch SegmentInfos carries a bunch of fields beyond the list of SI, but for merging purposes these fields are unused. We should cutover to ListSI instead. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3112) Add IW.add/updateDocuments to support nested documents
[ https://issues.apache.org/jira/browse/LUCENE-3112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034730#comment-13034730 ] Robert Muir commented on LUCENE-3112: - We should really think through the consequences of this though. If core features of lucene become implemented in a way that they rely upon these sequential docids, we then lock ourselves out of future optimizations such as reordering docids for optimal index compression. Add IW.add/updateDocuments to support nested documents -- Key: LUCENE-3112 URL: https://issues.apache.org/jira/browse/LUCENE-3112 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3112.patch I think nested documents (LUCENE-2454) is a very compelling addition to Lucene. It's also a popular (many votes) issue. Beyond supporting nested document querying, which is already an incredible addition since it preserves the relational model on indexing normalized content (eg, DB tables, XML docs), LUCENE-2454 should also enable speedups in grouping implementation when you group by a nested field. For the same reason, it can also enable very fast post-group facet counting impl (LUCENE-3097) when you what to count(distinct(nestedField)), instead of unique documents, as your identifier. I expect many apps that use faceting need this ability (to count(distinct(nestedField)) not distinct(docID)). To support these use cases, I believe the only core change needed is the ability to atomically add or update multiple documents, which you cannot do today since in between add/updateDocument calls a flush (eg due to commit or getReader()) could occur. This new API (addDocuments(IterableDocument), updateDocuments(Term delTerm, IterableDocument) would also further guarantee that the documents are assigned sequential docIDs in the order the iterator provided them, and that the docIDs all reside in one segment. Segment merging never splits segments apart, so this invariant would hold even as merges/optimizes take place. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3112) Add IW.add/updateDocuments to support nested documents
[ https://issues.apache.org/jira/browse/LUCENE-3112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034734#comment-13034734 ] Jason Rutherglen commented on LUCENE-3112: -- I think perhaps like a Hadoop input format split, we can define meta-data at the segment level as to where the documents live so that if one is 'splitting' the index, as is being implemented with HBase, the 'splitter' can be 'smart'. Add IW.add/updateDocuments to support nested documents -- Key: LUCENE-3112 URL: https://issues.apache.org/jira/browse/LUCENE-3112 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3112.patch I think nested documents (LUCENE-2454) is a very compelling addition to Lucene. It's also a popular (many votes) issue. Beyond supporting nested document querying, which is already an incredible addition since it preserves the relational model on indexing normalized content (eg, DB tables, XML docs), LUCENE-2454 should also enable speedups in grouping implementation when you group by a nested field. For the same reason, it can also enable very fast post-group facet counting impl (LUCENE-3097) when you what to count(distinct(nestedField)), instead of unique documents, as your identifier. I expect many apps that use faceting need this ability (to count(distinct(nestedField)) not distinct(docID)). To support these use cases, I believe the only core change needed is the ability to atomically add or update multiple documents, which you cannot do today since in between add/updateDocument calls a flush (eg due to commit or getReader()) could occur. This new API (addDocuments(IterableDocument), updateDocuments(Term delTerm, IterableDocument) would also further guarantee that the documents are assigned sequential docIDs in the order the iterator provided them, and that the docIDs all reside in one segment. Segment merging never splits segments apart, so this invariant would hold even as merges/optimizes take place. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Assigned] (LUCENE-3111) TestFSTs.testRandomWords failure
[ https://issues.apache.org/jira/browse/LUCENE-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless reassigned LUCENE-3111: -- Assignee: Michael McCandless TestFSTs.testRandomWords failure Key: LUCENE-3111 URL: https://issues.apache.org/jira/browse/LUCENE-3111 Project: Lucene - Java Issue Type: Bug Reporter: selckin Assignee: Michael McCandless Priority: Minor Was running some while(1) tests on the docvalues branch (r1103705) and the following test failed: {code} [junit] Testsuite: org.apache.lucene.util.automaton.fst.TestFSTs [junit] Testcase: testRandomWords(org.apache.lucene.util.automaton.fst.TestFSTs): FAILED [junit] expected:771 but was:TwoLongs:771,771 [junit] junit.framework.AssertionFailedError: expected:771 but was:TwoLongs:771,771 [junit] at org.apache.lucene.util.automaton.fst.TestFSTs$FSTTester.verifyUnPruned(TestFSTs.java:540) [junit] at org.apache.lucene.util.automaton.fst.TestFSTs$FSTTester.doTest(TestFSTs.java:496) [junit] at org.apache.lucene.util.automaton.fst.TestFSTs$FSTTester.doTest(TestFSTs.java:359) [junit] at org.apache.lucene.util.automaton.fst.TestFSTs.doTest(TestFSTs.java:319) [junit] at org.apache.lucene.util.automaton.fst.TestFSTs.testRandomWords(TestFSTs.java:940) [junit] at org.apache.lucene.util.automaton.fst.TestFSTs.testRandomWords(TestFSTs.java:915) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1282) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1211) [junit] [junit] [junit] Tests run: 7, Failures: 1, Errors: 0, Time elapsed: 7.628 sec [junit] [junit] - Standard Error - [junit] NOTE: Ignoring nightly-only test method 'testBigSet' [junit] NOTE: reproduce with: ant test -Dtestcase=TestFSTs -Dtestmethod=testRandomWords -Dtests.seed=-269475578956012681:0 [junit] NOTE: test params are: codec=PreFlex, locale=ar, timezone=America/Blanc-Sablon [junit] NOTE: all tests run in this JVM: [junit] [TestToken, TestCodecs, TestIndexReaderReopen, TestIndexWriterMerging, TestNoDeletionPolicy, TestParallelReaderEmptyIndex, TestParallelTermEnum, TestPerSegmentDeletes, TestSegmentReader, TestSegmentTermDocs, TestStressAdvance, TestTermVectorsReader, TestSurrogates, TestMultiFieldQueryParser, TestAutomatonQuery, TestBooleanScorer, TestFuzzyQuery, TestMultiTermConstantScore, TestNumericRangeQuery64, TestPositiveScoresOnlyCollector, TestPrefixFilter, TestQueryTermVector, TestScorerPerf, TestSloppyPhraseQuery, TestSpansAdvanced, TestWindowsMMap, TestRamUsageEstimator, TestSmallFloat, TestUnicodeUtil, TestFSTs] [junit] NOTE: Linux 2.6.37-gentoo amd64/Sun Microsystems Inc. 1.6.0_25 (64-bit)/cpus=8,threads=1,free=137329960,total=208207872 [junit] - --- [junit] TEST org.apache.lucene.util.automaton.fst.TestFSTs FAILED {code} I am not able to reproduce -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2119) IndexSchema should log warning if analyzer is declared with charfilter/tokenizer/tokenfiler out of order
[ https://issues.apache.org/jira/browse/SOLR-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated SOLR-2119: - Fix Version/s: 4.0 3.2 IndexSchema should log warning if analyzer is declared with charfilter/tokenizer/tokenfiler out of order -- Key: SOLR-2119 URL: https://issues.apache.org/jira/browse/SOLR-2119 Project: Solr Issue Type: Improvement Components: Schema and Analysis Reporter: Hoss Man Fix For: 3.2, 4.0 There seems to be a segment of hte user population that has a hard time understanding the distinction between a charfilter, a tokenizer, and a tokenfilter -- while we can certianly try to improve the documentation about what exactly each does, and when they take affect in the analysis chain, one other thing we should do is try to educate people when they constuct their analyzer in a way that doesn't make any sense. at the moment, some people are attempting to do things like move the Foo tokenFilter/ before the tokenizer/ to try and get certain behavior ... at a minimum we should log a warning in this case that doing that doesn't have the desired effect (we could easily make such a situation fail to initialize, but i'm not convinced that would be the best course of action, since some people may have schema's where they have declared a charFilter or tokenizer out of order relative to their tokenFilters, but are still getting correct results that work for them, and breaking their instance on upgrade doens't seem like it would be productive) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2119) IndexSchema should log warning if analyzer is declared with charfilter/tokenizer/tokenfiler out of order
[ https://issues.apache.org/jira/browse/SOLR-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034740#comment-13034740 ] Michael McCandless commented on SOLR-2119: -- +1 for hard error. In general for problems we can detect at startup we should not start the server. Users rarely see/do something about the warnings. I think this would be a good service to those users who trip the hard error on upgrade: it means Solr is not doing what they thought they asked it to do. IndexSchema should log warning if analyzer is declared with charfilter/tokenizer/tokenfiler out of order -- Key: SOLR-2119 URL: https://issues.apache.org/jira/browse/SOLR-2119 Project: Solr Issue Type: Improvement Components: Schema and Analysis Reporter: Hoss Man Fix For: 3.2, 4.0 There seems to be a segment of hte user population that has a hard time understanding the distinction between a charfilter, a tokenizer, and a tokenfilter -- while we can certianly try to improve the documentation about what exactly each does, and when they take affect in the analysis chain, one other thing we should do is try to educate people when they constuct their analyzer in a way that doesn't make any sense. at the moment, some people are attempting to do things like move the Foo tokenFilter/ before the tokenizer/ to try and get certain behavior ... at a minimum we should log a warning in this case that doing that doesn't have the desired effect (we could easily make such a situation fail to initialize, but i'm not convinced that would be the best course of action, since some people may have schema's where they have declared a charFilter or tokenizer out of order relative to their tokenFilters, but are still getting correct results that work for them, and breaking their instance on upgrade doens't seem like it would be productive) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Solr Config XML DTD's
https://issues.apache.org/jira/browse/SOLR-2119 is a good example where we are failing to catch mis-configuration on startup. Is there some way we can baby step here? EG use one of these XML validation packages, incrementally, on only sub-strings from the XML? (Or simpler is to just do the checking ourselves w/ custom code). Mike http://blog.mikemccandless.com On Wed, May 4, 2011 at 10:50 PM, Michael Sokolov soko...@ifactory.com wrote: I'm not sure you will find anyone wanting to put in this effort now, but another suggestion for a general approach might be: 1 very basic static analysis to catch what you can - this should be a pretty minimal effort only given what can reasonably be achieved 2 throw runtime errors as Hoss says (probably already doing this well enough, but maybe some incremental improvements are needed?) 3 an option to run a configtest like httpd provides that preloads all declared handlers/plugins/modules etc, instantiates them and gives them an opportunity to read their config and throw whatever errors they find. This way you can set a standard (error on unrecognized parameter, say) in some core areas, and distribute the effort. This is a hugely useful sanity check to be able to run when you want to make config changes and not have your server fall over when it starts (or worse - later). -Mike kibitzer Sokolov On 5/4/2011 6:55 PM, Chris Hostetter wrote: As i said: any improvements to help catch the mistakes we can identify would be great, but we should maintain perspective of the effort/gain tradeoff given that there is likely nothing we can do about the basic problem of a string that won't be evaluated until runtime - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3112) Add IW.add/updateDocuments to support nested documents
[ https://issues.apache.org/jira/browse/LUCENE-3112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034750#comment-13034750 ] Michael McCandless commented on LUCENE-3112: bq. Yet, I think you should push the document iteration etc into DWPT to actually apply the delterm only once to make it really atomic. Ahh good point -- it's wrong just passing that delTerm down N times, too. I'll fix. bq. I also wonder if we should allow multiple delTerm e.g. TupleDelTerm, Document otherwise you would be bound to one delterm pre collection but what if you want to remove only one of the sub-documents? So, this won't work today w/ nested querying, if I understand it right. Ie, if you only update one of the subs, now your subdocs are no longer sequential (nor in one segment). So I think design for today here...? Someday, when we implement incremental field updates correctly, so that updates are written as stacked segments against the original segment containing the document, at that point I think we can add an API that lets you update multiple docs atomically? {quote} Add IW.add/updateDocuments to support nested documents -- Key: LUCENE-3112 URL: https://issues.apache.org/jira/browse/LUCENE-3112 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3112.patch I think nested documents (LUCENE-2454) is a very compelling addition to Lucene. It's also a popular (many votes) issue. Beyond supporting nested document querying, which is already an incredible addition since it preserves the relational model on indexing normalized content (eg, DB tables, XML docs), LUCENE-2454 should also enable speedups in grouping implementation when you group by a nested field. For the same reason, it can also enable very fast post-group facet counting impl (LUCENE-3097) when you what to count(distinct(nestedField)), instead of unique documents, as your identifier. I expect many apps that use faceting need this ability (to count(distinct(nestedField)) not distinct(docID)). To support these use cases, I believe the only core change needed is the ability to atomically add or update multiple documents, which you cannot do today since in between add/updateDocument calls a flush (eg due to commit or getReader()) could occur. This new API (addDocuments(IterableDocument), updateDocuments(Term delTerm, IterableDocument) would also further guarantee that the documents are assigned sequential docIDs in the order the iterator provided them, and that the docIDs all reside in one segment. Segment merging never splits segments apart, so this invariant would hold even as merges/optimizes take place. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-3110) ASCIIFoldingFilter wrongly folds german Umlauts
[ https://issues.apache.org/jira/browse/LUCENE-3110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rowe resolved LUCENE-3110. - Resolution: Won't Fix See LUCENE-1696, where Robert Muir advocates using an ICU collation filter instead of locale-sensitive accent stripping. ASCIIFoldingFilter wrongly folds german Umlauts --- Key: LUCENE-3110 URL: https://issues.apache.org/jira/browse/LUCENE-3110 Project: Lucene - Java Issue Type: Improvement Components: modules/analysis Affects Versions: 3.1 Reporter: Michael Gaber the german umlauts are currently mapped as follows. Ä/ä = A/a Ö/ö = O/o Ü/ü = U/u the correct mapping would be Ä/ä = Ae/ae Ö/ö = Oe/oe Ü/ü = Ue/ue so the corresponding rows in the switch statement should be moved down to the ae/oe/ue positions. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3110) ASCIIFoldingFilter wrongly folds german Umlauts
[ https://issues.apache.org/jira/browse/LUCENE-3110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034764#comment-13034764 ] Robert Muir commented on LUCENE-3110: - another option, is to use the German2 stemmer from snowball, which is a variation on the german stemmer designed to handle these cases. If you use GermanAnalyzer in 3.1 it uses this stemmer by default. ASCIIFoldingFilter wrongly folds german Umlauts --- Key: LUCENE-3110 URL: https://issues.apache.org/jira/browse/LUCENE-3110 Project: Lucene - Java Issue Type: Improvement Components: modules/analysis Affects Versions: 3.1 Reporter: Michael Gaber the german umlauts are currently mapped as follows. Ä/ä = A/a Ö/ö = O/o Ü/ü = U/u the correct mapping would be Ä/ä = Ae/ae Ö/ö = Oe/oe Ü/ü = Ue/ue so the corresponding rows in the switch statement should be moved down to the ae/oe/ue positions. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Reopened] (SOLR-2445) unknown handler: standard
[ https://issues.apache.org/jira/browse/SOLR-2445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi reopened SOLR-2445: -- Seems that no one objects about applying the patch to 3.1.1. Reopening. unknown handler: standard - Key: SOLR-2445 URL: https://issues.apache.org/jira/browse/SOLR-2445 Project: Solr Issue Type: Bug Affects Versions: 1.4.1, 3.1, 3.2, 4.0 Reporter: Koji Sekiguchi Assignee: Koji Sekiguchi Priority: Minor Fix For: 3.2, 4.0 Attachments: SOLR-2445.patch, qt-form-jsp.patch To reproduce the problem using example config, go form.jsp, use standard for qt (it is default) then click Search. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3113) fix analyzer bugs found by MockTokenizer
fix analyzer bugs found by MockTokenizer Key: LUCENE-3113 URL: https://issues.apache.org/jira/browse/LUCENE-3113 Project: Lucene - Java Issue Type: Bug Reporter: Robert Muir Attachments: LUCENE-3113.patch In LUCENE-3064, we beefed up MockTokenizer with assertions, and I've switched over the analysis tests to use MockTokenizer for better coverage. However, this found a few bugs (one of which is LUCENE-3106): * incrementToken() after it returns false in CommonGramsQueryFilter, HyphenatedWordsFilter, ShingleFilter, SynonymFilter * missing end() implementation for PrefixAwareTokenFilter * double reset() in QueryAutoStopWordAnalyzer and ReusableAnalyzerBase * missing correctOffset()s in MockTokenizer itself. I think it would be nice to just fix all the bugs on one issue... I've fixed everything except Shingle and Synonym -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3113) fix analyzer bugs found by MockTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-3113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-3113: Component/s: modules/analysis Fix Version/s: 4.0 3.2 fix analyzer bugs found by MockTokenizer Key: LUCENE-3113 URL: https://issues.apache.org/jira/browse/LUCENE-3113 Project: Lucene - Java Issue Type: Bug Components: modules/analysis Reporter: Robert Muir Fix For: 3.2, 4.0 Attachments: LUCENE-3113.patch In LUCENE-3064, we beefed up MockTokenizer with assertions, and I've switched over the analysis tests to use MockTokenizer for better coverage. However, this found a few bugs (one of which is LUCENE-3106): * incrementToken() after it returns false in CommonGramsQueryFilter, HyphenatedWordsFilter, ShingleFilter, SynonymFilter * missing end() implementation for PrefixAwareTokenFilter * double reset() in QueryAutoStopWordAnalyzer and ReusableAnalyzerBase * missing correctOffset()s in MockTokenizer itself. I think it would be nice to just fix all the bugs on one issue... I've fixed everything except Shingle and Synonym -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3113) fix analyzer bugs found by MockTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-3113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-3113: Attachment: LUCENE-3113.patch attached is a patch, the synonyms and shingles tests still fail. fix analyzer bugs found by MockTokenizer Key: LUCENE-3113 URL: https://issues.apache.org/jira/browse/LUCENE-3113 Project: Lucene - Java Issue Type: Bug Components: modules/analysis Reporter: Robert Muir Fix For: 3.2, 4.0 Attachments: LUCENE-3113.patch In LUCENE-3064, we beefed up MockTokenizer with assertions, and I've switched over the analysis tests to use MockTokenizer for better coverage. However, this found a few bugs (one of which is LUCENE-3106): * incrementToken() after it returns false in CommonGramsQueryFilter, HyphenatedWordsFilter, ShingleFilter, SynonymFilter * missing end() implementation for PrefixAwareTokenFilter * double reset() in QueryAutoStopWordAnalyzer and ReusableAnalyzerBase * missing correctOffset()s in MockTokenizer itself. I think it would be nice to just fix all the bugs on one issue... I've fixed everything except Shingle and Synonym -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (SOLR-2445) unknown handler: standard
[ https://issues.apache.org/jira/browse/SOLR-2445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi resolved SOLR-2445. -- Resolution: Fixed Fix Version/s: 3.1.1 Committed revision 1104270 for 3.1.1. Thanks Gabriele for your patience! unknown handler: standard - Key: SOLR-2445 URL: https://issues.apache.org/jira/browse/SOLR-2445 Project: Solr Issue Type: Bug Affects Versions: 1.4.1, 3.1, 3.2, 4.0 Reporter: Koji Sekiguchi Assignee: Koji Sekiguchi Priority: Minor Fix For: 3.1.1, 3.2, 4.0 Attachments: SOLR-2445.patch, qt-form-jsp.patch To reproduce the problem using example config, go form.jsp, use standard for qt (it is default) then click Search. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3113) fix analyzer bugs found by MockTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-3113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-3113: Attachment: LUCENE-3113.patch updated patch, fixing the bugs in Synonyms and ShingleFilter. also, i found two more bugs: the ShingleAnalyzerWrapper was double-resetting, and the PrefixAndSuffixAwareTokenFilter was missing end() also fix analyzer bugs found by MockTokenizer Key: LUCENE-3113 URL: https://issues.apache.org/jira/browse/LUCENE-3113 Project: Lucene - Java Issue Type: Bug Components: modules/analysis Reporter: Robert Muir Fix For: 3.2, 4.0 Attachments: LUCENE-3113.patch, LUCENE-3113.patch In LUCENE-3064, we beefed up MockTokenizer with assertions, and I've switched over the analysis tests to use MockTokenizer for better coverage. However, this found a few bugs (one of which is LUCENE-3106): * incrementToken() after it returns false in CommonGramsQueryFilter, HyphenatedWordsFilter, ShingleFilter, SynonymFilter * missing end() implementation for PrefixAwareTokenFilter * double reset() in QueryAutoStopWordAnalyzer and ReusableAnalyzerBase * missing correctOffset()s in MockTokenizer itself. I think it would be nice to just fix all the bugs on one issue... I've fixed everything except Shingle and Synonym -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
SpanNearQuery - inOrder parameter
I attach a junit test which shows strange behaviour of the inOrder parameter on the SpanNearQuery constructor, using Lucene 2.9.4. My understanding of this parameter is that true forces the order and false doesn't care about the order. Using true always works. However using false works fine when the terms in the query are distinct, but if they are equivalent, e.g. searching for john john, I do not get the expected results. The workaround seems to be to always use true for queries with repeated terms. Any help? Thanks Greg import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.Term; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.TopDocsCollector; import org.apache.lucene.search.TopScoreDocCollector; import org.apache.lucene.search.spans.SpanNearQuery; import org.apache.lucene.search.spans.SpanQuery; import org.apache.lucene.search.spans.SpanTermQuery; import org.apache.lucene.store.RAMDirectory; import org.apache.lucene.util.Version; import org.junit.Assert; import org.junit.Test; public class TestSpanNearQueryInOrder { @Test public void testSpanNearQueryInOrder() { RAMDirectory directory = new RAMDirectory(); IndexWriter writer = new IndexWriter(directory, new StandardAnalyzer(Version.LUCENE_29), true, IndexWriter.MaxFieldLength.UNLIMITED); TopDocsCollector collector = TopScoreDocCollector.create(3, false); Document doc = new Document(); // DOC1 doc.add(new Field(text, , Field.Store.YES, Field.Index.ANALYZED)); writer.addDocument(doc); doc = new Document(); // DOC2 doc.add(new Field(text, )); writer.addDocument(doc); doc = new Document(); // DOC3 doc.add(new Field(text, )); writer.addDocument(doc); writer.optimize(); writer.close(); searcher = new IndexSearcher(directory, false); SpanQuery[] clauses = new SpanQuery[2]; clauses[0] = new SpanTermQuery(new Term(text, )); clauses[1] = new SpanTermQuery(new Term(text, )); // Don't care about order, so setting inOrder = false SpanNearQuery q = new SpanNearQuery(clauses, 1, false); searcher.search(q, collector); // This assert fails - 3 docs are returned. Expecting only DOC2 and DOC3 Assert.assertEquals(Check 2 results, 2, collector.getTotalHits()); collector = new TopScoreDocCollector.create(3, false); clauses = new SpanQuery[2]; clauses[0] = new SpanTermQuery(new Term(text, )); clauses[1] = new SpanTermQuery(new Term(text, )); // Don't care about order, so setting inOrder = false q = new SpanNearQuery(clauses, 0, false); searcher.search(q, collector); // This assert fails - 3 docs are returned. Expecting only DOC2 Assert.assertEquals(Check 1 result, 1, collector.getTotalHits()); } } TestSpanNearQueryInOrder.java Please consider the environment before printing this email. This message should be regarded as confidential. If you have received this email in error please notify the sender and destroy it immediately. Statements of intent shall only become binding when confirmed in hard copy by an authorised signatory. The contents of this email may relate to dealings with other companies within the Detica Limited group of companies. Detica Limited is registered in England under No: 1337451. Registered offices: Surrey Research Park, Guildford, Surrey, GU2 7YP, England. TestSpanNearQueryInOrder.java Description: TestSpanNearQueryInOrder.java - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3113) fix analyzer bugs found by MockTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-3113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034806#comment-13034806 ] Robert Muir commented on LUCENE-3113: - I think this patch is ready to commit, i'll wait and see if anyone feels like reviewing it :) fix analyzer bugs found by MockTokenizer Key: LUCENE-3113 URL: https://issues.apache.org/jira/browse/LUCENE-3113 Project: Lucene - Java Issue Type: Bug Components: modules/analysis Reporter: Robert Muir Fix For: 3.2, 4.0 Attachments: LUCENE-3113.patch, LUCENE-3113.patch In LUCENE-3064, we beefed up MockTokenizer with assertions, and I've switched over the analysis tests to use MockTokenizer for better coverage. However, this found a few bugs (one of which is LUCENE-3106): * incrementToken() after it returns false in CommonGramsQueryFilter, HyphenatedWordsFilter, ShingleFilter, SynonymFilter * missing end() implementation for PrefixAwareTokenFilter * double reset() in QueryAutoStopWordAnalyzer and ReusableAnalyzerBase * missing correctOffset()s in MockTokenizer itself. I think it would be nice to just fix all the bugs on one issue... I've fixed everything except Shingle and Synonym -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2091) Add BM25 Scoring to Lucene
[ https://issues.apache.org/jira/browse/LUCENE-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034815#comment-13034815 ] Shrinath commented on LUCENE-2091: -- Hi, Don't be harsh if I am asking this in a wrong place, but could someone tell me if the linked patch is better than http://nlp.uned.es/~jperezi/Lucene-BM25/ Add BM25 Scoring to Lucene -- Key: LUCENE-2091 URL: https://issues.apache.org/jira/browse/LUCENE-2091 Project: Lucene - Java Issue Type: New Feature Components: modules/other Reporter: Yuval Feinstein Priority: Minor Fix For: 4.0 Attachments: BM25SimilarityProvider.java, LUCENE-2091.patch, persianlucene.jpg Original Estimate: 48h Remaining Estimate: 48h http://nlp.uned.es/~jperezi/Lucene-BM25/ describes an implementation of Okapi-BM25 scoring in the Lucene framework, as an alternative to the standard Lucene scoring (which is a version of mixed boolean/TFIDF). I have refactored this a bit, added unit tests and improved the runtime somewhat. I would like to contribute the code to Lucene under contrib. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3113) fix analyzer bugs found by MockTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-3113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034816#comment-13034816 ] Uwe Schindler commented on LUCENE-3113: --- A quick check on the fixes in the implementations: all fine. I was just confused about PrefixAndSuffixAwareTF, but thats fine (Robert explained it to me - this Filters are very complicated from the code/class hierarchy design *g*). I did not verify the Tests, I assume its just dumb search-replacements. fix analyzer bugs found by MockTokenizer Key: LUCENE-3113 URL: https://issues.apache.org/jira/browse/LUCENE-3113 Project: Lucene - Java Issue Type: Bug Components: modules/analysis Reporter: Robert Muir Fix For: 3.2, 4.0 Attachments: LUCENE-3113.patch, LUCENE-3113.patch In LUCENE-3064, we beefed up MockTokenizer with assertions, and I've switched over the analysis tests to use MockTokenizer for better coverage. However, this found a few bugs (one of which is LUCENE-3106): * incrementToken() after it returns false in CommonGramsQueryFilter, HyphenatedWordsFilter, ShingleFilter, SynonymFilter * missing end() implementation for PrefixAwareTokenFilter * double reset() in QueryAutoStopWordAnalyzer and ReusableAnalyzerBase * missing correctOffset()s in MockTokenizer itself. I think it would be nice to just fix all the bugs on one issue... I've fixed everything except Shingle and Synonym -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2193) Re-architect Update Handler
[ https://issues.apache.org/jira/browse/SOLR-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034818#comment-13034818 ] Mark Miller commented on SOLR-2193: --- I've got some fixes for this, and I've started on some tests and other minor steps forward. I'll put it up before too long. Re-architect Update Handler --- Key: SOLR-2193 URL: https://issues.apache.org/jira/browse/SOLR-2193 Project: Solr Issue Type: Improvement Reporter: Mark Miller Assignee: Mark Miller Fix For: 4.0 Attachments: SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch The update handler needs an overhaul. A few goals I think we might want to look at: 1. Cleanup - drop DirectUpdateHandler(2) line - move to something like UpdateHandler, DefaultUpdateHandler 2. Expose the SolrIndexWriter in the api or add the proper abstractions to get done what we now do with special casing: if (directupdatehandler2) success else failish 3. Stop closing the IndexWriter and start using commit (still lazy IW init though). 4. Drop iwAccess, iwCommit locks and sync mostly at the Lucene level. 5. Keep NRT support in mind. 6. Keep microsharding in mind (maintain logical index as multiple physical indexes) 7. Address the current issues we face because multiple original/'reloaded' cores can have a different IndexWriter on the same index. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-1421) Ability to group search results by field
[ https://issues.apache.org/jira/browse/LUCENE-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034828#comment-13034828 ] Michael McCandless commented on LUCENE-1421: I'm only testing groupSort and sort by relevance now in the nightly bench. I'll add sort-by-title, groupSort-by-relevance cases too, so we test that. Hmm, though: this content set is alphabetized by title I believe, so it's not really a good test. (I suspect that's why the TermQuery sorting by title is faster bq. Do think when new features are added that these also need be added to this test suite? Or is this perfomance test suite just for the basic features? Well, in general I'd love to have wider coverage in the nightly perf test... really it's only a start now. But there's no hard rule we have to add new functions into the nightly bench... Ability to group search results by field Key: LUCENE-1421 URL: https://issues.apache.org/jira/browse/LUCENE-1421 Project: Lucene - Java Issue Type: New Feature Components: core/search Reporter: Artyom Sokolov Assignee: Michael McCandless Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-1421.patch, LUCENE-1421.patch, lucene-grouping.patch It would be awesome to group search results by specified field. Some functionality was provided for Apache Solr but I think it should be done in Core Lucene. There could be some useful information like total hits about collapsed data like total count and so on. Thanks, Artyom -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3092) NRTCachingDirectory, to buffer small segments in a RAMDir
[ https://issues.apache.org/jira/browse/LUCENE-3092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034835#comment-13034835 ] Michael McCandless commented on LUCENE-3092: Thanks Simon; I'll commit soon... NRTCachingDirectory, to buffer small segments in a RAMDir - Key: LUCENE-3092 URL: https://issues.apache.org/jira/browse/LUCENE-3092 Project: Lucene - Java Issue Type: Improvement Components: core/store Reporter: Michael McCandless Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3092-listener.patch, LUCENE-3092.patch, LUCENE-3092.patch, LUCENE-3092.patch, LUCENE-3092.patch I created this simply Directory impl, whose goal is reduce IO contention in a frequent reopen NRT use case. The idea is, when reopening quickly, but not indexing that much content, you wind up with many small files created with time, that can possibly stress the IO system eg if merges, searching are also fighting for IO. So, NRTCachingDirectory puts these newly created files into a RAMDir, and only when they are merged into a too-large segment, does it then write-through to the real (delegate) directory. This lets you spend some RAM to reduce I0. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3097) Post grouping faceting
[ https://issues.apache.org/jira/browse/LUCENE-3097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034836#comment-13034836 ] Michael McCandless commented on LUCENE-3097: Right, this'd mean all docs sharing a given group value are contiguous and in the same segment. The app would have to ensure this, in order to use a collector that takes advantage of it. Post grouping faceting -- Key: LUCENE-3097 URL: https://issues.apache.org/jira/browse/LUCENE-3097 Project: Lucene - Java Issue Type: New Feature Reporter: Martijn van Groningen Priority: Minor Fix For: 3.2, 4.0 This issues focuses on implementing post grouping faceting. * How to handle multivalued fields. What field value to show with the facet. * Where the facet counts should be based on ** Facet counts can be based on the normal documents. Ungrouped counts. ** Facet counts can be based on the groups. Grouped counts. ** Facet counts can be based on the combination of group value and facet value. Matrix counts. And properly more implementation options. The first two methods are implemented in the SOLR-236 patch. For the first option it calculates a DocSet based on the individual documents from the query result. For the second option it calculates a DocSet for all the most relevant documents of a group. Once the DocSet is computed the FacetComponent and StatsComponent use one the DocSet to create facets and statistics. This last one is a bit more complex. I think it is best explained with an example. Lets say we search on travel offers: |||hotel||departure_airport||duration|| |Hotel a|AMS|5 |Hotel a|DUS|10 |Hotel b|AMS|5 |Hotel b|AMS|10 If we group by hotel and have a facet for airport. Most end users expect (according to my experience off course) the following airport facet: AMS: 2 DUS: 1 The above result can't be achieved by the first two methods. You either get counts AMS:3 and DUS:1 or 1 for both airports. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3102) Few issues with CachingCollector
[ https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034841#comment-13034841 ] Michael McCandless commented on LUCENE-3102: Patch looks great! But, can we rename curupto - curUpto (and same for curbase)? Ie, so it matches the other camelCaseVariables we have here... Thank you! Few issues with CachingCollector Key: LUCENE-3102 URL: https://issues.apache.org/jira/browse/LUCENE-3102 Project: Lucene - Java Issue Type: Bug Components: core/search Reporter: Shai Erera Assignee: Shai Erera Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3102-factory.patch, LUCENE-3102.patch, LUCENE-3102.patch CachingCollector (introduced in LUCENE-1421) has few issues: # Since the wrapped Collector may support out-of-order collection, the document IDs cached may be out-of-order (depends on the Query) and thus replay(Collector) will forward document IDs out-of-order to a Collector that may not support it. # It does not clear cachedScores + cachedSegs upon exceeding RAM limits # I think that instead of comparing curScores to null, in order to determine if scores are requested, we should have a specific boolean - for clarity # This check if (base + nextLength maxDocsToCache) (line 168) can be relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want to try and cache them? Also: * The TODO in line 64 (having Collector specify needsScores()) -- why do we need that if CachingCollector ctor already takes a boolean cacheScores? I think it's better defined explicitly than implicitly? * Let's introduce a factory method for creating a specialized version if scoring is requested / not (i.e., impl the TODO in line 189) * I think it's a useful collector, which stands on its own and not specific to grouping. Can we move it to core? * How about using OpenBitSet instead of int[] for doc IDs? ** If the number of hits is big, we'd gain some RAM back, and be able to cache more entries ** NOTE: OpenBitSet can only be used for in-order collection only. So we can use that if the wrapped Collector does not support out-of-order * Do you think we can modify this Collector to not necessarily wrap another Collector? We have such Collector which stores (in-memory) all matching doc IDs + scores (if required). Those are later fed into several processes that operate on them (e.g. fetch more info from the index etc.). I am thinking, we can make CachingCollector *optionally* wrap another Collector and then someone can reuse it by setting RAM limit to unlimited (we should have a constant for that) in order to simply collect all matching docs + scores. * I think a set of dedicated unit tests for this class alone would be good. That's it so far. Perhaps, if we do all of the above, more things will pop up. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3113) fix analyzer bugs found by MockTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-3113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034846#comment-13034846 ] Robert Muir commented on LUCENE-3113: - Uwe, I think i'll open a followup issue to clean up the code about PrefixAndSuffixAwareTF. I don't like how tricky it is. fix analyzer bugs found by MockTokenizer Key: LUCENE-3113 URL: https://issues.apache.org/jira/browse/LUCENE-3113 Project: Lucene - Java Issue Type: Bug Components: modules/analysis Reporter: Robert Muir Fix For: 3.2, 4.0 Attachments: LUCENE-3113.patch, LUCENE-3113.patch In LUCENE-3064, we beefed up MockTokenizer with assertions, and I've switched over the analysis tests to use MockTokenizer for better coverage. However, this found a few bugs (one of which is LUCENE-3106): * incrementToken() after it returns false in CommonGramsQueryFilter, HyphenatedWordsFilter, ShingleFilter, SynonymFilter * missing end() implementation for PrefixAwareTokenFilter * double reset() in QueryAutoStopWordAnalyzer and ReusableAnalyzerBase * missing correctOffset()s in MockTokenizer itself. I think it would be nice to just fix all the bugs on one issue... I've fixed everything except Shingle and Synonym -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3112) Add IW.add/updateDocuments to support nested documents
[ https://issues.apache.org/jira/browse/LUCENE-3112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034850#comment-13034850 ] Michael McCandless commented on LUCENE-3112: {quote} We should really think through the consequences of this though. If core features of lucene become implemented in a way that they rely upon these sequential docids, we then lock ourselves out of future optimizations such as reordering docids for optimal index compression. {quote} I agree it's somewhat dangerous we are making an (experimental) guarantee that these docIDs will remain adjacent forever. We normally are very protective about letting apps rely on docID assignment/order. But, I think this will not be core functionality that relies on sub-docs (adjacent docs), but rather modules -- grouping, faceting, nestedqueries/queries. And, even if you use these modules, it's optional whether the app did sub-docs. Ie we would still have the 'generic grouping collector, but then also an optimized one that takes advantage of sub-docs. Finally, I think doing this today would not preclude doing docID reording in the future because the sub docs would be recomputable based on the identifier field which grouped them in the first place. Ie the worst case future scenario (an app uses this new sub-docs feature, but then has a big index they don't want to reindex and wants to take advantage of a future docid reording compression we add) would still be solvable because we could use this identifier field to find blocks of sub-docs. I suppose we could consider changing the index format today to record which docs are subs... but I think we don't need to. Maybe I should strengthen the @experimental to explain the risk that a future reindexing could be required? Add IW.add/updateDocuments to support nested documents -- Key: LUCENE-3112 URL: https://issues.apache.org/jira/browse/LUCENE-3112 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3112.patch I think nested documents (LUCENE-2454) is a very compelling addition to Lucene. It's also a popular (many votes) issue. Beyond supporting nested document querying, which is already an incredible addition since it preserves the relational model on indexing normalized content (eg, DB tables, XML docs), LUCENE-2454 should also enable speedups in grouping implementation when you group by a nested field. For the same reason, it can also enable very fast post-group facet counting impl (LUCENE-3097) when you what to count(distinct(nestedField)), instead of unique documents, as your identifier. I expect many apps that use faceting need this ability (to count(distinct(nestedField)) not distinct(docID)). To support these use cases, I believe the only core change needed is the ability to atomically add or update multiple documents, which you cannot do today since in between add/updateDocument calls a flush (eg due to commit or getReader()) could occur. This new API (addDocuments(IterableDocument), updateDocuments(Term delTerm, IterableDocument) would also further guarantee that the documents are assigned sequential docIDs in the order the iterator provided them, and that the docIDs all reside in one segment. Segment merging never splits segments apart, so this invariant would hold even as merges/optimizes take place. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3114) PrefixAndSuffixAwareTokenFilter code cleanup
PrefixAndSuffixAwareTokenFilter code cleanup Key: LUCENE-3114 URL: https://issues.apache.org/jira/browse/LUCENE-3114 Project: Lucene - Java Issue Type: Task Components: modules/analysis Reporter: Robert Muir as noted on LUCENE-3113, I think this tokenstream is difficult to review. In my opinion just changing the 'private PrefixAwareTokenFilter suffix' to 'private PrefixAwareTokenFilter prefixAndSuffix' would work wonders. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3112) Add IW.add/updateDocuments to support nested documents
[ https://issues.apache.org/jira/browse/LUCENE-3112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034854#comment-13034854 ] Robert Muir commented on LUCENE-3112: - {quote} I suppose we could consider changing the index format today to record which docs are subs... but I think we don't need to. Maybe I should strengthen the @experimental to explain the risk that a future reindexing could be required? {quote} I think this would be perfect. I certainly don't want to hold up this improvement, yet, in the future I just didnt want us to be in a situation where we say 'well if only we had recorded this information, now its not possible to do XYZ because someone COULD have used add/updateDocuments() for some arbitrary reason and we will 'split' their grouped ids'. We could also include in the note that various existing IndexSorters/Splitters are unaware about this, so use with caution :) Add IW.add/updateDocuments to support nested documents -- Key: LUCENE-3112 URL: https://issues.apache.org/jira/browse/LUCENE-3112 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3112.patch I think nested documents (LUCENE-2454) is a very compelling addition to Lucene. It's also a popular (many votes) issue. Beyond supporting nested document querying, which is already an incredible addition since it preserves the relational model on indexing normalized content (eg, DB tables, XML docs), LUCENE-2454 should also enable speedups in grouping implementation when you group by a nested field. For the same reason, it can also enable very fast post-group facet counting impl (LUCENE-3097) when you what to count(distinct(nestedField)), instead of unique documents, as your identifier. I expect many apps that use faceting need this ability (to count(distinct(nestedField)) not distinct(docID)). To support these use cases, I believe the only core change needed is the ability to atomically add or update multiple documents, which you cannot do today since in between add/updateDocument calls a flush (eg due to commit or getReader()) could occur. This new API (addDocuments(IterableDocument), updateDocuments(Term delTerm, IterableDocument) would also further guarantee that the documents are assigned sequential docIDs in the order the iterator provided them, and that the docIDs all reside in one segment. Segment merging never splits segments apart, so this invariant would hold even as merges/optimizes take place. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3111) TestFSTs.testRandomWords failure
[ https://issues.apache.org/jira/browse/LUCENE-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034868#comment-13034868 ] Michael McCandless commented on LUCENE-3111: I'm also not able to reproduce... TestFSTs.testRandomWords failure Key: LUCENE-3111 URL: https://issues.apache.org/jira/browse/LUCENE-3111 Project: Lucene - Java Issue Type: Bug Reporter: selckin Assignee: Michael McCandless Priority: Minor Was running some while(1) tests on the docvalues branch (r1103705) and the following test failed: {code} [junit] Testsuite: org.apache.lucene.util.automaton.fst.TestFSTs [junit] Testcase: testRandomWords(org.apache.lucene.util.automaton.fst.TestFSTs): FAILED [junit] expected:771 but was:TwoLongs:771,771 [junit] junit.framework.AssertionFailedError: expected:771 but was:TwoLongs:771,771 [junit] at org.apache.lucene.util.automaton.fst.TestFSTs$FSTTester.verifyUnPruned(TestFSTs.java:540) [junit] at org.apache.lucene.util.automaton.fst.TestFSTs$FSTTester.doTest(TestFSTs.java:496) [junit] at org.apache.lucene.util.automaton.fst.TestFSTs$FSTTester.doTest(TestFSTs.java:359) [junit] at org.apache.lucene.util.automaton.fst.TestFSTs.doTest(TestFSTs.java:319) [junit] at org.apache.lucene.util.automaton.fst.TestFSTs.testRandomWords(TestFSTs.java:940) [junit] at org.apache.lucene.util.automaton.fst.TestFSTs.testRandomWords(TestFSTs.java:915) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1282) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1211) [junit] [junit] [junit] Tests run: 7, Failures: 1, Errors: 0, Time elapsed: 7.628 sec [junit] [junit] - Standard Error - [junit] NOTE: Ignoring nightly-only test method 'testBigSet' [junit] NOTE: reproduce with: ant test -Dtestcase=TestFSTs -Dtestmethod=testRandomWords -Dtests.seed=-269475578956012681:0 [junit] NOTE: test params are: codec=PreFlex, locale=ar, timezone=America/Blanc-Sablon [junit] NOTE: all tests run in this JVM: [junit] [TestToken, TestCodecs, TestIndexReaderReopen, TestIndexWriterMerging, TestNoDeletionPolicy, TestParallelReaderEmptyIndex, TestParallelTermEnum, TestPerSegmentDeletes, TestSegmentReader, TestSegmentTermDocs, TestStressAdvance, TestTermVectorsReader, TestSurrogates, TestMultiFieldQueryParser, TestAutomatonQuery, TestBooleanScorer, TestFuzzyQuery, TestMultiTermConstantScore, TestNumericRangeQuery64, TestPositiveScoresOnlyCollector, TestPrefixFilter, TestQueryTermVector, TestScorerPerf, TestSloppyPhraseQuery, TestSpansAdvanced, TestWindowsMMap, TestRamUsageEstimator, TestSmallFloat, TestUnicodeUtil, TestFSTs] [junit] NOTE: Linux 2.6.37-gentoo amd64/Sun Microsystems Inc. 1.6.0_25 (64-bit)/cpus=8,threads=1,free=137329960,total=208207872 [junit] - --- [junit] TEST org.apache.lucene.util.automaton.fst.TestFSTs FAILED {code} I am not able to reproduce -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[Lucene.Net] [jira] [Resolved] (LUCENENET-410) Lucene In Action (LIA book) samples for .NET.
[ https://issues.apache.org/jira/browse/LUCENENET-410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prescott Nasser resolved LUCENENET-410. --- Resolution: Not A Problem Lucene In Action (LIA book) samples for .NET. - Key: LUCENENET-410 URL: https://issues.apache.org/jira/browse/LUCENENET-410 Project: Lucene.Net Issue Type: New Feature Reporter: Pasha Bizhan Priority: Minor Attachments: liabook1_net_samples.zip First edition, Lucene.Net 1.4 Not all samples from the book, only suitable for .NET. For example nutch samples excluded. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (SOLR-2521) TestJoin.testRandom fails
[ https://issues.apache.org/jira/browse/SOLR-2521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yonik Seeley reassigned SOLR-2521: -- Assignee: Yonik Seeley TestJoin.testRandom fails - Key: SOLR-2521 URL: https://issues.apache.org/jira/browse/SOLR-2521 Project: Solr Issue Type: Bug Reporter: Michael McCandless Assignee: Yonik Seeley Fix For: 4.0 Hit this random failure; it reproduces on trunk: {noformat} [junit] Testsuite: org.apache.solr.TestJoin [junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 4.512 sec [junit] [junit] - Standard Error - [junit] 2011-05-16 12:51:46 org.apache.solr.TestJoin testRandomJoin [junit] SEVERE: GROUPING MISMATCH: mismatch: '0'!='1' @ response/numFound [junit] request=LocalSolrQueryRequest{echoParams=allindent=trueq={!join+from%3Dsmall_i+to%3Dsmall3_is}*:*wt=json} [junit] result={ [junit] responseHeader:{ [junit] status:0, [junit] QTime:0, [junit] params:{ [junit] echoParams:all, [junit] indent:true, [junit] q:{!join from=small_i to=small3_is}*:*, [junit] wt:json}}, [junit] response:{numFound:1,start:0,docs:[ [junit] { [junit] id:NXEA, [junit] score_f:87.90162, [junit] small3_ss:[N, [junit] v, [junit] n], [junit] small_i:4, [junit] small2_i:1, [junit] small2_is:[2], [junit] small3_is:[69, [junit] 88, [junit] 54, [junit] 80, [junit] 75, [junit] 83, [junit] 57, [junit] 73, [junit] 85, [junit] 52, [junit] 50, [junit] 88, [junit] 51, [junit] 89, [junit] 12, [junit] 8, [junit] 19, [junit] 23, [junit] 53, [junit] 75, [junit] 26, [junit] 99, [junit] 0, [junit] 44]}] [junit] }} [junit] expected={numFound:0,start:0,docs:[]} [junit] model={NXEA:Doc(0):[id=NXEA, score_f=87.90162, small3_ss=[N, v, n], small_i=4, small2_i=1, small2_is=2, small3_is=[69, 88, 54, 80, 75, 83, 57, 73, 85, 52, 50, 88, 51, 89, 12, 8, 19, 23, 53, 75, 26, 99, 0, 44]],JSLZ:Doc(1):[id=JSLZ, score_f=11.198811, small2_ss=[c, d], small3_ss=[b, R, H, Q, O, f, C, e, Z, u, z, u, w, I, f, _, Y, r, w, u], small_i=6, small2_is=[2, 3], small3_is=[22, 1]],FAWX:Doc(2):[id=FAWX, score_f=25.524109, small_s=d, small3_ss=[O, D, X, `, W, z, k, M, j, m, r, [, E, P, w, ^, y, T, e, R, V, H, g, e, I], small_i=2, small2_is=[2, 1], small3_is=[95, 42]],GDDZ:Doc(3):[id=GDDZ, score_f=8.483642, small2_ss=[b, e], small3_ss=[o, i, y, l, I, O, r, O, f, d, E, e, d, f, b, P], small2_is=[6, 6], small3_is=[36, 48, 9, 8, 40, 40, 68]],RBIQ:Doc(4):[id=RBIQ, score_f=97.06258, small_s=b, small2_s=c, small2_ss=[e, e], small_i=2, small2_is=6, small3_is=[13, 77, 96, 45]],LRDM:Doc(5):[id=LRDM, score_f=82.302124, small_s=b, small2_s=a, small2_ss=d, small3_ss=[H, m, O, D, I, J, U, D, f, N, ^, m, I, j, L, s, F, h, A, `, c, j], small2_i=2, small2_is=[2, 7], small3_is=[81, 31, 78, 23, 88, 1, 7, 86, 20, 7, 40, 52, 100, 81, 34, 45, 87, 72, 14, 5]]} [junit] NOTE: reproduce with: ant test -Dtestcase=TestJoin -Dtestmethod=testRandomJoin -Dtests.seed=-4998031941344546449:8541928265064992444 [junit] NOTE: test params are: codec=RandomCodecProvider: {id=MockRandom, small2_ss=Standard, small2_is=MockFixedIntBlock(blockSize=1738), small2_s=MockFixedIntBlock(blockSize=1738), small3_is=MockVariableIntBlock(baseBlockSize=77), small_i=MockFixedIntBlock(blockSize=1738), small_s=MockVariableIntBlock(baseBlockSize=77), score_f=MockSep, small2_i=Pulsing(freqCutoff=9), small3_ss=SimpleText}, locale=sr_BA, timezone=America/Barbados [junit] NOTE: all tests run in this JVM: [junit] [TestJoin] [junit] NOTE: Linux 2.6.33.6-147.fc13.x86_64 amd64/Sun Microsystems Inc. 1.6.0_21 (64-bit)/cpus=24,threads=1,free=252342544,total=308084736 [junit] - --- [junit] Testcase: testRandomJoin(org.apache.solr.TestJoin): FAILED [junit] mismatch: '0'!='1' @ response/numFound [junit] junit.framework.AssertionFailedError: mismatch: '0'!='1' @ response/numFound [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1282) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1211) [junit] at
[jira] [Commented] (LUCENE-3111) TestFSTs.testRandomWords failure
[ https://issues.apache.org/jira/browse/LUCENE-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034876#comment-13034876 ] Robert Muir commented on LUCENE-3111: - This sounds like a bug in either the test or test-infra. I'm not able to reproduce but if I run this test with -Dtests.iter=100, i'm able to produce a similar failure (again not reproducible). So first I'd like to see if we can find the reproducibility bug. This is the most important to me :) TestFSTs.testRandomWords failure Key: LUCENE-3111 URL: https://issues.apache.org/jira/browse/LUCENE-3111 Project: Lucene - Java Issue Type: Bug Reporter: selckin Assignee: Michael McCandless Priority: Minor Was running some while(1) tests on the docvalues branch (r1103705) and the following test failed: {code} [junit] Testsuite: org.apache.lucene.util.automaton.fst.TestFSTs [junit] Testcase: testRandomWords(org.apache.lucene.util.automaton.fst.TestFSTs): FAILED [junit] expected:771 but was:TwoLongs:771,771 [junit] junit.framework.AssertionFailedError: expected:771 but was:TwoLongs:771,771 [junit] at org.apache.lucene.util.automaton.fst.TestFSTs$FSTTester.verifyUnPruned(TestFSTs.java:540) [junit] at org.apache.lucene.util.automaton.fst.TestFSTs$FSTTester.doTest(TestFSTs.java:496) [junit] at org.apache.lucene.util.automaton.fst.TestFSTs$FSTTester.doTest(TestFSTs.java:359) [junit] at org.apache.lucene.util.automaton.fst.TestFSTs.doTest(TestFSTs.java:319) [junit] at org.apache.lucene.util.automaton.fst.TestFSTs.testRandomWords(TestFSTs.java:940) [junit] at org.apache.lucene.util.automaton.fst.TestFSTs.testRandomWords(TestFSTs.java:915) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1282) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1211) [junit] [junit] [junit] Tests run: 7, Failures: 1, Errors: 0, Time elapsed: 7.628 sec [junit] [junit] - Standard Error - [junit] NOTE: Ignoring nightly-only test method 'testBigSet' [junit] NOTE: reproduce with: ant test -Dtestcase=TestFSTs -Dtestmethod=testRandomWords -Dtests.seed=-269475578956012681:0 [junit] NOTE: test params are: codec=PreFlex, locale=ar, timezone=America/Blanc-Sablon [junit] NOTE: all tests run in this JVM: [junit] [TestToken, TestCodecs, TestIndexReaderReopen, TestIndexWriterMerging, TestNoDeletionPolicy, TestParallelReaderEmptyIndex, TestParallelTermEnum, TestPerSegmentDeletes, TestSegmentReader, TestSegmentTermDocs, TestStressAdvance, TestTermVectorsReader, TestSurrogates, TestMultiFieldQueryParser, TestAutomatonQuery, TestBooleanScorer, TestFuzzyQuery, TestMultiTermConstantScore, TestNumericRangeQuery64, TestPositiveScoresOnlyCollector, TestPrefixFilter, TestQueryTermVector, TestScorerPerf, TestSloppyPhraseQuery, TestSpansAdvanced, TestWindowsMMap, TestRamUsageEstimator, TestSmallFloat, TestUnicodeUtil, TestFSTs] [junit] NOTE: Linux 2.6.37-gentoo amd64/Sun Microsystems Inc. 1.6.0_25 (64-bit)/cpus=8,threads=1,free=137329960,total=208207872 [junit] - --- [junit] TEST org.apache.lucene.util.automaton.fst.TestFSTs FAILED {code} I am not able to reproduce -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3111) TestFSTs.testRandomWords failure
[ https://issues.apache.org/jira/browse/LUCENE-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034880#comment-13034880 ] Robert Muir commented on LUCENE-3111: - ok, the problem is the test overrides setup() but doesnt call super.setup(), and it does the same with tearDown() Currently the way LuceneTestCase checks this is very crude, in other words if you make this mistake with one, or the other, but not both, it will catch it! The only workaround i know of to find test bugs like this is to install findbugs. it has a specific check for this exact test bug! we could run it on all of our tests. TestFSTs.testRandomWords failure Key: LUCENE-3111 URL: https://issues.apache.org/jira/browse/LUCENE-3111 Project: Lucene - Java Issue Type: Bug Reporter: selckin Assignee: Michael McCandless Priority: Minor Was running some while(1) tests on the docvalues branch (r1103705) and the following test failed: {code} [junit] Testsuite: org.apache.lucene.util.automaton.fst.TestFSTs [junit] Testcase: testRandomWords(org.apache.lucene.util.automaton.fst.TestFSTs): FAILED [junit] expected:771 but was:TwoLongs:771,771 [junit] junit.framework.AssertionFailedError: expected:771 but was:TwoLongs:771,771 [junit] at org.apache.lucene.util.automaton.fst.TestFSTs$FSTTester.verifyUnPruned(TestFSTs.java:540) [junit] at org.apache.lucene.util.automaton.fst.TestFSTs$FSTTester.doTest(TestFSTs.java:496) [junit] at org.apache.lucene.util.automaton.fst.TestFSTs$FSTTester.doTest(TestFSTs.java:359) [junit] at org.apache.lucene.util.automaton.fst.TestFSTs.doTest(TestFSTs.java:319) [junit] at org.apache.lucene.util.automaton.fst.TestFSTs.testRandomWords(TestFSTs.java:940) [junit] at org.apache.lucene.util.automaton.fst.TestFSTs.testRandomWords(TestFSTs.java:915) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1282) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1211) [junit] [junit] [junit] Tests run: 7, Failures: 1, Errors: 0, Time elapsed: 7.628 sec [junit] [junit] - Standard Error - [junit] NOTE: Ignoring nightly-only test method 'testBigSet' [junit] NOTE: reproduce with: ant test -Dtestcase=TestFSTs -Dtestmethod=testRandomWords -Dtests.seed=-269475578956012681:0 [junit] NOTE: test params are: codec=PreFlex, locale=ar, timezone=America/Blanc-Sablon [junit] NOTE: all tests run in this JVM: [junit] [TestToken, TestCodecs, TestIndexReaderReopen, TestIndexWriterMerging, TestNoDeletionPolicy, TestParallelReaderEmptyIndex, TestParallelTermEnum, TestPerSegmentDeletes, TestSegmentReader, TestSegmentTermDocs, TestStressAdvance, TestTermVectorsReader, TestSurrogates, TestMultiFieldQueryParser, TestAutomatonQuery, TestBooleanScorer, TestFuzzyQuery, TestMultiTermConstantScore, TestNumericRangeQuery64, TestPositiveScoresOnlyCollector, TestPrefixFilter, TestQueryTermVector, TestScorerPerf, TestSloppyPhraseQuery, TestSpansAdvanced, TestWindowsMMap, TestRamUsageEstimator, TestSmallFloat, TestUnicodeUtil, TestFSTs] [junit] NOTE: Linux 2.6.37-gentoo amd64/Sun Microsystems Inc. 1.6.0_25 (64-bit)/cpus=8,threads=1,free=137329960,total=208207872 [junit] - --- [junit] TEST org.apache.lucene.util.automaton.fst.TestFSTs FAILED {code} I am not able to reproduce -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3092) NRTCachingDirectory, to buffer small segments in a RAMDir
[ https://issues.apache.org/jira/browse/LUCENE-3092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034884#comment-13034884 ] David Smiley commented on LUCENE-3092: -- This looks cool. Any performance measurements? Perhaps a forthcoming post on Mike's blog? :-) NRTCachingDirectory, to buffer small segments in a RAMDir - Key: LUCENE-3092 URL: https://issues.apache.org/jira/browse/LUCENE-3092 Project: Lucene - Java Issue Type: Improvement Components: core/store Reporter: Michael McCandless Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3092-listener.patch, LUCENE-3092.patch, LUCENE-3092.patch, LUCENE-3092.patch, LUCENE-3092.patch I created this simply Directory impl, whose goal is reduce IO contention in a frequent reopen NRT use case. The idea is, when reopening quickly, but not indexing that much content, you wind up with many small files created with time, that can possibly stress the IO system eg if merges, searching are also fighting for IO. So, NRTCachingDirectory puts these newly created files into a RAMDir, and only when they are merged into a too-large segment, does it then write-through to the real (delegate) directory. This lets you spend some RAM to reduce I0. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2424) extracted text from tika has no spaces
[ https://issues.apache.org/jira/browse/SOLR-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034886#comment-13034886 ] Andrzej Bialecki commented on SOLR-2424: - Liam, what version of the cmd-line tika app did you use for this test? was it the exact same version as the one in Solr? extracted text from tika has no spaces -- Key: SOLR-2424 URL: https://issues.apache.org/jira/browse/SOLR-2424 Project: Solr Issue Type: Bug Components: contrib - Solr Cell (Tika extraction) Affects Versions: 3.1 Reporter: Yonik Seeley Attachments: ET2000 Service Manual.pdf Try this: curl http://localhost:8983/solr/update/extract?extractOnly=truewt=jsonindent=true; -F tutorial=@tutorial.pdf And you get text output w/o spaces: ThisdocumentcoversthebasicsofrunningSolru... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1395) Integrate Katta
[ https://issues.apache.org/jira/browse/SOLR-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034887#comment-13034887 ] Jamie Johnson commented on SOLR-1395: - Is there any updated documentation for how to do this? I've attempted to run through the patching process but the exact steps are not clear since the versions have changed significantly. Integrate Katta --- Key: SOLR-1395 URL: https://issues.apache.org/jira/browse/SOLR-1395 Project: Solr Issue Type: New Feature Affects Versions: 1.4 Reporter: Jason Rutherglen Priority: Minor Fix For: 3.2 Attachments: SOLR-1395.patch, SOLR-1395.patch, SOLR-1395.patch, back-end.log, front-end.log, hadoop-core-0.19.0.jar, katta-core-0.6-dev.jar, katta-solrcores.jpg, katta.node.properties, katta.zk.properties, log4j-1.2.13.jar, solr-1395-1431-3.patch, solr-1395-1431-4.patch, solr-1395-1431-katta0.6.patch, solr-1395-1431-katta0.6.patch, solr-1395-1431.patch, solr-1395-katta-0.6.2-1.patch, solr-1395-katta-0.6.2-2.patch, solr-1395-katta-0.6.2-3.patch, solr-1395-katta-0.6.2.patch, test-katta-core-0.6-dev.jar, zkclient-0.1-dev.jar, zookeeper-3.2.1.jar Original Estimate: 336h Remaining Estimate: 336h We'll integrate Katta into Solr so that: * Distributed search uses Hadoop RPC * Shard/SolrCore distribution and management * Zookeeper based failover * Indexes may be built using Hadoop -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3111) TestFSTs.testRandomWords failure
[ https://issues.apache.org/jira/browse/LUCENE-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034895#comment-13034895 ] Michael McCandless commented on LUCENE-3111: Doh! +1 for findbugs. TestFSTs.testRandomWords failure Key: LUCENE-3111 URL: https://issues.apache.org/jira/browse/LUCENE-3111 Project: Lucene - Java Issue Type: Bug Reporter: selckin Assignee: Michael McCandless Priority: Minor Was running some while(1) tests on the docvalues branch (r1103705) and the following test failed: {code} [junit] Testsuite: org.apache.lucene.util.automaton.fst.TestFSTs [junit] Testcase: testRandomWords(org.apache.lucene.util.automaton.fst.TestFSTs): FAILED [junit] expected:771 but was:TwoLongs:771,771 [junit] junit.framework.AssertionFailedError: expected:771 but was:TwoLongs:771,771 [junit] at org.apache.lucene.util.automaton.fst.TestFSTs$FSTTester.verifyUnPruned(TestFSTs.java:540) [junit] at org.apache.lucene.util.automaton.fst.TestFSTs$FSTTester.doTest(TestFSTs.java:496) [junit] at org.apache.lucene.util.automaton.fst.TestFSTs$FSTTester.doTest(TestFSTs.java:359) [junit] at org.apache.lucene.util.automaton.fst.TestFSTs.doTest(TestFSTs.java:319) [junit] at org.apache.lucene.util.automaton.fst.TestFSTs.testRandomWords(TestFSTs.java:940) [junit] at org.apache.lucene.util.automaton.fst.TestFSTs.testRandomWords(TestFSTs.java:915) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1282) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1211) [junit] [junit] [junit] Tests run: 7, Failures: 1, Errors: 0, Time elapsed: 7.628 sec [junit] [junit] - Standard Error - [junit] NOTE: Ignoring nightly-only test method 'testBigSet' [junit] NOTE: reproduce with: ant test -Dtestcase=TestFSTs -Dtestmethod=testRandomWords -Dtests.seed=-269475578956012681:0 [junit] NOTE: test params are: codec=PreFlex, locale=ar, timezone=America/Blanc-Sablon [junit] NOTE: all tests run in this JVM: [junit] [TestToken, TestCodecs, TestIndexReaderReopen, TestIndexWriterMerging, TestNoDeletionPolicy, TestParallelReaderEmptyIndex, TestParallelTermEnum, TestPerSegmentDeletes, TestSegmentReader, TestSegmentTermDocs, TestStressAdvance, TestTermVectorsReader, TestSurrogates, TestMultiFieldQueryParser, TestAutomatonQuery, TestBooleanScorer, TestFuzzyQuery, TestMultiTermConstantScore, TestNumericRangeQuery64, TestPositiveScoresOnlyCollector, TestPrefixFilter, TestQueryTermVector, TestScorerPerf, TestSloppyPhraseQuery, TestSpansAdvanced, TestWindowsMMap, TestRamUsageEstimator, TestSmallFloat, TestUnicodeUtil, TestFSTs] [junit] NOTE: Linux 2.6.37-gentoo amd64/Sun Microsystems Inc. 1.6.0_25 (64-bit)/cpus=8,threads=1,free=137329960,total=208207872 [junit] - --- [junit] TEST org.apache.lucene.util.automaton.fst.TestFSTs FAILED {code} I am not able to reproduce -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3111) TestFSTs.testRandomWords failure
[ https://issues.apache.org/jira/browse/LUCENE-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034899#comment-13034899 ] Michael McCandless commented on LUCENE-3111: OK this reproduces the bug, once you add the missing calls to super.setUp/tearDown: {noformat} ant test -Dtestcase=TestFSTs -Dtestmethod=testRandomWords -Dtests.seed=6166279653770643480:6589011488658196383 {noformat} TestFSTs.testRandomWords failure Key: LUCENE-3111 URL: https://issues.apache.org/jira/browse/LUCENE-3111 Project: Lucene - Java Issue Type: Bug Reporter: selckin Assignee: Michael McCandless Priority: Minor Was running some while(1) tests on the docvalues branch (r1103705) and the following test failed: {code} [junit] Testsuite: org.apache.lucene.util.automaton.fst.TestFSTs [junit] Testcase: testRandomWords(org.apache.lucene.util.automaton.fst.TestFSTs): FAILED [junit] expected:771 but was:TwoLongs:771,771 [junit] junit.framework.AssertionFailedError: expected:771 but was:TwoLongs:771,771 [junit] at org.apache.lucene.util.automaton.fst.TestFSTs$FSTTester.verifyUnPruned(TestFSTs.java:540) [junit] at org.apache.lucene.util.automaton.fst.TestFSTs$FSTTester.doTest(TestFSTs.java:496) [junit] at org.apache.lucene.util.automaton.fst.TestFSTs$FSTTester.doTest(TestFSTs.java:359) [junit] at org.apache.lucene.util.automaton.fst.TestFSTs.doTest(TestFSTs.java:319) [junit] at org.apache.lucene.util.automaton.fst.TestFSTs.testRandomWords(TestFSTs.java:940) [junit] at org.apache.lucene.util.automaton.fst.TestFSTs.testRandomWords(TestFSTs.java:915) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1282) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1211) [junit] [junit] [junit] Tests run: 7, Failures: 1, Errors: 0, Time elapsed: 7.628 sec [junit] [junit] - Standard Error - [junit] NOTE: Ignoring nightly-only test method 'testBigSet' [junit] NOTE: reproduce with: ant test -Dtestcase=TestFSTs -Dtestmethod=testRandomWords -Dtests.seed=-269475578956012681:0 [junit] NOTE: test params are: codec=PreFlex, locale=ar, timezone=America/Blanc-Sablon [junit] NOTE: all tests run in this JVM: [junit] [TestToken, TestCodecs, TestIndexReaderReopen, TestIndexWriterMerging, TestNoDeletionPolicy, TestParallelReaderEmptyIndex, TestParallelTermEnum, TestPerSegmentDeletes, TestSegmentReader, TestSegmentTermDocs, TestStressAdvance, TestTermVectorsReader, TestSurrogates, TestMultiFieldQueryParser, TestAutomatonQuery, TestBooleanScorer, TestFuzzyQuery, TestMultiTermConstantScore, TestNumericRangeQuery64, TestPositiveScoresOnlyCollector, TestPrefixFilter, TestQueryTermVector, TestScorerPerf, TestSloppyPhraseQuery, TestSpansAdvanced, TestWindowsMMap, TestRamUsageEstimator, TestSmallFloat, TestUnicodeUtil, TestFSTs] [junit] NOTE: Linux 2.6.37-gentoo amd64/Sun Microsystems Inc. 1.6.0_25 (64-bit)/cpus=8,threads=1,free=137329960,total=208207872 [junit] - --- [junit] TEST org.apache.lucene.util.automaton.fst.TestFSTs FAILED {code} I am not able to reproduce -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3111) TestFSTs.testRandomWords failure
[ https://issues.apache.org/jira/browse/LUCENE-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034900#comment-13034900 ] Robert Muir commented on LUCENE-3111: - I have an idea how i think i can make LuceneTestCase fail if a test does this... i'll see if i can improve the setup/tearDown checking this way so we don't have this issue again. TestFSTs.testRandomWords failure Key: LUCENE-3111 URL: https://issues.apache.org/jira/browse/LUCENE-3111 Project: Lucene - Java Issue Type: Bug Reporter: selckin Assignee: Michael McCandless Priority: Minor Was running some while(1) tests on the docvalues branch (r1103705) and the following test failed: {code} [junit] Testsuite: org.apache.lucene.util.automaton.fst.TestFSTs [junit] Testcase: testRandomWords(org.apache.lucene.util.automaton.fst.TestFSTs): FAILED [junit] expected:771 but was:TwoLongs:771,771 [junit] junit.framework.AssertionFailedError: expected:771 but was:TwoLongs:771,771 [junit] at org.apache.lucene.util.automaton.fst.TestFSTs$FSTTester.verifyUnPruned(TestFSTs.java:540) [junit] at org.apache.lucene.util.automaton.fst.TestFSTs$FSTTester.doTest(TestFSTs.java:496) [junit] at org.apache.lucene.util.automaton.fst.TestFSTs$FSTTester.doTest(TestFSTs.java:359) [junit] at org.apache.lucene.util.automaton.fst.TestFSTs.doTest(TestFSTs.java:319) [junit] at org.apache.lucene.util.automaton.fst.TestFSTs.testRandomWords(TestFSTs.java:940) [junit] at org.apache.lucene.util.automaton.fst.TestFSTs.testRandomWords(TestFSTs.java:915) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1282) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1211) [junit] [junit] [junit] Tests run: 7, Failures: 1, Errors: 0, Time elapsed: 7.628 sec [junit] [junit] - Standard Error - [junit] NOTE: Ignoring nightly-only test method 'testBigSet' [junit] NOTE: reproduce with: ant test -Dtestcase=TestFSTs -Dtestmethod=testRandomWords -Dtests.seed=-269475578956012681:0 [junit] NOTE: test params are: codec=PreFlex, locale=ar, timezone=America/Blanc-Sablon [junit] NOTE: all tests run in this JVM: [junit] [TestToken, TestCodecs, TestIndexReaderReopen, TestIndexWriterMerging, TestNoDeletionPolicy, TestParallelReaderEmptyIndex, TestParallelTermEnum, TestPerSegmentDeletes, TestSegmentReader, TestSegmentTermDocs, TestStressAdvance, TestTermVectorsReader, TestSurrogates, TestMultiFieldQueryParser, TestAutomatonQuery, TestBooleanScorer, TestFuzzyQuery, TestMultiTermConstantScore, TestNumericRangeQuery64, TestPositiveScoresOnlyCollector, TestPrefixFilter, TestQueryTermVector, TestScorerPerf, TestSloppyPhraseQuery, TestSpansAdvanced, TestWindowsMMap, TestRamUsageEstimator, TestSmallFloat, TestUnicodeUtil, TestFSTs] [junit] NOTE: Linux 2.6.37-gentoo amd64/Sun Microsystems Inc. 1.6.0_25 (64-bit)/cpus=8,threads=1,free=137329960,total=208207872 [junit] - --- [junit] TEST org.apache.lucene.util.automaton.fst.TestFSTs FAILED {code} I am not able to reproduce -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-3098) Grouped total count
[ https://issues.apache.org/jira/browse/LUCENE-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-3098. Resolution: Fixed Committed. I made a small change to TestGrouping (renamed one variable) and tweaked jdocs a bit on AllGroupsCollector. This is a great addition to the grouping module -- thanks Martijn! Grouped total count --- Key: LUCENE-3098 URL: https://issues.apache.org/jira/browse/LUCENE-3098 Project: Lucene - Java Issue Type: New Feature Reporter: Martijn van Groningen Assignee: Michael McCandless Fix For: 3.2, 4.0 Attachments: LUCENE-3098-3x.patch, LUCENE-3098-3x.patch, LUCENE-3098.patch, LUCENE-3098.patch, LUCENE-3098.patch, LUCENE-3098.patch, LUCENE-3098.patch When grouping currently you can get two counts: * Total hit count. Which counts all documents that matched the query. * Total grouped hit count. Which counts all documents that have been grouped in the top N groups. Since the end user gets groups in his search result instead of plain documents with grouping. The total number of groups as total count makes more sense in many situations. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2193) Re-architect Update Handler
[ https://issues.apache.org/jira/browse/SOLR-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller updated SOLR-2193: -- Attachment: SOLR-2193.patch Here is a new patch - couple tests, couple fixes, etc, etc. Still has no commitWithin type support for soft commits. Tested and made auto soft commit code work. I spent some time today firing documents rapidly at Solr with a soft commit max time of 1 second. Fantastic results at about 100 wikipedia documents per second. Didn't change any other example settings this time. Re-architect Update Handler --- Key: SOLR-2193 URL: https://issues.apache.org/jira/browse/SOLR-2193 Project: Solr Issue Type: Improvement Reporter: Mark Miller Assignee: Mark Miller Fix For: 4.0 Attachments: SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch The update handler needs an overhaul. A few goals I think we might want to look at: 1. Cleanup - drop DirectUpdateHandler(2) line - move to something like UpdateHandler, DefaultUpdateHandler 2. Expose the SolrIndexWriter in the api or add the proper abstractions to get done what we now do with special casing: if (directupdatehandler2) success else failish 3. Stop closing the IndexWriter and start using commit (still lazy IW init though). 4. Drop iwAccess, iwCommit locks and sync mostly at the Lucene level. 5. Keep NRT support in mind. 6. Keep microsharding in mind (maintain logical index as multiple physical indexes) 7. Address the current issues we face because multiple original/'reloaded' cores can have a different IndexWriter on the same index. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3092) NRTCachingDirectory, to buffer small segments in a RAMDir
[ https://issues.apache.org/jira/browse/LUCENE-3092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034911#comment-13034911 ] Michael McCandless commented on LUCENE-3092: Alas I haven't had time to really dig into perf gains here... but I suspect on systems where IO is in contention (due to ongoing cold searching, or merging), and reopen rate is highish, that this should be a decent win since we don't burden the IO system with many tiny files. NRTCachingDirectory, to buffer small segments in a RAMDir - Key: LUCENE-3092 URL: https://issues.apache.org/jira/browse/LUCENE-3092 Project: Lucene - Java Issue Type: Improvement Components: core/store Reporter: Michael McCandless Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3092-listener.patch, LUCENE-3092.patch, LUCENE-3092.patch, LUCENE-3092.patch, LUCENE-3092.patch I created this simply Directory impl, whose goal is reduce IO contention in a frequent reopen NRT use case. The idea is, when reopening quickly, but not indexing that much content, you wind up with many small files created with time, that can possibly stress the IO system eg if merges, searching are also fighting for IO. So, NRTCachingDirectory puts these newly created files into a RAMDir, and only when they are merged into a too-large segment, does it then write-through to the real (delegate) directory. This lets you spend some RAM to reduce I0. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2193) Re-architect Update Handler
[ https://issues.apache.org/jira/browse/SOLR-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034918#comment-13034918 ] Mark Miller commented on SOLR-2193: --- Next I need to look at the thread safety of CommitTracker under the new locking system. Re-architect Update Handler --- Key: SOLR-2193 URL: https://issues.apache.org/jira/browse/SOLR-2193 Project: Solr Issue Type: Improvement Reporter: Mark Miller Assignee: Mark Miller Fix For: 4.0 Attachments: SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch The update handler needs an overhaul. A few goals I think we might want to look at: 1. Cleanup - drop DirectUpdateHandler(2) line - move to something like UpdateHandler, DefaultUpdateHandler 2. Expose the SolrIndexWriter in the api or add the proper abstractions to get done what we now do with special casing: if (directupdatehandler2) success else failish 3. Stop closing the IndexWriter and start using commit (still lazy IW init though). 4. Drop iwAccess, iwCommit locks and sync mostly at the Lucene level. 5. Keep NRT support in mind. 6. Keep microsharding in mind (maintain logical index as multiple physical indexes) 7. Address the current issues we face because multiple original/'reloaded' cores can have a different IndexWriter on the same index. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-3092) NRTCachingDirectory, to buffer small segments in a RAMDir
[ https://issues.apache.org/jira/browse/LUCENE-3092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-3092. Resolution: Fixed NRTCachingDirectory, to buffer small segments in a RAMDir - Key: LUCENE-3092 URL: https://issues.apache.org/jira/browse/LUCENE-3092 Project: Lucene - Java Issue Type: Improvement Components: core/store Reporter: Michael McCandless Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3092-listener.patch, LUCENE-3092.patch, LUCENE-3092.patch, LUCENE-3092.patch, LUCENE-3092.patch I created this simply Directory impl, whose goal is reduce IO contention in a frequent reopen NRT use case. The idea is, when reopening quickly, but not indexing that much content, you wind up with many small files created with time, that can possibly stress the IO system eg if merges, searching are also fighting for IO. So, NRTCachingDirectory puts these newly created files into a RAMDir, and only when they are merged into a too-large segment, does it then write-through to the real (delegate) directory. This lets you spend some RAM to reduce I0. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3092) NRTCachingDirectory, to buffer small segments in a RAMDir
[ https://issues.apache.org/jira/browse/LUCENE-3092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034927#comment-13034927 ] Shai Erera commented on LUCENE-3092: Mike, this is a great idea ! If there are any chances it will be released in 3.2, I think one of our NRT apps can make good use of it. Question - I see that NRTCD ctor takes a Directory. Is there any reason to pass RAMDir to NRTCD? I assume you use a Directory for any other Dir impls out there that may not sub-class e.g., FSDir, which is ok - so can we at least document that this Dir is not useful if you intend to pass RAMDir to it? Unless I am wrong and it is useful w/ RAMDir as well. NRTCachingDirectory, to buffer small segments in a RAMDir - Key: LUCENE-3092 URL: https://issues.apache.org/jira/browse/LUCENE-3092 Project: Lucene - Java Issue Type: Improvement Components: core/store Reporter: Michael McCandless Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3092-listener.patch, LUCENE-3092.patch, LUCENE-3092.patch, LUCENE-3092.patch, LUCENE-3092.patch I created this simply Directory impl, whose goal is reduce IO contention in a frequent reopen NRT use case. The idea is, when reopening quickly, but not indexing that much content, you wind up with many small files created with time, that can possibly stress the IO system eg if merges, searching are also fighting for IO. So, NRTCachingDirectory puts these newly created files into a RAMDir, and only when they are merged into a too-large segment, does it then write-through to the real (delegate) directory. This lets you spend some RAM to reduce I0. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Bulk changing issues in JIRA
Thanks Shai! Would make a great addition to the wiki ;) On May 16, 2011, at 11:47 PM, Shai Erera wrote: Hi If you ever wondered how to bulk change issues in JIRA, here's the procedure: * View a list of issues, e.g. by query/filter * At the top-right you'll find this: * Click on Tools and select * The screen changes so that next to each issue there's a check box. * Mark all the issues you want to change and click Next * Select the operation (e.g. Edit) * The next screen (followed by choosing operation Edit) lets you edit the issues. Note this at the bottom: Deselect if you don't want to spam the list :). FYI, Shai - Mark Miller lucidimagination.com Lucene/Solr User Conference May 25-26, San Francisco www.lucenerevolution.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Lucene/Solr JIRA
Hi Today we have separate JIRA projects for Lucene and Solr. This, IMO, starts to become confusing and difficult to maintain. I'll explain: * With modules, we now have components in the Lucene JIRA project for different modules (some under modules/* some under lucene/contrib/*). Will we have the same components duplication in the Solr JIRA project? * Where do users go to open a bug report for a module - Lucene or Solr projects? I'd hate to see that they open it under their favorite (or worse. random picking) project. If so, it'll become a mess. * Administration -- everything needs to be done twice. Create versions (same one !) on both projects, close issues (after release) etc. * Managing a release now means I should monitor two JIRA projects for the 3.2 (an example) version issues. Why? I guess I'm not too sure what do two JIRA projects give us. Now that it is the same project, why not make our (committers and contributors) life easier by having one JIRA project w/ components: lucene/core lucene/contrib/xyz modules/xyz solr/core solr/contrib/xyz general/* (test, build) It's already becoming confusing: LUCENE-3097: post grouping faceting -- a great example for a module that both Lucene and Solr users can use. Opened under Lucene project, and depends on Solr issues (not a big deal) LUCENE-3104: could easily have been opened under the Solr project. I don't know why it was opened under Lucene (random maybe?) Can we merge the two? Shai
[jira] [Commented] (SOLR-2119) IndexSchema should log warning if analyzer is declared with charfilter/tokenizer/tokenfiler out of order
[ https://issues.apache.org/jira/browse/SOLR-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034939#comment-13034939 ] Mark Miller commented on SOLR-2119: --- bq. I think this would be a good service to those users who trip the hard error on upgrade: it means Solr is not doing what they thought they asked it to do. +1 IndexSchema should log warning if analyzer is declared with charfilter/tokenizer/tokenfiler out of order -- Key: SOLR-2119 URL: https://issues.apache.org/jira/browse/SOLR-2119 Project: Solr Issue Type: Improvement Components: Schema and Analysis Reporter: Hoss Man Fix For: 3.2, 4.0 There seems to be a segment of hte user population that has a hard time understanding the distinction between a charfilter, a tokenizer, and a tokenfilter -- while we can certianly try to improve the documentation about what exactly each does, and when they take affect in the analysis chain, one other thing we should do is try to educate people when they constuct their analyzer in a way that doesn't make any sense. at the moment, some people are attempting to do things like move the Foo tokenFilter/ before the tokenizer/ to try and get certain behavior ... at a minimum we should log a warning in this case that doing that doesn't have the desired effect (we could easily make such a situation fail to initialize, but i'm not convinced that would be the best course of action, since some people may have schema's where they have declared a charFilter or tokenizer out of order relative to their tokenFilters, but are still getting correct results that work for them, and breaking their instance on upgrade doens't seem like it would be productive) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Lucene/Solr JIRA
On May 17, 2011, at 2:22 PM, Shai Erera wrote: Can we merge the two? +1. Due to history and other possible pain points, I don't know that it's the right practical idea at the end of the upcoming discussion, but it's certainly a good idea. - Mark Miller lucidimagination.com Lucene/Solr User Conference May 25-26, San Francisco www.lucenerevolution.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3111) TestFSTs.testRandomWords failure
[ https://issues.apache.org/jira/browse/LUCENE-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-3111: --- Attachment: LUCENE-3111.patch OK I found this -- if you try to add the same output, twice, for the empty string, then the builder fails to realize this is a TwoInts and makes a single int output! Thank you random testing :) I'll commit shortly... TestFSTs.testRandomWords failure Key: LUCENE-3111 URL: https://issues.apache.org/jira/browse/LUCENE-3111 Project: Lucene - Java Issue Type: Bug Reporter: selckin Assignee: Michael McCandless Priority: Minor Attachments: LUCENE-3111.patch Was running some while(1) tests on the docvalues branch (r1103705) and the following test failed: {code} [junit] Testsuite: org.apache.lucene.util.automaton.fst.TestFSTs [junit] Testcase: testRandomWords(org.apache.lucene.util.automaton.fst.TestFSTs): FAILED [junit] expected:771 but was:TwoLongs:771,771 [junit] junit.framework.AssertionFailedError: expected:771 but was:TwoLongs:771,771 [junit] at org.apache.lucene.util.automaton.fst.TestFSTs$FSTTester.verifyUnPruned(TestFSTs.java:540) [junit] at org.apache.lucene.util.automaton.fst.TestFSTs$FSTTester.doTest(TestFSTs.java:496) [junit] at org.apache.lucene.util.automaton.fst.TestFSTs$FSTTester.doTest(TestFSTs.java:359) [junit] at org.apache.lucene.util.automaton.fst.TestFSTs.doTest(TestFSTs.java:319) [junit] at org.apache.lucene.util.automaton.fst.TestFSTs.testRandomWords(TestFSTs.java:940) [junit] at org.apache.lucene.util.automaton.fst.TestFSTs.testRandomWords(TestFSTs.java:915) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1282) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1211) [junit] [junit] [junit] Tests run: 7, Failures: 1, Errors: 0, Time elapsed: 7.628 sec [junit] [junit] - Standard Error - [junit] NOTE: Ignoring nightly-only test method 'testBigSet' [junit] NOTE: reproduce with: ant test -Dtestcase=TestFSTs -Dtestmethod=testRandomWords -Dtests.seed=-269475578956012681:0 [junit] NOTE: test params are: codec=PreFlex, locale=ar, timezone=America/Blanc-Sablon [junit] NOTE: all tests run in this JVM: [junit] [TestToken, TestCodecs, TestIndexReaderReopen, TestIndexWriterMerging, TestNoDeletionPolicy, TestParallelReaderEmptyIndex, TestParallelTermEnum, TestPerSegmentDeletes, TestSegmentReader, TestSegmentTermDocs, TestStressAdvance, TestTermVectorsReader, TestSurrogates, TestMultiFieldQueryParser, TestAutomatonQuery, TestBooleanScorer, TestFuzzyQuery, TestMultiTermConstantScore, TestNumericRangeQuery64, TestPositiveScoresOnlyCollector, TestPrefixFilter, TestQueryTermVector, TestScorerPerf, TestSloppyPhraseQuery, TestSpansAdvanced, TestWindowsMMap, TestRamUsageEstimator, TestSmallFloat, TestUnicodeUtil, TestFSTs] [junit] NOTE: Linux 2.6.37-gentoo amd64/Sun Microsystems Inc. 1.6.0_25 (64-bit)/cpus=8,threads=1,free=137329960,total=208207872 [junit] - --- [junit] TEST org.apache.lucene.util.automaton.fst.TestFSTs FAILED {code} I am not able to reproduce -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-3111) TestFSTs.testRandomWords failure
[ https://issues.apache.org/jira/browse/LUCENE-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-3111. Resolution: Fixed Fix Version/s: 4.0 TestFSTs.testRandomWords failure Key: LUCENE-3111 URL: https://issues.apache.org/jira/browse/LUCENE-3111 Project: Lucene - Java Issue Type: Bug Reporter: selckin Assignee: Michael McCandless Priority: Minor Fix For: 4.0 Attachments: LUCENE-3111.patch Was running some while(1) tests on the docvalues branch (r1103705) and the following test failed: {code} [junit] Testsuite: org.apache.lucene.util.automaton.fst.TestFSTs [junit] Testcase: testRandomWords(org.apache.lucene.util.automaton.fst.TestFSTs): FAILED [junit] expected:771 but was:TwoLongs:771,771 [junit] junit.framework.AssertionFailedError: expected:771 but was:TwoLongs:771,771 [junit] at org.apache.lucene.util.automaton.fst.TestFSTs$FSTTester.verifyUnPruned(TestFSTs.java:540) [junit] at org.apache.lucene.util.automaton.fst.TestFSTs$FSTTester.doTest(TestFSTs.java:496) [junit] at org.apache.lucene.util.automaton.fst.TestFSTs$FSTTester.doTest(TestFSTs.java:359) [junit] at org.apache.lucene.util.automaton.fst.TestFSTs.doTest(TestFSTs.java:319) [junit] at org.apache.lucene.util.automaton.fst.TestFSTs.testRandomWords(TestFSTs.java:940) [junit] at org.apache.lucene.util.automaton.fst.TestFSTs.testRandomWords(TestFSTs.java:915) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1282) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1211) [junit] [junit] [junit] Tests run: 7, Failures: 1, Errors: 0, Time elapsed: 7.628 sec [junit] [junit] - Standard Error - [junit] NOTE: Ignoring nightly-only test method 'testBigSet' [junit] NOTE: reproduce with: ant test -Dtestcase=TestFSTs -Dtestmethod=testRandomWords -Dtests.seed=-269475578956012681:0 [junit] NOTE: test params are: codec=PreFlex, locale=ar, timezone=America/Blanc-Sablon [junit] NOTE: all tests run in this JVM: [junit] [TestToken, TestCodecs, TestIndexReaderReopen, TestIndexWriterMerging, TestNoDeletionPolicy, TestParallelReaderEmptyIndex, TestParallelTermEnum, TestPerSegmentDeletes, TestSegmentReader, TestSegmentTermDocs, TestStressAdvance, TestTermVectorsReader, TestSurrogates, TestMultiFieldQueryParser, TestAutomatonQuery, TestBooleanScorer, TestFuzzyQuery, TestMultiTermConstantScore, TestNumericRangeQuery64, TestPositiveScoresOnlyCollector, TestPrefixFilter, TestQueryTermVector, TestScorerPerf, TestSloppyPhraseQuery, TestSpansAdvanced, TestWindowsMMap, TestRamUsageEstimator, TestSmallFloat, TestUnicodeUtil, TestFSTs] [junit] NOTE: Linux 2.6.37-gentoo amd64/Sun Microsystems Inc. 1.6.0_25 (64-bit)/cpus=8,threads=1,free=137329960,total=208207872 [junit] - --- [junit] TEST org.apache.lucene.util.automaton.fst.TestFSTs FAILED {code} I am not able to reproduce -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Lucene/Solr JIRA
Can we merge the two? gut reaction says +1, but after thinking about how it would work, i'm +0 Would we just stop accepting new tickets on one system, but still keep track of both? for how long? Would we move open issues from SOLR to LUCENE? migrate the comments/history/etc In the end I think the two systems are fine -- not ideal, and they should map (more or less) to where the entry should go in CHANGES.txt ryan - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Bulk changing issues in JIRA
Created http://wiki.apache.org/lucene-java/BulkIssuesUpdate http://wiki.apache.org/lucene-java/BulkIssuesUpdateThanks Mark ! Shai On Tue, May 17, 2011 at 9:01 PM, Mark Miller markrmil...@gmail.com wrote: Thanks Shai! Would make a great addition to the wiki ;) On May 16, 2011, at 11:47 PM, Shai Erera wrote: Hi If you ever wondered how to bulk change issues in JIRA, here's the procedure: * View a list of issues, e.g. by query/filter * At the top-right you'll find this: * Click on Tools and select * The screen changes so that next to each issue there's a check box. * Mark all the issues you want to change and click Next * Select the operation (e.g. Edit) * The next screen (followed by choosing operation Edit) lets you edit the issues. Note this at the bottom: Deselect if you don't want to spam the list :). FYI, Shai - Mark Miller lucidimagination.com Lucene/Solr User Conference May 25-26, San Francisco www.lucenerevolution.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3113) fix analyzer bugs found by MockTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-3113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034956#comment-13034956 ] Steven Rowe commented on LUCENE-3113: - +1 bq. the ShingleAnalyzerWrapper was double-resetting Your patch just removes the reset call: {noformat} @@ -201,7 +201,6 @@ TokenStream result = defaultAnalyzer.reusableTokenStream(fieldName, reader); if (result == streams.wrapped) { /* the wrapped analyzer reused the stream */ -streams.shingle.reset(); } else { /* the wrapped analyzer did not, create a new shingle around the new one */ streams.wrapped = result; {noformat} but inverting the condition would read better: {noformat} TokenStream result = defaultAnalyzer.reusableTokenStream(fieldName, reader); - if (result == streams.wrapped) { -/* the wrapped analyzer reused the stream */ -streams.shingle.reset(); - } else { -/* the wrapped analyzer did not, create a new shingle around the new one */ + if (result != streams.wrapped) { +// The wrapped analyzer did not reuse the stream. +// Wrap the new stream with a new ShingleFilter. streams.wrapped = result; streams.shingle = new ShingleFilter(streams.wrapped); } {noformat} fix analyzer bugs found by MockTokenizer Key: LUCENE-3113 URL: https://issues.apache.org/jira/browse/LUCENE-3113 Project: Lucene - Java Issue Type: Bug Components: modules/analysis Reporter: Robert Muir Fix For: 3.2, 4.0 Attachments: LUCENE-3113.patch, LUCENE-3113.patch In LUCENE-3064, we beefed up MockTokenizer with assertions, and I've switched over the analysis tests to use MockTokenizer for better coverage. However, this found a few bugs (one of which is LUCENE-3106): * incrementToken() after it returns false in CommonGramsQueryFilter, HyphenatedWordsFilter, ShingleFilter, SynonymFilter * missing end() implementation for PrefixAwareTokenFilter * double reset() in QueryAutoStopWordAnalyzer and ReusableAnalyzerBase * missing correctOffset()s in MockTokenizer itself. I think it would be nice to just fix all the bugs on one issue... I've fixed everything except Shingle and Synonym -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (PYLUCENE-9) QueryParser replacing stop words with wildcards
[ https://issues.apache.org/jira/browse/PYLUCENE-9?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034961#comment-13034961 ] Christopher Currens commented on PYLUCENE-9: We can close it. Thanks for the help. QueryParser replacing stop words with wildcards --- Key: PYLUCENE-9 URL: https://issues.apache.org/jira/browse/PYLUCENE-9 Project: PyLucene Issue Type: Bug Environment: Windows XP 32-bit Sp3, Ubuntu 10.04.2 LTS i686 GNU/Linux, jdk1.6.0_23 Reporter: Christopher Currens Was using query parser to build a query. In Java Lucene (as well as Lucene.Net), the query Calendar Item as Msg (quotes included), is parsed properly as FullText:calendar item msg in Java Lucene and Lucene.Net. In pylucene, it is parsed as: FullText:calendar item ? msg. This causes obvious problems when comparing search results from python, java and .net. Initially, I thought it was the Analyzer I was using, but I've tried the StandardAnalyzer and StopAnalyzer, which work properly in Java and .Net, but not pylucene. Here is code I've used to reproduce the issue: from lucene import StandardAnalyzer, StopAnalyzer, QueryParser, Version analyzer = StandardAnalyzer(Version.LUCENE_30) query = QueryParser(Version.LUCENE_30, FullText, analyzer) parsedQuery = query.parse(\Calendar Item as Msg\) parsedQuery Query: FullText:calendar item ? msg analyzer = StopAnalyzer(Version.LUCENE_30) query = QueryParser(Version.LUCENE_30) parsedQuery = query.parse(\Calendar Item as Msg\) parsedQuery Query: FullText:calendar item ? msg I've noticed this in pylucene 2.9.4, 2.9.3, and 3.0.3 -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (LUCENE-3104) Hook up Automated Patch Checking for Lucene/Solr
[ https://issues.apache.org/jira/browse/LUCENE-3104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034964#comment-13034964 ] Grant Ingersoll commented on LUCENE-3104: - General Docs started at http://wiki.apache.org/general/PreCommitBuilds Hook up Automated Patch Checking for Lucene/Solr Key: LUCENE-3104 URL: https://issues.apache.org/jira/browse/LUCENE-3104 Project: Lucene - Java Issue Type: Task Reporter: Grant Ingersoll It would be really great if we could get feedback to contributors sooner on many things that are basic (tests exist, patch applies cleanly, etc.) From Nigel Daley on builds@a.o {quote} I revamped the precommit testing in the fall so that it doesn't use Jira email anymore to trigger a build. The process is controlled by https://builds.apache.org/hudson/job/PreCommit-Admin/ which has some documentation up at the top of the job. You can look at the config of the job (do you have access?) to see what it's doing. Any project could use this same admin job -- you just need to ask me to add the project to the Jira filter used by the admin job (https://issues.apache.org/jira/sr/jira.issueviews:searchrequest-xml/12313474/SearchRequest-12313474.xml?tempMax=100 ) once you have the downstream job(s) setup for your specific project. For Hadoop we have 3 downstream builds configured which also have some documentation: https://builds.apache.org/hudson/job/PreCommit-HADOOP-Build/ https://builds.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/ https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/ {quote} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Apache Jenkins emails
Hmm... wouldn't this help to ignore build failures, while current situation encourages solving them? :) I mean, unlike threading JIRA issues which is more convenient now, for build failures this would hide some info - thread title would indicate the oldest failure no. In spite of the above, if others still like to change in this way, I'll be fine with it. Doron On Sun, May 15, 2011 at 6:16 PM, Shai Erera ser...@gmail.com wrote: Well, Gmail ignores (for grouping) everything that in between brackets []. That's how we made all issue emails appear under the same thread, the status (Commented, Created, Resolved etc.) now appears in brackets. So, I think that if we put the build # in brackets, the rest of the message is the same for all failures. So instead of: [JENKINS] Lucene-Solr-tests-only-trunk - Build # 8042 - Still Failing we write [JENKINS] Lucene-Solr-tests-only-trunk - [Build # 8042] - Still Failing Or [JENKINS] [Build # 8042] Lucene-Solr-tests-only-trunk Failed Remove the word still altogether (it's redundant) and move the build number to the start of the subject. Shai On Sun, May 15, 2011 at 6:08 PM, Uwe Schindler u...@thetaphi.de wrote: It’s possible to change the header, as the mails are already customized. How should it look like (I don’t use f*g Gmail) - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de *From:* Shai Erera [mailto:ser...@gmail.com] *Sent:* Sunday, May 15, 2011 5:02 PM *To:* dev@lucene.apache.org *Subject:* Apache Jenkins emails Hi Is it possible to change the subject format of the emails Jenkins server sends? I was thinking, if we put the build # in [], all failures will be grouped under one thread (in Gmail). Since we have so many of them, it will at least collapse all of them into a single thread. We can still tell the failure of each email as well as the build #. What do you think? Shai
Re: Lucene/Solr JIRA
If we were starting from scratch, i'd agree with you that having a single Jira project makes more sense, but given where we are today, i think we should probably keep them distinct -- partly from a pain of migration standpoint on our end, but also from a user expecations standpoint -- i think the Solr users/community as a whole is use to the existence of the SOLR project in Jira, and use to the SOLR-* issue naming convention, and it would likely be more confusing for *them* to change now. : * With modules, we now have components in the Lucene JIRA project for : different modules (some under modules/* some under lucene/contrib/*). Will : we have the same components duplication in the Solr JIRA project? when we discussed this before, it seemed clear that top level modules should be tracked as LUCENE issues, so i see no reason why there would be duplications. : * Where do users go to open a bug report for a module - Lucene or Solr : projects? I'd hate to see that they open it under their favorite (or : worse. random picking) project. If so, it'll become a mess. the user bases tend to be very distinct -- if people are dealing with the lucene java API directly they file a LUCENE bug, if they are dealing with the Solr HTTP or client layer (SolrJ) APIs they file a Solr bug. If an issue is filed in a place where we think it doesn't make sense, the issue can easily be moved (and Jira does a redirect for anyone following old links) : * Administration -- everything needs to be done twice. Create versions (same : one !) on both projects, close issues (after release) etc. given the low overhead of this, it doesn't seem all that problematic. : * Managing a release now means I should monitor two JIRA projects for the : 3.2 (an example) version issues. Why? Here's an example of a filter that shows you all issues marked to be fixed in 3.2 in both projects... https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=truejqlQuery=%28project+%3D+SOLR+OR+project+%3D+LUCENE%29+AND+fixVersion+%3D+%223.2%22+AND+resolution+%3D+Unresolved+ORDER+BY+updated+DESC%2C+key+DESC%2C+priority+DESC : I guess I'm not too sure what do two JIRA projects give us. Now that it is : the same project, why not make our (committers and contributors) life easier Short answer: trade off ease of use for committers + pain of migration against ease of use for users ... doesn't seem like a strong need to change. : It's already becoming confusing: neither of these examples seem that confusing to me... : LUCENE-3097: post grouping faceting -- a great example for a module that : both Lucene and Solr users can use. Opened under Lucene project, and : depends on Solr issues (not a big deal) it's an issue for implementing a top level module, therforce it goes in LUCENE. it doesn't depend on any Solr issue, it's marked as being blocked by another issue about adding another top level module : LUCENE-3104: could easily have been opened under the Solr project. I : don't know why it was opened under Lucene (random maybe?) Because it's about improving the hudson build which operates at the top level of the tree -Hoss - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3092) NRTCachingDirectory, to buffer small segments in a RAMDir
[ https://issues.apache.org/jira/browse/LUCENE-3092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034981#comment-13034981 ] Michael McCandless commented on LUCENE-3092: I committed it to 3.x as well so this will be in 3.2 :) I can't think of any reason why you'd want to wrap another RAMDir with NRTCD? We can fix the docs to state this. Can you work out the wording/patch? Or just go ahead and commit a fix :) NRTCachingDirectory, to buffer small segments in a RAMDir - Key: LUCENE-3092 URL: https://issues.apache.org/jira/browse/LUCENE-3092 Project: Lucene - Java Issue Type: Improvement Components: core/store Reporter: Michael McCandless Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3092-listener.patch, LUCENE-3092.patch, LUCENE-3092.patch, LUCENE-3092.patch, LUCENE-3092.patch I created this simply Directory impl, whose goal is reduce IO contention in a frequent reopen NRT use case. The idea is, when reopening quickly, but not indexing that much content, you wind up with many small files created with time, that can possibly stress the IO system eg if merges, searching are also fighting for IO. So, NRTCachingDirectory puts these newly created files into a RAMDir, and only when they are merged into a too-large segment, does it then write-through to the real (delegate) directory. This lets you spend some RAM to reduce I0. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Apache Jenkins emails
Yeah I agree... build failures should be as annoying as possible ;) Mike http://blog.mikemccandless.com On Tue, May 17, 2011 at 2:58 PM, Doron Cohen cdor...@gmail.com wrote: Hmm... wouldn't this help to ignore build failures, while current situation encourages solving them? :) I mean, unlike threading JIRA issues which is more convenient now, for build failures this would hide some info - thread title would indicate the oldest failure no. In spite of the above, if others still like to change in this way, I'll be fine with it. Doron On Sun, May 15, 2011 at 6:16 PM, Shai Erera ser...@gmail.com wrote: Well, Gmail ignores (for grouping) everything that in between brackets []. That's how we made all issue emails appear under the same thread, the status (Commented, Created, Resolved etc.) now appears in brackets. So, I think that if we put the build # in brackets, the rest of the message is the same for all failures. So instead of: [JENKINS] Lucene-Solr-tests-only-trunk - Build # 8042 - Still Failing we write [JENKINS] Lucene-Solr-tests-only-trunk - [Build # 8042] - Still Failing Or [JENKINS] [Build # 8042] Lucene-Solr-tests-only-trunk Failed Remove the word still altogether (it's redundant) and move the build number to the start of the subject. Shai On Sun, May 15, 2011 at 6:08 PM, Uwe Schindler u...@thetaphi.de wrote: It’s possible to change the header, as the mails are already customized. How should it look like (I don’t use f*g Gmail) - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de From: Shai Erera [mailto:ser...@gmail.com] Sent: Sunday, May 15, 2011 5:02 PM To: dev@lucene.apache.org Subject: Apache Jenkins emails Hi Is it possible to change the subject format of the emails Jenkins server sends? I was thinking, if we put the build # in [], all failures will be grouped under one thread (in Gmail). Since we have so many of them, it will at least collapse all of them into a single thread. We can still tell the failure of each email as well as the build #. What do you think? Shai - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3092) NRTCachingDirectory, to buffer small segments in a RAMDir
[ https://issues.apache.org/jira/browse/LUCENE-3092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034985#comment-13034985 ] Yonik Seeley commented on LUCENE-3092: -- bq. I can't think of any reason why you'd want to wrap another RAMDir with NRTCD? Tests? It's nice to have a test use a RAMDirectory for speed, but still follow the same code path as FSDirectory for debugging + orthogonality. AFAIK, most Solr tests use RAMDirectory by default. There's no benefit to restricting it, right? NRTCachingDirectory, to buffer small segments in a RAMDir - Key: LUCENE-3092 URL: https://issues.apache.org/jira/browse/LUCENE-3092 Project: Lucene - Java Issue Type: Improvement Components: core/store Reporter: Michael McCandless Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3092-listener.patch, LUCENE-3092.patch, LUCENE-3092.patch, LUCENE-3092.patch, LUCENE-3092.patch I created this simply Directory impl, whose goal is reduce IO contention in a frequent reopen NRT use case. The idea is, when reopening quickly, but not indexing that much content, you wind up with many small files created with time, that can possibly stress the IO system eg if merges, searching are also fighting for IO. So, NRTCachingDirectory puts these newly created files into a RAMDir, and only when they are merged into a too-large segment, does it then write-through to the real (delegate) directory. This lets you spend some RAM to reduce I0. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3092) NRTCachingDirectory, to buffer small segments in a RAMDir
[ https://issues.apache.org/jira/browse/LUCENE-3092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034989#comment-13034989 ] Michael McCandless commented on LUCENE-3092: That's a great point Yonik -- in fact the TestNRTCachingDirectory already relies on this generic-ness (pulls a newDirectory() from LuceneTestCase). NRTCachingDirectory, to buffer small segments in a RAMDir - Key: LUCENE-3092 URL: https://issues.apache.org/jira/browse/LUCENE-3092 Project: Lucene - Java Issue Type: Improvement Components: core/store Reporter: Michael McCandless Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3092-listener.patch, LUCENE-3092.patch, LUCENE-3092.patch, LUCENE-3092.patch, LUCENE-3092.patch I created this simply Directory impl, whose goal is reduce IO contention in a frequent reopen NRT use case. The idea is, when reopening quickly, but not indexing that much content, you wind up with many small files created with time, that can possibly stress the IO system eg if merges, searching are also fighting for IO. So, NRTCachingDirectory puts these newly created files into a RAMDir, and only when they are merged into a too-large segment, does it then write-through to the real (delegate) directory. This lets you spend some RAM to reduce I0. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: Lucene/Solr JIRA
On 5/17/2011 at 3:02 PM, Chris Hostetter wrote: If we were starting from scratch, i'd agree with you that having a single Jira project makes more sense, but given where we are today, i think we should probably keep them distinct -- partly from a pain of migration standpoint on our end, but also from a user expecations standpoint -- i think the Solr users/community as a whole is use to the existence of the SOLR project in Jira, and use to the SOLR-* issue naming convention, and it would likely be more confusing for *them* to change now. +1
[jira] [Commented] (SOLR-2168) Velocity facet output for facet missing
[ https://issues.apache.org/jira/browse/SOLR-2168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034994#comment-13034994 ] Peter Wolanin commented on SOLR-2168: - Did this change to the templates get committed to the actual Solr repo? Velocity facet output for facet missing --- Key: SOLR-2168 URL: https://issues.apache.org/jira/browse/SOLR-2168 Project: Solr Issue Type: Bug Components: Response Writers Affects Versions: 3.1 Reporter: Peter Wolanin Priority: Minor Attachments: SOLR-2168.patch If I add fact.missing to the facet params for a field, the Veolcity output has in the facet list: $facet.name (9220) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Apache Jenkins emails
Hmm... wouldn't this help to ignore build failures, while current situation encourages solving them? I don't think current situation encourages resolving the issues more than it would discourage if we grouped all emails together. And I don't believe people will ignore a Jenkins failure thread, if they don't ignore the separate emails today. True, for those who ignore build failures - it will help them ignore them more easily :) Those who don't ignore will continue to monitor. And from what I can tell, many failures are not due to code issues, but Jenkins server issues. Shai On Tuesday, May 17, 2011, Doron Cohen cdor...@gmail.com wrote: Hmm... wouldn't this help to ignore build failures, while current situation encourages solving them? :) I mean, unlike threading JIRA issues which is more convenient now, for build failures this would hide some info - thread title would indicate the oldest failure no. In spite of the above, if others still like to change in this way, I'll be fine with it. Doron On Sun, May 15, 2011 at 6:16 PM, Shai Erera ser...@gmail.com wrote: Well, Gmail ignores (for grouping) everything that in between brackets []. That's how we made all issue emails appear under the same thread, the status (Commented, Created, Resolved etc.) now appears in brackets. So, I think that if we put the build # in brackets, the rest of the message is the same for all failures. So instead of: [JENKINS] Lucene-Solr-tests-only-trunk - Build # 8042 - Still Failing we write [JENKINS] Lucene-Solr-tests-only-trunk - [Build # 8042] - Still Failing Or [JENKINS] [Build # 8042] Lucene-Solr-tests-only-trunk Failed Remove the word still altogether (it's redundant) and move the build number to the start of the subject. Shai On Sun, May 15, 2011 at 6:08 PM, Uwe Schindler u...@thetaphi.de wrote: It’s possible to change the header, as the mails are already customized. How should it look like (I don’t use f*g Gmail) - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremenhttp://www.thetaphi.de http://www.thetaphi.de/ eMail: u...@thetaphi.de From: Shai Erera [mailto:ser...@gmail.com] Sent: Sunday, May 15, 2011 5:02 PM To: dev@lucene.apache.org Subject: Apache Jenkins emails Hi Is it possible to change the subject format of the emails Jenkins server sends? I was thinking, if we put the build # in [], all failures will be grouped under one thread (in Gmail). Since we have so many of them, it will at least collapse all of them into a single thread. We can still tell the failure of each email as well as the build #. What do you think? Shai - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2230) Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.
[ https://issues.apache.org/jira/browse/LUCENE-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034999#comment-13034999 ] Fuad Efendi commented on LUCENE-2230: - I believe this issue should be closed due to significant performance improvements related to LUCENE-2089 and LUCENE-2258. I don't think there is any interest from the community to continue with this (BK Tree and Strike a Match) naive approach; although some people found it useful. Of course we might have few more distance implementations as a separate improvement. Please close it. Thanks Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times. Key: LUCENE-2230 URL: https://issues.apache.org/jira/browse/LUCENE-2230 Project: Lucene - Java Issue Type: Improvement Components: core/search Affects Versions: 3.0 Environment: Lucene currently uses brute force full-terms scanner and calculates distance for each term. New BKTree structure improves performance in average 20 times when distance is 1, and 3 times when distance is 3. I tested with index size several millions docs, and 250,000 terms. New algo uses integer distances between objects. Reporter: Fuad Efendi Attachments: BKTree.java, Distance.java, DistanceImpl.java, FuzzyTermEnumNEW.java, FuzzyTermEnumNEW.java Original Estimate: 1m Remaining Estimate: 1m W. Burkhard and R. Keller. Some approaches to best-match file searching, CACM, 1973 http://portal.acm.org/citation.cfm?doid=362003.362025 I was inspired by http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees (Nick Johnson, Google). Additionally, simplified algorythm at http://www.catalysoft.com/articles/StrikeAMatch.html seems to be much more logically correct than Levenstein distance, and it is 3-5 times faster (isolated tests). Big list od distance implementations: http://www.dcs.shef.ac.uk/~sam/stringmetrics.htm -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Apache Jenkins emails
On Tue, May 17, 2011 at 03:09:31PM -0400, Michael McCandless wrote: Yeah I agree... build failures should be as annoying as possible ;) Congratulations -- mission accomplished! They are certainly annoying to me, and probably to anyone else subscribed to dev who isn't a committer. Marvin Humphrey - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3084) MergePolicy.OneMerge.segments should be ListSegmentInfo not SegmentInfos
[ https://issues.apache.org/jira/browse/LUCENE-3084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-3084: -- Attachment: LUCENE-3084-trunk-only.patch Now I improved SegmentInfos more: - It now uses a Map/Set to enforce that the SI only contains each segment one time. - Faster contains() because Set-backed As said before: asList() and asSet() are unmodifiable, so consistency between List and Set/Map is enforced. The Set is itsself a MapSI,Integer. The values contain the index of segment in the infos. This speeds up indexOf() calls, needed for asserts and remove(SI). As on remove or reorder operations the indexes are no longer correct, a separate boolean is used to mark the Map as inconsistent. It is then regenerated on the next indexOf() call. IndexOf is seldom, butthe keySet() is still consistent, so delaying this update is fine. All tests pass. I think the cleanup of SegmentInfos is ready to commit. MergePolicy.OneMerge.segments should be ListSegmentInfo not SegmentInfos -- Key: LUCENE-3084 URL: https://issues.apache.org/jira/browse/LUCENE-3084 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084.patch SegmentInfos carries a bunch of fields beyond the list of SI, but for merging purposes these fields are unused. We should cutover to ListSI instead. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3113) fix analyzer bugs found by MockTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-3113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035007#comment-13035007 ] Robert Muir commented on LUCENE-3113: - thanks for reviewing Steven, I agree! I've made this change and will commit shortly. fix analyzer bugs found by MockTokenizer Key: LUCENE-3113 URL: https://issues.apache.org/jira/browse/LUCENE-3113 Project: Lucene - Java Issue Type: Bug Components: modules/analysis Reporter: Robert Muir Fix For: 3.2, 4.0 Attachments: LUCENE-3113.patch, LUCENE-3113.patch In LUCENE-3064, we beefed up MockTokenizer with assertions, and I've switched over the analysis tests to use MockTokenizer for better coverage. However, this found a few bugs (one of which is LUCENE-3106): * incrementToken() after it returns false in CommonGramsQueryFilter, HyphenatedWordsFilter, ShingleFilter, SynonymFilter * missing end() implementation for PrefixAwareTokenFilter * double reset() in QueryAutoStopWordAnalyzer and ReusableAnalyzerBase * missing correctOffset()s in MockTokenizer itself. I think it would be nice to just fix all the bugs on one issue... I've fixed everything except Shingle and Synonym -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Apache Jenkins emails
On Tue, May 17, 2011 at 3:38 PM, Marvin Humphrey mar...@rectangular.com wrote: On Tue, May 17, 2011 at 03:09:31PM -0400, Michael McCandless wrote: Yeah I agree... build failures should be as annoying as possible ;) Congratulations -- mission accomplished! They are certainly annoying to me, and probably to anyone else subscribed to dev who isn't a committer. Marvin, I'm not sure you can really assume that. If a test fails anyone who wants to contribute can look at the failure and try to create a jira issue/patch, I don't think they need to be a committer. Additionally due to the nature of our tests, anyone who wants to contribute to the project can simply download the tests and try to find failures, opening jira issues for ones that they find (for example selckin does this, and has found a lot of good ones lately). If you don't care about tests at all, you can easily filter this stuff with your email client by looking for [JENKINS]. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2168) Velocity facet output for facet missing
[ https://issues.apache.org/jira/browse/SOLR-2168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035015#comment-13035015 ] Erik Hatcher commented on SOLR-2168: Alas not, Peter. Sorry. Velocity facet output for facet missing --- Key: SOLR-2168 URL: https://issues.apache.org/jira/browse/SOLR-2168 Project: Solr Issue Type: Bug Components: Response Writers Affects Versions: 3.1 Reporter: Peter Wolanin Priority: Minor Attachments: SOLR-2168.patch If I add fact.missing to the facet params for a field, the Veolcity output has in the facet list: $facet.name (9220) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Issue Comment Edited] (SOLR-2168) Velocity facet output for facet missing
[ https://issues.apache.org/jira/browse/SOLR-2168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035015#comment-13035015 ] Erik Hatcher edited comment on SOLR-2168 at 5/17/11 8:05 PM: - Alas not yet, Peter. Sorry. was (Author: ehatcher): Alas not, Peter. Sorry. Velocity facet output for facet missing --- Key: SOLR-2168 URL: https://issues.apache.org/jira/browse/SOLR-2168 Project: Solr Issue Type: Bug Components: Response Writers Affects Versions: 3.1 Reporter: Peter Wolanin Priority: Minor Attachments: SOLR-2168.patch If I add fact.missing to the facet params for a field, the Veolcity output has in the facet list: $facet.name (9220) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org