[Lucene.Net] [jira] [Commented] (LUCENENET-412) Replacing ArrayLists, Hashtables etc. with appropriate Generics.

2011-05-17 Thread Digy (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENENET-412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035092#comment-13035092
 ] 

Digy commented on LUCENENET-412:


One more sample
{code}
From:
class AnonymousFilterCache : FilterCache
{
class AnonymousFilteredDocIdSet : FilteredDocIdSet
{
IndexReader r;
public AnonymousFilteredDocIdSet(DocIdSet innerSet, 
IndexReader r) : base(innerSet)
{
this.r = r;
}
public override bool Match(int docid)
{
return !r.IsDeleted(docid);
}
}

public AnonymousFilterCache(DeletesMode deletesMode) : 
base(deletesMode)
{
}

protected  override object MergeDeletes(IndexReader reader, 
object docIdSet)
{
return new 
AnonymousFilteredDocIdSet((DocIdSet)docIdSet, reader);
}
}   
...
cache = new AnonymousFilterCache(deletesMode);



To:
cache = new FilterCacheDocIdSet(deletesMode,
(reader,docIdSet)={
return new FilteredDocIdSet((DocIdSet)docIdSet, 
(docid) =
{
return !reader.IsDeleted(docid);
});
});
{code}

DIGY

 Replacing ArrayLists, Hashtables etc. with appropriate Generics.
 

 Key: LUCENENET-412
 URL: https://issues.apache.org/jira/browse/LUCENENET-412
 Project: Lucene.Net
  Issue Type: Improvement
Affects Versions: Lucene.Net 2.9.4
Reporter: Digy
Priority: Minor
 Fix For: Lucene.Net 2.9.4

 Attachments: IEquatable for QuerySubclasses.patch, 
 LUCENENET-412.patch, lucene_2.9.4g_exceptions_fix


 This will move Lucene.Net.2.9.4 closer to lucene.3.0.3 and allow some 
 performance gains.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: [Lucene.Net] [jira] [Commented] (LUCENENET-412) Replacing ArrayLists, Hashtables etc. with appropriate Generics.

2011-05-17 Thread Rory Plaire
This is a great improvement, but why not also remove the braces and returns?


var cache = new FilterCacheDocIdSet(deletesMode,
  (reader, docIdSet) = new FilteredDocIdSet(
   (DocIdSet)docIdSet, docid = !reader.IsDeleted(docid)));


On Tue, May 17, 2011 at 3:01 PM, Digy (JIRA) j...@apache.org wrote:


[
 https://issues.apache.org/jira/browse/LUCENENET-412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035092#comment-13035092]

 Digy commented on LUCENENET-412:
 

 One more sample
 {code}
 From:
class AnonymousFilterCache : FilterCache
{
class AnonymousFilteredDocIdSet : FilteredDocIdSet
{
IndexReader r;
public AnonymousFilteredDocIdSet(DocIdSet innerSet,
 IndexReader r) : base(innerSet)
{
this.r = r;
}
public override bool Match(int docid)
{
return !r.IsDeleted(docid);
}
}

public AnonymousFilterCache(DeletesMode deletesMode) :
 base(deletesMode)
{
}

protected  override object MergeDeletes(IndexReader reader,
 object docIdSet)
{
return new
 AnonymousFilteredDocIdSet((DocIdSet)docIdSet, reader);
}
}
...
cache = new AnonymousFilterCache(deletesMode);



 To:
cache = new FilterCacheDocIdSet(deletesMode,
(reader,docIdSet)={
return new FilteredDocIdSet((DocIdSet)docIdSet,
(docid) =
{
return !reader.IsDeleted(docid);
});
 });
 {code}

 DIGY

  Replacing ArrayLists, Hashtables etc. with appropriate Generics.
  
 
  Key: LUCENENET-412
  URL: https://issues.apache.org/jira/browse/LUCENENET-412
  Project: Lucene.Net
   Issue Type: Improvement
 Affects Versions: Lucene.Net 2.9.4
 Reporter: Digy
 Priority: Minor
  Fix For: Lucene.Net 2.9.4
 
  Attachments: IEquatable for QuerySubclasses.patch,
 LUCENENET-412.patch, lucene_2.9.4g_exceptions_fix
 
 
  This will move Lucene.Net.2.9.4 closer to lucene.3.0.3 and allow some
 performance gains.

 --
 This message is automatically generated by JIRA.
 For more information on JIRA, see: http://www.atlassian.com/software/jira



[jira] [Resolved] (LUCENE-3090) DWFlushControl does not take active DWPT out of the loop on fullFlush

2011-05-17 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer resolved LUCENE-3090.
-

   Resolution: Fixed
Lucene Fields: [New, Patch Available]  (was: [New])

Committed in revision 1104026.

 DWFlushControl does not take active DWPT out of the loop on fullFlush
 -

 Key: LUCENE-3090
 URL: https://issues.apache.org/jira/browse/LUCENE-3090
 Project: Lucene - Java
  Issue Type: Bug
  Components: core/index
Affects Versions: 4.0
Reporter: Simon Willnauer
Assignee: Simon Willnauer
Priority: Critical
 Fix For: 4.0

 Attachments: LUCENE-3090.patch, LUCENE-3090.patch, LUCENE-3090.patch


 We have seen several OOM on TestNRTThreads and all of them are caused by 
 DWFlushControl missing DWPT that are set as flushPending but can't full due 
 to a full flush going on. Yet that means that those DWPT are filling up in 
 the background while they should actually be checked out and blocked until 
 the full flush finishes. Even further we currently stall on the 
 maxNumThreadStates while we should stall on the num of active thread states. 
 I will attach a patch tomorrow.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2736) Wrong implementation of DocIdSetIterator.advance

2011-05-17 Thread Doron Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034618#comment-13034618
 ] 

Doron Cohen commented on LUCENE-2736:
-

Shai, with the modified text the NOTE on implementations freedom to not 
advance beyond in some situations becomes strange... I think that the original 
text stress the fact the real intended behavior is to do advance beyond 
current, just that for performance reasons the decision whether to advance 
beyond in some situations is left for implementation decision, and so, if 
caller provides a target which is not greater than current, it should be aware 
of this possibility. 

So I think it is perhaps better to either not modify this at all, or at most, 
to add (see NOTE below) just after beyond:

{noformat}
-   * Advances to the first beyond the current whose document number is greater
+   * Advances to the first beyond (see NOTE below) the current whose document 
number is greater
{noformat}

This would prevent the confusion I think?

 Wrong implementation of DocIdSetIterator.advance 
 -

 Key: LUCENE-2736
 URL: https://issues.apache.org/jira/browse/LUCENE-2736
 Project: Lucene - Java
  Issue Type: Bug
  Components: core/search
Affects Versions: 3.2, 4.0
Reporter: Hardy Ferentschik
Assignee: Shai Erera
 Attachments: LUCENE-2736.patch


 Implementations of {{DocIdSetIterator}} behave differently when advanced is 
 called. Taking the following test for {{OpenBitSet}}, {{DocIdBitSet}} and 
 {{SortedVIntList}} only {{SortedVIntList}} passes the test:
 {code:title=org.apache.lucene.search.TestDocIdSet.java|borderStyle=solid}
 ...
   public void testAdvanceWithOpenBitSet() throws IOException {
   DocIdSet idSet = new OpenBitSet( new long[] { 1121 }, 1 );  // 
 bits 0, 5, 6, 10
   assertAdvance( idSet );
   }
   public void testAdvanceDocIdBitSet() throws IOException {
   BitSet bitSet = new BitSet();
   bitSet.set( 0 );
   bitSet.set( 5 );
   bitSet.set( 6 );
   bitSet.set( 10 );
   DocIdSet idSet = new DocIdBitSet(bitSet);
   assertAdvance( idSet );
   }
   public void testAdvanceWithSortedVIntList() throws IOException {
   DocIdSet idSet = new SortedVIntList( 0, 5, 6, 10 );
   assertAdvance( idSet );
   }   
   private void assertAdvance(DocIdSet idSet) throws IOException {
   DocIdSetIterator iter = idSet.iterator();
   int docId = iter.nextDoc();
   assertEquals( First doc id should be 0, 0, docId );
   docId = iter.nextDoc();
   assertEquals( Second doc id should be 5, 5, docId );
   docId = iter.advance( 5 );
   assertEquals( Advancing iterator should return the next doc 
 id, 6, docId );
   }
 {code}
 The javadoc for {{advance}} says:
 {quote}
 Advances to the first *beyond* the current whose document number is greater 
 than or equal to _target_.
 {quote}
 This seems to indicate that {{SortedVIntList}} behaves correctly, whereas the 
 other two don't. 
 Just looking at the {{DocIdBitSet}} implementation advance is implemented as:
 {code}
 bitSet.nextSetBit(target);
 {code}
 where the docs of {{nextSetBit}} say:
 {quote}
 Returns the index of the first bit that is set to true that occurs *on or 
 after* the specified starting index
 {quote}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-3108) Land DocValues on trunk

2011-05-17 Thread Simon Willnauer (JIRA)
Land DocValues on trunk
---

 Key: LUCENE-3108
 URL: https://issues.apache.org/jira/browse/LUCENE-3108
 Project: Lucene - Java
  Issue Type: Task
  Components: core/index, core/search, core/store
Affects Versions: CSF branch, 4.0
Reporter: Simon Willnauer
Assignee: Simon Willnauer
 Fix For: 4.0


Its time to move another feature from branch to trunk. I want to start this 
process now while still a couple of issues remain on the branch. Currently I am 
down to a single nocommit (javadocs on DocValues.java) and a couple of testing 
TODOs (explicit multithreaded tests and unoptimized with deletions) but I think 
those are not worth separate issues so we can resolve them as we go. 
The already created issues (LUCENE-3075 and LUCENE-3074) should not block this 
process here IMO, we can fix them once we are on trunk. 

Here is a quick feature overview of what has been implemented:
 * DocValues implementations for Ints (based on PackedInts), Float 32 / 64, 
Bytes (fixed / variable size each in sorted, straight and deref variations)
 * Integration into Flex-API, Codec provides a 
PerDocConsumer-DocValuesConsumer (write) / PerDocValues-DocValues (read) 
 * By-Default enabled in all codecs except of PreFlex
 * Follows other flex-API patterns like non-segment reader throw UOE forcing 
MultiPerDocValues if on DirReader etc.
 * Integration into IndexWriter, FieldInfos etc.
 * Random-testing enabled via RandomIW - injecting random DocValues into 
documents
 * Basic checks in CheckIndex (which runs after each test)
 * FieldComparator for int and float variants (Sorting, currently directly 
integrated into SortField, this might go into a separate DocValuesSortField 
eventually)
 * Extended TestSort for DocValues
 * RAM-Resident random access API plus on-disk DocValuesEnum (currently only 
sequential access) - Source.java / DocValuesEnum.java
 * Extensible Cache implementation for RAM-Resident DocValues (by-default 
loaded into RAM only once and freed once IR is closed) - SourceCache.java
 
PS: Currently the RAM resident API is named Source (Source.java) which seems 
too generic. I think we should rename it into RamDocValues or something like 
that, suggestion welcome!   


Any comments, questions (rants :)) are very much appreciated.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Moving towards Lucene 4.0

2011-05-17 Thread Simon Willnauer
On Mon, May 16, 2011 at 5:24 PM, Shai Erera ser...@gmail.com wrote:
 We anyway seem to mark every new API as @lucene.experimental these days, so
 we shouldn't have too much problem when 4.0 is out :).

 Experimental API is subject to change at any time. We can consider that as
 an option as well (maybe it adds another option to Robert's?).

 Though personally, I'm not a big fan of this notion - I think we deceive
 ourselves and users when we have @experimental on a stable branch. Any
 @experimental API on trunk today falls into this bucket after 4.0 is out.
 And I'm sure there are a couple in 3.x already.

 Don't get me wrong - I don't suggest we should stop using it. But I think we
 should consider to review the @experimental API before every stable
 release, and reduce it over time, not increase it.

+1

 Shai

 On Mon, May 16, 2011 at 4:20 PM, Robert Muir rcm...@gmail.com wrote:

 On Mon, May 16, 2011 at 9:12 AM, Simon Willnauer
 simon.willna...@googlemail.com wrote:
  I have to admit that branch is very rough and the API is super hard to
  use. For now!
  Lets not be dragged away into discussion how this API should look like
  there will be time
  for that.

 +1, this is what i really meant by decide how to handle. I don't
 think we will be able to quickly decide how to fix the branch
 itself, i think its really complicated. But we can admit its really
 complicated and won't be solved very soon, and try to figure out a
 release strategy with this in mind.

 (p.s. sorry simon, you got two copies of this message i accidentally
 hit reply instead of reply-all)

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-3109) Rename FieldsConsumer to InvertedFieldsConsumer

2011-05-17 Thread Simon Willnauer (JIRA)
Rename FieldsConsumer to InvertedFieldsConsumer
---

 Key: LUCENE-3109
 URL: https://issues.apache.org/jira/browse/LUCENE-3109
 Project: Lucene - Java
  Issue Type: Task
  Components: core/codecs
Affects Versions: 4.0
Reporter: Simon Willnauer
Priority: Minor
 Fix For: 4.0


The name FieldsConsumer is missleading here it really is an 
InvertedFieldsConsumer and since we are extending codecs to consume 
non-inverted Fields we should be clear here. Same applies to Fields.java as 
well as FieldsProducer.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-3110) ASCIIFoldingFilter wrongly folds german Umlauts

2011-05-17 Thread Michael Gaber (JIRA)
ASCIIFoldingFilter wrongly folds german Umlauts
---

 Key: LUCENE-3110
 URL: https://issues.apache.org/jira/browse/LUCENE-3110
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/analysis
Affects Versions: 3.1
Reporter: Michael Gaber


the german umlauts are currently mapped as follows.

Ä/ä = A/a
Ö/ö = O/o
Ü/ü = U/u

the correct mapping would be

Ä/ä = Ae/ae
Ö/ö = Oe/oe
Ü/ü = Ue/ue

so the corresponding rows in the switch statement should be moved down to the 
ae/oe/ue positions.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-3111) TestFSTs.testRandomWords failure

2011-05-17 Thread selckin (JIRA)
TestFSTs.testRandomWords failure


 Key: LUCENE-3111
 URL: https://issues.apache.org/jira/browse/LUCENE-3111
 Project: Lucene - Java
  Issue Type: Bug
Reporter: selckin
Priority: Minor


Was running some while(1) tests on the docvalues branch (r1103705) and the 
following test failed:

{code}
[junit] Testsuite: org.apache.lucene.util.automaton.fst.TestFSTs
[junit] Testcase: 
testRandomWords(org.apache.lucene.util.automaton.fst.TestFSTs):   FAILED
[junit] expected:771 but was:TwoLongs:771,771
[junit] junit.framework.AssertionFailedError: expected:771 but 
was:TwoLongs:771,771
[junit] at 
org.apache.lucene.util.automaton.fst.TestFSTs$FSTTester.verifyUnPruned(TestFSTs.java:540)
[junit] at 
org.apache.lucene.util.automaton.fst.TestFSTs$FSTTester.doTest(TestFSTs.java:496)
[junit] at 
org.apache.lucene.util.automaton.fst.TestFSTs$FSTTester.doTest(TestFSTs.java:359)
[junit] at 
org.apache.lucene.util.automaton.fst.TestFSTs.doTest(TestFSTs.java:319)
[junit] at 
org.apache.lucene.util.automaton.fst.TestFSTs.testRandomWords(TestFSTs.java:940)
[junit] at 
org.apache.lucene.util.automaton.fst.TestFSTs.testRandomWords(TestFSTs.java:915)
[junit] at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1282)
[junit] at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1211)
[junit] 
[junit] 
[junit] Tests run: 7, Failures: 1, Errors: 0, Time elapsed: 7.628 sec
[junit] 
[junit] - Standard Error -
[junit] NOTE: Ignoring nightly-only test method 'testBigSet'
[junit] NOTE: reproduce with: ant test -Dtestcase=TestFSTs 
-Dtestmethod=testRandomWords -Dtests.seed=-269475578956012681:0
[junit] NOTE: test params are: codec=PreFlex, locale=ar, 
timezone=America/Blanc-Sablon
[junit] NOTE: all tests run in this JVM:
[junit] [TestToken, TestCodecs, TestIndexReaderReopen, 
TestIndexWriterMerging, TestNoDeletionPolicy, TestParallelReaderEmptyIndex, 
TestParallelTermEnum, TestPerSegmentDeletes, TestSegmentReader, 
TestSegmentTermDocs, TestStressAdvance, TestTermVectorsReader, TestSurrogates, 
TestMultiFieldQueryParser, TestAutomatonQuery, TestBooleanScorer, 
TestFuzzyQuery, TestMultiTermConstantScore, TestNumericRangeQuery64, 
TestPositiveScoresOnlyCollector, TestPrefixFilter, TestQueryTermVector, 
TestScorerPerf, TestSloppyPhraseQuery, TestSpansAdvanced, TestWindowsMMap, 
TestRamUsageEstimator, TestSmallFloat, TestUnicodeUtil, TestFSTs]
[junit] NOTE: Linux 2.6.37-gentoo amd64/Sun Microsystems Inc. 1.6.0_25 
(64-bit)/cpus=8,threads=1,free=137329960,total=208207872
[junit] -  ---
[junit] TEST org.apache.lucene.util.automaton.fst.TestFSTs FAILED
{code}

I am not able to reproduce

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3105) String.intern() calls slow down IndexWriter.close() and IndexReader.open() for index with large number of unique field names

2011-05-17 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034639#comment-13034639
 ] 

Earwin Burrfoot commented on LUCENE-3105:
-

StringInterner is in fact faster than CHM. And is compatible with 
String.intern(), ie - it returns the same String instances. It also won't eat 
up memory if spammed with numerous unique strings (which is a strange feature, 
but people requested that).
In Lucene 4.0 all of this is moot anyway, fields there are strongly separated 
and intern() is not used.

 String.intern() calls slow down IndexWriter.close() and IndexReader.open() 
 for index with large number of unique field names
 

 Key: LUCENE-3105
 URL: https://issues.apache.org/jira/browse/LUCENE-3105
 Project: Lucene - Java
  Issue Type: Bug
  Components: core/index
Affects Versions: 3.1
Reporter: Mark Kristensson
 Attachments: LUCENE-3105.patch


 We have one index with several hundred thousand unqiue field names (we're 
 optimistic that Lucene 4.0 is flexible enough to allow us to change our index 
 design...) and found that opening an index writer and closing an index reader 
 results in horribly slow performance on that one index. I have isolated the 
 problem down to the calls to String.intern() that are used to allow for quick 
 string comparisons of field names throughout Lucene. These String.intern() 
 calls are unnecessary and can be replaced with a hashmap lookup. In fact, 
 StringHelper.java has its own hashmap implementation that it uses in 
 conjunction with String.intern(). Rather than using a one-off hashmap, I've 
 elected to use a ConcurrentHashMap in this patch.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3105) String.intern() calls slow down IndexWriter.close() and IndexReader.open() for index with large number of unique field names

2011-05-17 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034640#comment-13034640
 ] 

Earwin Burrfoot commented on LUCENE-3105:
-

Hmm.. Ok, it *is* still used, but that's gonna be fixed, mm?

 String.intern() calls slow down IndexWriter.close() and IndexReader.open() 
 for index with large number of unique field names
 

 Key: LUCENE-3105
 URL: https://issues.apache.org/jira/browse/LUCENE-3105
 Project: Lucene - Java
  Issue Type: Bug
  Components: core/index
Affects Versions: 3.1
Reporter: Mark Kristensson
 Attachments: LUCENE-3105.patch


 We have one index with several hundred thousand unqiue field names (we're 
 optimistic that Lucene 4.0 is flexible enough to allow us to change our index 
 design...) and found that opening an index writer and closing an index reader 
 results in horribly slow performance on that one index. I have isolated the 
 problem down to the calls to String.intern() that are used to allow for quick 
 string comparisons of field names throughout Lucene. These String.intern() 
 calls are unnecessary and can be replaced with a hashmap lookup. In fact, 
 StringHelper.java has its own hashmap implementation that it uses in 
 conjunction with String.intern(). Rather than using a one-off hashmap, I've 
 elected to use a ConcurrentHashMap in this patch.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3100) IW.commit() writes but fails to fsync the N.fnx file

2011-05-17 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034642#comment-13034642
 ] 

Michael McCandless commented on LUCENE-3100:


Patch looks good Simon!

 IW.commit() writes but fails to fsync the N.fnx file
 

 Key: LUCENE-3100
 URL: https://issues.apache.org/jira/browse/LUCENE-3100
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Michael McCandless
Assignee: Simon Willnauer
 Fix For: 4.0

 Attachments: LUCENE-3100.patch


 In making a unit test for NRTCachingDir (LUCENE-3092) I hit this surprising 
 bug!
 Because the new N.fnx file is written at the last minute along with the 
 segments file, it's not included in the sis.files() that IW uses to figure 
 out which files to sync.
 This bug means one could call IW.commit(), successfully, return, and then the 
 machine could crash and when it comes back up your index could be corrupted.
 We should hopefully first fix TestCrash so that it hits this bug (maybe it 
 needs more/better randomization?), then fix the bug

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3092) NRTCachingDirectory, to buffer small segments in a RAMDir

2011-05-17 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-3092:
---

Attachment: LUCENE-3092.patch

New patch, fixes the issue Simon hit (was just a bug in the test -- it was 
using a silly MergePolicy that ignored partial optimize).

Test now passes w/ the patch from LUCENE-3100.

I think this is ready to commit, after LUCENE-3100 is in.

 NRTCachingDirectory, to buffer small segments in a RAMDir
 -

 Key: LUCENE-3092
 URL: https://issues.apache.org/jira/browse/LUCENE-3092
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/store
Reporter: Michael McCandless
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3092-listener.patch, LUCENE-3092.patch, 
 LUCENE-3092.patch, LUCENE-3092.patch, LUCENE-3092.patch


 I created this simply Directory impl, whose goal is reduce IO
 contention in a frequent reopen NRT use case.
 The idea is, when reopening quickly, but not indexing that much
 content, you wind up with many small files created with time, that can
 possibly stress the IO system eg if merges, searching are also
 fighting for IO.
 So, NRTCachingDirectory puts these newly created files into a RAMDir,
 and only when they are merged into a too-large segment, does it then
 write-through to the real (delegate) directory.
 This lets you spend some RAM to reduce I0.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3108) Land DocValues on trunk

2011-05-17 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034645#comment-13034645
 ] 

Michael McCandless commented on LUCENE-3108:


+1, excellent!

 Land DocValues on trunk
 ---

 Key: LUCENE-3108
 URL: https://issues.apache.org/jira/browse/LUCENE-3108
 Project: Lucene - Java
  Issue Type: Task
  Components: core/index, core/search, core/store
Affects Versions: CSF branch, 4.0
Reporter: Simon Willnauer
Assignee: Simon Willnauer
 Fix For: 4.0


 Its time to move another feature from branch to trunk. I want to start this 
 process now while still a couple of issues remain on the branch. Currently I 
 am down to a single nocommit (javadocs on DocValues.java) and a couple of 
 testing TODOs (explicit multithreaded tests and unoptimized with deletions) 
 but I think those are not worth separate issues so we can resolve them as we 
 go. 
 The already created issues (LUCENE-3075 and LUCENE-3074) should not block 
 this process here IMO, we can fix them once we are on trunk. 
 Here is a quick feature overview of what has been implemented:
  * DocValues implementations for Ints (based on PackedInts), Float 32 / 64, 
 Bytes (fixed / variable size each in sorted, straight and deref variations)
  * Integration into Flex-API, Codec provides a 
 PerDocConsumer-DocValuesConsumer (write) / PerDocValues-DocValues (read) 
  * By-Default enabled in all codecs except of PreFlex
  * Follows other flex-API patterns like non-segment reader throw UOE forcing 
 MultiPerDocValues if on DirReader etc.
  * Integration into IndexWriter, FieldInfos etc.
  * Random-testing enabled via RandomIW - injecting random DocValues into 
 documents
  * Basic checks in CheckIndex (which runs after each test)
  * FieldComparator for int and float variants (Sorting, currently directly 
 integrated into SortField, this might go into a separate DocValuesSortField 
 eventually)
  * Extended TestSort for DocValues
  * RAM-Resident random access API plus on-disk DocValuesEnum (currently only 
 sequential access) - Source.java / DocValuesEnum.java
  * Extensible Cache implementation for RAM-Resident DocValues (by-default 
 loaded into RAM only once and freed once IR is closed) - SourceCache.java
  
 PS: Currently the RAM resident API is named Source (Source.java) which seems 
 too generic. I think we should rename it into RamDocValues or something like 
 that, suggestion welcome!   
 Any comments, questions (rants :)) are very much appreciated.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-3100) IW.commit() writes but fails to fsync the N.fnx file

2011-05-17 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer resolved LUCENE-3100.
-

Resolution: Fixed

Committed in revision 1104090.

 IW.commit() writes but fails to fsync the N.fnx file
 

 Key: LUCENE-3100
 URL: https://issues.apache.org/jira/browse/LUCENE-3100
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Michael McCandless
Assignee: Simon Willnauer
 Fix For: 4.0

 Attachments: LUCENE-3100.patch


 In making a unit test for NRTCachingDir (LUCENE-3092) I hit this surprising 
 bug!
 Because the new N.fnx file is written at the last minute along with the 
 segments file, it's not included in the sis.files() that IW uses to figure 
 out which files to sync.
 This bug means one could call IW.commit(), successfully, return, and then the 
 machine could crash and when it comes back up your index could be corrupted.
 We should hopefully first fix TestCrash so that it hits this bug (maybe it 
 needs more/better randomization?), then fix the bug

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3092) NRTCachingDirectory, to buffer small segments in a RAMDir

2011-05-17 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034661#comment-13034661
 ] 

Simon Willnauer commented on LUCENE-3092:
-

Mike I committed LUCENE-3100 you can go ahead :)

 NRTCachingDirectory, to buffer small segments in a RAMDir
 -

 Key: LUCENE-3092
 URL: https://issues.apache.org/jira/browse/LUCENE-3092
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/store
Reporter: Michael McCandless
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3092-listener.patch, LUCENE-3092.patch, 
 LUCENE-3092.patch, LUCENE-3092.patch, LUCENE-3092.patch


 I created this simply Directory impl, whose goal is reduce IO
 contention in a frequent reopen NRT use case.
 The idea is, when reopening quickly, but not indexing that much
 content, you wind up with many small files created with time, that can
 possibly stress the IO system eg if merges, searching are also
 fighting for IO.
 So, NRTCachingDirectory puts these newly created files into a RAMDir,
 and only when they are merged into a too-large segment, does it then
 write-through to the real (delegate) directory.
 This lets you spend some RAM to reduce I0.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-1421) Ability to group search results by field

2011-05-17 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034691#comment-13034691
 ] 

Michael McCandless commented on LUCENE-1421:


I adding grouping queries to the nightly benchmarks
(http://people.apache.org/~mikemccand/lucenebench) -- see
TermGroup100/10K/1M.  The F annotation is the day grouping queries
first ran.

Those queries are the same queries running as TermQuery, just with
grouping turned on on 3 randomly generated fields, with 100, 10,000
and 1 million unique values.  So we can gauge the perf hit by
comparing to TermQuery each night.

I use the CachingCollector.

First off, I'm impressed that the perf hit for grouping is not too
bad:

||Query||QPS||Slowdown||
|TermQuery (baseline)|30.72|0|
|TermGroup100|13.59|2.26|
|TermQuery10K|13.2|2.34|
|TermQuery1M|12.15|2.53|

I had expected we'd pay a bigger perf hit!

Second, there more unique groups you have, the slower grouping gets,
but that multiplier really isn't so bad -- the 1M unique groups case
is only 10.6% slower than the 100 unique groups case.

Remember, though, that these groups are randomly generated
full-unicode strings, so real data could very well produce different
results...

Third, and this is insanity, the addition of grouping caused other
unexpected changes.  Most horribly, SpanNearQuery slowed down
by ~12.2%
(http://people.apache.org/~mikemccand/lucenebench/SpanNear.html),
while other queries seem to get a bit faster.  I think this is
[frustratingly!] due to hotspot making different decisions about which
code to optimize/inline.

Similarly strange, when I added sorting (TermQuery sorting by title
and date/time, E annotation in all graphs), I saw the variance in
the unsorted TermQuery performance drop substantially.  I'm pretty
sure this wide variance was due to hotspot's erratic decision making,
but somehow the addition of sorting, while not change TermQuery's mean
QPS, caused hotspot to at least be somewhat more consistent in how it
compiled the code.  Maybe as we add more and more diverse queries to
the benchmark we'll see hotspot behave more reasonably


 Ability to group search results by field
 

 Key: LUCENE-1421
 URL: https://issues.apache.org/jira/browse/LUCENE-1421
 Project: Lucene - Java
  Issue Type: New Feature
  Components: core/search
Reporter: Artyom Sokolov
Assignee: Michael McCandless
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-1421.patch, LUCENE-1421.patch, 
 lucene-grouping.patch


 It would be awesome to group search results by specified field. Some 
 functionality was provided for Apache Solr but I think it should be done in 
 Core Lucene. There could be some useful information like total hits about 
 collapsed data like total count and so on.
 Thanks,
 Artyom

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-3112) Add IW.add/updateDocuments to support nested documents

2011-05-17 Thread Michael McCandless (JIRA)
Add IW.add/updateDocuments to support nested documents
--

 Key: LUCENE-3112
 URL: https://issues.apache.org/jira/browse/LUCENE-3112
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 3.2, 4.0


I think nested documents (LUCENE-2454) is a very compelling addition
to Lucene.  It's also a popular (many votes) issue.

Beyond supporting nested document querying, which is already an
incredible addition since it preserves the relational model on
indexing normalized content (eg, DB tables, XML docs), LUCENE-2454
should also enable speedups in grouping implementation when you group
by a nested field.

For the same reason, it can also enable very fast post-group facet
counting impl (LUCENE-3097) when you what to
count(distinct(nestedField)), instead of unique documents, as your
identifier.  I expect many apps that use faceting need this ability
(to count(distinct(nestedField)) not distinct(docID)).

To support these use cases, I believe the only core change needed is
the ability to atomically add or update multiple documents, which you
cannot do today since in between add/updateDocument calls a flush (eg
due to commit or getReader()) could occur.

This new API (addDocuments(IterableDocument), updateDocuments(Term
delTerm, IterableDocument) would also further guarantee that the
documents are assigned sequential docIDs in the order the iterator
provided them, and that the docIDs all reside in one segment.

Segment merging never splits segments apart, so this invariant would
hold even as merges/optimizes take place.


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3112) Add IW.add/updateDocuments to support nested documents

2011-05-17 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-3112:
---

Attachment: LUCENE-3112.patch

Initial patch.

It's not done yet (needs tests, and the nocommit needs to be addressed).

 Add IW.add/updateDocuments to support nested documents
 --

 Key: LUCENE-3112
 URL: https://issues.apache.org/jira/browse/LUCENE-3112
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3112.patch


 I think nested documents (LUCENE-2454) is a very compelling addition
 to Lucene.  It's also a popular (many votes) issue.
 Beyond supporting nested document querying, which is already an
 incredible addition since it preserves the relational model on
 indexing normalized content (eg, DB tables, XML docs), LUCENE-2454
 should also enable speedups in grouping implementation when you group
 by a nested field.
 For the same reason, it can also enable very fast post-group facet
 counting impl (LUCENE-3097) when you what to
 count(distinct(nestedField)), instead of unique documents, as your
 identifier.  I expect many apps that use faceting need this ability
 (to count(distinct(nestedField)) not distinct(docID)).
 To support these use cases, I believe the only core change needed is
 the ability to atomically add or update multiple documents, which you
 cannot do today since in between add/updateDocument calls a flush (eg
 due to commit or getReader()) could occur.
 This new API (addDocuments(IterableDocument), updateDocuments(Term
 delTerm, IterableDocument) would also further guarantee that the
 documents are assigned sequential docIDs in the order the iterator
 provided them, and that the docIDs all reside in one segment.
 Segment merging never splits segments apart, so this invariant would
 hold even as merges/optimizes take place.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2454) Nested Document query support

2011-05-17 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034702#comment-13034702
 ] 

Michael McCandless commented on LUCENE-2454:


I think this is a very important addition to Lucene, so let's get this
done!

I just opened LUCENE-3112, to add IW.add/updateDocuments, which would
atomically add Document produced by an iterator, and ensure they all
wind up in the same segment.  I think this is the only core change
necessary for this feature?  Ie, all else can be built on top of Lucene
once LUCENE-3112 is committed?


 Nested Document query support
 -

 Key: LUCENE-2454
 URL: https://issues.apache.org/jira/browse/LUCENE-2454
 Project: Lucene - Java
  Issue Type: New Feature
  Components: core/search
Affects Versions: 3.0.2
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Attachments: LuceneNestedDocumentSupport.zip


 A facility for querying nested documents in a Lucene index as outlined in 
 http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3110) ASCIIFoldingFilter wrongly folds german Umlauts

2011-05-17 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034709#comment-13034709
 ] 

Robert Muir commented on LUCENE-3110:
-

Hi,

these characters are not German umlauts. They are unicode characters used
by a number of languages. the purpose of ASCIIFolding is to do simple 
accent-stripping.

 ASCIIFoldingFilter wrongly folds german Umlauts
 ---

 Key: LUCENE-3110
 URL: https://issues.apache.org/jira/browse/LUCENE-3110
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/analysis
Affects Versions: 3.1
Reporter: Michael Gaber

 the german umlauts are currently mapped as follows.
 Ä/ä = A/a
 Ö/ö = O/o
 Ü/ü = U/u
 the correct mapping would be
 Ä/ä = Ae/ae
 Ö/ö = Oe/oe
 Ü/ü = Ue/ue
 so the corresponding rows in the switch statement should be moved down to the 
 ae/oe/ue positions.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3112) Add IW.add/updateDocuments to support nested documents

2011-05-17 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034711#comment-13034711
 ] 

Simon Willnauer commented on LUCENE-3112:
-

bq. Initial patch.

nice simple idea! I like the refactorings into pre/postUpdate - looks much 
cleaner. Yet, I think you should push the document iteration etc into DWPT to 
actually apply the delterm only once to make it really atomic. I also wonder if 
we should allow multiple delTerm e.g. TupleDelTerm, Document otherwise you 
would be bound to one delterm pre collection but what if you want to remove 
only one of the sub-documents? So if we would have those tuples you really 
want to push the iteration into DWPT to make a final finishDocument(Term[] 
terms) call pushing the terms into a single DeleteItem.



 Add IW.add/updateDocuments to support nested documents
 --

 Key: LUCENE-3112
 URL: https://issues.apache.org/jira/browse/LUCENE-3112
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3112.patch


 I think nested documents (LUCENE-2454) is a very compelling addition
 to Lucene.  It's also a popular (many votes) issue.
 Beyond supporting nested document querying, which is already an
 incredible addition since it preserves the relational model on
 indexing normalized content (eg, DB tables, XML docs), LUCENE-2454
 should also enable speedups in grouping implementation when you group
 by a nested field.
 For the same reason, it can also enable very fast post-group facet
 counting impl (LUCENE-3097) when you what to
 count(distinct(nestedField)), instead of unique documents, as your
 identifier.  I expect many apps that use faceting need this ability
 (to count(distinct(nestedField)) not distinct(docID)).
 To support these use cases, I believe the only core change needed is
 the ability to atomically add or update multiple documents, which you
 cannot do today since in between add/updateDocument calls a flush (eg
 due to commit or getReader()) could occur.
 This new API (addDocuments(IterableDocument), updateDocuments(Term
 delTerm, IterableDocument) would also further guarantee that the
 documents are assigned sequential docIDs in the order the iterator
 provided them, and that the docIDs all reside in one segment.
 Segment merging never splits segments apart, so this invariant would
 hold even as merges/optimizes take place.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-1421) Ability to group search results by field

2011-05-17 Thread Martijn van Groningen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034714#comment-13034714
 ] 

Martijn van Groningen commented on LUCENE-1421:
---

bq. I adding grouping queries to the nightly benchmarks
Nice!

Are the regular sort and group sort different in these test cases?

Do think when new features are added that these also need be added to this test 
suite? Or is this perfomance test suite just for the basic features?

 Ability to group search results by field
 

 Key: LUCENE-1421
 URL: https://issues.apache.org/jira/browse/LUCENE-1421
 Project: Lucene - Java
  Issue Type: New Feature
  Components: core/search
Reporter: Artyom Sokolov
Assignee: Michael McCandless
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-1421.patch, LUCENE-1421.patch, 
 lucene-grouping.patch


 It would be awesome to group search results by specified field. Some 
 functionality was provided for Apache Solr but I think it should be done in 
 Core Lucene. There could be some useful information like total hits about 
 collapsed data like total count and so on.
 Thanks,
 Artyom

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3105) String.intern() calls slow down IndexWriter.close() and IndexReader.open() for index with large number of unique field names

2011-05-17 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034723#comment-13034723
 ] 

Uwe Schindler commented on LUCENE-3105:
---

Yes it's gonna fixed, see linked issue LUCENE-2548. The biggest problem is Solr 
at the moment. The other things are minor identity vs. equals in FieldCache.

 String.intern() calls slow down IndexWriter.close() and IndexReader.open() 
 for index with large number of unique field names
 

 Key: LUCENE-3105
 URL: https://issues.apache.org/jira/browse/LUCENE-3105
 Project: Lucene - Java
  Issue Type: Bug
  Components: core/index
Affects Versions: 3.1
Reporter: Mark Kristensson
 Attachments: LUCENE-3105.patch


 We have one index with several hundred thousand unqiue field names (we're 
 optimistic that Lucene 4.0 is flexible enough to allow us to change our index 
 design...) and found that opening an index writer and closing an index reader 
 results in horribly slow performance on that one index. I have isolated the 
 problem down to the calls to String.intern() that are used to allow for quick 
 string comparisons of field names throughout Lucene. These String.intern() 
 calls are unnecessary and can be replaced with a hashmap lookup. In fact, 
 StringHelper.java has its own hashmap implementation that it uses in 
 conjunction with String.intern(). Rather than using a one-off hashmap, I've 
 elected to use a ConcurrentHashMap in this patch.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2454) Nested Document query support

2011-05-17 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034726#comment-13034726
 ] 

Mark Harwood commented on LUCENE-2454:
--

bq.  I think this is the only core change necessary for this feature?

Yup. A same-segment indexing guarantee is all that is required.

 Nested Document query support
 -

 Key: LUCENE-2454
 URL: https://issues.apache.org/jira/browse/LUCENE-2454
 Project: Lucene - Java
  Issue Type: New Feature
  Components: core/search
Affects Versions: 3.0.2
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Attachments: LuceneNestedDocumentSupport.zip


 A facility for querying nested documents in a Lucene index as outlined in 
 http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-2736) Wrong implementation of DocIdSetIterator.advance

2011-05-17 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera resolved LUCENE-2736.


   Resolution: Fixed
Fix Version/s: 4.0
   3.2
Lucene Fields: [New, Patch Available]  (was: [New])

Thanks Doron - I changed the javadocs as you suggest.

Committed revision 1104159 (3x).
Committed revision 1104167 (trunk).

Thanks Hardy for reporting that !

 Wrong implementation of DocIdSetIterator.advance 
 -

 Key: LUCENE-2736
 URL: https://issues.apache.org/jira/browse/LUCENE-2736
 Project: Lucene - Java
  Issue Type: Bug
  Components: core/search
Affects Versions: 3.2, 4.0
Reporter: Hardy Ferentschik
Assignee: Shai Erera
 Fix For: 3.2, 4.0

 Attachments: LUCENE-2736.patch


 Implementations of {{DocIdSetIterator}} behave differently when advanced is 
 called. Taking the following test for {{OpenBitSet}}, {{DocIdBitSet}} and 
 {{SortedVIntList}} only {{SortedVIntList}} passes the test:
 {code:title=org.apache.lucene.search.TestDocIdSet.java|borderStyle=solid}
 ...
   public void testAdvanceWithOpenBitSet() throws IOException {
   DocIdSet idSet = new OpenBitSet( new long[] { 1121 }, 1 );  // 
 bits 0, 5, 6, 10
   assertAdvance( idSet );
   }
   public void testAdvanceDocIdBitSet() throws IOException {
   BitSet bitSet = new BitSet();
   bitSet.set( 0 );
   bitSet.set( 5 );
   bitSet.set( 6 );
   bitSet.set( 10 );
   DocIdSet idSet = new DocIdBitSet(bitSet);
   assertAdvance( idSet );
   }
   public void testAdvanceWithSortedVIntList() throws IOException {
   DocIdSet idSet = new SortedVIntList( 0, 5, 6, 10 );
   assertAdvance( idSet );
   }   
   private void assertAdvance(DocIdSet idSet) throws IOException {
   DocIdSetIterator iter = idSet.iterator();
   int docId = iter.nextDoc();
   assertEquals( First doc id should be 0, 0, docId );
   docId = iter.nextDoc();
   assertEquals( Second doc id should be 5, 5, docId );
   docId = iter.advance( 5 );
   assertEquals( Advancing iterator should return the next doc 
 id, 6, docId );
   }
 {code}
 The javadoc for {{advance}} says:
 {quote}
 Advances to the first *beyond* the current whose document number is greater 
 than or equal to _target_.
 {quote}
 This seems to indicate that {{SortedVIntList}} behaves correctly, whereas the 
 other two don't. 
 Just looking at the {{DocIdBitSet}} implementation advance is implemented as:
 {code}
 bitSet.nextSetBit(target);
 {code}
 where the docs of {{nextSetBit}} say:
 {quote}
 Returns the index of the first bit that is set to true that occurs *on or 
 after* the specified starting index
 {quote}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3102) Few issues with CachingCollector

2011-05-17 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-3102:
---

Attachment: LUCENE-3102-factory.patch

Patch against 3x which:

* Adds factory method to CachingCollector, specializing on cacheScores
* Clarify Collector.needScores() TODO

There are two remaining issues, let's address them after we iterate on this 
patch.

 Few issues with CachingCollector
 

 Key: LUCENE-3102
 URL: https://issues.apache.org/jira/browse/LUCENE-3102
 Project: Lucene - Java
  Issue Type: Bug
  Components: core/search
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3102-factory.patch, LUCENE-3102.patch, 
 LUCENE-3102.patch


 CachingCollector (introduced in LUCENE-1421) has few issues:
 # Since the wrapped Collector may support out-of-order collection, the 
 document IDs cached may be out-of-order (depends on the Query) and thus 
 replay(Collector) will forward document IDs out-of-order to a Collector that 
 may not support it.
 # It does not clear cachedScores + cachedSegs upon exceeding RAM limits
 # I think that instead of comparing curScores to null, in order to determine 
 if scores are requested, we should have a specific boolean - for clarity
 # This check if (base + nextLength  maxDocsToCache) (line 168) can be 
 relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the 
 maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want 
 to try and cache them?
 Also:
 * The TODO in line 64 (having Collector specify needsScores()) -- why do we 
 need that if CachingCollector ctor already takes a boolean cacheScores? I 
 think it's better defined explicitly than implicitly?
 * Let's introduce a factory method for creating a specialized version if 
 scoring is requested / not (i.e., impl the TODO in line 189)
 * I think it's a useful collector, which stands on its own and not specific 
 to grouping. Can we move it to core?
 * How about using OpenBitSet instead of int[] for doc IDs?
 ** If the number of hits is big, we'd gain some RAM back, and be able to 
 cache more entries
 ** NOTE: OpenBitSet can only be used for in-order collection only. So we can 
 use that if the wrapped Collector does not support out-of-order
 * Do you think we can modify this Collector to not necessarily wrap another 
 Collector? We have such Collector which stores (in-memory) all matching doc 
 IDs + scores (if required). Those are later fed into several processes that 
 operate on them (e.g. fetch more info from the index etc.). I am thinking, we 
 can make CachingCollector *optionally* wrap another Collector and then 
 someone can reuse it by setting RAM limit to unlimited (we should have a 
 constant for that) in order to simply collect all matching docs + scores.
 * I think a set of dedicated unit tests for this class alone would be good.
 That's it so far. Perhaps, if we do all of the above, more things will pop up.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3084) MergePolicy.OneMerge.segments should be ListSegmentInfo not SegmentInfos

2011-05-17 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-3084:
--

Attachment: LUCENE-3084-trunk-only.patch

Further refactoring:
- I was able to move more internal ArrayList-modifying code out of IndexWriter
- the returned List view is now unmodifiable!
- It's now possible to also add a Set view for better contains.

...working...

 MergePolicy.OneMerge.segments should be ListSegmentInfo not SegmentInfos
 --

 Key: LUCENE-3084
 URL: https://issues.apache.org/jira/browse/LUCENE-3084
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3084-trunk-only.patch, 
 LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, 
 LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, 
 LUCENE-3084-trunk-only.patch, LUCENE-3084.patch


 SegmentInfos carries a bunch of fields beyond the list of SI, but for merging 
 purposes these fields are unused.
 We should cutover to ListSI instead.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3112) Add IW.add/updateDocuments to support nested documents

2011-05-17 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034730#comment-13034730
 ] 

Robert Muir commented on LUCENE-3112:
-

We should really think through the consequences of this though.

If core features of lucene become implemented in a way that they rely upon 
these sequential docids, we then lock ourselves out of future optimizations 
such as reordering docids for optimal index compression.


 Add IW.add/updateDocuments to support nested documents
 --

 Key: LUCENE-3112
 URL: https://issues.apache.org/jira/browse/LUCENE-3112
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3112.patch


 I think nested documents (LUCENE-2454) is a very compelling addition
 to Lucene.  It's also a popular (many votes) issue.
 Beyond supporting nested document querying, which is already an
 incredible addition since it preserves the relational model on
 indexing normalized content (eg, DB tables, XML docs), LUCENE-2454
 should also enable speedups in grouping implementation when you group
 by a nested field.
 For the same reason, it can also enable very fast post-group facet
 counting impl (LUCENE-3097) when you what to
 count(distinct(nestedField)), instead of unique documents, as your
 identifier.  I expect many apps that use faceting need this ability
 (to count(distinct(nestedField)) not distinct(docID)).
 To support these use cases, I believe the only core change needed is
 the ability to atomically add or update multiple documents, which you
 cannot do today since in between add/updateDocument calls a flush (eg
 due to commit or getReader()) could occur.
 This new API (addDocuments(IterableDocument), updateDocuments(Term
 delTerm, IterableDocument) would also further guarantee that the
 documents are assigned sequential docIDs in the order the iterator
 provided them, and that the docIDs all reside in one segment.
 Segment merging never splits segments apart, so this invariant would
 hold even as merges/optimizes take place.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3112) Add IW.add/updateDocuments to support nested documents

2011-05-17 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034734#comment-13034734
 ] 

Jason Rutherglen commented on LUCENE-3112:
--

I think perhaps like a Hadoop input format split, we can define meta-data at 
the segment level as to where the documents live so that if one is 'splitting' 
the index, as is being implemented with HBase, the 'splitter' can be 'smart'.

 Add IW.add/updateDocuments to support nested documents
 --

 Key: LUCENE-3112
 URL: https://issues.apache.org/jira/browse/LUCENE-3112
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3112.patch


 I think nested documents (LUCENE-2454) is a very compelling addition
 to Lucene.  It's also a popular (many votes) issue.
 Beyond supporting nested document querying, which is already an
 incredible addition since it preserves the relational model on
 indexing normalized content (eg, DB tables, XML docs), LUCENE-2454
 should also enable speedups in grouping implementation when you group
 by a nested field.
 For the same reason, it can also enable very fast post-group facet
 counting impl (LUCENE-3097) when you what to
 count(distinct(nestedField)), instead of unique documents, as your
 identifier.  I expect many apps that use faceting need this ability
 (to count(distinct(nestedField)) not distinct(docID)).
 To support these use cases, I believe the only core change needed is
 the ability to atomically add or update multiple documents, which you
 cannot do today since in between add/updateDocument calls a flush (eg
 due to commit or getReader()) could occur.
 This new API (addDocuments(IterableDocument), updateDocuments(Term
 delTerm, IterableDocument) would also further guarantee that the
 documents are assigned sequential docIDs in the order the iterator
 provided them, and that the docIDs all reside in one segment.
 Segment merging never splits segments apart, so this invariant would
 hold even as merges/optimizes take place.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Assigned] (LUCENE-3111) TestFSTs.testRandomWords failure

2011-05-17 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned LUCENE-3111:
--

Assignee: Michael McCandless

 TestFSTs.testRandomWords failure
 

 Key: LUCENE-3111
 URL: https://issues.apache.org/jira/browse/LUCENE-3111
 Project: Lucene - Java
  Issue Type: Bug
Reporter: selckin
Assignee: Michael McCandless
Priority: Minor

 Was running some while(1) tests on the docvalues branch (r1103705) and the 
 following test failed:
 {code}
 [junit] Testsuite: org.apache.lucene.util.automaton.fst.TestFSTs
 [junit] Testcase: 
 testRandomWords(org.apache.lucene.util.automaton.fst.TestFSTs): FAILED
 [junit] expected:771 but was:TwoLongs:771,771
 [junit] junit.framework.AssertionFailedError: expected:771 but 
 was:TwoLongs:771,771
 [junit]   at 
 org.apache.lucene.util.automaton.fst.TestFSTs$FSTTester.verifyUnPruned(TestFSTs.java:540)
 [junit]   at 
 org.apache.lucene.util.automaton.fst.TestFSTs$FSTTester.doTest(TestFSTs.java:496)
 [junit]   at 
 org.apache.lucene.util.automaton.fst.TestFSTs$FSTTester.doTest(TestFSTs.java:359)
 [junit]   at 
 org.apache.lucene.util.automaton.fst.TestFSTs.doTest(TestFSTs.java:319)
 [junit]   at 
 org.apache.lucene.util.automaton.fst.TestFSTs.testRandomWords(TestFSTs.java:940)
 [junit]   at 
 org.apache.lucene.util.automaton.fst.TestFSTs.testRandomWords(TestFSTs.java:915)
 [junit]   at 
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1282)
 [junit]   at 
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1211)
 [junit] 
 [junit] 
 [junit] Tests run: 7, Failures: 1, Errors: 0, Time elapsed: 7.628 sec
 [junit] 
 [junit] - Standard Error -
 [junit] NOTE: Ignoring nightly-only test method 'testBigSet'
 [junit] NOTE: reproduce with: ant test -Dtestcase=TestFSTs 
 -Dtestmethod=testRandomWords -Dtests.seed=-269475578956012681:0
 [junit] NOTE: test params are: codec=PreFlex, locale=ar, 
 timezone=America/Blanc-Sablon
 [junit] NOTE: all tests run in this JVM:
 [junit] [TestToken, TestCodecs, TestIndexReaderReopen, 
 TestIndexWriterMerging, TestNoDeletionPolicy, TestParallelReaderEmptyIndex, 
 TestParallelTermEnum, TestPerSegmentDeletes, TestSegmentReader, 
 TestSegmentTermDocs, TestStressAdvance, TestTermVectorsReader, 
 TestSurrogates, TestMultiFieldQueryParser, TestAutomatonQuery, 
 TestBooleanScorer, TestFuzzyQuery, TestMultiTermConstantScore, 
 TestNumericRangeQuery64, TestPositiveScoresOnlyCollector, TestPrefixFilter, 
 TestQueryTermVector, TestScorerPerf, TestSloppyPhraseQuery, 
 TestSpansAdvanced, TestWindowsMMap, TestRamUsageEstimator, TestSmallFloat, 
 TestUnicodeUtil, TestFSTs]
 [junit] NOTE: Linux 2.6.37-gentoo amd64/Sun Microsystems Inc. 1.6.0_25 
 (64-bit)/cpus=8,threads=1,free=137329960,total=208207872
 [junit] -  ---
 [junit] TEST org.apache.lucene.util.automaton.fst.TestFSTs FAILED
 {code}
 I am not able to reproduce

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-2119) IndexSchema should log warning if analyzer is declared with charfilter/tokenizer/tokenfiler out of order

2011-05-17 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated SOLR-2119:
-

Fix Version/s: 4.0
   3.2

 IndexSchema should log warning if analyzer is declared with 
 charfilter/tokenizer/tokenfiler out of order
 --

 Key: SOLR-2119
 URL: https://issues.apache.org/jira/browse/SOLR-2119
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Reporter: Hoss Man
 Fix For: 3.2, 4.0


 There seems to be a segment of hte user population that has a hard time 
 understanding the distinction between a charfilter, a tokenizer, and a 
 tokenfilter -- while we can certianly try to improve the documentation about 
 what exactly each does, and when they take affect in the analysis chain, one 
 other thing we should do is try to educate people when they constuct their 
 analyzer in a way that doesn't make any sense.
 at the moment, some people are attempting to do things like move the Foo 
 tokenFilter/ before the tokenizer/ to try and get certain behavior ... 
 at a minimum we should log a warning in this case that doing that doesn't 
 have the desired effect
 (we could easily make such a situation fail to initialize, but i'm not 
 convinced that would be the best course of action, since some people may have 
 schema's where they have declared a charFilter or tokenizer out of order 
 relative to their tokenFilters, but are still getting correct results that 
 work for them, and breaking their instance on upgrade doens't seem like it 
 would be productive)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2119) IndexSchema should log warning if analyzer is declared with charfilter/tokenizer/tokenfiler out of order

2011-05-17 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034740#comment-13034740
 ] 

Michael McCandless commented on SOLR-2119:
--

+1 for hard error.

In general for problems we can detect at startup we should not start the 
server.  Users rarely see/do something about the warnings.

I think this would be a good service to those users who trip the hard error on 
upgrade: it means Solr is not doing what they thought they asked it to do.

 IndexSchema should log warning if analyzer is declared with 
 charfilter/tokenizer/tokenfiler out of order
 --

 Key: SOLR-2119
 URL: https://issues.apache.org/jira/browse/SOLR-2119
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Reporter: Hoss Man
 Fix For: 3.2, 4.0


 There seems to be a segment of hte user population that has a hard time 
 understanding the distinction between a charfilter, a tokenizer, and a 
 tokenfilter -- while we can certianly try to improve the documentation about 
 what exactly each does, and when they take affect in the analysis chain, one 
 other thing we should do is try to educate people when they constuct their 
 analyzer in a way that doesn't make any sense.
 at the moment, some people are attempting to do things like move the Foo 
 tokenFilter/ before the tokenizer/ to try and get certain behavior ... 
 at a minimum we should log a warning in this case that doing that doesn't 
 have the desired effect
 (we could easily make such a situation fail to initialize, but i'm not 
 convinced that would be the best course of action, since some people may have 
 schema's where they have declared a charFilter or tokenizer out of order 
 relative to their tokenFilters, but are still getting correct results that 
 work for them, and breaking their instance on upgrade doens't seem like it 
 would be productive)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Solr Config XML DTD's

2011-05-17 Thread Michael McCandless
https://issues.apache.org/jira/browse/SOLR-2119 is a good example
where we are failing to catch mis-configuration on startup.

Is there some way we can baby step here?  EG use one of these XML
validation packages, incrementally, on only sub-strings from the XML?
(Or simpler is to just do the checking ourselves w/ custom code).

Mike

http://blog.mikemccandless.com

On Wed, May 4, 2011 at 10:50 PM, Michael Sokolov soko...@ifactory.com wrote:
 I'm not sure you will find anyone wanting to put in this effort now, but
 another suggestion for a general approach might be:

 1 very basic static analysis to catch what you can - this should be a pretty
 minimal effort only given what can reasonably be achieved

 2 throw runtime errors as Hoss says (probably already doing this well
 enough, but maybe some incremental improvements are needed?)

 3 an option to run a configtest like httpd provides that preloads all
 declared handlers/plugins/modules etc, instantiates them and gives them an
 opportunity to read their config and throw whatever errors they find.  This
 way you can set a standard (error on unrecognized parameter, say) in some
 core areas, and distribute the effort.  This is a hugely useful sanity check
 to be able to run when you want to make config changes and not have your
 server fall over when it starts (or worse - later).

 -Mike kibitzer Sokolov

 On 5/4/2011 6:55 PM, Chris Hostetter wrote:

 As i said: any improvements to help catch the mistakes we can identify
 would be great, but we should maintain perspective of the effort/gain
 tradeoff given that there is likely nothing we can do about the basic
 problem of a string that won't be evaluated until runtime



 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3112) Add IW.add/updateDocuments to support nested documents

2011-05-17 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034750#comment-13034750
 ] 

Michael McCandless commented on LUCENE-3112:


bq. Yet, I think you should push the document iteration etc into DWPT to 
actually apply the delterm only once to make it really atomic.

Ahh good point -- it's wrong just passing that delTerm down N times, too.  I'll 
fix.

bq. I also wonder if we should allow multiple delTerm e.g. TupleDelTerm, 
Document otherwise you would be bound to one delterm pre collection but what 
if you want to remove only one of the sub-documents?

So, this won't work today w/ nested querying, if I understand it right.  Ie, if 
you only update one of the subs, now your subdocs are no longer sequential (nor 
in one segment).  So I think design for today here...?

Someday, when we implement incremental field updates correctly, so that updates 
are written as stacked segments against the original segment containing the 
document, at that point I think we can add an API that lets you update multiple 
docs atomically?
{quote}

 Add IW.add/updateDocuments to support nested documents
 --

 Key: LUCENE-3112
 URL: https://issues.apache.org/jira/browse/LUCENE-3112
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3112.patch


 I think nested documents (LUCENE-2454) is a very compelling addition
 to Lucene.  It's also a popular (many votes) issue.
 Beyond supporting nested document querying, which is already an
 incredible addition since it preserves the relational model on
 indexing normalized content (eg, DB tables, XML docs), LUCENE-2454
 should also enable speedups in grouping implementation when you group
 by a nested field.
 For the same reason, it can also enable very fast post-group facet
 counting impl (LUCENE-3097) when you what to
 count(distinct(nestedField)), instead of unique documents, as your
 identifier.  I expect many apps that use faceting need this ability
 (to count(distinct(nestedField)) not distinct(docID)).
 To support these use cases, I believe the only core change needed is
 the ability to atomically add or update multiple documents, which you
 cannot do today since in between add/updateDocument calls a flush (eg
 due to commit or getReader()) could occur.
 This new API (addDocuments(IterableDocument), updateDocuments(Term
 delTerm, IterableDocument) would also further guarantee that the
 documents are assigned sequential docIDs in the order the iterator
 provided them, and that the docIDs all reside in one segment.
 Segment merging never splits segments apart, so this invariant would
 hold even as merges/optimizes take place.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-3110) ASCIIFoldingFilter wrongly folds german Umlauts

2011-05-17 Thread Steven Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rowe resolved LUCENE-3110.
-

Resolution: Won't Fix

See LUCENE-1696, where Robert Muir advocates using an ICU collation filter 
instead of locale-sensitive accent stripping.

 ASCIIFoldingFilter wrongly folds german Umlauts
 ---

 Key: LUCENE-3110
 URL: https://issues.apache.org/jira/browse/LUCENE-3110
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/analysis
Affects Versions: 3.1
Reporter: Michael Gaber

 the german umlauts are currently mapped as follows.
 Ä/ä = A/a
 Ö/ö = O/o
 Ü/ü = U/u
 the correct mapping would be
 Ä/ä = Ae/ae
 Ö/ö = Oe/oe
 Ü/ü = Ue/ue
 so the corresponding rows in the switch statement should be moved down to the 
 ae/oe/ue positions.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3110) ASCIIFoldingFilter wrongly folds german Umlauts

2011-05-17 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034764#comment-13034764
 ] 

Robert Muir commented on LUCENE-3110:
-

another option, is to use the German2 stemmer from snowball, which is a 
variation on the german stemmer
designed to handle these cases.

If you use GermanAnalyzer in 3.1 it uses this stemmer by default.

 ASCIIFoldingFilter wrongly folds german Umlauts
 ---

 Key: LUCENE-3110
 URL: https://issues.apache.org/jira/browse/LUCENE-3110
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/analysis
Affects Versions: 3.1
Reporter: Michael Gaber

 the german umlauts are currently mapped as follows.
 Ä/ä = A/a
 Ö/ö = O/o
 Ü/ü = U/u
 the correct mapping would be
 Ä/ä = Ae/ae
 Ö/ö = Oe/oe
 Ü/ü = Ue/ue
 so the corresponding rows in the switch statement should be moved down to the 
 ae/oe/ue positions.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Reopened] (SOLR-2445) unknown handler: standard

2011-05-17 Thread Koji Sekiguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi reopened SOLR-2445:
--


Seems that no one objects about applying the patch to 3.1.1. Reopening.

 unknown handler: standard
 -

 Key: SOLR-2445
 URL: https://issues.apache.org/jira/browse/SOLR-2445
 Project: Solr
  Issue Type: Bug
Affects Versions: 1.4.1, 3.1, 3.2, 4.0
Reporter: Koji Sekiguchi
Assignee: Koji Sekiguchi
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: SOLR-2445.patch, qt-form-jsp.patch


 To reproduce the problem using example config, go form.jsp, use standard for 
 qt (it is default) then click Search.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-3113) fix analyzer bugs found by MockTokenizer

2011-05-17 Thread Robert Muir (JIRA)
fix analyzer bugs found by MockTokenizer


 Key: LUCENE-3113
 URL: https://issues.apache.org/jira/browse/LUCENE-3113
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Robert Muir
 Attachments: LUCENE-3113.patch

In LUCENE-3064, we beefed up MockTokenizer with assertions, and I've switched 
over the analysis tests to use MockTokenizer for better coverage.

However, this found a few bugs (one of which is LUCENE-3106):
* incrementToken() after it returns false in CommonGramsQueryFilter, 
HyphenatedWordsFilter, ShingleFilter, SynonymFilter
* missing end() implementation for PrefixAwareTokenFilter
* double reset() in QueryAutoStopWordAnalyzer and ReusableAnalyzerBase
* missing correctOffset()s in MockTokenizer itself.

I think it would be nice to just fix all the bugs on one issue... I've fixed 
everything except Shingle and Synonym

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3113) fix analyzer bugs found by MockTokenizer

2011-05-17 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-3113:


  Component/s: modules/analysis
Fix Version/s: 4.0
   3.2

 fix analyzer bugs found by MockTokenizer
 

 Key: LUCENE-3113
 URL: https://issues.apache.org/jira/browse/LUCENE-3113
 Project: Lucene - Java
  Issue Type: Bug
  Components: modules/analysis
Reporter: Robert Muir
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3113.patch


 In LUCENE-3064, we beefed up MockTokenizer with assertions, and I've switched 
 over the analysis tests to use MockTokenizer for better coverage.
 However, this found a few bugs (one of which is LUCENE-3106):
 * incrementToken() after it returns false in CommonGramsQueryFilter, 
 HyphenatedWordsFilter, ShingleFilter, SynonymFilter
 * missing end() implementation for PrefixAwareTokenFilter
 * double reset() in QueryAutoStopWordAnalyzer and ReusableAnalyzerBase
 * missing correctOffset()s in MockTokenizer itself.
 I think it would be nice to just fix all the bugs on one issue... I've fixed 
 everything except Shingle and Synonym

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3113) fix analyzer bugs found by MockTokenizer

2011-05-17 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-3113:


Attachment: LUCENE-3113.patch

attached is a patch, the synonyms and shingles tests still fail.

 fix analyzer bugs found by MockTokenizer
 

 Key: LUCENE-3113
 URL: https://issues.apache.org/jira/browse/LUCENE-3113
 Project: Lucene - Java
  Issue Type: Bug
  Components: modules/analysis
Reporter: Robert Muir
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3113.patch


 In LUCENE-3064, we beefed up MockTokenizer with assertions, and I've switched 
 over the analysis tests to use MockTokenizer for better coverage.
 However, this found a few bugs (one of which is LUCENE-3106):
 * incrementToken() after it returns false in CommonGramsQueryFilter, 
 HyphenatedWordsFilter, ShingleFilter, SynonymFilter
 * missing end() implementation for PrefixAwareTokenFilter
 * double reset() in QueryAutoStopWordAnalyzer and ReusableAnalyzerBase
 * missing correctOffset()s in MockTokenizer itself.
 I think it would be nice to just fix all the bugs on one issue... I've fixed 
 everything except Shingle and Synonym

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (SOLR-2445) unknown handler: standard

2011-05-17 Thread Koji Sekiguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi resolved SOLR-2445.
--

   Resolution: Fixed
Fix Version/s: 3.1.1

Committed revision 1104270 for 3.1.1.
Thanks Gabriele for your patience!

 unknown handler: standard
 -

 Key: SOLR-2445
 URL: https://issues.apache.org/jira/browse/SOLR-2445
 Project: Solr
  Issue Type: Bug
Affects Versions: 1.4.1, 3.1, 3.2, 4.0
Reporter: Koji Sekiguchi
Assignee: Koji Sekiguchi
Priority: Minor
 Fix For: 3.1.1, 3.2, 4.0

 Attachments: SOLR-2445.patch, qt-form-jsp.patch


 To reproduce the problem using example config, go form.jsp, use standard for 
 qt (it is default) then click Search.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3113) fix analyzer bugs found by MockTokenizer

2011-05-17 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-3113:


Attachment: LUCENE-3113.patch

updated patch, fixing the bugs in Synonyms and ShingleFilter.

also, i found two more bugs: the ShingleAnalyzerWrapper was double-resetting, 
and the PrefixAndSuffixAwareTokenFilter was missing end() also


 fix analyzer bugs found by MockTokenizer
 

 Key: LUCENE-3113
 URL: https://issues.apache.org/jira/browse/LUCENE-3113
 Project: Lucene - Java
  Issue Type: Bug
  Components: modules/analysis
Reporter: Robert Muir
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3113.patch, LUCENE-3113.patch


 In LUCENE-3064, we beefed up MockTokenizer with assertions, and I've switched 
 over the analysis tests to use MockTokenizer for better coverage.
 However, this found a few bugs (one of which is LUCENE-3106):
 * incrementToken() after it returns false in CommonGramsQueryFilter, 
 HyphenatedWordsFilter, ShingleFilter, SynonymFilter
 * missing end() implementation for PrefixAwareTokenFilter
 * double reset() in QueryAutoStopWordAnalyzer and ReusableAnalyzerBase
 * missing correctOffset()s in MockTokenizer itself.
 I think it would be nice to just fix all the bugs on one issue... I've fixed 
 everything except Shingle and Synonym

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



SpanNearQuery - inOrder parameter

2011-05-17 Thread Gregory Tarr
I attach a junit test which shows strange behaviour of the inOrder
parameter on the SpanNearQuery constructor, using Lucene 2.9.4.

My understanding of this parameter is that true forces the order and
false doesn't care about the order.

Using true always works. However using false works fine when the terms
in the query are distinct, but if they are equivalent, e.g. searching
for john john, I do not get the expected results. The workaround seems
to be to always use true for queries with repeated terms.

Any help?

Thanks

Greg

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.TopDocsCollector;
import org.apache.lucene.search.TopScoreDocCollector;
import org.apache.lucene.search.spans.SpanNearQuery;
import org.apache.lucene.search.spans.SpanQuery;
import org.apache.lucene.search.spans.SpanTermQuery;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;
import org.junit.Assert;
import org.junit.Test; 
public class TestSpanNearQueryInOrder { 
@Test
public void testSpanNearQueryInOrder() {
RAMDirectory directory = new RAMDirectory();
IndexWriter writer = new IndexWriter(directory, new
StandardAnalyzer(Version.LUCENE_29), true,
IndexWriter.MaxFieldLength.UNLIMITED);
TopDocsCollector collector = TopScoreDocCollector.create(3, false);

Document doc = new Document();

// DOC1
doc.add(new Field(text,   , Field.Store.YES,
Field.Index.ANALYZED));

writer.addDocument(doc);
doc = new Document(); 

// DOC2
doc.add(new Field(text,   ));

writer.addDocument(doc); 
doc = new Document();

// DOC3
doc.add(new Field(text,     ));

writer.addDocument(doc);
writer.optimize();
writer.close(); 
searcher = new IndexSearcher(directory, false); 
SpanQuery[] clauses = new SpanQuery[2];
clauses[0] = new SpanTermQuery(new Term(text, ));
clauses[1] = new SpanTermQuery(new Term(text, ));

// Don't care about order, so setting inOrder = false
SpanNearQuery q = new SpanNearQuery(clauses, 1, false);
searcher.search(q, collector); 
// This assert fails - 3 docs are returned. Expecting only DOC2 and DOC3
Assert.assertEquals(Check 2 results, 2, collector.getTotalHits()); 
collector = new TopScoreDocCollector.create(3, false);
clauses = new SpanQuery[2];
clauses[0] = new SpanTermQuery(new Term(text, ));
clauses[1] = new SpanTermQuery(new Term(text, ));

// Don't care about order, so setting inOrder = false
q = new SpanNearQuery(clauses, 0, false);
searcher.search(q, collector); 
// This assert fails - 3 docs are returned. Expecting only DOC2
Assert.assertEquals(Check 1 result, 1, collector.getTotalHits()); 
} 
}
 TestSpanNearQueryInOrder.java 

Please consider the environment before printing this email.

This message should be regarded as confidential. If you have received this 
email in error please notify the sender and destroy it immediately.
Statements of intent shall only become binding when confirmed in hard copy by 
an authorised signatory.  The contents of this email may relate to dealings 
with other companies within the Detica Limited group of companies.

Detica Limited is registered in England under No: 1337451.

Registered offices: Surrey Research Park, Guildford, Surrey, GU2 7YP, England.



TestSpanNearQueryInOrder.java
Description: TestSpanNearQueryInOrder.java

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3113) fix analyzer bugs found by MockTokenizer

2011-05-17 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034806#comment-13034806
 ] 

Robert Muir commented on LUCENE-3113:
-

I think this patch is ready to commit, i'll wait and see if anyone feels like 
reviewing it :)

 fix analyzer bugs found by MockTokenizer
 

 Key: LUCENE-3113
 URL: https://issues.apache.org/jira/browse/LUCENE-3113
 Project: Lucene - Java
  Issue Type: Bug
  Components: modules/analysis
Reporter: Robert Muir
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3113.patch, LUCENE-3113.patch


 In LUCENE-3064, we beefed up MockTokenizer with assertions, and I've switched 
 over the analysis tests to use MockTokenizer for better coverage.
 However, this found a few bugs (one of which is LUCENE-3106):
 * incrementToken() after it returns false in CommonGramsQueryFilter, 
 HyphenatedWordsFilter, ShingleFilter, SynonymFilter
 * missing end() implementation for PrefixAwareTokenFilter
 * double reset() in QueryAutoStopWordAnalyzer and ReusableAnalyzerBase
 * missing correctOffset()s in MockTokenizer itself.
 I think it would be nice to just fix all the bugs on one issue... I've fixed 
 everything except Shingle and Synonym

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2091) Add BM25 Scoring to Lucene

2011-05-17 Thread Shrinath (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034815#comment-13034815
 ] 

Shrinath commented on LUCENE-2091:
--

Hi, 

Don't be harsh if I am asking this in a wrong place, 
but could someone tell me if the linked patch is better than 
http://nlp.uned.es/~jperezi/Lucene-BM25/ 


 Add BM25 Scoring to Lucene
 --

 Key: LUCENE-2091
 URL: https://issues.apache.org/jira/browse/LUCENE-2091
 Project: Lucene - Java
  Issue Type: New Feature
  Components: modules/other
Reporter: Yuval Feinstein
Priority: Minor
 Fix For: 4.0

 Attachments: BM25SimilarityProvider.java, LUCENE-2091.patch, 
 persianlucene.jpg

   Original Estimate: 48h
  Remaining Estimate: 48h

 http://nlp.uned.es/~jperezi/Lucene-BM25/ describes an implementation of 
 Okapi-BM25 scoring in the Lucene framework,
 as an alternative to the standard Lucene scoring (which is a version of mixed 
 boolean/TFIDF).
 I have refactored this a bit, added unit tests and improved the runtime 
 somewhat.
 I would like to contribute the code to Lucene under contrib. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3113) fix analyzer bugs found by MockTokenizer

2011-05-17 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034816#comment-13034816
 ] 

Uwe Schindler commented on LUCENE-3113:
---

A quick check on the fixes in the implementations: all fine. I was just 
confused about PrefixAndSuffixAwareTF, but thats fine (Robert explained it to 
me - this Filters are very complicated from the code/class hierarchy design 
*g*).

I did not verify the Tests, I assume its just dumb search-replacements.

 fix analyzer bugs found by MockTokenizer
 

 Key: LUCENE-3113
 URL: https://issues.apache.org/jira/browse/LUCENE-3113
 Project: Lucene - Java
  Issue Type: Bug
  Components: modules/analysis
Reporter: Robert Muir
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3113.patch, LUCENE-3113.patch


 In LUCENE-3064, we beefed up MockTokenizer with assertions, and I've switched 
 over the analysis tests to use MockTokenizer for better coverage.
 However, this found a few bugs (one of which is LUCENE-3106):
 * incrementToken() after it returns false in CommonGramsQueryFilter, 
 HyphenatedWordsFilter, ShingleFilter, SynonymFilter
 * missing end() implementation for PrefixAwareTokenFilter
 * double reset() in QueryAutoStopWordAnalyzer and ReusableAnalyzerBase
 * missing correctOffset()s in MockTokenizer itself.
 I think it would be nice to just fix all the bugs on one issue... I've fixed 
 everything except Shingle and Synonym

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2193) Re-architect Update Handler

2011-05-17 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034818#comment-13034818
 ] 

Mark Miller commented on SOLR-2193:
---

I've got some fixes for this, and I've started on some tests and other minor 
steps forward. I'll put it up before too long.

 Re-architect Update Handler
 ---

 Key: SOLR-2193
 URL: https://issues.apache.org/jira/browse/SOLR-2193
 Project: Solr
  Issue Type: Improvement
Reporter: Mark Miller
Assignee: Mark Miller
 Fix For: 4.0

 Attachments: SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch


 The update handler needs an overhaul.
 A few goals I think we might want to look at:
 1. Cleanup - drop DirectUpdateHandler(2) line - move to something like 
 UpdateHandler, DefaultUpdateHandler
 2. Expose the SolrIndexWriter in the api or add the proper abstractions to 
 get done what we now do with special casing:
 if (directupdatehandler2)
   success
  else
   failish
 3. Stop closing the IndexWriter and start using commit (still lazy IW init 
 though).
 4. Drop iwAccess, iwCommit locks and sync mostly at the Lucene level.
 5. Keep NRT support in mind.
 6. Keep microsharding in mind (maintain logical index as multiple physical 
 indexes)
 7. Address the current issues we face because multiple original/'reloaded' 
 cores can have a different IndexWriter on the same index.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-1421) Ability to group search results by field

2011-05-17 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034828#comment-13034828
 ] 

Michael McCandless commented on LUCENE-1421:


I'm only testing groupSort and sort by relevance now in the nightly bench.

I'll add sort-by-title, groupSort-by-relevance cases too, so we test that.  
Hmm, though: this content set is alphabetized by title I believe, so it's not 
really a good test.  (I suspect that's why the TermQuery sorting by title is 
faster 

bq. Do think when new features are added that these also need be added to this 
test suite? Or is this perfomance test suite just for the basic features?

Well, in general I'd love to have wider coverage in the nightly perf test...  
really it's only a start now.  But there's no hard rule we have to add new 
functions into the nightly bench...

 Ability to group search results by field
 

 Key: LUCENE-1421
 URL: https://issues.apache.org/jira/browse/LUCENE-1421
 Project: Lucene - Java
  Issue Type: New Feature
  Components: core/search
Reporter: Artyom Sokolov
Assignee: Michael McCandless
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-1421.patch, LUCENE-1421.patch, 
 lucene-grouping.patch


 It would be awesome to group search results by specified field. Some 
 functionality was provided for Apache Solr but I think it should be done in 
 Core Lucene. There could be some useful information like total hits about 
 collapsed data like total count and so on.
 Thanks,
 Artyom

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3092) NRTCachingDirectory, to buffer small segments in a RAMDir

2011-05-17 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034835#comment-13034835
 ] 

Michael McCandless commented on LUCENE-3092:


Thanks Simon; I'll commit soon...

 NRTCachingDirectory, to buffer small segments in a RAMDir
 -

 Key: LUCENE-3092
 URL: https://issues.apache.org/jira/browse/LUCENE-3092
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/store
Reporter: Michael McCandless
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3092-listener.patch, LUCENE-3092.patch, 
 LUCENE-3092.patch, LUCENE-3092.patch, LUCENE-3092.patch


 I created this simply Directory impl, whose goal is reduce IO
 contention in a frequent reopen NRT use case.
 The idea is, when reopening quickly, but not indexing that much
 content, you wind up with many small files created with time, that can
 possibly stress the IO system eg if merges, searching are also
 fighting for IO.
 So, NRTCachingDirectory puts these newly created files into a RAMDir,
 and only when they are merged into a too-large segment, does it then
 write-through to the real (delegate) directory.
 This lets you spend some RAM to reduce I0.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3097) Post grouping faceting

2011-05-17 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034836#comment-13034836
 ] 

Michael McCandless commented on LUCENE-3097:


Right, this'd mean all docs sharing a given group value are contiguous and in 
the same segment.  The app would have to ensure this, in order to use a 
collector that takes advantage of it.


 Post grouping faceting
 --

 Key: LUCENE-3097
 URL: https://issues.apache.org/jira/browse/LUCENE-3097
 Project: Lucene - Java
  Issue Type: New Feature
Reporter: Martijn van Groningen
Priority: Minor
 Fix For: 3.2, 4.0


 This issues focuses on implementing post grouping faceting.
 * How to handle multivalued fields. What field value to show with the facet.
 * Where the facet counts should be based on
 ** Facet counts can be based on the normal documents. Ungrouped counts. 
 ** Facet counts can be based on the groups. Grouped counts.
 ** Facet counts can be based on the combination of group value and facet 
 value. Matrix counts.   
 And properly more implementation options.
 The first two methods are implemented in the SOLR-236 patch. For the first 
 option it calculates a DocSet based on the individual documents from the 
 query result. For the second option it calculates a DocSet for all the most 
 relevant documents of a group. Once the DocSet is computed the FacetComponent 
 and StatsComponent use one the DocSet to create facets and statistics.  
 This last one is a bit more complex. I think it is best explained with an 
 example. Lets say we search on travel offers:
 |||hotel||departure_airport||duration||
 |Hotel a|AMS|5
 |Hotel a|DUS|10
 |Hotel b|AMS|5
 |Hotel b|AMS|10
 If we group by hotel and have a facet for airport. Most end users expect 
 (according to my experience off course) the following airport facet:
 AMS: 2
 DUS: 1
 The above result can't be achieved by the first two methods. You either get 
 counts AMS:3 and DUS:1 or 1 for both airports.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3102) Few issues with CachingCollector

2011-05-17 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034841#comment-13034841
 ] 

Michael McCandless commented on LUCENE-3102:


Patch looks great!  But, can we rename curupto - curUpto (and same for 
curbase)?  Ie, so it matches the other camelCaseVariables we have here...

Thank you!

 Few issues with CachingCollector
 

 Key: LUCENE-3102
 URL: https://issues.apache.org/jira/browse/LUCENE-3102
 Project: Lucene - Java
  Issue Type: Bug
  Components: core/search
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3102-factory.patch, LUCENE-3102.patch, 
 LUCENE-3102.patch


 CachingCollector (introduced in LUCENE-1421) has few issues:
 # Since the wrapped Collector may support out-of-order collection, the 
 document IDs cached may be out-of-order (depends on the Query) and thus 
 replay(Collector) will forward document IDs out-of-order to a Collector that 
 may not support it.
 # It does not clear cachedScores + cachedSegs upon exceeding RAM limits
 # I think that instead of comparing curScores to null, in order to determine 
 if scores are requested, we should have a specific boolean - for clarity
 # This check if (base + nextLength  maxDocsToCache) (line 168) can be 
 relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the 
 maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want 
 to try and cache them?
 Also:
 * The TODO in line 64 (having Collector specify needsScores()) -- why do we 
 need that if CachingCollector ctor already takes a boolean cacheScores? I 
 think it's better defined explicitly than implicitly?
 * Let's introduce a factory method for creating a specialized version if 
 scoring is requested / not (i.e., impl the TODO in line 189)
 * I think it's a useful collector, which stands on its own and not specific 
 to grouping. Can we move it to core?
 * How about using OpenBitSet instead of int[] for doc IDs?
 ** If the number of hits is big, we'd gain some RAM back, and be able to 
 cache more entries
 ** NOTE: OpenBitSet can only be used for in-order collection only. So we can 
 use that if the wrapped Collector does not support out-of-order
 * Do you think we can modify this Collector to not necessarily wrap another 
 Collector? We have such Collector which stores (in-memory) all matching doc 
 IDs + scores (if required). Those are later fed into several processes that 
 operate on them (e.g. fetch more info from the index etc.). I am thinking, we 
 can make CachingCollector *optionally* wrap another Collector and then 
 someone can reuse it by setting RAM limit to unlimited (we should have a 
 constant for that) in order to simply collect all matching docs + scores.
 * I think a set of dedicated unit tests for this class alone would be good.
 That's it so far. Perhaps, if we do all of the above, more things will pop up.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3113) fix analyzer bugs found by MockTokenizer

2011-05-17 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034846#comment-13034846
 ] 

Robert Muir commented on LUCENE-3113:
-

Uwe, I think i'll open a followup issue to clean up the code about 
PrefixAndSuffixAwareTF. I don't like how tricky it is.


 fix analyzer bugs found by MockTokenizer
 

 Key: LUCENE-3113
 URL: https://issues.apache.org/jira/browse/LUCENE-3113
 Project: Lucene - Java
  Issue Type: Bug
  Components: modules/analysis
Reporter: Robert Muir
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3113.patch, LUCENE-3113.patch


 In LUCENE-3064, we beefed up MockTokenizer with assertions, and I've switched 
 over the analysis tests to use MockTokenizer for better coverage.
 However, this found a few bugs (one of which is LUCENE-3106):
 * incrementToken() after it returns false in CommonGramsQueryFilter, 
 HyphenatedWordsFilter, ShingleFilter, SynonymFilter
 * missing end() implementation for PrefixAwareTokenFilter
 * double reset() in QueryAutoStopWordAnalyzer and ReusableAnalyzerBase
 * missing correctOffset()s in MockTokenizer itself.
 I think it would be nice to just fix all the bugs on one issue... I've fixed 
 everything except Shingle and Synonym

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3112) Add IW.add/updateDocuments to support nested documents

2011-05-17 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034850#comment-13034850
 ] 

Michael McCandless commented on LUCENE-3112:


{quote}
We should really think through the consequences of this though.

If core features of lucene become implemented in a way that they rely upon 
these sequential docids, we then lock ourselves out of future optimizations 
such as reordering docids for optimal index compression.
{quote}

I agree it's somewhat dangerous we are making an (experimental)
guarantee that these docIDs will remain adjacent forever.  We
normally are very protective about letting apps rely on docID
assignment/order.

But, I think this will not be core functionality that relies on
sub-docs (adjacent docs), but rather modules -- grouping, faceting,
nestedqueries/queries.  And, even if you use these modules, it's
optional whether the app did sub-docs.  Ie we would still have the
'generic grouping collector, but then also an optimized one that
takes advantage of sub-docs.

Finally, I think doing this today would not preclude doing docID
reording in the future because the sub docs would be recomputable
based on the identifier field which grouped them in the first
place.

Ie the worst case future scenario (an app uses this new sub-docs
feature, but then has a big index they don't want to reindex and wants
to take advantage of a future docid reording compression we add) would
still be solvable because we could use this identifier field to find
blocks of sub-docs.

I suppose we could consider changing the index format today to record
which docs are subs... but I think we don't need to.  Maybe I should
strengthen the @experimental to explain the risk that a future
reindexing could be required?


 Add IW.add/updateDocuments to support nested documents
 --

 Key: LUCENE-3112
 URL: https://issues.apache.org/jira/browse/LUCENE-3112
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3112.patch


 I think nested documents (LUCENE-2454) is a very compelling addition
 to Lucene.  It's also a popular (many votes) issue.
 Beyond supporting nested document querying, which is already an
 incredible addition since it preserves the relational model on
 indexing normalized content (eg, DB tables, XML docs), LUCENE-2454
 should also enable speedups in grouping implementation when you group
 by a nested field.
 For the same reason, it can also enable very fast post-group facet
 counting impl (LUCENE-3097) when you what to
 count(distinct(nestedField)), instead of unique documents, as your
 identifier.  I expect many apps that use faceting need this ability
 (to count(distinct(nestedField)) not distinct(docID)).
 To support these use cases, I believe the only core change needed is
 the ability to atomically add or update multiple documents, which you
 cannot do today since in between add/updateDocument calls a flush (eg
 due to commit or getReader()) could occur.
 This new API (addDocuments(IterableDocument), updateDocuments(Term
 delTerm, IterableDocument) would also further guarantee that the
 documents are assigned sequential docIDs in the order the iterator
 provided them, and that the docIDs all reside in one segment.
 Segment merging never splits segments apart, so this invariant would
 hold even as merges/optimizes take place.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-3114) PrefixAndSuffixAwareTokenFilter code cleanup

2011-05-17 Thread Robert Muir (JIRA)
PrefixAndSuffixAwareTokenFilter code cleanup


 Key: LUCENE-3114
 URL: https://issues.apache.org/jira/browse/LUCENE-3114
 Project: Lucene - Java
  Issue Type: Task
  Components: modules/analysis
Reporter: Robert Muir


as noted on LUCENE-3113, I think this tokenstream is difficult to review.

In my opinion just changing the 'private PrefixAwareTokenFilter suffix' to 
'private PrefixAwareTokenFilter prefixAndSuffix' would work wonders.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3112) Add IW.add/updateDocuments to support nested documents

2011-05-17 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034854#comment-13034854
 ] 

Robert Muir commented on LUCENE-3112:
-

{quote}
I suppose we could consider changing the index format today to record
which docs are subs... but I think we don't need to. Maybe I should
strengthen the @experimental to explain the risk that a future
reindexing could be required?
{quote}

I think this would be perfect. I certainly don't want to hold up this 
improvement, yet, in the future I just didnt want us to be in a 
situation where we say 'well if only we had recorded this information,
now its not possible to do XYZ because someone COULD have used 
add/updateDocuments() for some arbitrary reason and we will 'split' 
their grouped ids'.

We could also include in the note that various existing 
IndexSorters/Splitters are unaware about this, so use with caution :)


 Add IW.add/updateDocuments to support nested documents
 --

 Key: LUCENE-3112
 URL: https://issues.apache.org/jira/browse/LUCENE-3112
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3112.patch


 I think nested documents (LUCENE-2454) is a very compelling addition
 to Lucene.  It's also a popular (many votes) issue.
 Beyond supporting nested document querying, which is already an
 incredible addition since it preserves the relational model on
 indexing normalized content (eg, DB tables, XML docs), LUCENE-2454
 should also enable speedups in grouping implementation when you group
 by a nested field.
 For the same reason, it can also enable very fast post-group facet
 counting impl (LUCENE-3097) when you what to
 count(distinct(nestedField)), instead of unique documents, as your
 identifier.  I expect many apps that use faceting need this ability
 (to count(distinct(nestedField)) not distinct(docID)).
 To support these use cases, I believe the only core change needed is
 the ability to atomically add or update multiple documents, which you
 cannot do today since in between add/updateDocument calls a flush (eg
 due to commit or getReader()) could occur.
 This new API (addDocuments(IterableDocument), updateDocuments(Term
 delTerm, IterableDocument) would also further guarantee that the
 documents are assigned sequential docIDs in the order the iterator
 provided them, and that the docIDs all reside in one segment.
 Segment merging never splits segments apart, so this invariant would
 hold even as merges/optimizes take place.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3111) TestFSTs.testRandomWords failure

2011-05-17 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034868#comment-13034868
 ] 

Michael McCandless commented on LUCENE-3111:


I'm also not able to reproduce...

 TestFSTs.testRandomWords failure
 

 Key: LUCENE-3111
 URL: https://issues.apache.org/jira/browse/LUCENE-3111
 Project: Lucene - Java
  Issue Type: Bug
Reporter: selckin
Assignee: Michael McCandless
Priority: Minor

 Was running some while(1) tests on the docvalues branch (r1103705) and the 
 following test failed:
 {code}
 [junit] Testsuite: org.apache.lucene.util.automaton.fst.TestFSTs
 [junit] Testcase: 
 testRandomWords(org.apache.lucene.util.automaton.fst.TestFSTs): FAILED
 [junit] expected:771 but was:TwoLongs:771,771
 [junit] junit.framework.AssertionFailedError: expected:771 but 
 was:TwoLongs:771,771
 [junit]   at 
 org.apache.lucene.util.automaton.fst.TestFSTs$FSTTester.verifyUnPruned(TestFSTs.java:540)
 [junit]   at 
 org.apache.lucene.util.automaton.fst.TestFSTs$FSTTester.doTest(TestFSTs.java:496)
 [junit]   at 
 org.apache.lucene.util.automaton.fst.TestFSTs$FSTTester.doTest(TestFSTs.java:359)
 [junit]   at 
 org.apache.lucene.util.automaton.fst.TestFSTs.doTest(TestFSTs.java:319)
 [junit]   at 
 org.apache.lucene.util.automaton.fst.TestFSTs.testRandomWords(TestFSTs.java:940)
 [junit]   at 
 org.apache.lucene.util.automaton.fst.TestFSTs.testRandomWords(TestFSTs.java:915)
 [junit]   at 
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1282)
 [junit]   at 
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1211)
 [junit] 
 [junit] 
 [junit] Tests run: 7, Failures: 1, Errors: 0, Time elapsed: 7.628 sec
 [junit] 
 [junit] - Standard Error -
 [junit] NOTE: Ignoring nightly-only test method 'testBigSet'
 [junit] NOTE: reproduce with: ant test -Dtestcase=TestFSTs 
 -Dtestmethod=testRandomWords -Dtests.seed=-269475578956012681:0
 [junit] NOTE: test params are: codec=PreFlex, locale=ar, 
 timezone=America/Blanc-Sablon
 [junit] NOTE: all tests run in this JVM:
 [junit] [TestToken, TestCodecs, TestIndexReaderReopen, 
 TestIndexWriterMerging, TestNoDeletionPolicy, TestParallelReaderEmptyIndex, 
 TestParallelTermEnum, TestPerSegmentDeletes, TestSegmentReader, 
 TestSegmentTermDocs, TestStressAdvance, TestTermVectorsReader, 
 TestSurrogates, TestMultiFieldQueryParser, TestAutomatonQuery, 
 TestBooleanScorer, TestFuzzyQuery, TestMultiTermConstantScore, 
 TestNumericRangeQuery64, TestPositiveScoresOnlyCollector, TestPrefixFilter, 
 TestQueryTermVector, TestScorerPerf, TestSloppyPhraseQuery, 
 TestSpansAdvanced, TestWindowsMMap, TestRamUsageEstimator, TestSmallFloat, 
 TestUnicodeUtil, TestFSTs]
 [junit] NOTE: Linux 2.6.37-gentoo amd64/Sun Microsystems Inc. 1.6.0_25 
 (64-bit)/cpus=8,threads=1,free=137329960,total=208207872
 [junit] -  ---
 [junit] TEST org.apache.lucene.util.automaton.fst.TestFSTs FAILED
 {code}
 I am not able to reproduce

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[Lucene.Net] [jira] [Resolved] (LUCENENET-410) Lucene In Action (LIA book) samples for .NET.

2011-05-17 Thread Prescott Nasser (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENENET-410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prescott Nasser resolved LUCENENET-410.
---

Resolution: Not A Problem

 Lucene In Action (LIA book) samples for .NET.
 -

 Key: LUCENENET-410
 URL: https://issues.apache.org/jira/browse/LUCENENET-410
 Project: Lucene.Net
  Issue Type: New Feature
Reporter: Pasha Bizhan
Priority: Minor
 Attachments: liabook1_net_samples.zip


 First edition, Lucene.Net 1.4
 Not all samples from the book, only suitable for .NET. 
 For example nutch samples excluded.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (SOLR-2521) TestJoin.testRandom fails

2011-05-17 Thread Yonik Seeley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley reassigned SOLR-2521:
--

Assignee: Yonik Seeley

 TestJoin.testRandom fails
 -

 Key: SOLR-2521
 URL: https://issues.apache.org/jira/browse/SOLR-2521
 Project: Solr
  Issue Type: Bug
Reporter: Michael McCandless
Assignee: Yonik Seeley
 Fix For: 4.0


 Hit this random failure; it reproduces on trunk:
 {noformat}
 [junit] Testsuite: org.apache.solr.TestJoin
 [junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 4.512 sec
 [junit] 
 [junit] - Standard Error -
 [junit] 2011-05-16 12:51:46 org.apache.solr.TestJoin testRandomJoin
 [junit] SEVERE: GROUPING MISMATCH: mismatch: '0'!='1' @ response/numFound
 [junit]   
 request=LocalSolrQueryRequest{echoParams=allindent=trueq={!join+from%3Dsmall_i+to%3Dsmall3_is}*:*wt=json}
 [junit]   result={
 [junit]   responseHeader:{
 [junit] status:0,
 [junit] QTime:0,
 [junit] params:{
 [junit]   echoParams:all,
 [junit]   indent:true,
 [junit]   q:{!join from=small_i to=small3_is}*:*,
 [junit]   wt:json}},
 [junit]   response:{numFound:1,start:0,docs:[
 [junit]   {
 [junit] id:NXEA,
 [junit] score_f:87.90162,
 [junit] small3_ss:[N,
 [junit]   v,
 [junit]   n],
 [junit] small_i:4,
 [junit] small2_i:1,
 [junit] small2_is:[2],
 [junit] small3_is:[69,
 [junit]   88,
 [junit]   54,
 [junit]   80,
 [junit]   75,
 [junit]   83,
 [junit]   57,
 [junit]   73,
 [junit]   85,
 [junit]   52,
 [junit]   50,
 [junit]   88,
 [junit]   51,
 [junit]   89,
 [junit]   12,
 [junit]   8,
 [junit]   19,
 [junit]   23,
 [junit]   53,
 [junit]   75,
 [junit]   26,
 [junit]   99,
 [junit]   0,
 [junit]   44]}]
 [junit]   }}
 [junit]   expected={numFound:0,start:0,docs:[]}
 [junit]   model={NXEA:Doc(0):[id=NXEA, score_f=87.90162, small3_ss=[N, 
 v, n], small_i=4, small2_i=1, small2_is=2, small3_is=[69, 88, 54, 80, 75, 83, 
 57, 73, 85, 52, 50, 88, 51, 89, 12, 8, 19, 23, 53, 75, 26, 99, 0, 
 44]],JSLZ:Doc(1):[id=JSLZ, score_f=11.198811, small2_ss=[c, d], 
 small3_ss=[b, R, H, Q, O, f, C, e, Z, u, z, u, w, I, f, _, Y, r, w, u], 
 small_i=6, small2_is=[2, 3], small3_is=[22, 1]],FAWX:Doc(2):[id=FAWX, 
 score_f=25.524109, small_s=d, small3_ss=[O, D, X, `, W, z, k, M, j, m, r, [, 
 E, P, w, ^, y, T, e, R, V, H, g, e, I], small_i=2, small2_is=[2, 1], 
 small3_is=[95, 42]],GDDZ:Doc(3):[id=GDDZ, score_f=8.483642, small2_ss=[b, 
 e], small3_ss=[o, i, y, l, I, O, r, O, f, d, E, e, d, f, b, P], small2_is=[6, 
 6], small3_is=[36, 48, 9, 8, 40, 40, 68]],RBIQ:Doc(4):[id=RBIQ, 
 score_f=97.06258, small_s=b, small2_s=c, small2_ss=[e, e], small_i=2, 
 small2_is=6, small3_is=[13, 77, 96, 45]],LRDM:Doc(5):[id=LRDM, 
 score_f=82.302124, small_s=b, small2_s=a, small2_ss=d, small3_ss=[H, m, O, D, 
 I, J, U, D, f, N, ^, m, I, j, L, s, F, h, A, `, c, j], small2_i=2, 
 small2_is=[2, 7], small3_is=[81, 31, 78, 23, 88, 1, 7, 86, 20, 7, 40, 52, 
 100, 81, 34, 45, 87, 72, 14, 5]]}
 [junit] NOTE: reproduce with: ant test -Dtestcase=TestJoin 
 -Dtestmethod=testRandomJoin 
 -Dtests.seed=-4998031941344546449:8541928265064992444
 [junit] NOTE: test params are: codec=RandomCodecProvider: {id=MockRandom, 
 small2_ss=Standard, small2_is=MockFixedIntBlock(blockSize=1738), 
 small2_s=MockFixedIntBlock(blockSize=1738), 
 small3_is=MockVariableIntBlock(baseBlockSize=77), 
 small_i=MockFixedIntBlock(blockSize=1738), 
 small_s=MockVariableIntBlock(baseBlockSize=77), score_f=MockSep, 
 small2_i=Pulsing(freqCutoff=9), small3_ss=SimpleText}, locale=sr_BA, 
 timezone=America/Barbados
 [junit] NOTE: all tests run in this JVM:
 [junit] [TestJoin]
 [junit] NOTE: Linux 2.6.33.6-147.fc13.x86_64 amd64/Sun Microsystems Inc. 
 1.6.0_21 (64-bit)/cpus=24,threads=1,free=252342544,total=308084736
 [junit] -  ---
 [junit] Testcase: testRandomJoin(org.apache.solr.TestJoin):   FAILED
 [junit] mismatch: '0'!='1' @ response/numFound
 [junit] junit.framework.AssertionFailedError: mismatch: '0'!='1' @ 
 response/numFound
 [junit]   at 
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1282)
 [junit]   at 
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1211)
 [junit]   at 

[jira] [Commented] (LUCENE-3111) TestFSTs.testRandomWords failure

2011-05-17 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034876#comment-13034876
 ] 

Robert Muir commented on LUCENE-3111:
-

This sounds like a bug in either the test or test-infra.

I'm not able to reproduce but if I run this test with -Dtests.iter=100, i'm 
able to produce a similar failure (again not reproducible).

So first I'd like to see if we can find the reproducibility bug. This is the 
most important to me :)

 TestFSTs.testRandomWords failure
 

 Key: LUCENE-3111
 URL: https://issues.apache.org/jira/browse/LUCENE-3111
 Project: Lucene - Java
  Issue Type: Bug
Reporter: selckin
Assignee: Michael McCandless
Priority: Minor

 Was running some while(1) tests on the docvalues branch (r1103705) and the 
 following test failed:
 {code}
 [junit] Testsuite: org.apache.lucene.util.automaton.fst.TestFSTs
 [junit] Testcase: 
 testRandomWords(org.apache.lucene.util.automaton.fst.TestFSTs): FAILED
 [junit] expected:771 but was:TwoLongs:771,771
 [junit] junit.framework.AssertionFailedError: expected:771 but 
 was:TwoLongs:771,771
 [junit]   at 
 org.apache.lucene.util.automaton.fst.TestFSTs$FSTTester.verifyUnPruned(TestFSTs.java:540)
 [junit]   at 
 org.apache.lucene.util.automaton.fst.TestFSTs$FSTTester.doTest(TestFSTs.java:496)
 [junit]   at 
 org.apache.lucene.util.automaton.fst.TestFSTs$FSTTester.doTest(TestFSTs.java:359)
 [junit]   at 
 org.apache.lucene.util.automaton.fst.TestFSTs.doTest(TestFSTs.java:319)
 [junit]   at 
 org.apache.lucene.util.automaton.fst.TestFSTs.testRandomWords(TestFSTs.java:940)
 [junit]   at 
 org.apache.lucene.util.automaton.fst.TestFSTs.testRandomWords(TestFSTs.java:915)
 [junit]   at 
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1282)
 [junit]   at 
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1211)
 [junit] 
 [junit] 
 [junit] Tests run: 7, Failures: 1, Errors: 0, Time elapsed: 7.628 sec
 [junit] 
 [junit] - Standard Error -
 [junit] NOTE: Ignoring nightly-only test method 'testBigSet'
 [junit] NOTE: reproduce with: ant test -Dtestcase=TestFSTs 
 -Dtestmethod=testRandomWords -Dtests.seed=-269475578956012681:0
 [junit] NOTE: test params are: codec=PreFlex, locale=ar, 
 timezone=America/Blanc-Sablon
 [junit] NOTE: all tests run in this JVM:
 [junit] [TestToken, TestCodecs, TestIndexReaderReopen, 
 TestIndexWriterMerging, TestNoDeletionPolicy, TestParallelReaderEmptyIndex, 
 TestParallelTermEnum, TestPerSegmentDeletes, TestSegmentReader, 
 TestSegmentTermDocs, TestStressAdvance, TestTermVectorsReader, 
 TestSurrogates, TestMultiFieldQueryParser, TestAutomatonQuery, 
 TestBooleanScorer, TestFuzzyQuery, TestMultiTermConstantScore, 
 TestNumericRangeQuery64, TestPositiveScoresOnlyCollector, TestPrefixFilter, 
 TestQueryTermVector, TestScorerPerf, TestSloppyPhraseQuery, 
 TestSpansAdvanced, TestWindowsMMap, TestRamUsageEstimator, TestSmallFloat, 
 TestUnicodeUtil, TestFSTs]
 [junit] NOTE: Linux 2.6.37-gentoo amd64/Sun Microsystems Inc. 1.6.0_25 
 (64-bit)/cpus=8,threads=1,free=137329960,total=208207872
 [junit] -  ---
 [junit] TEST org.apache.lucene.util.automaton.fst.TestFSTs FAILED
 {code}
 I am not able to reproduce

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3111) TestFSTs.testRandomWords failure

2011-05-17 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034880#comment-13034880
 ] 

Robert Muir commented on LUCENE-3111:
-

ok, the problem is the test overrides setup() but doesnt call super.setup(), 
and it does the same with tearDown()

Currently the way LuceneTestCase checks this is very crude, in other words if 
you make this mistake with one, or the other, but not both, it will catch it!

The only workaround i know of to find test bugs like this is to install 
findbugs. it has a specific check for this exact test bug! we could run it on 
all of our tests.

 TestFSTs.testRandomWords failure
 

 Key: LUCENE-3111
 URL: https://issues.apache.org/jira/browse/LUCENE-3111
 Project: Lucene - Java
  Issue Type: Bug
Reporter: selckin
Assignee: Michael McCandless
Priority: Minor

 Was running some while(1) tests on the docvalues branch (r1103705) and the 
 following test failed:
 {code}
 [junit] Testsuite: org.apache.lucene.util.automaton.fst.TestFSTs
 [junit] Testcase: 
 testRandomWords(org.apache.lucene.util.automaton.fst.TestFSTs): FAILED
 [junit] expected:771 but was:TwoLongs:771,771
 [junit] junit.framework.AssertionFailedError: expected:771 but 
 was:TwoLongs:771,771
 [junit]   at 
 org.apache.lucene.util.automaton.fst.TestFSTs$FSTTester.verifyUnPruned(TestFSTs.java:540)
 [junit]   at 
 org.apache.lucene.util.automaton.fst.TestFSTs$FSTTester.doTest(TestFSTs.java:496)
 [junit]   at 
 org.apache.lucene.util.automaton.fst.TestFSTs$FSTTester.doTest(TestFSTs.java:359)
 [junit]   at 
 org.apache.lucene.util.automaton.fst.TestFSTs.doTest(TestFSTs.java:319)
 [junit]   at 
 org.apache.lucene.util.automaton.fst.TestFSTs.testRandomWords(TestFSTs.java:940)
 [junit]   at 
 org.apache.lucene.util.automaton.fst.TestFSTs.testRandomWords(TestFSTs.java:915)
 [junit]   at 
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1282)
 [junit]   at 
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1211)
 [junit] 
 [junit] 
 [junit] Tests run: 7, Failures: 1, Errors: 0, Time elapsed: 7.628 sec
 [junit] 
 [junit] - Standard Error -
 [junit] NOTE: Ignoring nightly-only test method 'testBigSet'
 [junit] NOTE: reproduce with: ant test -Dtestcase=TestFSTs 
 -Dtestmethod=testRandomWords -Dtests.seed=-269475578956012681:0
 [junit] NOTE: test params are: codec=PreFlex, locale=ar, 
 timezone=America/Blanc-Sablon
 [junit] NOTE: all tests run in this JVM:
 [junit] [TestToken, TestCodecs, TestIndexReaderReopen, 
 TestIndexWriterMerging, TestNoDeletionPolicy, TestParallelReaderEmptyIndex, 
 TestParallelTermEnum, TestPerSegmentDeletes, TestSegmentReader, 
 TestSegmentTermDocs, TestStressAdvance, TestTermVectorsReader, 
 TestSurrogates, TestMultiFieldQueryParser, TestAutomatonQuery, 
 TestBooleanScorer, TestFuzzyQuery, TestMultiTermConstantScore, 
 TestNumericRangeQuery64, TestPositiveScoresOnlyCollector, TestPrefixFilter, 
 TestQueryTermVector, TestScorerPerf, TestSloppyPhraseQuery, 
 TestSpansAdvanced, TestWindowsMMap, TestRamUsageEstimator, TestSmallFloat, 
 TestUnicodeUtil, TestFSTs]
 [junit] NOTE: Linux 2.6.37-gentoo amd64/Sun Microsystems Inc. 1.6.0_25 
 (64-bit)/cpus=8,threads=1,free=137329960,total=208207872
 [junit] -  ---
 [junit] TEST org.apache.lucene.util.automaton.fst.TestFSTs FAILED
 {code}
 I am not able to reproduce

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3092) NRTCachingDirectory, to buffer small segments in a RAMDir

2011-05-17 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034884#comment-13034884
 ] 

David Smiley commented on LUCENE-3092:
--

This looks cool.  Any performance measurements?  Perhaps a forthcoming post on 
Mike's blog? :-)

 NRTCachingDirectory, to buffer small segments in a RAMDir
 -

 Key: LUCENE-3092
 URL: https://issues.apache.org/jira/browse/LUCENE-3092
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/store
Reporter: Michael McCandless
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3092-listener.patch, LUCENE-3092.patch, 
 LUCENE-3092.patch, LUCENE-3092.patch, LUCENE-3092.patch


 I created this simply Directory impl, whose goal is reduce IO
 contention in a frequent reopen NRT use case.
 The idea is, when reopening quickly, but not indexing that much
 content, you wind up with many small files created with time, that can
 possibly stress the IO system eg if merges, searching are also
 fighting for IO.
 So, NRTCachingDirectory puts these newly created files into a RAMDir,
 and only when they are merged into a too-large segment, does it then
 write-through to the real (delegate) directory.
 This lets you spend some RAM to reduce I0.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2424) extracted text from tika has no spaces

2011-05-17 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034886#comment-13034886
 ] 

Andrzej Bialecki  commented on SOLR-2424:
-

Liam, what version of the cmd-line tika app did you use for this test? was it 
the exact same version as the one in Solr?

 extracted text from tika has no spaces
 --

 Key: SOLR-2424
 URL: https://issues.apache.org/jira/browse/SOLR-2424
 Project: Solr
  Issue Type: Bug
  Components: contrib - Solr Cell (Tika extraction)
Affects Versions: 3.1
Reporter: Yonik Seeley
 Attachments: ET2000 Service Manual.pdf


 Try this:
 curl 
 http://localhost:8983/solr/update/extract?extractOnly=truewt=jsonindent=true;
   -F tutorial=@tutorial.pdf
 And you get text output w/o spaces: 
 ThisdocumentcoversthebasicsofrunningSolru...

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1395) Integrate Katta

2011-05-17 Thread Jamie Johnson (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034887#comment-13034887
 ] 

Jamie Johnson commented on SOLR-1395:
-

Is there any updated documentation for how to do this?  I've attempted to run 
through the patching process but the exact steps are not clear since the 
versions have changed significantly.  

 Integrate Katta
 ---

 Key: SOLR-1395
 URL: https://issues.apache.org/jira/browse/SOLR-1395
 Project: Solr
  Issue Type: New Feature
Affects Versions: 1.4
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 3.2

 Attachments: SOLR-1395.patch, SOLR-1395.patch, SOLR-1395.patch, 
 back-end.log, front-end.log, hadoop-core-0.19.0.jar, katta-core-0.6-dev.jar, 
 katta-solrcores.jpg, katta.node.properties, katta.zk.properties, 
 log4j-1.2.13.jar, solr-1395-1431-3.patch, solr-1395-1431-4.patch, 
 solr-1395-1431-katta0.6.patch, solr-1395-1431-katta0.6.patch, 
 solr-1395-1431.patch, solr-1395-katta-0.6.2-1.patch, 
 solr-1395-katta-0.6.2-2.patch, solr-1395-katta-0.6.2-3.patch, 
 solr-1395-katta-0.6.2.patch, test-katta-core-0.6-dev.jar, 
 zkclient-0.1-dev.jar, zookeeper-3.2.1.jar

   Original Estimate: 336h
  Remaining Estimate: 336h

 We'll integrate Katta into Solr so that:
 * Distributed search uses Hadoop RPC
 * Shard/SolrCore distribution and management
 * Zookeeper based failover
 * Indexes may be built using Hadoop

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3111) TestFSTs.testRandomWords failure

2011-05-17 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034895#comment-13034895
 ] 

Michael McCandless commented on LUCENE-3111:


Doh!

+1 for findbugs.

 TestFSTs.testRandomWords failure
 

 Key: LUCENE-3111
 URL: https://issues.apache.org/jira/browse/LUCENE-3111
 Project: Lucene - Java
  Issue Type: Bug
Reporter: selckin
Assignee: Michael McCandless
Priority: Minor

 Was running some while(1) tests on the docvalues branch (r1103705) and the 
 following test failed:
 {code}
 [junit] Testsuite: org.apache.lucene.util.automaton.fst.TestFSTs
 [junit] Testcase: 
 testRandomWords(org.apache.lucene.util.automaton.fst.TestFSTs): FAILED
 [junit] expected:771 but was:TwoLongs:771,771
 [junit] junit.framework.AssertionFailedError: expected:771 but 
 was:TwoLongs:771,771
 [junit]   at 
 org.apache.lucene.util.automaton.fst.TestFSTs$FSTTester.verifyUnPruned(TestFSTs.java:540)
 [junit]   at 
 org.apache.lucene.util.automaton.fst.TestFSTs$FSTTester.doTest(TestFSTs.java:496)
 [junit]   at 
 org.apache.lucene.util.automaton.fst.TestFSTs$FSTTester.doTest(TestFSTs.java:359)
 [junit]   at 
 org.apache.lucene.util.automaton.fst.TestFSTs.doTest(TestFSTs.java:319)
 [junit]   at 
 org.apache.lucene.util.automaton.fst.TestFSTs.testRandomWords(TestFSTs.java:940)
 [junit]   at 
 org.apache.lucene.util.automaton.fst.TestFSTs.testRandomWords(TestFSTs.java:915)
 [junit]   at 
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1282)
 [junit]   at 
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1211)
 [junit] 
 [junit] 
 [junit] Tests run: 7, Failures: 1, Errors: 0, Time elapsed: 7.628 sec
 [junit] 
 [junit] - Standard Error -
 [junit] NOTE: Ignoring nightly-only test method 'testBigSet'
 [junit] NOTE: reproduce with: ant test -Dtestcase=TestFSTs 
 -Dtestmethod=testRandomWords -Dtests.seed=-269475578956012681:0
 [junit] NOTE: test params are: codec=PreFlex, locale=ar, 
 timezone=America/Blanc-Sablon
 [junit] NOTE: all tests run in this JVM:
 [junit] [TestToken, TestCodecs, TestIndexReaderReopen, 
 TestIndexWriterMerging, TestNoDeletionPolicy, TestParallelReaderEmptyIndex, 
 TestParallelTermEnum, TestPerSegmentDeletes, TestSegmentReader, 
 TestSegmentTermDocs, TestStressAdvance, TestTermVectorsReader, 
 TestSurrogates, TestMultiFieldQueryParser, TestAutomatonQuery, 
 TestBooleanScorer, TestFuzzyQuery, TestMultiTermConstantScore, 
 TestNumericRangeQuery64, TestPositiveScoresOnlyCollector, TestPrefixFilter, 
 TestQueryTermVector, TestScorerPerf, TestSloppyPhraseQuery, 
 TestSpansAdvanced, TestWindowsMMap, TestRamUsageEstimator, TestSmallFloat, 
 TestUnicodeUtil, TestFSTs]
 [junit] NOTE: Linux 2.6.37-gentoo amd64/Sun Microsystems Inc. 1.6.0_25 
 (64-bit)/cpus=8,threads=1,free=137329960,total=208207872
 [junit] -  ---
 [junit] TEST org.apache.lucene.util.automaton.fst.TestFSTs FAILED
 {code}
 I am not able to reproduce

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3111) TestFSTs.testRandomWords failure

2011-05-17 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034899#comment-13034899
 ] 

Michael McCandless commented on LUCENE-3111:


OK this reproduces the bug, once you add the missing calls to 
super.setUp/tearDown:

{noformat}
ant test -Dtestcase=TestFSTs -Dtestmethod=testRandomWords 
-Dtests.seed=6166279653770643480:6589011488658196383
{noformat}

 TestFSTs.testRandomWords failure
 

 Key: LUCENE-3111
 URL: https://issues.apache.org/jira/browse/LUCENE-3111
 Project: Lucene - Java
  Issue Type: Bug
Reporter: selckin
Assignee: Michael McCandless
Priority: Minor

 Was running some while(1) tests on the docvalues branch (r1103705) and the 
 following test failed:
 {code}
 [junit] Testsuite: org.apache.lucene.util.automaton.fst.TestFSTs
 [junit] Testcase: 
 testRandomWords(org.apache.lucene.util.automaton.fst.TestFSTs): FAILED
 [junit] expected:771 but was:TwoLongs:771,771
 [junit] junit.framework.AssertionFailedError: expected:771 but 
 was:TwoLongs:771,771
 [junit]   at 
 org.apache.lucene.util.automaton.fst.TestFSTs$FSTTester.verifyUnPruned(TestFSTs.java:540)
 [junit]   at 
 org.apache.lucene.util.automaton.fst.TestFSTs$FSTTester.doTest(TestFSTs.java:496)
 [junit]   at 
 org.apache.lucene.util.automaton.fst.TestFSTs$FSTTester.doTest(TestFSTs.java:359)
 [junit]   at 
 org.apache.lucene.util.automaton.fst.TestFSTs.doTest(TestFSTs.java:319)
 [junit]   at 
 org.apache.lucene.util.automaton.fst.TestFSTs.testRandomWords(TestFSTs.java:940)
 [junit]   at 
 org.apache.lucene.util.automaton.fst.TestFSTs.testRandomWords(TestFSTs.java:915)
 [junit]   at 
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1282)
 [junit]   at 
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1211)
 [junit] 
 [junit] 
 [junit] Tests run: 7, Failures: 1, Errors: 0, Time elapsed: 7.628 sec
 [junit] 
 [junit] - Standard Error -
 [junit] NOTE: Ignoring nightly-only test method 'testBigSet'
 [junit] NOTE: reproduce with: ant test -Dtestcase=TestFSTs 
 -Dtestmethod=testRandomWords -Dtests.seed=-269475578956012681:0
 [junit] NOTE: test params are: codec=PreFlex, locale=ar, 
 timezone=America/Blanc-Sablon
 [junit] NOTE: all tests run in this JVM:
 [junit] [TestToken, TestCodecs, TestIndexReaderReopen, 
 TestIndexWriterMerging, TestNoDeletionPolicy, TestParallelReaderEmptyIndex, 
 TestParallelTermEnum, TestPerSegmentDeletes, TestSegmentReader, 
 TestSegmentTermDocs, TestStressAdvance, TestTermVectorsReader, 
 TestSurrogates, TestMultiFieldQueryParser, TestAutomatonQuery, 
 TestBooleanScorer, TestFuzzyQuery, TestMultiTermConstantScore, 
 TestNumericRangeQuery64, TestPositiveScoresOnlyCollector, TestPrefixFilter, 
 TestQueryTermVector, TestScorerPerf, TestSloppyPhraseQuery, 
 TestSpansAdvanced, TestWindowsMMap, TestRamUsageEstimator, TestSmallFloat, 
 TestUnicodeUtil, TestFSTs]
 [junit] NOTE: Linux 2.6.37-gentoo amd64/Sun Microsystems Inc. 1.6.0_25 
 (64-bit)/cpus=8,threads=1,free=137329960,total=208207872
 [junit] -  ---
 [junit] TEST org.apache.lucene.util.automaton.fst.TestFSTs FAILED
 {code}
 I am not able to reproduce

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3111) TestFSTs.testRandomWords failure

2011-05-17 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034900#comment-13034900
 ] 

Robert Muir commented on LUCENE-3111:
-

I have an idea how i think i can make LuceneTestCase fail if a test does 
this... i'll see if i can improve the setup/tearDown checking this way so we 
don't have this issue again.

 TestFSTs.testRandomWords failure
 

 Key: LUCENE-3111
 URL: https://issues.apache.org/jira/browse/LUCENE-3111
 Project: Lucene - Java
  Issue Type: Bug
Reporter: selckin
Assignee: Michael McCandless
Priority: Minor

 Was running some while(1) tests on the docvalues branch (r1103705) and the 
 following test failed:
 {code}
 [junit] Testsuite: org.apache.lucene.util.automaton.fst.TestFSTs
 [junit] Testcase: 
 testRandomWords(org.apache.lucene.util.automaton.fst.TestFSTs): FAILED
 [junit] expected:771 but was:TwoLongs:771,771
 [junit] junit.framework.AssertionFailedError: expected:771 but 
 was:TwoLongs:771,771
 [junit]   at 
 org.apache.lucene.util.automaton.fst.TestFSTs$FSTTester.verifyUnPruned(TestFSTs.java:540)
 [junit]   at 
 org.apache.lucene.util.automaton.fst.TestFSTs$FSTTester.doTest(TestFSTs.java:496)
 [junit]   at 
 org.apache.lucene.util.automaton.fst.TestFSTs$FSTTester.doTest(TestFSTs.java:359)
 [junit]   at 
 org.apache.lucene.util.automaton.fst.TestFSTs.doTest(TestFSTs.java:319)
 [junit]   at 
 org.apache.lucene.util.automaton.fst.TestFSTs.testRandomWords(TestFSTs.java:940)
 [junit]   at 
 org.apache.lucene.util.automaton.fst.TestFSTs.testRandomWords(TestFSTs.java:915)
 [junit]   at 
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1282)
 [junit]   at 
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1211)
 [junit] 
 [junit] 
 [junit] Tests run: 7, Failures: 1, Errors: 0, Time elapsed: 7.628 sec
 [junit] 
 [junit] - Standard Error -
 [junit] NOTE: Ignoring nightly-only test method 'testBigSet'
 [junit] NOTE: reproduce with: ant test -Dtestcase=TestFSTs 
 -Dtestmethod=testRandomWords -Dtests.seed=-269475578956012681:0
 [junit] NOTE: test params are: codec=PreFlex, locale=ar, 
 timezone=America/Blanc-Sablon
 [junit] NOTE: all tests run in this JVM:
 [junit] [TestToken, TestCodecs, TestIndexReaderReopen, 
 TestIndexWriterMerging, TestNoDeletionPolicy, TestParallelReaderEmptyIndex, 
 TestParallelTermEnum, TestPerSegmentDeletes, TestSegmentReader, 
 TestSegmentTermDocs, TestStressAdvance, TestTermVectorsReader, 
 TestSurrogates, TestMultiFieldQueryParser, TestAutomatonQuery, 
 TestBooleanScorer, TestFuzzyQuery, TestMultiTermConstantScore, 
 TestNumericRangeQuery64, TestPositiveScoresOnlyCollector, TestPrefixFilter, 
 TestQueryTermVector, TestScorerPerf, TestSloppyPhraseQuery, 
 TestSpansAdvanced, TestWindowsMMap, TestRamUsageEstimator, TestSmallFloat, 
 TestUnicodeUtil, TestFSTs]
 [junit] NOTE: Linux 2.6.37-gentoo amd64/Sun Microsystems Inc. 1.6.0_25 
 (64-bit)/cpus=8,threads=1,free=137329960,total=208207872
 [junit] -  ---
 [junit] TEST org.apache.lucene.util.automaton.fst.TestFSTs FAILED
 {code}
 I am not able to reproduce

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-3098) Grouped total count

2011-05-17 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-3098.


Resolution: Fixed

Committed.  I made a small change to TestGrouping (renamed one variable) and 
tweaked jdocs a bit on AllGroupsCollector.

This is a great addition to the grouping module -- thanks Martijn!

 Grouped total count
 ---

 Key: LUCENE-3098
 URL: https://issues.apache.org/jira/browse/LUCENE-3098
 Project: Lucene - Java
  Issue Type: New Feature
Reporter: Martijn van Groningen
Assignee: Michael McCandless
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3098-3x.patch, LUCENE-3098-3x.patch, 
 LUCENE-3098.patch, LUCENE-3098.patch, LUCENE-3098.patch, LUCENE-3098.patch, 
 LUCENE-3098.patch


 When grouping currently you can get two counts:
 * Total hit count. Which counts all documents that matched the query.
 * Total grouped hit count. Which counts all documents that have been grouped 
 in the top N groups.
 Since the end user gets groups in his search result instead of plain 
 documents with grouping. The total number of groups as total count makes more 
 sense in many situations. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-2193) Re-architect Update Handler

2011-05-17 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated SOLR-2193:
--

Attachment: SOLR-2193.patch

Here is a new patch - couple tests, couple fixes, etc, etc. Still has no 
commitWithin type support for soft commits.

Tested and made auto soft commit code work.

I spent some time today firing documents rapidly at Solr with a soft commit max 
time of 1 second. Fantastic results at about 100 wikipedia documents per 
second. Didn't change any other example settings this time.

 Re-architect Update Handler
 ---

 Key: SOLR-2193
 URL: https://issues.apache.org/jira/browse/SOLR-2193
 Project: Solr
  Issue Type: Improvement
Reporter: Mark Miller
Assignee: Mark Miller
 Fix For: 4.0

 Attachments: SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch, 
 SOLR-2193.patch


 The update handler needs an overhaul.
 A few goals I think we might want to look at:
 1. Cleanup - drop DirectUpdateHandler(2) line - move to something like 
 UpdateHandler, DefaultUpdateHandler
 2. Expose the SolrIndexWriter in the api or add the proper abstractions to 
 get done what we now do with special casing:
 if (directupdatehandler2)
   success
  else
   failish
 3. Stop closing the IndexWriter and start using commit (still lazy IW init 
 though).
 4. Drop iwAccess, iwCommit locks and sync mostly at the Lucene level.
 5. Keep NRT support in mind.
 6. Keep microsharding in mind (maintain logical index as multiple physical 
 indexes)
 7. Address the current issues we face because multiple original/'reloaded' 
 cores can have a different IndexWriter on the same index.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3092) NRTCachingDirectory, to buffer small segments in a RAMDir

2011-05-17 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034911#comment-13034911
 ] 

Michael McCandless commented on LUCENE-3092:


Alas I haven't had time to really dig into perf gains here... but I suspect on 
systems where IO is in contention (due to ongoing cold searching, or merging), 
and reopen rate is highish, that this should be a decent win since we don't 
burden the IO system with many tiny files.

 NRTCachingDirectory, to buffer small segments in a RAMDir
 -

 Key: LUCENE-3092
 URL: https://issues.apache.org/jira/browse/LUCENE-3092
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/store
Reporter: Michael McCandless
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3092-listener.patch, LUCENE-3092.patch, 
 LUCENE-3092.patch, LUCENE-3092.patch, LUCENE-3092.patch


 I created this simply Directory impl, whose goal is reduce IO
 contention in a frequent reopen NRT use case.
 The idea is, when reopening quickly, but not indexing that much
 content, you wind up with many small files created with time, that can
 possibly stress the IO system eg if merges, searching are also
 fighting for IO.
 So, NRTCachingDirectory puts these newly created files into a RAMDir,
 and only when they are merged into a too-large segment, does it then
 write-through to the real (delegate) directory.
 This lets you spend some RAM to reduce I0.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2193) Re-architect Update Handler

2011-05-17 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034918#comment-13034918
 ] 

Mark Miller commented on SOLR-2193:
---

Next I need to look at the thread safety of CommitTracker under the new locking 
system.

 Re-architect Update Handler
 ---

 Key: SOLR-2193
 URL: https://issues.apache.org/jira/browse/SOLR-2193
 Project: Solr
  Issue Type: Improvement
Reporter: Mark Miller
Assignee: Mark Miller
 Fix For: 4.0

 Attachments: SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch, 
 SOLR-2193.patch


 The update handler needs an overhaul.
 A few goals I think we might want to look at:
 1. Cleanup - drop DirectUpdateHandler(2) line - move to something like 
 UpdateHandler, DefaultUpdateHandler
 2. Expose the SolrIndexWriter in the api or add the proper abstractions to 
 get done what we now do with special casing:
 if (directupdatehandler2)
   success
  else
   failish
 3. Stop closing the IndexWriter and start using commit (still lazy IW init 
 though).
 4. Drop iwAccess, iwCommit locks and sync mostly at the Lucene level.
 5. Keep NRT support in mind.
 6. Keep microsharding in mind (maintain logical index as multiple physical 
 indexes)
 7. Address the current issues we face because multiple original/'reloaded' 
 cores can have a different IndexWriter on the same index.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-3092) NRTCachingDirectory, to buffer small segments in a RAMDir

2011-05-17 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-3092.


Resolution: Fixed

 NRTCachingDirectory, to buffer small segments in a RAMDir
 -

 Key: LUCENE-3092
 URL: https://issues.apache.org/jira/browse/LUCENE-3092
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/store
Reporter: Michael McCandless
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3092-listener.patch, LUCENE-3092.patch, 
 LUCENE-3092.patch, LUCENE-3092.patch, LUCENE-3092.patch


 I created this simply Directory impl, whose goal is reduce IO
 contention in a frequent reopen NRT use case.
 The idea is, when reopening quickly, but not indexing that much
 content, you wind up with many small files created with time, that can
 possibly stress the IO system eg if merges, searching are also
 fighting for IO.
 So, NRTCachingDirectory puts these newly created files into a RAMDir,
 and only when they are merged into a too-large segment, does it then
 write-through to the real (delegate) directory.
 This lets you spend some RAM to reduce I0.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3092) NRTCachingDirectory, to buffer small segments in a RAMDir

2011-05-17 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034927#comment-13034927
 ] 

Shai Erera commented on LUCENE-3092:


Mike, this is a great idea ! If there are any chances it will be released in 
3.2, I think one of our NRT apps can make good use of it.

Question - I see that NRTCD ctor takes a Directory. Is there any reason to pass 
RAMDir to NRTCD? I assume you use a Directory for any other Dir impls out there 
that may not sub-class e.g., FSDir, which is ok - so can we at least document 
that this Dir is not useful if you intend to pass RAMDir to it?

Unless I am wrong and it is useful w/ RAMDir as well. 

 NRTCachingDirectory, to buffer small segments in a RAMDir
 -

 Key: LUCENE-3092
 URL: https://issues.apache.org/jira/browse/LUCENE-3092
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/store
Reporter: Michael McCandless
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3092-listener.patch, LUCENE-3092.patch, 
 LUCENE-3092.patch, LUCENE-3092.patch, LUCENE-3092.patch


 I created this simply Directory impl, whose goal is reduce IO
 contention in a frequent reopen NRT use case.
 The idea is, when reopening quickly, but not indexing that much
 content, you wind up with many small files created with time, that can
 possibly stress the IO system eg if merges, searching are also
 fighting for IO.
 So, NRTCachingDirectory puts these newly created files into a RAMDir,
 and only when they are merged into a too-large segment, does it then
 write-through to the real (delegate) directory.
 This lets you spend some RAM to reduce I0.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Bulk changing issues in JIRA

2011-05-17 Thread Mark Miller
Thanks Shai! Would make a great addition to the wiki ;)

On May 16, 2011, at 11:47 PM, Shai Erera wrote:

 Hi
 
 If you ever wondered how to bulk change issues in JIRA, here's the procedure:
 
 * View a list of issues, e.g. by query/filter
 
 * At the top-right you'll find this:
 
 
 * Click on Tools and select
 
 
 
 * The screen changes so that next to each issue there's a check box.
 
 * Mark all the issues you want to change and click Next
 
 * Select the operation (e.g. Edit)
 
 * The next screen (followed by choosing operation Edit) lets you edit the 
 issues. Note this at the bottom:
 
 
 
 Deselect if you don't want to spam the list :).
 
 FYI,
 Shai

- Mark Miller
lucidimagination.com

Lucene/Solr User Conference
May 25-26, San Francisco
www.lucenerevolution.org






-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Lucene/Solr JIRA

2011-05-17 Thread Shai Erera
Hi

Today we have separate JIRA projects for Lucene and Solr. This, IMO, starts
to become confusing and difficult to maintain. I'll explain:

* With modules, we now have components in the Lucene JIRA project for
different modules (some under modules/* some under lucene/contrib/*). Will
we have the same components duplication in the Solr JIRA project?

* Where do users go to open a bug report for a module - Lucene or Solr
projects? I'd hate to see that they open it under their favorite (or
worse. random picking) project. If so, it'll become a mess.

* Administration -- everything needs to be done twice. Create versions (same
one !) on both projects, close issues (after release) etc.

* Managing a release now means I should monitor two JIRA projects for the
3.2 (an example) version issues. Why?

I guess I'm not too sure what do two JIRA projects give us. Now that it is
the same project, why not make our (committers and contributors) life easier
by having one JIRA project w/ components:
lucene/core
lucene/contrib/xyz
modules/xyz
solr/core
solr/contrib/xyz
general/* (test, build)

It's already becoming confusing:
LUCENE-3097: post grouping faceting -- a great example for a module that
both Lucene and Solr users can use. Opened under Lucene project, and depends
on Solr issues (not a big deal)
LUCENE-3104: could easily have been opened under the Solr project. I don't
know why it was opened under Lucene (random maybe?)

Can we merge the two?

Shai


[jira] [Commented] (SOLR-2119) IndexSchema should log warning if analyzer is declared with charfilter/tokenizer/tokenfiler out of order

2011-05-17 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034939#comment-13034939
 ] 

Mark Miller commented on SOLR-2119:
---

bq. I think this would be a good service to those users who trip the hard error 
on upgrade: it means Solr is not doing what they thought they asked it to do.

+1

 IndexSchema should log warning if analyzer is declared with 
 charfilter/tokenizer/tokenfiler out of order
 --

 Key: SOLR-2119
 URL: https://issues.apache.org/jira/browse/SOLR-2119
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Reporter: Hoss Man
 Fix For: 3.2, 4.0


 There seems to be a segment of hte user population that has a hard time 
 understanding the distinction between a charfilter, a tokenizer, and a 
 tokenfilter -- while we can certianly try to improve the documentation about 
 what exactly each does, and when they take affect in the analysis chain, one 
 other thing we should do is try to educate people when they constuct their 
 analyzer in a way that doesn't make any sense.
 at the moment, some people are attempting to do things like move the Foo 
 tokenFilter/ before the tokenizer/ to try and get certain behavior ... 
 at a minimum we should log a warning in this case that doing that doesn't 
 have the desired effect
 (we could easily make such a situation fail to initialize, but i'm not 
 convinced that would be the best course of action, since some people may have 
 schema's where they have declared a charFilter or tokenizer out of order 
 relative to their tokenFilters, but are still getting correct results that 
 work for them, and breaking their instance on upgrade doens't seem like it 
 would be productive)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Lucene/Solr JIRA

2011-05-17 Thread Mark Miller

On May 17, 2011, at 2:22 PM, Shai Erera wrote:

 Can we merge the two?

+1. Due to history and other possible pain points, I don't know that it's the 
right practical idea at the end of the upcoming discussion, but it's certainly 
a good idea.

- Mark Miller
lucidimagination.com

Lucene/Solr User Conference
May 25-26, San Francisco
www.lucenerevolution.org






-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3111) TestFSTs.testRandomWords failure

2011-05-17 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-3111:
---

Attachment: LUCENE-3111.patch

OK I found this -- if you try to add the same output, twice, for the empty 
string, then the builder fails to realize this is a TwoInts and makes a single 
int output!

Thank you random testing :)

I'll commit shortly...

 TestFSTs.testRandomWords failure
 

 Key: LUCENE-3111
 URL: https://issues.apache.org/jira/browse/LUCENE-3111
 Project: Lucene - Java
  Issue Type: Bug
Reporter: selckin
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-3111.patch


 Was running some while(1) tests on the docvalues branch (r1103705) and the 
 following test failed:
 {code}
 [junit] Testsuite: org.apache.lucene.util.automaton.fst.TestFSTs
 [junit] Testcase: 
 testRandomWords(org.apache.lucene.util.automaton.fst.TestFSTs): FAILED
 [junit] expected:771 but was:TwoLongs:771,771
 [junit] junit.framework.AssertionFailedError: expected:771 but 
 was:TwoLongs:771,771
 [junit]   at 
 org.apache.lucene.util.automaton.fst.TestFSTs$FSTTester.verifyUnPruned(TestFSTs.java:540)
 [junit]   at 
 org.apache.lucene.util.automaton.fst.TestFSTs$FSTTester.doTest(TestFSTs.java:496)
 [junit]   at 
 org.apache.lucene.util.automaton.fst.TestFSTs$FSTTester.doTest(TestFSTs.java:359)
 [junit]   at 
 org.apache.lucene.util.automaton.fst.TestFSTs.doTest(TestFSTs.java:319)
 [junit]   at 
 org.apache.lucene.util.automaton.fst.TestFSTs.testRandomWords(TestFSTs.java:940)
 [junit]   at 
 org.apache.lucene.util.automaton.fst.TestFSTs.testRandomWords(TestFSTs.java:915)
 [junit]   at 
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1282)
 [junit]   at 
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1211)
 [junit] 
 [junit] 
 [junit] Tests run: 7, Failures: 1, Errors: 0, Time elapsed: 7.628 sec
 [junit] 
 [junit] - Standard Error -
 [junit] NOTE: Ignoring nightly-only test method 'testBigSet'
 [junit] NOTE: reproduce with: ant test -Dtestcase=TestFSTs 
 -Dtestmethod=testRandomWords -Dtests.seed=-269475578956012681:0
 [junit] NOTE: test params are: codec=PreFlex, locale=ar, 
 timezone=America/Blanc-Sablon
 [junit] NOTE: all tests run in this JVM:
 [junit] [TestToken, TestCodecs, TestIndexReaderReopen, 
 TestIndexWriterMerging, TestNoDeletionPolicy, TestParallelReaderEmptyIndex, 
 TestParallelTermEnum, TestPerSegmentDeletes, TestSegmentReader, 
 TestSegmentTermDocs, TestStressAdvance, TestTermVectorsReader, 
 TestSurrogates, TestMultiFieldQueryParser, TestAutomatonQuery, 
 TestBooleanScorer, TestFuzzyQuery, TestMultiTermConstantScore, 
 TestNumericRangeQuery64, TestPositiveScoresOnlyCollector, TestPrefixFilter, 
 TestQueryTermVector, TestScorerPerf, TestSloppyPhraseQuery, 
 TestSpansAdvanced, TestWindowsMMap, TestRamUsageEstimator, TestSmallFloat, 
 TestUnicodeUtil, TestFSTs]
 [junit] NOTE: Linux 2.6.37-gentoo amd64/Sun Microsystems Inc. 1.6.0_25 
 (64-bit)/cpus=8,threads=1,free=137329960,total=208207872
 [junit] -  ---
 [junit] TEST org.apache.lucene.util.automaton.fst.TestFSTs FAILED
 {code}
 I am not able to reproduce

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-3111) TestFSTs.testRandomWords failure

2011-05-17 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-3111.


   Resolution: Fixed
Fix Version/s: 4.0

 TestFSTs.testRandomWords failure
 

 Key: LUCENE-3111
 URL: https://issues.apache.org/jira/browse/LUCENE-3111
 Project: Lucene - Java
  Issue Type: Bug
Reporter: selckin
Assignee: Michael McCandless
Priority: Minor
 Fix For: 4.0

 Attachments: LUCENE-3111.patch


 Was running some while(1) tests on the docvalues branch (r1103705) and the 
 following test failed:
 {code}
 [junit] Testsuite: org.apache.lucene.util.automaton.fst.TestFSTs
 [junit] Testcase: 
 testRandomWords(org.apache.lucene.util.automaton.fst.TestFSTs): FAILED
 [junit] expected:771 but was:TwoLongs:771,771
 [junit] junit.framework.AssertionFailedError: expected:771 but 
 was:TwoLongs:771,771
 [junit]   at 
 org.apache.lucene.util.automaton.fst.TestFSTs$FSTTester.verifyUnPruned(TestFSTs.java:540)
 [junit]   at 
 org.apache.lucene.util.automaton.fst.TestFSTs$FSTTester.doTest(TestFSTs.java:496)
 [junit]   at 
 org.apache.lucene.util.automaton.fst.TestFSTs$FSTTester.doTest(TestFSTs.java:359)
 [junit]   at 
 org.apache.lucene.util.automaton.fst.TestFSTs.doTest(TestFSTs.java:319)
 [junit]   at 
 org.apache.lucene.util.automaton.fst.TestFSTs.testRandomWords(TestFSTs.java:940)
 [junit]   at 
 org.apache.lucene.util.automaton.fst.TestFSTs.testRandomWords(TestFSTs.java:915)
 [junit]   at 
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1282)
 [junit]   at 
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1211)
 [junit] 
 [junit] 
 [junit] Tests run: 7, Failures: 1, Errors: 0, Time elapsed: 7.628 sec
 [junit] 
 [junit] - Standard Error -
 [junit] NOTE: Ignoring nightly-only test method 'testBigSet'
 [junit] NOTE: reproduce with: ant test -Dtestcase=TestFSTs 
 -Dtestmethod=testRandomWords -Dtests.seed=-269475578956012681:0
 [junit] NOTE: test params are: codec=PreFlex, locale=ar, 
 timezone=America/Blanc-Sablon
 [junit] NOTE: all tests run in this JVM:
 [junit] [TestToken, TestCodecs, TestIndexReaderReopen, 
 TestIndexWriterMerging, TestNoDeletionPolicy, TestParallelReaderEmptyIndex, 
 TestParallelTermEnum, TestPerSegmentDeletes, TestSegmentReader, 
 TestSegmentTermDocs, TestStressAdvance, TestTermVectorsReader, 
 TestSurrogates, TestMultiFieldQueryParser, TestAutomatonQuery, 
 TestBooleanScorer, TestFuzzyQuery, TestMultiTermConstantScore, 
 TestNumericRangeQuery64, TestPositiveScoresOnlyCollector, TestPrefixFilter, 
 TestQueryTermVector, TestScorerPerf, TestSloppyPhraseQuery, 
 TestSpansAdvanced, TestWindowsMMap, TestRamUsageEstimator, TestSmallFloat, 
 TestUnicodeUtil, TestFSTs]
 [junit] NOTE: Linux 2.6.37-gentoo amd64/Sun Microsystems Inc. 1.6.0_25 
 (64-bit)/cpus=8,threads=1,free=137329960,total=208207872
 [junit] -  ---
 [junit] TEST org.apache.lucene.util.automaton.fst.TestFSTs FAILED
 {code}
 I am not able to reproduce

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Lucene/Solr JIRA

2011-05-17 Thread Ryan McKinley
 Can we merge the two?

gut reaction says +1, but after thinking about how it would work, i'm +0

Would we just stop accepting new tickets on one system, but still keep
track of both?  for how long?
Would we move open issues from SOLR to LUCENE?  migrate the comments/history/etc

In the end I think the two systems are fine -- not ideal, and they
should map (more or less) to where the entry should go in CHANGES.txt

ryan

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Bulk changing issues in JIRA

2011-05-17 Thread Shai Erera
Created http://wiki.apache.org/lucene-java/BulkIssuesUpdate

http://wiki.apache.org/lucene-java/BulkIssuesUpdateThanks Mark !

Shai

On Tue, May 17, 2011 at 9:01 PM, Mark Miller markrmil...@gmail.com wrote:

 Thanks Shai! Would make a great addition to the wiki ;)

 On May 16, 2011, at 11:47 PM, Shai Erera wrote:

  Hi
 
  If you ever wondered how to bulk change issues in JIRA, here's the
 procedure:
 
  * View a list of issues, e.g. by query/filter
 
  * At the top-right you'll find this:
 
 
  * Click on Tools and select
 
 
 
  * The screen changes so that next to each issue there's a check box.
 
  * Mark all the issues you want to change and click Next
 
  * Select the operation (e.g. Edit)
 
  * The next screen (followed by choosing operation Edit) lets you edit
 the issues. Note this at the bottom:
 
 
 
  Deselect if you don't want to spam the list :).
 
  FYI,
  Shai

 - Mark Miller
 lucidimagination.com

 Lucene/Solr User Conference
 May 25-26, San Francisco
 www.lucenerevolution.org






 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




[jira] [Commented] (LUCENE-3113) fix analyzer bugs found by MockTokenizer

2011-05-17 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034956#comment-13034956
 ] 

Steven Rowe commented on LUCENE-3113:
-

+1

bq. the ShingleAnalyzerWrapper was double-resetting

Your patch just removes the reset call:

{noformat}
@@ -201,7 +201,6 @@
   TokenStream result = defaultAnalyzer.reusableTokenStream(fieldName, 
reader);
   if (result == streams.wrapped) {
 /* the wrapped analyzer reused the stream */
-streams.shingle.reset(); 
   } else {
 /* the wrapped analyzer did not, create a new shingle around the new 
one */
 streams.wrapped = result;
{noformat}

but inverting the condition would read better:

{noformat}
   TokenStream result = defaultAnalyzer.reusableTokenStream(fieldName, 
reader);
-  if (result == streams.wrapped) {
-/* the wrapped analyzer reused the stream */
-streams.shingle.reset(); 
-  } else {
-/* the wrapped analyzer did not, create a new shingle around the new 
one */
+  if (result != streams.wrapped) {
+// The wrapped analyzer did not reuse the stream. 
+// Wrap the new stream with a new ShingleFilter.
 streams.wrapped = result;
 streams.shingle = new ShingleFilter(streams.wrapped);
   }
{noformat}


 fix analyzer bugs found by MockTokenizer
 

 Key: LUCENE-3113
 URL: https://issues.apache.org/jira/browse/LUCENE-3113
 Project: Lucene - Java
  Issue Type: Bug
  Components: modules/analysis
Reporter: Robert Muir
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3113.patch, LUCENE-3113.patch


 In LUCENE-3064, we beefed up MockTokenizer with assertions, and I've switched 
 over the analysis tests to use MockTokenizer for better coverage.
 However, this found a few bugs (one of which is LUCENE-3106):
 * incrementToken() after it returns false in CommonGramsQueryFilter, 
 HyphenatedWordsFilter, ShingleFilter, SynonymFilter
 * missing end() implementation for PrefixAwareTokenFilter
 * double reset() in QueryAutoStopWordAnalyzer and ReusableAnalyzerBase
 * missing correctOffset()s in MockTokenizer itself.
 I think it would be nice to just fix all the bugs on one issue... I've fixed 
 everything except Shingle and Synonym

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (PYLUCENE-9) QueryParser replacing stop words with wildcards

2011-05-17 Thread Christopher Currens (JIRA)

[ 
https://issues.apache.org/jira/browse/PYLUCENE-9?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034961#comment-13034961
 ] 

Christopher Currens commented on PYLUCENE-9:


We can close it.  Thanks for the help.

 QueryParser replacing stop words with wildcards
 ---

 Key: PYLUCENE-9
 URL: https://issues.apache.org/jira/browse/PYLUCENE-9
 Project: PyLucene
  Issue Type: Bug
 Environment: Windows XP 32-bit Sp3, Ubuntu 10.04.2 LTS i686 
 GNU/Linux, jdk1.6.0_23
Reporter: Christopher Currens

 Was using query parser to build a query.  In Java Lucene (as well as 
 Lucene.Net), the query Calendar Item as Msg (quotes included), is parsed 
 properly as FullText:calendar item msg in Java Lucene and Lucene.Net.  In 
 pylucene, it is parsed as: FullText:calendar item ? msg.  This causes 
 obvious problems when comparing search results from python, java and .net.
 Initially, I thought it was the Analyzer I was using, but I've tried the 
 StandardAnalyzer and StopAnalyzer, which work properly in Java and .Net, but 
 not pylucene.
 Here is code I've used to reproduce the issue:
  from lucene import StandardAnalyzer, StopAnalyzer, QueryParser, Version
  analyzer = StandardAnalyzer(Version.LUCENE_30)
  query = QueryParser(Version.LUCENE_30, FullText, analyzer)
  parsedQuery = query.parse(\Calendar Item as Msg\)
  parsedQuery
 Query: FullText:calendar item ? msg
  analyzer = StopAnalyzer(Version.LUCENE_30)
  query = QueryParser(Version.LUCENE_30)
  parsedQuery = query.parse(\Calendar Item as Msg\)
  parsedQuery
 Query: FullText:calendar item ? msg
 I've noticed this in pylucene 2.9.4, 2.9.3, and 3.0.3

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (LUCENE-3104) Hook up Automated Patch Checking for Lucene/Solr

2011-05-17 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034964#comment-13034964
 ] 

Grant Ingersoll commented on LUCENE-3104:
-

General Docs started at http://wiki.apache.org/general/PreCommitBuilds

 Hook up Automated Patch Checking for Lucene/Solr
 

 Key: LUCENE-3104
 URL: https://issues.apache.org/jira/browse/LUCENE-3104
 Project: Lucene - Java
  Issue Type: Task
Reporter: Grant Ingersoll

 It would be really great if we could get feedback to contributors sooner on 
 many things that are basic (tests exist, patch applies cleanly, etc.)
 From Nigel Daley on builds@a.o
 {quote}
 I revamped the precommit testing in the fall so that it doesn't use Jira 
 email anymore to trigger a build.  The process is controlled by
 https://builds.apache.org/hudson/job/PreCommit-Admin/
 which has some documentation up at the top of the job.  You can look at the 
 config of the job (do you have access?) to see what it's doing.  Any project 
 could use this same admin job -- you just need to ask me to add the project 
 to the Jira filter used by the admin job 
 (https://issues.apache.org/jira/sr/jira.issueviews:searchrequest-xml/12313474/SearchRequest-12313474.xml?tempMax=100
  ) once you have the downstream job(s) setup for your specific project.  For 
 Hadoop we have 3 downstream builds configured which also have some 
 documentation:
 https://builds.apache.org/hudson/job/PreCommit-HADOOP-Build/
 https://builds.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/
 https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/
 {quote}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Apache Jenkins emails

2011-05-17 Thread Doron Cohen
Hmm... wouldn't this help to ignore build failures, while current
situation encourages solving them? :)

I mean, unlike threading JIRA issues which is more convenient now, for build
failures this would hide some info - thread title would indicate the oldest
failure no.

In spite of the above, if others still like to change in this way, I'll be
fine with it.

Doron

On Sun, May 15, 2011 at 6:16 PM, Shai Erera ser...@gmail.com wrote:

 Well, Gmail ignores (for grouping) everything that in between brackets [].
 That's how we made all issue emails appear under the same thread, the status
 (Commented, Created, Resolved etc.) now appears in brackets.

 So, I think that if we put the build # in brackets, the rest of the message
 is the same for all failures. So instead of:

 [JENKINS] Lucene-Solr-tests-only-trunk - Build # 8042 - Still Failing

 we write

 [JENKINS] Lucene-Solr-tests-only-trunk - [Build # 8042] - Still Failing

 Or

 [JENKINS] [Build # 8042] Lucene-Solr-tests-only-trunk Failed

 Remove the word still altogether (it's redundant) and move the build
 number to the start of the subject.

 Shai

 On Sun, May 15, 2011 at 6:08 PM, Uwe Schindler u...@thetaphi.de wrote:

 It’s possible to change the header, as the mails are already customized.
 How should it look like (I don’t use f*g Gmail)



 -

 Uwe Schindler

 H.-H.-Meier-Allee 63, D-28213 Bremen

 http://www.thetaphi.de

 eMail: u...@thetaphi.de



 *From:* Shai Erera [mailto:ser...@gmail.com]
 *Sent:* Sunday, May 15, 2011 5:02 PM
 *To:* dev@lucene.apache.org
 *Subject:* Apache Jenkins emails



 Hi

 Is it possible to change the subject format of the emails Jenkins server
 sends? I was thinking, if we put the build # in [], all failures will be
 grouped under one thread (in Gmail). Since we have so many of them, it will
 at least collapse all of them into a single thread. We can still tell the
 failure of each email as well as the build #.

 What do you think?

 Shai





Re: Lucene/Solr JIRA

2011-05-17 Thread Chris Hostetter

If we were starting from scratch, i'd agree with you that having a single 
Jira project makes more sense, but given where we are today, i think we 
should probably keep them distinct -- partly from a pain of migration 
standpoint on our end, but also from a user expecations standpoint -- i 
think the Solr users/community as a whole is use to the existence of the 
SOLR project in Jira, and use to the SOLR-* issue naming convention, and 
it would likely be more confusing for *them* to change now.

: * With modules, we now have components in the Lucene JIRA project for
: different modules (some under modules/* some under lucene/contrib/*). Will
: we have the same components duplication in the Solr JIRA project?

when we discussed this before, it seemed clear that top level modules 
should be tracked as LUCENE issues, so i see no reason why there would be 
duplications.

: * Where do users go to open a bug report for a module - Lucene or Solr
: projects? I'd hate to see that they open it under their favorite (or
: worse. random picking) project. If so, it'll become a mess.

the user bases tend to be very distinct -- if people are dealing with the 
lucene java API directly they file a LUCENE bug, if they are dealing with 
the Solr HTTP or client layer (SolrJ) APIs they file a Solr bug.

If an issue is filed in a place where we think it doesn't make sense, the 
issue can easily be moved (and Jira does a redirect for anyone following 
old links)

: * Administration -- everything needs to be done twice. Create versions (same
: one !) on both projects, close issues (after release) etc.

given the low overhead of this, it doesn't seem all that problematic.

: * Managing a release now means I should monitor two JIRA projects for the
: 3.2 (an example) version issues. Why?

Here's an example of a filter that shows you all issues marked to be fixed 
in 3.2 in both projects...

https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=truejqlQuery=%28project+%3D+SOLR+OR+project+%3D+LUCENE%29+AND+fixVersion+%3D+%223.2%22+AND+resolution+%3D+Unresolved+ORDER+BY+updated+DESC%2C+key+DESC%2C+priority+DESC

: I guess I'm not too sure what do two JIRA projects give us. Now that it is
: the same project, why not make our (committers and contributors) life easier

Short answer: trade off ease of use for committers + pain of 
migration against ease of use for users ... doesn't seem like a strong 
need to change.

: It's already becoming confusing:

neither of these examples seem that confusing to me...

: LUCENE-3097: post grouping faceting -- a great example for a module that 
: both Lucene and Solr users can use. Opened under Lucene project, and 
: depends on Solr issues (not a big deal)

it's an issue for implementing a top level module, therforce it goes in 
LUCENE.  it doesn't depend on any Solr issue, it's marked as being blocked 
by another issue about adding another top level module 

: LUCENE-3104: could easily have been opened under the Solr project. I 
: don't know why it was opened under Lucene (random maybe?)

Because it's about improving the hudson build which operates at the top 
level of the tree



-Hoss

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3092) NRTCachingDirectory, to buffer small segments in a RAMDir

2011-05-17 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034981#comment-13034981
 ] 

Michael McCandless commented on LUCENE-3092:


I committed it to 3.x as well so this will be in 3.2 :)

I can't think of any reason why you'd want to wrap another RAMDir with 
NRTCD?  We can fix the docs to state this.  Can you work out the 
wording/patch?  Or just go ahead and commit a fix :)

 NRTCachingDirectory, to buffer small segments in a RAMDir
 -

 Key: LUCENE-3092
 URL: https://issues.apache.org/jira/browse/LUCENE-3092
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/store
Reporter: Michael McCandless
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3092-listener.patch, LUCENE-3092.patch, 
 LUCENE-3092.patch, LUCENE-3092.patch, LUCENE-3092.patch


 I created this simply Directory impl, whose goal is reduce IO
 contention in a frequent reopen NRT use case.
 The idea is, when reopening quickly, but not indexing that much
 content, you wind up with many small files created with time, that can
 possibly stress the IO system eg if merges, searching are also
 fighting for IO.
 So, NRTCachingDirectory puts these newly created files into a RAMDir,
 and only when they are merged into a too-large segment, does it then
 write-through to the real (delegate) directory.
 This lets you spend some RAM to reduce I0.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Apache Jenkins emails

2011-05-17 Thread Michael McCandless
Yeah I agree... build failures should be as annoying as possible ;)

Mike

http://blog.mikemccandless.com

On Tue, May 17, 2011 at 2:58 PM, Doron Cohen cdor...@gmail.com wrote:
 Hmm... wouldn't this help to ignore build failures, while current
 situation encourages solving them? :)

 I mean, unlike threading JIRA issues which is more convenient now, for build
 failures this would hide some info - thread title would indicate the oldest
 failure no.

 In spite of the above, if others still like to change in this way, I'll be
 fine with it.

 Doron

 On Sun, May 15, 2011 at 6:16 PM, Shai Erera ser...@gmail.com wrote:

 Well, Gmail ignores (for grouping) everything that in between brackets [].
 That's how we made all issue emails appear under the same thread, the status
 (Commented, Created, Resolved etc.) now appears in brackets.

 So, I think that if we put the build # in brackets, the rest of the
 message is the same for all failures. So instead of:

 [JENKINS] Lucene-Solr-tests-only-trunk - Build # 8042 - Still Failing

 we write

 [JENKINS] Lucene-Solr-tests-only-trunk - [Build # 8042] - Still Failing

 Or

 [JENKINS] [Build # 8042] Lucene-Solr-tests-only-trunk Failed

 Remove the word still altogether (it's redundant) and move the build
 number to the start of the subject.

 Shai

 On Sun, May 15, 2011 at 6:08 PM, Uwe Schindler u...@thetaphi.de wrote:

 It’s possible to change the header, as the mails are already customized.
 How should it look like (I don’t use f*g Gmail)



 -

 Uwe Schindler

 H.-H.-Meier-Allee 63, D-28213 Bremen

 http://www.thetaphi.de

 eMail: u...@thetaphi.de



 From: Shai Erera [mailto:ser...@gmail.com]
 Sent: Sunday, May 15, 2011 5:02 PM
 To: dev@lucene.apache.org
 Subject: Apache Jenkins emails



 Hi

 Is it possible to change the subject format of the emails Jenkins server
 sends? I was thinking, if we put the build # in [], all failures will be
 grouped under one thread (in Gmail). Since we have so many of them, it will
 at least collapse all of them into a single thread. We can still tell the
 failure of each email as well as the build #.

 What do you think?

 Shai



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3092) NRTCachingDirectory, to buffer small segments in a RAMDir

2011-05-17 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034985#comment-13034985
 ] 

Yonik Seeley commented on LUCENE-3092:
--

bq. I can't think of any reason why you'd want to wrap another RAMDir with 
NRTCD?

Tests?  It's nice to have a test use a RAMDirectory for speed, but still follow 
the same code path as FSDirectory for debugging + orthogonality.
AFAIK, most Solr tests use RAMDirectory by default.  There's no benefit to 
restricting it, right?

 NRTCachingDirectory, to buffer small segments in a RAMDir
 -

 Key: LUCENE-3092
 URL: https://issues.apache.org/jira/browse/LUCENE-3092
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/store
Reporter: Michael McCandless
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3092-listener.patch, LUCENE-3092.patch, 
 LUCENE-3092.patch, LUCENE-3092.patch, LUCENE-3092.patch


 I created this simply Directory impl, whose goal is reduce IO
 contention in a frequent reopen NRT use case.
 The idea is, when reopening quickly, but not indexing that much
 content, you wind up with many small files created with time, that can
 possibly stress the IO system eg if merges, searching are also
 fighting for IO.
 So, NRTCachingDirectory puts these newly created files into a RAMDir,
 and only when they are merged into a too-large segment, does it then
 write-through to the real (delegate) directory.
 This lets you spend some RAM to reduce I0.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3092) NRTCachingDirectory, to buffer small segments in a RAMDir

2011-05-17 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034989#comment-13034989
 ] 

Michael McCandless commented on LUCENE-3092:


That's a great point Yonik -- in fact the TestNRTCachingDirectory already
relies on this generic-ness (pulls a newDirectory() from LuceneTestCase).

 NRTCachingDirectory, to buffer small segments in a RAMDir
 -

 Key: LUCENE-3092
 URL: https://issues.apache.org/jira/browse/LUCENE-3092
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/store
Reporter: Michael McCandless
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3092-listener.patch, LUCENE-3092.patch, 
 LUCENE-3092.patch, LUCENE-3092.patch, LUCENE-3092.patch


 I created this simply Directory impl, whose goal is reduce IO
 contention in a frequent reopen NRT use case.
 The idea is, when reopening quickly, but not indexing that much
 content, you wind up with many small files created with time, that can
 possibly stress the IO system eg if merges, searching are also
 fighting for IO.
 So, NRTCachingDirectory puts these newly created files into a RAMDir,
 and only when they are merged into a too-large segment, does it then
 write-through to the real (delegate) directory.
 This lets you spend some RAM to reduce I0.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



RE: Lucene/Solr JIRA

2011-05-17 Thread Steven A Rowe
On 5/17/2011 at 3:02 PM, Chris Hostetter wrote:
 If we were starting from scratch, i'd agree with you that having a single
 Jira project makes more sense, but given where we are today, i think we
 should probably keep them distinct -- partly from a pain of migration
 standpoint on our end, but also from a user expecations standpoint -- i
 think the Solr users/community as a whole is use to the existence of the
 SOLR project in Jira, and use to the SOLR-* issue naming convention, and
 it would likely be more confusing for *them* to change now.

+1




[jira] [Commented] (SOLR-2168) Velocity facet output for facet missing

2011-05-17 Thread Peter Wolanin (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034994#comment-13034994
 ] 

Peter Wolanin commented on SOLR-2168:
-

Did this change to the templates get committed to the actual Solr repo?

 Velocity facet output for facet missing
 ---

 Key: SOLR-2168
 URL: https://issues.apache.org/jira/browse/SOLR-2168
 Project: Solr
  Issue Type: Bug
  Components: Response Writers
Affects Versions: 3.1
Reporter: Peter Wolanin
Priority: Minor
 Attachments: SOLR-2168.patch


 If I add fact.missing to the facet params for a field, the Veolcity output 
 has in the facet list:
 $facet.name (9220)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Apache Jenkins emails

2011-05-17 Thread Shai Erera
 Hmm... wouldn't this help to ignore build failures, while current situation 
 encourages solving them?

I don't think current situation encourages resolving the issues more
than it would discourage if we grouped all emails together.

And I don't believe people will ignore a Jenkins failure thread, if
they don't ignore the separate emails today.

True, for those who ignore build failures - it will help them ignore
them more easily :)

Those who don't ignore will continue to monitor. And from what I can
tell, many failures are not due to code issues, but Jenkins server
issues.

Shai

On Tuesday, May 17, 2011, Doron Cohen cdor...@gmail.com wrote:
 Hmm... wouldn't this help to ignore build failures, while current situation 
 encourages solving them? :)

 I mean, unlike threading JIRA issues which is more convenient now, for build 
 failures this would hide some info - thread title would indicate the oldest 
 failure no.

 In spite of the above, if others still like to change in this way, I'll be 
 fine with it.

 Doron

 On Sun, May 15, 2011 at 6:16 PM, Shai Erera ser...@gmail.com wrote:
 Well, Gmail ignores (for grouping) everything that in between brackets []. 
 That's how we made all issue emails appear under the same thread, the status 
 (Commented, Created, Resolved etc.) now appears in brackets.

 So, I think that if we put the build # in brackets, the rest of the message 
 is the same for all failures. So instead of:

 [JENKINS] Lucene-Solr-tests-only-trunk - Build # 8042 - Still Failing

 we write

 [JENKINS] Lucene-Solr-tests-only-trunk - [Build # 8042] - Still Failing

 Or

 [JENKINS] [Build # 8042] Lucene-Solr-tests-only-trunk Failed

 Remove the word still altogether (it's redundant) and move the build number 
 to the start of the subject.

 Shai


 On Sun, May 15, 2011 at 6:08 PM, Uwe Schindler u...@thetaphi.de wrote:
 It’s possible to change the header, as the mails are already customized. How 
 should it look like (I don’t use f*g Gmail)

  -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 
 Bremenhttp://www.thetaphi.de http://www.thetaphi.de/

 eMail: u...@thetaphi.de


 From: Shai Erera [mailto:ser...@gmail.com]
 Sent: Sunday, May 15, 2011 5:02 PM
 To: dev@lucene.apache.org
 Subject: Apache Jenkins emails

  Hi

 Is it possible to change the subject format of the emails Jenkins server 
 sends? I was thinking, if we put the build # in [], all failures will be 
 grouped under one thread (in Gmail). Since we have so many of them, it will 
 at least collapse all of them into a single thread. We can still tell the 
 failure of each email as well as the build #.

 What do you think?

 Shai





-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2230) Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.

2011-05-17 Thread Fuad Efendi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034999#comment-13034999
 ] 

Fuad Efendi commented on LUCENE-2230:
-

I believe this issue should be closed due to significant performance 
improvements related to LUCENE-2089 and LUCENE-2258.
I don't think there is any interest from the community to continue with this 
(BK Tree and Strike a Match) naive approach; although some people found it 
useful. Of course we might have few more distance implementations as a separate 
improvement.

Please close it.


Thanks

 Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.
 

 Key: LUCENE-2230
 URL: https://issues.apache.org/jira/browse/LUCENE-2230
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/search
Affects Versions: 3.0
 Environment: Lucene currently uses brute force full-terms scanner and 
 calculates distance for each term. New BKTree structure improves performance 
 in average 20 times when distance is 1, and 3 times when distance is 3. I 
 tested with index size several millions docs, and 250,000 terms. 
 New algo uses integer distances between objects.
Reporter: Fuad Efendi
 Attachments: BKTree.java, Distance.java, DistanceImpl.java, 
 FuzzyTermEnumNEW.java, FuzzyTermEnumNEW.java

   Original Estimate: 1m
  Remaining Estimate: 1m

 W. Burkhard and R. Keller. Some approaches to best-match file searching, 
 CACM, 1973
 http://portal.acm.org/citation.cfm?doid=362003.362025
 I was inspired by 
 http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees (Nick 
 Johnson, Google).
 Additionally, simplified algorythm at 
 http://www.catalysoft.com/articles/StrikeAMatch.html seems to be much more 
 logically correct than Levenstein distance, and it is 3-5 times faster 
 (isolated tests).
 Big list od distance implementations:
 http://www.dcs.shef.ac.uk/~sam/stringmetrics.htm

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Apache Jenkins emails

2011-05-17 Thread Marvin Humphrey
On Tue, May 17, 2011 at 03:09:31PM -0400, Michael McCandless wrote:
 Yeah I agree... build failures should be as annoying as possible ;)

Congratulations -- mission accomplished!  They are certainly annoying to me,
and probably to anyone else subscribed to dev who isn't a committer.

Marvin Humphrey


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3084) MergePolicy.OneMerge.segments should be ListSegmentInfo not SegmentInfos

2011-05-17 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-3084:
--

Attachment: LUCENE-3084-trunk-only.patch

Now I improved SegmentInfos more:

- It now uses a Map/Set to enforce that the SI only contains each segment one 
time.
- Faster contains() because Set-backed

As said before: asList() and asSet() are unmodifiable, so consistency between 
List and Set/Map is enforced.

The Set is itsself a MapSI,Integer. The values contain the index of segment 
in the infos. This speeds up indexOf() calls, needed for asserts and 
remove(SI). As on remove or reorder operations the indexes are no longer 
correct, a separate boolean is used to mark the Map as inconsistent. It is then 
regenerated on the next indexOf() call. IndexOf is seldom, butthe keySet() is 
still consistent, so delaying this update is fine.

All tests pass. I think the cleanup of SegmentInfos is ready to commit.

 MergePolicy.OneMerge.segments should be ListSegmentInfo not SegmentInfos
 --

 Key: LUCENE-3084
 URL: https://issues.apache.org/jira/browse/LUCENE-3084
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3084-trunk-only.patch, 
 LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, 
 LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, 
 LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084.patch


 SegmentInfos carries a bunch of fields beyond the list of SI, but for merging 
 purposes these fields are unused.
 We should cutover to ListSI instead.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3113) fix analyzer bugs found by MockTokenizer

2011-05-17 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035007#comment-13035007
 ] 

Robert Muir commented on LUCENE-3113:
-

thanks for reviewing Steven, I agree! I've made this change and will commit 
shortly.

 fix analyzer bugs found by MockTokenizer
 

 Key: LUCENE-3113
 URL: https://issues.apache.org/jira/browse/LUCENE-3113
 Project: Lucene - Java
  Issue Type: Bug
  Components: modules/analysis
Reporter: Robert Muir
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3113.patch, LUCENE-3113.patch


 In LUCENE-3064, we beefed up MockTokenizer with assertions, and I've switched 
 over the analysis tests to use MockTokenizer for better coverage.
 However, this found a few bugs (one of which is LUCENE-3106):
 * incrementToken() after it returns false in CommonGramsQueryFilter, 
 HyphenatedWordsFilter, ShingleFilter, SynonymFilter
 * missing end() implementation for PrefixAwareTokenFilter
 * double reset() in QueryAutoStopWordAnalyzer and ReusableAnalyzerBase
 * missing correctOffset()s in MockTokenizer itself.
 I think it would be nice to just fix all the bugs on one issue... I've fixed 
 everything except Shingle and Synonym

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Apache Jenkins emails

2011-05-17 Thread Robert Muir
On Tue, May 17, 2011 at 3:38 PM, Marvin Humphrey mar...@rectangular.com wrote:
 On Tue, May 17, 2011 at 03:09:31PM -0400, Michael McCandless wrote:
 Yeah I agree... build failures should be as annoying as possible ;)

 Congratulations -- mission accomplished!  They are certainly annoying to me,
 and probably to anyone else subscribed to dev who isn't a committer.


Marvin, I'm not sure you can really assume that. If a test fails
anyone who wants to contribute can look at the failure and try to
create a jira issue/patch, I don't think they need to be a committer.

Additionally due to the nature of our tests, anyone who wants to
contribute to the project can simply download the tests and try to
find failures, opening jira issues for ones that they find (for
example selckin does this, and has found a lot of good ones lately).

If you don't care about tests at all, you can easily filter this stuff
with your email client by looking for [JENKINS].

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2168) Velocity facet output for facet missing

2011-05-17 Thread Erik Hatcher (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035015#comment-13035015
 ] 

Erik Hatcher commented on SOLR-2168:


Alas not, Peter.  Sorry.

 Velocity facet output for facet missing
 ---

 Key: SOLR-2168
 URL: https://issues.apache.org/jira/browse/SOLR-2168
 Project: Solr
  Issue Type: Bug
  Components: Response Writers
Affects Versions: 3.1
Reporter: Peter Wolanin
Priority: Minor
 Attachments: SOLR-2168.patch


 If I add fact.missing to the facet params for a field, the Veolcity output 
 has in the facet list:
 $facet.name (9220)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Issue Comment Edited] (SOLR-2168) Velocity facet output for facet missing

2011-05-17 Thread Erik Hatcher (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035015#comment-13035015
 ] 

Erik Hatcher edited comment on SOLR-2168 at 5/17/11 8:05 PM:
-

Alas not yet, Peter.  Sorry.

  was (Author: ehatcher):
Alas not, Peter.  Sorry.
  
 Velocity facet output for facet missing
 ---

 Key: SOLR-2168
 URL: https://issues.apache.org/jira/browse/SOLR-2168
 Project: Solr
  Issue Type: Bug
  Components: Response Writers
Affects Versions: 3.1
Reporter: Peter Wolanin
Priority: Minor
 Attachments: SOLR-2168.patch


 If I add fact.missing to the facet params for a field, the Veolcity output 
 has in the facet list:
 $facet.name (9220)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



  1   2   >