Re: Whither ORP?

2010-09-13 Thread Dan Cardin
Hello,

I am new to ORP. I would like to contribute to the project. I do not have a
lot of experience in this field of IR, crowd sourcing or AI. If someone
could take the lead and set forward path I would be willing to contribute my
skill set to ORP.

How can I help? I have a lot of experience doing software development and
system administration.

Cheers,
--Dan

On Mon, Sep 13, 2010 at 1:36 PM, Omar Alonso oralo...@yahoo.com wrote:

 I think ORP is a great candidate for crowdsourcing/human computation. In
 the last year or so there's been quite a bit of research and applications on
 this. See the page for the SIGIR workshop on using crowdsourcing for IR
 evaluation: 
 http://www.ischool.utexas.edu/~cse2010/http://www.ischool.utexas.edu/%7Ecse2010/

 Omar

 --- On Mon, 9/13/10, Itamar Syn-Hershko ita...@code972.com wrote:

  From: Itamar Syn-Hershko ita...@code972.com
  Subject: Re: Whither ORP?
  To: openrelevance-dev@lucene.apache.org
  Date: Monday, September 13, 2010, 9:33 AM
  With the proper two-way open-source
  development process (taking and then giving) I think it can
  become an important part of open-IR technologies, just like
  what Lucene did to the search engines world. What ORP has to
  offer is of great interest to HebMorph, an open-source
  project of mine trying to decide on what is the best way to
  index and search Hebrew texts.
 
  To this end I decided to put some of the development
  efforts of the HebMorph project into making tools for the
  ORP. I have announced this before, but unfortunately I had
  to attend to more pressing tasks before I could complete
  this (and there was no response from the community
  anyway...). Just in case you're interested in seeing what I
  came up with so far: http://github.com/synhershko/Orev.
 
  IMHO, the ORP should stand by itself, and relate to
  Lucene/Solr only as its basis framework for these initial
  stages. Perhaps also try to attract more people who could
  find an interest in what it has to offer, so it can really
  start growing.
 
  Itamar.
 
  On 12/9/2010 1:29 PM, Grant Ingersoll wrote:
   On Sep 11, 2010, at 8:51 PM, Robert Muir wrote:
  
  
   i propose we take what we have and import into
  lucene-java's benchmark
   contrib.  it already has integration with
  wikipedia and reuters for perf
   purposes, and the quality package is actually
  there anyways.  later, maybe
   more people have time and contrib/benchmark
  evolves naturally... e.g. to
   modules/benchmark with solr support as a first big
  step.
  
   Yeah, that seems reasonable.  I have been
  thinking lately that it might be useful to pull our DocMaker
  stuff out separately from benchmark so that people have easy
  ways of generating content from things like Wikipedia, etc.
  
   Still, at the end of the day, I like what ORP _could_
  bring to the table and to some extent I think that is lost
  by folding it into Lucene benchmark.
  
  
   On Sep 11, 2010 7:33 PM, Grant Ingersollgsing...@apache.org
  wrote:
  
   Seems ORP isn't really catching on with
  people. I know personally I don't
  
   have the time I had hoped to have to get it going.
  At the same time, I
   really think it could be a good project. We've got
  some tools put together,
   but we still haven't done much about the bigger
  goal of a self contained
   evaluation.
  
   Any thoughts on how we should proceed with
  ORP?
  
   -Grant
  
  
  
  






unsubscribe

2010-09-13 Thread 朱蓝天
unsubscribe


NRT SOLR-1617 on branch_3x

2010-09-13 Thread Peter Sturge
Hi,

Starting a new thread for this question, as it's a little separate
from the 3.x cache warming thread:
http://www.lucidimagination.com/search/document/fdc19610bdf6e179/tuning_solr_caches_with_high_commit_rates_nrt#adf9ee007ce18ba6

The new fcs/per-segment faceting that's in the 4.x trunk looks like a
hugely useful feature.

Would it be possible to get/port this for branch_3x?
I'm happy to help any way I can in porting this cool new feature.

Many Thanks,
Peter

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2642) merge LuceneTestCase and LuceneTestCaseJ4

2010-09-13 Thread Robert Muir (JIRA)
merge LuceneTestCase and LuceneTestCaseJ4
-

 Key: LUCENE-2642
 URL: https://issues.apache.org/jira/browse/LUCENE-2642
 Project: Lucene - Java
  Issue Type: Test
  Components: Tests
Reporter: Robert Muir
 Fix For: 3.1, 4.0
 Attachments: LUCENE-2642.patch

We added Junit4 support, but as a separate test class.

So unfortunately, we have two separate base classes to maintain: LuceneTestCase 
and LuceneTestCaseJ4.
This creates a mess and is difficult to manage.

Instead, I propose a single base test class that works both junit3 and junit4 
style.

I modified our LuceneTestCaseJ4 in the following way:
* the methods to run are not limited to the ones annotated with @Test, but also 
any void no-arg methods that start with test, like junit3. this means you 
dont have to sprinkle @Test everywhere.
* of course, @Ignore works as expected everywhere.
* LuceneTestCaseJ4 extends TestCase so you dont have to import static Assert.* 
to get all the asserts.

for most tests, no changes are required. but a few very minor things had to be 
changed:
* setUp() and tearDown() must be public, not protected.
* useless ctors must be removed, such as TestFoo(String name) { super(name); }
* LocalizedTestCase is gone, instead of
{code}
public class TestQueryParser extends LocalizedTestCase {
{code}
it is now
{code}
@RunWith(LuceneTestCase.LocalizedTestCaseRunner.class)
public class TestQueryParser extends LuceneTestCase {
{code}
* Same with MultiCodecTestCase: (LuceneTestCase.MultiCodecTestCaseRunner.class}

I only did the core tests in the patch as a start, and i just made an empty 
LuceneTestCase extends LuceneTestCaseJ4.

I'd like to do contrib and solr and rename this LuceneTestCaseJ4 to only a 
single class: LuceneTestCase.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2642) merge LuceneTestCase and LuceneTestCaseJ4

2010-09-13 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12908733#action_12908733
 ] 

Uwe Schindler commented on LUCENE-2642:
---

Why not simply extend the Assert abstract class? This would remove use of 
deprecated old JUnit3 Framework completely?

 merge LuceneTestCase and LuceneTestCaseJ4
 -

 Key: LUCENE-2642
 URL: https://issues.apache.org/jira/browse/LUCENE-2642
 Project: Lucene - Java
  Issue Type: Test
  Components: Tests
Reporter: Robert Muir
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2642.patch


 We added Junit4 support, but as a separate test class.
 So unfortunately, we have two separate base classes to maintain: 
 LuceneTestCase and LuceneTestCaseJ4.
 This creates a mess and is difficult to manage.
 Instead, I propose a single base test class that works both junit3 and junit4 
 style.
 I modified our LuceneTestCaseJ4 in the following way:
 * the methods to run are not limited to the ones annotated with @Test, but 
 also any void no-arg methods that start with test, like junit3. this means 
 you dont have to sprinkle @Test everywhere.
 * of course, @Ignore works as expected everywhere.
 * LuceneTestCaseJ4 extends TestCase so you dont have to import static 
 Assert.* to get all the asserts.
 for most tests, no changes are required. but a few very minor things had to 
 be changed:
 * setUp() and tearDown() must be public, not protected.
 * useless ctors must be removed, such as TestFoo(String name) { super(name); }
 * LocalizedTestCase is gone, instead of
 {code}
 public class TestQueryParser extends LocalizedTestCase {
 {code}
 it is now
 {code}
 @RunWith(LuceneTestCase.LocalizedTestCaseRunner.class)
 public class TestQueryParser extends LuceneTestCase {
 {code}
 * Same with MultiCodecTestCase: 
 (LuceneTestCase.MultiCodecTestCaseRunner.class}
 I only did the core tests in the patch as a start, and i just made an empty 
 LuceneTestCase extends LuceneTestCaseJ4.
 I'd like to do contrib and solr and rename this LuceneTestCaseJ4 to only a 
 single class: LuceneTestCase.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2642) merge LuceneTestCase and LuceneTestCaseJ4

2010-09-13 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12908735#action_12908735
 ] 

Robert Muir commented on LUCENE-2642:
-

bq. Why not simply extend the Assert abstract class? This would remove use of 
deprecated old JUnit3 Framework completely?

I would like to do this under a different issue.

We cannot do this, until all assertEquals(float, float) are changed to use 
epsilons, for example.

By extending Assert, we can catch all the places we don't use epsilons and fix 
them, which
is a great improvement, but out of scope of this issue.

 merge LuceneTestCase and LuceneTestCaseJ4
 -

 Key: LUCENE-2642
 URL: https://issues.apache.org/jira/browse/LUCENE-2642
 Project: Lucene - Java
  Issue Type: Test
  Components: Tests
Reporter: Robert Muir
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2642.patch


 We added Junit4 support, but as a separate test class.
 So unfortunately, we have two separate base classes to maintain: 
 LuceneTestCase and LuceneTestCaseJ4.
 This creates a mess and is difficult to manage.
 Instead, I propose a single base test class that works both junit3 and junit4 
 style.
 I modified our LuceneTestCaseJ4 in the following way:
 * the methods to run are not limited to the ones annotated with @Test, but 
 also any void no-arg methods that start with test, like junit3. this means 
 you dont have to sprinkle @Test everywhere.
 * of course, @Ignore works as expected everywhere.
 * LuceneTestCaseJ4 extends TestCase so you dont have to import static 
 Assert.* to get all the asserts.
 for most tests, no changes are required. but a few very minor things had to 
 be changed:
 * setUp() and tearDown() must be public, not protected.
 * useless ctors must be removed, such as TestFoo(String name) { super(name); }
 * LocalizedTestCase is gone, instead of
 {code}
 public class TestQueryParser extends LocalizedTestCase {
 {code}
 it is now
 {code}
 @RunWith(LuceneTestCase.LocalizedTestCaseRunner.class)
 public class TestQueryParser extends LuceneTestCase {
 {code}
 * Same with MultiCodecTestCase: 
 (LuceneTestCase.MultiCodecTestCaseRunner.class}
 I only did the core tests in the patch as a start, and i just made an empty 
 LuceneTestCase extends LuceneTestCaseJ4.
 I'd like to do contrib and solr and rename this LuceneTestCaseJ4 to only a 
 single class: LuceneTestCase.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-2642) merge LuceneTestCase and LuceneTestCaseJ4

2010-09-13 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12908733#action_12908733
 ] 

Uwe Schindler edited comment on LUCENE-2642 at 9/13/10 8:02 AM:


Why not simply extend the org.junit.Assert class? This would remove use of 
deprecated old JUnit3 Framework completely?

  was (Author: thetaphi):
Why not simply extend the Assert abstract class? This would remove use of 
deprecated old JUnit3 Framework completely?
  
 merge LuceneTestCase and LuceneTestCaseJ4
 -

 Key: LUCENE-2642
 URL: https://issues.apache.org/jira/browse/LUCENE-2642
 Project: Lucene - Java
  Issue Type: Test
  Components: Tests
Reporter: Robert Muir
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2642.patch


 We added Junit4 support, but as a separate test class.
 So unfortunately, we have two separate base classes to maintain: 
 LuceneTestCase and LuceneTestCaseJ4.
 This creates a mess and is difficult to manage.
 Instead, I propose a single base test class that works both junit3 and junit4 
 style.
 I modified our LuceneTestCaseJ4 in the following way:
 * the methods to run are not limited to the ones annotated with @Test, but 
 also any void no-arg methods that start with test, like junit3. this means 
 you dont have to sprinkle @Test everywhere.
 * of course, @Ignore works as expected everywhere.
 * LuceneTestCaseJ4 extends TestCase so you dont have to import static 
 Assert.* to get all the asserts.
 for most tests, no changes are required. but a few very minor things had to 
 be changed:
 * setUp() and tearDown() must be public, not protected.
 * useless ctors must be removed, such as TestFoo(String name) { super(name); }
 * LocalizedTestCase is gone, instead of
 {code}
 public class TestQueryParser extends LocalizedTestCase {
 {code}
 it is now
 {code}
 @RunWith(LuceneTestCase.LocalizedTestCaseRunner.class)
 public class TestQueryParser extends LuceneTestCase {
 {code}
 * Same with MultiCodecTestCase: 
 (LuceneTestCase.MultiCodecTestCaseRunner.class}
 I only did the core tests in the patch as a start, and i just made an empty 
 LuceneTestCase extends LuceneTestCaseJ4.
 I'd like to do contrib and solr and rename this LuceneTestCaseJ4 to only a 
 single class: LuceneTestCase.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: exceptions from solr/contrib/dataimporthandler and solr/contrib/extraction

2010-09-13 Thread Grant Ingersoll

On Sep 12, 2010, at 2:47 PM, Yonik Seeley wrote:

 On Sun, Sep 12, 2010 at 2:35 PM, Grant Ingersoll gsing...@apache.org wrote:
 contrib/extraction prints an 'unknown field a' exception
 
 The printing of this could be commented out, possibly.  The exception being 
 thrown is proper operation, as the field a does not exist.
 
 Yeah, but the bigger question is if  field a should have been
 generated given the specific extraction configuration.
 If it's explicitly testing for an exception, then we can ignore it
 (add it to the ignores list)... otherwise it's either a test bug or a

Yes, it should be added to the ignores list.
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2642) merge LuceneTestCase and LuceneTestCaseJ4

2010-09-13 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12908739#action_12908739
 ] 

Robert Muir commented on LUCENE-2642:
-

bq. I am just afraid of extending form the old JUnit Testcase.

we already extend this! Have you looked at LuceneTestCase lately?

bq. So extend Assert and the add missing static methods for compatibility.

Please, i would like to keep the epsilon stuff out of this issue. All tests 
pass the way it is now, there is no
problem.

We can fix epsilons in a followup issue, and then use no junit3 code at all... 
as I said its a great improvement, but not necessary to mix in with this change.

By doing both at once, if somethign goes wrong, it will be more difficult to 
debug. Lets keep the scope under control.

 merge LuceneTestCase and LuceneTestCaseJ4
 -

 Key: LUCENE-2642
 URL: https://issues.apache.org/jira/browse/LUCENE-2642
 Project: Lucene - Java
  Issue Type: Test
  Components: Tests
Reporter: Robert Muir
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2642.patch


 We added Junit4 support, but as a separate test class.
 So unfortunately, we have two separate base classes to maintain: 
 LuceneTestCase and LuceneTestCaseJ4.
 This creates a mess and is difficult to manage.
 Instead, I propose a single base test class that works both junit3 and junit4 
 style.
 I modified our LuceneTestCaseJ4 in the following way:
 * the methods to run are not limited to the ones annotated with @Test, but 
 also any void no-arg methods that start with test, like junit3. this means 
 you dont have to sprinkle @Test everywhere.
 * of course, @Ignore works as expected everywhere.
 * LuceneTestCaseJ4 extends TestCase so you dont have to import static 
 Assert.* to get all the asserts.
 for most tests, no changes are required. but a few very minor things had to 
 be changed:
 * setUp() and tearDown() must be public, not protected.
 * useless ctors must be removed, such as TestFoo(String name) { super(name); }
 * LocalizedTestCase is gone, instead of
 {code}
 public class TestQueryParser extends LocalizedTestCase {
 {code}
 it is now
 {code}
 @RunWith(LuceneTestCase.LocalizedTestCaseRunner.class)
 public class TestQueryParser extends LuceneTestCase {
 {code}
 * Same with MultiCodecTestCase: 
 (LuceneTestCase.MultiCodecTestCaseRunner.class}
 I only did the core tests in the patch as a start, and i just made an empty 
 LuceneTestCase extends LuceneTestCaseJ4.
 I'd like to do contrib and solr and rename this LuceneTestCaseJ4 to only a 
 single class: LuceneTestCase.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: exceptions from solr/contrib/dataimporthandler and solr/contrib/extraction

2010-09-13 Thread Robert Muir
On Mon, Sep 13, 2010 at 8:05 AM, Grant Ingersoll gsing...@apache.orgwrote:


 Yes, it should be added to the ignores list.


Thanks Grant!

-- 
Robert Muir
rcm...@gmail.com


[jira] Updated: (LUCENE-2642) merge LuceneTestCase and LuceneTestCaseJ4

2010-09-13 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-2642:
--

Attachment: LUCENE-2642-extendAssert.patch

Here the patch, so LuceneTestCaseJ4 only extends Assert without importing extra 
crap. The double/float API of old Junit3 is emulated using static overrides. 
After applying patch do a ant clean, else you will get lining errors.

 merge LuceneTestCase and LuceneTestCaseJ4
 -

 Key: LUCENE-2642
 URL: https://issues.apache.org/jira/browse/LUCENE-2642
 Project: Lucene - Java
  Issue Type: Test
  Components: Tests
Reporter: Robert Muir
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2642-extendAssert.patch, LUCENE-2642.patch


 We added Junit4 support, but as a separate test class.
 So unfortunately, we have two separate base classes to maintain: 
 LuceneTestCase and LuceneTestCaseJ4.
 This creates a mess and is difficult to manage.
 Instead, I propose a single base test class that works both junit3 and junit4 
 style.
 I modified our LuceneTestCaseJ4 in the following way:
 * the methods to run are not limited to the ones annotated with @Test, but 
 also any void no-arg methods that start with test, like junit3. this means 
 you dont have to sprinkle @Test everywhere.
 * of course, @Ignore works as expected everywhere.
 * LuceneTestCaseJ4 extends TestCase so you dont have to import static 
 Assert.* to get all the asserts.
 for most tests, no changes are required. but a few very minor things had to 
 be changed:
 * setUp() and tearDown() must be public, not protected.
 * useless ctors must be removed, such as TestFoo(String name) { super(name); }
 * LocalizedTestCase is gone, instead of
 {code}
 public class TestQueryParser extends LocalizedTestCase {
 {code}
 it is now
 {code}
 @RunWith(LuceneTestCase.LocalizedTestCaseRunner.class)
 public class TestQueryParser extends LuceneTestCase {
 {code}
 * Same with MultiCodecTestCase: 
 (LuceneTestCase.MultiCodecTestCaseRunner.class}
 I only did the core tests in the patch as a start, and i just made an empty 
 LuceneTestCase extends LuceneTestCaseJ4.
 I'd like to do contrib and solr and rename this LuceneTestCaseJ4 to only a 
 single class: LuceneTestCase.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: IndexReader Cache - a different angle

2010-09-13 Thread Tim Smith
 i created https://issues.apache.org/jira/browse/LUCENE-2345 some time 
ago proposing pretty much what seems to be discussed here



 -- Tim

On 09/12/10 10:18, Simon Willnauer wrote:

On Sun, Sep 12, 2010 at 11:46 AM, Michael McCandless
luc...@mikemccandless.com  wrote:

Having hooks to enable an app to manage its own external, private
stuff associated w/ each segment reader would be useful and it's been
asked for in the past.  However, since we've now opened up
SegmentReader, SegmentInfo/s, etc., in recent releases, can't an app
already do this w/o core API changes?

The visitor approach would simply be a little more than syntactic
sugar where only new SubReader instances are passed to the callback.
You can do the same with the already existing API like
gatherSubReaders or getSequentialSubReaders. Every API I was talking
about would just be simplification anyway and would be possible to
build without changing the core.

I know Earwin has built a whole system like this on top of Lucene --
Earwin how did you do that...?  Did you make core changes to
Lucene...?

A custom Codec should be an excellent way to handle the specific use
cache (caching certain postings) -- by doing it as a Codec, any time
anything in Lucene needs to tap into that posting (query scorers,
filters, merging, applying deletes, etc), it hits this cache.  You
could model it like PulsingCodec, which wraps any other Codec but
handles the low-freq ones itself.  If you do it externally how would
core use of postings hit it?  (Or was that not the intention?)

I don't understand the filter use-case... the CachingWrapperFilter
already caches per-segment, so that reopen is efficient?  How would an
external cache (built on these hooks) be different?

Man you are right - never mind :)

simon

For faster filters we have to apply them like we do deleted docs if
the filter is random access (such as being cached), LUCENE-1536 --
flex actually makes this relatively easy now, since the postings API
no longer implicitly filters deleted docs (ie you provide your own
skipDocs) -- but these hooks won't fix that right?

Mike

On Sun, Sep 12, 2010 at 3:43 AM, Simon Willnauer
simon.willna...@googlemail.com  wrote:

Hey Shai,

On Sun, Sep 12, 2010 at 6:51 AM, Shai Ereraser...@gmail.com  wrote:

Hey Simon,

You're right that the application can develop a Caching mechanism outside
Lucene, and when reopen() is called, if it changed, iterate on the
sub-readers and init the Cache w/ the new ones.

Alright, then we are on the same track I guess!


However, by building something like that inside Lucene, the application will
get more native support, and thus better performance, in some cases. For
example, consider a field fileType with 10 possible values, and for the sake
of simplicity, let's say that the index is divided evenly across them. Your
users always add such a term constraint to the query (e.g. they want to get
results of fileType:pdf or fileType:odt, and perhaps sometimes both, but not
others). You have basically two ways of supporting this:
(1) Add such a term to the query / clause to a BooleanQuery w/ an AND
relation -- cons is that this term / posting is read for every query.

Oh I wasn't saying that a cache framework would be obsolet and
shouldn't be part of lucene. My intention was rather to generalize
this functionality so that we can make the API change more easily and
at the same time brining the infrastructure you are proposing in
place.

Regarding you example above, filters are a very good example where
something like that could help to improve performance and we should
provide it with lucene core but I would again prefer the least
intrusive way to do so. If we can make that happen without adding any
cache agnostic API we should do it. We really should try to sketch out
a simple API with gives us access to the opened segReaders and see if
that would be sufficient for our usecases. Specialization will always
be possible but I doubt that it is needed.

(2) Write a Filter which works at the top IR level, that is refreshed
whenever the index is refreshed. This is better than (1), however has some
disadvantages:

(2.1) As Mike already proved (on some issue which I don't remember its
subject/number at the moment), if we could get Filter down to the lower
level components of Lucene's search, so e.g. it is used as the deleted docs
OBS, we can get better performance w/ Filters.

(2.2) The Filter is refreshed for the entire IR, and not just the changed
segments. Reason is, outside Collector, you have no way of telling
IndexSearcher use Filter F1 for segment S1 and F2 for segment F2.
Loading/refreshing the Filter may be expensive, and definitely won't perform
well w/ NRT, where by definition you'd like to get small changes searchable
very fast.

No doubt you are right about the above. A
PerSegmentCachingFilterWrapper would be something we can easily do on
an application level basis with the infrastructure we are talking
about in place. While I don't exactly 

[jira] Updated: (LUCENE-2642) merge LuceneTestCase and LuceneTestCaseJ4

2010-09-13 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2642:


Attachment: LUCENE-2642.patch

updated patch, with all of lucene/solr and including uwe's stuff.

all tests pass.

 merge LuceneTestCase and LuceneTestCaseJ4
 -

 Key: LUCENE-2642
 URL: https://issues.apache.org/jira/browse/LUCENE-2642
 Project: Lucene - Java
  Issue Type: Test
  Components: Tests
Reporter: Robert Muir
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2642-extendAssert.patch, LUCENE-2642.patch, 
 LUCENE-2642.patch


 We added Junit4 support, but as a separate test class.
 So unfortunately, we have two separate base classes to maintain: 
 LuceneTestCase and LuceneTestCaseJ4.
 This creates a mess and is difficult to manage.
 Instead, I propose a single base test class that works both junit3 and junit4 
 style.
 I modified our LuceneTestCaseJ4 in the following way:
 * the methods to run are not limited to the ones annotated with @Test, but 
 also any void no-arg methods that start with test, like junit3. this means 
 you dont have to sprinkle @Test everywhere.
 * of course, @Ignore works as expected everywhere.
 * LuceneTestCaseJ4 extends TestCase so you dont have to import static 
 Assert.* to get all the asserts.
 for most tests, no changes are required. but a few very minor things had to 
 be changed:
 * setUp() and tearDown() must be public, not protected.
 * useless ctors must be removed, such as TestFoo(String name) { super(name); }
 * LocalizedTestCase is gone, instead of
 {code}
 public class TestQueryParser extends LocalizedTestCase {
 {code}
 it is now
 {code}
 @RunWith(LuceneTestCase.LocalizedTestCaseRunner.class)
 public class TestQueryParser extends LuceneTestCase {
 {code}
 * Same with MultiCodecTestCase: 
 (LuceneTestCase.MultiCodecTestCaseRunner.class}
 I only did the core tests in the patch as a start, and i just made an empty 
 LuceneTestCase extends LuceneTestCaseJ4.
 I'd like to do contrib and solr and rename this LuceneTestCaseJ4 to only a 
 single class: LuceneTestCase.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: IndexReader Cache - a different angle

2010-09-13 Thread Danil ŢORIN
I'd second that

In my usecase we need to search, sometimes with sort, on pretty big index...

So in worst case scenario we get OOM while loading FieldCache as it
tries do create an huge array.
You can increase -Xmx, go to bigger host, but in the end there WILL be
an index big enough to crash you.

My idea would be to use something like EhCache with few elements in
memory and overflow to disk, so that if there are few unique terms, it
would be almost as fast as an array.
Otherwise in Collector/Sort/SortField/FieldComparator I would hit the
EhCache on disk (yes it would be a huge performance hit) but I won't
get OOMs and the results STILL will be sorted.

Right now SegmentReader/FieldCacheImpl are pretty hardcoded on
FieldCache.DEFAULT

And yes, I'm on 3.x...


On Mon, Sep 13, 2010 at 16:05, Tim Smith tsm...@attivio.com wrote:
  i created https://issues.apache.org/jira/browse/LUCENE-2345 some time ago
 proposing pretty much what seems to be discussed here


  -- Tim

 On 09/12/10 10:18, Simon Willnauer wrote:

 On Sun, Sep 12, 2010 at 11:46 AM, Michael McCandless
 luc...@mikemccandless.com  wrote:

 Having hooks to enable an app to manage its own external, private
 stuff associated w/ each segment reader would be useful and it's been
 asked for in the past.  However, since we've now opened up
 SegmentReader, SegmentInfo/s, etc., in recent releases, can't an app
 already do this w/o core API changes?

 The visitor approach would simply be a little more than syntactic
 sugar where only new SubReader instances are passed to the callback.
 You can do the same with the already existing API like
 gatherSubReaders or getSequentialSubReaders. Every API I was talking
 about would just be simplification anyway and would be possible to
 build without changing the core.

 I know Earwin has built a whole system like this on top of Lucene --
 Earwin how did you do that...?  Did you make core changes to
 Lucene...?

 A custom Codec should be an excellent way to handle the specific use
 cache (caching certain postings) -- by doing it as a Codec, any time
 anything in Lucene needs to tap into that posting (query scorers,
 filters, merging, applying deletes, etc), it hits this cache.  You
 could model it like PulsingCodec, which wraps any other Codec but
 handles the low-freq ones itself.  If you do it externally how would
 core use of postings hit it?  (Or was that not the intention?)

 I don't understand the filter use-case... the CachingWrapperFilter
 already caches per-segment, so that reopen is efficient?  How would an
 external cache (built on these hooks) be different?

 Man you are right - never mind :)

 simon

 For faster filters we have to apply them like we do deleted docs if
 the filter is random access (such as being cached), LUCENE-1536 --
 flex actually makes this relatively easy now, since the postings API
 no longer implicitly filters deleted docs (ie you provide your own
 skipDocs) -- but these hooks won't fix that right?

 Mike

 On Sun, Sep 12, 2010 at 3:43 AM, Simon Willnauer
 simon.willna...@googlemail.com  wrote:

 Hey Shai,

 On Sun, Sep 12, 2010 at 6:51 AM, Shai Ereraser...@gmail.com  wrote:

 Hey Simon,

 You're right that the application can develop a Caching mechanism
 outside
 Lucene, and when reopen() is called, if it changed, iterate on the
 sub-readers and init the Cache w/ the new ones.

 Alright, then we are on the same track I guess!

 However, by building something like that inside Lucene, the application
 will
 get more native support, and thus better performance, in some cases.
 For
 example, consider a field fileType with 10 possible values, and for the
 sake
 of simplicity, let's say that the index is divided evenly across them.
 Your
 users always add such a term constraint to the query (e.g. they want to
 get
 results of fileType:pdf or fileType:odt, and perhaps sometimes both,
 but not
 others). You have basically two ways of supporting this:
 (1) Add such a term to the query / clause to a BooleanQuery w/ an AND
 relation -- cons is that this term / posting is read for every query.

 Oh I wasn't saying that a cache framework would be obsolet and
 shouldn't be part of lucene. My intention was rather to generalize
 this functionality so that we can make the API change more easily and
 at the same time brining the infrastructure you are proposing in
 place.

 Regarding you example above, filters are a very good example where
 something like that could help to improve performance and we should
 provide it with lucene core but I would again prefer the least
 intrusive way to do so. If we can make that happen without adding any
 cache agnostic API we should do it. We really should try to sketch out
 a simple API with gives us access to the opened segReaders and see if
 that would be sufficient for our usecases. Specialization will always
 be possible but I doubt that it is needed.

 (2) Write a Filter which works at the top IR level, that is refreshed
 whenever the index is 

Re: IndexReader Cache - a different angle

2010-09-13 Thread Danil ŢORIN
And it would be nice to have hooks in lucene and avoid managing refs
to indexReader on reopen() and close() by myself.

Oh...and to complicate things, my index it's near-realtime using
IndexWriter.getReader(), so it's not just IndexReader we need to
change, but also IndexWriter should provide a reader that has proper
FieldCache implementation.

And I'm a bit uncomfortable to dig that deep :)

On Mon, Sep 13, 2010 at 17:51, Danil ŢORIN torin...@gmail.com wrote:
 I'd second that

 In my usecase we need to search, sometimes with sort, on pretty big index...

 So in worst case scenario we get OOM while loading FieldCache as it
 tries do create an huge array.
 You can increase -Xmx, go to bigger host, but in the end there WILL be
 an index big enough to crash you.

 My idea would be to use something like EhCache with few elements in
 memory and overflow to disk, so that if there are few unique terms, it
 would be almost as fast as an array.
 Otherwise in Collector/Sort/SortField/FieldComparator I would hit the
 EhCache on disk (yes it would be a huge performance hit) but I won't
 get OOMs and the results STILL will be sorted.

 Right now SegmentReader/FieldCacheImpl are pretty hardcoded on
 FieldCache.DEFAULT

 And yes, I'm on 3.x...


 On Mon, Sep 13, 2010 at 16:05, Tim Smith tsm...@attivio.com wrote:
  i created https://issues.apache.org/jira/browse/LUCENE-2345 some time ago
 proposing pretty much what seems to be discussed here


  -- Tim

 On 09/12/10 10:18, Simon Willnauer wrote:

 On Sun, Sep 12, 2010 at 11:46 AM, Michael McCandless
 luc...@mikemccandless.com  wrote:

 Having hooks to enable an app to manage its own external, private
 stuff associated w/ each segment reader would be useful and it's been
 asked for in the past.  However, since we've now opened up
 SegmentReader, SegmentInfo/s, etc., in recent releases, can't an app
 already do this w/o core API changes?

 The visitor approach would simply be a little more than syntactic
 sugar where only new SubReader instances are passed to the callback.
 You can do the same with the already existing API like
 gatherSubReaders or getSequentialSubReaders. Every API I was talking
 about would just be simplification anyway and would be possible to
 build without changing the core.

 I know Earwin has built a whole system like this on top of Lucene --
 Earwin how did you do that...?  Did you make core changes to
 Lucene...?

 A custom Codec should be an excellent way to handle the specific use
 cache (caching certain postings) -- by doing it as a Codec, any time
 anything in Lucene needs to tap into that posting (query scorers,
 filters, merging, applying deletes, etc), it hits this cache.  You
 could model it like PulsingCodec, which wraps any other Codec but
 handles the low-freq ones itself.  If you do it externally how would
 core use of postings hit it?  (Or was that not the intention?)

 I don't understand the filter use-case... the CachingWrapperFilter
 already caches per-segment, so that reopen is efficient?  How would an
 external cache (built on these hooks) be different?

 Man you are right - never mind :)

 simon

 For faster filters we have to apply them like we do deleted docs if
 the filter is random access (such as being cached), LUCENE-1536 --
 flex actually makes this relatively easy now, since the postings API
 no longer implicitly filters deleted docs (ie you provide your own
 skipDocs) -- but these hooks won't fix that right?

 Mike

 On Sun, Sep 12, 2010 at 3:43 AM, Simon Willnauer
 simon.willna...@googlemail.com  wrote:

 Hey Shai,

 On Sun, Sep 12, 2010 at 6:51 AM, Shai Ereraser...@gmail.com  wrote:

 Hey Simon,

 You're right that the application can develop a Caching mechanism
 outside
 Lucene, and when reopen() is called, if it changed, iterate on the
 sub-readers and init the Cache w/ the new ones.

 Alright, then we are on the same track I guess!

 However, by building something like that inside Lucene, the application
 will
 get more native support, and thus better performance, in some cases.
 For
 example, consider a field fileType with 10 possible values, and for the
 sake
 of simplicity, let's say that the index is divided evenly across them.
 Your
 users always add such a term constraint to the query (e.g. they want to
 get
 results of fileType:pdf or fileType:odt, and perhaps sometimes both,
 but not
 others). You have basically two ways of supporting this:
 (1) Add such a term to the query / clause to a BooleanQuery w/ an AND
 relation -- cons is that this term / posting is read for every query.

 Oh I wasn't saying that a cache framework would be obsolet and
 shouldn't be part of lucene. My intention was rather to generalize
 this functionality so that we can make the API change more easily and
 at the same time brining the infrastructure you are proposing in
 place.

 Regarding you example above, filters are a very good example where
 something like that could help to improve performance and we should
 provide it 

[jira] Commented: (SOLR-2112) Solrj should support streaming response

2010-09-13 Thread Ryan McKinley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12908838#action_12908838
 ] 

Ryan McKinley commented on SOLR-2112:
-

I would like to commit this soon (just to /trunk) unless there are objections

 Solrj should support streaming response
 ---

 Key: SOLR-2112
 URL: https://issues.apache.org/jira/browse/SOLR-2112
 Project: Solr
  Issue Type: New Feature
  Components: clients - java
Reporter: Ryan McKinley
 Fix For: 4.0

 Attachments: SOLR-2112-StreamingSolrj.patch, 
 SOLR-2112-StreamingSolrj.patch


 The solrj API should optionally support streaming documents.
 Rather then putting all results into a SolrDocumentList, sorlj should be able 
 to call a callback function after each document is parsed.  This would allow 
 someone to call query.setRows( Integer.MAX_INT ) and get each result to the 
 client without loading them all into memory.
 For starters, I think the important things to stream are SolrDocuments, but 
 down the road, this could also stream other things (consider reading all 
 terms from the index)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: exceptions from solr/contrib/dataimporthandler and solr/contrib/extraction

2010-09-13 Thread Yonik Seeley
On Sun, Sep 12, 2010 at 8:55 PM, Lance Norskog goks...@gmail.com wrote:
 And this is why unit tests shouldn't spew- just yes or no, please. If you
 want to patch this, please comment out all the printed trash, since all it
 does is cause mail threads like this.

Also, people shouldn't put bugs in their code :-P

-Yonik

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2575) Concurrent byte and int block implementations

2010-09-13 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12908849#action_12908849
 ] 

Jason Rutherglen commented on LUCENE-2575:
--

One thing I noticed, correct me if I'm wrong, is the term doc
frequency (the one stored per term, ie, TermsEnum.docFreq)
doesn't seem to be currently recorded in the ram buffer code
tree. It will be easy to add, though if we make it accurate per
RAM index reader then we could be allocating a unique array, the
length of the number of terms, per reader. I'll implement it
this way to start and we can change it later if necessary.
Actually, to save RAM this could be another use case where a 2
dimensional copy-on-write array is practical.

 Concurrent byte and int block implementations
 -

 Key: LUCENE-2575
 URL: https://issues.apache.org/jira/browse/LUCENE-2575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: Realtime Branch
Reporter: Jason Rutherglen
 Fix For: Realtime Branch

 Attachments: LUCENE-2575.patch, LUCENE-2575.patch


 The current *BlockPool implementations aren't quite concurrent.
 We really need something that has a locking flush method, where
 flush is called at the end of adding a document. Once flushed,
 the newly written data would be available to all other reading
 threads (ie, postings etc). I'm not sure I understand the slices
 concept, it seems like it'd be easier to implement a seekable
 random access file like API. One'd seek to a given position,
 then read or write from there. The underlying management of byte
 arrays could then be hidden?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2642) merge LuceneTestCase and LuceneTestCaseJ4

2010-09-13 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12908856#action_12908856
 ] 

Michael McCandless commented on LUCENE-2642:


This is great Robert!  Patch works for me (except for bizarre hang in Solr's 
TestSolrCoreProperties, apparently only on my machine, that's pre-existing).

 merge LuceneTestCase and LuceneTestCaseJ4
 -

 Key: LUCENE-2642
 URL: https://issues.apache.org/jira/browse/LUCENE-2642
 Project: Lucene - Java
  Issue Type: Test
  Components: Tests
Reporter: Robert Muir
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2642-extendAssert.patch, LUCENE-2642.patch, 
 LUCENE-2642.patch


 We added Junit4 support, but as a separate test class.
 So unfortunately, we have two separate base classes to maintain: 
 LuceneTestCase and LuceneTestCaseJ4.
 This creates a mess and is difficult to manage.
 Instead, I propose a single base test class that works both junit3 and junit4 
 style.
 I modified our LuceneTestCaseJ4 in the following way:
 * the methods to run are not limited to the ones annotated with @Test, but 
 also any void no-arg methods that start with test, like junit3. this means 
 you dont have to sprinkle @Test everywhere.
 * of course, @Ignore works as expected everywhere.
 * LuceneTestCaseJ4 extends TestCase so you dont have to import static 
 Assert.* to get all the asserts.
 for most tests, no changes are required. but a few very minor things had to 
 be changed:
 * setUp() and tearDown() must be public, not protected.
 * useless ctors must be removed, such as TestFoo(String name) { super(name); }
 * LocalizedTestCase is gone, instead of
 {code}
 public class TestQueryParser extends LocalizedTestCase {
 {code}
 it is now
 {code}
 @RunWith(LuceneTestCase.LocalizedTestCaseRunner.class)
 public class TestQueryParser extends LuceneTestCase {
 {code}
 * Same with MultiCodecTestCase: 
 (LuceneTestCase.MultiCodecTestCaseRunner.class}
 I only did the core tests in the patch as a start, and i just made an empty 
 LuceneTestCase extends LuceneTestCaseJ4.
 I'd like to do contrib and solr and rename this LuceneTestCaseJ4 to only a 
 single class: LuceneTestCase.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2642) merge LuceneTestCase and LuceneTestCaseJ4

2010-09-13 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12908861#action_12908861
 ] 

Uwe Schindler commented on LUCENE-2642:
---

Looks good, this is a really good step forwards. We can write old-style tests, 
butuse JUnit4 and can optionally add the @BeforeClass and so on :-)

 merge LuceneTestCase and LuceneTestCaseJ4
 -

 Key: LUCENE-2642
 URL: https://issues.apache.org/jira/browse/LUCENE-2642
 Project: Lucene - Java
  Issue Type: Test
  Components: Tests
Reporter: Robert Muir
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2642-extendAssert.patch, LUCENE-2642.patch, 
 LUCENE-2642.patch


 We added Junit4 support, but as a separate test class.
 So unfortunately, we have two separate base classes to maintain: 
 LuceneTestCase and LuceneTestCaseJ4.
 This creates a mess and is difficult to manage.
 Instead, I propose a single base test class that works both junit3 and junit4 
 style.
 I modified our LuceneTestCaseJ4 in the following way:
 * the methods to run are not limited to the ones annotated with @Test, but 
 also any void no-arg methods that start with test, like junit3. this means 
 you dont have to sprinkle @Test everywhere.
 * of course, @Ignore works as expected everywhere.
 * LuceneTestCaseJ4 extends TestCase so you dont have to import static 
 Assert.* to get all the asserts.
 for most tests, no changes are required. but a few very minor things had to 
 be changed:
 * setUp() and tearDown() must be public, not protected.
 * useless ctors must be removed, such as TestFoo(String name) { super(name); }
 * LocalizedTestCase is gone, instead of
 {code}
 public class TestQueryParser extends LocalizedTestCase {
 {code}
 it is now
 {code}
 @RunWith(LuceneTestCase.LocalizedTestCaseRunner.class)
 public class TestQueryParser extends LuceneTestCase {
 {code}
 * Same with MultiCodecTestCase: 
 (LuceneTestCase.MultiCodecTestCaseRunner.class}
 I only did the core tests in the patch as a start, and i just made an empty 
 LuceneTestCase extends LuceneTestCaseJ4.
 I'd like to do contrib and solr and rename this LuceneTestCaseJ4 to only a 
 single class: LuceneTestCase.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2642) merge LuceneTestCase and LuceneTestCaseJ4

2010-09-13 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12908866#action_12908866
 ] 

Robert Muir commented on LUCENE-2642:
-

bq. We can write old-style tests, butuse JUnit4 and can optionally add the 
@BeforeClass and so on 

Yeah i've never understood why Junit4 requires all these static imports and 
annotations... i just care about @BeforeClass!


 merge LuceneTestCase and LuceneTestCaseJ4
 -

 Key: LUCENE-2642
 URL: https://issues.apache.org/jira/browse/LUCENE-2642
 Project: Lucene - Java
  Issue Type: Test
  Components: Tests
Reporter: Robert Muir
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2642-extendAssert.patch, LUCENE-2642.patch, 
 LUCENE-2642.patch


 We added Junit4 support, but as a separate test class.
 So unfortunately, we have two separate base classes to maintain: 
 LuceneTestCase and LuceneTestCaseJ4.
 This creates a mess and is difficult to manage.
 Instead, I propose a single base test class that works both junit3 and junit4 
 style.
 I modified our LuceneTestCaseJ4 in the following way:
 * the methods to run are not limited to the ones annotated with @Test, but 
 also any void no-arg methods that start with test, like junit3. this means 
 you dont have to sprinkle @Test everywhere.
 * of course, @Ignore works as expected everywhere.
 * LuceneTestCaseJ4 extends TestCase so you dont have to import static 
 Assert.* to get all the asserts.
 for most tests, no changes are required. but a few very minor things had to 
 be changed:
 * setUp() and tearDown() must be public, not protected.
 * useless ctors must be removed, such as TestFoo(String name) { super(name); }
 * LocalizedTestCase is gone, instead of
 {code}
 public class TestQueryParser extends LocalizedTestCase {
 {code}
 it is now
 {code}
 @RunWith(LuceneTestCase.LocalizedTestCaseRunner.class)
 public class TestQueryParser extends LuceneTestCase {
 {code}
 * Same with MultiCodecTestCase: 
 (LuceneTestCase.MultiCodecTestCaseRunner.class}
 I only did the core tests in the patch as a start, and i just made an empty 
 LuceneTestCase extends LuceneTestCaseJ4.
 I'd like to do contrib and solr and rename this LuceneTestCaseJ4 to only a 
 single class: LuceneTestCase.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



ant clean required

2010-09-13 Thread Robert Muir
heads up: due to https://issues.apache.org/jira/browse/LUCENE-2642 you will
need to run 'ant clean' after you svn up.

-- 
Robert Muir
rcm...@gmail.com


Re: exceptions from solr/contrib/dataimporthandler and solr/contrib/extraction

2010-09-13 Thread Lance Norskog
What I want you to do is, I want you to find the guys who are putting
all the bugs in the code, and I want you to FIRE THEM!
- true

On Mon, Sep 13, 2010 at 9:10 AM, Yonik Seeley
yo...@lucidimagination.com wrote:
 On Sun, Sep 12, 2010 at 8:55 PM, Lance Norskog goks...@gmail.com wrote:
 And this is why unit tests shouldn't spew- just yes or no, please. If you
 want to patch this, please comment out all the printed trash, since all it
 does is cause mail threads like this.

 Also, people shouldn't put bugs in their code :-P

 -Yonik

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org





-- 
Lance Norskog
goks...@gmail.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2642) merge LuceneTestCase and LuceneTestCaseJ4

2010-09-13 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-2642:
--

Attachment: LUCENE-2642-fixes.patch

Some small fixes in reflection inspection:
- exclude static and abstract methods
- check native return type

 merge LuceneTestCase and LuceneTestCaseJ4
 -

 Key: LUCENE-2642
 URL: https://issues.apache.org/jira/browse/LUCENE-2642
 Project: Lucene - Java
  Issue Type: Test
  Components: Tests
Reporter: Robert Muir
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2642-extendAssert.patch, LUCENE-2642-fixes.patch, 
 LUCENE-2642.patch, LUCENE-2642.patch


 We added Junit4 support, but as a separate test class.
 So unfortunately, we have two separate base classes to maintain: 
 LuceneTestCase and LuceneTestCaseJ4.
 This creates a mess and is difficult to manage.
 Instead, I propose a single base test class that works both junit3 and junit4 
 style.
 I modified our LuceneTestCaseJ4 in the following way:
 * the methods to run are not limited to the ones annotated with @Test, but 
 also any void no-arg methods that start with test, like junit3. this means 
 you dont have to sprinkle @Test everywhere.
 * of course, @Ignore works as expected everywhere.
 * LuceneTestCaseJ4 extends TestCase so you dont have to import static 
 Assert.* to get all the asserts.
 for most tests, no changes are required. but a few very minor things had to 
 be changed:
 * setUp() and tearDown() must be public, not protected.
 * useless ctors must be removed, such as TestFoo(String name) { super(name); }
 * LocalizedTestCase is gone, instead of
 {code}
 public class TestQueryParser extends LocalizedTestCase {
 {code}
 it is now
 {code}
 @RunWith(LuceneTestCase.LocalizedTestCaseRunner.class)
 public class TestQueryParser extends LuceneTestCase {
 {code}
 * Same with MultiCodecTestCase: 
 (LuceneTestCase.MultiCodecTestCaseRunner.class}
 I only did the core tests in the patch as a start, and i just made an empty 
 LuceneTestCase extends LuceneTestCaseJ4.
 I'd like to do contrib and solr and rename this LuceneTestCaseJ4 to only a 
 single class: LuceneTestCase.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2642) merge LuceneTestCase and LuceneTestCaseJ4

2010-09-13 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12908890#action_12908890
 ] 

Robert Muir commented on LUCENE-2642:
-

thanks Uwe, i can merge this into the 3x backport too.

 merge LuceneTestCase and LuceneTestCaseJ4
 -

 Key: LUCENE-2642
 URL: https://issues.apache.org/jira/browse/LUCENE-2642
 Project: Lucene - Java
  Issue Type: Test
  Components: Tests
Reporter: Robert Muir
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2642-extendAssert.patch, LUCENE-2642-fixes.patch, 
 LUCENE-2642.patch, LUCENE-2642.patch


 We added Junit4 support, but as a separate test class.
 So unfortunately, we have two separate base classes to maintain: 
 LuceneTestCase and LuceneTestCaseJ4.
 This creates a mess and is difficult to manage.
 Instead, I propose a single base test class that works both junit3 and junit4 
 style.
 I modified our LuceneTestCaseJ4 in the following way:
 * the methods to run are not limited to the ones annotated with @Test, but 
 also any void no-arg methods that start with test, like junit3. this means 
 you dont have to sprinkle @Test everywhere.
 * of course, @Ignore works as expected everywhere.
 * LuceneTestCaseJ4 extends TestCase so you dont have to import static 
 Assert.* to get all the asserts.
 for most tests, no changes are required. but a few very minor things had to 
 be changed:
 * setUp() and tearDown() must be public, not protected.
 * useless ctors must be removed, such as TestFoo(String name) { super(name); }
 * LocalizedTestCase is gone, instead of
 {code}
 public class TestQueryParser extends LocalizedTestCase {
 {code}
 it is now
 {code}
 @RunWith(LuceneTestCase.LocalizedTestCaseRunner.class)
 public class TestQueryParser extends LuceneTestCase {
 {code}
 * Same with MultiCodecTestCase: 
 (LuceneTestCase.MultiCodecTestCaseRunner.class}
 I only did the core tests in the patch as a start, and i just made an empty 
 LuceneTestCase extends LuceneTestCaseJ4.
 I'd like to do contrib and solr and rename this LuceneTestCaseJ4 to only a 
 single class: LuceneTestCase.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-2112) Solrj should support streaming response

2010-09-13 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12908894#action_12908894
 ] 

Yonik Seeley commented on SOLR-2112:


Can StreamingResponseCallback be an abstract class for easier back compat?
I imagine we could want to stream other stuff in the future (output from terms 
component, facet component, term vector component, etc).

 Solrj should support streaming response
 ---

 Key: SOLR-2112
 URL: https://issues.apache.org/jira/browse/SOLR-2112
 Project: Solr
  Issue Type: New Feature
  Components: clients - java
Reporter: Ryan McKinley
 Fix For: 4.0

 Attachments: SOLR-2112-StreamingSolrj.patch, 
 SOLR-2112-StreamingSolrj.patch


 The solrj API should optionally support streaming documents.
 Rather then putting all results into a SolrDocumentList, sorlj should be able 
 to call a callback function after each document is parsed.  This would allow 
 someone to call query.setRows( Integer.MAX_INT ) and get each result to the 
 client without loading them all into memory.
 For starters, I think the important things to stream are SolrDocuments, but 
 down the road, this could also stream other things (consider reading all 
 terms from the index)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2642) merge LuceneTestCase and LuceneTestCaseJ4

2010-09-13 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12908896#action_12908896
 ] 

Robert Muir commented on LUCENE-2642:
-

OK, i didnt merge the reflection fixes yet, but i backported the patch to 3x.

Committed revision 996611, 996630 (3x).

Will mark the issue resolved when Uwe is out of reflection hell :)

 merge LuceneTestCase and LuceneTestCaseJ4
 -

 Key: LUCENE-2642
 URL: https://issues.apache.org/jira/browse/LUCENE-2642
 Project: Lucene - Java
  Issue Type: Test
  Components: Tests
Reporter: Robert Muir
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2642-extendAssert.patch, LUCENE-2642-fixes.patch, 
 LUCENE-2642.patch, LUCENE-2642.patch


 We added Junit4 support, but as a separate test class.
 So unfortunately, we have two separate base classes to maintain: 
 LuceneTestCase and LuceneTestCaseJ4.
 This creates a mess and is difficult to manage.
 Instead, I propose a single base test class that works both junit3 and junit4 
 style.
 I modified our LuceneTestCaseJ4 in the following way:
 * the methods to run are not limited to the ones annotated with @Test, but 
 also any void no-arg methods that start with test, like junit3. this means 
 you dont have to sprinkle @Test everywhere.
 * of course, @Ignore works as expected everywhere.
 * LuceneTestCaseJ4 extends TestCase so you dont have to import static 
 Assert.* to get all the asserts.
 for most tests, no changes are required. but a few very minor things had to 
 be changed:
 * setUp() and tearDown() must be public, not protected.
 * useless ctors must be removed, such as TestFoo(String name) { super(name); }
 * LocalizedTestCase is gone, instead of
 {code}
 public class TestQueryParser extends LocalizedTestCase {
 {code}
 it is now
 {code}
 @RunWith(LuceneTestCase.LocalizedTestCaseRunner.class)
 public class TestQueryParser extends LuceneTestCase {
 {code}
 * Same with MultiCodecTestCase: 
 (LuceneTestCase.MultiCodecTestCaseRunner.class}
 I only did the core tests in the patch as a start, and i just made an empty 
 LuceneTestCase extends LuceneTestCaseJ4.
 I'd like to do contrib and solr and rename this LuceneTestCaseJ4 to only a 
 single class: LuceneTestCase.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2638) Make HighFreqTerms.TermStats class public

2010-09-13 Thread Tom Burton-West (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12908904#action_12908904
 ] 

Tom Burton-West commented on LUCENE-2638:
-

Just wondering if you could describe the use you have in mind.

Tom

 Make HighFreqTerms.TermStats class public
 -

 Key: LUCENE-2638
 URL: https://issues.apache.org/jira/browse/LUCENE-2638
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 4.0
Reporter: Andrzej Bialecki 
 Attachments: LUCENE-2638.patch


 It's not possible to use public methods in contrib/misc/... /HighFreqTerms 
 from outside the package because the return type has package visibility. I 
 propose to move TermStats class to a separate file and make it public.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2504) sorting performance regression

2010-09-13 Thread Yonik Seeley (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley updated LUCENE-2504:
-

Attachment: LUCENE-2504.patch

OK, here's  a patch for solr's sort missing last.
Median response time in my tests drops from 160 to 102 ms.

 sorting performance regression
 --

 Key: LUCENE-2504
 URL: https://issues.apache.org/jira/browse/LUCENE-2504
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 4.0
Reporter: Yonik Seeley
 Fix For: 4.0

 Attachments: LUCENE-2504.patch, LUCENE-2504.patch, LUCENE-2504.zip


 sorting can be much slower on trunk than branch_3x

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1536) if a filter can support random access API, we should use it

2010-09-13 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1536:
---

Attachment: CachedFilterIndexReader.java

By subclassing FilterIndexReader, and taking advantage of how the flex
APIs now let you pass a custom skipDocs when pulling the postings, I
created a prototype class (attached, named CachedFilterIndexReader) that
up-front compiles the deleted docs for each segment with the negation
of a Filter (that you provide), and returns a reader that applies that
filter.

This is nice because it's fully external to Lucene, and it gives
awesome gains in many cases (see
http://chbits.blogspot.com/2010/09/fast-search-filters-using-flex.html).

I don't think we should commit this class -- we should instead fix
Filters correctly!  But it's a nice workaround until we do that.


 if a filter can support random access API, we should use it
 ---

 Key: LUCENE-1536
 URL: https://issues.apache.org/jira/browse/LUCENE-1536
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.4
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 4.0

 Attachments: CachedFilterIndexReader.java, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch


 I ran some performance tests, comparing applying a filter via
 random-access API instead of current trunk's iterator API.
 This was inspired by LUCENE-1476, where we realized deletions should
 really be implemented just like a filter, but then in testing found
 that switching deletions to iterator was a very sizable performance
 hit.
 Some notes on the test:
   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
   * I test across multiple queries.  1-X means an OR query, eg 1-4
 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
 AND 3 AND 4.  u s means united states (phrase search).
   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
 95, 98, 99, 99.9 (filter is non-null but all bits are set),
 100 (filter=null, control)).
   * Method high means I use random-access filter API in
 IndexSearcher's main loop.  Method low means I use random-access
 filter API down in SegmentTermDocs (just like deleted docs
 today).
   * Baseline (QPS) is current trunk, where filter is applied as iterator up
 high (ie in IndexSearcher's search loop).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2504) sorting performance regression

2010-09-13 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12908918#action_12908918
 ] 

Robert Muir commented on LUCENE-2504:
-

silly question, what does the bigString do?

(just wondering if it should be U+10,U+10,... now that we use utf-8 
order, depending what it does)


 sorting performance regression
 --

 Key: LUCENE-2504
 URL: https://issues.apache.org/jira/browse/LUCENE-2504
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 4.0
Reporter: Yonik Seeley
 Fix For: 4.0

 Attachments: LUCENE-2504.patch, LUCENE-2504.patch, LUCENE-2504.zip


 sorting can be much slower on trunk than branch_3x

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2504) sorting performance regression

2010-09-13 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12908934#action_12908934
 ] 

Yonik Seeley commented on LUCENE-2504:
--

bq. silly question, what does the bigString do? 

It's actually not currently used by Solr but it's basically to use as a 
proxy for a null if you want the Comparables returned by value() to match the 
sort order the Comparator actually used.

bq. (just wondering if it should be U+10,U+10,... now that we use utf-8 
order, depending what it does)

Maybe... if it is supposed to be just a string (I know that's the name, but 
maybe it should be called bigTerm I guess).  All of our terms are currently 
UTF8 - but I don't know if that will last?


 sorting performance regression
 --

 Key: LUCENE-2504
 URL: https://issues.apache.org/jira/browse/LUCENE-2504
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 4.0
Reporter: Yonik Seeley
 Fix For: 4.0

 Attachments: LUCENE-2504.patch, LUCENE-2504.patch, LUCENE-2504.zip


 sorting can be much slower on trunk than branch_3x

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2504) sorting performance regression

2010-09-13 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12908947#action_12908947
 ] 

Robert Muir commented on LUCENE-2504:
-

bq. Maybe... if it is supposed to be just a string (I know that's the name, but 
maybe it should be called bigTerm I guess). All of our terms are currently UTF8 
- but I don't know if that will last?

well you are right, for example collated terms for Locale-sensitive sort will 
hopefully use full byte range soon... 

we can always safely use bytes of 0xff i think?


 sorting performance regression
 --

 Key: LUCENE-2504
 URL: https://issues.apache.org/jira/browse/LUCENE-2504
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 4.0
Reporter: Yonik Seeley
 Fix For: 4.0

 Attachments: LUCENE-2504.patch, LUCENE-2504.patch, LUCENE-2504.zip


 sorting can be much slower on trunk than branch_3x

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2504) sorting performance regression

2010-09-13 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12908953#action_12908953
 ] 

Yonik Seeley commented on LUCENE-2504:
--

bq. we can always safely use bytes of 0xff i think?

Yep, should be fine.  Is there an upper bound on how long collated terms can be?

 sorting performance regression
 --

 Key: LUCENE-2504
 URL: https://issues.apache.org/jira/browse/LUCENE-2504
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 4.0
Reporter: Yonik Seeley
 Fix For: 4.0

 Attachments: LUCENE-2504.patch, LUCENE-2504.patch, LUCENE-2504.zip


 sorting can be much slower on trunk than branch_3x

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2504) sorting performance regression

2010-09-13 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12908956#action_12908956
 ] 

Robert Muir commented on LUCENE-2504:
-

bq. Yep, should be fine. Is there an upper bound on how long collated terms can 
be?

There isn't, but...

I can't promise (but i'll verify), i think actually a single 0xff might do, for 
the major encodings.

* its invalid in utf-8
* its technically valid, but unused (reset byte) in bocu-1
* collation keys i understand are a modified bocu, likely unused there too.

so its like a NaN sentinel, if someone is doing something very wierd, maybe it 
wont work,
but in general, i think it will work. ill check.


 sorting performance regression
 --

 Key: LUCENE-2504
 URL: https://issues.apache.org/jira/browse/LUCENE-2504
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 4.0
Reporter: Yonik Seeley
 Fix For: 4.0

 Attachments: LUCENE-2504.patch, LUCENE-2504.patch, LUCENE-2504.zip


 sorting can be much slower on trunk than branch_3x

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[IGNORE] Email test

2010-09-13 Thread George Aroush

Hi all,

Sorry for the spam; testing to see if my email still works!

Thanks,

-- George


[jira] Commented: (LUCENE-2504) sorting performance regression

2010-09-13 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12908963#action_12908963
 ] 

Yonik Seeley commented on LUCENE-2504:
--

OK, I've changed bigString to bigTerm and used 10 0xff bytes (to account for 
possible binary encoding of 8 byte numerics + other stuff like tags that trie 
encoding uses).

 sorting performance regression
 --

 Key: LUCENE-2504
 URL: https://issues.apache.org/jira/browse/LUCENE-2504
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 4.0
Reporter: Yonik Seeley
 Fix For: 4.0

 Attachments: LUCENE-2504.patch, LUCENE-2504.patch, LUCENE-2504.zip


 sorting can be much slower on trunk than branch_3x

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2504) sorting performance regression

2010-09-13 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12908965#action_12908965
 ] 

Robert Muir commented on LUCENE-2504:
-

Cool thanks (unfortunately i ran a bunch of collators and encountered what 
looks like 0xff bytes)
I think this will help.


 sorting performance regression
 --

 Key: LUCENE-2504
 URL: https://issues.apache.org/jira/browse/LUCENE-2504
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 4.0
Reporter: Yonik Seeley
 Fix For: 4.0

 Attachments: LUCENE-2504.patch, LUCENE-2504.patch, LUCENE-2504.zip


 sorting can be much slower on trunk than branch_3x

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (SOLR-2112) Solrj should support streaming response

2010-09-13 Thread Ryan McKinley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan McKinley updated SOLR-2112:


Attachment: SOLR-2112-StreamingSolrj.patch

ah yes, good point.

Here is an updated patch using:
{code:java}
public abstract class StreamingResponseCallback {
  /*
   * Called for each SolrDocument in the response
   */
  public abstract void streamSolrDocument( SolrDocument doc );

  /*
   * Called at the beginning of each DocList (and SolrDocumentList)
   */
  public abstract void streamDocListInfo( long numFound, long start, Float 
maxScore );
}
{code}

 Solrj should support streaming response
 ---

 Key: SOLR-2112
 URL: https://issues.apache.org/jira/browse/SOLR-2112
 Project: Solr
  Issue Type: New Feature
  Components: clients - java
Reporter: Ryan McKinley
 Fix For: 4.0

 Attachments: SOLR-2112-StreamingSolrj.patch, 
 SOLR-2112-StreamingSolrj.patch, SOLR-2112-StreamingSolrj.patch


 The solrj API should optionally support streaming documents.
 Rather then putting all results into a SolrDocumentList, sorlj should be able 
 to call a callback function after each document is parsed.  This would allow 
 someone to call query.setRows( Integer.MAX_INT ) and get each result to the 
 client without loading them all into memory.
 For starters, I think the important things to stream are SolrDocuments, but 
 down the road, this could also stream other things (consider reading all 
 terms from the index)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-2642) merge LuceneTestCase and LuceneTestCaseJ4

2010-09-13 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir resolved LUCENE-2642.
-

Resolution: Fixed

OK i merged back all of Uwe's improvements. Thanks for the help Uwe.

I think now in future issues we can clean up and improve this test case a lot.
I felt discouraged from doing so with the previous duplication...

 merge LuceneTestCase and LuceneTestCaseJ4
 -

 Key: LUCENE-2642
 URL: https://issues.apache.org/jira/browse/LUCENE-2642
 Project: Lucene - Java
  Issue Type: Test
  Components: Tests
Reporter: Robert Muir
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2642-extendAssert.patch, LUCENE-2642-fixes.patch, 
 LUCENE-2642.patch, LUCENE-2642.patch


 We added Junit4 support, but as a separate test class.
 So unfortunately, we have two separate base classes to maintain: 
 LuceneTestCase and LuceneTestCaseJ4.
 This creates a mess and is difficult to manage.
 Instead, I propose a single base test class that works both junit3 and junit4 
 style.
 I modified our LuceneTestCaseJ4 in the following way:
 * the methods to run are not limited to the ones annotated with @Test, but 
 also any void no-arg methods that start with test, like junit3. this means 
 you dont have to sprinkle @Test everywhere.
 * of course, @Ignore works as expected everywhere.
 * LuceneTestCaseJ4 extends TestCase so you dont have to import static 
 Assert.* to get all the asserts.
 for most tests, no changes are required. but a few very minor things had to 
 be changed:
 * setUp() and tearDown() must be public, not protected.
 * useless ctors must be removed, such as TestFoo(String name) { super(name); }
 * LocalizedTestCase is gone, instead of
 {code}
 public class TestQueryParser extends LocalizedTestCase {
 {code}
 it is now
 {code}
 @RunWith(LuceneTestCase.LocalizedTestCaseRunner.class)
 public class TestQueryParser extends LuceneTestCase {
 {code}
 * Same with MultiCodecTestCase: 
 (LuceneTestCase.MultiCodecTestCaseRunner.class}
 I only did the core tests in the patch as a start, and i just made an empty 
 LuceneTestCase extends LuceneTestCaseJ4.
 I'd like to do contrib and solr and rename this LuceneTestCaseJ4 to only a 
 single class: LuceneTestCase.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Resolved: (SOLR-2112) Solrj should support streaming response

2010-09-13 Thread Ryan McKinley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan McKinley resolved SOLR-2112.
-

Resolution: Fixed

added in r996693

I'm not sure what the 3.x release schedule looks like... so i'm not sure if 
back porting makes sense.  I think keeping it on /trunk for a while makes sense 
till we know this is the API we want.



 Solrj should support streaming response
 ---

 Key: SOLR-2112
 URL: https://issues.apache.org/jira/browse/SOLR-2112
 Project: Solr
  Issue Type: New Feature
  Components: clients - java
Reporter: Ryan McKinley
 Fix For: 4.0

 Attachments: SOLR-2112-StreamingSolrj.patch, 
 SOLR-2112-StreamingSolrj.patch, SOLR-2112-StreamingSolrj.patch


 The solrj API should optionally support streaming documents.
 Rather then putting all results into a SolrDocumentList, sorlj should be able 
 to call a callback function after each document is parsed.  This would allow 
 someone to call query.setRows( Integer.MAX_INT ) and get each result to the 
 client without loading them all into memory.
 For starters, I think the important things to stream are SolrDocuments, but 
 down the road, this could also stream other things (consider reading all 
 terms from the index)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (SOLR-792) Tree Faceting Component

2010-09-13 Thread Ryan McKinley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan McKinley updated SOLR-792:
---

Attachment: SOLR-792-PivotFaceting.patch

updated to trunk

 Tree Faceting Component
 ---

 Key: SOLR-792
 URL: https://issues.apache.org/jira/browse/SOLR-792
 Project: Solr
  Issue Type: New Feature
Reporter: Erik Hatcher
Assignee: Erik Hatcher
Priority: Minor
 Attachments: SOLR-792-PivotFaceting.patch, 
 SOLR-792-PivotFaceting.patch, SOLR-792-PivotFaceting.patch, 
 SOLR-792-PivotFaceting.patch, SOLR-792.patch, SOLR-792.patch, SOLR-792.patch, 
 SOLR-792.patch, SOLR-792.patch, SOLR-792.patch


 A component to do multi-level faceting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1297) Enable sorting by Function Query

2010-09-13 Thread Scott Kister (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12909013#action_12909013
 ] 

Scott Kister commented on SOLR-1297:


Bug report on this feature, if there is a wildcard entry in schema.xml such as 
the following.

!-- Ignore any fields that don't already match an existing field name --
dynamicField name=* type=ignored multiValued=true /

then this feature does not work and an error is returned, ie

GET 'http://localhost:8983/solr/select?q=*:*sort=sum(1,2)+asc'
Error 400 can not sort on unindexed field: sum(1,2)


 Enable sorting by Function Query
 

 Key: SOLR-1297
 URL: https://issues.apache.org/jira/browse/SOLR-1297
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.5, 3.1, 4.0

 Attachments: SOLR-1297-2.patch, SOLR-1297.patch, SOLR-1297.patch


 It would be nice if one could sort by FunctionQuery.  See also SOLR-773, 
 where this was first mentioned by Yonik as part of the generic solution to 
 geo-search

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Assigned: (SOLR-792) Tree Faceting Component

2010-09-13 Thread Erik Hatcher (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Hatcher reassigned SOLR-792:
-

Assignee: Ryan McKinley  (was: Erik Hatcher)

handing this one over to Ryan, as I don't have cycles to work on it anytime 
soon.   Rock on Ryan...

 Tree Faceting Component
 ---

 Key: SOLR-792
 URL: https://issues.apache.org/jira/browse/SOLR-792
 Project: Solr
  Issue Type: New Feature
Reporter: Erik Hatcher
Assignee: Ryan McKinley
Priority: Minor
 Attachments: SOLR-792-PivotFaceting.patch, 
 SOLR-792-PivotFaceting.patch, SOLR-792-PivotFaceting.patch, 
 SOLR-792-PivotFaceting.patch, SOLR-792.patch, SOLR-792.patch, SOLR-792.patch, 
 SOLR-792.patch, SOLR-792.patch, SOLR-792.patch


 A component to do multi-level faceting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



/trunk GRAVE: ConcurrentLRUCache was not destroyed prior to finalize()

2010-09-13 Thread Ryan McKinley
On windows vista with JDK 1.6 running /trunk, I see warnings like this often:

[junit]
[junit] - Standard Error -
[junit] 13-sep-2010 22:19:26
org.apache.solr.common.util.ConcurrentLRUCache finalize
[junit] GRAVE: ConcurrentLRUCache was not destroyed prior to
finalize(), indicates a bug -- POSSIBLE RESOURCE LEAK!!!
[junit] 13-sep-2010 22:19:26
org.apache.solr.common.util.ConcurrentLRUCache finalize
[junit] GRAVE: ConcurrentLRUCache was not destroyed prior to
finalize(), indicates a bug -- POSSIBLE RESOURCE LEAK!!!
[junit] 13-sep-2010 22:19:26
org.apache.solr.common.util.ConcurrentLRUCache finalize
[junit] GRAVE: ConcurrentLRUCache was not destroyed prior to
finalize(), indicates a bug -- POSSIBLE RESOURCE LEAK!!!
[junit] 13-sep-2010 22:19:26
org.apache.solr.common.util.ConcurrentLRUCache finalize
[junit] GRAVE: ConcurrentLRUCache was not destroyed prior to
finalize(), indicates a bug -- POSSIBLE RESOURCE LEAK!!!
[junit] 13-sep-2010 22:19:26
org.apache.solr.common.util.ConcurrentLRUCache finalize
[junit] GRAVE: ConcurrentLRUCache was not destroyed prior to
finalize(), indicates a bug -- POSSIBLE RESOURCE LEAK!!!
[junit] 13-sep-2010 22:19:26
org.apache.solr.common.util.ConcurrentLRUCache finalize
[junit] GRAVE: ConcurrentLRUCache was not destroyed prior to
finalize(), indicates a bug -- POSSIBLE RESOURCE LEAK!!!
[junit] 13-sep-2010 22:19:26
org.apache.solr.common.util.ConcurrentLRUCache finalize
[junit] GRAVE: ConcurrentLRUCache was not destroyed prior to
finalize(), indicates a bug -- POSSIBLE RESOURCE LEAK!!!
[junit] 13-sep-2010 22:19:26
org.apache.solr.common.util.ConcurrentLRUCache finalize
[junit] GRAVE: ConcurrentLRUCache was not destroyed prior to
finalize(), indicates a bug -- POSSIBLE RESOURCE LEAK!!!
[junit] -  ---

Do others see this also?  Is this new since the test reworking?

$ java -version
java version 1.6.0_20
Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
Java HotSpot(TM) Client VM (build 16.3-b01, mixed mode, sharing)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: /trunk GRAVE: ConcurrentLRUCache was not destroyed prior to finalize()

2010-09-13 Thread Robert Muir
On Mon, Sep 13, 2010 at 6:24 PM, Ryan McKinley ryan...@gmail.com wrote:

 On windows vista with JDK 1.6 running /trunk, I see warnings like this
 often:

[junit]
[junit] - Standard Error -
[junit] 13-sep-2010 22:19:26
 org.apache.solr.common.util.ConcurrentLRUCache finalize
[junit] GRAVE: ConcurrentLRUCache was not destroyed prior to
 finalize(), indicates a bug -- POSSIBLE RESOURCE LEAK!!!


I see this often as well on vista, for quite some time.

-- 
Robert Muir
rcm...@gmail.com


[jira] Commented: (SOLR-2116) TikaEntityProcessor does not find parser by default

2010-09-13 Thread Lance Norskog (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12909030#action_12909030
 ] 

Lance Norskog commented on SOLR-2116:
-

It does not work if the parser= attribute is set to 

{code}
parser=org.apache.tika.parser.AutoDetectParser
{code}

So, the AutoDetectParser does not work.

Lance

 TikaEntityProcessor does not find parser by default
 ---

 Key: SOLR-2116
 URL: https://issues.apache.org/jira/browse/SOLR-2116
 Project: Solr
  Issue Type: Bug
  Components: contrib - DataImportHandler, contrib - Solr Cell (Tika 
 extraction)
Affects Versions: 3.1, 4.0
Reporter: Lance Norskog
 Attachments: pdflist-data-config.xml, pdflist.xml


 The TikaEntityProcessor does not find the correct document parser by default.
 This is in a two-level DIH config file. I have attached 
 pdflist-data-config.xml and pdflist.xml, the XML file list supplying. To test 
 this, you will need the current 3.x branch or 4.0 trunk.
 # Set up a Tika-enabled Solr 
 # copy any PDF file to /tmp/testfile.pdf
 # copy the pdflist-data-config.xml to your solr/conf
 # and add this snippet to your solrconfig.xml
 {code:xml}
 requestHandler name=/pdflist
   class=org.apache.solr.handler.dataimport.DataImportHandler
   lst name=defaults
   str name=configpdflist-data-config.xml/str
   /lst
 /requestHandler
 {code}
 [http://localhost:8983/solr/pdflist?command=full-import] will make one 
 document with the id and text fields populated. If you remove this line:
 {code}
  parser=org.apache.tika.parser.pdf.PDFParser
 {code}
 from the TikaEntityProcessor entity, the parser will not be found and you 
 will get a document with the id field and nothing else.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (SOLR-1900) move Solr to flex APIs

2010-09-13 Thread Yonik Seeley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley updated SOLR-1900:
---

Attachment: SOLR-1900_bigTerm.txt

Attaching patch that moves bigTerm into ByteUtils, adds 
BytesRef.append(BytesRef), and uses those in the faceting code when a prefix is 
specified (instead of a String with \u chars).

If people think that the append() is more Solr specific (i.e. not likely to be 
used in lucene) I can move it to Solr's ByteUtils.

 move Solr to flex APIs
 --

 Key: SOLR-1900
 URL: https://issues.apache.org/jira/browse/SOLR-1900
 Project: Solr
  Issue Type: Improvement
Affects Versions: 4.0
Reporter: Yonik Seeley
 Fix For: 4.0

 Attachments: SOLR-1900-facet_enum.patch, SOLR-1900-facet_enum.patch, 
 SOLR-1900_bigTerm.txt, SOLR-1900_FileFloatSource.patch, 
 SOLR-1900_termsComponent.txt


 Solr should use flex APIs

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Created: (SOLR-2119) IndexSchema should log warning if analyzer is declared with charfilter/tokenizer/tokenfiler out of order

2010-09-13 Thread Hoss Man (JIRA)
IndexSchema should log warning if analyzer is declared with 
charfilter/tokenizer/tokenfiler out of order
--

 Key: SOLR-2119
 URL: https://issues.apache.org/jira/browse/SOLR-2119
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Reporter: Hoss Man


There seems to be a segment of hte user population that has a hard time 
understanding the distinction between a charfilter, a tokenizer, and a 
tokenfilter -- while we can certianly try to improve the documentation about 
what exactly each does, and when they take affect in the analysis chain, one 
other thing we should do is try to educate people when they constuct their 
analyzer in a way that doesn't make any sense.

at the moment, some people are attempting to do things like move the Foo 
tokenFilter/ before the tokenizer/ to try and get certain behavior ... at 
a minimum we should log a warning in this case that doing that doesn't have the 
desired effect

(we could easily make such a situation fail to initialize, but i'm not 
convinced that would be the best course of action, since some people may have 
schema's where they have declared a charFilter or tokenizer out of order 
relative to their tokenFilters, but are still getting correct results that 
work for them, and breaking their instance on upgrade doens't seem like it 
would be productive)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2504) sorting performance regression

2010-09-13 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12909107#action_12909107
 ] 

Yonik Seeley commented on LUCENE-2504:
--

bq. Yonik, just curious, how do you know what HotSpot is doing? Empirically 
based on performance numbers?

Yeah - it's a best guess based on what I see when performance testing, and 
matching that up with what I've read in the past.
As far as deoptmization is concerned, it's mentioned here: 
http://java.sun.com/products/hotspot/whitepaper.html, but I haven't read much 
elsewhere.

Specific to this issue, the whole optimization/deoptimization issue is 
extremely complex.
Recall that I reported this: Median response time in my tests drops from 160 
to 102 ms.

For simplicity, there are some details I left out:
Those numbers were for randomly sorting on different fields (hopefully the most 
realistic scenario).
If you test differently, the results are far different.

The first and second test runs measured median time sorting on a single field 
100 times in a row, then moving to the next field.

Trunk before patch:
|unique terms in field|median sort time in ms (first run)|second run
|10|105|168
|1|105|169
|1000|106|164
|100|127|163
|10|165|197

Trunk after patch:
|unique terms in field|median sort time in ms (first run)|second run
|10|85|130
|1|92|129
|1000|92|126
|100|116|127
|10|117|128

branch_3x
|unique terms in field|median sort time in ms (first run)|second run
|10|102|102
|1|102|103
|1000|101|103
|100|103|103
|10|118|118

So, it seems by running in batches (sorting on the same field over and over), 
we cause hotspot to overspecialize somehow, and then when we switch things up 
the resulting deoptimization puts us in a permanently worse condition).  
branch_3x does not suffer from that, but trunk still does due to the increased 
amount of indirection.  I imagine the differences are also due to the 
boundaries that the compiler tries to inline/specialize for a certain class.

It certainly complicates performance testing, and we need to keep a sharp eye 
on how we actually test potential improvements.

 sorting performance regression
 --

 Key: LUCENE-2504
 URL: https://issues.apache.org/jira/browse/LUCENE-2504
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 4.0
Reporter: Yonik Seeley
 Fix For: 4.0

 Attachments: LUCENE-2504.patch, LUCENE-2504.patch, LUCENE-2504.zip


 sorting can be much slower on trunk than branch_3x

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-236) Field collapsing

2010-09-13 Thread Stephen Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12909116#action_12909116
 ] 

Stephen Weiss commented on SOLR-236:


FWIW, I fixed my earlier OOM issues with some garbage collection tuning.

Now I'm noticing NPEs very similar to those people were reporting back before 
the patch from Jun 28th:

SEVERE: java.lang.NullPointerException
at 
org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser$FloatValueFieldComparator.compare(NonAdjacentDocumentCollapser.java:450)
at 
org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser$DocumentComparator.compare(NonAdjacentDocumentCollapser.java:262)
... it's the same backtrace ...

I'm guessing it's because I added those 5 lines back into the patch to get the 
paging working again.

It's rather infrequent, it's probably something I can deal with until the new 
patch is complete.  It doesn't happen every time at all like it seemed to 
happen to many people - just once in a while, and on queries that honestly run 
all the time, so it seems random and not related to a particular query (except 
perhaps in the size of the filter queries - these fqs relatively large #'s of 
documents).  But if any of this code makes it to the new patch I thought it 
would be worth mentioning.

 Field collapsing
 

 Key: SOLR-236
 URL: https://issues.apache.org/jira/browse/SOLR-236
 Project: Solr
  Issue Type: New Feature
  Components: search
Affects Versions: 1.3
Reporter: Emmanuel Keller
Assignee: Shalin Shekhar Mangar
 Fix For: Next

 Attachments: collapsing-patch-to-1.3.0-dieter.patch, 
 collapsing-patch-to-1.3.0-ivan.patch, collapsing-patch-to-1.3.0-ivan_2.patch, 
 collapsing-patch-to-1.3.0-ivan_3.patch, DocSetScoreCollector.java, 
 field-collapse-3.patch, field-collapse-4-with-solrj.patch, 
 field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, 
 field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, 
 field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, 
 field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, 
 field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, 
 field-collapse-solr-236-2.patch, field-collapse-solr-236.patch, 
 field-collapsing-extended-592129.patch, field_collapsing_1.1.0.patch, 
 field_collapsing_1.3.patch, field_collapsing_dsteigerwald.diff, 
 field_collapsing_dsteigerwald.diff, field_collapsing_dsteigerwald.diff, 
 NonAdjacentDocumentCollapser.java, NonAdjacentDocumentCollapserTest.java, 
 quasidistributed.additional.patch, SOLR-236-1_4_1.patch, 
 SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch, 
 SOLR-236-FieldCollapsing.patch, SOLR-236-trunk.patch, SOLR-236-trunk.patch, 
 SOLR-236-trunk.patch, SOLR-236-trunk.patch, SOLR-236-trunk.patch, 
 SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, 
 SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, solr-236.patch, 
 SOLR-236_collapsing.patch, SOLR-236_collapsing.patch


 This patch include a new feature called Field collapsing.
 Used in order to collapse a group of results with similar value for a given 
 field to a single entry in the result set. Site collapsing is a special case 
 of this, where all results for a given web site is collapsed into one or two 
 entries in the result set, typically with an associated more documents from 
 this site link. See also Duplicate detection.
 http://www.fastsearch.com/glossary.aspx?m=48amid=299
 The implementation add 3 new query parameters (SolrParams):
 collapse.field to choose the field used to group results
 collapse.type normal (default value) or adjacent
 collapse.max to select how many continuous results are allowed before 
 collapsing
 TODO (in progress):
 - More documentation (on source code)
 - Test cases
 Two patches:
 - field_collapsing.patch for current development version
 - field_collapsing_1.1.0.patch for Solr-1.1.0
 P.S.: Feedback and misspelling correction are welcome ;-)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org