[jira] [Commented] (LUCENE-3296) Enable passing a config into PKIndexSplitter

2011-07-11 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13062862#comment-13062862
 ] 

Jason Rutherglen commented on LUCENE-3296:
--

Uwe, the first patch [1] is implemented with CURRENT.

1. https://issues.apache.org/jira/secure/attachment/12485805/LUCENE-3296.patch

 Enable passing a config into PKIndexSplitter
 

 Key: LUCENE-3296
 URL: https://issues.apache.org/jira/browse/LUCENE-3296
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/other
Affects Versions: 3.3, 4.0
Reporter: Jason Rutherglen
Assignee: Simon Willnauer
Priority: Trivial
 Attachments: LUCENE-3296.patch, LUCENE-3296.patch


 I need to be able to pass the IndexWriterConfig into the IW used by 
 PKIndexSplitter.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: New facet module

2011-07-11 Thread Jason Rutherglen
 Actually I think the faceting module is per-segment?

That would be very cool.  I reviewed the user guide and it is
ambiguous on this topic.  Eg, why does the facet taxonomy need to be
be committed for every IW commit?  Mapping that to [N]RT will be
tricky.

Page 17:

In faceted search, we complicate things somewhat by adding a second index – the
taxonomy index. The taxonomy API also follows point-in-time semantics,
but this is
not quite enough. Some attention must be paid by the user to keep
those two indexes
consistently in sync:

The main index refers to category numbers defined in the taxonomy index.
Therefore, it is important that we open the TaxonomyReader after opening the
IndexReader. Moreover, every time an IndexReader is reopen()ed, the
TaxonomyReader needs to be refresh()1ed as well.

But there is one extra caution: whenever the application deems it has written
enough information worthy a commit, it must first call commit() for the
TaxonomyWriter and only after that call commit() for the IndexWriter.
Closing the
indices should also be done in this order – first close the taxonomy,
and only after
that close the index.


On Sat, Jul 9, 2011 at 4:13 AM, Michael McCandless
luc...@mikemccandless.com wrote:
 Actually I think the faceting module is per-segment?

 The facets are encoded into payloads, and then it visits the payload
 of each hit right per segment, and aggregates the counts.

 Like, on reopen (NRT or not) of a reader, there are no global data
 structures that must be recomputed.  EG, this facets impl doesn't use
 FieldCache on the global reader (leading to insanity).

 Mike McCandless

 http://blog.mikemccandless.com

 On Sat, Jul 9, 2011 at 12:40 AM, Shai Erera ser...@gmail.com wrote:
 Well, the approach is entirely different, and the new module
 introduces features not available in the other impls (and I imagine
 vice versa).

 The taxonomy is managed on the side, hence why it is global to the
 'content' index. It plays very well with NRT, and we in fact have
 several apps that use the module in an NRT environment.

 The taxonomy index supports NRT by itself, by using the IR.open(IW)
 API and then it's up to the application to manage its content index
 search as NRT.

 I think you should read the high-level description I put on
 LUCENE-3079 and the userguide I put on LUCENE-3261. As I said, the
 approach is quite different than the bitset and FieldCache ones.

 Shai

 On Saturday, July 9, 2011, Jason Rutherglen jason.rutherg...@gmail.com 
 wrote:
 The taxonomy is global to the index, but I think it will be
 interesting to explore per-segment taxonomy, and how it can be used to
 improve indexing or search perf (hopefully both)

 Right so with NRT this'll be an issue.  Is there a write up on this?
 It sounds fairly radical in design.  Eg, I'm curious as to how it
 compares with the bit set and un-inverted field cache based faceting
 systems.

 On Fri, Jul 8, 2011 at 8:44 PM, Shai Erera ser...@gmail.com wrote:
 Currently it doesn't facet per segment, because the approach it uses
 is irrelevant to per segment.

 It maintains a count array in the size of the taxonomy and every
 matching document contributes to the weight of the categories it is
 associated with, orregardless of the segment it is found in.

 The taxonomy is global to the index, but I think it will be
 interesting to explore per-segment taxonomy, and how it can be used to
 improve indexing or search perf (hopefully both).

 Shai

 On Saturday, July 9, 2011, Jason Rutherglen jason.rutherg...@gmail.com 
 wrote:
 Is it faceting per-segment?

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2551) Check dataimport.properties for write access before starting import

2011-07-11 Thread Shalin Shekhar Mangar (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13062866#comment-13062866
 ] 

Shalin Shekhar Mangar commented on SOLR-2551:
-

I'm taking a look. Thanks Chris.

 Check dataimport.properties for write access before starting import
 ---

 Key: SOLR-2551
 URL: https://issues.apache.org/jira/browse/SOLR-2551
 Project: Solr
  Issue Type: Improvement
  Components: contrib - DataImportHandler
Affects Versions: 1.4.1, 3.1
Reporter: C S
Assignee: Shalin Shekhar Mangar
Priority: Minor
 Fix For: 3.3, 4.0

 Attachments: SOLR-2551.patch


 A common mistake is that the /conf (respectively the dataimport.properties) 
 file is not writable for solr. It would be great if that were detected on 
 starting a dataimport job. 
 Currently and import might grind away for days and fail if it can't write its 
 timestamp to the dataimport.properties file.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: New facet module

2011-07-11 Thread Shai Erera
Hi Jason,

The reason why the taxonomy and content indexes need to be in sync is
because the taxonomy index manages the categories and their ordinals. The
ordinals are written in a special posting list in the content index (I think
we should cut over this part to use DocValues).

Now imagine that you only commit to the content index, but not to the
taxonomy index. If the system crashes, the content index will refer to
ordinals which the taxonomy index does not know about.

NRT-wise, the taxonomy index is much smaller than the content index. Imagine
what it takes to manage NRT over regular content. Every document you add
includes probably some moderate size of text that's parsed, stored fields,
term vectors and what not. Flushing that data (during getReader()), whether
to FSDir or RAMDir is much more expensive than flushing the information in
the taxonomy index, where every document contains a single term with the
category label.

So I don't think we should be worried too much about the taxonomy index's
NRT support and performance. It is orders of magnitude smaller than the
other index.

There is one thing we should improve about per-segment faceting in the new
module -- by default categories are read from the posting list's payload,
but there is a way to load all categories into RAM and fetch them from there
during search. Today that code is not per-segment, and I think it should be.

Shai

On Mon, Jul 11, 2011 at 9:27 AM, Jason Rutherglen 
jason.rutherg...@gmail.com wrote:

  Actually I think the faceting module is per-segment?

 That would be very cool.  I reviewed the user guide and it is
 ambiguous on this topic.  Eg, why does the facet taxonomy need to be
 be committed for every IW commit?  Mapping that to [N]RT will be
 tricky.

 Page 17:

 In faceted search, we complicate things somewhat by adding a second index
 – the
 taxonomy index. The taxonomy API also follows point-in-time semantics,
 but this is
 not quite enough. Some attention must be paid by the user to keep
 those two indexes
 consistently in sync:

 The main index refers to category numbers defined in the taxonomy index.
 Therefore, it is important that we open the TaxonomyReader after opening
 the
 IndexReader. Moreover, every time an IndexReader is reopen()ed, the
 TaxonomyReader needs to be refresh()1ed as well.

 But there is one extra caution: whenever the application deems it has
 written
 enough information worthy a commit, it must first call commit() for the
 TaxonomyWriter and only after that call commit() for the IndexWriter.
 Closing the
 indices should also be done in this order – first close the taxonomy,
 and only after
 that close the index.


 On Sat, Jul 9, 2011 at 4:13 AM, Michael McCandless
 luc...@mikemccandless.com wrote:
  Actually I think the faceting module is per-segment?
 
  The facets are encoded into payloads, and then it visits the payload
  of each hit right per segment, and aggregates the counts.
 
  Like, on reopen (NRT or not) of a reader, there are no global data
  structures that must be recomputed.  EG, this facets impl doesn't use
  FieldCache on the global reader (leading to insanity).
 
  Mike McCandless
 
  http://blog.mikemccandless.com
 
  On Sat, Jul 9, 2011 at 12:40 AM, Shai Erera ser...@gmail.com wrote:
  Well, the approach is entirely different, and the new module
  introduces features not available in the other impls (and I imagine
  vice versa).
 
  The taxonomy is managed on the side, hence why it is global to the
  'content' index. It plays very well with NRT, and we in fact have
  several apps that use the module in an NRT environment.
 
  The taxonomy index supports NRT by itself, by using the IR.open(IW)
  API and then it's up to the application to manage its content index
  search as NRT.
 
  I think you should read the high-level description I put on
  LUCENE-3079 and the userguide I put on LUCENE-3261. As I said, the
  approach is quite different than the bitset and FieldCache ones.
 
  Shai
 
  On Saturday, July 9, 2011, Jason Rutherglen jason.rutherg...@gmail.com
 wrote:
  The taxonomy is global to the index, but I think it will be
  interesting to explore per-segment taxonomy, and how it can be used to
  improve indexing or search perf (hopefully both)
 
  Right so with NRT this'll be an issue.  Is there a write up on this?
  It sounds fairly radical in design.  Eg, I'm curious as to how it
  compares with the bit set and un-inverted field cache based faceting
  systems.
 
  On Fri, Jul 8, 2011 at 8:44 PM, Shai Erera ser...@gmail.com wrote:
  Currently it doesn't facet per segment, because the approach it uses
  is irrelevant to per segment.
 
  It maintains a count array in the size of the taxonomy and every
  matching document contributes to the weight of the categories it is
  associated with, orregardless of the segment it is found in.
 
  The taxonomy is global to the index, but I think it will be
  interesting to explore per-segment taxonomy, and how it can be used to
 

Re: New facet module

2011-07-11 Thread Toke Eskildsen
On Sat, 2011-07-09 at 05:44 +0200, Shai Erera wrote:
 The taxonomy is global to the index, but I think it will be
 interesting to explore per-segment taxonomy, and how it can be used to
 improve indexing or search perf (hopefully both).

I have struggled with this for some time and still haven't found a real
solution. Distributed faceting, with the special case segment based
faceting, is hard to do without a central taxonomy.

The new faceting module is explicit about the central taxonomy. My
experiments with https://issues.apache.org/jira/browse/LUCENE-2369
computes it at index open time. None of them work very well, if at all,
for a real distributed environment.

The problem is the same for flat faceting but is magnified with
hierarchical faceting: When the sorting order of facet elements is
popularity based, computing the correct counts for a top-X might
potentially involve comparison of the whole result from each part. 

A pathological case for flat faceting is
Part 1: A1(2), A2(2)... An(2)
Part 2: B1(3), B2(2), B3(2)... Bn(2), An(1)
where the correct top 3 answer is An(3), B1(3), A2(2), which requires
the full part results to get to the An(2) and An(1) as they are the last
elements.

For real world use, we can do clever counting so that we only return
what is necessary, but it does not change the worst case. To ensure that
we don't hit any million entries merge situations, we must cheat and
make a cutoff point.

With a multi-level faceting result (state/town/street expanded to top 5
elements on all levels) we must resolve quite a lot of elements to
ensure a high chance of getting the right elements with the right
counts. We can avoid this by drilling down one level at a time, but that
is just replacing bulk transfers with multiple requests: 1*5*5 is the
unrealistically low minimum for the address case.

- Toke


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[JENKINS-MAVEN] Lucene-Solr-Maven-trunk #184: POMs out of sync

2011-07-11 Thread Apache Jenkins Server
Build: https://builds.apache.org/job/Lucene-Solr-Maven-trunk/184/

1 tests failed.
REGRESSION:  org.apache.lucene.index.TestIndexWriter.testThreadInterruptDeadlock

Error Message:


Stack Trace:
java.lang.AssertionError: 
at org.junit.Assert.fail(Assert.java:91)
at org.junit.Assert.assertTrue(Assert.java:43)
at org.junit.Assert.assertFalse(Assert.java:68)
at org.junit.Assert.assertFalse(Assert.java:79)
at 
org.apache.lucene.index.TestIndexWriter.testThreadInterruptDeadlock(TestIndexWriter.java:1204)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
at org.junit.rules.TestWatchman$1.evaluate(TestWatchman.java:48)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:31)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:76)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1464)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1382)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:31)
at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
at 
org.apache.maven.surefire.junit4.JUnit4TestSet.execute(JUnit4TestSet.java:35)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:146)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:97)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at 
org.apache.maven.surefire.booter.ProviderFactory$ClassLoaderProxy.invoke(ProviderFactory.java:103)
at $Proxy0.invoke(Unknown Source)
at 
org.apache.maven.surefire.booter.SurefireStarter.invokeProvider(SurefireStarter.java:145)
at 
org.apache.maven.surefire.booter.SurefireStarter.runSuitesInProcess(SurefireStarter.java:87)
at 
org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:69)




Build Log (for compile errors):
[...truncated 20110 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[JENKINS] Lucene-Solr-tests-only-trunk - Build # 9492 - Failure

2011-07-11 Thread Apache Jenkins Server
Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk/9492/

1 tests failed.
REGRESSION:  
org.apache.solr.handler.dataimport.TestSqlEntityProcessorDelta2.testCompositePk_DeltaImport_delete

Error Message:
Exception during query

Stack Trace:
java.lang.RuntimeException: Exception during query
at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:405)
at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:372)
at 
org.apache.solr.handler.dataimport.TestSqlEntityProcessorDelta2.testCompositePk_DeltaImport_delete(TestSqlEntityProcessorDelta2.java:113)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1464)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1382)
Caused by: java.lang.RuntimeException: REQUEST FAILED: xpath=//*[@numFound='0']
xml response was: ?xml version=1.0 encoding=UTF-8?
response
lst name=responseHeaderint name=status0/intint 
name=QTime1/intlst name=paramsstr name=start0/strstr 
name=q*:* OR testCompositePk_DeltaImport_delete/strstr 
name=qtstandard/strstr name=rows20/strstr 
name=version2.2/str/lst/lstresult name=response numFound=1 
start=0docstr name=solr_idprefix-1/strarr 
name=descstrhello/str/arrdate 
name=timestamp2011-07-11T08:09:46.491Z/date/doc/result
/response

request 
was:start=0q=*:*+OR+testCompositePk_DeltaImport_deleteqt=standardrows=20version=2.2
at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:398)




Build Log (for compile errors):
[...truncated 11997 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-2647) DOMUtilTestBase should be abstract

2011-07-11 Thread Chris Male (JIRA)
DOMUtilTestBase should be abstract
--

 Key: SOLR-2647
 URL: https://issues.apache.org/jira/browse/SOLR-2647
 Project: Solr
  Issue Type: Improvement
Reporter: Chris Male
Priority: Trivial
 Attachments: SOLR-2647.patch

Its serves as a base for other test classes that use the DOM.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-2647) DOMUtilTestBase should be abstract

2011-07-11 Thread Chris Male (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Male updated SOLR-2647:
-

Attachment: SOLR-2647.patch

Patch.

 DOMUtilTestBase should be abstract
 --

 Key: SOLR-2647
 URL: https://issues.apache.org/jira/browse/SOLR-2647
 Project: Solr
  Issue Type: Improvement
Reporter: Chris Male
Priority: Trivial
 Attachments: SOLR-2647.patch


 Its serves as a base for other test classes that use the DOM.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2564) Integrating grouping module into Solr 4.0

2011-07-11 Thread Martijn van Groningen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063231#comment-13063231
 ] 

Martijn van Groningen commented on SOLR-2564:
-

Since Lucene is now also Java 6 we can just change the code in 
AbstractFirstPassGroupingCollector and the TermFirstPassGroupingCollectorJava6 
in grouping.java is no longer needed, right?

 Integrating grouping module into Solr 4.0
 -

 Key: SOLR-2564
 URL: https://issues.apache.org/jira/browse/SOLR-2564
 Project: Solr
  Issue Type: Improvement
Reporter: Martijn van Groningen
Assignee: Martijn van Groningen
Priority: Blocker
 Fix For: 4.0

 Attachments: LUCENE-2564.patch, SOLR-2564.patch, SOLR-2564.patch, 
 SOLR-2564.patch, SOLR-2564.patch, SOLR-2564.patch, SOLR-2564.patch, 
 SOLR-2564.patch, SOLR-2564_performance_loss_fix.patch


 Since work on grouping module is going well. I think it is time to wire this 
 up in Solr.
 Besides the current grouping features Solr provides, Solr will then also 
 support second pass caching and total count based on groups.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[JENKINS] Lucene-Solr-tests-only-trunk - Build # 9493 - Still Failing

2011-07-11 Thread Apache Jenkins Server
Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk/9493/

1 tests failed.
REGRESSION:  
org.apache.solr.handler.dataimport.TestSqlEntityProcessorDelta3.testCompositePk_DeltaImport_delete

Error Message:
Exception during query

Stack Trace:
java.lang.RuntimeException: Exception during query
at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:405)
at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:372)
at 
org.apache.solr.handler.dataimport.TestSqlEntityProcessorDelta3.testCompositePk_DeltaImport_delete(TestSqlEntityProcessorDelta3.java:111)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1464)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1382)
Caused by: java.lang.RuntimeException: REQUEST FAILED: xpath=//*[@numFound='0']
xml response was: ?xml version=1.0 encoding=UTF-8?
response
lst name=responseHeaderint name=status0/intint 
name=QTime1/intlst name=paramsstr name=start0/strstr 
name=q*:* OR testCompositePk_DeltaImport_delete/strstr 
name=qtstandard/strstr name=rows20/strstr 
name=version2.2/str/lst/lstresult name=response numFound=1 
start=0docarr name=descstrd1/str/arrstr name=id2/strdate 
name=timestamp2011-07-11T09:22:55.278Z/date/doc/result
/response

request 
was:start=0q=*:*+OR+testCompositePk_DeltaImport_deleteqt=standardrows=20version=2.2
at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:398)




Build Log (for compile errors):
[...truncated 12030 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2564) Integrating grouping module into Solr 4.0

2011-07-11 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063244#comment-13063244
 ] 

Simon Willnauer commented on SOLR-2564:
---

bq. Since Lucene is now also Java 6 we can just change the code in 
AbstractFirstPassGroupingCollector and the TermFirstPassGroupingCollectorJava6 
in grouping.java is no longer needed, right?
yes thats right

 Integrating grouping module into Solr 4.0
 -

 Key: SOLR-2564
 URL: https://issues.apache.org/jira/browse/SOLR-2564
 Project: Solr
  Issue Type: Improvement
Reporter: Martijn van Groningen
Assignee: Martijn van Groningen
Priority: Blocker
 Fix For: 4.0

 Attachments: LUCENE-2564.patch, SOLR-2564.patch, SOLR-2564.patch, 
 SOLR-2564.patch, SOLR-2564.patch, SOLR-2564.patch, SOLR-2564.patch, 
 SOLR-2564.patch, SOLR-2564_performance_loss_fix.patch


 Since work on grouping module is going well. I think it is time to wire this 
 up in Solr.
 Besides the current grouping features Solr provides, Solr will then also 
 support second pass caching and total count based on groups.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3295) BitVector never skips fully populated bytes when writing ClearedDgaps

2011-07-11 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063245#comment-13063245
 ] 

Simon Willnauer commented on LUCENE-3295:
-

thanks for resolving this mike

 BitVector never skips fully populated bytes when writing ClearedDgaps
 -

 Key: LUCENE-3295
 URL: https://issues.apache.org/jira/browse/LUCENE-3295
 Project: Lucene - Java
  Issue Type: Bug
  Components: core/other
Affects Versions: 4.0
Reporter: Simon Willnauer
Assignee: Simon Willnauer
Priority: Minor
 Fix For: 4.0

 Attachments: LUCENE-3295.patch, LUCENE-3295.patch


 When writing cleared DGaps in BitVector we compare a byte against 0xFF (255) 
 yet the byte is casted into an int (-1) and the comparison will never 
 succeed. We should mask the byte with 0xFF before comparing or compare 
 against -1

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[JENKINS] Lucene-Solr-tests-only-3.x - Build # 9505 - Failure

2011-07-11 Thread Apache Jenkins Server
Build: https://builds.apache.org/job/Lucene-Solr-tests-only-3.x/9505/

All tests passed

Build Log (for compile errors):
[...truncated 17572 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[JENKINS] Lucene-Solr-tests-only-trunk - Build # 9494 - Still Failing

2011-07-11 Thread Apache Jenkins Server
Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk/9494/

1 tests failed.
REGRESSION:  
org.apache.solr.handler.dataimport.TestSqlEntityProcessor2.testCompositePk_DeltaImport

Error Message:
Exception during query

Stack Trace:
java.lang.RuntimeException: Exception during query
at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:405)
at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:372)
at 
org.apache.solr.handler.dataimport.TestSqlEntityProcessor2.testCompositePk_DeltaImport(TestSqlEntityProcessor2.java:129)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1464)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1382)
Caused by: java.lang.RuntimeException: REQUEST FAILED: xpath=//*[@numFound='1']
xml response was: ?xml version=1.0 encoding=UTF-8?
response
lst name=responseHeaderint name=status0/intint 
name=QTime1/intlst name=paramsstr name=start0/strstr 
name=qid:5/strstr name=qtstandard/strstr name=rows20/strstr 
name=version2.2/str/lst/lstresult name=response numFound=0 
start=0/result
/response

request was:start=0q=id:5qt=standardrows=20version=2.2
at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:398)




Build Log (for compile errors):
[...truncated 12034 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3296) Enable passing a config into PKIndexSplitter

2011-07-11 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-3296:


Attachment: LUCENE-3296.patch

here is a new patch. I added a second IWC since we can not reuse IWC instances 
across IW due to SetOnce restrictions. I also moved out the VERSION_CURRENT and 
made it a ctor argument. We should not randomly use the VERSION_CURRENT but 
rather be consistent when we use version.

 Enable passing a config into PKIndexSplitter
 

 Key: LUCENE-3296
 URL: https://issues.apache.org/jira/browse/LUCENE-3296
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/other
Affects Versions: 3.3, 4.0
Reporter: Jason Rutherglen
Assignee: Simon Willnauer
Priority: Trivial
 Attachments: LUCENE-3296.patch, LUCENE-3296.patch, LUCENE-3296.patch


 I need to be able to pass the IndexWriterConfig into the IW used by 
 PKIndexSplitter.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Issue Comment Edited] (LUCENE-3296) Enable passing a config into PKIndexSplitter

2011-07-11 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063249#comment-13063249
 ] 

Simon Willnauer edited comment on LUCENE-3296 at 7/11/11 9:54 AM:
--

here is a new patch. I added a second IWC since we can not reuse IWC instances 
across IW due to SetOnce restrictions. I also moved out the VERSION_CURRENT and 
made it a ctor argument. We should not randomly use the VERSION_CURRENT but 
rather be consistent when we use version.

bq. Simon: The Version.LUCENE_CURRENT is not important here, for easier 
porting, the version should be LUCENE_CURRENT (and it was before Jason's 
patch). Else we will have to always upgrade it with every new release. The same 
applies to the IndexUpdater class in core, it also uses LUCENE_CURRENT when you 
not pass in anything (as the version is completely useless for simple merge 
operations - like here).

not entirely true, we use the index splitter in 3.x and if you upgrade from 3.1 
to 3.2 you get a new mergepolicy by default which doesn't merge in order. I 
think its a problem that this version is not in 3.x yet so let fix it properly 
and backport.

Simon

  was (Author: simonw):
here is a new patch. I added a second IWC since we can not reuse IWC 
instances across IW due to SetOnce restrictions. I also moved out the 
VERSION_CURRENT and made it a ctor argument. We should not randomly use the 
VERSION_CURRENT but rather be consistent when we use version.
  
 Enable passing a config into PKIndexSplitter
 

 Key: LUCENE-3296
 URL: https://issues.apache.org/jira/browse/LUCENE-3296
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/other
Affects Versions: 3.3, 4.0
Reporter: Jason Rutherglen
Assignee: Simon Willnauer
Priority: Trivial
 Attachments: LUCENE-3296.patch, LUCENE-3296.patch, LUCENE-3296.patch


 I need to be able to pass the IndexWriterConfig into the IW used by 
 PKIndexSplitter.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3296) Enable passing a config into PKIndexSplitter

2011-07-11 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063259#comment-13063259
 ] 

Uwe Schindler commented on LUCENE-3296:
---

bq. not entirely true, we use the index splitter in 3.x and if you upgrade from 
3.1 to 3.2 you get a new mergepolicy by default which doesn't merge in order. I 
think its a problem that this version is not in 3.x yet so let fix it properly 
and backport.

PKIndexSplitter is new in 3.3, so you would never used it with older versions...

 Enable passing a config into PKIndexSplitter
 

 Key: LUCENE-3296
 URL: https://issues.apache.org/jira/browse/LUCENE-3296
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/other
Affects Versions: 3.3, 4.0
Reporter: Jason Rutherglen
Assignee: Simon Willnauer
Priority: Trivial
 Attachments: LUCENE-3296.patch, LUCENE-3296.patch, LUCENE-3296.patch


 I need to be able to pass the IndexWriterConfig into the IW used by 
 PKIndexSplitter.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2564) Integrating grouping module into Solr 4.0

2011-07-11 Thread Martijn van Groningen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063272#comment-13063272
 ] 

Martijn van Groningen commented on SOLR-2564:
-

Hi Matteo, I can also confirm the bug and only happens when group.main=true. I 
also think that this error occurs on 3x code base. I'll provide a fix for this 
issue soon.

 Integrating grouping module into Solr 4.0
 -

 Key: SOLR-2564
 URL: https://issues.apache.org/jira/browse/SOLR-2564
 Project: Solr
  Issue Type: Improvement
Reporter: Martijn van Groningen
Assignee: Martijn van Groningen
Priority: Blocker
 Fix For: 4.0

 Attachments: LUCENE-2564.patch, SOLR-2564.patch, SOLR-2564.patch, 
 SOLR-2564.patch, SOLR-2564.patch, SOLR-2564.patch, SOLR-2564.patch, 
 SOLR-2564.patch, SOLR-2564_performance_loss_fix.patch


 Since work on grouping module is going well. I think it is time to wire this 
 up in Solr.
 Besides the current grouping features Solr provides, Solr will then also 
 support second pass caching and total count based on groups.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2524) Adding grouping to Solr 3x

2011-07-11 Thread Yuriy Akopov (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063284#comment-13063284
 ] 

Yuriy Akopov commented on SOLR-2524:


I suppose I'm late with these questions, but could you please acknowledge if 
the following is correct:

1) The functionality from this patch was included into Solr 3.3, so no need to 
apply it to any version = 3.3

2) This patch (as well as the collapsing functionality in 3.3) doesn't allow 
calculation of facet numbers after collapsing. Faceting is still possible for 
collapsed results but the numbers returned for facets are always calculated 
before collapsing the results.

3) In order to calculate facets after collapsing, LUCENE-3097 must be applied 
to Solr 3.3.

Thanks.

 Adding grouping to Solr 3x
 --

 Key: SOLR-2524
 URL: https://issues.apache.org/jira/browse/SOLR-2524
 Project: Solr
  Issue Type: New Feature
Reporter: Martijn van Groningen
Assignee: Martijn van Groningen
 Fix For: 3.3

 Attachments: SOLR-2524.patch, SOLR-2524.patch, SOLR-2524.patch, 
 SOLR-2524.patch, SOLR-2524.patch, SOLR-2524.patch


 Grouping was recently added to Lucene 3x. See LUCENE-1421 for more 
 information.
 I think it would be nice if we expose this functionality also to the Solr 
 users that are bound to a 3.x version.
 The grouping feature added to Lucene is currently a subset of the 
 functionality that Solr 4.0-trunk offers. Mainly it doesn't support grouping 
 by function / query.
 The work involved getting the grouping contrib to work on Solr 3x is 
 acceptable. I have it more or less running here. It supports the response 
 format and request parameters (expect: group.query and group.func) described 
 in the FieldCollapse page on the Solr wiki.
 I think it would be great if this is included in the Solr 3.2 release. Many 
 people are using grouping as patch now and this would help them a lot. Any 
 thoughts?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[JENKINS] Lucene-Solr-tests-only-trunk - Build # 9496 - Failure

2011-07-11 Thread Apache Jenkins Server
Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk/9496/

2 tests failed.
REGRESSION:  
org.apache.solr.handler.dataimport.TestSqlEntityProcessor2.testCompositePk_FullImport

Error Message:
Exception during query

Stack Trace:
java.lang.RuntimeException: Exception during query
at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:405)
at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:372)
at 
org.apache.solr.handler.dataimport.TestSqlEntityProcessor2.testCompositePk_FullImport(TestSqlEntityProcessor2.java:66)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1464)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1382)
Caused by: java.lang.RuntimeException: REQUEST FAILED: xpath=//*[@numFound='1']
xml response was: ?xml version=1.0 encoding=UTF-8?
response
lst name=responseHeaderint name=status0/intint 
name=QTime0/intlst name=paramsstr name=start0/strstr 
name=qid:1/strstr name=qtstandard/strstr name=rows20/strstr 
name=version2.2/str/lst/lstresult name=response numFound=0 
start=0/result
/response

request was:start=0q=id:1qt=standardrows=20version=2.2
at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:398)


REGRESSION:  
org.apache.solr.handler.dataimport.TestSqlEntityProcessorDelta3.testCompositePk_FullImport

Error Message:
Exception during query

Stack Trace:
java.lang.RuntimeException: Exception during query
at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:405)
at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:372)
at 
org.apache.solr.handler.dataimport.TestSqlEntityProcessorDelta3.add1document(TestSqlEntityProcessorDelta3.java:83)
at 
org.apache.solr.handler.dataimport.TestSqlEntityProcessorDelta3.testCompositePk_FullImport(TestSqlEntityProcessorDelta3.java:92)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1464)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1382)
Caused by: java.lang.RuntimeException: REQUEST FAILED: xpath=//*[@numFound='1']
xml response was: ?xml version=1.0 encoding=UTF-8?
response
lst name=responseHeaderint name=status0/intint 
name=QTime1/intlst name=paramsstr name=start0/strstr 
name=q*:* OR add1document/strstr name=qtstandard/strstr 
name=rows20/strstr name=version2.2/str/lst/lstresult 
name=response numFound=0 start=0/result
/response

request 
was:start=0q=*:*+OR+add1documentqt=standardrows=20version=2.2
at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:398)




Build Log (for compile errors):
[...truncated 12155 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Assigned] (LUCENE-3233) HuperDuperSynonymsFilter™

2011-07-11 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir reassigned LUCENE-3233:
---

Assignee: Robert Muir

 HuperDuperSynonymsFilter™
 -

 Key: LUCENE-3233
 URL: https://issues.apache.org/jira/browse/LUCENE-3233
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Robert Muir
Assignee: Robert Muir
 Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch, 
 LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, 
 LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, 
 LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, 
 LUCENE-3233.patch, LUCENE-3233.patch, synonyms.zip


 The current synonymsfilter uses a lot of ram and cpu, especially at build 
 time.
 I think yesterday I heard about huge synonyms files three times.
 So, I think we should use an FST-based structure, sharing the inputs and 
 outputs.
 And we should be more efficient with the tokenStream api, e.g. using 
 save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2647) DOMUtilTestBase should be abstract

2011-07-11 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063303#comment-13063303
 ] 

Steven Rowe commented on SOLR-2647:
---

+1

This is my mistake - thanks for fixing!

 DOMUtilTestBase should be abstract
 --

 Key: SOLR-2647
 URL: https://issues.apache.org/jira/browse/SOLR-2647
 Project: Solr
  Issue Type: Improvement
Reporter: Chris Male
Priority: Trivial
 Attachments: SOLR-2647.patch


 Its serves as a base for other test classes that use the DOM.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3295) BitVector never skips fully populated bytes when writing ClearedDgaps

2011-07-11 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063312#comment-13063312
 ] 

Michael McCandless commented on LUCENE-3295:


Thank you for catching that something was amiss in the first place ;)  That's 
the hardest part.

 BitVector never skips fully populated bytes when writing ClearedDgaps
 -

 Key: LUCENE-3295
 URL: https://issues.apache.org/jira/browse/LUCENE-3295
 Project: Lucene - Java
  Issue Type: Bug
  Components: core/other
Affects Versions: 4.0
Reporter: Simon Willnauer
Assignee: Simon Willnauer
Priority: Minor
 Fix For: 4.0

 Attachments: LUCENE-3295.patch, LUCENE-3295.patch


 When writing cleared DGaps in BitVector we compare a byte against 0xFF (255) 
 yet the byte is casted into an int (-1) and the comparison will never 
 succeed. We should mask the byte with 0xFF before comparing or compare 
 against -1

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (SOLR-2647) DOMUtilTestBase should be abstract

2011-07-11 Thread Chris Male (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Male resolved SOLR-2647.
--

   Resolution: Fixed
Fix Version/s: 4.0
 Assignee: Chris Male

Committed revision 1145154.

 DOMUtilTestBase should be abstract
 --

 Key: SOLR-2647
 URL: https://issues.apache.org/jira/browse/SOLR-2647
 Project: Solr
  Issue Type: Improvement
Reporter: Chris Male
Assignee: Chris Male
Priority: Trivial
 Fix For: 4.0

 Attachments: SOLR-2647.patch


 Its serves as a base for other test classes that use the DOM.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-2644) DIH handler - when using threads=2 the default logging is set too high

2011-07-11 Thread Shalin Shekhar Mangar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar updated SOLR-2644:


Attachment: SOLR-2644.patch

This was probably added for debugging. Attached patch to remove the extra 
logging.

I'll commit shortly.

 DIH handler - when using threads=2 the default logging is set too high
 --

 Key: SOLR-2644
 URL: https://issues.apache.org/jira/browse/SOLR-2644
 Project: Solr
  Issue Type: Bug
  Components: contrib - DataImportHandler
Affects Versions: 3.3
Reporter: Bill Bell
 Fix For: 3.4, 4.0

 Attachments: SOLR-2644.patch


 Setting threads parameter in DIH handler, every add outputs to the log in 
 INFO level.
 The only current solution is to set the following in log4j.properties:
 log4j.rootCategory=INFO, logfile
 log4j.logger.org.apache.solr.handler.dataimport.DocBuilder=FATAL
 log4j.logger.org.apache.solr.handler.dataimport.ThreadedEntityProcessorWrapper=FATAL
 These 2 log messages need to be changed to  INFO.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-2644) DIH handler - when using threads=2 the default logging is set too high

2011-07-11 Thread Shalin Shekhar Mangar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar updated SOLR-2644:


Fix Version/s: 4.0
   3.4
 Assignee: Shalin Shekhar Mangar

 DIH handler - when using threads=2 the default logging is set too high
 --

 Key: SOLR-2644
 URL: https://issues.apache.org/jira/browse/SOLR-2644
 Project: Solr
  Issue Type: Bug
  Components: contrib - DataImportHandler
Affects Versions: 3.3
Reporter: Bill Bell
Assignee: Shalin Shekhar Mangar
 Fix For: 3.4, 4.0

 Attachments: SOLR-2644.patch


 Setting threads parameter in DIH handler, every add outputs to the log in 
 INFO level.
 The only current solution is to set the following in log4j.properties:
 log4j.rootCategory=INFO, logfile
 log4j.logger.org.apache.solr.handler.dataimport.DocBuilder=FATAL
 log4j.logger.org.apache.solr.handler.dataimport.ThreadedEntityProcessorWrapper=FATAL
 These 2 log messages need to be changed to  INFO.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2551) Check dataimport.properties for write access before starting import

2011-07-11 Thread Shalin Shekhar Mangar (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063346#comment-13063346
 ] 

Shalin Shekhar Mangar commented on SOLR-2551:
-

bq. Doesn't every test of delta functionality write to the 
dataimport.properties file?

Yes, it does but I don't think any of our tests rely on the contents of the 
properties file.

Ironically, the fact that the tests failed is proof that this feature works :)

 Check dataimport.properties for write access before starting import
 ---

 Key: SOLR-2551
 URL: https://issues.apache.org/jira/browse/SOLR-2551
 Project: Solr
  Issue Type: Improvement
  Components: contrib - DataImportHandler
Affects Versions: 1.4.1, 3.1
Reporter: C S
Assignee: Shalin Shekhar Mangar
Priority: Minor
 Fix For: 3.3, 4.0

 Attachments: SOLR-2551.patch


 A common mistake is that the /conf (respectively the dataimport.properties) 
 file is not writable for solr. It would be great if that were detected on 
 starting a dataimport job. 
 Currently and import might grind away for days and fail if it can't write its 
 timestamp to the dataimport.properties file.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-3233) HuperDuperSynonymsFilter™

2011-07-11 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir resolved LUCENE-3233.
-

   Resolution: Fixed
Fix Version/s: 4.0
   3.4

 HuperDuperSynonymsFilter™
 -

 Key: LUCENE-3233
 URL: https://issues.apache.org/jira/browse/LUCENE-3233
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Robert Muir
Assignee: Robert Muir
 Fix For: 3.4, 4.0

 Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch, 
 LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, 
 LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, 
 LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, 
 LUCENE-3233.patch, LUCENE-3233.patch, synonyms.zip


 The current synonymsfilter uses a lot of ram and cpu, especially at build 
 time.
 I think yesterday I heard about huge synonyms files three times.
 So, I think we should use an FST-based structure, sharing the inputs and 
 outputs.
 And we should be more efficient with the tokenStream api, e.g. using 
 save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2551) Check dataimport.properties for write access before starting import

2011-07-11 Thread Chris Male (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063352#comment-13063352
 ] 

Chris Male commented on SOLR-2551:
--

Okay, so, given that basically every test writes to this file, what are our 
options? 

To me it seems since the file is getting written too (whether we rely on the 
contents or not), this could get in the way of another test. So perhaps we need 
to pull the checkWritablePersistFile method out for awhile and re-assess how to 
achieve the same functionality in a way the tests can handle? 

 Check dataimport.properties for write access before starting import
 ---

 Key: SOLR-2551
 URL: https://issues.apache.org/jira/browse/SOLR-2551
 Project: Solr
  Issue Type: Improvement
  Components: contrib - DataImportHandler
Affects Versions: 1.4.1, 3.1
Reporter: C S
Assignee: Shalin Shekhar Mangar
Priority: Minor
 Fix For: 3.3, 4.0

 Attachments: SOLR-2551.patch


 A common mistake is that the /conf (respectively the dataimport.properties) 
 file is not writable for solr. It would be great if that were detected on 
 starting a dataimport job. 
 Currently and import might grind away for days and fail if it can't write its 
 timestamp to the dataimport.properties file.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2551) Check dataimport.properties for write access before starting import

2011-07-11 Thread Chris Male (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063354#comment-13063354
 ] 

Chris Male commented on SOLR-2551:
--

or alternatively, we could make the DIH tests run sequentially, so we don't hit 
this problem.

 Check dataimport.properties for write access before starting import
 ---

 Key: SOLR-2551
 URL: https://issues.apache.org/jira/browse/SOLR-2551
 Project: Solr
  Issue Type: Improvement
  Components: contrib - DataImportHandler
Affects Versions: 1.4.1, 3.1
Reporter: C S
Assignee: Shalin Shekhar Mangar
Priority: Minor
 Fix For: 3.3, 4.0

 Attachments: SOLR-2551.patch


 A common mistake is that the /conf (respectively the dataimport.properties) 
 file is not writable for solr. It would be great if that were detected on 
 starting a dataimport job. 
 Currently and import might grind away for days and fail if it can't write its 
 timestamp to the dataimport.properties file.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2551) Check dataimport.properties for write access before starting import

2011-07-11 Thread Shalin Shekhar Mangar (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063355#comment-13063355
 ] 

Shalin Shekhar Mangar commented on SOLR-2551:
-

Yes, lets disable this test for now. I don't think it is even worth testing. I 
guess I just had too much time that day :)

Another option could be to run the DIH tests sequentially.

 Check dataimport.properties for write access before starting import
 ---

 Key: SOLR-2551
 URL: https://issues.apache.org/jira/browse/SOLR-2551
 Project: Solr
  Issue Type: Improvement
  Components: contrib - DataImportHandler
Affects Versions: 1.4.1, 3.1
Reporter: C S
Assignee: Shalin Shekhar Mangar
Priority: Minor
 Fix For: 3.3, 4.0

 Attachments: SOLR-2551.patch


 A common mistake is that the /conf (respectively the dataimport.properties) 
 file is not writable for solr. It would be great if that were detected on 
 starting a dataimport job. 
 Currently and import might grind away for days and fail if it can't write its 
 timestamp to the dataimport.properties file.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2551) Check dataimport.properties for write access before starting import

2011-07-11 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063366#comment-13063366
 ] 

Steven Rowe commented on SOLR-2551:
---

I'll switch the DIH tests to run sequentially.  The benchmark module does this 
by setting the {{tests.threadspercpu}} property to zero.

Here's the patch:

{{
Index: solr/contrib/dataimporthandler/build.xml
===
--- solr/contrib/dataimporthandler/build.xml(revision 1145189)
+++ solr/contrib/dataimporthandler/build.xml(working copy)
@@ -23,6 +23,9 @@
 Data Import Handler
   /description
 
+  !-- the tests have some parallel problems: writability to single copy of 
dataimport.properties --
+  property name=tests.threadspercpu value=0/
+
   import file=../contrib-build.xml/
 
 /project
}}

Committing shortly.

 Check dataimport.properties for write access before starting import
 ---

 Key: SOLR-2551
 URL: https://issues.apache.org/jira/browse/SOLR-2551
 Project: Solr
  Issue Type: Improvement
  Components: contrib - DataImportHandler
Affects Versions: 1.4.1, 3.1
Reporter: C S
Assignee: Shalin Shekhar Mangar
Priority: Minor
 Fix For: 3.3, 4.0

 Attachments: SOLR-2551.patch


 A common mistake is that the /conf (respectively the dataimport.properties) 
 file is not writable for solr. It would be great if that were detected on 
 starting a dataimport job. 
 Currently and import might grind away for days and fail if it can't write its 
 timestamp to the dataimport.properties file.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2551) Check dataimport.properties for write access before starting import

2011-07-11 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063369#comment-13063369
 ] 

Steven Rowe commented on SOLR-2551:
---

Committed the patch to run DIH tests sequentially:
- r1145194: trunk
- r1145196: branch_3x

 Check dataimport.properties for write access before starting import
 ---

 Key: SOLR-2551
 URL: https://issues.apache.org/jira/browse/SOLR-2551
 Project: Solr
  Issue Type: Improvement
  Components: contrib - DataImportHandler
Affects Versions: 1.4.1, 3.1
Reporter: C S
Assignee: Shalin Shekhar Mangar
Priority: Minor
 Fix For: 3.3, 4.0

 Attachments: SOLR-2551.patch


 A common mistake is that the /conf (respectively the dataimport.properties) 
 file is not writable for solr. It would be great if that were detected on 
 starting a dataimport job. 
 Currently and import might grind away for days and fail if it can't write its 
 timestamp to the dataimport.properties file.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2382) DIH Cache Improvements

2011-07-11 Thread James Dyer (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063372#comment-13063372
 ] 

James Dyer commented on SOLR-2382:
--

Noble,

Are you still able to work with me on this issue?  Is there anything else you 
are waiting for from me?  The patch I submitted on June 24 passes parameters 
via the Context object as you requested.  Also, I previously separated 
BerkleyBackedCache out into a separate issue to (SOLR-2613) so we won't run 
into licensing issues here.  Let me know what else you think we need to do.  
Thanks.

 DIH Cache Improvements
 --

 Key: SOLR-2382
 URL: https://issues.apache.org/jira/browse/SOLR-2382
 Project: Solr
  Issue Type: New Feature
  Components: contrib - DataImportHandler
Reporter: James Dyer
Priority: Minor
 Attachments: SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
 SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
 SOLR-2382.patch


 Functionality:
  1. Provide a pluggable caching framework for DIH so that users can choose a 
 cache implementation that best suits their data and application.
  
  2. Provide a means to temporarily cache a child Entity's data without 
 needing to create a special cached implementation of the Entity Processor 
 (such as CachedSqlEntityProcessor).
  
  3. Provide a means to write the final (root entity) DIH output to a cache 
 rather than to Solr.  Then provide a way for a subsequent DIH call to use the 
 cache as an Entity input.  Also provide the ability to do delta updates on 
 such persistent caches.
  
  4. Provide the ability to partition data across multiple caches that can 
 then be fed back into DIH and indexed either to varying Solr Shards, or to 
 the same Core in parallel.
 Use Cases:
  1. We needed a flexible  scalable way to temporarily cache child-entity 
 data prior to joining to parent entities.
   - Using SqlEntityProcessor with Child Entities can cause an n+1 select 
 problem.
   - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching 
 mechanism and does not scale.
   - There is no way to cache non-SQL inputs (ex: flat files, xml, etc).
  
  2. We needed the ability to gather data from long-running entities by a 
 process that runs separate from our main indexing process.
   
  3. We wanted the ability to do a delta import of only the entities that 
 changed.
   - Lucene/Solr requires entire documents to be re-indexed, even if only a 
 few fields changed.
   - Our data comes from 50+ complex sql queries and/or flat files.
   - We do not want to incur overhead re-gathering all of this data if only 1 
 entity's data changed.
   - Persistent DIH caches solve this problem.
   
  4. We want the ability to index several documents in parallel (using 1.4.1, 
 which did not have the threads parameter).
  
  5. In the future, we may need to use Shards, creating a need to easily 
 partition our source data into Shards.
 Implementation Details:
  1. De-couple EntityProcessorBase from caching.  
   - Created a new interface, DIHCache  two implementations:  
 - SortedMapBackedCache - An in-memory cache, used as default with 
 CachedSqlEntityProcessor (now deprecated).
 - BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested 
 with je-4.1.6.jar
- NOTE: the existing Lucene Contrib db project uses je-3.3.93.jar.  
 I believe this may be incompatible due to Generic Usage.
- NOTE: I did not modify the ant script to automatically get this jar, 
 so to use or evaluate this patch, download bdb-je from 
 http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html 
  
  2. Allow Entity Processors to take a cacheImpl parameter to cause the 
 entity data to be cached (see EntityProcessorBase  DIHCacheProperties).
  
  3. Partially De-couple SolrWriter from DocBuilder
   - Created a new interface DIHWriter,  two implementations:
- SolrWriter (refactored)
- DIHCacheWriter (allows DIH to write ultimately to a Cache).

  4. Create a new Entity Processor, DIHCacheProcessor, which reads a 
 persistent Cache as DIH Entity Input.
  
  5. Support a partition parameter with both DIHCacheWriter and 
 DIHCacheProcessor to allow for easy partitioning of source entity data.
  
  6. Change the semantics of entity.destroy()
   - Previously, it was being called on each iteration of 
 DocBuilder.buildDocument().
   - Now it is does one-time cleanup tasks (like closing or deleting a 
 disk-backed cache) once the entity processor is completed.
   - The only out-of-the-box entity processor that previously implemented 
 destroy() was LineEntitiyProcessor, so this is not a very invasive change.
 General Notes:
 We are near completion in converting our search functionality from a legacy 
 search engine to Solr.  However, I found that DIH did 

[jira] [Resolved] (SOLR-2615) Have LogUpdateProcessor log each command (add, delete, ...) at debug/FINE level

2011-07-11 Thread Yonik Seeley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley resolved SOLR-2615.


Resolution: Fixed

 Have LogUpdateProcessor log each command (add, delete, ...) at debug/FINE 
 level
 ---

 Key: SOLR-2615
 URL: https://issues.apache.org/jira/browse/SOLR-2615
 Project: Solr
  Issue Type: Improvement
  Components: update
Reporter: David Smiley
Assignee: Yonik Seeley
Priority: Minor
 Fix For: 3.4, 4.0

 Attachments: SOLR-2615_LogUpdateProcessor_debug_logging.patch, 
 SOLR-2615_LogUpdateProcessor_debug_logging.patch


 It would be great if the LogUpdateProcessor logged each command (add, delete, 
 ...) at debug (Fine) level. Presently it only logs a summary of 8 commands 
 and it does so at the very end.
 The attached patch implements this.
 * I moved the LogUpdateProcessor ahead of RunUpdateProcessor so that the 
 debug level log happens before Solr does anything with it. It should not 
 affect the ordering of the existing summary log which happens at finish(). 
 * I changed UpdateRequestProcessor's static log variable to be an instance 
 variable that uses the current class name. I think this makes much more sense 
 since I want to be able to alter logging levels for a specific processor 
 without doing it for all of them. This change did require me to tweak the 
 factory's detection of the log level which avoids creating the 
 LogUpdateProcessor.
 * There was an NPE bug in AddUpdateCommand.getPrintableId() in the event 
 there is no schema unique field. I fixed that.
 You may notice I use SLF4J's nifty log.debug(message blah {} blah, var) 
 syntax, which is both performant and concise as there's no point in guarding 
 the debug message with an isDebugEnabled() since debug() will internally 
 check this any way and there is no string concatenation if debug isn't 
 enabled.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2452) rewrite solr build system

2011-07-11 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063394#comment-13063394
 ] 

Yonik Seeley commented on SOLR-2452:


bq. What's the right thing to do here in terms of a patch against the old file 
structure? Is it reasonable to check out fresh code, hack the patch file to 
reflect the new paths and apply it to the new structure or must I re-edit the 
source?

That's what I did.

bq. And is SVN merge smart enough to deal when merging from trunk to 3x when 3x 
hasn't been changed, or is it better to just wait on it all until the back-port 
is done?

Apply the changes in 3x however you can (i.e. patch, etc) and then use svn 
merge --record-only.  http://wiki.apache.org/lucene-java/SvnMerge

 rewrite solr build system
 -

 Key: SOLR-2452
 URL: https://issues.apache.org/jira/browse/SOLR-2452
 Project: Solr
  Issue Type: Task
  Components: Build
Reporter: Robert Muir
Assignee: Steven Rowe
 Fix For: 3.4, 4.0

 Attachments: SOLR-2452-post-reshuffling.patch, 
 SOLR-2452-post-reshuffling.patch, SOLR-2452-post-reshuffling.patch, 
 SOLR-2452.diffSource.py.patch.zip, SOLR-2452.dir.reshuffle.sh, 
 SOLR-2452.dir.reshuffle.sh


 As discussed some in SOLR-2002 (but that issue is long and hard to follow), I 
 think we should rewrite the solr build system.
 Its slow, cumbersome, and messy, and makes it hard for us to improve things.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



heads up: reindex trunk indexes

2011-07-11 Thread Robert Muir
I just committed https://issues.apache.org/jira/browse/LUCENE-3233,
which includes improvements that change the format of the terms index.

You should reindex.

-- 
lucidimagination.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2452) rewrite solr build system

2011-07-11 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063418#comment-13063418
 ] 

Yonik Seeley commented on SOLR-2452:


The script produced output like this:

{code}
Index: solr/core/src/java/org/apache/solr/core/SolrCore.java
===
--- solr/src/java/org/apache/solr/core/SolrCore.java(revision 
80231429dc9c7680375a0a21b1886e59b194)
+++ solr/src/java/org/apache/solr/core/SolrCore.java(revision )
{code}

Notice that core wasn't substituted on the lines starting with --- and +++

Trying to use the resulting patch file, I got:
{code}
/opt/code/lusolr$ patch -p0  tt.patch
can't find file to patch at input line 5
Perhaps you used the wrong -p or --strip option?
The text leading up to this was:
--
|Index: solr/core/src/java/org/apache/solr/core/SolrCore.java
|===
|--- solr/src/java/org/apache/solr/core/SolrCore.java   (revision 
80231429dc9c7680375a0a21b1886e59b194)
|+++ solr/src/java/org/apache/solr/core/SolrCore.java   (revision )
--
{code}

 rewrite solr build system
 -

 Key: SOLR-2452
 URL: https://issues.apache.org/jira/browse/SOLR-2452
 Project: Solr
  Issue Type: Task
  Components: Build
Reporter: Robert Muir
Assignee: Steven Rowe
 Fix For: 3.4, 4.0

 Attachments: SOLR-2452-post-reshuffling.patch, 
 SOLR-2452-post-reshuffling.patch, SOLR-2452-post-reshuffling.patch, 
 SOLR-2452.diffSource.py.patch.zip, SOLR-2452.dir.reshuffle.sh, 
 SOLR-2452.dir.reshuffle.sh, SOLR-2452.patch.hack.pl


 As discussed some in SOLR-2002 (but that issue is long and hard to follow), I 
 think we should rewrite the solr build system.
 Its slow, cumbersome, and messy, and makes it hard for us to improve things.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2452) rewrite solr build system

2011-07-11 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063419#comment-13063419
 ] 

Steven Rowe commented on SOLR-2452:
---

Thanks Yonik - I'll fix it


 rewrite solr build system
 -

 Key: SOLR-2452
 URL: https://issues.apache.org/jira/browse/SOLR-2452
 Project: Solr
  Issue Type: Task
  Components: Build
Reporter: Robert Muir
Assignee: Steven Rowe
 Fix For: 3.4, 4.0

 Attachments: SOLR-2452-post-reshuffling.patch, 
 SOLR-2452-post-reshuffling.patch, SOLR-2452-post-reshuffling.patch, 
 SOLR-2452.diffSource.py.patch.zip, SOLR-2452.dir.reshuffle.sh, 
 SOLR-2452.dir.reshuffle.sh, SOLR-2452.patch.hack.pl


 As discussed some in SOLR-2002 (but that issue is long and hard to follow), I 
 think we should rewrite the solr build system.
 Its slow, cumbersome, and messy, and makes it hard for us to improve things.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-2452) rewrite solr build system

2011-07-11 Thread Steven Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rowe updated SOLR-2452:
--

Attachment: SOLR-2452.patch.hack.pl

This version of the patch hacking script is fixed so that all paths are 
modified instead of just the ones on 'Index:' lines

 rewrite solr build system
 -

 Key: SOLR-2452
 URL: https://issues.apache.org/jira/browse/SOLR-2452
 Project: Solr
  Issue Type: Task
  Components: Build
Reporter: Robert Muir
Assignee: Steven Rowe
 Fix For: 3.4, 4.0

 Attachments: SOLR-2452-post-reshuffling.patch, 
 SOLR-2452-post-reshuffling.patch, SOLR-2452-post-reshuffling.patch, 
 SOLR-2452.diffSource.py.patch.zip, SOLR-2452.dir.reshuffle.sh, 
 SOLR-2452.dir.reshuffle.sh, SOLR-2452.patch.hack.pl, SOLR-2452.patch.hack.pl


 As discussed some in SOLR-2002 (but that issue is long and hard to follow), I 
 think we should rewrite the solr build system.
 Its slow, cumbersome, and messy, and makes it hard for us to improve things.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2551) Check dataimport.properties for write access before starting import

2011-07-11 Thread Shalin Shekhar Mangar (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063424#comment-13063424
 ] 

Shalin Shekhar Mangar commented on SOLR-2551:
-

Thanks Steven!

 Check dataimport.properties for write access before starting import
 ---

 Key: SOLR-2551
 URL: https://issues.apache.org/jira/browse/SOLR-2551
 Project: Solr
  Issue Type: Improvement
  Components: contrib - DataImportHandler
Affects Versions: 1.4.1, 3.1
Reporter: C S
Assignee: Shalin Shekhar Mangar
Priority: Minor
 Fix For: 3.3, 4.0

 Attachments: SOLR-2551.patch


 A common mistake is that the /conf (respectively the dataimport.properties) 
 file is not writable for solr. It would be great if that were detected on 
 starting a dataimport job. 
 Currently and import might grind away for days and fail if it can't write its 
 timestamp to the dataimport.properties file.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [Lucene.Net] Incubator Status Page

2011-07-11 Thread digy digy
On Sun, Jul 10, 2011 at 6:24 PM, Stefan Bodewig bode...@apache.org wrote:

 Hi all,

 http://incubator.apache.org/projects/lucene.net.html contains quite a
 few blanks that I think we could easily fill.  I intend to either add
 some N/A or real dates where I can during the coming week.

 On the IP issues part (copyright and distribution rights) I trust the
 Lucene PMC has been taking care of this before Lucene.NET headed back to
 the Incubator and after that all contributions have come either directly
 by people with a CLA on file or as patches via JIRA where the ASF may
 use this checkbox has been checked - is this correct?


absolutely.



 For the project specific tasks I'd ask all of you to fill in whatever
 you feel like adding.  All Lucene.NET committers should be able to
 modify the status page.

 Stefan


DIGY


[jira] [Created] (SOLR-2648) improve interaction of synonymsfilterfactory with analysis chain

2011-07-11 Thread Robert Muir (JIRA)
improve interaction of synonymsfilterfactory with analysis chain


 Key: SOLR-2648
 URL: https://issues.apache.org/jira/browse/SOLR-2648
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Affects Versions: 3.4, 4.0
Reporter: Robert Muir


Spinoff of LUCENE-3233 (there is a TODO here), this was also mentioned by Otis 
on the mailing list: 
http://www.lucidimagination.com/search/document/8e91f858314562e/automatic_synonyms_for_multiple_variations_of_a_word#76c3d09f95f7a58f

As of LUCENE-3233, the builder for the synonyms structure uses an Analyzer 
behind the scenes to actually tokenize the synonyms in your synonyms file.
Currently the solr factory uses a WhitespaceTokenizer, unless you supply the 
tokenizerchain parameter, which lets you specify a tokenizer.

If there was some way to instead specify a chain to this factory (e.g. 
charfilters, tokenizer, tokenfilter such as stemmers) versus just a 
tokenizerfactory, 
it would be a lot more flexible (e.g. it would stem your synonyms for you), and 
would solve this use case.

Personally I think it would be most ideal if this just automatically work, e.g. 
if you have a chain of A, B, SynonymsFilter, C, D: then in my opinion the 
synonyms
should be analyzed with an analysis chain of A, B. This way the injected 
synonyms are processed as if they were in the tokenstream to begin with.

Note: there are some limitations here to what the chain can do, e.g. you cant 
be putting WDF before synonyms or other things that muck with positions, and 
you cant
have a synonym that analyzes to nothing at all, but the parser checks for all 
these conditions and throws a syntax error so it would be clear to the user 
that 
they put the synonymsfilter in the wrong place in their chain.


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-3280) Add new bit set impl for caching filters

2011-07-11 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-3280.


Resolution: Fixed

 Add new bit set impl for caching filters
 

 Key: LUCENE-3280
 URL: https://issues.apache.org/jira/browse/LUCENE-3280
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.4, 4.0

 Attachments: LUCENE-3280.patch, LUCENE-3280.patch


 I think OpenBitSet is trying to satisfy too many audiences, and it's
 confusing/error-proned as a result.  It has int/long variants of many
 methods.  Some methods require in-bound access, others don't; of those
 others, some methods auto-grow the bits, some don't.  OpenBitSet
 doesn't always know its numBits.
 I'd like to factor out a more focused bit set impl whose primary
 target usage is a cached Lucene Filter, ie a bit set indexed by docID
 (int, not long) whose size is known and fixed up front (backed by
 final long[]) and is always accessed in-bounds.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2048) Omit positions but keep termFreq

2011-07-11 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063455#comment-13063455
 ] 

Robert Muir commented on LUCENE-2048:
-

i created a throwaway branch: branches/omitp, to hopefully sucker mike into 
helping me with some random fails (always pulsing is involved!)

in general the pulsing cutover was tricky for me.


 Omit positions but keep termFreq
 

 Key: LUCENE-2048
 URL: https://issues.apache.org/jira/browse/LUCENE-2048
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.1
Reporter: Andrzej Bialecki 
Assignee: Robert Muir
 Fix For: 4.0

 Attachments: LUCENE-2048.patch


 it would be useful to have an option to discard positional information but 
 still keep the term frequency - currently setOmitTermFreqAndPositions 
 discards both. Even though position-dependent queries wouldn't work in such 
 case, still any other queries would work fine and we would get the right 
 scoring.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3293) Use IOContext.READONCE in VarGapTermsIndexReader to load FST

2011-07-11 Thread Varun Thacker (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Thacker updated LUCENE-3293:
--

Attachment: LUCENE-3293.patch

Also edited SegmentReader#loadLiveDocs 

 Use IOContext.READONCE in VarGapTermsIndexReader to load FST
 

 Key: LUCENE-3293
 URL: https://issues.apache.org/jira/browse/LUCENE-3293
 Project: Lucene - Java
  Issue Type: Task
  Components: core/codecs
Affects Versions: 4.0
Reporter: Simon Willnauer
Assignee: Varun Thacker
Priority: Minor
 Fix For: 4.0

 Attachments: LUCENE-3293.patch


 VarGapTermsIndexReader should pass READONCE context down when it
 opens/reads the FST. Yet, it should just replace the ctx passed in, ie if we 
 are merging vs reading we want to differentiate.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2452) rewrite solr build system

2011-07-11 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063488#comment-13063488
 ] 

Steven Rowe commented on SOLR-2452:
---

If there are no objections, I plan on committing the patch hacking script to 
{{dev-tools/scripts/}} later today.

 rewrite solr build system
 -

 Key: SOLR-2452
 URL: https://issues.apache.org/jira/browse/SOLR-2452
 Project: Solr
  Issue Type: Task
  Components: Build
Reporter: Robert Muir
Assignee: Steven Rowe
 Fix For: 3.4, 4.0

 Attachments: SOLR-2452-post-reshuffling.patch, 
 SOLR-2452-post-reshuffling.patch, SOLR-2452-post-reshuffling.patch, 
 SOLR-2452.diffSource.py.patch.zip, SOLR-2452.dir.reshuffle.sh, 
 SOLR-2452.dir.reshuffle.sh, SOLR-2452.patch.hack.pl, SOLR-2452.patch.hack.pl


 As discussed some in SOLR-2002 (but that issue is long and hard to follow), I 
 think we should rewrite the solr build system.
 Its slow, cumbersome, and messy, and makes it hard for us to improve things.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[JENKINS] Lucene-Solr-tests-only-3.x - Build # 9511 - Failure

2011-07-11 Thread Apache Jenkins Server
Build: https://builds.apache.org/job/Lucene-Solr-tests-only-3.x/9511/

No tests ran.

Build Log (for compile errors):
[...truncated 3060 lines...]
[javac] found   : java.util.Collection
[javac] required: java.util.Collectionjava.lang.String
[javac] public Collection getFileNames() throws IOException {
[javac]   ^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/java/org/apache/solr/core/IndexDeletionPolicyWrapper.java:211:
 warning: getUserData() in 
org.apache.solr.core.IndexDeletionPolicyWrapper.IndexCommitWrapper overrides 
getUserData() in org.apache.lucene.index.IndexCommit; return type requires 
unchecked conversion
[javac] found   : java.util.Map
[javac] required: java.util.Mapjava.lang.String,java.lang.String
[javac] public Map getUserData() throws IOException {
[javac]^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/java/org/apache/solr/handler/RequestHandlerBase.java:173:
 warning: [unchecked] unchecked call to add(java.lang.String,T) as a member of 
the raw type org.apache.solr.common.util.NamedList
[javac] lst.add(handlerStart,handlerStart);
[javac]^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/java/org/apache/solr/handler/RequestHandlerBase.java:174:
 warning: [unchecked] unchecked call to add(java.lang.String,T) as a member of 
the raw type org.apache.solr.common.util.NamedList
[javac] lst.add(requests, numRequests);
[javac]^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/java/org/apache/solr/handler/RequestHandlerBase.java:175:
 warning: [unchecked] unchecked call to add(java.lang.String,T) as a member of 
the raw type org.apache.solr.common.util.NamedList
[javac] lst.add(errors, numErrors);
[javac]^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/java/org/apache/solr/handler/RequestHandlerBase.java:176:
 warning: [unchecked] unchecked call to add(java.lang.String,T) as a member of 
the raw type org.apache.solr.common.util.NamedList
[javac] lst.add(timeouts, numTimeouts);
[javac]^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/java/org/apache/solr/handler/RequestHandlerBase.java:177:
 warning: [unchecked] unchecked call to add(java.lang.String,T) as a member of 
the raw type org.apache.solr.common.util.NamedList
[javac] lst.add(totalTime,totalTime);
[javac]^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/java/org/apache/solr/handler/RequestHandlerBase.java:178:
 warning: [unchecked] unchecked call to add(java.lang.String,T) as a member of 
the raw type org.apache.solr.common.util.NamedList
[javac] lst.add(avgTimePerRequest, (float) totalTime / (float) 
this.numRequests);
[javac]^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/java/org/apache/solr/handler/RequestHandlerBase.java:179:
 warning: [unchecked] unchecked call to add(java.lang.String,T) as a member of 
the raw type org.apache.solr.common.util.NamedList
[javac] lst.add(avgRequestsPerSecond, (float) numRequests*1000 / 
(float)(System.currentTimeMillis()-handlerStart));   
[javac]^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/java/org/apache/solr/handler/admin/CoreAdminHandler.java:213:
 warning: [unchecked] unchecked conversion
[javac] found   : org.apache.solr.util.RefCounted[]
[javac] required: 
org.apache.solr.util.RefCountedorg.apache.solr.search.SolrIndexSearcher[]
[javac]   searchers = new RefCounted[sourceCores.length];
[javac]   ^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/java/org/apache/solr/handler/component/ResponseBuilder.java:291:
 warning: [unchecked] unchecked call to add(java.lang.String,T) as a member of 
the raw type org.apache.solr.common.util.NamedList
[javac]   rsp.getResponseHeader().add( partialResults, Boolean.TRUE );
[javac]  ^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/java/org/apache/solr/search/FunctionQParser.java:254:
 warning: [unchecked] unchecked conversion
[javac] found   : java.util.HashMap
[javac] required: java.util.Mapjava.lang.String,java.lang.String
[javac]   int end = QueryParsing.parseLocalParams(qs, start, 
nestedLocalParams, getParams());
[javac]  ^
[javac] 

[jira] [Resolved] (LUCENE-3289) FST should allow controlling how hard builder tries to share suffixes

2011-07-11 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-3289.


Resolution: Fixed

 FST should allow controlling how hard builder tries to share suffixes
 -

 Key: LUCENE-3289
 URL: https://issues.apache.org/jira/browse/LUCENE-3289
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.4, 4.0

 Attachments: LUCENE-3289.patch, LUCENE-3289.patch


 Today we have a boolean option to the FST builder telling it whether
 it should share suffixes.
 If you turn this off, building is much faster, uses much less RAM, and
 the resulting FST is a prefix trie.  But, the FST is larger than it
 needs to be.  When it's on, the builder maintains a node hash holding
 every node seen so far in the FST -- this uses up RAM and slows things
 down.
 On a dataset that Elmer (see java-user thread Autocompletion on large
 index on Jul 6 2011) provided (thank you!), which is 1.32 M titles
 avg 67.3 chars per title, building with suffix sharing on took 22.5
 seconds, required 1.25 GB heap, and produced 91.6 MB FST.  With suffix
 sharing off, it was 8.2 seconds, 450 MB heap and 129 MB FST.
 I think we should allow this boolean to be shade-of-gray instead:
 usually, how well suffixes can share is a function of how far they are
 from the end of the string, so, by adding a tunable N to only share
 when suffix length  N, we can let caller make reasonable tradeoffs. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



RE: [JENKINS] Lucene-Solr-tests-only-3.x - Build # 9511 - Failure

2011-07-11 Thread Steven A Rowe
This compilation failure is down to @Override annotations - I've committed the 
fix (removing the annotations):

[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/java/org/apache/solr/analysis/FSTSynonymFilterFactory.java:57:
 method does not override a method from its superclass
[javac]   @Override
[javac]^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/java/org/apache/solr/analysis/FSTSynonymFilterFactory.java:62:
 method does not override a method from its superclass
[javac]   @Override
[javac]^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/java/org/apache/solr/analysis/SynonymFilterFactory.java:43:
 method does not override a method from its superclass
[javac]   @Override
[javac]^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/java/org/apache/solr/analysis/SynonymFilterFactory.java:49:
 method does not override a method from its superclass
[javac]   @Override
[javac]^


 -Original Message-
 From: Apache Jenkins Server [mailto:jenk...@builds.apache.org]
 Sent: Monday, July 11, 2011 3:43 PM
 To: dev@lucene.apache.org
 Subject: [JENKINS] Lucene-Solr-tests-only-3.x - Build # 9511 - Failure
 
 Build: https://builds.apache.org/job/Lucene-Solr-tests-only-3.x/9511/
 
 No tests ran.
 
 Build Log (for compile errors):
 [...truncated 3060 lines...]
 [javac] found   : java.util.Collection
 [javac] required: java.util.Collectionjava.lang.String
 [javac] public Collection getFileNames() throws IOException {
 [javac]   ^
 [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-
 only-
 3.x/checkout/solr/src/java/org/apache/solr/core/IndexDeletionPolicyWrappe
 r.java:211: warning: getUserData() in
 org.apache.solr.core.IndexDeletionPolicyWrapper.IndexCommitWrapper
 overrides getUserData() in org.apache.lucene.index.IndexCommit; return
 type requires unchecked conversion
 [javac] found   : java.util.Map
 [javac] required: java.util.Mapjava.lang.String,java.lang.String
 [javac] public Map getUserData() throws IOException {
 [javac]^
 [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-
 only-
 3.x/checkout/solr/src/java/org/apache/solr/handler/RequestHandlerBase.jav
 a:173: warning: [unchecked] unchecked call to add(java.lang.String,T) as
 a member of the raw type org.apache.solr.common.util.NamedList
 [javac] lst.add(handlerStart,handlerStart);
 [javac]^
 [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-
 only-
 3.x/checkout/solr/src/java/org/apache/solr/handler/RequestHandlerBase.jav
 a:174: warning: [unchecked] unchecked call to add(java.lang.String,T) as
 a member of the raw type org.apache.solr.common.util.NamedList
 [javac] lst.add(requests, numRequests);
 [javac]^
 [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-
 only-
 3.x/checkout/solr/src/java/org/apache/solr/handler/RequestHandlerBase.jav
 a:175: warning: [unchecked] unchecked call to add(java.lang.String,T) as
 a member of the raw type org.apache.solr.common.util.NamedList
 [javac] lst.add(errors, numErrors);
 [javac]^
 [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-
 only-
 3.x/checkout/solr/src/java/org/apache/solr/handler/RequestHandlerBase.jav
 a:176: warning: [unchecked] unchecked call to add(java.lang.String,T) as
 a member of the raw type org.apache.solr.common.util.NamedList
 [javac] lst.add(timeouts, numTimeouts);
 [javac]^
 [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-
 only-
 3.x/checkout/solr/src/java/org/apache/solr/handler/RequestHandlerBase.jav
 a:177: warning: [unchecked] unchecked call to add(java.lang.String,T) as
 a member of the raw type org.apache.solr.common.util.NamedList
 [javac] lst.add(totalTime,totalTime);
 [javac]^
 [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-
 only-
 3.x/checkout/solr/src/java/org/apache/solr/handler/RequestHandlerBase.jav
 a:178: warning: [unchecked] unchecked call to add(java.lang.String,T) as
 a member of the raw type org.apache.solr.common.util.NamedList
 [javac] lst.add(avgTimePerRequest, (float) totalTime / (float)
 this.numRequests);
 [javac]^
 [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-
 only-
 3.x/checkout/solr/src/java/org/apache/solr/handler/RequestHandlerBase.jav
 a:179: warning: [unchecked] unchecked call to add(java.lang.String,T) as
 a member of the raw type org.apache.solr.common.util.NamedList
 [javac] lst.add(avgRequestsPerSecond, (float) numRequests*1000
 / (float)(System.currentTimeMillis()-handlerStart));
 [javac]^
 

[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans

2011-07-11 Thread Mike Sokolov (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063499#comment-13063499
 ] 

Mike Sokolov commented on LUCENE-2878:
--

OK I think I brushed by some of your comments, Simon, in my hasty response, 
sorry.  Here's a little more thought, I hope:

bq. So bottom line here is that we need an api that is capable of collecting 
fine grained parts of the scorer tree. The only way I see doing this is 1. have 
a subscribe / register method and 2. do this subscription during scorer 
creation. Once we have this we can implement very simple collect methods that 
only collect positions for the current match like in a near query, while the 
current matching document is collected all contributing TermScorers have their 
positioninterval ready for collection. The collect method can then be called 
from the consumer instead of in the loop this way we only get the positions we 
need since we know the document we are collecting.

I *think* it's necessary to have both a callback from within the scoring loop, 
and a mechanism for iterating over the current state of the iterator.  For 
boolean queries, the positions will never be iterated in the scoring loop (all 
you care about is the frequencies, positions are ignored), so some new process: 
either the position collector (highlighter, say), or a loop in the scorer that 
knows positions are being consumed (needsPositions==true) has to cause the 
iteration to be performed.  But for position-aware queries (like phrases), the 
scorer *will* iterate over positions, and in order to score properly, I think 
the Scorer has to drive the iteration?  I tried a few different approaches at 
this before deciding to just push the iteration into the Scorer, but none of 
them really worked properly.

Let's say, for example that a document is collected.  Then the position 
consumer comes in to find out what positions were matched - it may already too 
late, because during scoring, some of the positions may have been consumed (eg 
to score phrases)?  It's possible I may be suffering from some delusion, though 
:)  But if I'm right, then it means there has to be some sort of callback 
mechanism in place *during scoring*, or else we have to resign ourselves to 
scoring first, and then re-setting and iterating positions in a second pass.

I actually think that if we follow through with the 
registration-during-construction idea, we can have the tests done in an 
efficient way during scoring (with final boolean properties of the scorers), 
and it can be OK to have them in the scoring loop.

 Allow Scorer to expose positions and payloads aka. nuke spans 
 --

 Key: LUCENE-2878
 URL: https://issues.apache.org/jira/browse/LUCENE-2878
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/search
Affects Versions: Bulk Postings branch
Reporter: Simon Willnauer
Assignee: Simon Willnauer
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Attachments: LUCENE-2878-OR.patch, LUCENE-2878.patch, 
 LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, 
 LUCENE-2878.patch, LUCENE-2878_trunk.patch, LUCENE-2878_trunk.patch, 
 PosHighlighter.patch, PosHighlighter.patch


 Currently we have two somewhat separate types of queries, the one which can 
 make use of positions (mainly spans) and payloads (spans). Yet Span*Query 
 doesn't really do scoring comparable to what other queries do and at the end 
 of the day they are duplicating lot of code all over lucene. Span*Queries are 
 also limited to other Span*Query instances such that you can not use a 
 TermQuery or a BooleanQuery with SpanNear or anthing like that. 
 Beside of the Span*Query limitation other queries lacking a quiet interesting 
 feature since they can not score based on term proximity since scores doesn't 
 expose any positional information. All those problems bugged me for a while 
 now so I stared working on that using the bulkpostings API. I would have done 
 that first cut on trunk but TermScorer is working on BlockReader that do not 
 expose positions while the one in this branch does. I started adding a new 
 Positions class which users can pull from a scorer, to prevent unnecessary 
 positions enums I added ScorerContext#needsPositions and eventually 
 Scorere#needsPayloads to create the corresponding enum on demand. Yet, 
 currently only TermQuery / TermScorer implements this API and other simply 
 return null instead. 
 To show that the API really works and our BulkPostings work fine too with 
 positions I cut over TermSpanQuery to use a TermScorer under the hood and 
 nuked TermSpans entirely. A nice sideeffect of this was that the Position 
 BulkReading implementation got some exercise which now :) work all with 
 

[JENKINS] Lucene-Solr-tests-only-3.x - Build # 9512 - Still Failing

2011-07-11 Thread Apache Jenkins Server
Build: https://builds.apache.org/job/Lucene-Solr-tests-only-3.x/9512/

No tests ran.

Build Log (for compile errors):
[...truncated 3549 lines...]
[javac] NamedListNamedList fieldTypes = result.get(field_types);
[javac] ^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:133:
 warning: [unchecked] unchecked conversion
[javac] found   : org.apache.solr.common.util.NamedList
[javac] required: 
org.apache.solr.common.util.NamedListorg.apache.solr.common.util.NamedList
[javac] NamedListNamedList textType = fieldTypes.get(text);
[javac]   ^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:136:
 warning: [unchecked] unchecked conversion
[javac] found   : org.apache.solr.common.util.NamedList
[javac] required: 
org.apache.solr.common.util.NamedListjava.util.Listorg.apache.solr.common.util.NamedList
[javac] NamedListListNamedList indexPart = textType.get(index);
[javac]^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:201:
 warning: [unchecked] unchecked conversion
[javac] found   : org.apache.solr.common.util.NamedList
[javac] required: 
org.apache.solr.common.util.NamedListjava.util.Listorg.apache.solr.common.util.NamedList
[javac] NamedListListNamedList queryPart = textType.get(query);
[javac]^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:230:
 warning: [unchecked] unchecked conversion
[javac] found   : org.apache.solr.common.util.NamedList
[javac] required: 
org.apache.solr.common.util.NamedListorg.apache.solr.common.util.NamedList
[javac] NamedListNamedList nameTextType = fieldTypes.get(nametext);
[javac]   ^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:233:
 warning: [unchecked] unchecked conversion
[javac] found   : org.apache.solr.common.util.NamedList
[javac] required: 
org.apache.solr.common.util.NamedListjava.util.Listorg.apache.solr.common.util.NamedList
[javac] indexPart = nameTextType.get(index);
[javac] ^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:250:
 warning: [unchecked] unchecked conversion
[javac] found   : org.apache.solr.common.util.NamedList
[javac] required: 
org.apache.solr.common.util.NamedListjava.util.Listorg.apache.solr.common.util.NamedList
[javac] queryPart = nameTextType.get(query);
[javac] ^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:256:
 warning: [unchecked] unchecked conversion
[javac] found   : org.apache.solr.common.util.NamedList
[javac] required: 
org.apache.solr.common.util.NamedListorg.apache.solr.common.util.NamedList
[javac] NamedListNamedList fieldNames = result.get(field_names);
[javac] ^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:259:
 warning: [unchecked] unchecked conversion
[javac] found   : org.apache.solr.common.util.NamedList
[javac] required: 
org.apache.solr.common.util.NamedListorg.apache.solr.common.util.NamedList
[javac] NamedListNamedList whitetok = fieldNames.get(whitetok);
[javac]   ^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:262:
 warning: [unchecked] unchecked conversion
[javac] found   : org.apache.solr.common.util.NamedList
[javac] required: 
org.apache.solr.common.util.NamedListjava.util.Listorg.apache.solr.common.util.NamedList
[javac] indexPart = whitetok.get(index);
[javac] ^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:279:
 warning: [unchecked] 

RE: [JENKINS] Lucene-Solr-tests-only-3.x - Build # 9512 - Still Failing

2011-07-11 Thread Steven A Rowe
More @Override annotations - I've again committed the fix (removing the 
annotations):

[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/test/org/apache/solr/analysis/TestSynonymMap.java:274:
 method does not override a method from its superclass
[javac]   @Override
[javac]^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/test/org/apache/solr/analysis/TestSynonymMap.java:284:
 method does not override a method from its superclass
[javac]   @Override
[javac]^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/test/org/apache/solr/analysis/TestSynonymMap.java:289:
 method does not override a method from its superclass
[javac]   @Override
[javac]^

 -Original Message-
 From: Apache Jenkins Server [mailto:jenk...@builds.apache.org]
 Sent: Monday, July 11, 2011 4:11 PM
 To: dev@lucene.apache.org
 Subject: [JENKINS] Lucene-Solr-tests-only-3.x - Build # 9512 - Still
 Failing
 
 Build: https://builds.apache.org/job/Lucene-Solr-tests-only-3.x/9512/
 
 No tests ran.
 
 Build Log (for compile errors):
 [...truncated 3549 lines...]
 [javac] NamedListNamedList fieldTypes =
 result.get(field_types);
 [javac] ^
 [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-
 only-
 3.x/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHa
 ndlerTest.java:133: warning: [unchecked] unchecked conversion
 [javac] found   : org.apache.solr.common.util.NamedList
 [javac] required:
 org.apache.solr.common.util.NamedListorg.apache.solr.common.util.NamedLi
 st
 [javac] NamedListNamedList textType = fieldTypes.get(text);
 [javac]   ^
 [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-
 only-
 3.x/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHa
 ndlerTest.java:136: warning: [unchecked] unchecked conversion
 [javac] found   : org.apache.solr.common.util.NamedList
 [javac] required:
 org.apache.solr.common.util.NamedListjava.util.Listorg.apache.solr.comm
 on.util.NamedList
 [javac] NamedListListNamedList indexPart =
 textType.get(index);
 [javac]^
 [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-
 only-
 3.x/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHa
 ndlerTest.java:201: warning: [unchecked] unchecked conversion
 [javac] found   : org.apache.solr.common.util.NamedList
 [javac] required:
 org.apache.solr.common.util.NamedListjava.util.Listorg.apache.solr.comm
 on.util.NamedList
 [javac] NamedListListNamedList queryPart =
 textType.get(query);
 [javac]^
 [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-
 only-
 3.x/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHa
 ndlerTest.java:230: warning: [unchecked] unchecked conversion
 [javac] found   : org.apache.solr.common.util.NamedList
 [javac] required:
 org.apache.solr.common.util.NamedListorg.apache.solr.common.util.NamedLi
 st
 [javac] NamedListNamedList nameTextType =
 fieldTypes.get(nametext);
 [javac]   ^
 [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-
 only-
 3.x/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHa
 ndlerTest.java:233: warning: [unchecked] unchecked conversion
 [javac] found   : org.apache.solr.common.util.NamedList
 [javac] required:
 org.apache.solr.common.util.NamedListjava.util.Listorg.apache.solr.comm
 on.util.NamedList
 [javac] indexPart = nameTextType.get(index);
 [javac] ^
 [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-
 only-
 3.x/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHa
 ndlerTest.java:250: warning: [unchecked] unchecked conversion
 [javac] found   : org.apache.solr.common.util.NamedList
 [javac] required:
 org.apache.solr.common.util.NamedListjava.util.Listorg.apache.solr.comm
 on.util.NamedList
 [javac] queryPart = nameTextType.get(query);
 [javac] ^
 [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-
 only-
 3.x/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHa
 ndlerTest.java:256: warning: [unchecked] unchecked conversion
 [javac] found   : org.apache.solr.common.util.NamedList
 [javac] required:
 org.apache.solr.common.util.NamedListorg.apache.solr.common.util.NamedLi
 st
 [javac] NamedListNamedList fieldNames =
 result.get(field_names);
 [javac]

[jira] [Commented] (SOLR-2551) Check dataimport.properties for write access before starting import

2011-07-11 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063511#comment-13063511
 ] 

Steven Rowe commented on SOLR-2551:
---

The [Lucene-Solr-tests-only-trunk Jenkins 
job|https://builds.apache.org/job/Lucene-Solr-tests-only-trunk/] has run only 
once since the DIH tests were made to run sequentially 
(https://builds.apache.org/job/Lucene-Solr-tests-only-trunk/9500/), so I'll 
delay closing this issue until it's successfully run 15 or 20 more times, which 
should take less than one day.

 Check dataimport.properties for write access before starting import
 ---

 Key: SOLR-2551
 URL: https://issues.apache.org/jira/browse/SOLR-2551
 Project: Solr
  Issue Type: Improvement
  Components: contrib - DataImportHandler
Affects Versions: 1.4.1, 3.1
Reporter: C S
Assignee: Shalin Shekhar Mangar
Priority: Minor
 Fix For: 3.3, 4.0

 Attachments: SOLR-2551.patch


 A common mistake is that the /conf (respectively the dataimport.properties) 
 file is not writable for solr. It would be great if that were detected on 
 starting a dataimport job. 
 Currently and import might grind away for days and fail if it can't write its 
 timestamp to the dataimport.properties file.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-3304) Allow WeightedSpanTermExtractor to collect positions for TermQuerys

2011-07-11 Thread Jahangir Anwari (JIRA)
Allow WeightedSpanTermExtractor to collect positions for TermQuerys
---

 Key: LUCENE-3304
 URL: https://issues.apache.org/jira/browse/LUCENE-3304
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/highlighter
Affects Versions: 3.3
Reporter: Jahangir Anwari
Priority: Trivial


Spinoff from this thread:

http://www.gossamer-threads.com/lists/lucene/java-user/129668

Currently WeightedSpanTermExtractor only collects positions for position 
sensitive queries. Allowing WeightedSpanTermExtractor to store positions for 
TermQuery would enable the WeightedSpanTermExtractor to be used outside the 
highlighter in custom plugins to get positions information.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3287) Allow ability to set maxDocCharsToAnalyze in WeightedSpanTermExtractor

2011-07-11 Thread Jahangir Anwari (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jahangir Anwari updated LUCENE-3287:


Description: 
Spinoff from this thread:

http://www.gossamer-threads.com/lists/lucene/java-user/129668

In WeightedSpanTermExtractor the default maxDocCharsToAnalyze value is 0. This 
inhibits us from getting the weighted span terms in any custom code(e.g 
attached CustomHighlighter.java) that uses WeightedSpanTermExtractor. Currently 
the setMaxDocCharsToAnalyze() method is protected, which prevents us from 
setting  maxDocCharsToAnalyze to a value greater than 0. Changing the method to 
public would give us the ability to set the maxDocCharsToAnalyze.


  was:
In WeightedSpanTermExtractor the default maxDocCharsToAnalyze value is 0. This 
inhibits us from getting the weighted span terms in any custom code(e.g 
attached CustomHighlighter.java) that uses WeightedSpanTermExtractor. Currently 
the setMaxDocCharsToAnalyze() method is protected, which prevents us from 
setting  maxDocCharsToAnalyze to a value greater than 0. Changing the method to 
public would give us the ability to set the maxDocCharsToAnalyze.



 Allow ability to set maxDocCharsToAnalyze in WeightedSpanTermExtractor
 --

 Key: LUCENE-3287
 URL: https://issues.apache.org/jira/browse/LUCENE-3287
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/highlighter
Affects Versions: 3.3
Reporter: Jahangir Anwari
Priority: Trivial
 Attachments: CustomHighlighter.java, WeightedSpanTermExtractor.patch


 Spinoff from this thread:
 http://www.gossamer-threads.com/lists/lucene/java-user/129668
 In WeightedSpanTermExtractor the default maxDocCharsToAnalyze value is 0. 
 This inhibits us from getting the weighted span terms in any custom code(e.g 
 attached CustomHighlighter.java) that uses WeightedSpanTermExtractor. 
 Currently the setMaxDocCharsToAnalyze() method is protected, which prevents 
 us from setting  maxDocCharsToAnalyze to a value greater than 0. Changing the 
 method to public would give us the ability to set the maxDocCharsToAnalyze.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-2308) Separately specify a field's type

2011-07-11 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-2308:
---

Attachment: LUCENE-2308-ltc.patch

Small patch to fix LTC.newField to again randomly add in term vectors when they 
are disabled.

 Separately specify a field's type
 -

 Key: LUCENE-2308
 URL: https://issues.apache.org/jira/browse/LUCENE-2308
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Reporter: Michael McCandless
Assignee: Michael McCandless
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Fix For: 4.0

 Attachments: LUCENE-2308-2.patch, LUCENE-2308-3.patch, 
 LUCENE-2308-4.patch, LUCENE-2308-4.patch, LUCENE-2308-5.patch, 
 LUCENE-2308-6.patch, LUCENE-2308-7.patch, LUCENE-2308-8.patch, 
 LUCENE-2308-9.patch, LUCENE-2308-ltc.patch, LUCENE-2308.patch, 
 LUCENE-2308.patch


 This came up from dicussions on IRC.  I'm summarizing here...
 Today when you make a Field to add to a document you can set things
 index or not, stored or not, analyzed or not, details like omitTfAP,
 omitNorms, index term vectors (separately controlling
 offsets/positions), etc.
 I think we should factor these out into a new class (FieldType?).
 Then you could re-use this FieldType instance across multiple fields.
 The Field instance would still hold the actual value.
 We could then do per-field analyzers by adding a setAnalyzer on the
 FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise
 for per-field codecs (with flex), where we now have
 PerFieldCodecWrapper).
 This would NOT be a schema!  It's just refactoring what we already
 specify today.  EG it's not serialized into the index.
 This has been discussed before, and I know Michael Busch opened a more
 ambitious (I think?) issue.  I think this is a good first baby step.  We could
 consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold
 off on that for starters...

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-1768) NumericRange support for new query parser

2011-07-11 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063555#comment-13063555
 ] 

Uwe Schindler commented on LUCENE-1768:
---

Vinicius, do you have any plans about backporting the stuff to Lucene 3.x - it 
should not be that hard :-)

bq. I am not sure about numeric support, Vinicius changed TermRangeQueryNode 
inheritance, which breaks the backwards compatibility. I am not saying the 
change is bad, I agree with the new structure, however Vinicius will need to 
find another solution before backporting it to 3.x.

I am not sure if this is really a break when you change inheritance. If code 
still compiles, its no break, if classes were renamed its more serious. I am 
not sure, if implementation classes (and -names) should be covered by the 
backwards compatibility. In my opinion, mainly the configuration and interfaces 
of the QP must be covered by backwards policy.

As we are now at mid-time, it would be a good idea, to maybe add some extra 
syntax support for numerics, like  and ? We should also add tests/support 
for half-open ranges, so syntax like [* TO 1.0] should also be supported (I 
am not sure, if TermRangeQueryNode supports this, but numerics should do this 
in all cases) - the above syntax is also printed out on 
NumericRangeQuery.toString(), if one of the bounds is null. The latter could be 
easily implemented by checking for * as input to the range bounds and map 
those special values to NULL. Adding support for  and  (also =, 
=) needs knowledge of JavaCC parser language. Vinicius, have you ever worked 
with JavaCC, so do you think you will be able to extend the syntax?


 NumericRange support for new query parser
 -

 Key: LUCENE-1768
 URL: https://issues.apache.org/jira/browse/LUCENE-1768
 Project: Lucene - Java
  Issue Type: New Feature
  Components: core/queryparser
Affects Versions: 2.9
Reporter: Uwe Schindler
Assignee: Uwe Schindler
  Labels: contrib, gsoc, gsoc2011, lucene-gsoc-11, mentor
 Fix For: 4.0

 Attachments: week-7.patch, week1.patch, week2.patch, week3.patch, 
 week4.patch, week5-6.patch


 It would be good to specify some type of schema for the query parser in 
 future, to automatically create NumericRangeQuery for different numeric 
 types? It would then be possible to index a numeric value 
 (double,float,long,int) using NumericField and then the query parser knows, 
 which type of field this is and so it correctly creates a NumericRangeQuery 
 for strings like [1.567..*] or (1.787..19.5].
 There is currently no way to extract if a field is numeric from the index, so 
 the user will have to configure the FieldConfig objects in the ConfigHandler. 
 But if this is done, it will not be that difficult to implement the rest.
 The only difference between the current handling of RangeQuery is then the 
 instantiation of the correct Query type and conversion of the entered numeric 
 values (simple Number.valueOf(...) cast of the user entered numbers). 
 Evenerything else is identical, NumericRangeQuery also supports the MTQ 
 rewrite modes (as it is a MTQ).
 Another thing is a change in Date semantics. There are some strange flags in 
 the current parser that tells it how to handle dates.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-1768) NumericRange support for new query parser

2011-07-11 Thread Adriano Crestani (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063570#comment-13063570
 ] 

Adriano Crestani commented on LUCENE-1768:
--

{quote}
I am not sure if this is really a break when you change inheritance. If code 
still compiles, its no break, if classes were renamed its more serious. I am 
not sure, if implementation classes (and -names) should be covered by the 
backwards compatibility. In my opinion, mainly the configuration and interfaces 
of the QP must be covered by backwards policy.
{quote}

I didn't see any class renaming, I need to double check Vinicius's patches. But 
he did change the query node inheritance, which may affect how processors and 
builder (specially QueryNodeTreeBuilder) work. I am not saying it is not 
possible to implement his approach on 3.x, but he will need to deal differently 
with query nodes classes he created. As I said before, what he did is good and 
clean, I like the way it is, but it will break someone's code if pushed to 3.x. 
So if you ask me whether to push it to 3.x, I say YES, just make sure to not 
break the query node structure that people may be relying on.

{quote}
As we are now at mid-time, it would be a good idea, to maybe add some extra 
syntax support for numerics, like  and ? We should also add tests/support 
for half-open ranges, so syntax like [* TO 1.0] should also be supported (I 
am not sure, if TermRangeQueryNode supports this, but numerics should do this 
in all cases) - the above syntax is also printed out on 
NumericRangeQuery.toString(), if one of the bounds is null. The latter could be 
easily implemented by checking for * as input to the range bounds and map 
those special values to NULL. Adding support for  and  (also =, 
=) needs knowledge of JavaCC parser language. Vinicius, have you ever worked 
with JavaCC, so do you think you will be able to extend the syntax?
{quote}

I still need to investigate the bugs Vinicius reported (should have been 
created a JIRA for that already), I never really tried open ranges in contrib 
QP. And if Vinicius thinks he will have time and skills to do the JAVACC change 
to support those new operators, go for it! And remember Vinicius, you don't 
need to do everything during gsoc, you are always welcome to contribute code 
whenever you want :)

 NumericRange support for new query parser
 -

 Key: LUCENE-1768
 URL: https://issues.apache.org/jira/browse/LUCENE-1768
 Project: Lucene - Java
  Issue Type: New Feature
  Components: core/queryparser
Affects Versions: 2.9
Reporter: Uwe Schindler
Assignee: Uwe Schindler
  Labels: contrib, gsoc, gsoc2011, lucene-gsoc-11, mentor
 Fix For: 4.0

 Attachments: week-7.patch, week1.patch, week2.patch, week3.patch, 
 week4.patch, week5-6.patch


 It would be good to specify some type of schema for the query parser in 
 future, to automatically create NumericRangeQuery for different numeric 
 types? It would then be possible to index a numeric value 
 (double,float,long,int) using NumericField and then the query parser knows, 
 which type of field this is and so it correctly creates a NumericRangeQuery 
 for strings like [1.567..*] or (1.787..19.5].
 There is currently no way to extract if a field is numeric from the index, so 
 the user will have to configure the FieldConfig objects in the ConfigHandler. 
 But if this is done, it will not be that difficult to implement the rest.
 The only difference between the current handling of RangeQuery is then the 
 instantiation of the correct Query type and conversion of the entered numeric 
 values (simple Number.valueOf(...) cast of the user entered numbers). 
 Evenerything else is identical, NumericRangeQuery also supports the MTQ 
 rewrite modes (as it is a MTQ).
 Another thing is a change in Date semantics. There are some strange flags in 
 the current parser that tells it how to handle dates.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-2048) Omit positions but keep termFreq

2011-07-11 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2048:


Attachment: LUCENE-2048.patch

ok here's a updated patch. I think its ready to commit!

 Omit positions but keep termFreq
 

 Key: LUCENE-2048
 URL: https://issues.apache.org/jira/browse/LUCENE-2048
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.1
Reporter: Andrzej Bialecki 
Assignee: Robert Muir
 Fix For: 4.0

 Attachments: LUCENE-2048.patch, LUCENE-2048.patch


 it would be useful to have an option to discard positional information but 
 still keep the term frequency - currently setOmitTermFreqAndPositions 
 discards both. Even though position-dependent queries wouldn't work in such 
 case, still any other queries would work fine and we would get the right 
 scoring.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans

2011-07-11 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063582#comment-13063582
 ] 

Robert Muir commented on LUCENE-2878:
-

{quote}
But if I'm right, then it means there has to be some sort of callback mechanism 
in place during scoring, or else we have to resign ourselves to scoring first, 
and then re-setting and iterating positions in a second pass.
{quote}

But I think this is what I think we want? If there are 10 million documents 
that match a query, but our priority queue size is 20 (1 page), we only want to 
do the expensive highlighting on those 20 documents. 

Its the same for the positional scoring, its too expensive to look at positions 
for all documents, so you re-order maybe the top 100 or so.

Or maybe I'm totally confused by the comments!

 Allow Scorer to expose positions and payloads aka. nuke spans 
 --

 Key: LUCENE-2878
 URL: https://issues.apache.org/jira/browse/LUCENE-2878
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/search
Affects Versions: Bulk Postings branch
Reporter: Simon Willnauer
Assignee: Simon Willnauer
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Attachments: LUCENE-2878-OR.patch, LUCENE-2878.patch, 
 LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, 
 LUCENE-2878.patch, LUCENE-2878_trunk.patch, LUCENE-2878_trunk.patch, 
 PosHighlighter.patch, PosHighlighter.patch


 Currently we have two somewhat separate types of queries, the one which can 
 make use of positions (mainly spans) and payloads (spans). Yet Span*Query 
 doesn't really do scoring comparable to what other queries do and at the end 
 of the day they are duplicating lot of code all over lucene. Span*Queries are 
 also limited to other Span*Query instances such that you can not use a 
 TermQuery or a BooleanQuery with SpanNear or anthing like that. 
 Beside of the Span*Query limitation other queries lacking a quiet interesting 
 feature since they can not score based on term proximity since scores doesn't 
 expose any positional information. All those problems bugged me for a while 
 now so I stared working on that using the bulkpostings API. I would have done 
 that first cut on trunk but TermScorer is working on BlockReader that do not 
 expose positions while the one in this branch does. I started adding a new 
 Positions class which users can pull from a scorer, to prevent unnecessary 
 positions enums I added ScorerContext#needsPositions and eventually 
 Scorere#needsPayloads to create the corresponding enum on demand. Yet, 
 currently only TermQuery / TermScorer implements this API and other simply 
 return null instead. 
 To show that the API really works and our BulkPostings work fine too with 
 positions I cut over TermSpanQuery to use a TermScorer under the hood and 
 nuked TermSpans entirely. A nice sideeffect of this was that the Position 
 BulkReading implementation got some exercise which now :) work all with 
 positions while Payloads for bulkreading are kind of experimental in the 
 patch and those only work with Standard codec. 
 So all spans now work on top of TermScorer ( I truly hate spans since today ) 
 including the ones that need Payloads (StandardCodec ONLY)!!  I didn't bother 
 to implement the other codecs yet since I want to get feedback on the API and 
 on this first cut before I go one with it. I will upload the corresponding 
 patch in a minute. 
 I also had to cut over SpanQuery.getSpans(IR) to 
 SpanQuery.getSpans(AtomicReaderContext) which I should probably do on trunk 
 first but after that pain today I need a break first :).
 The patch passes all core tests 
 (org.apache.lucene.search.highlight.HighlighterTest still fails but I didn't 
 look into the MemoryIndex BulkPostings API yet)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans

2011-07-11 Thread Mike Sokolov (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063595#comment-13063595
 ] 

Mike Sokolov commented on LUCENE-2878:
--

bq. But I think this is what I think we want? If there are 10 million documents 
that match a query, but our priority queue size is 20 (1 page), we only want to 
do the expensive highlighting on those 20 documents. 

Yes - the comments may be getting lost in the weeds a bit here; sorry.  I've 
been assuming you'd search once to collect documents and then search again with 
the same query plus a constraint to limited by gathered docids, with an 
indication that positions are required - this pushes you towards some sort of 
collector-style callback API. Maybe life would be simpler if instead you could 
just say getPositionIterator(docid, query).  But that would force you actually 
into a third pass (I think), if you wanted positional scoring too, wouldn't it?

 Allow Scorer to expose positions and payloads aka. nuke spans 
 --

 Key: LUCENE-2878
 URL: https://issues.apache.org/jira/browse/LUCENE-2878
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/search
Affects Versions: Bulk Postings branch
Reporter: Simon Willnauer
Assignee: Simon Willnauer
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Attachments: LUCENE-2878-OR.patch, LUCENE-2878.patch, 
 LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, 
 LUCENE-2878.patch, LUCENE-2878_trunk.patch, LUCENE-2878_trunk.patch, 
 PosHighlighter.patch, PosHighlighter.patch


 Currently we have two somewhat separate types of queries, the one which can 
 make use of positions (mainly spans) and payloads (spans). Yet Span*Query 
 doesn't really do scoring comparable to what other queries do and at the end 
 of the day they are duplicating lot of code all over lucene. Span*Queries are 
 also limited to other Span*Query instances such that you can not use a 
 TermQuery or a BooleanQuery with SpanNear or anthing like that. 
 Beside of the Span*Query limitation other queries lacking a quiet interesting 
 feature since they can not score based on term proximity since scores doesn't 
 expose any positional information. All those problems bugged me for a while 
 now so I stared working on that using the bulkpostings API. I would have done 
 that first cut on trunk but TermScorer is working on BlockReader that do not 
 expose positions while the one in this branch does. I started adding a new 
 Positions class which users can pull from a scorer, to prevent unnecessary 
 positions enums I added ScorerContext#needsPositions and eventually 
 Scorere#needsPayloads to create the corresponding enum on demand. Yet, 
 currently only TermQuery / TermScorer implements this API and other simply 
 return null instead. 
 To show that the API really works and our BulkPostings work fine too with 
 positions I cut over TermSpanQuery to use a TermScorer under the hood and 
 nuked TermSpans entirely. A nice sideeffect of this was that the Position 
 BulkReading implementation got some exercise which now :) work all with 
 positions while Payloads for bulkreading are kind of experimental in the 
 patch and those only work with Standard codec. 
 So all spans now work on top of TermScorer ( I truly hate spans since today ) 
 including the ones that need Payloads (StandardCodec ONLY)!!  I didn't bother 
 to implement the other codecs yet since I want to get feedback on the API and 
 on this first cut before I go one with it. I will upload the corresponding 
 patch in a minute. 
 I also had to cut over SpanQuery.getSpans(IR) to 
 SpanQuery.getSpans(AtomicReaderContext) which I should probably do on trunk 
 first but after that pain today I need a break first :).
 The patch passes all core tests 
 (org.apache.lucene.search.highlight.HighlighterTest still fails but I didn't 
 look into the MemoryIndex BulkPostings API yet)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-2308) Separately specify a field's type

2011-07-11 Thread Nikola Tankovic (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikola Tankovic updated LUCENE-2308:


Attachment: LUCENE-2308-10.patch

Solr cutover to FieldType. Having repeated similar errors in tests. Trying to 
debug. Help is appriciated :)

 Separately specify a field's type
 -

 Key: LUCENE-2308
 URL: https://issues.apache.org/jira/browse/LUCENE-2308
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Reporter: Michael McCandless
Assignee: Michael McCandless
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Fix For: 4.0

 Attachments: LUCENE-2308-10.patch, LUCENE-2308-2.patch, 
 LUCENE-2308-3.patch, LUCENE-2308-4.patch, LUCENE-2308-4.patch, 
 LUCENE-2308-5.patch, LUCENE-2308-6.patch, LUCENE-2308-7.patch, 
 LUCENE-2308-8.patch, LUCENE-2308-9.patch, LUCENE-2308-ltc.patch, 
 LUCENE-2308.patch, LUCENE-2308.patch


 This came up from dicussions on IRC.  I'm summarizing here...
 Today when you make a Field to add to a document you can set things
 index or not, stored or not, analyzed or not, details like omitTfAP,
 omitNorms, index term vectors (separately controlling
 offsets/positions), etc.
 I think we should factor these out into a new class (FieldType?).
 Then you could re-use this FieldType instance across multiple fields.
 The Field instance would still hold the actual value.
 We could then do per-field analyzers by adding a setAnalyzer on the
 FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise
 for per-field codecs (with flex), where we now have
 PerFieldCodecWrapper).
 This would NOT be a schema!  It's just refactoring what we already
 specify today.  EG it's not serialized into the index.
 This has been discussed before, and I know Michael Busch opened a more
 ambitious (I think?) issue.  I think this is a good first baby step.  We could
 consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold
 off on that for starters...

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-1768) NumericRange support for new query parser

2011-07-11 Thread Vinicius Barros (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063607#comment-13063607
 ] 

Vinicius Barros commented on LUCENE-1768:
-

Thanks for committing the patch Uwe!

I will review the code again looking for switch without default case and fix it.

I never did anything with javacc, I just quickly looked at the code, does not 
seem complicated, however, I have no idea how complex is to run javacc and 
regenerate the java files. Does lucene ant script do that automaticaly?

I can try to fix open range queries on contrib query parser, add =-like 
operators or backport numeric support to 3.x. Just let me know the priorities 
and I will work on it. My suggestion is that the bug on open range queries is 
the most critical now, so I could start working on that. Your call Uwe.

 NumericRange support for new query parser
 -

 Key: LUCENE-1768
 URL: https://issues.apache.org/jira/browse/LUCENE-1768
 Project: Lucene - Java
  Issue Type: New Feature
  Components: core/queryparser
Affects Versions: 2.9
Reporter: Uwe Schindler
Assignee: Uwe Schindler
  Labels: contrib, gsoc, gsoc2011, lucene-gsoc-11, mentor
 Fix For: 4.0

 Attachments: week-7.patch, week1.patch, week2.patch, week3.patch, 
 week4.patch, week5-6.patch


 It would be good to specify some type of schema for the query parser in 
 future, to automatically create NumericRangeQuery for different numeric 
 types? It would then be possible to index a numeric value 
 (double,float,long,int) using NumericField and then the query parser knows, 
 which type of field this is and so it correctly creates a NumericRangeQuery 
 for strings like [1.567..*] or (1.787..19.5].
 There is currently no way to extract if a field is numeric from the index, so 
 the user will have to configure the FieldConfig objects in the ConfigHandler. 
 But if this is done, it will not be that difficult to implement the rest.
 The only difference between the current handling of RangeQuery is then the 
 instantiation of the correct Query type and conversion of the entered numeric 
 values (simple Number.valueOf(...) cast of the user entered numbers). 
 Evenerything else is identical, NumericRangeQuery also supports the MTQ 
 rewrite modes (as it is a MTQ).
 Another thing is a change in Date semantics. There are some strange flags in 
 the current parser that tells it how to handle dates.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans

2011-07-11 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063610#comment-13063610
 ] 

Robert Muir commented on LUCENE-2878:
-

{quote}
But that would force you actually into a third pass (I think), if you wanted 
positional scoring too, wouldn't it?
{quote}

I think thats ok? because the two things are different: 
* in general i think you want to rerank more than just page 1 with scoring, 
e.g. maybe 100 or even 1000 documents versus the 20 that highlighting needs.
* for scoring, we need to adjust our PQ, resulting in a (possibly) different 
set of page 1 documents for the highlighting process, so if we are doing both 
algorithms, we still don't yet know what to highlight anyway.
* if we assume we are going to add offsets (optionally) to our postings lists 
in parallel to the positions,
  thats another difference: scoring doesnt care about offsets, but highlighting 
needs them.


 Allow Scorer to expose positions and payloads aka. nuke spans 
 --

 Key: LUCENE-2878
 URL: https://issues.apache.org/jira/browse/LUCENE-2878
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/search
Affects Versions: Bulk Postings branch
Reporter: Simon Willnauer
Assignee: Simon Willnauer
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Attachments: LUCENE-2878-OR.patch, LUCENE-2878.patch, 
 LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, 
 LUCENE-2878.patch, LUCENE-2878_trunk.patch, LUCENE-2878_trunk.patch, 
 PosHighlighter.patch, PosHighlighter.patch


 Currently we have two somewhat separate types of queries, the one which can 
 make use of positions (mainly spans) and payloads (spans). Yet Span*Query 
 doesn't really do scoring comparable to what other queries do and at the end 
 of the day they are duplicating lot of code all over lucene. Span*Queries are 
 also limited to other Span*Query instances such that you can not use a 
 TermQuery or a BooleanQuery with SpanNear or anthing like that. 
 Beside of the Span*Query limitation other queries lacking a quiet interesting 
 feature since they can not score based on term proximity since scores doesn't 
 expose any positional information. All those problems bugged me for a while 
 now so I stared working on that using the bulkpostings API. I would have done 
 that first cut on trunk but TermScorer is working on BlockReader that do not 
 expose positions while the one in this branch does. I started adding a new 
 Positions class which users can pull from a scorer, to prevent unnecessary 
 positions enums I added ScorerContext#needsPositions and eventually 
 Scorere#needsPayloads to create the corresponding enum on demand. Yet, 
 currently only TermQuery / TermScorer implements this API and other simply 
 return null instead. 
 To show that the API really works and our BulkPostings work fine too with 
 positions I cut over TermSpanQuery to use a TermScorer under the hood and 
 nuked TermSpans entirely. A nice sideeffect of this was that the Position 
 BulkReading implementation got some exercise which now :) work all with 
 positions while Payloads for bulkreading are kind of experimental in the 
 patch and those only work with Standard codec. 
 So all spans now work on top of TermScorer ( I truly hate spans since today ) 
 including the ones that need Payloads (StandardCodec ONLY)!!  I didn't bother 
 to implement the other codecs yet since I want to get feedback on the API and 
 on this first cut before I go one with it. I will upload the corresponding 
 patch in a minute. 
 I also had to cut over SpanQuery.getSpans(IR) to 
 SpanQuery.getSpans(AtomicReaderContext) which I should probably do on trunk 
 first but after that pain today I need a break first :).
 The patch passes all core tests 
 (org.apache.lucene.search.highlight.HighlighterTest still fails but I didn't 
 look into the MemoryIndex BulkPostings API yet)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2048) Omit positions but keep termFreq

2011-07-11 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063612#comment-13063612
 ] 

Michael McCandless commented on LUCENE-2048:


Looks great!  +1 to commit.

 Omit positions but keep termFreq
 

 Key: LUCENE-2048
 URL: https://issues.apache.org/jira/browse/LUCENE-2048
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.1
Reporter: Andrzej Bialecki 
Assignee: Robert Muir
 Fix For: 4.0

 Attachments: LUCENE-2048.patch, LUCENE-2048.patch


 it would be useful to have an option to discard positional information but 
 still keep the term frequency - currently setOmitTermFreqAndPositions 
 discards both. Even though position-dependent queries wouldn't work in such 
 case, still any other queries would work fine and we would get the right 
 scoring.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans

2011-07-11 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063614#comment-13063614
 ] 

Grant Ingersoll commented on LUCENE-2878:
-

FWIW, I do think there are use cases where one wants positions over all hits 
(or most such that you might as well do all), so if it doesn't cause problems 
for the main use case, it would be nice to support it.  In fact, in these 
scenarios, you usually care less about the PQ and more about the positions. 

 Allow Scorer to expose positions and payloads aka. nuke spans 
 --

 Key: LUCENE-2878
 URL: https://issues.apache.org/jira/browse/LUCENE-2878
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/search
Affects Versions: Bulk Postings branch
Reporter: Simon Willnauer
Assignee: Simon Willnauer
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Attachments: LUCENE-2878-OR.patch, LUCENE-2878.patch, 
 LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, 
 LUCENE-2878.patch, LUCENE-2878_trunk.patch, LUCENE-2878_trunk.patch, 
 PosHighlighter.patch, PosHighlighter.patch


 Currently we have two somewhat separate types of queries, the one which can 
 make use of positions (mainly spans) and payloads (spans). Yet Span*Query 
 doesn't really do scoring comparable to what other queries do and at the end 
 of the day they are duplicating lot of code all over lucene. Span*Queries are 
 also limited to other Span*Query instances such that you can not use a 
 TermQuery or a BooleanQuery with SpanNear or anthing like that. 
 Beside of the Span*Query limitation other queries lacking a quiet interesting 
 feature since they can not score based on term proximity since scores doesn't 
 expose any positional information. All those problems bugged me for a while 
 now so I stared working on that using the bulkpostings API. I would have done 
 that first cut on trunk but TermScorer is working on BlockReader that do not 
 expose positions while the one in this branch does. I started adding a new 
 Positions class which users can pull from a scorer, to prevent unnecessary 
 positions enums I added ScorerContext#needsPositions and eventually 
 Scorere#needsPayloads to create the corresponding enum on demand. Yet, 
 currently only TermQuery / TermScorer implements this API and other simply 
 return null instead. 
 To show that the API really works and our BulkPostings work fine too with 
 positions I cut over TermSpanQuery to use a TermScorer under the hood and 
 nuked TermSpans entirely. A nice sideeffect of this was that the Position 
 BulkReading implementation got some exercise which now :) work all with 
 positions while Payloads for bulkreading are kind of experimental in the 
 patch and those only work with Standard codec. 
 So all spans now work on top of TermScorer ( I truly hate spans since today ) 
 including the ones that need Payloads (StandardCodec ONLY)!!  I didn't bother 
 to implement the other codecs yet since I want to get feedback on the API and 
 on this first cut before I go one with it. I will upload the corresponding 
 patch in a minute. 
 I also had to cut over SpanQuery.getSpans(IR) to 
 SpanQuery.getSpans(AtomicReaderContext) which I should probably do on trunk 
 first but after that pain today I need a break first :).
 The patch passes all core tests 
 (org.apache.lucene.search.highlight.HighlighterTest still fails but I didn't 
 look into the MemoryIndex BulkPostings API yet)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2308) Separately specify a field's type

2011-07-11 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063618#comment-13063618
 ] 

Michael McCandless commented on LUCENE-2308:


Nikola tracked this down -- it's because we're not reading numeric field back 
properly from stored fields.

 Separately specify a field's type
 -

 Key: LUCENE-2308
 URL: https://issues.apache.org/jira/browse/LUCENE-2308
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Reporter: Michael McCandless
Assignee: Michael McCandless
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Fix For: 4.0

 Attachments: LUCENE-2308-10.patch, LUCENE-2308-2.patch, 
 LUCENE-2308-3.patch, LUCENE-2308-4.patch, LUCENE-2308-4.patch, 
 LUCENE-2308-5.patch, LUCENE-2308-6.patch, LUCENE-2308-7.patch, 
 LUCENE-2308-8.patch, LUCENE-2308-9.patch, LUCENE-2308-ltc.patch, 
 LUCENE-2308.patch, LUCENE-2308.patch


 This came up from dicussions on IRC.  I'm summarizing here...
 Today when you make a Field to add to a document you can set things
 index or not, stored or not, analyzed or not, details like omitTfAP,
 omitNorms, index term vectors (separately controlling
 offsets/positions), etc.
 I think we should factor these out into a new class (FieldType?).
 Then you could re-use this FieldType instance across multiple fields.
 The Field instance would still hold the actual value.
 We could then do per-field analyzers by adding a setAnalyzer on the
 FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise
 for per-field codecs (with flex), where we now have
 PerFieldCodecWrapper).
 This would NOT be a schema!  It's just refactoring what we already
 specify today.  EG it's not serialized into the index.
 This has been discussed before, and I know Michael Busch opened a more
 ambitious (I think?) issue.  I think this is a good first baby step.  We could
 consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold
 off on that for starters...

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3282) BlockJoinQuery: Allow to add a custom child collector, and customize the parent bitset extraction

2011-07-11 Thread Shay Banon (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063619#comment-13063619
 ] 

Shay Banon commented on LUCENE-3282:


Heya,

   In my app, I have a wrapper around OBS, that has a common interface that 
allows to access bits by index (similar to Bits in trunk), so I need to extract 
from it the OBS.

   Regarding the Collector, I will work on CollectorProvider interface. I liked 
the NoOpCollector option since then you don't have to check for nulls each 
time...

 BlockJoinQuery: Allow to add a custom child collector, and customize the 
 parent bitset extraction
 -

 Key: LUCENE-3282
 URL: https://issues.apache.org/jira/browse/LUCENE-3282
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/search
Affects Versions: 3.4, 4.0
Reporter: Shay Banon
 Attachments: LUCENE-3282.patch


 It would be nice to allow to add a custom child collector to the 
 BlockJoinQuery to be called on every matching doc (so we can do things with 
 it, like counts and such). Also, allow to extend BlockJoinQuery to have a 
 custom code that converts the filter bitset to an OpenBitSet.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3282) BlockJoinQuery: Allow to add a custom child collector, and customize the parent bitset extraction

2011-07-11 Thread Shay Banon (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shay Banon updated LUCENE-3282:
---

Attachment: LUCENE-3282.patch

New version, with CollectorProvider.

 BlockJoinQuery: Allow to add a custom child collector, and customize the 
 parent bitset extraction
 -

 Key: LUCENE-3282
 URL: https://issues.apache.org/jira/browse/LUCENE-3282
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/search
Affects Versions: 3.4, 4.0
Reporter: Shay Banon
 Attachments: LUCENE-3282.patch, LUCENE-3282.patch


 It would be nice to allow to add a custom child collector to the 
 BlockJoinQuery to be called on every matching doc (so we can do things with 
 it, like counts and such). Also, allow to extend BlockJoinQuery to have a 
 custom code that converts the filter bitset to an OpenBitSet.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans

2011-07-11 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063622#comment-13063622
 ] 

Robert Muir commented on LUCENE-2878:
-

{quote}
FWIW, I do think there are use cases where one wants positions over all hits 
(or most such that you might as well do all), so if it doesn't cause problems 
for the main use case, it would be nice to support it. In fact, in these 
scenarios, you usually care less about the PQ and more about the positions. 
{quote}

I don't think this issue should try to solve that problem: if you are doing 
that, it sounds like you are using the wrong Query!


 Allow Scorer to expose positions and payloads aka. nuke spans 
 --

 Key: LUCENE-2878
 URL: https://issues.apache.org/jira/browse/LUCENE-2878
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/search
Affects Versions: Bulk Postings branch
Reporter: Simon Willnauer
Assignee: Simon Willnauer
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Attachments: LUCENE-2878-OR.patch, LUCENE-2878.patch, 
 LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, 
 LUCENE-2878.patch, LUCENE-2878_trunk.patch, LUCENE-2878_trunk.patch, 
 PosHighlighter.patch, PosHighlighter.patch


 Currently we have two somewhat separate types of queries, the one which can 
 make use of positions (mainly spans) and payloads (spans). Yet Span*Query 
 doesn't really do scoring comparable to what other queries do and at the end 
 of the day they are duplicating lot of code all over lucene. Span*Queries are 
 also limited to other Span*Query instances such that you can not use a 
 TermQuery or a BooleanQuery with SpanNear or anthing like that. 
 Beside of the Span*Query limitation other queries lacking a quiet interesting 
 feature since they can not score based on term proximity since scores doesn't 
 expose any positional information. All those problems bugged me for a while 
 now so I stared working on that using the bulkpostings API. I would have done 
 that first cut on trunk but TermScorer is working on BlockReader that do not 
 expose positions while the one in this branch does. I started adding a new 
 Positions class which users can pull from a scorer, to prevent unnecessary 
 positions enums I added ScorerContext#needsPositions and eventually 
 Scorere#needsPayloads to create the corresponding enum on demand. Yet, 
 currently only TermQuery / TermScorer implements this API and other simply 
 return null instead. 
 To show that the API really works and our BulkPostings work fine too with 
 positions I cut over TermSpanQuery to use a TermScorer under the hood and 
 nuked TermSpans entirely. A nice sideeffect of this was that the Position 
 BulkReading implementation got some exercise which now :) work all with 
 positions while Payloads for bulkreading are kind of experimental in the 
 patch and those only work with Standard codec. 
 So all spans now work on top of TermScorer ( I truly hate spans since today ) 
 including the ones that need Payloads (StandardCodec ONLY)!!  I didn't bother 
 to implement the other codecs yet since I want to get feedback on the API and 
 on this first cut before I go one with it. I will upload the corresponding 
 patch in a minute. 
 I also had to cut over SpanQuery.getSpans(IR) to 
 SpanQuery.getSpans(AtomicReaderContext) which I should probably do on trunk 
 first but after that pain today I need a break first :).
 The patch passes all core tests 
 (org.apache.lucene.search.highlight.HighlighterTest still fails but I didn't 
 look into the MemoryIndex BulkPostings API yet)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans

2011-07-11 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063625#comment-13063625
 ] 

Grant Ingersoll commented on LUCENE-2878:
-

bq. I don't think this issue should try to solve that problem: if you are doing 
that, it sounds like you are using the wrong Query!

It's basically a boolean match on any arbitrary Query where you care about the 
positions.  Pretty common in e-discovery and other areas.  You have a query 
that tells you all the matches and you want to operate over the positions.  
Right now, it's a pain as you have to execute the query twice.  Once to get the 
scores and once to get the positions/spans.  If you have a callback mechanism, 
one can do both at once.

 Allow Scorer to expose positions and payloads aka. nuke spans 
 --

 Key: LUCENE-2878
 URL: https://issues.apache.org/jira/browse/LUCENE-2878
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/search
Affects Versions: Bulk Postings branch
Reporter: Simon Willnauer
Assignee: Simon Willnauer
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Attachments: LUCENE-2878-OR.patch, LUCENE-2878.patch, 
 LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, 
 LUCENE-2878.patch, LUCENE-2878_trunk.patch, LUCENE-2878_trunk.patch, 
 PosHighlighter.patch, PosHighlighter.patch


 Currently we have two somewhat separate types of queries, the one which can 
 make use of positions (mainly spans) and payloads (spans). Yet Span*Query 
 doesn't really do scoring comparable to what other queries do and at the end 
 of the day they are duplicating lot of code all over lucene. Span*Queries are 
 also limited to other Span*Query instances such that you can not use a 
 TermQuery or a BooleanQuery with SpanNear or anthing like that. 
 Beside of the Span*Query limitation other queries lacking a quiet interesting 
 feature since they can not score based on term proximity since scores doesn't 
 expose any positional information. All those problems bugged me for a while 
 now so I stared working on that using the bulkpostings API. I would have done 
 that first cut on trunk but TermScorer is working on BlockReader that do not 
 expose positions while the one in this branch does. I started adding a new 
 Positions class which users can pull from a scorer, to prevent unnecessary 
 positions enums I added ScorerContext#needsPositions and eventually 
 Scorere#needsPayloads to create the corresponding enum on demand. Yet, 
 currently only TermQuery / TermScorer implements this API and other simply 
 return null instead. 
 To show that the API really works and our BulkPostings work fine too with 
 positions I cut over TermSpanQuery to use a TermScorer under the hood and 
 nuked TermSpans entirely. A nice sideeffect of this was that the Position 
 BulkReading implementation got some exercise which now :) work all with 
 positions while Payloads for bulkreading are kind of experimental in the 
 patch and those only work with Standard codec. 
 So all spans now work on top of TermScorer ( I truly hate spans since today ) 
 including the ones that need Payloads (StandardCodec ONLY)!!  I didn't bother 
 to implement the other codecs yet since I want to get feedback on the API and 
 on this first cut before I go one with it. I will upload the corresponding 
 patch in a minute. 
 I also had to cut over SpanQuery.getSpans(IR) to 
 SpanQuery.getSpans(AtomicReaderContext) which I should probably do on trunk 
 first but after that pain today I need a break first :).
 The patch passes all core tests 
 (org.apache.lucene.search.highlight.HighlighterTest still fails but I didn't 
 look into the MemoryIndex BulkPostings API yet)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-2048) Omit positions but keep termFreq

2011-07-11 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2048:


Fix Version/s: 3.4

 Omit positions but keep termFreq
 

 Key: LUCENE-2048
 URL: https://issues.apache.org/jira/browse/LUCENE-2048
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.1
Reporter: Andrzej Bialecki 
Assignee: Robert Muir
 Fix For: 3.4, 4.0

 Attachments: LUCENE-2048.patch, LUCENE-2048.patch


 it would be useful to have an option to discard positional information but 
 still keep the term frequency - currently setOmitTermFreqAndPositions 
 discards both. Even though position-dependent queries wouldn't work in such 
 case, still any other queries would work fine and we would get the right 
 scoring.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans

2011-07-11 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063626#comment-13063626
 ] 

Robert Muir commented on LUCENE-2878:
-

I don't understand the exact use case... it still sounds like the wrong query? 
What operating over the positions do you need to do?

 Allow Scorer to expose positions and payloads aka. nuke spans 
 --

 Key: LUCENE-2878
 URL: https://issues.apache.org/jira/browse/LUCENE-2878
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/search
Affects Versions: Bulk Postings branch
Reporter: Simon Willnauer
Assignee: Simon Willnauer
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Attachments: LUCENE-2878-OR.patch, LUCENE-2878.patch, 
 LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, 
 LUCENE-2878.patch, LUCENE-2878_trunk.patch, LUCENE-2878_trunk.patch, 
 PosHighlighter.patch, PosHighlighter.patch


 Currently we have two somewhat separate types of queries, the one which can 
 make use of positions (mainly spans) and payloads (spans). Yet Span*Query 
 doesn't really do scoring comparable to what other queries do and at the end 
 of the day they are duplicating lot of code all over lucene. Span*Queries are 
 also limited to other Span*Query instances such that you can not use a 
 TermQuery or a BooleanQuery with SpanNear or anthing like that. 
 Beside of the Span*Query limitation other queries lacking a quiet interesting 
 feature since they can not score based on term proximity since scores doesn't 
 expose any positional information. All those problems bugged me for a while 
 now so I stared working on that using the bulkpostings API. I would have done 
 that first cut on trunk but TermScorer is working on BlockReader that do not 
 expose positions while the one in this branch does. I started adding a new 
 Positions class which users can pull from a scorer, to prevent unnecessary 
 positions enums I added ScorerContext#needsPositions and eventually 
 Scorere#needsPayloads to create the corresponding enum on demand. Yet, 
 currently only TermQuery / TermScorer implements this API and other simply 
 return null instead. 
 To show that the API really works and our BulkPostings work fine too with 
 positions I cut over TermSpanQuery to use a TermScorer under the hood and 
 nuked TermSpans entirely. A nice sideeffect of this was that the Position 
 BulkReading implementation got some exercise which now :) work all with 
 positions while Payloads for bulkreading are kind of experimental in the 
 patch and those only work with Standard codec. 
 So all spans now work on top of TermScorer ( I truly hate spans since today ) 
 including the ones that need Payloads (StandardCodec ONLY)!!  I didn't bother 
 to implement the other codecs yet since I want to get feedback on the API and 
 on this first cut before I go one with it. I will upload the corresponding 
 patch in a minute. 
 I also had to cut over SpanQuery.getSpans(IR) to 
 SpanQuery.getSpans(AtomicReaderContext) which I should probably do on trunk 
 first but after that pain today I need a break first :).
 The patch passes all core tests 
 (org.apache.lucene.search.highlight.HighlighterTest still fails but I didn't 
 look into the MemoryIndex BulkPostings API yet)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans

2011-07-11 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063630#comment-13063630
 ] 

Grant Ingersoll commented on LUCENE-2878:
-

In the cases where I've both done this and seen it done, you often have an 
arbitrary query that matches X docs.  You then want to know where exactly the 
matches occur and then you often want to do something in a window around those 
matches.  Right now, w/ Spans, you have to run the query once to get the scores 
and then run a second time to get the windows.  The times I've seen it, the 
result is most often given to some downstream process that does deeper analysis 
of the window, so in these cases X can be quite large (1000's if not more).  In 
those cases, some people care about the score, some do not.  For instance, if 
one is analyzing all the words around the name of a company, you search term 
would be the company name and you want to iterate over all the positions where 
it matched, looking for other words near it (perhaps sentiment words or other 
things)

 Allow Scorer to expose positions and payloads aka. nuke spans 
 --

 Key: LUCENE-2878
 URL: https://issues.apache.org/jira/browse/LUCENE-2878
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/search
Affects Versions: Bulk Postings branch
Reporter: Simon Willnauer
Assignee: Simon Willnauer
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Attachments: LUCENE-2878-OR.patch, LUCENE-2878.patch, 
 LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, 
 LUCENE-2878.patch, LUCENE-2878_trunk.patch, LUCENE-2878_trunk.patch, 
 PosHighlighter.patch, PosHighlighter.patch


 Currently we have two somewhat separate types of queries, the one which can 
 make use of positions (mainly spans) and payloads (spans). Yet Span*Query 
 doesn't really do scoring comparable to what other queries do and at the end 
 of the day they are duplicating lot of code all over lucene. Span*Queries are 
 also limited to other Span*Query instances such that you can not use a 
 TermQuery or a BooleanQuery with SpanNear or anthing like that. 
 Beside of the Span*Query limitation other queries lacking a quiet interesting 
 feature since they can not score based on term proximity since scores doesn't 
 expose any positional information. All those problems bugged me for a while 
 now so I stared working on that using the bulkpostings API. I would have done 
 that first cut on trunk but TermScorer is working on BlockReader that do not 
 expose positions while the one in this branch does. I started adding a new 
 Positions class which users can pull from a scorer, to prevent unnecessary 
 positions enums I added ScorerContext#needsPositions and eventually 
 Scorere#needsPayloads to create the corresponding enum on demand. Yet, 
 currently only TermQuery / TermScorer implements this API and other simply 
 return null instead. 
 To show that the API really works and our BulkPostings work fine too with 
 positions I cut over TermSpanQuery to use a TermScorer under the hood and 
 nuked TermSpans entirely. A nice sideeffect of this was that the Position 
 BulkReading implementation got some exercise which now :) work all with 
 positions while Payloads for bulkreading are kind of experimental in the 
 patch and those only work with Standard codec. 
 So all spans now work on top of TermScorer ( I truly hate spans since today ) 
 including the ones that need Payloads (StandardCodec ONLY)!!  I didn't bother 
 to implement the other codecs yet since I want to get feedback on the API and 
 on this first cut before I go one with it. I will upload the corresponding 
 patch in a minute. 
 I also had to cut over SpanQuery.getSpans(IR) to 
 SpanQuery.getSpans(AtomicReaderContext) which I should probably do on trunk 
 first but after that pain today I need a break first :).
 The patch passes all core tests 
 (org.apache.lucene.search.highlight.HighlighterTest still fails but I didn't 
 look into the MemoryIndex BulkPostings API yet)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans

2011-07-11 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063635#comment-13063635
 ] 

Robert Muir commented on LUCENE-2878:
-

{quote}
In those cases, some people care about the score, some do not. For instance, if 
one is analyzing all the words around the name of a company, you search term 
would be the company name and you want to iterate over all the positions where 
it matched, looking for other words near it 
{quote}

Grant, I'm not sure this sounds like an inverted index is even the best data 
structure for what you describe.

I just don't want us to confuse the issue with the nuking of spans/speeding up 
highlighting/enabling positional scoring use cases which are core to search.

 Allow Scorer to expose positions and payloads aka. nuke spans 
 --

 Key: LUCENE-2878
 URL: https://issues.apache.org/jira/browse/LUCENE-2878
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/search
Affects Versions: Bulk Postings branch
Reporter: Simon Willnauer
Assignee: Simon Willnauer
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Attachments: LUCENE-2878-OR.patch, LUCENE-2878.patch, 
 LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, 
 LUCENE-2878.patch, LUCENE-2878_trunk.patch, LUCENE-2878_trunk.patch, 
 PosHighlighter.patch, PosHighlighter.patch


 Currently we have two somewhat separate types of queries, the one which can 
 make use of positions (mainly spans) and payloads (spans). Yet Span*Query 
 doesn't really do scoring comparable to what other queries do and at the end 
 of the day they are duplicating lot of code all over lucene. Span*Queries are 
 also limited to other Span*Query instances such that you can not use a 
 TermQuery or a BooleanQuery with SpanNear or anthing like that. 
 Beside of the Span*Query limitation other queries lacking a quiet interesting 
 feature since they can not score based on term proximity since scores doesn't 
 expose any positional information. All those problems bugged me for a while 
 now so I stared working on that using the bulkpostings API. I would have done 
 that first cut on trunk but TermScorer is working on BlockReader that do not 
 expose positions while the one in this branch does. I started adding a new 
 Positions class which users can pull from a scorer, to prevent unnecessary 
 positions enums I added ScorerContext#needsPositions and eventually 
 Scorere#needsPayloads to create the corresponding enum on demand. Yet, 
 currently only TermQuery / TermScorer implements this API and other simply 
 return null instead. 
 To show that the API really works and our BulkPostings work fine too with 
 positions I cut over TermSpanQuery to use a TermScorer under the hood and 
 nuked TermSpans entirely. A nice sideeffect of this was that the Position 
 BulkReading implementation got some exercise which now :) work all with 
 positions while Payloads for bulkreading are kind of experimental in the 
 patch and those only work with Standard codec. 
 So all spans now work on top of TermScorer ( I truly hate spans since today ) 
 including the ones that need Payloads (StandardCodec ONLY)!!  I didn't bother 
 to implement the other codecs yet since I want to get feedback on the API and 
 on this first cut before I go one with it. I will upload the corresponding 
 patch in a minute. 
 I also had to cut over SpanQuery.getSpans(IR) to 
 SpanQuery.getSpans(AtomicReaderContext) which I should probably do on trunk 
 first but after that pain today I need a break first :).
 The patch passes all core tests 
 (org.apache.lucene.search.highlight.HighlighterTest still fails but I didn't 
 look into the MemoryIndex BulkPostings API yet)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans

2011-07-11 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063644#comment-13063644
 ] 

Grant Ingersoll commented on LUCENE-2878:
-

bq. I'm not sure this sounds like an inverted index is even the best data 
structure for what you describe

The key is you usually have a fairly complex Query to begin with, so I do think 
it is legitimate and it is the right data structure.  It is always driven by 
the search results.  I've seen this use case multiple times, where multiple is 
more than 10, so I am pretty convinced it is beyond just me.  I think if you 
are taking away the ability to create windows around a match (if you read my 
early comments on this issue I brought it up from the beginning), that is a 
pretty big loss.  I don't think the two things are mutually exclusive.  As long 
as I have a way to get at the positions for all matches, I don't care that it.  
A collector type callback interface or a way for one to iterate all positions 
for a given match should be sufficient.

That being said, if Mike's comments about a collector like API are how it is 
implemented, I think it should work.  In reality, I think one would just need a 
way to, for whatever number of results, be told about positions as they happen. 
 Naturally, the default should be to only do this after the top X are 
retrieved, when X is small, but I could see implementing it in the scoring loop 
on certain occasions (and I'm not saying Lucene need have first order support 
for that).  As long as you don't preclude me from doing that, it should be fine.

I'll try to find time to review the patch in more depth in the coming day or so.

 Allow Scorer to expose positions and payloads aka. nuke spans 
 --

 Key: LUCENE-2878
 URL: https://issues.apache.org/jira/browse/LUCENE-2878
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/search
Affects Versions: Bulk Postings branch
Reporter: Simon Willnauer
Assignee: Simon Willnauer
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Attachments: LUCENE-2878-OR.patch, LUCENE-2878.patch, 
 LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, 
 LUCENE-2878.patch, LUCENE-2878_trunk.patch, LUCENE-2878_trunk.patch, 
 PosHighlighter.patch, PosHighlighter.patch


 Currently we have two somewhat separate types of queries, the one which can 
 make use of positions (mainly spans) and payloads (spans). Yet Span*Query 
 doesn't really do scoring comparable to what other queries do and at the end 
 of the day they are duplicating lot of code all over lucene. Span*Queries are 
 also limited to other Span*Query instances such that you can not use a 
 TermQuery or a BooleanQuery with SpanNear or anthing like that. 
 Beside of the Span*Query limitation other queries lacking a quiet interesting 
 feature since they can not score based on term proximity since scores doesn't 
 expose any positional information. All those problems bugged me for a while 
 now so I stared working on that using the bulkpostings API. I would have done 
 that first cut on trunk but TermScorer is working on BlockReader that do not 
 expose positions while the one in this branch does. I started adding a new 
 Positions class which users can pull from a scorer, to prevent unnecessary 
 positions enums I added ScorerContext#needsPositions and eventually 
 Scorere#needsPayloads to create the corresponding enum on demand. Yet, 
 currently only TermQuery / TermScorer implements this API and other simply 
 return null instead. 
 To show that the API really works and our BulkPostings work fine too with 
 positions I cut over TermSpanQuery to use a TermScorer under the hood and 
 nuked TermSpans entirely. A nice sideeffect of this was that the Position 
 BulkReading implementation got some exercise which now :) work all with 
 positions while Payloads for bulkreading are kind of experimental in the 
 patch and those only work with Standard codec. 
 So all spans now work on top of TermScorer ( I truly hate spans since today ) 
 including the ones that need Payloads (StandardCodec ONLY)!!  I didn't bother 
 to implement the other codecs yet since I want to get feedback on the API and 
 on this first cut before I go one with it. I will upload the corresponding 
 patch in a minute. 
 I also had to cut over SpanQuery.getSpans(IR) to 
 SpanQuery.getSpans(AtomicReaderContext) which I should probably do on trunk 
 first but after that pain today I need a break first :).
 The patch passes all core tests 
 (org.apache.lucene.search.highlight.HighlighterTest still fails but I didn't 
 look into the MemoryIndex BulkPostings API yet)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: 

[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans

2011-07-11 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063657#comment-13063657
 ] 

Robert Muir commented on LUCENE-2878:
-

{quote}
The key is you usually have a fairly complex Query to begin with, so I do think 
it is legitimate and it is the right data structure.
{quote}

Really, just because its complicated? Accessing other terms 'around the 
position' seems like accessing the document in a non-inverted way.

{quote}
I've seen this use case multiple times, where multiple is more than 10, so I am 
pretty convinced it is beyond just me.
{quote}

Really? If this is so common, why do the spans get so little attention? if the 
queries are so complex, how is this even possible now given that spans have so 
many problems, even basic ones (e.g. discarding boosts)

If performance here is so important towards looking at these 'windows around a 
match' (which is gonna be slow as shit via term vectors),
why don't I see codecs that e.g. deduplicate terms and store pointers to the 
term windows around themselves in payloads, and things like that
for this use case?

I don't think we need to lock ourselves into a particular solution (such as 
per-position callback API) for something that sounds like its really slow 
already.


 Allow Scorer to expose positions and payloads aka. nuke spans 
 --

 Key: LUCENE-2878
 URL: https://issues.apache.org/jira/browse/LUCENE-2878
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/search
Affects Versions: Bulk Postings branch
Reporter: Simon Willnauer
Assignee: Simon Willnauer
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Attachments: LUCENE-2878-OR.patch, LUCENE-2878.patch, 
 LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, 
 LUCENE-2878.patch, LUCENE-2878_trunk.patch, LUCENE-2878_trunk.patch, 
 PosHighlighter.patch, PosHighlighter.patch


 Currently we have two somewhat separate types of queries, the one which can 
 make use of positions (mainly spans) and payloads (spans). Yet Span*Query 
 doesn't really do scoring comparable to what other queries do and at the end 
 of the day they are duplicating lot of code all over lucene. Span*Queries are 
 also limited to other Span*Query instances such that you can not use a 
 TermQuery or a BooleanQuery with SpanNear or anthing like that. 
 Beside of the Span*Query limitation other queries lacking a quiet interesting 
 feature since they can not score based on term proximity since scores doesn't 
 expose any positional information. All those problems bugged me for a while 
 now so I stared working on that using the bulkpostings API. I would have done 
 that first cut on trunk but TermScorer is working on BlockReader that do not 
 expose positions while the one in this branch does. I started adding a new 
 Positions class which users can pull from a scorer, to prevent unnecessary 
 positions enums I added ScorerContext#needsPositions and eventually 
 Scorere#needsPayloads to create the corresponding enum on demand. Yet, 
 currently only TermQuery / TermScorer implements this API and other simply 
 return null instead. 
 To show that the API really works and our BulkPostings work fine too with 
 positions I cut over TermSpanQuery to use a TermScorer under the hood and 
 nuked TermSpans entirely. A nice sideeffect of this was that the Position 
 BulkReading implementation got some exercise which now :) work all with 
 positions while Payloads for bulkreading are kind of experimental in the 
 patch and those only work with Standard codec. 
 So all spans now work on top of TermScorer ( I truly hate spans since today ) 
 including the ones that need Payloads (StandardCodec ONLY)!!  I didn't bother 
 to implement the other codecs yet since I want to get feedback on the API and 
 on this first cut before I go one with it. I will upload the corresponding 
 patch in a minute. 
 I also had to cut over SpanQuery.getSpans(IR) to 
 SpanQuery.getSpans(AtomicReaderContext) which I should probably do on trunk 
 first but after that pain today I need a break first :).
 The patch passes all core tests 
 (org.apache.lucene.search.highlight.HighlighterTest still fails but I didn't 
 look into the MemoryIndex BulkPostings API yet)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[JENKINS] Lucene-Solr-tests-only-trunk - Build # 9507 - Failure

2011-07-11 Thread Apache Jenkins Server
Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk/9507/

2 tests failed.
REGRESSION:  
org.apache.lucene.queryParser.standard.TestNumericQueryParser.testInclusiveNumericRange

Error Message:
null

Stack Trace:
java.lang.NullPointerException
at 
org.apache.lucene.queryParser.standard.config.NumberDateFormat.parse(NumberDateFormat.java:50)
at java.text.NumberFormat.parse(NumberFormat.java:348)
at 
org.apache.lucene.queryParser.standard.processors.NumericRangeQueryNodeProcessor.postProcessNode(NumericRangeQueryNodeProcessor.java:72)
at 
org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processIteration(QueryNodeProcessorImpl.java:98)
at 
org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processChildren(QueryNodeProcessorImpl.java:124)
at 
org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processIteration(QueryNodeProcessorImpl.java:96)
at 
org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processChildren(QueryNodeProcessorImpl.java:124)
at 
org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processIteration(QueryNodeProcessorImpl.java:96)
at 
org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.process(QueryNodeProcessorImpl.java:89)
at 
org.apache.lucene.queryParser.core.processors.QueryNodeProcessorPipeline.process(QueryNodeProcessorPipeline.java:88)
at 
org.apache.lucene.queryParser.core.QueryParserHelper.parse(QueryParserHelper.java:254)
at 
org.apache.lucene.queryParser.standard.StandardQueryParser.parse(StandardQueryParser.java:166)
at 
org.apache.lucene.queryParser.standard.TestNumericQueryParser.testQuery(TestNumericQueryParser.java:385)
at 
org.apache.lucene.queryParser.standard.TestNumericQueryParser.assertRangeQuery(TestNumericQueryParser.java:356)
at 
org.apache.lucene.queryParser.standard.TestNumericQueryParser.testInclusiveNumericRange(TestNumericQueryParser.java:282)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1464)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1382)


REGRESSION:  
org.apache.lucene.queryParser.standard.TestNumericQueryParser.testExclusiveNumericRange

Error Message:
null

Stack Trace:
java.lang.NullPointerException
at 
org.apache.lucene.queryParser.standard.config.NumberDateFormat.parse(NumberDateFormat.java:50)
at java.text.NumberFormat.parse(NumberFormat.java:348)
at 
org.apache.lucene.queryParser.standard.processors.NumericRangeQueryNodeProcessor.postProcessNode(NumericRangeQueryNodeProcessor.java:72)
at 
org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processIteration(QueryNodeProcessorImpl.java:98)
at 
org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processChildren(QueryNodeProcessorImpl.java:124)
at 
org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processIteration(QueryNodeProcessorImpl.java:96)
at 
org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processChildren(QueryNodeProcessorImpl.java:124)
at 
org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processIteration(QueryNodeProcessorImpl.java:96)
at 
org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.process(QueryNodeProcessorImpl.java:89)
at 
org.apache.lucene.queryParser.core.processors.QueryNodeProcessorPipeline.process(QueryNodeProcessorPipeline.java:88)
at 
org.apache.lucene.queryParser.core.QueryParserHelper.parse(QueryParserHelper.java:254)
at 
org.apache.lucene.queryParser.standard.StandardQueryParser.parse(StandardQueryParser.java:166)
at 
org.apache.lucene.queryParser.standard.TestNumericQueryParser.testQuery(TestNumericQueryParser.java:385)
at 
org.apache.lucene.queryParser.standard.TestNumericQueryParser.assertRangeQuery(TestNumericQueryParser.java:356)
at 
org.apache.lucene.queryParser.standard.TestNumericQueryParser.testExclusiveNumericRange(TestNumericQueryParser.java:311)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1464)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1382)




Build Log (for compile errors):
[...truncated 3344 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-2644) DIH handler - when using threads=2 the default logging is set too high

2011-07-11 Thread Bill Bell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Bell updated SOLR-2644:


Description: 
Setting threads parameter in DIH handler, every add outputs to the log in INFO 
level.
The only current solution is to set the following in log4j.properties:

log4j.rootCategory=INFO, logfile
log4j.logger.org.apache.solr.handler.dataimport.DocBuilder=FATAL
log4j.logger.org.apache.solr.handler.dataimport.ThreadedEntityProcessorWrapper=FATAL

These 2 log messages need to be changed to DEBUG.


  was:
Setting threads parameter in DIH handler, every add outputs to the log in INFO 
level.
The only current solution is to set the following in log4j.properties:

log4j.rootCategory=INFO, logfile
log4j.logger.org.apache.solr.handler.dataimport.DocBuilder=FATAL
log4j.logger.org.apache.solr.handler.dataimport.ThreadedEntityProcessorWrapper=FATAL

These 2 log messages need to be changed to  INFO.



 DIH handler - when using threads=2 the default logging is set too high
 --

 Key: SOLR-2644
 URL: https://issues.apache.org/jira/browse/SOLR-2644
 Project: Solr
  Issue Type: Bug
  Components: contrib - DataImportHandler
Affects Versions: 3.3
Reporter: Bill Bell
Assignee: Shalin Shekhar Mangar
 Fix For: 3.4, 4.0

 Attachments: SOLR-2644.patch


 Setting threads parameter in DIH handler, every add outputs to the log in 
 INFO level.
 The only current solution is to set the following in log4j.properties:
 log4j.rootCategory=INFO, logfile
 log4j.logger.org.apache.solr.handler.dataimport.DocBuilder=FATAL
 log4j.logger.org.apache.solr.handler.dataimport.ThreadedEntityProcessorWrapper=FATAL
 These 2 log messages need to be changed to DEBUG.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans

2011-07-11 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063662#comment-13063662
 ] 

Grant Ingersoll commented on LUCENE-2878:
-

bq. Really, just because its complicated? Accessing other terms 'around the 
position' seems like accessing the document in a non-inverted way.

Isn't that what highlighting does?  This is just highlighting on a much bigger 
set of documents.  I don't see why we should prevent users from doing it just 
b/c you don't see the use case.  

bq. Really? If this is so common, why do the spans get so little attention? if 
the queries are so complex, how is this even possible now given that spans have 
so many problems, even basic ones (e.g. discarding boosts)

Isn't that the point of this whole patch?  To bring spans into the fold and 
treat as first class citizens? I didn't say it happened all the time.  I just 
said it happened enough that I think it warrants being covered before one 
nukes spans.

bq. If performance here is so important towards looking at these 'windows 
around a match' (which is gonna be slow as shit via term vectors),
why don't I see codecs that e.g. deduplicate terms and store pointers to the 
term windows around themselves in payloads, and things like that
for this use case?

Um, b/c it's open source and not everything gets implemented the minute you 
think of it?

bq. I don't think we need to lock ourselves into a particular solution (such as 
per-position callback API) for something that sounds like its really slow 
already.

Never said we did.



 Allow Scorer to expose positions and payloads aka. nuke spans 
 --

 Key: LUCENE-2878
 URL: https://issues.apache.org/jira/browse/LUCENE-2878
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/search
Affects Versions: Bulk Postings branch
Reporter: Simon Willnauer
Assignee: Simon Willnauer
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Attachments: LUCENE-2878-OR.patch, LUCENE-2878.patch, 
 LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, 
 LUCENE-2878.patch, LUCENE-2878_trunk.patch, LUCENE-2878_trunk.patch, 
 PosHighlighter.patch, PosHighlighter.patch


 Currently we have two somewhat separate types of queries, the one which can 
 make use of positions (mainly spans) and payloads (spans). Yet Span*Query 
 doesn't really do scoring comparable to what other queries do and at the end 
 of the day they are duplicating lot of code all over lucene. Span*Queries are 
 also limited to other Span*Query instances such that you can not use a 
 TermQuery or a BooleanQuery with SpanNear or anthing like that. 
 Beside of the Span*Query limitation other queries lacking a quiet interesting 
 feature since they can not score based on term proximity since scores doesn't 
 expose any positional information. All those problems bugged me for a while 
 now so I stared working on that using the bulkpostings API. I would have done 
 that first cut on trunk but TermScorer is working on BlockReader that do not 
 expose positions while the one in this branch does. I started adding a new 
 Positions class which users can pull from a scorer, to prevent unnecessary 
 positions enums I added ScorerContext#needsPositions and eventually 
 Scorere#needsPayloads to create the corresponding enum on demand. Yet, 
 currently only TermQuery / TermScorer implements this API and other simply 
 return null instead. 
 To show that the API really works and our BulkPostings work fine too with 
 positions I cut over TermSpanQuery to use a TermScorer under the hood and 
 nuked TermSpans entirely. A nice sideeffect of this was that the Position 
 BulkReading implementation got some exercise which now :) work all with 
 positions while Payloads for bulkreading are kind of experimental in the 
 patch and those only work with Standard codec. 
 So all spans now work on top of TermScorer ( I truly hate spans since today ) 
 including the ones that need Payloads (StandardCodec ONLY)!!  I didn't bother 
 to implement the other codecs yet since I want to get feedback on the API and 
 on this first cut before I go one with it. I will upload the corresponding 
 patch in a minute. 
 I also had to cut over SpanQuery.getSpans(IR) to 
 SpanQuery.getSpans(AtomicReaderContext) which I should probably do on trunk 
 first but after that pain today I need a break first :).
 The patch passes all core tests 
 (org.apache.lucene.search.highlight.HighlighterTest still fails but I didn't 
 look into the MemoryIndex BulkPostings API yet)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To 

[jira] [Updated] (SOLR-2644) DIH handler - when using threads=2 the default logging is set too high

2011-07-11 Thread Bill Bell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Bell updated SOLR-2644:


Attachment: SOLR-2644-2.patch

 DIH handler - when using threads=2 the default logging is set too high
 --

 Key: SOLR-2644
 URL: https://issues.apache.org/jira/browse/SOLR-2644
 Project: Solr
  Issue Type: Bug
  Components: contrib - DataImportHandler
Affects Versions: 3.3
Reporter: Bill Bell
Assignee: Shalin Shekhar Mangar
 Fix For: 3.4, 4.0

 Attachments: SOLR-2644-2.patch, SOLR-2644.patch


 Setting threads parameter in DIH handler, every add outputs to the log in 
 INFO level.
 The only current solution is to set the following in log4j.properties:
 log4j.rootCategory=INFO, logfile
 log4j.logger.org.apache.solr.handler.dataimport.DocBuilder=FATAL
 log4j.logger.org.apache.solr.handler.dataimport.ThreadedEntityProcessorWrapper=FATAL
 These 2 log messages need to be changed to DEBUG.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-2644) DIH handler - when using threads=2 the default logging is set too high

2011-07-11 Thread Bill Bell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Bell updated SOLR-2644:


Attachment: (was: SOLR-2644-2.patch)

 DIH handler - when using threads=2 the default logging is set too high
 --

 Key: SOLR-2644
 URL: https://issues.apache.org/jira/browse/SOLR-2644
 Project: Solr
  Issue Type: Bug
  Components: contrib - DataImportHandler
Affects Versions: 3.3
Reporter: Bill Bell
Assignee: Shalin Shekhar Mangar
 Fix For: 3.4, 4.0

 Attachments: SOLR-2644-2.patch, SOLR-2644.patch


 Setting threads parameter in DIH handler, every add outputs to the log in 
 INFO level.
 The only current solution is to set the following in log4j.properties:
 log4j.rootCategory=INFO, logfile
 log4j.logger.org.apache.solr.handler.dataimport.DocBuilder=FATAL
 log4j.logger.org.apache.solr.handler.dataimport.ThreadedEntityProcessorWrapper=FATAL
 These 2 log messages need to be changed to DEBUG.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-2644) DIH handler - when using threads=2 the default logging is set too high

2011-07-11 Thread Bill Bell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Bell updated SOLR-2644:


Attachment: SOLR-2644-2.patch

 DIH handler - when using threads=2 the default logging is set too high
 --

 Key: SOLR-2644
 URL: https://issues.apache.org/jira/browse/SOLR-2644
 Project: Solr
  Issue Type: Bug
  Components: contrib - DataImportHandler
Affects Versions: 3.3
Reporter: Bill Bell
Assignee: Shalin Shekhar Mangar
 Fix For: 3.4, 4.0

 Attachments: SOLR-2644-2.patch, SOLR-2644.patch


 Setting threads parameter in DIH handler, every add outputs to the log in 
 INFO level.
 The only current solution is to set the following in log4j.properties:
 log4j.rootCategory=INFO, logfile
 log4j.logger.org.apache.solr.handler.dataimport.DocBuilder=FATAL
 log4j.logger.org.apache.solr.handler.dataimport.ThreadedEntityProcessorWrapper=FATAL
 These 2 log messages need to be changed to DEBUG.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2644) DIH handler - when using threads=2 the default logging is set too high

2011-07-11 Thread Bill Bell (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063664#comment-13063664
 ] 

Bill Bell commented on SOLR-2644:
-

New patch you forgot 
solr/contrib/dataimporthandler/src/java/org/apache/solr/handler/dataimport/DocBuilder.java.

Also, I would rather change to debug and leave it.

 DIH handler - when using threads=2 the default logging is set too high
 --

 Key: SOLR-2644
 URL: https://issues.apache.org/jira/browse/SOLR-2644
 Project: Solr
  Issue Type: Bug
  Components: contrib - DataImportHandler
Affects Versions: 3.3
Reporter: Bill Bell
Assignee: Shalin Shekhar Mangar
 Fix For: 3.4, 4.0

 Attachments: SOLR-2644-2.patch, SOLR-2644.patch


 Setting threads parameter in DIH handler, every add outputs to the log in 
 INFO level.
 The only current solution is to set the following in log4j.properties:
 log4j.rootCategory=INFO, logfile
 log4j.logger.org.apache.solr.handler.dataimport.DocBuilder=FATAL
 log4j.logger.org.apache.solr.handler.dataimport.ThreadedEntityProcessorWrapper=FATAL
 These 2 log messages need to be changed to DEBUG.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans

2011-07-11 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063667#comment-13063667
 ] 

Robert Muir commented on LUCENE-2878:
-

{quote}
Isn't that what highlighting does? This is just highlighting on a much bigger 
set of documents. I don't see why we should prevent users from doing it just 
b/c you don't see the use case. 
{quote}

well it is different: I'm not saying we should prevent users from doing it, but 
we shouldn't slow down normal use cases either: I think its fine for this to be 
a 2-pass operation, because any performance differences from it being 2-pass 
across many documents are going to be completely dwarfed by the term vector 
access!


 Allow Scorer to expose positions and payloads aka. nuke spans 
 --

 Key: LUCENE-2878
 URL: https://issues.apache.org/jira/browse/LUCENE-2878
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/search
Affects Versions: Bulk Postings branch
Reporter: Simon Willnauer
Assignee: Simon Willnauer
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Attachments: LUCENE-2878-OR.patch, LUCENE-2878.patch, 
 LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, 
 LUCENE-2878.patch, LUCENE-2878_trunk.patch, LUCENE-2878_trunk.patch, 
 PosHighlighter.patch, PosHighlighter.patch


 Currently we have two somewhat separate types of queries, the one which can 
 make use of positions (mainly spans) and payloads (spans). Yet Span*Query 
 doesn't really do scoring comparable to what other queries do and at the end 
 of the day they are duplicating lot of code all over lucene. Span*Queries are 
 also limited to other Span*Query instances such that you can not use a 
 TermQuery or a BooleanQuery with SpanNear or anthing like that. 
 Beside of the Span*Query limitation other queries lacking a quiet interesting 
 feature since they can not score based on term proximity since scores doesn't 
 expose any positional information. All those problems bugged me for a while 
 now so I stared working on that using the bulkpostings API. I would have done 
 that first cut on trunk but TermScorer is working on BlockReader that do not 
 expose positions while the one in this branch does. I started adding a new 
 Positions class which users can pull from a scorer, to prevent unnecessary 
 positions enums I added ScorerContext#needsPositions and eventually 
 Scorere#needsPayloads to create the corresponding enum on demand. Yet, 
 currently only TermQuery / TermScorer implements this API and other simply 
 return null instead. 
 To show that the API really works and our BulkPostings work fine too with 
 positions I cut over TermSpanQuery to use a TermScorer under the hood and 
 nuked TermSpans entirely. A nice sideeffect of this was that the Position 
 BulkReading implementation got some exercise which now :) work all with 
 positions while Payloads for bulkreading are kind of experimental in the 
 patch and those only work with Standard codec. 
 So all spans now work on top of TermScorer ( I truly hate spans since today ) 
 including the ones that need Payloads (StandardCodec ONLY)!!  I didn't bother 
 to implement the other codecs yet since I want to get feedback on the API and 
 on this first cut before I go one with it. I will upload the corresponding 
 patch in a minute. 
 I also had to cut over SpanQuery.getSpans(IR) to 
 SpanQuery.getSpans(AtomicReaderContext) which I should probably do on trunk 
 first but after that pain today I need a break first :).
 The patch passes all core tests 
 (org.apache.lucene.search.highlight.HighlighterTest still fails but I didn't 
 look into the MemoryIndex BulkPostings API yet)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [JENKINS] Lucene-Solr-tests-only-trunk - Build # 9507 - Failure

2011-07-11 Thread Chris Male
I'm seeing this locally as well.

On Tue, Jul 12, 2011 at 1:55 PM, Apache Jenkins Server 
jenk...@builds.apache.org wrote:

 Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk/9507/

 2 tests failed.
 REGRESSION:
  
 org.apache.lucene.queryParser.standard.TestNumericQueryParser.testInclusiveNumericRange

 Error Message:
 null

 Stack Trace:
 java.lang.NullPointerException
at
 org.apache.lucene.queryParser.standard.config.NumberDateFormat.parse(NumberDateFormat.java:50)
at java.text.NumberFormat.parse(NumberFormat.java:348)
at
 org.apache.lucene.queryParser.standard.processors.NumericRangeQueryNodeProcessor.postProcessNode(NumericRangeQueryNodeProcessor.java:72)
at
 org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processIteration(QueryNodeProcessorImpl.java:98)
at
 org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processChildren(QueryNodeProcessorImpl.java:124)
at
 org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processIteration(QueryNodeProcessorImpl.java:96)
at
 org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processChildren(QueryNodeProcessorImpl.java:124)
at
 org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processIteration(QueryNodeProcessorImpl.java:96)
at
 org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.process(QueryNodeProcessorImpl.java:89)
at
 org.apache.lucene.queryParser.core.processors.QueryNodeProcessorPipeline.process(QueryNodeProcessorPipeline.java:88)
at
 org.apache.lucene.queryParser.core.QueryParserHelper.parse(QueryParserHelper.java:254)
at
 org.apache.lucene.queryParser.standard.StandardQueryParser.parse(StandardQueryParser.java:166)
at
 org.apache.lucene.queryParser.standard.TestNumericQueryParser.testQuery(TestNumericQueryParser.java:385)
at
 org.apache.lucene.queryParser.standard.TestNumericQueryParser.assertRangeQuery(TestNumericQueryParser.java:356)
at
 org.apache.lucene.queryParser.standard.TestNumericQueryParser.testInclusiveNumericRange(TestNumericQueryParser.java:282)
at
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1464)
at
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1382)


 REGRESSION:
  
 org.apache.lucene.queryParser.standard.TestNumericQueryParser.testExclusiveNumericRange

 Error Message:
 null

 Stack Trace:
 java.lang.NullPointerException
at
 org.apache.lucene.queryParser.standard.config.NumberDateFormat.parse(NumberDateFormat.java:50)
at java.text.NumberFormat.parse(NumberFormat.java:348)
at
 org.apache.lucene.queryParser.standard.processors.NumericRangeQueryNodeProcessor.postProcessNode(NumericRangeQueryNodeProcessor.java:72)
at
 org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processIteration(QueryNodeProcessorImpl.java:98)
at
 org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processChildren(QueryNodeProcessorImpl.java:124)
at
 org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processIteration(QueryNodeProcessorImpl.java:96)
at
 org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processChildren(QueryNodeProcessorImpl.java:124)
at
 org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processIteration(QueryNodeProcessorImpl.java:96)
at
 org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.process(QueryNodeProcessorImpl.java:89)
at
 org.apache.lucene.queryParser.core.processors.QueryNodeProcessorPipeline.process(QueryNodeProcessorPipeline.java:88)
at
 org.apache.lucene.queryParser.core.QueryParserHelper.parse(QueryParserHelper.java:254)
at
 org.apache.lucene.queryParser.standard.StandardQueryParser.parse(StandardQueryParser.java:166)
at
 org.apache.lucene.queryParser.standard.TestNumericQueryParser.testQuery(TestNumericQueryParser.java:385)
at
 org.apache.lucene.queryParser.standard.TestNumericQueryParser.assertRangeQuery(TestNumericQueryParser.java:356)
at
 org.apache.lucene.queryParser.standard.TestNumericQueryParser.testExclusiveNumericRange(TestNumericQueryParser.java:311)
at
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1464)
at
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1382)




 Build Log (for compile errors):
 [...truncated 3344 lines...]



 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




-- 
Chris Male | Software Developer | JTeam BV.| www.jteam.nl


[jira] [Commented] (LUCENE-3285) Move QueryParsers from contrib/queryparser to queryparser module

2011-07-11 Thread Chris Male (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063671#comment-13063671
 ] 

Chris Male commented on LUCENE-3285:


Committed revision 1145430.

Now moving onto flexible QP.

 Move QueryParsers from contrib/queryparser to queryparser module
 

 Key: LUCENE-3285
 URL: https://issues.apache.org/jira/browse/LUCENE-3285
 Project: Lucene - Java
  Issue Type: Sub-task
  Components: modules/queryparser
Reporter: Chris Male
 Attachments: LUCENE-3285.patch


 Each of the QueryParsers will be ported across.
 Those which use the flexible parsing framework will be placed under the 
 package flexible.  The StandardQueryParser will be renamed to 
 FlexibleQueryParser and surround.QueryParser will be renamed to 
 SurroundQueryParser.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans

2011-07-11 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063672#comment-13063672
 ] 

Grant Ingersoll commented on LUCENE-2878:
-

Yeah, I agree.  I don't want to block the primary use case, I'm just really 
hoping we can have a solution for the second one that elegantly falls out of 
the primary one and doesn't require a two pass solution.  You are correct on 
the Term Vec access, but for large enough sets, the second search isn't 
trivial, even if it is dwarfed.  Although, I think it may be possible to at 
least access them in document order.

 Allow Scorer to expose positions and payloads aka. nuke spans 
 --

 Key: LUCENE-2878
 URL: https://issues.apache.org/jira/browse/LUCENE-2878
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/search
Affects Versions: Bulk Postings branch
Reporter: Simon Willnauer
Assignee: Simon Willnauer
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Attachments: LUCENE-2878-OR.patch, LUCENE-2878.patch, 
 LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, 
 LUCENE-2878.patch, LUCENE-2878_trunk.patch, LUCENE-2878_trunk.patch, 
 PosHighlighter.patch, PosHighlighter.patch


 Currently we have two somewhat separate types of queries, the one which can 
 make use of positions (mainly spans) and payloads (spans). Yet Span*Query 
 doesn't really do scoring comparable to what other queries do and at the end 
 of the day they are duplicating lot of code all over lucene. Span*Queries are 
 also limited to other Span*Query instances such that you can not use a 
 TermQuery or a BooleanQuery with SpanNear or anthing like that. 
 Beside of the Span*Query limitation other queries lacking a quiet interesting 
 feature since they can not score based on term proximity since scores doesn't 
 expose any positional information. All those problems bugged me for a while 
 now so I stared working on that using the bulkpostings API. I would have done 
 that first cut on trunk but TermScorer is working on BlockReader that do not 
 expose positions while the one in this branch does. I started adding a new 
 Positions class which users can pull from a scorer, to prevent unnecessary 
 positions enums I added ScorerContext#needsPositions and eventually 
 Scorere#needsPayloads to create the corresponding enum on demand. Yet, 
 currently only TermQuery / TermScorer implements this API and other simply 
 return null instead. 
 To show that the API really works and our BulkPostings work fine too with 
 positions I cut over TermSpanQuery to use a TermScorer under the hood and 
 nuked TermSpans entirely. A nice sideeffect of this was that the Position 
 BulkReading implementation got some exercise which now :) work all with 
 positions while Payloads for bulkreading are kind of experimental in the 
 patch and those only work with Standard codec. 
 So all spans now work on top of TermScorer ( I truly hate spans since today ) 
 including the ones that need Payloads (StandardCodec ONLY)!!  I didn't bother 
 to implement the other codecs yet since I want to get feedback on the API and 
 on this first cut before I go one with it. I will upload the corresponding 
 patch in a minute. 
 I also had to cut over SpanQuery.getSpans(IR) to 
 SpanQuery.getSpans(AtomicReaderContext) which I should probably do on trunk 
 first but after that pain today I need a break first :).
 The patch passes all core tests 
 (org.apache.lucene.search.highlight.HighlighterTest still fails but I didn't 
 look into the MemoryIndex BulkPostings API yet)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [JENKINS] Lucene-Solr-tests-only-trunk - Build # 9507 - Failure

2011-07-11 Thread Robert Muir
I think this test has incorrect randomization, because it initializes
its random locale and timezone statically (not in @beforeclass).

You can see this by running the test, it has the same timezone every time.

On Mon, Jul 11, 2011 at 10:18 PM, Chris Male gento...@gmail.com wrote:
 I'm seeing this locally as well.

 On Tue, Jul 12, 2011 at 1:55 PM, Apache Jenkins Server
 jenk...@builds.apache.org wrote:

 Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk/9507/

 2 tests failed.
 REGRESSION:
  org.apache.lucene.queryParser.standard.TestNumericQueryParser.testInclusiveNumericRange

 Error Message:
 null

 Stack Trace:
 java.lang.NullPointerException
        at
 org.apache.lucene.queryParser.standard.config.NumberDateFormat.parse(NumberDateFormat.java:50)
        at java.text.NumberFormat.parse(NumberFormat.java:348)
        at
 org.apache.lucene.queryParser.standard.processors.NumericRangeQueryNodeProcessor.postProcessNode(NumericRangeQueryNodeProcessor.java:72)
        at
 org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processIteration(QueryNodeProcessorImpl.java:98)
        at
 org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processChildren(QueryNodeProcessorImpl.java:124)
        at
 org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processIteration(QueryNodeProcessorImpl.java:96)
        at
 org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processChildren(QueryNodeProcessorImpl.java:124)
        at
 org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processIteration(QueryNodeProcessorImpl.java:96)
        at
 org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.process(QueryNodeProcessorImpl.java:89)
        at
 org.apache.lucene.queryParser.core.processors.QueryNodeProcessorPipeline.process(QueryNodeProcessorPipeline.java:88)
        at
 org.apache.lucene.queryParser.core.QueryParserHelper.parse(QueryParserHelper.java:254)
        at
 org.apache.lucene.queryParser.standard.StandardQueryParser.parse(StandardQueryParser.java:166)
        at
 org.apache.lucene.queryParser.standard.TestNumericQueryParser.testQuery(TestNumericQueryParser.java:385)
        at
 org.apache.lucene.queryParser.standard.TestNumericQueryParser.assertRangeQuery(TestNumericQueryParser.java:356)
        at
 org.apache.lucene.queryParser.standard.TestNumericQueryParser.testInclusiveNumericRange(TestNumericQueryParser.java:282)
        at
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1464)
        at
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1382)


 REGRESSION:
  org.apache.lucene.queryParser.standard.TestNumericQueryParser.testExclusiveNumericRange

 Error Message:
 null

 Stack Trace:
 java.lang.NullPointerException
        at
 org.apache.lucene.queryParser.standard.config.NumberDateFormat.parse(NumberDateFormat.java:50)
        at java.text.NumberFormat.parse(NumberFormat.java:348)
        at
 org.apache.lucene.queryParser.standard.processors.NumericRangeQueryNodeProcessor.postProcessNode(NumericRangeQueryNodeProcessor.java:72)
        at
 org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processIteration(QueryNodeProcessorImpl.java:98)
        at
 org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processChildren(QueryNodeProcessorImpl.java:124)
        at
 org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processIteration(QueryNodeProcessorImpl.java:96)
        at
 org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processChildren(QueryNodeProcessorImpl.java:124)
        at
 org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processIteration(QueryNodeProcessorImpl.java:96)
        at
 org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.process(QueryNodeProcessorImpl.java:89)
        at
 org.apache.lucene.queryParser.core.processors.QueryNodeProcessorPipeline.process(QueryNodeProcessorPipeline.java:88)
        at
 org.apache.lucene.queryParser.core.QueryParserHelper.parse(QueryParserHelper.java:254)
        at
 org.apache.lucene.queryParser.standard.StandardQueryParser.parse(StandardQueryParser.java:166)
        at
 org.apache.lucene.queryParser.standard.TestNumericQueryParser.testQuery(TestNumericQueryParser.java:385)
        at
 org.apache.lucene.queryParser.standard.TestNumericQueryParser.assertRangeQuery(TestNumericQueryParser.java:356)
        at
 org.apache.lucene.queryParser.standard.TestNumericQueryParser.testExclusiveNumericRange(TestNumericQueryParser.java:311)
        at
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1464)
        at
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1382)




 Build Log (for compile errors):
 [...truncated 3344 lines...]



 

Re: [JENKINS] Lucene-Solr-tests-only-trunk - Build # 9507 - Failure

2011-07-11 Thread Robert Muir
here's a change that makes the test reproducible (run it a few times
and eventually you get a problematic locale/tz, then the seed will
reproduce the problem):

Index: 
lucene/contrib/queryparser/src/test/org/apache/lucene/queryParser/standard/TestNumericQueryParser.java
===
--- 
lucene/contrib/queryparser/src/test/org/apache/lucene/queryParser/standard/TestNumericQueryParser.java
  (revision
1145431)
+++ 
lucene/contrib/queryparser/src/test/org/apache/lucene/queryParser/standard/TestNumericQueryParser.java
  (working
copy)
@@ -63,26 +63,30 @@

   final private static int PRECISION_STEP = 8;
   final private static String FIELD_NAME = field;
-  final private static Locale LOCALE = randomLocale(random);
-  final private static TimeZone TIMEZONE = randomTimeZone(random);
-  final private static MapString,Number RANDOM_NUMBER_MAP;
+  private static Locale LOCALE;
+  private static TimeZone TIMEZONE;
+  private static MapString,Number RANDOM_NUMBER_MAP;
   final private static EscapeQuerySyntax ESCAPER = new EscapeQuerySyntaxImpl();
   final private static String DATE_FIELD_NAME = date;
-  final private static int DATE_STYLE = randomDateStyle(random);
-  final private static int TIME_STYLE = randomDateStyle(random);
+  private static int DATE_STYLE;
+  private static int TIME_STYLE;
+  private static Analyzer ANALYZER;

-  final private static Analyzer ANALYZER = new MockAnalyzer(random);
+  private static NumberFormat NUMBER_FORMAT;

-  final private static NumberFormat NUMBER_FORMAT = NumberFormat
-  .getNumberInstance(LOCALE);
+  private static StandardQueryParser qp;

-  final private static StandardQueryParser qp = new StandardQueryParser(
-  ANALYZER);
+  private static NumberDateFormat DATE_FORMAT;

-  final private static NumberDateFormat DATE_FORMAT;
-
-  static {
+  static void initFormats() {
 try {
+  LOCALE = randomLocale(random);
+  TIMEZONE = randomTimeZone(random);
+  DATE_STYLE = randomDateStyle(random);
+  TIME_STYLE = randomDateStyle(random);
+  ANALYZER = new MockAnalyzer(random);
+  NUMBER_FORMAT = NumberFormat.getNumberInstance(LOCALE);
+  qp = new StandardQueryParser(ANALYZER);
   NUMBER_FORMAT.setMaximumFractionDigits((random.nextInt()  20) + 1);
   NUMBER_FORMAT.setMinimumFractionDigits((random.nextInt()  20) + 1);
   NUMBER_FORMAT.setMaximumIntegerDigits((random.nextInt()  20) + 1);
@@ -145,6 +149,7 @@

   @BeforeClass
   public static void beforeClass() throws Exception {
+initFormats();
 directory = newDirectory();
 RandomIndexWriter writer = new RandomIndexWriter(random, directory,
 newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(random))


On Mon, Jul 11, 2011 at 10:30 PM, Robert Muir rcm...@gmail.com wrote:
 I think this test has incorrect randomization, because it initializes
 its random locale and timezone statically (not in @beforeclass).

 You can see this by running the test, it has the same timezone every time.

 On Mon, Jul 11, 2011 at 10:18 PM, Chris Male gento...@gmail.com wrote:
 I'm seeing this locally as well.

 On Tue, Jul 12, 2011 at 1:55 PM, Apache Jenkins Server
 jenk...@builds.apache.org wrote:

 Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk/9507/

 2 tests failed.
 REGRESSION:
  org.apache.lucene.queryParser.standard.TestNumericQueryParser.testInclusiveNumericRange

 Error Message:
 null

 Stack Trace:
 java.lang.NullPointerException
        at
 org.apache.lucene.queryParser.standard.config.NumberDateFormat.parse(NumberDateFormat.java:50)
        at java.text.NumberFormat.parse(NumberFormat.java:348)
        at
 org.apache.lucene.queryParser.standard.processors.NumericRangeQueryNodeProcessor.postProcessNode(NumericRangeQueryNodeProcessor.java:72)
        at
 org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processIteration(QueryNodeProcessorImpl.java:98)
        at
 org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processChildren(QueryNodeProcessorImpl.java:124)
        at
 org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processIteration(QueryNodeProcessorImpl.java:96)
        at
 org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processChildren(QueryNodeProcessorImpl.java:124)
        at
 org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processIteration(QueryNodeProcessorImpl.java:96)
        at
 org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.process(QueryNodeProcessorImpl.java:89)
        at
 org.apache.lucene.queryParser.core.processors.QueryNodeProcessorPipeline.process(QueryNodeProcessorPipeline.java:88)
        at
 org.apache.lucene.queryParser.core.QueryParserHelper.parse(QueryParserHelper.java:254)
        at
 org.apache.lucene.queryParser.standard.StandardQueryParser.parse(StandardQueryParser.java:166)
        at
 

[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans

2011-07-11 Thread Mike Sokolov (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063689#comment-13063689
 ] 

Mike Sokolov commented on LUCENE-2878:
--

I hope you all will review the patch and see what you think.  My gut at the 
moment tells me we can have it both ways with a bit more tinkering.  I think 
that as it stands now, if you ask for positions you get them in more or less 
the most efficient way we know how. At the moment there is some performance hit 
when you don't want positions, but I think we can deal with that. Simon had the 
idea we could rely on the JIT compiler to optimize away the test we have if we 
set it up as a final false boolean (totally do-able if we set up the state 
during Scorer construction), which would be great and convenient.  I'm no 
compiler expert, so not sure how reliable that is - is it?  But we could also 
totally separate the two cases (say with a wrapping Scorer? - no need for 
compiler tricks) while still allowing us to retrieve positions while querying, 
collecting docs, and scoring.

 Allow Scorer to expose positions and payloads aka. nuke spans 
 --

 Key: LUCENE-2878
 URL: https://issues.apache.org/jira/browse/LUCENE-2878
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/search
Affects Versions: Bulk Postings branch
Reporter: Simon Willnauer
Assignee: Simon Willnauer
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Attachments: LUCENE-2878-OR.patch, LUCENE-2878.patch, 
 LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, 
 LUCENE-2878.patch, LUCENE-2878_trunk.patch, LUCENE-2878_trunk.patch, 
 PosHighlighter.patch, PosHighlighter.patch


 Currently we have two somewhat separate types of queries, the one which can 
 make use of positions (mainly spans) and payloads (spans). Yet Span*Query 
 doesn't really do scoring comparable to what other queries do and at the end 
 of the day they are duplicating lot of code all over lucene. Span*Queries are 
 also limited to other Span*Query instances such that you can not use a 
 TermQuery or a BooleanQuery with SpanNear or anthing like that. 
 Beside of the Span*Query limitation other queries lacking a quiet interesting 
 feature since they can not score based on term proximity since scores doesn't 
 expose any positional information. All those problems bugged me for a while 
 now so I stared working on that using the bulkpostings API. I would have done 
 that first cut on trunk but TermScorer is working on BlockReader that do not 
 expose positions while the one in this branch does. I started adding a new 
 Positions class which users can pull from a scorer, to prevent unnecessary 
 positions enums I added ScorerContext#needsPositions and eventually 
 Scorere#needsPayloads to create the corresponding enum on demand. Yet, 
 currently only TermQuery / TermScorer implements this API and other simply 
 return null instead. 
 To show that the API really works and our BulkPostings work fine too with 
 positions I cut over TermSpanQuery to use a TermScorer under the hood and 
 nuked TermSpans entirely. A nice sideeffect of this was that the Position 
 BulkReading implementation got some exercise which now :) work all with 
 positions while Payloads for bulkreading are kind of experimental in the 
 patch and those only work with Standard codec. 
 So all spans now work on top of TermScorer ( I truly hate spans since today ) 
 including the ones that need Payloads (StandardCodec ONLY)!!  I didn't bother 
 to implement the other codecs yet since I want to get feedback on the API and 
 on this first cut before I go one with it. I will upload the corresponding 
 patch in a minute. 
 I also had to cut over SpanQuery.getSpans(IR) to 
 SpanQuery.getSpans(AtomicReaderContext) which I should probably do on trunk 
 first but after that pain today I need a break first :).
 The patch passes all core tests 
 (org.apache.lucene.search.highlight.HighlighterTest still fails but I didn't 
 look into the MemoryIndex BulkPostings API yet)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2641) Auto Facet Selection component

2011-07-11 Thread Toke Eskildsen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063722#comment-13063722
 ] 

Toke Eskildsen commented on SOLR-2641:
--

This looks like a variant of hierarchical faceting. For popularity count as 
selector, paths like color/green, memory_size/4GB would produce the desired 
result.

 Auto Facet Selection component
 --

 Key: SOLR-2641
 URL: https://issues.apache.org/jira/browse/SOLR-2641
 Project: Solr
  Issue Type: Improvement
  Components: SearchComponents - other
Reporter: Erik Hatcher
Assignee: Erik Hatcher
Priority: Minor
 Attachments: SOLR_2641.patch


 It sure would be nice if you could have Solr automatically select field(s) 
 for faceting based dynamically off the profile of the results.  For example, 
 you're indexing disparate types of products, all with varying attributes 
 (color, size - like for apparel, memory_size - for electronics, subject - for 
 books, etc), and a user searches for ipod where most products match 
 products with color and memory_size attributes... let's automatically facet 
 on those fields.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2382) DIH Cache Improvements

2011-07-11 Thread Noble Paul (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13063727#comment-13063727
 ] 

Noble Paul commented on SOLR-2382:
--

my apologies for the delay.

The problem w/ the patch is the size/scope. You may not need to open up other 
issues but stuff like abstracting DIHWriter,DIHpropertiesWriter etc can be 
given as a separate patch in the same issue and I can commit them straight 
away. Though the issue is aboout cache improvements , it goes far beyond that 
scope. committing it in as a whole is difficult. 





 DIH Cache Improvements
 --

 Key: SOLR-2382
 URL: https://issues.apache.org/jira/browse/SOLR-2382
 Project: Solr
  Issue Type: New Feature
  Components: contrib - DataImportHandler
Reporter: James Dyer
Priority: Minor
 Attachments: SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
 SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
 SOLR-2382.patch


 Functionality:
  1. Provide a pluggable caching framework for DIH so that users can choose a 
 cache implementation that best suits their data and application.
  
  2. Provide a means to temporarily cache a child Entity's data without 
 needing to create a special cached implementation of the Entity Processor 
 (such as CachedSqlEntityProcessor).
  
  3. Provide a means to write the final (root entity) DIH output to a cache 
 rather than to Solr.  Then provide a way for a subsequent DIH call to use the 
 cache as an Entity input.  Also provide the ability to do delta updates on 
 such persistent caches.
  
  4. Provide the ability to partition data across multiple caches that can 
 then be fed back into DIH and indexed either to varying Solr Shards, or to 
 the same Core in parallel.
 Use Cases:
  1. We needed a flexible  scalable way to temporarily cache child-entity 
 data prior to joining to parent entities.
   - Using SqlEntityProcessor with Child Entities can cause an n+1 select 
 problem.
   - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching 
 mechanism and does not scale.
   - There is no way to cache non-SQL inputs (ex: flat files, xml, etc).
  
  2. We needed the ability to gather data from long-running entities by a 
 process that runs separate from our main indexing process.
   
  3. We wanted the ability to do a delta import of only the entities that 
 changed.
   - Lucene/Solr requires entire documents to be re-indexed, even if only a 
 few fields changed.
   - Our data comes from 50+ complex sql queries and/or flat files.
   - We do not want to incur overhead re-gathering all of this data if only 1 
 entity's data changed.
   - Persistent DIH caches solve this problem.
   
  4. We want the ability to index several documents in parallel (using 1.4.1, 
 which did not have the threads parameter).
  
  5. In the future, we may need to use Shards, creating a need to easily 
 partition our source data into Shards.
 Implementation Details:
  1. De-couple EntityProcessorBase from caching.  
   - Created a new interface, DIHCache  two implementations:  
 - SortedMapBackedCache - An in-memory cache, used as default with 
 CachedSqlEntityProcessor (now deprecated).
 - BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested 
 with je-4.1.6.jar
- NOTE: the existing Lucene Contrib db project uses je-3.3.93.jar.  
 I believe this may be incompatible due to Generic Usage.
- NOTE: I did not modify the ant script to automatically get this jar, 
 so to use or evaluate this patch, download bdb-je from 
 http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html 
  
  2. Allow Entity Processors to take a cacheImpl parameter to cause the 
 entity data to be cached (see EntityProcessorBase  DIHCacheProperties).
  
  3. Partially De-couple SolrWriter from DocBuilder
   - Created a new interface DIHWriter,  two implementations:
- SolrWriter (refactored)
- DIHCacheWriter (allows DIH to write ultimately to a Cache).

  4. Create a new Entity Processor, DIHCacheProcessor, which reads a 
 persistent Cache as DIH Entity Input.
  
  5. Support a partition parameter with both DIHCacheWriter and 
 DIHCacheProcessor to allow for easy partitioning of source entity data.
  
  6. Change the semantics of entity.destroy()
   - Previously, it was being called on each iteration of 
 DocBuilder.buildDocument().
   - Now it is does one-time cleanup tasks (like closing or deleting a 
 disk-backed cache) once the entity processor is completed.
   - The only out-of-the-box entity processor that previously implemented 
 destroy() was LineEntitiyProcessor, so this is not a very invasive change.
 General Notes:
 We are near completion in converting our search functionality from a legacy 
 search engine to Solr.  However, I found that DIH did not