Re: [Lucene.Net] Contribution

2011-10-17 Thread Danijel Kecman
thank you stefan,

will do so.

best regards,
danijel

On Fri, Sep 23, 2011 at 6:03 AM, Stefan Bodewig bode...@apache.org wrote:

 On 2011-09-22, Danijel Kecman wrote:

  i would like to contribute.

 welcome Danijel.

 The best way to start contributing is by looking at the issues in JIRA
 pick one and start providing patches there - as well as engaging in
 discussion on this list.

 Cheers

Stefan



[JENKINS] Solr-trunk - Build # 1647 - Failure

2011-10-17 Thread Apache Jenkins Server
Build: https://builds.apache.org/job/Solr-trunk/1647/

1 tests failed.
REGRESSION:  org.apache.solr.search.TestRealTimeGet.testStressGetRealtime

Error Message:
java.lang.AssertionError: Some threads threw uncaught exceptions!

Stack Trace:
java.lang.RuntimeException: java.lang.AssertionError: Some threads threw 
uncaught exceptions!
at 
org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:739)
at org.apache.solr.SolrTestCaseJ4.tearDown(SolrTestCaseJ4.java:89)
at 
org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:149)
at 
org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:51)
at 
org.apache.lucene.util.LuceneTestCase.checkUncaughtExceptionsAfter(LuceneTestCase.java:767)
at 
org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:711)




Build Log (for compile errors):
[...truncated 28831 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-1536) if a filter can support random access API, we should use it

2011-10-17 Thread Chris Male (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13128744#comment-13128744
 ] 

Chris Male commented on LUCENE-1536:


Is the only question mark remaining around the BooleanWeight work? If so, I 
think its definitely worth examining that in a wider separate issue after this 
is committed.

 if a filter can support random access API, we should use it
 ---

 Key: LUCENE-1536
 URL: https://issues.apache.org/jira/browse/LUCENE-1536
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/search
Affects Versions: 2.4
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Fix For: 4.0

 Attachments: CachedFilterIndexReader.java, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536_hack.patch, 
 changes-yonik-uwe.patch, luceneutil.patch


 I ran some performance tests, comparing applying a filter via
 random-access API instead of current trunk's iterator API.
 This was inspired by LUCENE-1476, where we realized deletions should
 really be implemented just like a filter, but then in testing found
 that switching deletions to iterator was a very sizable performance
 hit.
 Some notes on the test:
   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
   * I test across multiple queries.  1-X means an OR query, eg 1-4
 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
 AND 3 AND 4.  u s means united states (phrase search).
   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
 95, 98, 99, 99.9 (filter is non-null but all bits are set),
 100 (filter=null, control)).
   * Method high means I use random-access filter API in
 IndexSearcher's main loop.  Method low means I use random-access
 filter API down in SegmentTermDocs (just like deleted docs
 today).
   * Baseline (QPS) is current trunk, where filter is applied as iterator up
 high (ie in IndexSearcher's search loop).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-1536) if a filter can support random access API, we should use it

2011-10-17 Thread Uwe Schindler (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13128749#comment-13128749
 ] 

Uwe Schindler commented on LUCENE-1536:
---

bq. Is the only question mark remaining around the BooleanWeight work? If so, I 
think its definitely worth examining that in a wider separate issue after this 
is committed.

The patch requests scorer always in order for now, so BooleanWeight is not 
mixed up for different segments. This is not different as in current trunk, as 
Scorers are always requested in order if filters are used. The optimization in 
the future would be to use out-of-order scoring if random access bits are used.

 if a filter can support random access API, we should use it
 ---

 Key: LUCENE-1536
 URL: https://issues.apache.org/jira/browse/LUCENE-1536
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/search
Affects Versions: 2.4
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Fix For: 4.0

 Attachments: CachedFilterIndexReader.java, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536_hack.patch, 
 changes-yonik-uwe.patch, luceneutil.patch


 I ran some performance tests, comparing applying a filter via
 random-access API instead of current trunk's iterator API.
 This was inspired by LUCENE-1476, where we realized deletions should
 really be implemented just like a filter, but then in testing found
 that switching deletions to iterator was a very sizable performance
 hit.
 Some notes on the test:
   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
   * I test across multiple queries.  1-X means an OR query, eg 1-4
 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
 AND 3 AND 4.  u s means united states (phrase search).
   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
 95, 98, 99, 99.9 (filter is non-null but all bits are set),
 100 (filter=null, control)).
   * Method high means I use random-access filter API in
 IndexSearcher's main loop.  Method low means I use random-access
 filter API down in SegmentTermDocs (just like deleted docs
 today).
   * Baseline (QPS) is current trunk, where filter is applied as iterator up
 high (ie in IndexSearcher's search loop).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-1536) if a filter can support random access API, we should use it

2011-10-17 Thread Chris Male (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13128754#comment-13128754
 ] 

Chris Male commented on LUCENE-1536:


I was more referring to Robert's comment: 

bq. It seems to me these parameters (topLevel/scoresInOrder) really shouldn't 
be parameters to weight.scorer()!



 if a filter can support random access API, we should use it
 ---

 Key: LUCENE-1536
 URL: https://issues.apache.org/jira/browse/LUCENE-1536
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/search
Affects Versions: 2.4
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Fix For: 4.0

 Attachments: CachedFilterIndexReader.java, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536_hack.patch, 
 changes-yonik-uwe.patch, luceneutil.patch


 I ran some performance tests, comparing applying a filter via
 random-access API instead of current trunk's iterator API.
 This was inspired by LUCENE-1476, where we realized deletions should
 really be implemented just like a filter, but then in testing found
 that switching deletions to iterator was a very sizable performance
 hit.
 Some notes on the test:
   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
   * I test across multiple queries.  1-X means an OR query, eg 1-4
 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
 AND 3 AND 4.  u s means united states (phrase search).
   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
 95, 98, 99, 99.9 (filter is non-null but all bits are set),
 100 (filter=null, control)).
   * Method high means I use random-access filter API in
 IndexSearcher's main loop.  Method low means I use random-access
 filter API down in SegmentTermDocs (just like deleted docs
 today).
   * Baseline (QPS) is current trunk, where filter is applied as iterator up
 high (ie in IndexSearcher's search loop).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3440) FastVectorHighlighter: IDF-weighted terms for ordered fragments

2011-10-17 Thread sebastian L. (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13128762#comment-13128762
 ] 

sebastian L. commented on LUCENE-3440:
--

Hi Koji, patch don't work because of 
https://issues.apache.org/jira/browse/LUCENE-3513.

bq. And I found a lot of test errors...

Frankly, I didn't run the tests because I thought the changes provided with the 
last patch shouldn't affect the original behavior. 
I'll have a look into it. But this may take some time, due to the fact that I 
have no knowledge about the test-framework.  

 FastVectorHighlighter: IDF-weighted terms for ordered fragments 
 

 Key: LUCENE-3440
 URL: https://issues.apache.org/jira/browse/LUCENE-3440
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/highlighter
Affects Versions: 3.5, 4.0
Reporter: sebastian L.
Priority: Minor
  Labels: FastVectorHighlighter
 Fix For: 3.5, 4.0

 Attachments: LUCENE-3.5-SNAPSHOT-3440-8.patch, LUCENE-3440.patch, 
 LUCENE-4.0-SNAPSHOT-3440-9.patch, weight-vs-boost_table01.html, 
 weight-vs-boost_table02.html


 The FastVectorHighlighter uses for every term found in a fragment an equal 
 weight, which causes a higher ranking for fragments with a high number of 
 words or, in the worst case, a high number of very common words than 
 fragments that contains *all* of the terms used in the original query. 
 This patch provides ordered fragments with IDF-weighted terms: 
 total weight = total weight + IDF for unique term per fragment * boost of 
 query; 
 The ranking-formula should be the same, or at least similar, to that one used 
 in org.apache.lucene.search.highlight.QueryTermScorer.
 The patch is simple, but it works for us. 
 Some ideas:
 - A better approach would be moving the whole fragments-scoring into a 
 separate class.
 - Switch scoring via parameter 
 - Exact phrases should be given a even better score, regardless if a 
 phrase-query was executed or not
 - edismax/dismax-parameters pf, ps and pf^boost should be observed and 
 corresponding fragments should be ranked higher 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-1536) if a filter can support random access API, we should use it

2011-10-17 Thread Uwe Schindler (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13128767#comment-13128767
 ] 

Uwe Schindler commented on LUCENE-1536:
---

Yes, that should be sorted out in another issue. We have a working fix, the 
rest is optimization and unrelated api changes.

 if a filter can support random access API, we should use it
 ---

 Key: LUCENE-1536
 URL: https://issues.apache.org/jira/browse/LUCENE-1536
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/search
Affects Versions: 2.4
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Fix For: 4.0

 Attachments: CachedFilterIndexReader.java, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536_hack.patch, 
 changes-yonik-uwe.patch, luceneutil.patch


 I ran some performance tests, comparing applying a filter via
 random-access API instead of current trunk's iterator API.
 This was inspired by LUCENE-1476, where we realized deletions should
 really be implemented just like a filter, but then in testing found
 that switching deletions to iterator was a very sizable performance
 hit.
 Some notes on the test:
   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
   * I test across multiple queries.  1-X means an OR query, eg 1-4
 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
 AND 3 AND 4.  u s means united states (phrase search).
   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
 95, 98, 99, 99.9 (filter is non-null but all bits are set),
 100 (filter=null, control)).
   * Method high means I use random-access filter API in
 IndexSearcher's main loop.  Method low means I use random-access
 filter API down in SegmentTermDocs (just like deleted docs
 today).
   * Baseline (QPS) is current trunk, where filter is applied as iterator up
 high (ie in IndexSearcher's search loop).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2842) Re-factor UpdateChain and UpdateProcessor interfaces

2011-10-17 Thread Commented

[ 
https://issues.apache.org/jira/browse/SOLR-2842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13128793#comment-13128793
 ] 

Jan Høydahl commented on SOLR-2842:
---

Some valid points there. I thought I saw a possibility for generalization that 
would help solve SOLR-1526, but wanted to flesh out the feasibility here.

So far I do not see any other example than Tika extraction which could really 
benefit from being done client-side. There may be others, but perhaps not 
justifying this change.

Another option for SOLR-1526 could be to provide a 
ClientExtractingUpdateProcessorFactory class which instantiates the 
ExtractingUpdateProcessor for client side use. Then if other processors are 
useful on the client side as well, people simply write a Client factory for 
them?

 Re-factor UpdateChain and UpdateProcessor interfaces
 

 Key: SOLR-2842
 URL: https://issues.apache.org/jira/browse/SOLR-2842
 Project: Solr
  Issue Type: Improvement
  Components: update
Reporter: Jan Høydahl

 The UpdateChain's main task is to send SolrInputDocuments through a chain of 
 UpdateRequestProcessors in order to transform them in some way and then 
 (typically) indexing them.
 This generic pipeline concept would also be useful on the client side 
 (SolrJ), so that we could choose to do parts or all of the processing on the 
 client. The most prominent use case is extracting text (Tika) from large 
 binary documents, residing on local storage on the client(s). Streaming 
 hundreds of Mb over to Solr for processing is not efficcient. See SOLR-1526.
 We're already implementing Tika as an UpdateProcessor in SOLR-1763, and what 
 would be more natural than reusing this - and any other processor - on the 
 client side?
 However, for this to be possible, some interfaces need to change slightly..

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3440) FastVectorHighlighter: IDF-weighted terms for ordered fragments

2011-10-17 Thread Koji Sekiguchi (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13128795#comment-13128795
 ] 

Koji Sekiguchi commented on LUCENE-3440:


Hi sebastian,

{quote}
Frankly, I didn't run the tests because I thought the changes provided with the 
last patch shouldn't affect the original behavior.
I'll have a look into it. But this may take some time, due to the fact that I 
have no knowledge about the test-framework. 
{quote}

Ok, no problem. I'll see the test case (hopefully next week or so). But can you 
take care of the following to go forward?

{quote}
Ah, sebastian, I think you needed to check Grant license to ASF for inclusion 
in ASF works when you attach your patch. Can you remove the latest patches and 
reattach them with that flag? Thanks!
{quote}


 FastVectorHighlighter: IDF-weighted terms for ordered fragments 
 

 Key: LUCENE-3440
 URL: https://issues.apache.org/jira/browse/LUCENE-3440
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/highlighter
Affects Versions: 3.5, 4.0
Reporter: sebastian L.
Priority: Minor
  Labels: FastVectorHighlighter
 Fix For: 3.5, 4.0

 Attachments: LUCENE-3.5-SNAPSHOT-3440-8.patch, LUCENE-3440.patch, 
 LUCENE-4.0-SNAPSHOT-3440-9.patch, weight-vs-boost_table01.html, 
 weight-vs-boost_table02.html


 The FastVectorHighlighter uses for every term found in a fragment an equal 
 weight, which causes a higher ranking for fragments with a high number of 
 words or, in the worst case, a high number of very common words than 
 fragments that contains *all* of the terms used in the original query. 
 This patch provides ordered fragments with IDF-weighted terms: 
 total weight = total weight + IDF for unique term per fragment * boost of 
 query; 
 The ranking-formula should be the same, or at least similar, to that one used 
 in org.apache.lucene.search.highlight.QueryTermScorer.
 The patch is simple, but it works for us. 
 Some ideas:
 - A better approach would be moving the whole fragments-scoring into a 
 separate class.
 - Switch scoring via parameter 
 - Exact phrases should be given a even better score, regardless if a 
 phrase-query was executed or not
 - edismax/dismax-parameters pf, ps and pf^boost should be observed and 
 corresponding fragments should be ranked higher 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3440) FastVectorHighlighter: IDF-weighted terms for ordered fragments

2011-10-17 Thread Koji Sekiguchi (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated LUCENE-3440:
---

Attachment: (was: LUCENE-3440.patch)

 FastVectorHighlighter: IDF-weighted terms for ordered fragments 
 

 Key: LUCENE-3440
 URL: https://issues.apache.org/jira/browse/LUCENE-3440
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/highlighter
Affects Versions: 3.5, 4.0
Reporter: sebastian L.
Priority: Minor
  Labels: FastVectorHighlighter
 Fix For: 3.5, 4.0

 Attachments: LUCENE-3.5-SNAPSHOT-3440-8.patch, 
 LUCENE-4.0-SNAPSHOT-3440-9.patch, weight-vs-boost_table01.html, 
 weight-vs-boost_table02.html


 The FastVectorHighlighter uses for every term found in a fragment an equal 
 weight, which causes a higher ranking for fragments with a high number of 
 words or, in the worst case, a high number of very common words than 
 fragments that contains *all* of the terms used in the original query. 
 This patch provides ordered fragments with IDF-weighted terms: 
 total weight = total weight + IDF for unique term per fragment * boost of 
 query; 
 The ranking-formula should be the same, or at least similar, to that one used 
 in org.apache.lucene.search.highlight.QueryTermScorer.
 The patch is simple, but it works for us. 
 Some ideas:
 - A better approach would be moving the whole fragments-scoring into a 
 separate class.
 - Switch scoring via parameter 
 - Exact phrases should be given a even better score, regardless if a 
 phrase-query was executed or not
 - edismax/dismax-parameters pf, ps and pf^boost should be observed and 
 corresponding fragments should be ranked higher 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3440) FastVectorHighlighter: IDF-weighted terms for ordered fragments

2011-10-17 Thread Koji Sekiguchi (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13128799#comment-13128799
 ] 

Koji Sekiguchi commented on LUCENE-3440:


I've removed my latest patch. Because the patch had ASF granted license flag 
but it was not right because it was totally based on sebastian's patch, which 
was not granted to ASF.

 FastVectorHighlighter: IDF-weighted terms for ordered fragments 
 

 Key: LUCENE-3440
 URL: https://issues.apache.org/jira/browse/LUCENE-3440
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/highlighter
Affects Versions: 3.5, 4.0
Reporter: sebastian L.
Priority: Minor
  Labels: FastVectorHighlighter
 Fix For: 3.5, 4.0

 Attachments: LUCENE-3.5-SNAPSHOT-3440-8.patch, 
 LUCENE-4.0-SNAPSHOT-3440-9.patch, weight-vs-boost_table01.html, 
 weight-vs-boost_table02.html


 The FastVectorHighlighter uses for every term found in a fragment an equal 
 weight, which causes a higher ranking for fragments with a high number of 
 words or, in the worst case, a high number of very common words than 
 fragments that contains *all* of the terms used in the original query. 
 This patch provides ordered fragments with IDF-weighted terms: 
 total weight = total weight + IDF for unique term per fragment * boost of 
 query; 
 The ranking-formula should be the same, or at least similar, to that one used 
 in org.apache.lucene.search.highlight.QueryTermScorer.
 The patch is simple, but it works for us. 
 Some ideas:
 - A better approach would be moving the whole fragments-scoring into a 
 separate class.
 - Switch scoring via parameter 
 - Exact phrases should be given a even better score, regardless if a 
 phrase-query was executed or not
 - edismax/dismax-parameters pf, ps and pf^boost should be observed and 
 corresponding fragments should be ranked higher 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3440) FastVectorHighlighter: IDF-weighted terms for ordered fragments

2011-10-17 Thread sebastian L. (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sebastian L. updated LUCENE-3440:
-

Attachment: LUCENE-4.0-SNAPSHOT-3440-9.patch

 FastVectorHighlighter: IDF-weighted terms for ordered fragments 
 

 Key: LUCENE-3440
 URL: https://issues.apache.org/jira/browse/LUCENE-3440
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/highlighter
Affects Versions: 3.5, 4.0
Reporter: sebastian L.
Priority: Minor
  Labels: FastVectorHighlighter
 Fix For: 3.5, 4.0

 Attachments: LUCENE-3.5-SNAPSHOT-3440-8.patch, 
 LUCENE-4.0-SNAPSHOT-3440-9.patch, LUCENE-4.0-SNAPSHOT-3440-9.patch, 
 weight-vs-boost_table01.html, weight-vs-boost_table02.html


 The FastVectorHighlighter uses for every term found in a fragment an equal 
 weight, which causes a higher ranking for fragments with a high number of 
 words or, in the worst case, a high number of very common words than 
 fragments that contains *all* of the terms used in the original query. 
 This patch provides ordered fragments with IDF-weighted terms: 
 total weight = total weight + IDF for unique term per fragment * boost of 
 query; 
 The ranking-formula should be the same, or at least similar, to that one used 
 in org.apache.lucene.search.highlight.QueryTermScorer.
 The patch is simple, but it works for us. 
 Some ideas:
 - A better approach would be moving the whole fragments-scoring into a 
 separate class.
 - Switch scoring via parameter 
 - Exact phrases should be given a even better score, regardless if a 
 phrase-query was executed or not
 - edismax/dismax-parameters pf, ps and pf^boost should be observed and 
 corresponding fragments should be ranked higher 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3440) FastVectorHighlighter: IDF-weighted terms for ordered fragments

2011-10-17 Thread sebastian L. (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13128802#comment-13128802
 ] 

sebastian L. commented on LUCENE-3440:
--

bq. Ah, sebastian, I think you needed to check Grant license to ASF for 
inclusion in ASF works when you attach your patch. Can you remove the latest 
patches and reattach them with that flag? Thanks!

Sorry, I forgot that. Done.

 FastVectorHighlighter: IDF-weighted terms for ordered fragments 
 

 Key: LUCENE-3440
 URL: https://issues.apache.org/jira/browse/LUCENE-3440
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/highlighter
Affects Versions: 3.5, 4.0
Reporter: sebastian L.
Priority: Minor
  Labels: FastVectorHighlighter
 Fix For: 3.5, 4.0

 Attachments: LUCENE-3.5-SNAPSHOT-3440-8.patch, 
 LUCENE-4.0-SNAPSHOT-3440-9.patch, LUCENE-4.0-SNAPSHOT-3440-9.patch, 
 weight-vs-boost_table01.html, weight-vs-boost_table02.html


 The FastVectorHighlighter uses for every term found in a fragment an equal 
 weight, which causes a higher ranking for fragments with a high number of 
 words or, in the worst case, a high number of very common words than 
 fragments that contains *all* of the terms used in the original query. 
 This patch provides ordered fragments with IDF-weighted terms: 
 total weight = total weight + IDF for unique term per fragment * boost of 
 query; 
 The ranking-formula should be the same, or at least similar, to that one used 
 in org.apache.lucene.search.highlight.QueryTermScorer.
 The patch is simple, but it works for us. 
 Some ideas:
 - A better approach would be moving the whole fragments-scoring into a 
 separate class.
 - Switch scoring via parameter 
 - Exact phrases should be given a even better score, regardless if a 
 phrase-query was executed or not
 - edismax/dismax-parameters pf, ps and pf^boost should be observed and 
 corresponding fragments should be ranked higher 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Assigned] (LUCENE-3522) TermsFilter.getDocIdSet(context) NPE on missing field

2011-10-17 Thread Michael McCandless (Assigned) (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned LUCENE-3522:
--

Assignee: Michael McCandless

 TermsFilter.getDocIdSet(context) NPE on missing field
 -

 Key: LUCENE-3522
 URL: https://issues.apache.org/jira/browse/LUCENE-3522
 Project: Lucene - Java
  Issue Type: Bug
  Components: modules/other
Affects Versions: 4.0
Reporter: Dan Climan
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-3522.patch


 If the context does not contain the field for a term when calling 
 TermsFilter.getDocIdSet(AtomicReaderContext context) then a 
 NullPointerException is thrown due to not checking for null Terms before 
 getting iterator.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3522) TermsFilter.getDocIdSet(context) NPE on missing field

2011-10-17 Thread Shai Erera (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13128812#comment-13128812
 ] 

Shai Erera commented on LUCENE-3522:


Good catch Dan!

Patch looks good, but I have some comments about the test:
# You don't close the directories at the end of it, and the test fails due that.
# I think it can be simplified to create just one Directory, with f1:content 
and request f2:content. I actually tried it and the test still fails (NPE 
reproduced without your fix).

Here is the modified, more compact, test:
{code}
  public void testMissingField() throws Exception {
// LUCENE-3522: if requested field does not exist in the index, TermsFilter 
threw NPE.
Directory dir = newDirectory();
RandomIndexWriter writer = new RandomIndexWriter(random, dir);
Document doc = new Document();
doc.add(newField(f1, content, StringField.TYPE_STORED));
writer.addDocument(doc);
IndexReader reader = writer.getReader();
writer.close();

TermsFilter tf = new TermsFilter();
tf.addTerm(new Term(f2, content));

FixedBitSet bits = (FixedBitSet) 
tf.getDocIdSet(reader.getTopReaderContext().leaves()[0]);
assertTrue(Must be = 0, bits.cardinality() = 0);  
reader.close();
dir.close();
  }
{code}

Would you mind changing the test case to this compact one? Or did you want to 
demonstrate something else with the two readers?

 TermsFilter.getDocIdSet(context) NPE on missing field
 -

 Key: LUCENE-3522
 URL: https://issues.apache.org/jira/browse/LUCENE-3522
 Project: Lucene - Java
  Issue Type: Bug
  Components: modules/other
Affects Versions: 4.0
Reporter: Dan Climan
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-3522.patch


 If the context does not contain the field for a term when calling 
 TermsFilter.getDocIdSet(AtomicReaderContext context) then a 
 NullPointerException is thrown due to not checking for null Terms before 
 getting iterator.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-3522) TermsFilter.getDocIdSet(context) NPE on missing field

2011-10-17 Thread Michael McCandless (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-3522.


   Resolution: Fixed
Fix Version/s: 4.0

Thanks Dan!

I committed to trunk and backported the test case to 3.x.  I had to add missing 
rd1/2.close() at the end of the test case.

 TermsFilter.getDocIdSet(context) NPE on missing field
 -

 Key: LUCENE-3522
 URL: https://issues.apache.org/jira/browse/LUCENE-3522
 Project: Lucene - Java
  Issue Type: Bug
  Components: modules/other
Affects Versions: 4.0
Reporter: Dan Climan
Assignee: Michael McCandless
Priority: Minor
 Fix For: 4.0

 Attachments: LUCENE-3522.patch


 If the context does not contain the field for a term when calling 
 TermsFilter.getDocIdSet(AtomicReaderContext context) then a 
 NullPointerException is thrown due to not checking for null Terms before 
 getting iterator.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3522) TermsFilter.getDocIdSet(context) NPE on missing field

2011-10-17 Thread Michael McCandless (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13128816#comment-13128816
 ] 

Michael McCandless commented on LUCENE-3522:


Dan/Shai feel free to fix the test case if you want... didn't see your comments 
here until after I committed!

 TermsFilter.getDocIdSet(context) NPE on missing field
 -

 Key: LUCENE-3522
 URL: https://issues.apache.org/jira/browse/LUCENE-3522
 Project: Lucene - Java
  Issue Type: Bug
  Components: modules/other
Affects Versions: 4.0
Reporter: Dan Climan
Assignee: Michael McCandless
Priority: Minor
 Fix For: 4.0

 Attachments: LUCENE-3522.patch


 If the context does not contain the field for a term when calling 
 TermsFilter.getDocIdSet(AtomicReaderContext context) then a 
 NullPointerException is thrown due to not checking for null Terms before 
 getting iterator.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3522) TermsFilter.getDocIdSet(context) NPE on missing field

2011-10-17 Thread Shai Erera (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-3522:
---

Fix Version/s: 3.5

Added 3.5 as a fix version as well

 TermsFilter.getDocIdSet(context) NPE on missing field
 -

 Key: LUCENE-3522
 URL: https://issues.apache.org/jira/browse/LUCENE-3522
 Project: Lucene - Java
  Issue Type: Bug
  Components: modules/other
Affects Versions: 4.0
Reporter: Dan Climan
Assignee: Michael McCandless
Priority: Minor
 Fix For: 3.5, 4.0

 Attachments: LUCENE-3522.patch


 If the context does not contain the field for a term when calling 
 TermsFilter.getDocIdSet(AtomicReaderContext context) then a 
 NullPointerException is thrown due to not checking for null Terms before 
 getting iterator.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3522) TermsFilter.getDocIdSet(context) NPE on missing field

2011-10-17 Thread Michael McCandless (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13128820#comment-13128820
 ] 

Michael McCandless commented on LUCENE-3522:


OK that's fine but technically this bug didn't exist on 3.5... I only 
backported the test case.

 TermsFilter.getDocIdSet(context) NPE on missing field
 -

 Key: LUCENE-3522
 URL: https://issues.apache.org/jira/browse/LUCENE-3522
 Project: Lucene - Java
  Issue Type: Bug
  Components: modules/other
Affects Versions: 4.0
Reporter: Dan Climan
Assignee: Michael McCandless
Priority: Minor
 Fix For: 3.5, 4.0

 Attachments: LUCENE-3522.patch


 If the context does not contain the field for a term when calling 
 TermsFilter.getDocIdSet(AtomicReaderContext context) then a 
 NullPointerException is thrown due to not checking for null Terms before 
 getting iterator.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[JENKINS-MAVEN] Lucene-Solr-Maven-3.x #271: POMs out of sync

2011-10-17 Thread Apache Jenkins Server
Build: https://builds.apache.org/job/Lucene-Solr-Maven-3.x/271/

No tests ran.

Build Log (for compile errors):
[...truncated 19697 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-434) Lucene database bindings

2011-10-17 Thread Mark Harwood (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13128904#comment-13128904
 ] 

Mark Harwood commented on LUCENE-434:
-

Note: Lucene 4.0's DocValueFields aka column stride fields will be of benefit 
here in efficient KeyMap implementations - e.g. updating the load routines in 
CachedKeyMapImpl. An overhaul is also needed e.g support segment-level mappings

 Lucene database bindings
 

 Key: LUCENE-434
 URL: https://issues.apache.org/jira/browse/LUCENE-434
 Project: Lucene - Java
  Issue Type: New Feature
  Components: modules/other
Reporter: Mark Harwood
Priority: Minor
 Attachments: LuceneDb.zip


 Code and examples for embedding Lucene in HSQLDB and Derby relational 
 databases.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[JENKINS-MAVEN] Lucene-Solr-Maven-trunk #268: POMs out of sync

2011-10-17 Thread Apache Jenkins Server
Build: https://builds.apache.org/job/Lucene-Solr-Maven-trunk/268/

No tests ran.

Build Log (for compile errors):
[...truncated 18924 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[JENKINS] Lucene-Solr-tests-only-trunk - Build # 10867 - Failure

2011-10-17 Thread Apache Jenkins Server
Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk/10867/

1 tests failed.
REGRESSION:  org.apache.solr.cloud.ZkSolrClientTest.testReconnect

Error Message:
Node does not exist, but it should

Stack Trace:
junit.framework.AssertionFailedError: Node does not exist, but it should
at 
org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:149)
at 
org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:51)
at 
org.apache.solr.cloud.ZkSolrClientTest.testReconnect(ZkSolrClientTest.java:149)
at 
org.apache.lucene.util.LuceneTestCase$2$1.evaluate(LuceneTestCase.java:610)




Build Log (for compile errors):
[...truncated 7783 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2841) Scriptable UpdateRequestChain

2011-10-17 Thread Commented

[ 
https://issues.apache.org/jira/browse/SOLR-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13128966#comment-13128966
 ] 

Jan Høydahl commented on SOLR-2841:
---

The DSL could be based on Groovy, JRuby, Jython or JS. Here's my quasi sketch 
of a Groovy example from 2823:

...This approach also solves another wish of mine, namely being able to define 
chains outside of solrconfig.xml. Logically, configuring schema and document 
processing is done by a content guy, but configuring solrconfig.xml is done 
by the hardware/operations guys. Imagine a solr/conf/pipeline.groovy defined 
in solrconfig.xml:

{code:xml}
updateProcessorChain class=solr.ScriptedUpdateProcessorChainFactory 
file=updateprocessing.groovy /
{code}

updateprocessing.groovy:
{code}
chain simple {
  process(langid)
  process(copyfield)
  chain(logAndRun)
}

chain moreComplex {
  process(langid)
  if(doc.getFieldValue(employees)  10)
process(copyfield)
  else
chain(myOtherProcesses)
  doc.deleteField(title)
  chain(logAndRun)
}

chain logAndRun {
  process(log)
  process(run)
}

processor langid {
  class = solr.LanguageIdentifierUpdateProcessorFactory
  config(langid.fl, title,body)
  config(langid.langField, language)
  config(map, true)
}

processor copyfield {
  script = copyfield.groovy
  config(from, title)
  config(to, title_en)
}
{code}

I don't know what it takes to code such a thing, but if we had it, I'd never go 
back to defining pipelines in XML :)

 Scriptable UpdateRequestChain
 -

 Key: SOLR-2841
 URL: https://issues.apache.org/jira/browse/SOLR-2841
 Project: Solr
  Issue Type: New Feature
  Components: update
Reporter: Jan Høydahl

 UpdateProcessorChains must currently be defined with XML in solrconfig.xml. 
 We should explore a scriptable chain implementation with a DSL that allows 
 for full flexibility. The first step would be to make UpdateChain 
 implementations pluggable in solrconfig.xml, for backward compat support.
 Benefits and possibilities with a Scriptable UpdateChain:
 * A compact DSL for defining Processors and Chains (Workflows would be a 
 better, less limited term here)
 * Keeping update processor config separate from solrconfig.xml gives better 
 separations of roles
 * Use this as an opportunity to natively support scripting language 
 Processors (ideas from SOLR-1725)
 This issue is spun off from SOLR-2823.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2155) Geospatial search using geohash prefixes

2011-10-17 Thread Olivier Jacquet (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13128980#comment-13128980
 ] 

Olivier Jacquet commented on SOLR-2155:
---

I just wanted to mention another use case for multivalued point fields since 
everyone is always talking about this in a location context.

The PointType can also be used to categorize other other stuff. In my case 
we're storing qualifications of persons as a tuple of experience, function and 
skill (eg. senior, developer, java) which are internally represented by 
numerical ids. Now with Solr I would like to be able to do the query: return 
everything that is a java developer which would be the same as asking for all 
points on a certain line.

 Geospatial search using geohash prefixes
 

 Key: SOLR-2155
 URL: https://issues.apache.org/jira/browse/SOLR-2155
 Project: Solr
  Issue Type: Improvement
Reporter: David Smiley
Assignee: Grant Ingersoll
 Attachments: GeoHashPrefixFilter.patch, GeoHashPrefixFilter.patch, 
 GeoHashPrefixFilter.patch, 
 SOLR-2155_GeoHashPrefixFilter_with_sorting_no_poly.patch, SOLR.2155.p3.patch, 
 SOLR.2155.p3tests.patch, Solr2155-1.0.2-project.zip, 
 Solr2155-1.0.3-project.zip, Solr2155-for-1.0.2-3.x-port.patch


 There currently isn't a solution in Solr for doing geospatial filtering on 
 documents that have a variable number of points.  This scenario occurs when 
 there is location extraction (i.e. via a gazateer) occurring on free text.  
 None, one, or many geospatial locations might be extracted from any given 
 document and users want to limit their search results to those occurring in a 
 user-specified area.
 I've implemented this by furthering the GeoHash based work in Lucene/Solr 
 with a geohash prefix based filter.  A geohash refers to a lat-lon box on the 
 earth.  Each successive character added further subdivides the box into a 4x8 
 (or 8x4 depending on the even/odd length of the geohash) grid.  The first 
 step in this scheme is figuring out which geohash grid squares cover the 
 user's search query.  I've added various extra methods to GeoHashUtils (and 
 added tests) to assist in this purpose.  The next step is an actual Lucene 
 Filter, GeoHashPrefixFilter, that uses these geohash prefixes in 
 TermsEnum.seek() to skip to relevant grid squares in the index.  Once a 
 matching geohash grid is found, the points therein are compared against the 
 user's query to see if it matches.  I created an abstraction GeoShape 
 extended by subclasses named PointDistance... and CartesianBox to support 
 different queried shapes so that the filter need not care about these details.
 This work was presented at LuceneRevolution in Boston on October 8th.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Issue Comment Edited] (SOLR-2155) Geospatial search using geohash prefixes

2011-10-17 Thread Olivier Jacquet (Issue Comment Edited) (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13128980#comment-13128980
 ] 

Olivier Jacquet edited comment on SOLR-2155 at 10/17/11 4:30 PM:
-

I just wanted to mention another use case for multivalued point fields since 
everyone is always talking about this in a location context.

The PointType can also be used to categorize other other stuff. In my case 
we're storing qualifications of persons as a tuple of experience, function and 
skill (eg. senior, developer, java) which are internally represented by 
numerical ids. Now with Solr I would like to be able to do the query: return 
everything that is a java developer which would be the same as asking for all 
points on a certain plane.

  was (Author: ojacquet):
I just wanted to mention another use case for multivalued point fields 
since everyone is always talking about this in a location context.

The PointType can also be used to categorize other other stuff. In my case 
we're storing qualifications of persons as a tuple of experience, function and 
skill (eg. senior, developer, java) which are internally represented by 
numerical ids. Now with Solr I would like to be able to do the query: return 
everything that is a java developer which would be the same as asking for all 
points on a certain line.
  
 Geospatial search using geohash prefixes
 

 Key: SOLR-2155
 URL: https://issues.apache.org/jira/browse/SOLR-2155
 Project: Solr
  Issue Type: Improvement
Reporter: David Smiley
Assignee: Grant Ingersoll
 Attachments: GeoHashPrefixFilter.patch, GeoHashPrefixFilter.patch, 
 GeoHashPrefixFilter.patch, 
 SOLR-2155_GeoHashPrefixFilter_with_sorting_no_poly.patch, SOLR.2155.p3.patch, 
 SOLR.2155.p3tests.patch, Solr2155-1.0.2-project.zip, 
 Solr2155-1.0.3-project.zip, Solr2155-for-1.0.2-3.x-port.patch


 There currently isn't a solution in Solr for doing geospatial filtering on 
 documents that have a variable number of points.  This scenario occurs when 
 there is location extraction (i.e. via a gazateer) occurring on free text.  
 None, one, or many geospatial locations might be extracted from any given 
 document and users want to limit their search results to those occurring in a 
 user-specified area.
 I've implemented this by furthering the GeoHash based work in Lucene/Solr 
 with a geohash prefix based filter.  A geohash refers to a lat-lon box on the 
 earth.  Each successive character added further subdivides the box into a 4x8 
 (or 8x4 depending on the even/odd length of the geohash) grid.  The first 
 step in this scheme is figuring out which geohash grid squares cover the 
 user's search query.  I've added various extra methods to GeoHashUtils (and 
 added tests) to assist in this purpose.  The next step is an actual Lucene 
 Filter, GeoHashPrefixFilter, that uses these geohash prefixes in 
 TermsEnum.seek() to skip to relevant grid squares in the index.  Once a 
 matching geohash grid is found, the points therein are compared against the 
 user's query to see if it matches.  I created an abstraction GeoShape 
 extended by subclasses named PointDistance... and CartesianBox to support 
 different queried shapes so that the filter need not care about these details.
 This work was presented at LuceneRevolution in Boston on October 8th.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Issue Comment Edited] (SOLR-2155) Geospatial search using geohash prefixes

2011-10-17 Thread Olivier Jacquet (Issue Comment Edited) (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13128980#comment-13128980
 ] 

Olivier Jacquet edited comment on SOLR-2155 at 10/17/11 4:32 PM:
-

I just wanted to mention another use case for multivalued point fields since 
everyone is always talking about this in a location context.

The PointType can also be used to categorize other other stuff. In my case 
we're storing qualifications of persons as a tuple of experience, function and 
skill (eg. senior, developer, java) which are internally represented by 
numerical ids. Now with Solr I would like to be able to do the query: return 
everything that is a java developer which would be the same as asking for all 
points on a certain line.

  was (Author: ojacquet):
I just wanted to mention another use case for multivalued point fields 
since everyone is always talking about this in a location context.

The PointType can also be used to categorize other other stuff. In my case 
we're storing qualifications of persons as a tuple of experience, function and 
skill (eg. senior, developer, java) which are internally represented by 
numerical ids. Now with Solr I would like to be able to do the query: return 
everything that is a java developer which would be the same as asking for all 
points on a certain plane.
  
 Geospatial search using geohash prefixes
 

 Key: SOLR-2155
 URL: https://issues.apache.org/jira/browse/SOLR-2155
 Project: Solr
  Issue Type: Improvement
Reporter: David Smiley
Assignee: Grant Ingersoll
 Attachments: GeoHashPrefixFilter.patch, GeoHashPrefixFilter.patch, 
 GeoHashPrefixFilter.patch, 
 SOLR-2155_GeoHashPrefixFilter_with_sorting_no_poly.patch, SOLR.2155.p3.patch, 
 SOLR.2155.p3tests.patch, Solr2155-1.0.2-project.zip, 
 Solr2155-1.0.3-project.zip, Solr2155-for-1.0.2-3.x-port.patch


 There currently isn't a solution in Solr for doing geospatial filtering on 
 documents that have a variable number of points.  This scenario occurs when 
 there is location extraction (i.e. via a gazateer) occurring on free text.  
 None, one, or many geospatial locations might be extracted from any given 
 document and users want to limit their search results to those occurring in a 
 user-specified area.
 I've implemented this by furthering the GeoHash based work in Lucene/Solr 
 with a geohash prefix based filter.  A geohash refers to a lat-lon box on the 
 earth.  Each successive character added further subdivides the box into a 4x8 
 (or 8x4 depending on the even/odd length of the geohash) grid.  The first 
 step in this scheme is figuring out which geohash grid squares cover the 
 user's search query.  I've added various extra methods to GeoHashUtils (and 
 added tests) to assist in this purpose.  The next step is an actual Lucene 
 Filter, GeoHashPrefixFilter, that uses these geohash prefixes in 
 TermsEnum.seek() to skip to relevant grid squares in the index.  Once a 
 matching geohash grid is found, the points therein are compared against the 
 user's query to see if it matches.  I created an abstraction GeoShape 
 extended by subclasses named PointDistance... and CartesianBox to support 
 different queried shapes so that the filter need not care about these details.
 This work was presented at LuceneRevolution in Boston on October 8th.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2155) Geospatial search using geohash prefixes

2011-10-17 Thread David Smiley (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13129002#comment-13129002
 ] 

David Smiley commented on SOLR-2155:


Oliver: Your scenario is interesting but I wouldn't recommend spatial for that. 
A key part of spatial is the use of a numerical range. In your case there are 
discrete values.  Instead, I recommend you experiment with phrase queries, and 
if you are expert in Lucene then span queries.  As a toy hack example, imagine 
indexing each of these values in the form senior developer java (3 words, one 
for each part). We assume each value tokenizes as one token.  Then search for 
the developer java in which the was substituted as a kind of wildcard for 
the first position to find java developers in all levels of experience. The 
is a stopword and in effect creates a wildcard placeholder.  If you search the 
solr-user list then you will see information on this topic.  I've solved this 
problem in a different more difficult way because my values were not single 
tokens, but based on the example you present, the solution I present here isn't 
bad.  If you want to discuss this further I recommend the solr-user list.

 Geospatial search using geohash prefixes
 

 Key: SOLR-2155
 URL: https://issues.apache.org/jira/browse/SOLR-2155
 Project: Solr
  Issue Type: Improvement
Reporter: David Smiley
Assignee: Grant Ingersoll
 Attachments: GeoHashPrefixFilter.patch, GeoHashPrefixFilter.patch, 
 GeoHashPrefixFilter.patch, 
 SOLR-2155_GeoHashPrefixFilter_with_sorting_no_poly.patch, SOLR.2155.p3.patch, 
 SOLR.2155.p3tests.patch, Solr2155-1.0.2-project.zip, 
 Solr2155-1.0.3-project.zip, Solr2155-for-1.0.2-3.x-port.patch


 There currently isn't a solution in Solr for doing geospatial filtering on 
 documents that have a variable number of points.  This scenario occurs when 
 there is location extraction (i.e. via a gazateer) occurring on free text.  
 None, one, or many geospatial locations might be extracted from any given 
 document and users want to limit their search results to those occurring in a 
 user-specified area.
 I've implemented this by furthering the GeoHash based work in Lucene/Solr 
 with a geohash prefix based filter.  A geohash refers to a lat-lon box on the 
 earth.  Each successive character added further subdivides the box into a 4x8 
 (or 8x4 depending on the even/odd length of the geohash) grid.  The first 
 step in this scheme is figuring out which geohash grid squares cover the 
 user's search query.  I've added various extra methods to GeoHashUtils (and 
 added tests) to assist in this purpose.  The next step is an actual Lucene 
 Filter, GeoHashPrefixFilter, that uses these geohash prefixes in 
 TermsEnum.seek() to skip to relevant grid squares in the index.  Once a 
 matching geohash grid is found, the points therein are compared against the 
 user's query to see if it matches.  I created an abstraction GeoShape 
 extended by subclasses named PointDistance... and CartesianBox to support 
 different queried shapes so that the filter need not care about these details.
 This work was presented at LuceneRevolution in Boston on October 8th.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2842) Re-factor UpdateChain and UpdateProcessor interfaces

2011-10-17 Thread Ryan McKinley (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13129030#comment-13129030
 ] 

Ryan McKinley commented on SOLR-2842:
-

rather then use UpdateProcessor directly, what about adding a simple interface 
like:
{code:java}
 SolrInputDocument transform(SolrInputDocument)
{code}
and using simple bean getter/setters -- perhaps also respecting the 'aware' 
interfaces (SolrCoreAware, SchemaAware, ResourceLoaderAware)

It seems like most of the custom things we would want to do only care about 
'add' and don't care about commit,delete,merge,rollback.  Starting with a 
simple interface like this would give us lots of flexibility to integrate 
wherever it feels most appropriate -- client/server or any other pipeline 
framework (I've been using commons pipeline with pretty reasonable success)




 Re-factor UpdateChain and UpdateProcessor interfaces
 

 Key: SOLR-2842
 URL: https://issues.apache.org/jira/browse/SOLR-2842
 Project: Solr
  Issue Type: Improvement
  Components: update
Reporter: Jan Høydahl

 The UpdateChain's main task is to send SolrInputDocuments through a chain of 
 UpdateRequestProcessors in order to transform them in some way and then 
 (typically) indexing them.
 This generic pipeline concept would also be useful on the client side 
 (SolrJ), so that we could choose to do parts or all of the processing on the 
 client. The most prominent use case is extracting text (Tika) from large 
 binary documents, residing on local storage on the client(s). Streaming 
 hundreds of Mb over to Solr for processing is not efficcient. See SOLR-1526.
 We're already implementing Tika as an UpdateProcessor in SOLR-1763, and what 
 would be more natural than reusing this - and any other processor - on the 
 client side?
 However, for this to be possible, some interfaces need to change slightly..

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[JENKINS] Lucene-Solr-tests-only-trunk-java7 - Build # 647 - Failure

2011-10-17 Thread Apache Jenkins Server
Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk-java7/647/

1 tests failed.
REGRESSION:  
org.apache.solr.client.solrj.embedded.LargeVolumeJettyTest.testMultiThreaded

Error Message:
java.lang.AssertionError: Some threads threw uncaught exceptions!

Stack Trace:
java.lang.RuntimeException: java.lang.AssertionError: Some threads threw 
uncaught exceptions!
at 
org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:739)
at org.apache.solr.SolrTestCaseJ4.tearDown(SolrTestCaseJ4.java:89)
at 
org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:149)
at 
org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:51)
at 
org.apache.lucene.util.LuceneTestCase.checkUncaughtExceptionsAfter(LuceneTestCase.java:767)
at 
org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:711)




Build Log (for compile errors):
[...truncated 12202 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3522) TermsFilter.getDocIdSet(context) NPE on missing field

2011-10-17 Thread Shai Erera (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-3522:
---

Fix Version/s: (was: 3.5)

Ah. I thought that we need the Fix Version to properly track which issues are 
part of a release. But you're right - if this bug didn't exist in 3.x, then we 
better not mark that it was fixed there.

 TermsFilter.getDocIdSet(context) NPE on missing field
 -

 Key: LUCENE-3522
 URL: https://issues.apache.org/jira/browse/LUCENE-3522
 Project: Lucene - Java
  Issue Type: Bug
  Components: modules/other
Affects Versions: 4.0
Reporter: Dan Climan
Assignee: Michael McCandless
Priority: Minor
 Fix For: 4.0

 Attachments: LUCENE-3522.patch


 If the context does not contain the field for a term when calling 
 TermsFilter.getDocIdSet(AtomicReaderContext context) then a 
 NullPointerException is thrown due to not checking for null Terms before 
 getting iterator.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[JENKINS] Lucene-Solr-tests-only-trunk - Build # 10870 - Failure

2011-10-17 Thread Apache Jenkins Server
Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk/10870/

1 tests failed.
REGRESSION:  org.apache.solr.update.AutoCommitTest.testMaxDocs

Error Message:
should find one query failed XPath: //result[@numFound=1]  xml response was: 
?xml version=1.0 encoding=UTF-8? response lst 
name=responseHeaderint name=status0/intint 
name=QTime2/int/lstresult name=response numFound=0 
start=0/result /response   request was: 
start=0q=id:14qt=standardrows=20version=2.2

Stack Trace:
junit.framework.AssertionFailedError: should find one query failed XPath: 
//result[@numFound=1]
 xml response was: ?xml version=1.0 encoding=UTF-8?
response
lst name=responseHeaderint name=status0/intint 
name=QTime2/int/lstresult name=response numFound=0 
start=0/result
/response

 request was: start=0q=id:14qt=standardrows=20version=2.2
at 
org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:149)
at 
org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:51)
 xml response was: ?xml version=1.0 encoding=UTF-8?
response
lst name=responseHeaderint name=status0/intint 
name=QTime2/int/lstresult name=response numFound=0 
start=0/result
/response

 request was: start=0q=id:14qt=standardrows=20version=2.2
at 
org.apache.solr.util.AbstractSolrTestCase.assertQ(AbstractSolrTestCase.java:262)
at 
org.apache.solr.update.AutoCommitTest.testMaxDocs(AutoCommitTest.java:182)
at 
org.apache.lucene.util.LuceneTestCase$2$1.evaluate(LuceneTestCase.java:610)




Build Log (for compile errors):
[...truncated 7776 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-2843) WordBreakSpellChecker

2011-10-17 Thread James Dyer (Created) (JIRA)
WordBreakSpellChecker
-

 Key: SOLR-2843
 URL: https://issues.apache.org/jira/browse/SOLR-2843
 Project: Solr
  Issue Type: Improvement
  Components: spellchecker
Affects Versions: 3.5, 4.0
Reporter: James Dyer
Priority: Minor
 Fix For: 3.5, 4.0


A spellchecker that generates suggestions by combining two or more terms and/or 
breaking terms into multiple words.  This would typically be used in addition 
to one of the existing spell checkers to get both traditional and word-break 
suggestions for the end user.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Closed] (SOLR-2843) WordBreakSpellChecker

2011-10-17 Thread James Dyer (Closed) (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Dyer closed SOLR-2843.


Resolution: Won't Fix

this is a LUCENE- issue, not a SOLR- issue.

 WordBreakSpellChecker
 -

 Key: SOLR-2843
 URL: https://issues.apache.org/jira/browse/SOLR-2843
 Project: Solr
  Issue Type: Improvement
  Components: spellchecker
Affects Versions: 3.5, 4.0
Reporter: James Dyer
Priority: Minor
 Fix For: 3.5, 4.0


 A spellchecker that generates suggestions by combining two or more terms 
 and/or breaking terms into multiple words.  This would typically be used in 
 addition to one of the existing spell checkers to get both traditional and 
 word-break suggestions for the end user.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-3523) WordBreakSpellChecker

2011-10-17 Thread James Dyer (Created) (JIRA)
WordBreakSpellChecker
-

 Key: LUCENE-3523
 URL: https://issues.apache.org/jira/browse/LUCENE-3523
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/spellchecker
Affects Versions: 3.5, 4.0
Reporter: James Dyer
Priority: Minor
 Fix For: 3.5, 4.0
 Attachments: LUCENE-3523.patch

A spellchecker that generates suggestions by combining two or more terms and/or 
breaking terms into multiple words. This would typically be used in addition to 
one of the existing spell checkers to get both traditional and word-break 
suggestions for the end user.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3523) WordBreakSpellChecker

2011-10-17 Thread James Dyer (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Dyer updated LUCENE-3523:
---

Attachment: LUCENE-3523.patch

 WordBreakSpellChecker
 -

 Key: LUCENE-3523
 URL: https://issues.apache.org/jira/browse/LUCENE-3523
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/spellchecker
Affects Versions: 3.5, 4.0
Reporter: James Dyer
Priority: Minor
 Fix For: 3.5, 4.0

 Attachments: LUCENE-3523.patch


 A spellchecker that generates suggestions by combining two or more terms 
 and/or breaking terms into multiple words. This would typically be used in 
 addition to one of the existing spell checkers to get both traditional and 
 word-break suggestions for the end user.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2842) Re-factor UpdateChain and UpdateProcessor interfaces

2011-10-17 Thread Commented

[ 
https://issues.apache.org/jira/browse/SOLR-2842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13129112#comment-13129112
 ] 

Jan Høydahl commented on SOLR-2842:
---

Interesting thought Ryan. I already commonly isolate processing in a similar 
method to simplify unit testing so this is useful in several ways. Do you 
suggest to let UpdateProcessor base class implement this interface? But you 
still need to construct and initialize the processors even if they are wrapped 
in the interface, thus my suggestion for a client side version of the factory.

 Re-factor UpdateChain and UpdateProcessor interfaces
 

 Key: SOLR-2842
 URL: https://issues.apache.org/jira/browse/SOLR-2842
 Project: Solr
  Issue Type: Improvement
  Components: update
Reporter: Jan Høydahl

 The UpdateChain's main task is to send SolrInputDocuments through a chain of 
 UpdateRequestProcessors in order to transform them in some way and then 
 (typically) indexing them.
 This generic pipeline concept would also be useful on the client side 
 (SolrJ), so that we could choose to do parts or all of the processing on the 
 client. The most prominent use case is extracting text (Tika) from large 
 binary documents, residing on local storage on the client(s). Streaming 
 hundreds of Mb over to Solr for processing is not efficcient. See SOLR-1526.
 We're already implementing Tika as an UpdateProcessor in SOLR-1763, and what 
 would be more natural than reusing this - and any other processor - on the 
 client side?
 However, for this to be possible, some interfaces need to change slightly..

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[JENKINS] Lucene-Solr-tests-only-trunk - Build # 10871 - Still Failing

2011-10-17 Thread Apache Jenkins Server
Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk/10871/

1 tests failed.
REGRESSION:  org.apache.solr.search.TestRealTimeGet.testStressGetRealtime

Error Message:
java.lang.AssertionError: Some threads threw uncaught exceptions!

Stack Trace:
java.lang.RuntimeException: java.lang.AssertionError: Some threads threw 
uncaught exceptions!
at 
org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:739)
at org.apache.solr.SolrTestCaseJ4.tearDown(SolrTestCaseJ4.java:89)
at 
org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:149)
at 
org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:51)
at 
org.apache.lucene.util.LuceneTestCase.checkUncaughtExceptionsAfter(LuceneTestCase.java:767)
at 
org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:711)




Build Log (for compile errors):
[...truncated 7942 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[JENKINS] Lucene-Solr-tests-only-3.x - Build # 10891 - Failure

2011-10-17 Thread Apache Jenkins Server
Build: https://builds.apache.org/job/Lucene-Solr-tests-only-3.x/10891/

1 tests failed.
REGRESSION:  org.apache.solr.core.TestJmxIntegration.testJmxOnCoreReload

Error Message:
Number of registered MBeans is not the same as info registry size expected:58 
but was:55

Stack Trace:
junit.framework.AssertionFailedError: Number of registered MBeans is not the 
same as info registry size expected:58 but was:55
at 
org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:147)
at 
org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:50)
at 
org.apache.solr.core.TestJmxIntegration.testJmxOnCoreReload(TestJmxIntegration.java:137)
at 
org.apache.lucene.util.LuceneTestCase$2$1.evaluate(LuceneTestCase.java:435)




Build Log (for compile errors):
[...truncated 14016 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-2829) Filter queries have false-positive matches. Exposed by user's list titled Regarding geodist and multiple location fields

2011-10-17 Thread Emmanuel Espina (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Emmanuel Espina updated SOLR-2829:
--

Attachment: SOLR-2829.patch

I modified the tests to reproduce the issue in the mailing list.
The suggestion Erick made about adding this.origField.equals(other.origField) 
solves the problem. That line is included in the patch.

 Filter queries have false-positive matches. Exposed by user's list titled 
 Regarding geodist and multiple location fields
 --

 Key: SOLR-2829
 URL: https://issues.apache.org/jira/browse/SOLR-2829
 Project: Solr
  Issue Type: Bug
  Components: search
Affects Versions: 3.4, 4.0
 Environment: N/A
Reporter: Erick Erickson
 Attachments: SOLR-2829.patch


 I don't know how generic this is, whether it's just a
 problem with fqs when combined with spatial or whether
 it has wider applicability , but here's what I know so far.
 Marc Tinnemeyer in a post titled:
 Regarding geodist and multiple location fields
 outlines this. I checked this on 3.4 and trunk and it's
 weird in both cases.
 HOLD THE PRESSES:
 After looking at this a bit more, it looks like a caching
 issue, NOT a geodist issue. When I bounce Solr
 between changing the sfield from home to work,
 it seems to work as expected.
 H, very strange. If I comment out BOTH
 the filterCache and queryResultCache then it works
 fine. Switching from home to work in the query
 finds/fails to find the document.
 But commenting out only one of those caches
 doesn't fix the problem.
 on trunk I used this query; just flipping home to work and back:
 http://localhost:8983/solr/select?q=id:1fq={!geofilt sfield=home
 pt=52.67,7.30 d=5}
 The info below is what I used to test.
 From Marc's posts:
 field name=home type=location indexed=true stored=true/
 field name=work type=location indexed=true stored=true/
 field name=elsewhere type=location indexed=true stored=true/
 At first I thought so too. Here is a simple document.
 add
   doc
   field name=id1/field
   field name=namefirst/field
   field name=work48.60,11.61/field
   field name=home52.67,7.30/field
   /doc
 /add
 and here is the result that shouldn't be:
 response
 ...
 str name=q*:*/str
 str name=fq{!geofilt sfield=work pt=52.67,7.30 d=5}/str
 ...
 /lst
 /lst
 result name=response numFound=1 start=0
 doc
 str name=home52.67,7.30/str
 str name=id1/str
 str name=namefirst/str
 str name=work48.60,11.61/str
 /doc
 /result
 /response
 Yonik's comment**
 It's going to be a bug in an equals() implementation somewhere in the query.
 The top level equals will be SpatialDistanceQuery.equals() (from
 LatLonField.java)
 On trunk, I already see a bug introduced when the new lucene field
 cache stuff was done.
 DoubleValueSource now just inherits it's equals method from
 NumericFieldCacheSource... and that equals() method only tests if the
 CachedArrayCreator.getClass() is the same!  That's definitely wrong.
 I don't know why 3x would also have this behavior (unless there's more
 than one bug!)
 Anyway, first step is to modify the spatial tests to catch the bug...
 from there it should be pretty easy to debug.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-2382) DIH Cache Improvements

2011-10-17 Thread James Dyer (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Dyer updated SOLR-2382:
-

Attachment: SOLR-2382-entities.patch

Noble,

Here is a version of the entities patch using .iterator() methods as you 
suggest.  Let me know if this is what you had in mind and also if there is 
anything else you'd like to address.

 DIH Cache Improvements
 --

 Key: SOLR-2382
 URL: https://issues.apache.org/jira/browse/SOLR-2382
 Project: Solr
  Issue Type: New Feature
  Components: contrib - DataImportHandler
Reporter: James Dyer
Priority: Minor
 Attachments: SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch, 
 SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
 SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
 SOLR-2382-properties.patch, SOLR-2382-properties.patch, 
 SOLR-2382-solrwriter-verbose-fix.patch, SOLR-2382-solrwriter.patch, 
 SOLR-2382-solrwriter.patch, SOLR-2382-solrwriter.patch, SOLR-2382.patch, 
 SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
 SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch


 Functionality:
  1. Provide a pluggable caching framework for DIH so that users can choose a 
 cache implementation that best suits their data and application.
  
  2. Provide a means to temporarily cache a child Entity's data without 
 needing to create a special cached implementation of the Entity Processor 
 (such as CachedSqlEntityProcessor).
  
  3. Provide a means to write the final (root entity) DIH output to a cache 
 rather than to Solr.  Then provide a way for a subsequent DIH call to use the 
 cache as an Entity input.  Also provide the ability to do delta updates on 
 such persistent caches.
  
  4. Provide the ability to partition data across multiple caches that can 
 then be fed back into DIH and indexed either to varying Solr Shards, or to 
 the same Core in parallel.
 Use Cases:
  1. We needed a flexible  scalable way to temporarily cache child-entity 
 data prior to joining to parent entities.
   - Using SqlEntityProcessor with Child Entities can cause an n+1 select 
 problem.
   - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching 
 mechanism and does not scale.
   - There is no way to cache non-SQL inputs (ex: flat files, xml, etc).
  
  2. We needed the ability to gather data from long-running entities by a 
 process that runs separate from our main indexing process.
   
  3. We wanted the ability to do a delta import of only the entities that 
 changed.
   - Lucene/Solr requires entire documents to be re-indexed, even if only a 
 few fields changed.
   - Our data comes from 50+ complex sql queries and/or flat files.
   - We do not want to incur overhead re-gathering all of this data if only 1 
 entity's data changed.
   - Persistent DIH caches solve this problem.
   
  4. We want the ability to index several documents in parallel (using 1.4.1, 
 which did not have the threads parameter).
  
  5. In the future, we may need to use Shards, creating a need to easily 
 partition our source data into Shards.
 Implementation Details:
  1. De-couple EntityProcessorBase from caching.  
   - Created a new interface, DIHCache  two implementations:  
 - SortedMapBackedCache - An in-memory cache, used as default with 
 CachedSqlEntityProcessor (now deprecated).
 - BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested 
 with je-4.1.6.jar
- NOTE: the existing Lucene Contrib db project uses je-3.3.93.jar.  
 I believe this may be incompatible due to Generic Usage.
- NOTE: I did not modify the ant script to automatically get this jar, 
 so to use or evaluate this patch, download bdb-je from 
 http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html 
  
  2. Allow Entity Processors to take a cacheImpl parameter to cause the 
 entity data to be cached (see EntityProcessorBase  DIHCacheProperties).
  
  3. Partially De-couple SolrWriter from DocBuilder
   - Created a new interface DIHWriter,  two implementations:
- SolrWriter (refactored)
- DIHCacheWriter (allows DIH to write ultimately to a Cache).

  4. Create a new Entity Processor, DIHCacheProcessor, which reads a 
 persistent Cache as DIH Entity Input.
  
  5. Support a partition parameter with both DIHCacheWriter and 
 DIHCacheProcessor to allow for easy partitioning of source entity data.
  
  6. Change the semantics of entity.destroy()
   - Previously, it was being called on each iteration of 
 DocBuilder.buildDocument().
   - Now it is does one-time cleanup tasks (like closing or deleting a 
 disk-backed cache) once the entity processor is completed.
   - The only out-of-the-box entity processor that previously implemented 
 destroy() was LineEntitiyProcessor, so this is not a 

[jira] [Commented] (SOLR-2842) Re-factor UpdateChain and UpdateProcessor interfaces

2011-10-17 Thread Ryan McKinley (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13129229#comment-13129229
 ] 

Ryan McKinley commented on SOLR-2842:
-

I don't have a real proposal... just thinking about generally reusable pipeline 
code.

bq. Do you suggest to let UpdateProcessor base class implement this interface?

No, since most domain specific UpdateProcessors can be boiled down to this 
(tika, langid, geonames, etc) i don't think they need to have access to the 
whole UpdateProcessor -- only sometimes do they need access to 
SolrCore/Schema/ResourceLoader etc.  With minimal dependencies, moving them 
around would be easy.

I was thinking we could have a general TransformingUpdateProcesor that could 
take a list of transformers (or something), rather then having all the 
dependencies 

bq. But you still need to construct and initialize the processors even if they 
are wrapped in the interface, thus my suggestion for a client side version of 
the factory.

I'm not convinced that a client side framework is necessary if the interfaces 
were easy enough to deal with directly.  I can see where a DSL would be cool, 
but having a client side NamedListInitalizedPlugin seems like a can of worms

.



 Re-factor UpdateChain and UpdateProcessor interfaces
 

 Key: SOLR-2842
 URL: https://issues.apache.org/jira/browse/SOLR-2842
 Project: Solr
  Issue Type: Improvement
  Components: update
Reporter: Jan Høydahl

 The UpdateChain's main task is to send SolrInputDocuments through a chain of 
 UpdateRequestProcessors in order to transform them in some way and then 
 (typically) indexing them.
 This generic pipeline concept would also be useful on the client side 
 (SolrJ), so that we could choose to do parts or all of the processing on the 
 client. The most prominent use case is extracting text (Tika) from large 
 binary documents, residing on local storage on the client(s). Streaming 
 hundreds of Mb over to Solr for processing is not efficcient. See SOLR-1526.
 We're already implementing Tika as an UpdateProcessor in SOLR-1763, and what 
 would be more natural than reusing this - and any other processor - on the 
 client side?
 However, for this to be possible, some interfaces need to change slightly..

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[JENKINS] Lucene-Solr-tests-only-trunk-java7 - Build # 649 - Failure

2011-10-17 Thread Apache Jenkins Server
Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk-java7/649/

1 tests failed.
REGRESSION:  org.apache.solr.core.TestJmxIntegration.testJmxOnCoreReload

Error Message:
Number of registered MBeans is not the same as info registry size expected:56 
but was:51

Stack Trace:
junit.framework.AssertionFailedError: Number of registered MBeans is not the 
same as info registry size expected:56 but was:51
at 
org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:149)
at 
org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:51)
at 
org.apache.solr.core.TestJmxIntegration.testJmxOnCoreReload(TestJmxIntegration.java:134)
at 
org.apache.lucene.util.LuceneTestCase$2$1.evaluate(LuceneTestCase.java:610)




Build Log (for compile errors):
[...truncated 11102 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[JENKINS] Lucene-Solr-tests-only-trunk - Build # 10876 - Failure

2011-10-17 Thread Apache Jenkins Server
Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk/10876/

1 tests failed.
REGRESSION:  org.apache.lucene.analysis.th.TestThaiAnalyzer.testRandomStrings

Error Message:
255

Stack Trace:
java.lang.ArrayIndexOutOfBoundsException: 255
at 
java.text.DictionaryBasedBreakIterator.lookupCategory(DictionaryBasedBreakIterator.java:319)
at 
java.text.RuleBasedBreakIterator.handleNext(RuleBasedBreakIterator.java:903)
at 
java.text.DictionaryBasedBreakIterator.handleNext(DictionaryBasedBreakIterator.java:281)
at 
java.text.RuleBasedBreakIterator.next(RuleBasedBreakIterator.java:621)
at 
org.apache.lucene.analysis.th.ThaiWordFilter.incrementToken(ThaiWordFilter.java:85)
at 
org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:49)
at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:280)
at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:247)
at 
org.apache.lucene.analysis.th.TestThaiAnalyzer.testRandomStrings(TestThaiAnalyzer.java:151)
at 
org.apache.lucene.util.LuceneTestCase$2$1.evaluate(LuceneTestCase.java:610)
at 
org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:149)
at 
org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:51)




Build Log (for compile errors):
[...truncated 3569 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2842) Re-factor UpdateChain and UpdateProcessor interfaces

2011-10-17 Thread Chris Male (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13129448#comment-13129448
 ] 

Chris Male commented on SOLR-2842:
--

Ryan is on the right track here.  Really many of these changes are just related 
to the 'Add' update logic.  UpdateProcessor provides a lot of integration with 
Solr's CRUD features and I think we should leave it that way.  If we want to 
provide a cleaner system for doing Document manipulations that would be part of 
adding a Document, then lets keep the focus on that.

Having a simple interface like Ryan suggests seems the best way to go forward.  
Then all the actual meat of the Document manipulation logic can go there.  
Integration with UpdateProcessor seems pretty straightforward.  It then frees 
us up to tackle configuration / DSLs / reuse / whatever else buzzword, without 
changing the already powerful and functional UpdateProcessor framework.

 Re-factor UpdateChain and UpdateProcessor interfaces
 

 Key: SOLR-2842
 URL: https://issues.apache.org/jira/browse/SOLR-2842
 Project: Solr
  Issue Type: Improvement
  Components: update
Reporter: Jan Høydahl

 The UpdateChain's main task is to send SolrInputDocuments through a chain of 
 UpdateRequestProcessors in order to transform them in some way and then 
 (typically) indexing them.
 This generic pipeline concept would also be useful on the client side 
 (SolrJ), so that we could choose to do parts or all of the processing on the 
 client. The most prominent use case is extracting text (Tika) from large 
 binary documents, residing on local storage on the client(s). Streaming 
 hundreds of Mb over to Solr for processing is not efficcient. See SOLR-1526.
 We're already implementing Tika as an UpdateProcessor in SOLR-1763, and what 
 would be more natural than reusing this - and any other processor - on the 
 client side?
 However, for this to be possible, some interfaces need to change slightly..

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2382) DIH Cache Improvements

2011-10-17 Thread Noble Paul (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13129493#comment-13129493
 ] 

Noble Paul commented on SOLR-2382:
--

A few more points

Let us avoid methods which are not used 

example
SolrWriter.setDeltaKeys();

Now that we have a concept of DIHCache, move all cache related logic from 
EntityprocessorBase to another class. Probably a baseDIHCache. 

it is not implemented and I am not even clear why it is there

Let us put in the minimum amount of changes

remove the DIHCacheProperties class and inline the constants. That is the way 
it done everywhere else

I don't understand the need for DocBuilder.resetEntity() According to me the 
DataCOnfig state must not be changed between runs. 



 DIH Cache Improvements
 --

 Key: SOLR-2382
 URL: https://issues.apache.org/jira/browse/SOLR-2382
 Project: Solr
  Issue Type: New Feature
  Components: contrib - DataImportHandler
Reporter: James Dyer
Priority: Minor
 Attachments: SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch, 
 SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
 SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
 SOLR-2382-properties.patch, SOLR-2382-properties.patch, 
 SOLR-2382-solrwriter-verbose-fix.patch, SOLR-2382-solrwriter.patch, 
 SOLR-2382-solrwriter.patch, SOLR-2382-solrwriter.patch, SOLR-2382.patch, 
 SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
 SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch


 Functionality:
  1. Provide a pluggable caching framework for DIH so that users can choose a 
 cache implementation that best suits their data and application.
  
  2. Provide a means to temporarily cache a child Entity's data without 
 needing to create a special cached implementation of the Entity Processor 
 (such as CachedSqlEntityProcessor).
  
  3. Provide a means to write the final (root entity) DIH output to a cache 
 rather than to Solr.  Then provide a way for a subsequent DIH call to use the 
 cache as an Entity input.  Also provide the ability to do delta updates on 
 such persistent caches.
  
  4. Provide the ability to partition data across multiple caches that can 
 then be fed back into DIH and indexed either to varying Solr Shards, or to 
 the same Core in parallel.
 Use Cases:
  1. We needed a flexible  scalable way to temporarily cache child-entity 
 data prior to joining to parent entities.
   - Using SqlEntityProcessor with Child Entities can cause an n+1 select 
 problem.
   - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching 
 mechanism and does not scale.
   - There is no way to cache non-SQL inputs (ex: flat files, xml, etc).
  
  2. We needed the ability to gather data from long-running entities by a 
 process that runs separate from our main indexing process.
   
  3. We wanted the ability to do a delta import of only the entities that 
 changed.
   - Lucene/Solr requires entire documents to be re-indexed, even if only a 
 few fields changed.
   - Our data comes from 50+ complex sql queries and/or flat files.
   - We do not want to incur overhead re-gathering all of this data if only 1 
 entity's data changed.
   - Persistent DIH caches solve this problem.
   
  4. We want the ability to index several documents in parallel (using 1.4.1, 
 which did not have the threads parameter).
  
  5. In the future, we may need to use Shards, creating a need to easily 
 partition our source data into Shards.
 Implementation Details:
  1. De-couple EntityProcessorBase from caching.  
   - Created a new interface, DIHCache  two implementations:  
 - SortedMapBackedCache - An in-memory cache, used as default with 
 CachedSqlEntityProcessor (now deprecated).
 - BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested 
 with je-4.1.6.jar
- NOTE: the existing Lucene Contrib db project uses je-3.3.93.jar.  
 I believe this may be incompatible due to Generic Usage.
- NOTE: I did not modify the ant script to automatically get this jar, 
 so to use or evaluate this patch, download bdb-je from 
 http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html 
  
  2. Allow Entity Processors to take a cacheImpl parameter to cause the 
 entity data to be cached (see EntityProcessorBase  DIHCacheProperties).
  
  3. Partially De-couple SolrWriter from DocBuilder
   - Created a new interface DIHWriter,  two implementations:
- SolrWriter (refactored)
- DIHCacheWriter (allows DIH to write ultimately to a Cache).

  4. Create a new Entity Processor, DIHCacheProcessor, which reads a 
 persistent Cache as DIH Entity Input.
  
  5. Support a partition parameter with both DIHCacheWriter and 
 DIHCacheProcessor to allow for easy partitioning of source entity