[jira] Commented: (SOLR-221) faceting memory and performance improvement

2007-04-28 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12492543
 ] 

Yonik Seeley commented on SOLR-221:
---

The results are slightly surprising.

I made up an index, and each document contained 4 random numbers between 1 and 
500,000
This is not the distribution one would expect to see in a real index. but we 
can still learn much.

The synthetic index:
 maxDoc=500,000
 numDocs=393,566
 number of segments = 5
 number of unique facet terms = 490903
 filterCache max size = 1,000,000 entries (more than enough)
 JVM=1.5.0_09 -server -Xmx200M
 System=WinXP, 3GHz P4, hyperthreaded, 1GB dual channel RAM
 facet type = facet.field, facet.sort=true, facet.limit=10
 maximum df of any term = 15
 warming times were not included... queries were run many times and the lowest 
time recorded.

Number of documents that match test "base" queries (for example, base query #1 
matches 175K docs):
1) 175000,  
2) 43000
3) 8682
4) 2179
5) 422
6) 1

WITHOUT PATCH (milliseconds to facet each base query):
1578, 1578, 1547, 1485, 1484,1422

WITH PATCH (min df comparison w/ term df,  minDfFilterCache=0) (all field cache)
 984,  1203, 1391, 1437, 1484, 1420

WITH PATCH (min df comp, minDfFilterCache=30)  (no fieldCache at all)
1406, 2344, 3125, 3015, 3172, 3172

CONCLUSION1: min df comparison increases faceting speed 60% when the base query 
matches many documents.  With a real term distribution, this could be even 
greater.

CONCLUSION2: opting to not use the fieldCache for smaller df terms can save a 
lot of memory, but it hurts performance up to 200% for our non-optimized index.

CONCLUSION3: using the field cache less can significantly speed up warming time 
(times not shown, but a full warming of the fieldCache took 33 sec)

 now the same index, but optimized ===
WITH PATCH (optimized, min df comparison w/ term df,  minDfFilterCache=0) (all 
field cache)
 172,  312,  485,  578,  610,  656

WITH PATCH (optimized, min df comp, minDfFilterCache=30)  (no fieldCache at all)
 265,  344,  422,  468,  500,  484  

CONCLUSION3: An optimized index increased performance 200-500%

CONCLUSION4:  The fact that an all-fieldcache option was significantly faster 
on an optimized probably cannot totally be explained by accurate dfs (no 
deleted documents to inflate the term df values), means that just iterating 
over the terms is *much* faster in an optimized index (a potential Lucene area 
to look into)


> faceting memory and performance improvement
> ---
>
> Key: SOLR-221
> URL: https://issues.apache.org/jira/browse/SOLR-221
> Project: Solr
>  Issue Type: Improvement
>Reporter: Yonik Seeley
> Assigned To: Yonik Seeley
> Attachments: facet.patch
>
>
> 1) compare minimum count currently needed to the term df and avoid 
> unnecessary intersection count
> 2) set a minimum term df in order to use the filterCache, otherwise iterate 
> over TermDocs

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-221) faceting memory and performance improvement

2007-04-28 Thread Yonik Seeley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley updated SOLR-221:
--

Attachment: facet.patch

> faceting memory and performance improvement
> ---
>
> Key: SOLR-221
> URL: https://issues.apache.org/jira/browse/SOLR-221
> Project: Solr
>  Issue Type: Improvement
>Reporter: Yonik Seeley
> Assigned To: Yonik Seeley
> Attachments: facet.patch
>
>
> 1) compare minimum count currently needed to the term df and avoid 
> unnecessary intersection count
> 2) set a minimum term df in order to use the filterCache, otherwise iterate 
> over TermDocs

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (SOLR-221) faceting memory and performance improvement

2007-04-28 Thread Yonik Seeley (JIRA)
faceting memory and performance improvement
---

 Key: SOLR-221
 URL: https://issues.apache.org/jira/browse/SOLR-221
 Project: Solr
  Issue Type: Improvement
Reporter: Yonik Seeley
 Assigned To: Yonik Seeley


1) compare minimum count currently needed to the term df and avoid unnecessary 
intersection count
2) set a minimum term df in order to use the filterCache, otherwise iterate 
over TermDocs

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-212) Embeddable class to call solr directly

2007-04-28 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12492534
 ] 

Otis Gospodnetic commented on SOLR-212:
---

Brian: interested!


> Embeddable class to call solr directly
> --
>
> Key: SOLR-212
> URL: https://issues.apache.org/jira/browse/SOLR-212
> Project: Solr
>  Issue Type: Improvement
>Reporter: Ryan McKinley
> Assigned To: Ryan McKinley
>Priority: Minor
> Attachments: SOLR-212-DirectSolrConnection.patch, 
> SOLR-212-DirectSolrConnection.patch, SOLR-212-DirectSolrConnection.patch, 
> SOLR-212-DirectSolrConnection.patch
>
>
> For some embedded applications, it is useful to call solr without running an 
> HTTP server.  This class mimics the behavior you would get if you sent the 
> request through an HTTP connection.  It is designed to work nicely (ie 
> simple) with JNI
> the main function is:
> public class DirectSolrConnection 
> {
>   String request( String pathAndParams, String body ) throws Exception
>   {
> ...
>   }
> }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-181) Support for "Required" field Property

2007-04-28 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12492532
 ] 

Yonik Seeley commented on SOLR-181:
---

Haven't looked at the code,  but the description looks fine.

+1

> Support for "Required" field Property
> -
>
> Key: SOLR-181
> URL: https://issues.apache.org/jira/browse/SOLR-181
> Project: Solr
>  Issue Type: Improvement
>  Components: update
>Reporter: Greg Ludington
> Assigned To: Ryan McKinley
>Priority: Minor
> Attachments: solr-181-required-fields.patch, 
> solr-181-required-fields.patch
>
>
> In certain situations, it can be helpful to require every document in your 
> index has a value for a given field.  While ideally the indexing client(s) 
> should be responsible enough to add all necessary fields, this patch allows 
> it to be enforced in the Solr schema, by adding a required property to a 
> field entry.  For example, with this in the schema:
> required="true"/>
> A request to index a document without a name field will result in this 
> response:
> org.apache.solr.core.SolrException: missing required 
> fields: name 
> (and then, of course, the stack trace)
> 
> The meat of this patch is that DocumentBuilder.getDoc() throws a 
> SolrException if not all required fields have values; this may not work well 
> as is with SOLR-139, Support updateable/modifiable documents, and may have to 
> be changed depending on that issue's final disposition.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-220) Solr returns "HTTP status code=1" in some case

2007-04-28 Thread Ryan McKinley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12492531
 ] 

Ryan McKinley commented on SOLR-220:


I just checked in a much smaller patch that at least won't throw a status 
code="1"
http://svn.apache.org/viewvc/lucene/solr/trunk/src/java/org/apache/solr/schema/IndexSchema.java?r1=533449&r2=533448&pathrev=533449

We should probably use your patch so that it has a nice context specific error, 
rather then the general "undefined field"

As an aside, SOLR-204 will make the request dispatcher the default /select 
handler.  This catches invalid error codes and returns a 500.

thanks



> Solr returns "HTTP status code=1" in some case
> --
>
> Key: SOLR-220
> URL: https://issues.apache.org/jira/browse/SOLR-220
> Project: Solr
>  Issue Type: Bug
>  Components: search
>Reporter: Koji Sekiguchi
> Attachments: QueryParsing.patch
>
>
> If I request the following on solr example:
> http://localhost:8080/solr/select?q=ipod%3Bzzz+asc&version=2.2&start=0&rows=10&indent=on
> I got an exception as I expected because zzz isn't undefined, but HTTP status 
> code is 1. I expected 400 in this case.
> The reason of this is because IndexSchema.getField() method throws 
> SolrException(1,"") and QueryParsing.parseSort() doesn't catch it:
> // getField could throw an exception if the name isn't found
>   SchemaField f = schema.getField(part);  // <=== makes HTTP status code=1
> if (f == null || !f.indexed()){
>   throw new SolrException( 400, "can not sort on unindexed field: 
> "+part );
> }
> There seems to be a couple of ways to solve this problem:
> 1. IndexSchema.getField() method throws SolrException(400,"")
> 2. IndexSchema.getField() method doesn't throw the exception but returns null
> 3. The caller catches the exception and re-throws SolrException(400,"")
> 4. The caller catches the exception and re-throws SolrException(400,"",cause) 
> that wraps the cause exception
> I think either #3 or #4 will be acceptable. The attached patch is #3 for sort 
> on undefined field.
> Other than QueryParsing.parseSort(), IndexSchema.getField() is called by the 
> following class/methos:
> - CSVLoader.prepareFields()
> - JSONWriter.writeDoc()
> - SimpleFacets.getTermCounts()
> - QueryParsing.parseValSource()
> I'm not sure these methods require same patch. Any thoughts?
> regards,

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-220) Solr returns "HTTP status code=1" in some case

2007-04-28 Thread Koji Sekiguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated SOLR-220:


Attachment: QueryParsing.patch

the patch for "sort on undefined field"

> Solr returns "HTTP status code=1" in some case
> --
>
> Key: SOLR-220
> URL: https://issues.apache.org/jira/browse/SOLR-220
> Project: Solr
>  Issue Type: Bug
>  Components: search
>Reporter: Koji Sekiguchi
> Attachments: QueryParsing.patch
>
>
> If I request the following on solr example:
> http://localhost:8080/solr/select?q=ipod%3Bzzz+asc&version=2.2&start=0&rows=10&indent=on
> I got an exception as I expected because zzz isn't undefined, but HTTP status 
> code is 1. I expected 400 in this case.
> The reason of this is because IndexSchema.getField() method throws 
> SolrException(1,"") and QueryParsing.parseSort() doesn't catch it:
> // getField could throw an exception if the name isn't found
>   SchemaField f = schema.getField(part);  // <=== makes HTTP status code=1
> if (f == null || !f.indexed()){
>   throw new SolrException( 400, "can not sort on unindexed field: 
> "+part );
> }
> There seems to be a couple of ways to solve this problem:
> 1. IndexSchema.getField() method throws SolrException(400,"")
> 2. IndexSchema.getField() method doesn't throw the exception but returns null
> 3. The caller catches the exception and re-throws SolrException(400,"")
> 4. The caller catches the exception and re-throws SolrException(400,"",cause) 
> that wraps the cause exception
> I think either #3 or #4 will be acceptable. The attached patch is #3 for sort 
> on undefined field.
> Other than QueryParsing.parseSort(), IndexSchema.getField() is called by the 
> following class/methos:
> - CSVLoader.prepareFields()
> - JSONWriter.writeDoc()
> - SimpleFacets.getTermCounts()
> - QueryParsing.parseValSource()
> I'm not sure these methods require same patch. Any thoughts?
> regards,

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (SOLR-220) Solr returns "HTTP status code=1" in some case

2007-04-28 Thread Koji Sekiguchi (JIRA)
Solr returns "HTTP status code=1" in some case
--

 Key: SOLR-220
 URL: https://issues.apache.org/jira/browse/SOLR-220
 Project: Solr
  Issue Type: Bug
  Components: search
Reporter: Koji Sekiguchi


If I request the following on solr example:

http://localhost:8080/solr/select?q=ipod%3Bzzz+asc&version=2.2&start=0&rows=10&indent=on

I got an exception as I expected because zzz isn't undefined, but HTTP status 
code is 1. I expected 400 in this case.
The reason of this is because IndexSchema.getField() method throws 
SolrException(1,"") and QueryParsing.parseSort() doesn't catch it:

// getField could throw an exception if the name isn't found
SchemaField f = schema.getField(part);  // <=== makes HTTP status code=1
if (f == null || !f.indexed()){
  throw new SolrException( 400, "can not sort on unindexed field: 
"+part );
}

There seems to be a couple of ways to solve this problem:

1. IndexSchema.getField() method throws SolrException(400,"")
2. IndexSchema.getField() method doesn't throw the exception but returns null
3. The caller catches the exception and re-throws SolrException(400,"")
4. The caller catches the exception and re-throws SolrException(400,"",cause) 
that wraps the cause exception

I think either #3 or #4 will be acceptable. The attached patch is #3 for sort 
on undefined field.

Other than QueryParsing.parseSort(), IndexSchema.getField() is called by the 
following class/methos:

- CSVLoader.prepareFields()
- JSONWriter.writeDoc()
- SimpleFacets.getTermCounts()
- QueryParsing.parseValSource()

I'm not sure these methods require same patch. Any thoughts?

regards,


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (SOLR-212) Embeddable class to call solr directly

2007-04-28 Thread Ryan McKinley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan McKinley reassigned SOLR-212:
--

Assignee: Ryan McKinley

> Embeddable class to call solr directly
> --
>
> Key: SOLR-212
> URL: https://issues.apache.org/jira/browse/SOLR-212
> Project: Solr
>  Issue Type: Improvement
>Reporter: Ryan McKinley
> Assigned To: Ryan McKinley
>Priority: Minor
> Attachments: SOLR-212-DirectSolrConnection.patch, 
> SOLR-212-DirectSolrConnection.patch, SOLR-212-DirectSolrConnection.patch, 
> SOLR-212-DirectSolrConnection.patch
>
>
> For some embedded applications, it is useful to call solr without running an 
> HTTP server.  This class mimics the behavior you would get if you sent the 
> request through an HTTP connection.  It is designed to work nicely (ie 
> simple) with JNI
> the main function is:
> public class DirectSolrConnection 
> {
>   String request( String pathAndParams, String body ) throws Exception
>   {
> ...
>   }
> }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-212) Embeddable class to call solr directly

2007-04-28 Thread Ryan McKinley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan McKinley updated SOLR-212:
---

Attachment: SOLR-212-DirectSolrConnection.patch

Updated to take an (optional) logging path

> Embeddable class to call solr directly
> --
>
> Key: SOLR-212
> URL: https://issues.apache.org/jira/browse/SOLR-212
> Project: Solr
>  Issue Type: Improvement
>Reporter: Ryan McKinley
>Priority: Minor
> Attachments: SOLR-212-DirectSolrConnection.patch, 
> SOLR-212-DirectSolrConnection.patch, SOLR-212-DirectSolrConnection.patch, 
> SOLR-212-DirectSolrConnection.patch
>
>
> For some embedded applications, it is useful to call solr without running an 
> HTTP server.  This class mimics the behavior you would get if you sent the 
> request through an HTTP connection.  It is designed to work nicely (ie 
> simple) with JNI
> the main function is:
> public class DirectSolrConnection 
> {
>   String request( String pathAndParams, String body ) throws Exception
>   {
> ...
>   }
> }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (SOLR-181) Support for "Required" field Property

2007-04-28 Thread Ryan McKinley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan McKinley reassigned SOLR-181:
--

Assignee: Ryan McKinley

> Support for "Required" field Property
> -
>
> Key: SOLR-181
> URL: https://issues.apache.org/jira/browse/SOLR-181
> Project: Solr
>  Issue Type: Improvement
>  Components: update
>Reporter: Greg Ludington
> Assigned To: Ryan McKinley
>Priority: Minor
> Attachments: solr-181-required-fields.patch, 
> solr-181-required-fields.patch
>
>
> In certain situations, it can be helpful to require every document in your 
> index has a value for a given field.  While ideally the indexing client(s) 
> should be responsible enough to add all necessary fields, this patch allows 
> it to be enforced in the Solr schema, by adding a required property to a 
> field entry.  For example, with this in the schema:
> required="true"/>
> A request to index a document without a name field will result in this 
> response:
> org.apache.solr.core.SolrException: missing required 
> fields: name 
> (and then, of course, the stack trace)
> 
> The meat of this patch is that DocumentBuilder.getDoc() throws a 
> SolrException if not all required fields have values; this may not work well 
> as is with SOLR-139, Support updateable/modifiable documents, and may have to 
> be changed depending on that issue's final disposition.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-181) Support for "Required" field Property

2007-04-28 Thread Ryan McKinley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan McKinley updated SOLR-181:
---

Attachment: solr-181-required-fields.patch

Finally got a chance to look at this.  It looks good.  I made a few 
modifications:

1. changed tabs to spaces
2. Added javadoc comments to make it clear that RequiredFields must contain all 
fieldsWithDefaultValues
3. The error now contains the documents uniqueKey
4. moved the test to o.a.s.schema
5. I added a non-final flag to SchemaField to say if the field is required.
6. Modified IndexSchema.java to set the uniqueKey as required *unless* it is 
specified as required="false" in the schema
7. Added required="true" to the example schema.xml 
8. Added required="false" to the test schema.xml (one test does not include it)

As a note to anyone else looking at the change log, Greg's patch also modifies 
AbstractSolrTestCase and TestHarness to be able to check what status is 
expected from checkUpdateU


I think this offers a good solution to the (mis)feature that you could have a 
null uniqueKey.  This patch lets you have a null uniqueKey, but you have to 
configure it.



> Support for "Required" field Property
> -
>
> Key: SOLR-181
> URL: https://issues.apache.org/jira/browse/SOLR-181
> Project: Solr
>  Issue Type: Improvement
>  Components: update
>Reporter: Greg Ludington
>Priority: Minor
> Attachments: solr-181-required-fields.patch, 
> solr-181-required-fields.patch
>
>
> In certain situations, it can be helpful to require every document in your 
> index has a value for a given field.  While ideally the indexing client(s) 
> should be responsible enough to add all necessary fields, this patch allows 
> it to be enforced in the Solr schema, by adding a required property to a 
> field entry.  For example, with this in the schema:
> required="true"/>
> A request to index a document without a name field will result in this 
> response:
> org.apache.solr.core.SolrException: missing required 
> fields: name 
> (and then, of course, the stack trace)
> 
> The meat of this patch is that DocumentBuilder.getDoc() throws a 
> SolrException if not all required fields have values; this may not work well 
> as is with SOLR-139, Support updateable/modifiable documents, and may have to 
> be changed depending on that issue's final disposition.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-212) Embeddable class to call solr directly

2007-04-28 Thread Brian Whitman (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12492522
 ] 

Brian Whitman commented on SOLR-212:


Since the main use case of SOLR-212 is to embed it in client applications, we 
should be careful about logging. As of now SOLR-212 will spit stuff all over 
stderr.

I suggest putting this

System.setProperty("java.util.logging.config.file", 
instanceDir+"/conf/logging.properties");

near line 79 of DirectSolrConnection.java. That way, if a developer/user 
chooses, they can put a logging.prop file in conf and set direct logging of 
Solr requests either to their own application logs or a file. If the 
conf/logging.properties file does not exist, I believe the default 
logging.properties will be used (which is what happens now.)



> Embeddable class to call solr directly
> --
>
> Key: SOLR-212
> URL: https://issues.apache.org/jira/browse/SOLR-212
> Project: Solr
>  Issue Type: Improvement
>Reporter: Ryan McKinley
>Priority: Minor
> Attachments: SOLR-212-DirectSolrConnection.patch, 
> SOLR-212-DirectSolrConnection.patch, SOLR-212-DirectSolrConnection.patch
>
>
> For some embedded applications, it is useful to call solr without running an 
> HTTP server.  This class mimics the behavior you would get if you sent the 
> request through an HTTP connection.  It is designed to work nicely (ie 
> simple) with JNI
> the main function is:
> public class DirectSolrConnection 
> {
>   String request( String pathAndParams, String body ) throws Exception
>   {
> ...
>   }
> }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Luke handler help

2007-04-28 Thread Yonik Seeley

> In an inverted index, terms point to documents.   So you have to
> traverse *all* of the terms of a field across all documents, and keep
> track of when you run across the document you are interested in.  When
> you do, then get the positions that the term appeared at, and keep
> track of them.  After you have covered all the terms, you can put
> everything in order.  There could be gaps (positionIncrement, stop
> word removal, etc) and it's also possible for multiple tokens to
> appear at the same position.
>
> For a full-text field with many terms, and a large index, this could
> take a *long* time.
> It's probably very useful for debugging though.


I just realized that it's worse... if you specified a field, then you
only have to iterate the terms for that field.  If you want *all* of
the indexed, non-stored fields for a particular document, but don't
know what they are, there is no info to help you.  You need to iterate
over *all* terms in the index.

Luckily, there is patch in the works in Lucene that will make
skipTo(myDoc) in TermDocs faster.  That should speed things up a
little.


> Remember that df is not updated when a document is marked for deletion
> in Lucene.
> So you can have a df of 2, do a search, and only come up with one document.
>

that would explain why I'm seeing df > 1 for the uniqueKey!


Yep, that's not likely to ever be fixed in Lucene.  Again, it's the
nature of the inverted index... given a particular docid, you really
have no clue what terms in the index point to that docid.

-Yonik


[jira] Commented: (SOLR-212) Embeddable class to call solr directly

2007-04-28 Thread Brian Whitman (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12492518
 ] 

Brian Whitman commented on SOLR-212:


Much love from user land on this one. I just successfully put solr in a C app 
without any webserver running using JNI.

After I clean up my JNI calling code I can post an example app here to show how 
it's done on the client side if anyone is interested?









> Embeddable class to call solr directly
> --
>
> Key: SOLR-212
> URL: https://issues.apache.org/jira/browse/SOLR-212
> Project: Solr
>  Issue Type: Improvement
>Reporter: Ryan McKinley
>Priority: Minor
> Attachments: SOLR-212-DirectSolrConnection.patch, 
> SOLR-212-DirectSolrConnection.patch, SOLR-212-DirectSolrConnection.patch
>
>
> For some embedded applications, it is useful to call solr without running an 
> HTTP server.  This class mimics the behavior you would get if you sent the 
> request through an HTTP connection.  It is designed to work nicely (ie 
> simple) with JNI
> the main function is:
> public class DirectSolrConnection 
> {
>   String request( String pathAndParams, String body ) throws Exception
>   {
> ...
>   }
> }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Luke handler help

2007-04-28 Thread Ryan McKinley

Yonik Seeley wrote:

On 4/28/07, Ryan McKinley <[EMAIL PROTECTED]> wrote:

I have a few things I'd like to check with the Luke handler, if you call
could check some of the assumptions, that would be great.

* I want to print out the document frequency for a term in a given
document.  Since that term shows up in the given document, I would think
the term frequency must be > 1.  I am using: reader.docFreq( t ) [line
236] The results seem reasonable, but *sometimes* it returns zero... is
that possible?


Is the field indexed?
Did you run the field through the analyzer to get the terms (to match
what's in the index)?
If both of those are true, it seems like the docFreq should always be
greater than 0.



aah, that makes sense - now that you mention it, I only see df=0 for 
non-indexed, stored fields.





In an inverted index, terms point to documents.   So you have to
traverse *all* of the terms of a field across all documents, and keep
track of when you run across the document you are interested in.  When
you do, then get the positions that the term appeared at, and keep
track of them.  After you have covered all the terms, you can put
everything in order.  There could be gaps (positionIncrement, stop
word removal, etc) and it's also possible for multiple tokens to
appear at the same position.

For a full-text field with many terms, and a large index, this could
take a *long* time.
It's probably very useful for debugging though.



that must be why luke starts a new thread for 'reconstruct and edit' 
For now, i will leave this out of the handler, and leave that open to 
someone with the need/time in the future.




* Each field gets an boolean attribute "cacheableFaceting" -- this true
if the number of distinct terms is smaller then the filterCacheSize.  I
get the filterCacheSize from: solrconfig.xml:"query/filterCache/@size"
and get the distinctTerm count from counting up the termEnum.  Is this
logic solid?  I know the cacheability changes if you are faciting
multiple fields at once, but its still nice to have a ballpark estimate
without needing to know the internals.


It could get trickier... I'm about to hack up a quick patch now that
will reduce memory usage by only using the filterCache  above a
certain df threshold.  It may increase or
decrease the faceting speed - TBD.

Also, other alternate faceting schemes are in the works (a month or two 
out).

I'd leave this attribute out and just report on the number of unique terms.


ok, that seems reasonable.



Some kind of histogram might be really nice though (how many terms
under varying df values):
 1=>412  (412 terms have a df of 1)
 2=>516  (516 terms have a df of 2)
 4=>600
 8=>650
16=>670
32=>680
64=>683
128=>685
256=>686
11325=>690  (the maxDf found)



I'll take a look at that



Remember that df is not updated when a document is marked for deletion
in Lucene.
So you can have a df of 2, do a search, and only come up with one document.



that would explain why I'm seeing df > 1 for the uniqueKey!



Re: Luke handler help

2007-04-28 Thread Yonik Seeley

On 4/28/07, Ryan McKinley <[EMAIL PROTECTED]> wrote:

I have a few things I'd like to check with the Luke handler, if you call
could check some of the assumptions, that would be great.

* I want to print out the document frequency for a term in a given
document.  Since that term shows up in the given document, I would think
the term frequency must be > 1.  I am using: reader.docFreq( t ) [line
236] The results seem reasonable, but *sometimes* it returns zero... is
that possible?


Is the field indexed?
Did you run the field through the analyzer to get the terms (to match
what's in the index)?
If both of those are true, it seems like the docFreq should always be
greater than 0.


* I want to return the lucene field flags for each field.  I run through
all the field names with:
reader.getFieldNames(IndexReader.FieldOption.ALL).  Is there a way to
get any Fieldable for a given name?  IIUC, all terms with the same name
will have the same flags.  I tried searching for a document with that
field, it works, but only for stored fields.

* I just realized that I am only returning stored fields for get
getDocumentFieldsInfo() (it uses Document.getFields())  How can I get
find *all* Fieldables for a given document?  I have tried following the
luke source, but get a bit lost ;)


LOL... if it's an inverted index, it's difficult and time consuming to
try and reconstruct what a non-stored field value was.

In an inverted index, terms point to documents.   So you have to
traverse *all* of the terms of a field across all documents, and keep
track of when you run across the document you are interested in.  When
you do, then get the positions that the term appeared at, and keep
track of them.  After you have covered all the terms, you can put
everything in order.  There could be gaps (positionIncrement, stop
word removal, etc) and it's also possible for multiple tokens to
appear at the same position.

For a full-text field with many terms, and a large index, this could
take a *long* time.
It's probably very useful for debugging though.


* Each field gets an boolean attribute "cacheableFaceting" -- this true
if the number of distinct terms is smaller then the filterCacheSize.  I
get the filterCacheSize from: solrconfig.xml:"query/filterCache/@size"
and get the distinctTerm count from counting up the termEnum.  Is this
logic solid?  I know the cacheability changes if you are faciting
multiple fields at once, but its still nice to have a ballpark estimate
without needing to know the internals.


It could get trickier... I'm about to hack up a quick patch now that
will reduce memory usage by only using the filterCache  above a
certain df threshold.  It may increase or
decrease the faceting speed - TBD.

Also, other alternate faceting schemes are in the works (a month or two out).
I'd leave this attribute out and just report on the number of unique terms.
Some kind of histogram might be really nice though (how many terms
under varying df values):
 1=>412  (412 terms have a df of 1)
 2=>516  (516 terms have a df of 2)
 4=>600
 8=>650
16=>670
32=>680
64=>683
128=>685
256=>686
11325=>690  (the maxDf found)

Remember that df is not updated when a document is marked for deletion
in Lucene.
So you can have a df of 2, do a search, and only come up with one document.

-Yonik


[jira] Updated: (SOLR-212) Embeddable class to call solr directly

2007-04-28 Thread Ryan McKinley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan McKinley updated SOLR-212:
---

Attachment: SOLR-212-DirectSolrConnection.patch

Adding dataDir to an optional constructor.

> Embeddable class to call solr directly
> --
>
> Key: SOLR-212
> URL: https://issues.apache.org/jira/browse/SOLR-212
> Project: Solr
>  Issue Type: Improvement
>Reporter: Ryan McKinley
>Priority: Minor
> Attachments: SOLR-212-DirectSolrConnection.patch, 
> SOLR-212-DirectSolrConnection.patch, SOLR-212-DirectSolrConnection.patch
>
>
> For some embedded applications, it is useful to call solr without running an 
> HTTP server.  This class mimics the behavior you would get if you sent the 
> request through an HTTP connection.  It is designed to work nicely (ie 
> simple) with JNI
> the main function is:
> public class DirectSolrConnection 
> {
>   String request( String pathAndParams, String body ) throws Exception
>   {
> ...
>   }
> }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Luke handler help

2007-04-28 Thread Ryan McKinley
I have a few things I'd like to check with the Luke handler, if you call 
could check some of the assumptions, that would be great.


* I want to print out the document frequency for a term in a given 
document.  Since that term shows up in the given document, I would think 
the term frequency must be > 1.  I am using: reader.docFreq( t ) [line 
236] The results seem reasonable, but *sometimes* it returns zero... is 
that possible?


* I want to return the lucene field flags for each field.  I run through 
all the field names with: 
reader.getFieldNames(IndexReader.FieldOption.ALL).  Is there a way to 
get any Fieldable for a given name?  IIUC, all terms with the same name 
will have the same flags.  I tried searching for a document with that 
field, it works, but only for stored fields.


* I just realized that I am only returning stored fields for get 
getDocumentFieldsInfo() (it uses Document.getFields())  How can I get 
find *all* Fieldables for a given document?  I have tried following the 
luke source, but get a bit lost ;)


* Each field gets an boolean attribute "cacheableFaceting" -- this true 
if the number of distinct terms is smaller then the filterCacheSize.  I 
get the filterCacheSize from: solrconfig.xml:"query/filterCache/@size" 
and get the distinctTerm count from counting up the termEnum.  Is this 
logic solid?  I know the cacheability changes if you are faciting 
multiple fields at once, but its still nice to have a ballpark estimate 
without needing to know the internals.



thanks for any pointers
ryan


[jira] Commented: (SOLR-204) Let solrconfig.xml configure the SolrDispatchFilter to handle /select

2007-04-28 Thread Ryan McKinley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12492512
 ] 

Ryan McKinley commented on SOLR-204:



> 
> should probably be something more like:
>   throw new SolrException(400,"Query parsing error: " + e.getMessage() ,e);
> 

Yes, the other change is that errors for RequestDispatcher only print the stack 
trace if it is >=500, 400 (bad request) assumes the message will contain a user 
useful response.  



> Let solrconfig.xml configure the SolrDispatchFilter to handle /select
> -
>
> Key: SOLR-204
> URL: https://issues.apache.org/jira/browse/SOLR-204
> Project: Solr
>  Issue Type: Improvement
>Reporter: Ryan McKinley
> Assigned To: Ryan McKinley
> Attachments: SOLR-204-HandleSelect.patch, 
> SOLR-204-HandleSelect.patch, SOLR-204-HandleSelect.patch
>
>
> The major reason to make everythign use the SolrDispatchFilter is that we 
> would have consistent error handling.  Currently, 
> SolrServlet spits back errors using:
>  PrintWriter writer = response.getWriter();
>  writer.write(msg);
> and the SolrDispatchFilter spits them back using:
>  res.sendError( code, ex.getMessage() );
> Using "sendError" lets the servlet container format the code so it shows up 
> ok in a browser.  Without it, you may have to view source to see the error.
> Aditionaly, SolrDispatchFilter is more decerning about including stack trace. 
>  It only includes a stack trace of 500 or an unknown response code.
> Eventually, the error should probably be formatted in the requested format - 
> SOLR-141.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-204) Let solrconfig.xml configure the SolrDispatchFilter to handle /select

2007-04-28 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12492511
 ] 

Yonik Seeley commented on SOLR-204:
---

OK cool, for something like an undefined field, it looks fine:
"undefined field catdsfgsdg"

But for something like a query parsing error, the only pointer to *what* the 
error is is in the stack trace, and you don't get that back.  You just get: 
"Error parsing Lucene query"

The logs show:
SEVERE: org.apache.lucene.queryParser.ParseException: Cannot parse 'foo:*': '*' 
or '?' not allowed as first character in WildcardQuery
at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:149)
at org.apache.solr.search.QueryParsing.parseQuery(QueryParsing.java:94)
at 
org.apache.solr.request.StandardRequestHandler.handleRequestBody(StandardRequestHandler.java:85)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:77)

Hmmm, but I think this is an exception issue:

In QueryParsing.java:
} catch (ParseException e) {
  SolrCore.log(e);
  throw new SolrException(400,"Error parsing Lucene query",e);
}

should probably be something more like:
  throw new SolrException(400,"Query parsing error: " + e.getMessage() ,e);


> Let solrconfig.xml configure the SolrDispatchFilter to handle /select
> -
>
> Key: SOLR-204
> URL: https://issues.apache.org/jira/browse/SOLR-204
> Project: Solr
>  Issue Type: Improvement
>Reporter: Ryan McKinley
> Assigned To: Ryan McKinley
> Attachments: SOLR-204-HandleSelect.patch, 
> SOLR-204-HandleSelect.patch, SOLR-204-HandleSelect.patch
>
>
> The major reason to make everythign use the SolrDispatchFilter is that we 
> would have consistent error handling.  Currently, 
> SolrServlet spits back errors using:
>  PrintWriter writer = response.getWriter();
>  writer.write(msg);
> and the SolrDispatchFilter spits them back using:
>  res.sendError( code, ex.getMessage() );
> Using "sendError" lets the servlet container format the code so it shows up 
> ok in a browser.  Without it, you may have to view source to see the error.
> Aditionaly, SolrDispatchFilter is more decerning about including stack trace. 
>  It only includes a stack trace of 500 or an unknown response code.
> Eventually, the error should probably be formatted in the requested format - 
> SOLR-141.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-204) Let solrconfig.xml configure the SolrDispatchFilter to handle /select

2007-04-28 Thread Ryan McKinley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12492508
 ] 

Ryan McKinley commented on SOLR-204:


sendError lets the web app decide how to format the response body.  Typically 
they put HTML with the status code, with a footer saying the "Jetty" or "Resin"

This is what you get to configure with:

  
java.lang.Exception
/error
  
  
404/error
etc

> Let solrconfig.xml configure the SolrDispatchFilter to handle /select
> -
>
> Key: SOLR-204
> URL: https://issues.apache.org/jira/browse/SOLR-204
> Project: Solr
>  Issue Type: Improvement
>Reporter: Ryan McKinley
> Assigned To: Ryan McKinley
> Attachments: SOLR-204-HandleSelect.patch, 
> SOLR-204-HandleSelect.patch, SOLR-204-HandleSelect.patch
>
>
> The major reason to make everythign use the SolrDispatchFilter is that we 
> would have consistent error handling.  Currently, 
> SolrServlet spits back errors using:
>  PrintWriter writer = response.getWriter();
>  writer.write(msg);
> and the SolrDispatchFilter spits them back using:
>  res.sendError( code, ex.getMessage() );
> Using "sendError" lets the servlet container format the code so it shows up 
> ok in a browser.  Without it, you may have to view source to see the error.
> Aditionaly, SolrDispatchFilter is more decerning about including stack trace. 
>  It only includes a stack trace of 500 or an unknown response code.
> Eventually, the error should probably be formatted in the requested format - 
> SOLR-141.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-204) Let solrconfig.xml configure the SolrDispatchFilter to handle /select

2007-04-28 Thread Ryan McKinley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan McKinley updated SOLR-204:
---

Attachment: SOLR-204-HandleSelect.patch

applies cleanly with trunk

> Let solrconfig.xml configure the SolrDispatchFilter to handle /select
> -
>
> Key: SOLR-204
> URL: https://issues.apache.org/jira/browse/SOLR-204
> Project: Solr
>  Issue Type: Improvement
>Reporter: Ryan McKinley
> Assigned To: Ryan McKinley
> Attachments: SOLR-204-HandleSelect.patch, 
> SOLR-204-HandleSelect.patch, SOLR-204-HandleSelect.patch
>
>
> The major reason to make everythign use the SolrDispatchFilter is that we 
> would have consistent error handling.  Currently, 
> SolrServlet spits back errors using:
>  PrintWriter writer = response.getWriter();
>  writer.write(msg);
> and the SolrDispatchFilter spits them back using:
>  res.sendError( code, ex.getMessage() );
> Using "sendError" lets the servlet container format the code so it shows up 
> ok in a browser.  Without it, you may have to view source to see the error.
> Aditionaly, SolrDispatchFilter is more decerning about including stack trace. 
>  It only includes a stack trace of 500 or an unknown response code.
> Eventually, the error should probably be formatted in the requested format - 
> SOLR-141.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Admin interface configuration changes?

2007-04-28 Thread Ryan McKinley


As we move to arbitrary path based configuration, the JSP admin pages 
don't really know where things are and what to link to.


In looking into how to replace get-file.jsp and how to have an upload 
page for /update and /update/csv, I stumbled on the idea that we could 
have the list of options for what is displayed in the admin interface 
configured in solrconfig.xml.


Perhaps something like:


solr

  









  
  


  
  


  

   ...

Thoughts?




Re: move UpdateParams

2007-04-28 Thread Yonik Seeley

On 4/28/07, Ryan McKinley <[EMAIL PROTECTED]> wrote:


I'd like to move UpdateParams from o.a.s.handler to o.a.s.util

The other classes like it are in .util

objections?


No objections... this class and update plugins in general are very new.

-Yonik


Re: solr release planning for 1.2

2007-04-28 Thread Yonik Seeley

On 4/28/07, Ryan McKinley <[EMAIL PROTECTED]> wrote:

If the configuration is in solrconfig.xml, we can set the example to use
the dispatcher but still leave the option of the 'old' style servlet if
that is desired.  The only real difference between them is how errors
are returned.  The dispatcher calls req.sendError( code, msg ) while the
servlet writes them out directly (causing them to be hidden by IE/FF)


I think only the body of the response changes since the HTTP error
codes were already being used for /select

Since the body of the response was never really specified, and it
wasn't in a parseable format, I think using sendError() could be
considered backward compatible.

-Yonik


[jira] Commented: (SOLR-204) Let solrconfig.xml configure the SolrDispatchFilter to handle /select

2007-04-28 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12492505
 ] 

Yonik Seeley commented on SOLR-204:
---

I wanted to try this out to see what sendError() output looks like, but the 
patch isn't applying cleanly.

$ patch -p0 < c:/dl/SOLR-204*
(Stripping trailing CRs from patch.)
patching file src/test/test-files/solr/conf/solrconfig.xml
(Stripping trailing CRs from patch.)
patching file src/webapp/WEB-INF/web.xml
(Stripping trailing CRs from patch.)
patching file src/webapp/src/org/apache/solr/servlet/SolrDispatchFilter.java
Hunk #1 FAILED at 56.
1 out of 1 hunk FAILED -- saving rejects to file src/webapp/src/org/apache/solr/
servlet/SolrDispatchFilter.java.rej
(Stripping trailing CRs from patch.)
patching file src/webapp/src/org/apache/solr/servlet/SolrRequestParsers.java
(Stripping trailing CRs from patch.)
patching file example/solr/conf/solrconfig.xml
Hunk #1 succeeded at 231 (offset 8 lines).

> Let solrconfig.xml configure the SolrDispatchFilter to handle /select
> -
>
> Key: SOLR-204
> URL: https://issues.apache.org/jira/browse/SOLR-204
> Project: Solr
>  Issue Type: Improvement
>Reporter: Ryan McKinley
> Assigned To: Ryan McKinley
> Attachments: SOLR-204-HandleSelect.patch, SOLR-204-HandleSelect.patch
>
>
> The major reason to make everythign use the SolrDispatchFilter is that we 
> would have consistent error handling.  Currently, 
> SolrServlet spits back errors using:
>  PrintWriter writer = response.getWriter();
>  writer.write(msg);
> and the SolrDispatchFilter spits them back using:
>  res.sendError( code, ex.getMessage() );
> Using "sendError" lets the servlet container format the code so it shows up 
> ok in a browser.  Without it, you may have to view source to see the error.
> Aditionaly, SolrDispatchFilter is more decerning about including stack trace. 
>  It only includes a stack trace of 500 or an unknown response code.
> Eventually, the error should probably be formatted in the requested format - 
> SOLR-141.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



move UpdateParams

2007-04-28 Thread Ryan McKinley


I'd like to move UpdateParams from o.a.s.handler to o.a.s.util

The other classes like it are in .util

objections?


Re: solr release planning for 1.2

2007-04-28 Thread Ryan McKinley

Yonik Seeley wrote:

On 4/5/07, Ryan McKinley <[EMAIL PROTECTED]> wrote:
> I'm certainly on board with adding a requestHandler mapping for 
"/update",

> but i'm not sure how i feel about changing it under the covers ...

I'm suggesting we keep /update mapped to SolrUpdateServlet in web.xml, 
but map:


  


+1

I am not sure what we should do with the DispatchFilter handle-select 
parameter:


  handle-select
  true



Why do we need this parameter?  I thought that /select through
DispatchFilter would be backward compatible with the servlet's current
handling?  If that's the case, just have dispatch handle it and be
done with it.



Since writing this, I added SOLR-204 - this lets you configure if the 
DispatchFilter will handle select in solrconfig.xml rather then web.xml


If the configuration is in solrconfig.xml, we can set the example to use 
the dispatcher but still leave the option of the 'old' style servlet if 
that is desired.  The only real difference between them is how errors 
are returned.  The dispatcher calls req.sendError( code, msg ) while the 
servlet writes them out directly (causing them to be hidden by IE/FF)


SOLR-204 removes the 


Re: Do we agree on our RTC way of working? (was: Welcome Ryan McKinley!)

2007-04-28 Thread Erik Hatcher


On Apr 27, 2007, at 4:45 PM, Yonik Seeley wrote:

Off on a tangent: for contributors, we want to be careful about
implying that patches should always be complete, include unit tests,
and be documented.  While it's nice, we'd still rather have a patch
than no patch at all.

Of course if someone is looking to become a committer, then we would
be looking at patch completeness, quality, tests, etc.


+1 on Yonik's comments!

And this is where we committers can really lend a hand in mentoring  
contributors, by creating or beefing up unit tests for contributed  
patches and showing others how to run and craft the tests.


Here's a tip: contributors can start with a clean checkout, ensure  
unit tests pass, then apply their changes and ensure the existing  
tests still work.


Erik



Re: solr release planning for 1.2

2007-04-28 Thread Yonik Seeley

On 4/5/07, Ryan McKinley <[EMAIL PROTECTED]> wrote:

> I'm certainly on board with adding a requestHandler mapping for "/update",
> but i'm not sure how i feel about changing it under the covers ...

I'm suggesting we keep /update mapped to SolrUpdateServlet in web.xml, but map:

  


+1


I am not sure what we should do with the DispatchFilter handle-select parameter:

  handle-select
  true



Why do we need this parameter?  I thought that /select through
DispatchFilter would be backward compatible with the servlet's current
handling?  If that's the case, just have dispatch handle it and be
done with it.

-Yonik


Re: Open to updates to solr.py? (Python client library)

2007-04-28 Thread Ed Summers

On 4/26/07, Jason Cater <[EMAIL PROTECTED]> wrote:

I've recently implemented a SOLR solution internally.  We typically use
python as our language of choice, so I needed a python library to
connect to SOLR.


Nice work. This looks like a really nice improvement on the python
client that is currently available. I wonder, have you considered
bundling this up and making it available via PyPI so that it can be
installed w/ easy_install?

//Ed


Re: Do we agree on our RTC way of working? (was: Welcome Ryan McKinley!)

2007-04-28 Thread Bertrand Delacretaz

On 4/27/07, Yonik Seeley <[EMAIL PROTECTED]> wrote:




...My *personal* philosophy is probably more permissive than most:..


Thanks for sharing this, you're totally right that a half-baked patch
is better than no patch at all, and that there are different stages
which make sense in contributions.

Hard rules wouldn't work, but I'm glad we've had this discussion (and
I'll go back to my corner now ;-)

Also, thanks Hoss for creating
http://wiki.apache.org/solr/CommitPolicy, I think it's really good to
have this.

-Bertrand