Filtering results based on score
Hi, As part of solr results i am able to get the max score.If i want to filter the results based on the max score, let say the max score is 10 And i need only the results between max score to 50 % of max score.This max score is going to change dynamically.How can we implement this?Do we need to customize the solr?Please any suggestions. Regards, JS -- View this message in context: http://lucene.472066.n3.nabble.com/Filtering-results-based-on-score-tp1819769p1819769.html Sent from the Solr - User mailing list archive at Nabble.com.
Solr Relevency Calculation
Hi, I have 25 indexed fields in my document.But by default, if i give q=laptops this is going to search on five fields and iam getting the score as part of search results.How solr will calculate the score?Is it going to calculate only on the five fields or on 25 fields which are indexed?What is the order it is going to take to calculate score?Any documents related to this topic is helpful for me. Regards, JS -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Relevency-Calculation-tp1819798p1819798.html Sent from the Solr - User mailing list archive at Nabble.com.
Boosting the score based on certain field
Hi, In my document i have a filed called category.This contains electronics,games ,..etc.For some of the category values i need to boost the document score.Let us say, for electronics category, i will decide the boosting parameter grater than the games category.Is there any body has the idea to achieve this functionality? Regards, Siva -- View this message in context: http://lucene.472066.n3.nabble.com/Boosting-the-score-based-on-certain-field-tp1819820p1819820.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Filtering results based on score
As part of solr results i am able to get the max score.If i want to filter the results based on the max score, let say the max score is 10 And i need only the results between max score to 50 % of max score.This max score is going to change dynamically.How can we implement this?Do we need to customize the solr?Please any suggestions. frange is advised in a similar discussion: http://search-lucene.com/m/4AHNF17wIJW1/
Multiple Keyword Search
Hi There is a situation where i search for more than 1 keyword my main 2 fields are ad_title ad_description. I want those results which match all of the keywords in both fields, should come on top. Then sequentially one by one keyword can be dropped in further results. E.g. In a search of 3 keywords, let there are 100 results. If 35 contain all the keywords combined in ad_title ad_description, then they should come first. Then if 50 results contain combination of any 2 keywords, they should come next. Finally results with single keyword should come at last Please suggest -- Thanks, Pawan Darira
Re:Re: problem of solr replcation's speed
I hacked SnapPuller to log the cost, and the log is like thus: [2010-11-01 17:21:19][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 979 [2010-11-01 17:21:19][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 4 [2010-11-01 17:21:19][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 4 [2010-11-01 17:21:20][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 980 [2010-11-01 17:21:20][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 4 [2010-11-01 17:21:20][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 5 [2010-11-01 17:21:21][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 979 It's saying it cost about 1000ms for transfering 1M data every 2 times. I used jetty as server and embeded solr in my app.I'm so confused.What I have done wrong? At 2010-11-01 10:12:38,Lance Norskog goks...@gmail.com wrote: If you are copying from an indexer while you are indexing new content, this would cause contention for the disk head. Does indexing slow down during this period? Lance 2010/10/31 Peter Karich peat...@yahoo.de: we have an identical-sized index and it takes ~5minutes It takes about one hour to replacate 6G index for solr in my env. But my network can transfer file about 10-20M/s using scp. So solr's http replcation is too slow, it's normal or I do something wrong? -- Lance Norskog goks...@gmail.com
Re:Re:Re: problem of solr replcation's speed
I suspected my app has some sleeping op every 1s, so I changed ReplicationHandler.PACKET_SZ to 1024 * 1024*10; // 10MB and log result is like thus : [2010-11-01 17:49:29][INFO][pool-6-thread-1][SnapPuller.java(1038)]readFully10485760 cost 3184 [2010-11-01 17:49:32][INFO][pool-6-thread-1][SnapPuller.java(1038)]readFully10485760 cost 3426 [2010-11-01 17:49:36][INFO][pool-6-thread-1][SnapPuller.java(1038)]readFully10485760 cost 3359 [2010-11-01 17:49:39][INFO][pool-6-thread-1][SnapPuller.java(1038)]readFully10485760 cost 3166 [2010-11-01 17:49:42][INFO][pool-6-thread-1][SnapPuller.java(1038)]readFully10485760 cost 3513 [2010-11-01 17:49:46][INFO][pool-6-thread-1][SnapPuller.java(1038)]readFully10485760 cost 3140 [2010-11-01 17:49:50][INFO][pool-6-thread-1][SnapPuller.java(1038)]readFully10485760 cost 3471 That means It's still slow like before. what's wrong with my env At 2010-11-01 17:30:32,kafka0102 kafka0...@163.com wrote: I hacked SnapPuller to log the cost, and the log is like thus: [2010-11-01 17:21:19][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 979 [2010-11-01 17:21:19][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 4 [2010-11-01 17:21:19][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 4 [2010-11-01 17:21:20][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 980 [2010-11-01 17:21:20][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 4 [2010-11-01 17:21:20][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 5 [2010-11-01 17:21:21][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 979 It's saying it cost about 1000ms for transfering 1M data every 2 times. I used jetty as server and embeded solr in my app.I'm so confused.What I have done wrong? At 2010-11-01 10:12:38,Lance Norskog goks...@gmail.com wrote: If you are copying from an indexer while you are indexing new content, this would cause contention for the disk head. Does indexing slow down during this period? Lance 2010/10/31 Peter Karich peat...@yahoo.de: we have an identical-sized index and it takes ~5minutes It takes about one hour to replacate 6G index for solr in my env. But my network can transfer file about 10-20M/s using scp. So solr's http replcation is too slow, it's normal or I do something wrong? -- Lance Norskog goks...@gmail.com
Re: Design and Usage Questions
Hm, I do not have a webserver setup for security reasons.I use SVNKit to connect to SVN via the file:// protocol, what I get then is the ByteArrayOutputStream.What would the buffer-solution or the DualThread Writer/Reader pair look like?-Ursprüngliche Nachricht- Von: Lance Norskog goks...@gmail.com Gesendet: Nov 1, 2010 3:23:55 AM An: solr-user@lucene.apache.org Betreff: Re: Design and Usage Questions 2. The SolrJ library handling of content streams is pull, not push. That is, you give it a reader and it pulls content when it feels like it. If your software to feed the connection wants to write the data, you have to either buffer the whole thing or do a dual-thread writer/reader pair. The easiest way to pull stuff from SVN is to use one of the web server apps. Solr takes a stream.url parameter. (Also stream.file.) Note that there is no outbound authentication supported; your web server has to be open (at least to the Solr instance). On Sun, Oct 31, 2010 at 4:06 PM, getagrip getag...@web.de wrote: Hi, I've got some basic usage / design questions. 1. The SolrJ wiki proposes to use the same CommonsHttpSolrServer instance for all requests to avoid connection leaks. So if I create a Singleton instance upon application-startup I can securely use this instance for ALL queries/updates throughout my application without running into performance issues? 2. My System's documents are stored in a Subversion repository. For fast searchresults I want to periodically index new documents from the repository. What I get from the repository is a ByteArrayOutputStream. How can I pass this Stream to Solr? I only see possibilities to pass Files but in my case it does not make sense to write the ByteArrayOutputStream to disk again as this would cause performance issues apart from making no sense anyway. 3. Are there any disadvantages using Solrj over some other HTTP based solution e.g. creating sending my own HTTP requests? Do I even have to use HTTP? I see the EmbeddedSolrServer exists. Any drawbacks using that? Any hints are welcome, Thanks! -- Lance Norskog goks...@gmail.com ___ Neu: WEB.DE De-Mail - Einfach wie E-Mail, sicher wie ein Brief! Jetzt De-Mail-Adresse reservieren: https://produkte.web.de/go/demail02
Re: Design and Usage Questions
Ok, so if I did NOT use Solr_J I could PUSH a Stream to Solr somehow? I do not depend on Solr_J, any connection-method would suffice. On 11/01/2010 03:23 AM, Lance Norskog wrote: 2. The SolrJ library handling of content streams is pull, not push. That is, you give it a reader and it pulls content when it feels like it. If your software to feed the connection wants to write the data, you have to either buffer the whole thing or do a dual-thread writer/reader pair. The easiest way to pull stuff from SVN is to use one of the web server apps. Solr takes a stream.url parameter. (Also stream.file.) Note that there is no outbound authentication supported; your web server has to be open (at least to the Solr instance). On Sun, Oct 31, 2010 at 4:06 PM, getagripgetag...@web.de wrote: Hi, I've got some basic usage / design questions. 1. The SolrJ wiki proposes to use the same CommonsHttpSolrServer instance for all requests to avoid connection leaks. So if I create a Singleton instance upon application-startup I can securely use this instance for ALL queries/updates throughout my application without running into performance issues? 2. My System's documents are stored in a Subversion repository. For fast searchresults I want to periodically index new documents from the repository. What I get from the repository is a ByteArrayOutputStream. How can I pass this Stream to Solr? I only see possibilities to pass Files but in my case it does not make sense to write the ByteArrayOutputStream to disk again as this would cause performance issues apart from making no sense anyway. 3. Are there any disadvantages using Solrj over some other HTTP based solution e.g. creating sending my own HTTP requests? Do I even have to use HTTP? I see the EmbeddedSolrServer exists. Any drawbacks using that? Any hints are welcome, Thanks!
Re: Custom Sorting in Solr
Ok i imagined that the double linked list would be far too complicated for solr. Now, how can i achieve that solr connects to a webservice and do the import? I'm sorry if i'm not clear, sometimes my english gets fuzzy :P On Fri, Oct 29, 2010 at 4:51 PM, Yonik Seeley yo...@lucidimagination.comwrote: On Fri, Oct 29, 2010 at 3:39 PM, Ezequiel Calderara ezech...@gmail.com wrote: Hi all guys! I'm in a weird situation here. We have index a set of documents which are ordered using a linked list (each documents has the reference of the previous and the next). Is there a way when sorting in the solr search, Use the linked list to sort? It seems like you should be able to encode this linked list as an integer instead, and sort by that? If there are multiple linked lists in the index, it seems like you could even use the high bits of the int to designate which list the doc belongs to, and the low order bits as the order in that list. -Yonik http://www.lucidimagination.com -- __ Ezequiel. Http://www.ironicnet.com
Re: solr stuck in writing to inexisting sockets
Hi, Yes, sometimes it takes 5 minutes for a query. I agree this is not desirable. However, if the application has no control over the input queries other that closing the socket after a while, solr should not continue writing the response, but terminate the thread. In general, is there a way to quantify the complexity of a given query on a certain index? Some general guidelines which can be used by non-technical people? Thanks a lot, roxana --- On Sun, 10/31/10, Erick Erickson erickerick...@gmail.com wrote: From: Erick Erickson erickerick...@gmail.com Subject: Re: solr stuck in writing to inexisting sockets To: solr-user@lucene.apache.org Date: Sunday, October 31, 2010, 2:29 AM Are you saying that your Solr server is at times taking 5 minutes to complete? If so, I'd get to the bottom of that first off. My first guess would be you're either hitting memory issues and swapping horribly or..well, that would be my first guess. Best Erick On Thu, Oct 28, 2010 at 5:23 AM, Roxana Angheluta anghelu...@yahoo.comwrote: Hi all, We are using Solr over Jetty with a large index, sharded and distributed over multiple machines. Our queries are quite long, involving boolean and proximity operators. We cut the connection at the client side after 5 minutes. Also, we are using parameter timeAllowed to stop executing it on the server after a while. We quite often run into situations when solr blocks. The load on the server increases and a thread dump on the solr process shows many threads like below: btpool0-49 prio=10 tid=0x7f73afe1d000 nid=0x3581 runnable [0x451a] java.lang.Thread.State: RUNNABLE at java.io.PrintWriter.write(PrintWriter.java:362) at org.apache.solr.common.util.XML.escape(XML.java:206) at org.apache.solr.common.util.XML.escapeCharData(XML.java:79) at org.apache.solr.request.XMLWriter.writePrim(XMLWriter.java:832) at org.apache.solr.request.XMLWriter.writeStr(XMLWriter.java:684) at org.apache.solr.request.XMLWriter.writeVal(XMLWriter.java:564) at org.apache.solr.request.XMLWriter.writeDoc(XMLWriter.java:435) at org.apache.solr.request.XMLWriter$2.writeDocs(XMLWriter.java:514) at org.apache.solr.request.XMLWriter.writeDocuments(XMLWriter.java:485) at org.apache.solr.request.XMLWriter.writeSolrDocumentList(XMLWriter.java:494) at org.apache.solr.request.XMLWriter.writeVal(XMLWriter.java:588) at org.apache.solr.request.XMLWriter.writeResponse(XMLWriter.java:130) at org.apache.solr.request.XMLResponseWriter.write(XMLResponseWriter.java:34) at org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:325) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:254) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712) .. A netstat on the machine shows sockets in state CLOSE_WAIT. However, they are fewer than the number of RUNNABLE threads as the above. Why is this happening? Is there anything we can do to avoid getting in these situations? Thanks, roxana
big terms in UnInvertedField
Hello, With solr example, using facet.field=text creates UnInvertedField for the text field in fieldValueCache. After that, I saw stats page and I was surprised at counters in *filterCache* were up: lookups : 213 hits : 106 hitratio : 0.49 inserts : 107 evictions : 0 size : 107 warmupTime : 0 cumulative_lookups : 213 cumulative_hits : 106 cumulative_hitratio : 0.49 cumulative_inserts : 107 cumulative_evictions : 0 Do they cause of big words in UnInvertedField? If so, when using both facet for multiValued field and facet for single valued field/facet query, it is difficult to estimate the size of filterCache. Koji -- http://www.rondhuit.com/en/
Re: big terms in UnInvertedField
2010/11/1 Koji Sekiguchi k...@r.email.ne.jp: With solr example, using facet.field=text creates UnInvertedField for the text field in fieldValueCache. After that, I saw stats page and I was surprised at counters in *filterCache* were up: Do they cause of big words in UnInvertedField? Yes. big terms (defined as matching more than 5% of the index) are not uninverted since it's more efficient (both CPU and memory) to use the filterCache and calculate intersections. If so, when using both facet for multiValued field and facet for single valued field/facet query, it is difficult to estimate the size of filterCache. Yep. At least fieldValueCache (for UnInvertedField) tells you the number of big terms in each field you are faceting on though. -Yonik http://www.lucidimagination.com
Re: big terms in UnInvertedField
Yonik, Thank you for your reply. I just wanted to share my surprise. :) Koji -- http://www.rondhuit.com/en/ (10/11/01 23:17), Yonik Seeley wrote: 2010/11/1 Koji Sekiguchik...@r.email.ne.jp: With solr example, using facet.field=text creates UnInvertedField for the text field in fieldValueCache. After that, I saw stats page and I was surprised at counters in *filterCache* were up: Do they cause of big words in UnInvertedField? Yes. big terms (defined as matching more than 5% of the index) are not uninverted since it's more efficient (both CPU and memory) to use the filterCache and calculate intersections. If so, when using both facet for multiValued field and facet for single valued field/facet query, it is difficult to estimate the size of filterCache. Yep. At least fieldValueCache (for UnInvertedField) tells you the number of big terms in each field you are faceting on though. -Yonik http://www.lucidimagination.com
Re: Solr Relevency Calculation
Here's a good place to start: http://search.lucidimagination.com/search/out?u=http://lucene.apache.org/java/2_4_0/scoring.html http://search.lucidimagination.com/search/out?u=http://lucene.apache.org/java/2_4_0/scoring.htmlBut what do you mean this is going to search on five fields? This sounds like you're using DisMax in which case it throws out all but the top-scoring clause when it calculates the score for the document. HTH Erick On Sun, Oct 31, 2010 at 10:48 PM, sivaprasad sivaprasa...@echidnainc.comwrote: Hi, I have 25 indexed fields in my document.But by default, if i give q=laptops this is going to search on five fields and iam getting the score as part of search results.How solr will calculate the score?Is it going to calculate only on the five fields or on 25 fields which are indexed?What is the order it is going to take to calculate score?Any documents related to this topic is helpful for me. Regards, JS -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Relevency-Calculation-tp1819798p1819798.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Boosting the score based on certain field
Would simple boosting work? As in category:electronics^2? If not, perhaps you can explain a bit more about what you're trying to accomplish... Best Erick On Sun, Oct 31, 2010 at 10:55 PM, sivaprasad sivaprasa...@echidnainc.comwrote: Hi, In my document i have a filed called category.This contains electronics,games ,..etc.For some of the category values i need to boost the document score.Let us say, for electronics category, i will decide the boosting parameter grater than the games category.Is there any body has the idea to achieve this functionality? Regards, Siva -- View this message in context: http://lucene.472066.n3.nabble.com/Boosting-the-score-based-on-certain-field-tp1819820p1819820.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Multiple Keyword Search
I'm not sure this exactly fits your use-case, but it may come close enough. Have you looked at disMax and the mm parameter (minimum should match)? Best Erick On Mon, Nov 1, 2010 at 5:00 AM, Pawan Darira pawan.dar...@gmail.com wrote: Hi There is a situation where i search for more than 1 keyword my main 2 fields are ad_title ad_description. I want those results which match all of the keywords in both fields, should come on top. Then sequentially one by one keyword can be dropped in further results. E.g. In a search of 3 keywords, let there are 100 results. If 35 contain all the keywords combined in ad_title ad_description, then they should come first. Then if 50 results contain combination of any 2 keywords, they should come next. Finally results with single keyword should come at last Please suggest -- Thanks, Pawan Darira
Re: solr stuck in writing to inexisting sockets
I'm going to nudge you in the direction of understanding why the queries take so long in the first place rather than going toward the blunt approach of cutting them off after some time. The fact that you don't control the queries submitted doesn't prevent you from trying to understand what is taking so long. The first thing I'd look for is whether the system is memory starved. What JVM are you using and what memory parameters are you giving it? What version of Solr are you using? Have you tried any performance monitoring to determine what is happening? The reason I'm pushing in this direction is that 5 minute searches are pathological. Once you're up in that range, virtually any fix you come up with will simply mask the underlying problems, and you'll be forever chasing the next manifestation of the underlying problem. Besides, I don't know how you'd stop Solr processing a query mid-way through, I don't know of any way to make that happen. Best Erick On Mon, Nov 1, 2010 at 9:30 AM, Roxana Angheluta anghelu...@yahoo.comwrote: Hi, Yes, sometimes it takes 5 minutes for a query. I agree this is not desirable. However, if the application has no control over the input queries other that closing the socket after a while, solr should not continue writing the response, but terminate the thread. In general, is there a way to quantify the complexity of a given query on a certain index? Some general guidelines which can be used by non-technical people? Thanks a lot, roxana --- On Sun, 10/31/10, Erick Erickson erickerick...@gmail.com wrote: From: Erick Erickson erickerick...@gmail.com Subject: Re: solr stuck in writing to inexisting sockets To: solr-user@lucene.apache.org Date: Sunday, October 31, 2010, 2:29 AM Are you saying that your Solr server is at times taking 5 minutes to complete? If so, I'd get to the bottom of that first off. My first guess would be you're either hitting memory issues and swapping horribly or..well, that would be my first guess. Best Erick On Thu, Oct 28, 2010 at 5:23 AM, Roxana Angheluta anghelu...@yahoo.com wrote: Hi all, We are using Solr over Jetty with a large index, sharded and distributed over multiple machines. Our queries are quite long, involving boolean and proximity operators. We cut the connection at the client side after 5 minutes. Also, we are using parameter timeAllowed to stop executing it on the server after a while. We quite often run into situations when solr blocks. The load on the server increases and a thread dump on the solr process shows many threads like below: btpool0-49 prio=10 tid=0x7f73afe1d000 nid=0x3581 runnable [0x451a] java.lang.Thread.State: RUNNABLE at java.io.PrintWriter.write(PrintWriter.java:362) at org.apache.solr.common.util.XML.escape(XML.java:206) at org.apache.solr.common.util.XML.escapeCharData(XML.java:79) at org.apache.solr.request.XMLWriter.writePrim(XMLWriter.java:832) at org.apache.solr.request.XMLWriter.writeStr(XMLWriter.java:684) at org.apache.solr.request.XMLWriter.writeVal(XMLWriter.java:564) at org.apache.solr.request.XMLWriter.writeDoc(XMLWriter.java:435) at org.apache.solr.request.XMLWriter$2.writeDocs(XMLWriter.java:514) at org.apache.solr.request.XMLWriter.writeDocuments(XMLWriter.java:485) at org.apache.solr.request.XMLWriter.writeSolrDocumentList(XMLWriter.java:494) at org.apache.solr.request.XMLWriter.writeVal(XMLWriter.java:588) at org.apache.solr.request.XMLWriter.writeResponse(XMLWriter.java:130) at org.apache.solr.request.XMLResponseWriter.write(XMLResponseWriter.java:34) at org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:325) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:254) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712) .. A netstat on the machine shows sockets in state CLOSE_WAIT. However, they are fewer than the number of RUNNABLE threads as the above. Why is this happening? Is there anything we can do to avoid getting in these situations? Thanks, roxana
Using ICUTokenizerFilter or StandardAnalyzer with UAX#29 support from Solr
We are trying to solve some multilingual issues with our Solr analysis filter chain and would like to use the new Lucene 3.x filters that are Unicode compliant. Is it possible to use the Lucene ICUTokenizerFilter or StandardAnalyzer with UAX#29 support from Solr? Is it just a matter of writing the appropriate Solr filter factories? Are there any tricky gotchas in writing such a filter? If so, should I open a JIRA issue or two JIRA issues so the filter factories can be contributed to the Solr code base? Tom
Re: Using ICUTokenizerFilter or StandardAnalyzer with UAX#29 support from Solr
On Mon, Nov 1, 2010 at 12:24 PM, Burton-West, Tom tburt...@umich.edu wrote: We are trying to solve some multilingual issues with our Solr analysis filter chain and would like to use the new Lucene 3.x filters that are Unicode compliant. Is it possible to use the Lucene ICUTokenizerFilter or StandardAnalyzer with UAX#29 support from Solr? right now, you can use the StandardTokenizerFactory (which is UAX#29 + URL and IP address recognition) from Solr. just make sure you set the Version to 3.1 in your solrconfig.xml with branch_3x, otherwise it will use the old standardtokenizer for backwards compatibility. !-- Controls what version of Lucene various components of Solr adhere to. Generally, you want to use the latest version to get all bug fixes and improvements. It is highly recommended that you fully re-index after changing this setting as it can affect both how text is indexed and queried. -- luceneMatchVersionLUCENE_31/luceneMatchVersion But if you want the pure UAX#29 Tokenizer without this, there isn't a factory. Also if you want customization/supplementary character support, there is no factory for ICUTokenizer at the moment. If so, should I open a JIRA issue or two JIRA issues so the filter factories can be contributed to the Solr code base? Please open issues for a factory for the pure UAX#29 Tokenizer, and for the ICU factories (maybe we can just put this into a contrib for now?) !
Re: Solr in virtual host as opposed to /lib
I think you guys are talking about two different kinds of 'virtual hosts'. Lance is talking about CPU virtualization. Eric appears to be talking about apache virtual web hosts, although Eric hasn't told us how apache is involved in his setup in the first place, so it's unclear. Assuming you are using apache to reverse proxy to Solr, there is no reason I can think of that your front-end apache setup would effect CPU utilizaton by Solr, let alone by nutch. Eric Martin wrote: Oh. So I should take out the installations and move them to /some_dir as opposed to inside my virtual host of /home/my solr nutch is here/www ' -Original Message- From: Lance Norskog [mailto:goks...@gmail.com] Sent: Sunday, October 31, 2010 7:26 PM To: solr-user@lucene.apache.org Subject: Re: Solr in virtual host as opposed to /lib With virtual hosting you can give CPU memory quotas to your different VMs. This allows you to control the Nutch v.s. The World problem. Unforch, you cannot allocate disk channel. With two i/o bound apps, this is a problem. On Sun, Oct 31, 2010 at 4:38 PM, Eric Martin e...@makethembite.com wrote: Excellent information. Thank you. Solr is acting just fine then. I can connect to it no issues, it indexes fine and there didn't seem to be any complication with it. Now I can rule it out and go about solving, what you pointed out, and I agree, to be a java/nutch issue. Nutch is a crawler I use to feed URL's into Solr for indexing. Nutch is open source and found on apache.org Thanks for your time. -Original Message- From: Jonathan Rochkind [mailto:rochk...@jhu.edu] Sent: Sunday, October 31, 2010 4:33 PM To: solr-user@lucene.apache.org Subject: RE: Solr in virtual host as opposed to /lib What servlet container are you putting your Solr in? Jetty? Tomcat? Something else? Are you fronting it with apache on top of that? (I think maybe you are, otherwise I'm not sure how the phrase 'virtual host' applies). In general, Solr of course doesn't care what directory it's in on disk, so long as the process running solr has the neccesary read/write permissions to the neccesary directories (and if it doesn't, you'd usually find out right away with an error message). And clients to Solr don't care what directory it's in on disk either, they only care that they can get it to it connecting to a certain port at a certain hostname. In general, if they can't get to it on a certain port at a certain hostname, that's something you'd discover right away, not something that would be intermittent. But I'm not familiar with nutch, you may want to try connecting to the port you have Solr running on (the hostname/port you have told nutch to find solr on?) yourself manually, and just make sure it is connectable. I can't think of any reason that what directory you have Solr in could cause CPU utilization issues. I think it's got nothing to do with that. I am not familar with nutch, if it's nutch that's taking 100% of your CPU, you might want to find some nutch experts to ask. Perhaps there's a nutch listserv? I am also not familiar with hadoop; you mention just in passing that you're using hadoop too, maybe that's an added complication, I don't know. One obvious reason nutch could be taking 100% cpu would be simply because you've asked it to do a lot of work quickly, and it's trying to. One reason I have seen Solr take 100% of CPU and become responsive, is when the Solr process gets caught up in terrible Java garbage collection. If that's what's happening, then giving the Solr JVM a higher maximum heap size can sometimes help (although confusingly, I've seen people suggest that if you give the Solr JVM too MUCH heap it can also result in long GC pauses), and if you have a multi-core/multi-CPU machine, I've found the JVM argument -XX:+UseConcMarkSweepGC to be very helpful. Other than that, it sounds to me like you've got a nutch/hadoop issue, not a Solr issue. From: Eric Martin [e...@makethembite.com] Sent: Sunday, October 31, 2010 7:16 PM To: solr-user@lucene.apache.org Subject: RE: Solr in virtual host as opposed to /lib Hi, Thank you. This is more than idle curiosity. I am trying to debug an issue I am having with my installation and this is one step in verifying that I have a setup that does not consume resources. I am trying to debunk my internal myth that having Solr nad Nutch in a virtual host would be causing these issues. Here is the main issue that involves Nutch/Solr and Drupal: /home/mootlaw/lib/solr /home/mootlaw/lib/nutch /home/mootlaw/www/Drupal site I'm running a 1333 FSB Dual Socket Xeon 5500 Series @ 2.4ghz, Enterprise Linux - x86_64 - OS, 12 Gig RAM. My Solr and Nutch are running. I am using jetty for my Solr. My server is not rooted. Nutch is using 100% of my cpus. I see this in my CPU utilization in my whm: /usr/bin/java -Xmx1000m -Dhadoop.log.dir=/home/mootlaw/lib/nutch/logs -Dhadoop.log.file=hadoop.log
Facet count of zero
I'm trying to exclude certain facet results from a facet query. It seems to work but rather than being excluded from the facet list its returned with a count of zero. Ex: q=(-foo:bar)facet=truefacet.field=foofacet.sort=idxwt=jsonindent=true This returns bar with a count of zero. All the other foo's show up with valid counts. Can I do this? Is my syntax incorrect? Thanks - Tod
Problem with phrase matches in Solr
Hey guys, I have a solr index where i store information about experts from various fields. The thing is when I search for channel marketing i get people that have the word channel or marketing in their data. I only want people who have that entire phrase in their bio. I copy the contents of bio to the default search field (which is text) How can I make sure that exact phrase matching works while the search is agile enough that half searches match too (like uni matches university, etc - this works but not phrase matching)? I hope I was able to properly explain my problem. If not, please let me know. Thanks in advance, Moazzam
Re: Facet count of zero
On Mon, Nov 1, 2010 at 12:55 PM, Tod listac...@gmail.com wrote: I'm trying to exclude certain facet results from a facet query. It seems to work but rather than being excluded from the facet list its returned with a count of zero. If you don't want to see 0 counts, use facet.mincount=1 http://wiki.apache.org/solr/SimpleFacetParameters -Yonik http://www.lucidimagination.co Ex: q=(-foo:bar)facet=truefacet.field=foofacet.sort=idxwt=jsonindent=true This returns bar with a count of zero. All the other foo's show up with valid counts. Can I do this? Is my syntax incorrect? Thanks - Tod
RE: Solr in virtual host as opposed to /lib
I was speaking about apache virtual hosts. I was concerned that there was an increase processing time due to the solr and nutch instance being housed inside a virtual host as opposed to being dropped in root of my distro. Thank you for the astute clarification. -Original Message- From: Jonathan Rochkind [mailto:rochk...@jhu.edu] Sent: Monday, November 01, 2010 9:52 AM To: solr-user@lucene.apache.org Subject: Re: Solr in virtual host as opposed to /lib I think you guys are talking about two different kinds of 'virtual hosts'. Lance is talking about CPU virtualization. Eric appears to be talking about apache virtual web hosts, although Eric hasn't told us how apache is involved in his setup in the first place, so it's unclear. Assuming you are using apache to reverse proxy to Solr, there is no reason I can think of that your front-end apache setup would effect CPU utilizaton by Solr, let alone by nutch. Eric Martin wrote: Oh. So I should take out the installations and move them to /some_dir as opposed to inside my virtual host of /home/my solr nutch is here/www ' -Original Message- From: Lance Norskog [mailto:goks...@gmail.com] Sent: Sunday, October 31, 2010 7:26 PM To: solr-user@lucene.apache.org Subject: Re: Solr in virtual host as opposed to /lib With virtual hosting you can give CPU memory quotas to your different VMs. This allows you to control the Nutch v.s. The World problem. Unforch, you cannot allocate disk channel. With two i/o bound apps, this is a problem. On Sun, Oct 31, 2010 at 4:38 PM, Eric Martin e...@makethembite.com wrote: Excellent information. Thank you. Solr is acting just fine then. I can connect to it no issues, it indexes fine and there didn't seem to be any complication with it. Now I can rule it out and go about solving, what you pointed out, and I agree, to be a java/nutch issue. Nutch is a crawler I use to feed URL's into Solr for indexing. Nutch is open source and found on apache.org Thanks for your time. -Original Message- From: Jonathan Rochkind [mailto:rochk...@jhu.edu] Sent: Sunday, October 31, 2010 4:33 PM To: solr-user@lucene.apache.org Subject: RE: Solr in virtual host as opposed to /lib What servlet container are you putting your Solr in? Jetty? Tomcat? Something else? Are you fronting it with apache on top of that? (I think maybe you are, otherwise I'm not sure how the phrase 'virtual host' applies). In general, Solr of course doesn't care what directory it's in on disk, so long as the process running solr has the neccesary read/write permissions to the neccesary directories (and if it doesn't, you'd usually find out right away with an error message). And clients to Solr don't care what directory it's in on disk either, they only care that they can get it to it connecting to a certain port at a certain hostname. In general, if they can't get to it on a certain port at a certain hostname, that's something you'd discover right away, not something that would be intermittent. But I'm not familiar with nutch, you may want to try connecting to the port you have Solr running on (the hostname/port you have told nutch to find solr on?) yourself manually, and just make sure it is connectable. I can't think of any reason that what directory you have Solr in could cause CPU utilization issues. I think it's got nothing to do with that. I am not familar with nutch, if it's nutch that's taking 100% of your CPU, you might want to find some nutch experts to ask. Perhaps there's a nutch listserv? I am also not familiar with hadoop; you mention just in passing that you're using hadoop too, maybe that's an added complication, I don't know. One obvious reason nutch could be taking 100% cpu would be simply because you've asked it to do a lot of work quickly, and it's trying to. One reason I have seen Solr take 100% of CPU and become responsive, is when the Solr process gets caught up in terrible Java garbage collection. If that's what's happening, then giving the Solr JVM a higher maximum heap size can sometimes help (although confusingly, I've seen people suggest that if you give the Solr JVM too MUCH heap it can also result in long GC pauses), and if you have a multi-core/multi-CPU machine, I've found the JVM argument -XX:+UseConcMarkSweepGC to be very helpful. Other than that, it sounds to me like you've got a nutch/hadoop issue, not a Solr issue. From: Eric Martin [e...@makethembite.com] Sent: Sunday, October 31, 2010 7:16 PM To: solr-user@lucene.apache.org Subject: RE: Solr in virtual host as opposed to /lib Hi, Thank you. This is more than idle curiosity. I am trying to debug an issue I am having with my installation and this is one step in verifying that I have a setup that does not consume resources. I am trying to debunk my internal myth that having Solr nad Nutch in a virtual host would be causing these issues.
Re: Problem with phrase matches in Solr
Take a look at term proximity and phrase query. http://wiki.apache.org/solr/SolrRelevancyCookbook Hey guys, I have a solr index where i store information about experts from various fields. The thing is when I search for channel marketing i get people that have the word channel or marketing in their data. I only want people who have that entire phrase in their bio. I copy the contents of bio to the default search field (which is text) How can I make sure that exact phrase matching works while the search is agile enough that half searches match too (like uni matches university, etc - this works but not phrase matching)? I hope I was able to properly explain my problem. If not, please let me know. Thanks in advance, Moazzam
RE: Using ICUTokenizerFilter or StandardAnalyzer with UAX#29 support from Solr
Thanks Robert, I'll use the workaround for now (using StandardTokenizerFactory and specifying version 3.1), but I suspect that I don't want the added URL/IP address recognition due to my use case. I've also talked to a couple people who recommended using the ICUTokenFilter with some rule modifications, but haven't had a chance to investigate that yet. I opened two JIRA issues (https://issues.apache.org/jira/browse/SOLR-2210) and https://issues.apache.org/jira/browse/SOLR-2211. Sometime later this week I'll try writing the FilterFactories and upload patches. (Unless someone beats me to it :) Tom -Original Message- From: Robert Muir [mailto:rcm...@gmail.com] Sent: Monday, November 01, 2010 12:49 PM To: solr-user@lucene.apache.org Subject: Re: Using ICUTokenizerFilter or StandardAnalyzer with UAX#29 support from Solr On Mon, Nov 1, 2010 at 12:24 PM, Burton-West, Tom tburt...@umich.edu wrote: We are trying to solve some multilingual issues with our Solr analysis filter chain and would like to use the new Lucene 3.x filters that are Unicode compliant. Is it possible to use the Lucene ICUTokenizerFilter or StandardAnalyzer with UAX#29 support from Solr? right now, you can use the StandardTokenizerFactory (which is UAX#29 + URL and IP address recognition) from Solr. just make sure you set the Version to 3.1 in your solrconfig.xml with branch_3x, otherwise it will use the old standardtokenizer for backwards compatibility. !-- Controls what version of Lucene various components of Solr adhere to. Generally, you want to use the latest version to get all bug fixes and improvements. It is highly recommended that you fully re-index after changing this setting as it can affect both how text is indexed and queried. -- luceneMatchVersionLUCENE_31/luceneMatchVersion But if you want the pure UAX#29 Tokenizer without this, there isn't a factory. Also if you want customization/supplementary character support, there is no factory for ICUTokenizer at the moment. If so, should I open a JIRA issue or two JIRA issues so the filter factories can be contributed to the Solr code base? Please open issues for a factory for the pure UAX#29 Tokenizer, and for the ICU factories (maybe we can just put this into a contrib for now?) !
RE: How does DIH multithreading work?
Mark, I have the same question so I did a little research on this. Not a complete answer but here is what I've found: - threads was aded with SOLR-1352 (https://issues.apache.org/jira/browse/SOLR-1352). - Also see http://www.lucidimagination.com/search/document/a9b26ade46466ee/queries_regarding_a_paralleldataimporthandler for background info. - Only available in 3.x and trunk. Committed on 1/12/2010 by Noble Paul (who surely can tell you more accurate info than I can). - Seems like when using, each thread will call nextRow on your root entity datasource in parallel. - Not sure this will help with child entities (ie. I had hoped I could get it to build child caches in parallel but I don't think this is the case). - A doc comment on ThreadedEntityProcessorWrapper indicates this will help speed up running transformers becauses they'd be in parallel. This would make sense if maybe your database can only pull back so fast, but then you have an intensive transformer. Maybe adding a thread would make your processing no slower than the db... James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: markwaddle [mailto:m...@markwaddle.com] Sent: Tuesday, October 26, 2010 2:25 PM To: solr-user@lucene.apache.org Subject: How does DIH multithreading work? I understand that the thread count is specified on root entities only. Does it spawn multiple threads per root entity? Or multiple threads per descendant entity? Can someone give an example of how you would make a database query in an entity with 4 threads that would select 1 row per thread? Thanks, Mark -- View this message in context: http://lucene.472066.n3.nabble.com/How-does-DIH-multithreading-work-tp1776111p1776111.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: indexing '-
Guys, the string type did the trick :) Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/indexing-tp1816969p1823199.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Using ICUTokenizerFilter or StandardAnalyzer with UAX#29 support from Solr
On Mon, Nov 1, 2010 at 1:34 PM, Burton-West, Tom tburt...@umich.edu wrote: Thanks Robert, I'll use the workaround for now (using StandardTokenizerFactory and specifying version 3.1), but I suspect that I don't want the added URL/IP address recognition due to my use case. I've also talked to a couple people who recommended using the ICUTokenFilter with some rule modifications, but haven't had a chance to investigate that yet. yes, as far as doing rule modifications, we can think about how to hook this in. At the end of the day, if we allow someone to specify the classname of their ICUTokenizerConfig (default: DefaultICUTokenizerConfig), that would at least allow this customization. separately i'd be interested in hearing about whatever rule modifications might be useful for different purposes. I opened two JIRA issues (https://issues.apache.org/jira/browse/SOLR-2210) and https://issues.apache.org/jira/browse/SOLR-2211. Sometime later this week I'll try writing the FilterFactories and upload patches. (Unless someone beats me to it :) Thanks Tom, there are actually a lot of analysis factories (even in just icu itself) not exposed to Solr, so its a good deal of work. I know i have a few of them, but they aren't the best. I suggested on SOLR-2210 we could make a contrib like 'extraAnalyzers' and put all the analyzers-that-have-large-dependencies/dictionaries (e.g. SmartChinese too) in there. So theres a lot to be done... including tests, any help is appreciated!
Testing/packaging question
Hi, I'm pretty much of a Solr newbie currently packaging solrpy for Debian; see http://svn.debian.org/viewsvn/python-modules/packages/python-solrpy/trunk/ In order to run solrpy's supplied tests at build time, I'd need Solr to know about the schema.xml that comes with the tests. Can anyone tell me how do that properly? I'd basically need Solr to temporarily recognize that schema.xml without permanently installing it -- is there any way to do this, eg via environment variables? TIA Bernhard Reiter
Re: Facet count of zero
On 11/1/2010 1:03 PM, Yonik Seeley wrote: On Mon, Nov 1, 2010 at 12:55 PM, Todlistac...@gmail.com wrote: I'm trying to exclude certain facet results from a facet query. �It seems to work but rather than being excluded from the facet list its returned with a count of zero. If you don't want to see 0 counts, use facet.mincount=1 http://wiki.apache.org/solr/SimpleFacetParameters -Yonik http://www.lucidimagination.co Ex: q=(-foo:bar)facet=truefacet.field=foofacet.sort=idxwt=jsonindent=true This returns bar with a count of zero. �All the other foo's show up with valid counts. Can I do this? �Is my syntax incorrect? Thanks - Tod Excellent, I completely missed it - thanks!
Re: Solr in virtual host as opposed to /lib
: References: aanlktimvv5foc2b=gxo+xs1zwgps9o5t5jorwv3id...@mail.gmail.com : aanlktim30aat8s0nxq_8utxcokv8myyabz8wtxeyl...@mail.gmail.com : aanlktimpo9v_krgaxomd4hocqabibgzdhc+jhhgsq...@mail.gmail.com : aanlktimdvaawj7=b7=pgu+rzm+nobvzdfh4o39nkp...@mail.gmail.com : aanlktindzuwyjxwqqmtr5-rrp4gekvmj5vzzc_f0n...@mail.gmail.com : In-Reply-To: aanlktindzuwyjxwqqmtr5-rrp4gekvmj5vzzc_f0n...@mail.gmail.com : Subject: Solr in virtual host as opposed to /lib http://people.apache.org/~hossman/#threadhijack Thread Hijacking on Mailing Lists When starting a new discussion on a mailing list, please do not reply to an existing message, instead start a fresh email. Even if you change the subject line of your email, other mail headers still track which thread you replied to and your question is hidden in that thread and gets less attention. It makes following discussions in the mailing list archives particularly difficult. See Also: http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking -Hoss
Re: Reverse range search
Hi, I think I have seen a comment on the list from someone with the same need a few months ago. He planned to make a new fieldType to support this, e.g. MinMaxRangeFieldType which would be a polyField type holding both a min and max value, and then you could query it q=myminmaxfield:123 I did not find it as a Jira issue however, but I can see how it would be useful for a lot of usecases. Perhaps you can create a Jira issue for it and supply a patch? :) -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com On 28. okt. 2010, at 23.24, kenf_nc wrote: Doing a range search is straightforward. I have a fixed value in a document field, I search on [x TO y] and if the fixed value is in the range requested it gets a hit. But, what if I have data in a document where there is a min value and a max value and my query is a fixed value and I want to get a hit if the query value is in that range. For example: Solr Doc1: field min_price:100 field max_price:500 Solr Doc2: field min_price:300 field max_price:500 and my query is price:250. I could create a query of (min_price:[* TO 250] AND max_price:[250 TO *]) and that should work. It should find only doc 1. However, if I have several fields like this and complex queries that include most of those fields, it becomes a very ugly query. Ideally I'd like to do something similar to what the spatial contrib guys do where they make lat/long a single point. If I had a min/max field, I could call it Price (100, 500) or Price (300,500) and just do a query of Price:250 and Solr would see if 250 was in the appropriate range. Looong question short...Is there something out there already that does this? Does anyone else do something like this and have some suggestions? Thanks, Ken -- View this message in context: http://lucene.472066.n3.nabble.com/Reverse-range-search-tp1789135p1789135.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Solr in virtual host as opposed to /lib
I don't think you read the entire thread. I'm assuming you made a mistake. -Original Message- From: Chris Hostetter [mailto:hossman_luc...@fucit.org] Sent: Monday, November 01, 2010 11:49 AM To: solr-user@lucene.apache.org Subject: Re: Solr in virtual host as opposed to /lib : References: aanlktimvv5foc2b=gxo+xs1zwgps9o5t5jorwv3id...@mail.gmail.com : aanlktim30aat8s0nxq_8utxcokv8myyabz8wtxeyl...@mail.gmail.com : aanlktimpo9v_krgaxomd4hocqabibgzdhc+jhhgsq...@mail.gmail.com : aanlktimdvaawj7=b7=pgu+rzm+nobvzdfh4o39nkp...@mail.gmail.com : aanlktindzuwyjxwqqmtr5-rrp4gekvmj5vzzc_f0n...@mail.gmail.com : In-Reply-To: aanlktindzuwyjxwqqmtr5-rrp4gekvmj5vzzc_f0n...@mail.gmail.com : Subject: Solr in virtual host as opposed to /lib http://people.apache.org/~hossman/#threadhijack Thread Hijacking on Mailing Lists When starting a new discussion on a mailing list, please do not reply to an existing message, instead start a fresh email. Even if you change the subject line of your email, other mail headers still track which thread you replied to and your question is hidden in that thread and gets less attention. It makes following discussions in the mailing list archives particularly difficult. See Also: http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking -Hoss
Re: Solr in virtual host as opposed to /lib
No, he didn't make a mistake but you did. Next time, please start a new thread not by conveniently replying to an existing thread and just changing the subject. Now we have two threads in thread. :) I don't think you read the entire thread. I'm assuming you made a mistake. -Original Message- From: Chris Hostetter [mailto:hossman_luc...@fucit.org] Sent: Monday, November 01, 2010 11:49 AM To: solr-user@lucene.apache.org Subject: Re: Solr in virtual host as opposed to /lib : References: : aanlktimvv5foc2b=gxo+xs1zwgps9o5t5jorwv3id...@mail.gmail.com : : aanlktim30aat8s0nxq_8utxcokv8myyabz8wtxeyl...@mail.gmail.com : aanlktimpo9v_krgaxomd4hocqabibgzdhc+jhhgsq...@mail.gmail.com : aanlktimdvaawj7=b7=pgu+rzm+nobvzdfh4o39nkp...@mail.gmail.com : aanlktindzuwyjxwqqmtr5-rrp4gekvmj5vzzc_f0n...@mail.gmail.com : : In-Reply-To: aanlktindzuwyjxwqqmtr5-rrp4gekvmj5vzzc_f0n...@mail.gmail.com : Subject: Solr in virtual host as opposed to /lib http://people.apache.org/~hossman/#threadhijack Thread Hijacking on Mailing Lists When starting a new discussion on a mailing list, please do not reply to an existing message, instead start a fresh email. Even if you change the subject line of your email, other mail headers still track which thread you replied to and your question is hidden in that thread and gets less attention. It makes following discussions in the mailing list archives particularly difficult. See Also: http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking -Hoss
RE: Solr in virtual host as opposed to /lib
: I don't think you read the entire thread. I'm assuming you made a mistake. No mistake. When you sent your first message with the subject Solr in virtual host as opposed to /lib you did so in response to a completely unrelated thread (Searching with wrong keyboard layout or using translit) Please note the headers i quoted below documenting this, or consult any mailing list archive that displays full threads... http://markmail.org/thread/bjl23qcigp6w3kyl : : -Original Message- : From: Chris Hostetter [mailto:hossman_luc...@fucit.org] : Sent: Monday, November 01, 2010 11:49 AM : To: solr-user@lucene.apache.org : Subject: Re: Solr in virtual host as opposed to /lib : : : : References: aanlktimvv5foc2b=gxo+xs1zwgps9o5t5jorwv3id...@mail.gmail.com : : aanlktim30aat8s0nxq_8utxcokv8myyabz8wtxeyl...@mail.gmail.com : : aanlktimpo9v_krgaxomd4hocqabibgzdhc+jhhgsq...@mail.gmail.com : : aanlktimdvaawj7=b7=pgu+rzm+nobvzdfh4o39nkp...@mail.gmail.com : : aanlktindzuwyjxwqqmtr5-rrp4gekvmj5vzzc_f0n...@mail.gmail.com : : In-Reply-To: : aanlktindzuwyjxwqqmtr5-rrp4gekvmj5vzzc_f0n...@mail.gmail.com : : Subject: Solr in virtual host as opposed to /lib : : http://people.apache.org/~hossman/#threadhijack : Thread Hijacking on Mailing Lists : : When starting a new discussion on a mailing list, please do not reply to : an existing message, instead start a fresh email. Even if you change the : subject line of your email, other mail headers still track which thread : you replied to and your question is hidden in that thread and gets less : attention. It makes following discussions in the mailing list archives : particularly difficult. : See Also: http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking : : : : -Hoss : -Hoss
is my search fast ?! date search i need some feedback :D
my index is 13M big and i have not index all of my documents. the index in production system should be about 30M Documents big. so with my test 13M Index i try a search over all documents, with first query: q:[2008-10-27 12:23:00:00 TO 2009-04-29 23:59:00:00] than i run the next query, for statistics. grouped by currency_id and get the amounts, of these Currencys. thats my result: - EUR Sum: 437.259.518,28 € Founded: 3712331 - CHF Sum: 2.048.147,62 SFr. Founded: 10473 - GBP Sum: 1.221,41 £ Founded: 181 for getting the result solr needs 9 seconds. ... i dont think thats really fast =( what do you think ? for faster search i want to try change precisionStep=6 to -- for deleting the milliseconds. whats the value for deleting also the seconds ? we only need HH:MM and not HH:MM:SS:MSMS and i change the datesearch from q to fq ... thx -- View this message in context: http://lucene.472066.n3.nabble.com/is-my-search-fast-date-search-i-need-some-feedback-D-tp1820821p1820821.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Use SolrCloud (SOLR-1873) on trunk, or with 1.4.1?
I took a swag at applying SOLR-1873 to branch_3x. It applied mostly, most of the rest of the issues where Zookeeper integrations, and those appliedly cleanly by hand. There were also a few constants and such that need to be pulled in from trunk. At the moment, it passes all the tests. I have not actually used it yet, and probably won't for a few weeks, but if someone else wants to try it out: http://github.com/collectiveintellect/lucene-solr/tree/branch_3x-cloud Have at it. enjoy, -jeremy On Thu, Oct 28, 2010 at 11:21:12PM +0200, Jan H?ydahl / Cominvent wrote: Hi, I would aim for reindexing on branch3_x, which will be the 3.1 release soon. I don't know if SOLR-1873 applies cleanly to 3_x now, but it would surely be less effort to have it apply to 3_x than to 1.4. Perhaps you can help backport the patch to 3_x? -- Jan H?ydahl, search solution architect Cominvent AS - www.cominvent.com On 28. okt. 2010, at 03.04, Jeremy Hinegardner wrote: Hi all, I see that as of r1022188 Solr Cloud has been committed to trunk. I was wondering about the stability of Solr Cloud on trunk. We are planning to do a major reindexing soon (within 30 days), several billion docs, and would like to switch to a Solr Cloud based infrastructure. We are wondering should use trunk as it is now that SOLR-1873 is applied, or should we take SOLR-1873 and apply it to Solr 1.4.1. Has anyone used 1.4.1 + SOLR-1873? In production? Thanks, -jeremy -- Jeremy Hinegardner jer...@hinegardner.org -- Jeremy Hinegardner jer...@hinegardner.org
Re: How does DIH multithreading work?
It is useful for parsing PDFs on a multi-processor machine. Also, if a sub-entity does an outbound I/O call to a database, a file, or another SOLR (SOLR-1499). Anything where the pipeline time outweighs disk i/o time. Threading happens on a per-document level- there is no concurrent access inside a document pipeline. There is a bug which causes Entityprocessor that look up attributes to throw an exception. This make Tika unusable inside a thread. Two other EPs also won't work, but I did not test them. https://issues.apache.org/jira/browse/SOLR-2186 On Mon, Nov 1, 2010 at 10:43 AM, Dyer, James james.d...@ingrambook.com wrote: Mark, I have the same question so I did a little research on this. Not a complete answer but here is what I've found: - threads was aded with SOLR-1352 (https://issues.apache.org/jira/browse/SOLR-1352). - Also see http://www.lucidimagination.com/search/document/a9b26ade46466ee/queries_regarding_a_paralleldataimporthandler for background info. - Only available in 3.x and trunk. Committed on 1/12/2010 by Noble Paul (who surely can tell you more accurate info than I can). - Seems like when using, each thread will call nextRow on your root entity datasource in parallel. - Not sure this will help with child entities (ie. I had hoped I could get it to build child caches in parallel but I don't think this is the case). - A doc comment on ThreadedEntityProcessorWrapper indicates this will help speed up running transformers becauses they'd be in parallel. This would make sense if maybe your database can only pull back so fast, but then you have an intensive transformer. Maybe adding a thread would make your processing no slower than the db... James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: markwaddle [mailto:m...@markwaddle.com] Sent: Tuesday, October 26, 2010 2:25 PM To: solr-user@lucene.apache.org Subject: How does DIH multithreading work? I understand that the thread count is specified on root entities only. Does it spawn multiple threads per root entity? Or multiple threads per descendant entity? Can someone give an example of how you would make a database query in an entity with 4 threads that would select 1 row per thread? Thanks, Mark -- View this message in context: http://lucene.472066.n3.nabble.com/How-does-DIH-multithreading-work-tp1776111p1776111.html Sent from the Solr - User mailing list archive at Nabble.com. -- Lance Norskog goks...@gmail.com
Re: solr stuck in writing to inexisting sockets
Besides, I don't know how you'd stop Solr processing a query mid-way through, I don't know of any way to make that happen. The timeAllowed parameter causes a timeout in the Solr server to kill the searching thread. They uses that now. But, yes, Erick is right- there is a fundamental problem you should solve. Since they are all stuck in returning XML results, there is something wrong in reading back results. It is possible that there is a bug in timeAllowed, where the kill-this-thread hits while returning the results and the handler for this does not work correctly when returning results. It would be great if someone wrote a unit test for this (not me) and posted it. On Mon, Nov 1, 2010 at 8:44 AM, Erick Erickson erickerick...@gmail.com wrote: I'm going to nudge you in the direction of understanding why the queries take so long in the first place rather than going toward the blunt approach of cutting them off after some time. The fact that you don't control the queries submitted doesn't prevent you from trying to understand what is taking so long. The first thing I'd look for is whether the system is memory starved. What JVM are you using and what memory parameters are you giving it? What version of Solr are you using? Have you tried any performance monitoring to determine what is happening? The reason I'm pushing in this direction is that 5 minute searches are pathological. Once you're up in that range, virtually any fix you come up with will simply mask the underlying problems, and you'll be forever chasing the next manifestation of the underlying problem. Besides, I don't know how you'd stop Solr processing a query mid-way through, I don't know of any way to make that happen. Best Erick On Mon, Nov 1, 2010 at 9:30 AM, Roxana Angheluta anghelu...@yahoo.comwrote: Hi, Yes, sometimes it takes 5 minutes for a query. I agree this is not desirable. However, if the application has no control over the input queries other that closing the socket after a while, solr should not continue writing the response, but terminate the thread. In general, is there a way to quantify the complexity of a given query on a certain index? Some general guidelines which can be used by non-technical people? Thanks a lot, roxana --- On Sun, 10/31/10, Erick Erickson erickerick...@gmail.com wrote: From: Erick Erickson erickerick...@gmail.com Subject: Re: solr stuck in writing to inexisting sockets To: solr-user@lucene.apache.org Date: Sunday, October 31, 2010, 2:29 AM Are you saying that your Solr server is at times taking 5 minutes to complete? If so, I'd get to the bottom of that first off. My first guess would be you're either hitting memory issues and swapping horribly or..well, that would be my first guess. Best Erick On Thu, Oct 28, 2010 at 5:23 AM, Roxana Angheluta anghelu...@yahoo.com wrote: Hi all, We are using Solr over Jetty with a large index, sharded and distributed over multiple machines. Our queries are quite long, involving boolean and proximity operators. We cut the connection at the client side after 5 minutes. Also, we are using parameter timeAllowed to stop executing it on the server after a while. We quite often run into situations when solr blocks. The load on the server increases and a thread dump on the solr process shows many threads like below: btpool0-49 prio=10 tid=0x7f73afe1d000 nid=0x3581 runnable [0x451a] java.lang.Thread.State: RUNNABLE at java.io.PrintWriter.write(PrintWriter.java:362) at org.apache.solr.common.util.XML.escape(XML.java:206) at org.apache.solr.common.util.XML.escapeCharData(XML.java:79) at org.apache.solr.request.XMLWriter.writePrim(XMLWriter.java:832) at org.apache.solr.request.XMLWriter.writeStr(XMLWriter.java:684) at org.apache.solr.request.XMLWriter.writeVal(XMLWriter.java:564) at org.apache.solr.request.XMLWriter.writeDoc(XMLWriter.java:435) at org.apache.solr.request.XMLWriter$2.writeDocs(XMLWriter.java:514) at org.apache.solr.request.XMLWriter.writeDocuments(XMLWriter.java:485) at org.apache.solr.request.XMLWriter.writeSolrDocumentList(XMLWriter.java:494) at org.apache.solr.request.XMLWriter.writeVal(XMLWriter.java:588) at org.apache.solr.request.XMLWriter.writeResponse(XMLWriter.java:130) at org.apache.solr.request.XMLResponseWriter.write(XMLResponseWriter.java:34) at org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:325) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:254) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089) at
Re: Design and Usage Questions
Yes, you can write your own app to read the file with SVNkit and post it to the ExtractingRequestHandler. This would be easiest. On Mon, Nov 1, 2010 at 5:49 AM, getagrip getag...@web.de wrote: Ok, so if I did NOT use Solr_J I could PUSH a Stream to Solr somehow? I do not depend on Solr_J, any connection-method would suffice. On 11/01/2010 03:23 AM, Lance Norskog wrote: 2. The SolrJ library handling of content streams is pull, not push. That is, you give it a reader and it pulls content when it feels like it. If your software to feed the connection wants to write the data, you have to either buffer the whole thing or do a dual-thread writer/reader pair. The easiest way to pull stuff from SVN is to use one of the web server apps. Solr takes a stream.url parameter. (Also stream.file.) Note that there is no outbound authentication supported; your web server has to be open (at least to the Solr instance). On Sun, Oct 31, 2010 at 4:06 PM, getagripgetag...@web.de wrote: Hi, I've got some basic usage / design questions. 1. The SolrJ wiki proposes to use the same CommonsHttpSolrServer instance for all requests to avoid connection leaks. So if I create a Singleton instance upon application-startup I can securely use this instance for ALL queries/updates throughout my application without running into performance issues? 2. My System's documents are stored in a Subversion repository. For fast searchresults I want to periodically index new documents from the repository. What I get from the repository is a ByteArrayOutputStream. How can I pass this Stream to Solr? I only see possibilities to pass Files but in my case it does not make sense to write the ByteArrayOutputStream to disk again as this would cause performance issues apart from making no sense anyway. 3. Are there any disadvantages using Solrj over some other HTTP based solution e.g. creating sending my own HTTP requests? Do I even have to use HTTP? I see the EmbeddedSolrServer exists. Any drawbacks using that? Any hints are welcome, Thanks! -- Lance Norskog goks...@gmail.com
Re: Design and Usage Questions
If you just want a quick way to query Solr server, Perl module Webservice::Solr is pretty good. On Mon, Nov 1, 2010 at 4:56 PM, Lance Norskog goks...@gmail.com wrote: Yes, you can write your own app to read the file with SVNkit and post it to the ExtractingRequestHandler. This would be easiest. On Mon, Nov 1, 2010 at 5:49 AM, getagrip getag...@web.de wrote: Ok, so if I did NOT use Solr_J I could PUSH a Stream to Solr somehow? I do not depend on Solr_J, any connection-method would suffice. On 11/01/2010 03:23 AM, Lance Norskog wrote: 2. The SolrJ library handling of content streams is pull, not push. That is, you give it a reader and it pulls content when it feels like it. If your software to feed the connection wants to write the data, you have to either buffer the whole thing or do a dual-thread writer/reader pair. The easiest way to pull stuff from SVN is to use one of the web server apps. Solr takes a stream.url parameter. (Also stream.file.) Note that there is no outbound authentication supported; your web server has to be open (at least to the Solr instance). On Sun, Oct 31, 2010 at 4:06 PM, getagripgetag...@web.de wrote: Hi, I've got some basic usage / design questions. 1. The SolrJ wiki proposes to use the same CommonsHttpSolrServer instance for all requests to avoid connection leaks. So if I create a Singleton instance upon application-startup I can securely use this instance for ALL queries/updates throughout my application without running into performance issues? 2. My System's documents are stored in a Subversion repository. For fast searchresults I want to periodically index new documents from the repository. What I get from the repository is a ByteArrayOutputStream. How can I pass this Stream to Solr? I only see possibilities to pass Files but in my case it does not make sense to write the ByteArrayOutputStream to disk again as this would cause performance issues apart from making no sense anyway. 3. Are there any disadvantages using Solrj over some other HTTP based solution e.g. creating sending my own HTTP requests? Do I even have to use HTTP? I see the EmbeddedSolrServer exists. Any drawbacks using that? Any hints are welcome, Thanks! -- Lance Norskog goks...@gmail.com
Re: is my search fast ?! date search i need some feedback :D
Careful here. First searches are known to be slow, various caches are filled up the first time they are used etc. So even though you're measuring the second query, it's still perhaps filling caches. And what are you measuring? The raw search time or the entire response time? These can be quite different. Try running with debugQuery=on and one of the things you'll get back is the search time (not including assembling the response). You're right, though, 9 seconds is far too long. If you have a relatively small number of currency_ids, think about the enum method (see: http://wiki.apache.org/solr/SimpleFacetParameters#facet.method) Also, think about autowarming and firstsearch queries to prepare your solr instance for faster responses. If none of that helps, please post the relevant parts of your schema.xml and the results of running your query with debugQuery=on, that'll give us a lot more info to go on. Best Erick On Mon, Nov 1, 2010 at 5:37 AM, stockiii stock.jo...@gmail.com wrote: my index is 13M big and i have not index all of my documents. the index in production system should be about 30M Documents big. so with my test 13M Index i try a search over all documents, with first query: q:[2008-10-27 12:23:00:00 TO 2009-04-29 23:59:00:00] than i run the next query, for statistics. grouped by currency_id and get the amounts, of these Currencys. thats my result: - EUR Sum: 437.259.518,28 € Founded: 3712331 - CHF Sum: 2.048.147,62 SFr. Founded: 10473 - GBP Sum: 1.221,41 £ Founded: 181 for getting the result solr needs 9 seconds. ... i dont think thats really fast =( what do you think ? for faster search i want to try change precisionStep=6 to -- for deleting the milliseconds. whats the value for deleting also the seconds ? we only need HH:MM and not HH:MM:SS:MSMS and i change the datesearch from q to fq ... thx -- View this message in context: http://lucene.472066.n3.nabble.com/is-my-search-fast-date-search-i-need-some-feedback-D-tp1820821p1820821.html Sent from the Solr - User mailing list archive at Nabble.com.
Which is faster -- delete or update?
My documents have a down_vote field. Every time a user votes down a document, I increment the down_vote field in my database and also re-index the document to Solr to reflect the new down_vote value. During searches, I want to restrict the results to only documents with, say fewer than 3 down_vote. 2 ways to implement that: 1) When a user down vote a document, check to see if total down votes have reached 3. If it has, delete document from Solr index. 2) When a user down vote a document, update the document in Solr index to reflect the new down_vote value even if total down votes might have been more than 3. During query, add a fq to restrict results to documents with fewer than 3 down votes. Which approach is better? Is it faster to delete a document from index or to update the document to reflect the new down_vote value? Thanks.Andy
Re: Which is faster -- delete or update?
From the user perspective I wouldn't delete it, because it could be that down-voting by mistake or spam or something and up-voting can resurrect it. It could be also wise to keep the docs to see which content (from which users?) are down voted to get spam accounts? From the dev perspective you should benchmark it, if really necessary. (I guess updating is a more expensive because I think it is delete+completely-new-add) Regards, Peter. My documents have a down_vote field. Every time a user votes down a document, I increment the down_vote field in my database and also re-index the document to Solr to reflect the new down_vote value. During searches, I want to restrict the results to only documents with, say fewer than 3 down_vote. 2 ways to implement that: 1) When a user down vote a document, check to see if total down votes have reached 3. If it has, delete document from Solr index. 2) When a user down vote a document, update the document in Solr index to reflect the new down_vote value even if total down votes might have been more than 3. During query, add a fq to restrict results to documents with fewer than 3 down votes. Which approach is better? Is it faster to delete a document from index or to update the document to reflect the new down_vote value? Thanks.Andy
Re: Which is faster -- delete or update?
Just deleting a document is faster because all that really happens is the document is marked as deleted. An update is really a delete followed by an add of the same document, so by definition an update will be slower... But... does it really make a difference? How often to you expect this to happen? Perter Karich added a note while I was typing this, and he makes some cogent points. I'm starting to think that I don't care about better unless and until my users notice (or I have a reasonable expectation that they #will# notice). I'm far more interested in simpler code that I can maintain than I am shaving off another 4 milliseconds from the response time. That gives me more chance to put in cool new features that the user will notice... Best Erick On Mon, Nov 1, 2010 at 5:04 PM, Andy angelf...@yahoo.com wrote: My documents have a down_vote field. Every time a user votes down a document, I increment the down_vote field in my database and also re-index the document to Solr to reflect the new down_vote value. During searches, I want to restrict the results to only documents with, say fewer than 3 down_vote. 2 ways to implement that: 1) When a user down vote a document, check to see if total down votes have reached 3. If it has, delete document from Solr index. 2) When a user down vote a document, update the document in Solr index to reflect the new down_vote value even if total down votes might have been more than 3. During query, add a fq to restrict results to documents with fewer than 3 down votes. Which approach is better? Is it faster to delete a document from index or to update the document to reflect the new down_vote value? Thanks.Andy
Re: Which is faster -- delete or update?
The actual time it takes to delete or update the document is unlikely to make a difference to you. What might make a difference to you is the time it takes to actually finalize the commit, and the time it takes to re-warm your indexes after a commit, and especially the time it takes to run any warming queries you have set in newSearcher. Most of these probably won't differ between delete or update, but could be a problem either way; one way to find out, try it and measure it. Whether you do a delete or an update, if you're planning on making changes to your index more often than, oh, 10 or 20 minute seperation, you may run into trouble. Solr isn't so good at frequent changes to the index like that. I haven't looked at it myself, but the Solr patches that get called near real-time seem like they're intended to deal with this, among other things, and allow frequent commits without killing performance or RAM usage. I am not sure how/if other people are effectively dealing with user-generated content that needs to be included in the index for filtering and searching against. Would be very curious if anyone has any successful strategies to share. Another example would be user-generated tagging. Erick Erickson wrote: Just deleting a document is faster because all that really happens is the document is marked as deleted. An update is really a delete followed by an add of the same document, so by definition an update will be slower... But... does it really make a difference? How often to you expect this to happen? Perter Karich added a note while I was typing this, and he makes some cogent points. I'm starting to think that I don't care about better unless and until my users notice (or I have a reasonable expectation that they #will# notice). I'm far more interested in simpler code that I can maintain than I am shaving off another 4 milliseconds from the response time. That gives me more chance to put in cool new features that the user will notice... Best Erick On Mon, Nov 1, 2010 at 5:04 PM, Andy angelf...@yahoo.com wrote: My documents have a down_vote field. Every time a user votes down a document, I increment the down_vote field in my database and also re-index the document to Solr to reflect the new down_vote value. During searches, I want to restrict the results to only documents with, say fewer than 3 down_vote. 2 ways to implement that: 1) When a user down vote a document, check to see if total down votes have reached 3. If it has, delete document from Solr index. 2) When a user down vote a document, update the document in Solr index to reflect the new down_vote value even if total down votes might have been more than 3. During query, add a fq to restrict results to documents with fewer than 3 down votes. Which approach is better? Is it faster to delete a document from index or to update the document to reflect the new down_vote value? Thanks.Andy
Field boosting in DataImportHandler transformer
It's not looking very promising, but is there something I'm missing to be able to apply a field boost from within a transformer in the DataImportHandler? Not a boost defined within the schema, but a boost applied to the field from the transformer itself. I know you can do a document boost, but I can't see anything for a field boost. ~bck
Possible memory leaks with frequent replication
We've been trying to get a setup in which a slave replicates from a master every few seconds (ideally every second but currently we have it set at every 5s). Everything seems to work fine until, periodically, the slave just stops responding from what looks like it running out of memory: org.apache.catalina.core.StandardWrapperValve invoke SEVERE: Servlet.service() for servlet jsp threw exception java.lang.OutOfMemoryError: Java heap space (our monitoring seems to confirm this). Looking around my suspicion is that it takes new Readers longer to warm than the gap between replication and thus they just build up until all memory is consumed (which, I suppose isn't really memory 'leaking' per se, more just resource consumption) That said, we've tried turning off caching on the slave and that didn't help either so it's possible I'm wrong. Is there anything we can do about this? I'm reluctant to increase the heap space since I suspect that will mean that there's just a longer period between failures. Might Zoie help here? Or should we just query against the Master? Thanks, Simon
Re: Re:Re: problem of solr replcation's speed
This is the time to replicate and open the new index, right? Opening a new index can take a lot of time. How many autowarmers and queries are there in the caches? Opening a new index re-runs all of the queries in all of the caches. 2010/11/1 kafka0102 kafka0...@163.com: I suspected my app has some sleeping op every 1s, so I changed ReplicationHandler.PACKET_SZ to 1024 * 1024*10; // 10MB and log result is like thus : [2010-11-01 17:49:29][INFO][pool-6-thread-1][SnapPuller.java(1038)]readFully10485760 cost 3184 [2010-11-01 17:49:32][INFO][pool-6-thread-1][SnapPuller.java(1038)]readFully10485760 cost 3426 [2010-11-01 17:49:36][INFO][pool-6-thread-1][SnapPuller.java(1038)]readFully10485760 cost 3359 [2010-11-01 17:49:39][INFO][pool-6-thread-1][SnapPuller.java(1038)]readFully10485760 cost 3166 [2010-11-01 17:49:42][INFO][pool-6-thread-1][SnapPuller.java(1038)]readFully10485760 cost 3513 [2010-11-01 17:49:46][INFO][pool-6-thread-1][SnapPuller.java(1038)]readFully10485760 cost 3140 [2010-11-01 17:49:50][INFO][pool-6-thread-1][SnapPuller.java(1038)]readFully10485760 cost 3471 That means It's still slow like before. what's wrong with my env At 2010-11-01 17:30:32,kafka0102 kafka0...@163.com wrote: I hacked SnapPuller to log the cost, and the log is like thus: [2010-11-01 17:21:19][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 979 [2010-11-01 17:21:19][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 4 [2010-11-01 17:21:19][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 4 [2010-11-01 17:21:20][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 980 [2010-11-01 17:21:20][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 4 [2010-11-01 17:21:20][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 5 [2010-11-01 17:21:21][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 979 It's saying it cost about 1000ms for transfering 1M data every 2 times. I used jetty as server and embeded solr in my app.I'm so confused.What I have done wrong? At 2010-11-01 10:12:38,Lance Norskog goks...@gmail.com wrote: If you are copying from an indexer while you are indexing new content, this would cause contention for the disk head. Does indexing slow down during this period? Lance 2010/10/31 Peter Karich peat...@yahoo.de: we have an identical-sized index and it takes ~5minutes It takes about one hour to replacate 6G index for solr in my env. But my network can transfer file about 10-20M/s using scp. So solr's http replcation is too slow, it's normal or I do something wrong? -- Lance Norskog goks...@gmail.com -- Lance Norskog goks...@gmail.com
Re: Possible memory leaks with frequent replication
You should query against the indexer. I'm impressed that you got 5s replication to work reliably. On Mon, Nov 1, 2010 at 4:27 PM, Simon Wistow si...@thegestalt.org wrote: We've been trying to get a setup in which a slave replicates from a master every few seconds (ideally every second but currently we have it set at every 5s). Everything seems to work fine until, periodically, the slave just stops responding from what looks like it running out of memory: org.apache.catalina.core.StandardWrapperValve invoke SEVERE: Servlet.service() for servlet jsp threw exception java.lang.OutOfMemoryError: Java heap space (our monitoring seems to confirm this). Looking around my suspicion is that it takes new Readers longer to warm than the gap between replication and thus they just build up until all memory is consumed (which, I suppose isn't really memory 'leaking' per se, more just resource consumption) That said, we've tried turning off caching on the slave and that didn't help either so it's possible I'm wrong. Is there anything we can do about this? I'm reluctant to increase the heap space since I suspect that will mean that there's just a longer period between failures. Might Zoie help here? Or should we just query against the Master? Thanks, Simon -- Lance Norskog goks...@gmail.com
Phrase Query Problem?
I have a number of fields I need to do an exact match on. I've defined them as 'string' in my schema.xml. I've noticed that I get back query results that don't have all of the words I'm using to search with. For example: q=(((mykeywords:Compliance+With+Conduct+Standards)OR(mykeywords:All)OR(mykeywords:ALL)))start=0indent=truewt=json Should, with an exact match, return only one entry but it returns five some of which don't have any of the fields I've specified. I've tried this both with and without quotes. What could I be doing wrong? Thanks - Tod
Re: Phrase Query Problem?
On Mon, Nov 1, 2010 at 10:26 PM, Tod listac...@gmail.com wrote: I have a number of fields I need to do an exact match on. I've defined them as 'string' in my schema.xml. I've noticed that I get back query results that don't have all of the words I'm using to search with. For example: q=(((mykeywords:Compliance+With+Conduct+Standards)OR(mykeywords:All)OR(mykeywords:ALL)))start=0indent=truewt=json Should, with an exact match, return only one entry but it returns five some of which don't have any of the fields I've specified. I've tried this both with and without quotes. What could I be doing wrong? Thanks - Tod Tod, Without knowing your exact field definition, my first guess would be your first boolean query; because it is not quoted, what SOLR typically does is to transform that type of query into something like (assuming your uniqueKey is id): (mykeywords:Compliance id:With id:Conduct id:Standards). If you do (mykeywords:Compliance+With+Conduct+Standards) you might see different (better?) results. Otherwise, append debugQuery=on to your URL and you can see exactly how SOLR is parsing your query. If none of that helps, what is your field definition in your schema.xml? - Ken
RE: Ensuring stable timestamp ordering
how about a timrstamp with either a GUID appended on the end of it? Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die. --- On Sun, 10/31/10, Toke Eskildsen t...@statsbiblioteket.dk wrote: From: Toke Eskildsen t...@statsbiblioteket.dk Subject: RE: Ensuring stable timestamp ordering To: solr-user@lucene.apache.org solr-user@lucene.apache.org Date: Sunday, October 31, 2010, 12:18 PM Dennis Gearon [gear...@sbcglobal.net] wrote: Even microseconds may not be enough on some really good, fast machine. True, especially since the timer might not provide microsecond granularity although the returned value is in microseconds. However, an unique timestamp generator should keep track of the previous timestamp to guard against duplicates. Uniqueness can thus be guaranteed by waiting a bit or cheating on the decimals. With microseconds can produce 1 million timestamps / second. While I agree that duplicates within microseconds can occur on a fast machine, guaranteeing uniqueness by waiting should only be a performance problem when the number of duplicates is high. That's still a few years off, I think. As Michael pointed out, using normal timestamps as unique IDs might not be such a great idea as it effectively locks index-building to a single JVM. By going the ugly route and expressing the time in nanos with only microsecond granularity and use the last 3 decimals for a builder ID this could be fixed. Not very clean though, as the contract is not expressed in the data themselves but must nevertheless be obeyed by all builders to avoid collisions. It also raises the question of who should assign the builder IDs. Not trivial in an anarchistic setup where new builders can be added by different controllers. Pragmatists might use the PID % 1000 or similar for the builder ID as it does not require coordination, but this is where the Birthday Paradox hits us again: The chance of two processes on different machines having the same PID is 10% if just 15 machines are used (1% for 5 machines, 50% for 37 machines). I don't like those odds and that's assuming that the PIDs will be randomly distributed, which they won't. It could be lowered by reserving more decimals for the salt, but then we would decrease the maximum amount of timestamps / second, still without guaranteed uniqueness. Guys a lot smarter than me has spend time on the unique ID problem and it's clearly not easy: Java's UUID takes up 128 bits. - Toke
Default file locking on trunk
Scenario: Git update to current trunk (Nov 1, 2010). Build all Run solr in trunk/solr/example with 'java -jar start.jar' Hi ^C Jetty reports doing shutdown hook There is now a data/index with a write lock file in it. I have not attempted to read the index, let alone add something to it. I start solr again, and it cannot open the index because of the write lock. Why is there a write lock file when I have not tried to index anything? -- Lance Norskog goks...@gmail.com