solr-cloud performance decrease day by day
Hello, i am using sold 4.1.0 and ihave used sold cloud in my product.I have found at first everything seems good,the search time is fast and delay is slow,but it becomes very slow after days.does any one knows if there maybe some params or optimization to use sold cloud?
Re: solr-cloud performance decrease day by day
Could you give more info about your index size and technical details of your machine? Maybe you are indexing more data day by day and your RAM capability is not enough anymore? 2013/4/19 qibaoyuan qibaoy...@gmail.com Hello, i am using sold 4.1.0 and ihave used sold cloud in my product.I have found at first everything seems good,the search time is fast and delay is slow,but it becomes very slow after days.does any one knows if there maybe some params or optimization to use sold cloud?
Re: solr-cloud performance decrease day by day
there are 6 shards and they are in one machine,and the jvm param is very big,the physical memory is 16GB,the total #docs is about 150k,the index size of each shard is about 1GB.AND there is indexing while searching,I USE auto commit each 10min.and the data comes about 100 per minutes. 在 2013-4-19,下午3:17,Furkan KAMACI furkankam...@gmail.com 写道: Could you give more info about your index size and technical details of your machine? Maybe you are indexing more data day by day and your RAM capability is not enough anymore? 2013/4/19 qibaoyuan qibaoy...@gmail.com Hello, i am using sold 4.1.0 and ihave used sold cloud in my product.I have found at first everything seems good,the search time is fast and delay is slow,but it becomes very slow after days.does any one knows if there maybe some params or optimization to use sold cloud?
Re: SolrCloud loadbalancing, replication, and failover
Well, to consume 120GB of RAM with a 120GB index, you would have to query over every single GB of data. If you only actually query over, say, 500MB of the 120GB data in your dev environment, you would only use 500MB worth of RAM for caching. Not 120GB On Fri, Apr 19, 2013 at 7:55 AM, David Parks davidpark...@yahoo.com wrote: Wow! That was the most pointed, concise discussion of hardware requirements I've seen to date, and it's fabulously helpful, thank you Shawn! We currently have 2 servers that I can dedicate about 12GB of ram to Solr on (we're moving to these 2 servers now). I can upgrade further if it's needed justified, and your discussion helps me justify that such an upgrade is the right thing to do. So... If I move to 3 servers with 50GB of RAM each, using 3 shards, I should be in the free and clear then right? This seems reasonable and doable. In this more extreme example the failover properties of solr cloud become more clear. I couldn't possibly run a replica shard without doubling the memory, so really replication isn't reasonable until I have double the hardware, then the load balancing scheme makes perfect sense. With 3 servers, 50GB of RAM and 120GB index I should just backup the index directory I think. My previous though to run replication just for failover would have actually resulted in LOWER performance because I would have halved the memory available to the master replica. So the previous question is answered as well now. Question: if I had 1 server with 60GB of memory and 120GB index, would solr make full use of the 60GB of memory? Thus trimming disk access in half. Or is it an all-or-nothing thing? In a dev environment, I didn't notice SOLR consuming the full 5GB of RAM assigned to it with a 120GB index. Dave -Original Message- From: Shawn Heisey [mailto:s...@elyograg.org] Sent: Friday, April 19, 2013 11:51 AM To: solr-user@lucene.apache.org Subject: Re: SolrCloud loadbalancing, replication, and failover On 4/18/2013 8:12 PM, David Parks wrote: I think I still don't understand something here. My concern right now is that query times are very slow for 120GB index (14s on avg), I've seen a lot of disk activity when running queries. I'm hoping that distributing that query across 2 servers is going to improve the query time, specifically I'm hoping that we can distribute that disk activity because we don't have great disks on there (yet). So, with disk IO being a factor in mind, running the query on one box, vs. across 2 *should* be a concern right? Admittedly, this is the first step in what will probably be many to try to work our query times down from 14s to what I want to be around 1s. I went through my mailing list archive to see what all you've said about your setup. One thing that I can't seem to find is a mention of how much total RAM is in each of your servers. I apologize if it was actually there and I overlooked it. In one email thread, you wanted to know whether Solr is CPU-bound or IO-bound. Solr is heavily reliant on the index on disk, and disk I/O is the slowest piece of the puzzle. The way to get good performance out of Solr is to have enough memory that you can take the disk mostly out of the equation by having the operating system cache the index in RAM. If you don't have enough RAM for that, then Solr becomes IO-bound, and your CPUs will be busy in iowait, unable to do much real work. If you DO have enough RAM to cache all (or most) of your index, then Solr will be CPU-bound. With 120GB of total index data on each server, you would want at least 128GB of RAM per server, assuming you are only giving 8-16GB of RAM to Solr, and that Solr is the only thing running on the machine. If you have more servers and shards, you can reduce the per-server memory requirement because the amount of index data on each server would go down. I am aware of the cost associated with this kind of requirement - each of my Solr servers has 64GB. If you are sharing the server with another program, then you want to have enough RAM available for Solr's heap, Solr's data, the other program's heap, and the other program's data. Some programs (like MySQL) completely skip the OS disk cache and instead do that caching themselves with heap memory that's actually allocated to the program. If you're using a program like that, then you wouldn't need to count its data. Using SSDs for storage can speed things up dramatically and may reduce the total memory requirement to some degree, but even an SSD is slower than RAM. The transfer speed of RAM is faster, and from what I understand, the latency is at least an order of magnitude quicker - nanoseconds vs microseconds. In another thread, you asked about how Google gets such good response times. Although Google's software probably works differently than Solr/Lucene, when it comes right down to it, all search engines do similar
RE: SolrCloud loadbalancing, replication, and failover
Interesting. I'm trying to correlate this new understanding to what I see on my servers. I've got one server with 5GB dedicated to solr, solr dashboard reports a 167GB index actually. When I do many typical queries I see between 3MB and 9MB of disk reads (watching iostat). But solr's dashboard only shows 710MB of memory in use (this box has had many hundreds of queries put through it, and has been up for 1 week). That doesn't quite correlate with my understanding that Solr would cache the index as much as possible. Should I be thinking that things aren't configured correctly here? Dave -Original Message- From: John Nielsen [mailto:j...@mcb.dk] Sent: Friday, April 19, 2013 2:35 PM To: solr-user@lucene.apache.org Subject: Re: SolrCloud loadbalancing, replication, and failover Well, to consume 120GB of RAM with a 120GB index, you would have to query over every single GB of data. If you only actually query over, say, 500MB of the 120GB data in your dev environment, you would only use 500MB worth of RAM for caching. Not 120GB On Fri, Apr 19, 2013 at 7:55 AM, David Parks davidpark...@yahoo.com wrote: Wow! That was the most pointed, concise discussion of hardware requirements I've seen to date, and it's fabulously helpful, thank you Shawn! We currently have 2 servers that I can dedicate about 12GB of ram to Solr on (we're moving to these 2 servers now). I can upgrade further if it's needed justified, and your discussion helps me justify that such an upgrade is the right thing to do. So... If I move to 3 servers with 50GB of RAM each, using 3 shards, I should be in the free and clear then right? This seems reasonable and doable. In this more extreme example the failover properties of solr cloud become more clear. I couldn't possibly run a replica shard without doubling the memory, so really replication isn't reasonable until I have double the hardware, then the load balancing scheme makes perfect sense. With 3 servers, 50GB of RAM and 120GB index I should just backup the index directory I think. My previous though to run replication just for failover would have actually resulted in LOWER performance because I would have halved the memory available to the master replica. So the previous question is answered as well now. Question: if I had 1 server with 60GB of memory and 120GB index, would solr make full use of the 60GB of memory? Thus trimming disk access in half. Or is it an all-or-nothing thing? In a dev environment, I didn't notice SOLR consuming the full 5GB of RAM assigned to it with a 120GB index. Dave -Original Message- From: Shawn Heisey [mailto:s...@elyograg.org] Sent: Friday, April 19, 2013 11:51 AM To: solr-user@lucene.apache.org Subject: Re: SolrCloud loadbalancing, replication, and failover On 4/18/2013 8:12 PM, David Parks wrote: I think I still don't understand something here. My concern right now is that query times are very slow for 120GB index (14s on avg), I've seen a lot of disk activity when running queries. I'm hoping that distributing that query across 2 servers is going to improve the query time, specifically I'm hoping that we can distribute that disk activity because we don't have great disks on there (yet). So, with disk IO being a factor in mind, running the query on one box, vs. across 2 *should* be a concern right? Admittedly, this is the first step in what will probably be many to try to work our query times down from 14s to what I want to be around 1s. I went through my mailing list archive to see what all you've said about your setup. One thing that I can't seem to find is a mention of how much total RAM is in each of your servers. I apologize if it was actually there and I overlooked it. In one email thread, you wanted to know whether Solr is CPU-bound or IO-bound. Solr is heavily reliant on the index on disk, and disk I/O is the slowest piece of the puzzle. The way to get good performance out of Solr is to have enough memory that you can take the disk mostly out of the equation by having the operating system cache the index in RAM. If you don't have enough RAM for that, then Solr becomes IO-bound, and your CPUs will be busy in iowait, unable to do much real work. If you DO have enough RAM to cache all (or most) of your index, then Solr will be CPU-bound. With 120GB of total index data on each server, you would want at least 128GB of RAM per server, assuming you are only giving 8-16GB of RAM to Solr, and that Solr is the only thing running on the machine. If you have more servers and shards, you can reduce the per-server memory requirement because the amount of index data on each server would go down. I am aware of the cost associated with this kind of requirement - each of my Solr servers has 64GB. If you are sharing the server with another program, then you want to have enough RAM available for Solr's
Re: solr-cloud performance decrease day by day
Can happen for various reasons. Can you recreate the situation, meaning restarting the servlet or server would start with good qTime and decrease from that point? How fast does this happen? Start by monitoring the jvm process, with oracle visualVM for example. Monitor for frequent garbage collections or unreasonable memory peacks or opening threads. Then monitor your system to see if there's an io disk latency or disk usage that increases in time, the writing queue to disk exploads, cpu load becomes heavier or network usage's exeeds limit. If you can recreate the decrease and monitor well, one of the above params should pop up. Fixing it after defining the problem will be easier. Good day, Manu On Apr 19, 2013 10:26 AM, qibaoyuan qibaoy...@gmail.com wrote:
Re: solr-cloud performance decrease day by day
Thanks manu,i will check it. 在 2013-4-19,下午4:26,Manuel Le Normand manuel.lenorm...@gmail.com 写道: Can happen for various reasons. Can you recreate the situation, meaning restarting the servlet or server would start with good qTime and decrease from that point? How fast does this happen? Start by monitoring the jvm process, with oracle visualVM for example. Monitor for frequent garbage collections or unreasonable memory peacks or opening threads. Then monitor your system to see if there's an io disk latency or disk usage that increases in time, the writing queue to disk exploads, cpu load becomes heavier or network usage's exeeds limit. If you can recreate the decrease and monitor well, one of the above params should pop up. Fixing it after defining the problem will be easier. Good day, Manu On Apr 19, 2013 10:26 AM, qibaoyuan qibaoy...@gmail.com wrote:
Re: shard query return 500 on large data set
Can you instead use paging mechanism? On Thu, Apr 18, 2013 at 8:03 PM, Jie Sun jsun5...@yahoo.com wrote: Hi - when I execute a shard query like: [myhost]:8080/solr/mycore/select?q=type:messagerows=14...qt=standardwt=standardexplainOther=hl.fl=shards=solrserver1:8080/solr/mycore,solrserver2:8080/solr/mycore,solrserver3:8080/solr/mycore everything works fine until I query against a large set of data ( 100k documents), when the number of rows returned exceeds about 50k. by the way I am using HttpClient GET method to send the solr shard query over. In the above scenario, the query fails with a 500 server error as returned status code. I am using solr 3.5. I encountered a 404 before, when one of the shard servers does not have the core (404) the whole shard query will return 404 to me; so I expect if one of the server encounter a timeout (408?), the shard query should return time out status code? I guess I am not sure what will be the shard query results with various error scenario... guess i could look into solr code, but if you have any input, it will be appreciated. thanks Renee -- View this message in context: http://lucene.472066.n3.nabble.com/shard-query-return-500-on-large-data-set-tp4057038.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SolrCloud loadbalancing, replication, and failover
On 4/19/2013 1:34 AM, John Nielsen wrote: Well, to consume 120GB of RAM with a 120GB index, you would have to query over every single GB of data. If you only actually query over, say, 500MB of the 120GB data in your dev environment, you would only use 500MB worth of RAM for caching. Not 120GB What you are saying is essentially true, although I would not be surprised to learn that even a single query would read a few gigabytes from a 120GB index, assuming that you start after a server reboot. The next query would re-use a lot of the data cached by the first query and return much faster. On Fri, Apr 19, 2013 at 7:55 AM, David Parks davidpark...@yahoo.com wrote: Question: if I had 1 server with 60GB of memory and 120GB index, would solr make full use of the 60GB of memory? Thus trimming disk access in half. Or is it an all-or-nothing thing? In a dev environment, I didn't notice SOLR consuming the full 5GB of RAM assigned to it with a 120GB index. Solr would likely cause the OS to use most or all of that memory. It's not an all or nothing thing. The first few queries will load a big chunk, then each additional query will load a little more. 60GB of RAM will be significantly better than 12GB. With only 12GB, it is extremely likely that a given query will read a section of the index that will push the data required for the next query out of the disk cache, so it will have to re-read it from the disk on the next query, and so on in a never-ending cycle. That is far less likely if you have enough RAM for half your index rather than a tenth. Operating system disk caches are pretty good at figuring out which data is needed frequently. If the cache is big enough, that data can be kept in the cache easily. An ideal setup would have enough RAM to cache the entire index. Depending on your schema, you may find that the disk cache in production only ends up caching somewhere between half and two thirds of your index. The 60GB figure you have quoted above *MIGHT* be enough to make things work really well with a 120GB index, but I always tell people that if they want top performance, they will buy enough RAM to cache the whole thing. You might have a combination of query pattern and data that results in more of the index needing cache than I have seen on my setup. You are likely to add documents continuously. You may learn that your schema doesn't cover your needs, so you have to modify it to tokenize more aggressively, or you may need to copy fields so you can analyze the same data more than one way. These things will make your index bigger. If your query volume grows or gets more varied, more of your index is likely to end up in the disk cache. I would not recommend going into production with an index that has no redundancy. If you buy quality hardware with redundancy in storage, dual power supplies, and ECC memory, catastrophic failures are rare, but they DO happen. The motherboard or an entire RAM chip could suddenly die. Someone might accidentally hit the power switch on the server and cause it to shut down. They might be working in the rack, fall down, and pull out both power cords in an attempt to catch themselves. The latter scenarios are a temporary problem, but your users will probably notice. Thanks, Shawn
Re: SolrCloud loadbalancing, replication, and failover
On 4/19/2013 2:15 AM, David Parks wrote: Interesting. I'm trying to correlate this new understanding to what I see on my servers. I've got one server with 5GB dedicated to solr, solr dashboard reports a 167GB index actually. When I do many typical queries I see between 3MB and 9MB of disk reads (watching iostat). But solr's dashboard only shows 710MB of memory in use (this box has had many hundreds of queries put through it, and has been up for 1 week). That doesn't quite correlate with my understanding that Solr would cache the index as much as possible. There are two memory sections on the dashboard. The one at the top shows the operating system view of physical memory. That is probably showing virtually all of it in use. Most UNIX platforms will show you the same info with 'top' or 'free'. Some of them, like Solaris, require different tools. I assume you're not using Windows, because you mention iostat. The other memory section is for the JVM, and that only covers the memory used by Solr. The dark grey section is the amount of Java heap memory currently utilized by Solr and its servlet container. The light grey section represents the memory that the JVM has allocated from system memory. If any part of that bar is white, then Java has not yet requested the maximum configured heap. Typically a long-running Solr install will have only dark and light grey, no white. The operating system is what caches your index, not Solr. The bulk of your RAM should be unallocated. With your index size, the OS will use all unallocated RAM for the disk cache. If a program requests some of that RAM, the OS will instantly give it up. Thanks, Shawn
RE: SolrCloud loadbalancing, replication, and failover
Ok, I understand better now. The Physical Memory is 90% utilized (21.18GB of 23.54GB). Solr has dark grey allocation of 602MB, and light grey of an additional 108MB, for a JVM total of 710MB allocated. If I understand correctly, Solr memory utilization is *not* for caching (unless I configured document caches or some of the other cache options in Solr, which don't seem to apply in this case, and I haven't altered from their defaults). So assuming this box was dedicated to 1 solr instance/shard. What JVM heap should I set? Does that matter? 24GB JVM heap? Or keep it lower and ensure the OS cache has plenty of room to operate? (this is an Ubuntu 12.10 server instance). Would I be wise to just put the index on a RAM disk and guarantee performance? Assuming I installed sufficient RAM? Dave -Original Message- From: Shawn Heisey [mailto:s...@elyograg.org] Sent: Friday, April 19, 2013 4:19 PM To: solr-user@lucene.apache.org Subject: Re: SolrCloud loadbalancing, replication, and failover On 4/19/2013 2:15 AM, David Parks wrote: Interesting. I'm trying to correlate this new understanding to what I see on my servers. I've got one server with 5GB dedicated to solr, solr dashboard reports a 167GB index actually. When I do many typical queries I see between 3MB and 9MB of disk reads (watching iostat). But solr's dashboard only shows 710MB of memory in use (this box has had many hundreds of queries put through it, and has been up for 1 week). That doesn't quite correlate with my understanding that Solr would cache the index as much as possible. There are two memory sections on the dashboard. The one at the top shows the operating system view of physical memory. That is probably showing virtually all of it in use. Most UNIX platforms will show you the same info with 'top' or 'free'. Some of them, like Solaris, require different tools. I assume you're not using Windows, because you mention iostat. The other memory section is for the JVM, and that only covers the memory used by Solr. The dark grey section is the amount of Java heap memory currently utilized by Solr and its servlet container. The light grey section represents the memory that the JVM has allocated from system memory. If any part of that bar is white, then Java has not yet requested the maximum configured heap. Typically a long-running Solr install will have only dark and light grey, no white. The operating system is what caches your index, not Solr. The bulk of your RAM should be unallocated. With your index size, the OS will use all unallocated RAM for the disk cache. If a program requests some of that RAM, the OS will instantly give it up. Thanks, Shawn
in solrcoud, how to assign a schemaConf to a collection ?
hi all, help~~~ how to specify a schema to a collection in solrcloud? i have a solrcloud with 3 collections, and each configfile is uploaded to zk like this: args=-Xmn3000m -Xms5000m -Xmx5000m -XX:MaxPermSize=384m -Dbootstrap_confdir=/workspace/solr/solrhome/doc/conf -Dcollection.configName=docconf -DzkHost=zk1:2181,zk2:2181,zk3:2181 -DnumShards=3 -Dname=docCollection the solr.xml is like this cores ... core name=doc instanceDir=doc/ loadOnStartup=true transient=false collection=docCollection / core name=video instanceDir=video/ loadOnStartup=true transient=false collection=videoCollection / core name=pic instanceDir=pic/ loadOnStartup=true transient=false collection=picCollection / /cores then, when all nodes startup, i find the schema of 2 collection(doc and video) are the same , while the schema of pic is wrong too.. are there some propeties in core, which can specify a its own schma??? thands for any help... -- View this message in context: http://lucene.472066.n3.nabble.com/in-solrcoud-how-to-assign-a-schemaConf-to-a-collection-tp4057238.html Sent from the Solr - User mailing list archive at Nabble.com.
solr-cloud problem about user-specified tags
I have plenty of docs and each docs maybe connected to many user-defined tags.I have used sold-cloud, and use join to do this kind of job,and recently i know sole-cloud does not support distributed search.AND so this is a big problem so far.AND the decomposition is quite impossible,because docs and user-defined docs are so huge,and many search is always searched on these two fields.ANY good idea to deal with this problem??
Re: SolrCloud loadbalancing, replication, and failover
On Fri, 2013-04-19 at 06:51 +0200, Shawn Heisey wrote: Using SSDs for storage can speed things up dramatically and may reduce the total memory requirement to some degree, We have been using SSDs for several years in our servers. It is our clear experience that to some degree should be replaced with very much in the above. Our current SSD-equipped servers each holds a total of 127GB of index data spread ever 3 instances. The machines each have 16GB of RAM, of which about 7GB are left for disk cache. We are the State and University Library, Denmark and our search engine is the primary (and arguably only) way to locate resources for our users. The average raw search time is 32ms for non-faceted queries and 616ms for heavy faceted (which is much too slow. Dang! I thought I fixed that). but even an SSD is slower than RAM. The transfer speed of RAM is faster, and from what I understand, the latency is at least an order of magnitude quicker - nanoseconds vs microseconds. True, but you might as well argue that everyone should go for the fastest CPU possible, as it will be, well, faster than the slower ones. The question is almost never to get the fastest possible, but to get a good price/performance tradeoff. I would argue that SSDs fit that bill very well for a great deal of the My search is too slow-threads that are spun on this mailing list. Especially for larger indexes. Regards, Toke Eskildsen
RE: SolrCloud loadbalancing, replication, and failover
Wow, thank you for those benchmarks Toke, that really gives me some firm footing to stand on in knowing what to expect and thinking out which path to venture down. It's tremendously appreciated! Dave -Original Message- From: Toke Eskildsen [mailto:t...@statsbiblioteket.dk] Sent: Friday, April 19, 2013 5:17 PM To: solr-user@lucene.apache.org Subject: Re: SolrCloud loadbalancing, replication, and failover On Fri, 2013-04-19 at 06:51 +0200, Shawn Heisey wrote: Using SSDs for storage can speed things up dramatically and may reduce the total memory requirement to some degree, We have been using SSDs for several years in our servers. It is our clear experience that to some degree should be replaced with very much in the above. Our current SSD-equipped servers each holds a total of 127GB of index data spread ever 3 instances. The machines each have 16GB of RAM, of which about 7GB are left for disk cache. We are the State and University Library, Denmark and our search engine is the primary (and arguably only) way to locate resources for our users. The average raw search time is 32ms for non-faceted queries and 616ms for heavy faceted (which is much too slow. Dang! I thought I fixed that). but even an SSD is slower than RAM. The transfer speed of RAM is faster, and from what I understand, the latency is at least an order of magnitude quicker - nanoseconds vs microseconds. True, but you might as well argue that everyone should go for the fastest CPU possible, as it will be, well, faster than the slower ones. The question is almost never to get the fastest possible, but to get a good price/performance tradeoff. I would argue that SSDs fit that bill very well for a great deal of the My search is too slow-threads that are spun on this mailing list. Especially for larger indexes. Regards, Toke Eskildsen
Re: WordDelimiterFactory
Ashok: You really, _really_ need to dive into the admin/analysis page. That'll show you exactly what WDFF (and all the other elements of your chain) do to input tokens. Understanding the index and query-time implications of all the settings in WDFF takes a while. But from what you're describing, WDFF may not be what you're looking for anyway, some of the regex filters could split, for instance, on all non-alphanum characters. Best Erick On Wed, Apr 17, 2013 at 12:25 AM, Shawn Heisey s...@elyograg.org wrote: On 4/16/2013 8:12 PM, Ashok wrote: It looks like any 'word' that starts with a digit is treated as a numeric string. Setting generateNumberParts=1 in stead of 0 seems to generate the right tokens in this case but need to see if it has any other impacts on the finalized token list... I have a fieldType that is using WDF with the following settings on the index side. Both index and query analysis show it behaving correctly with terms that start with numbers, on versions 4.2.1 and 3.5.0: filter class=solr.WordDelimiterFilterFactory splitOnCaseChange=1 splitOnNumerics=1 stemEnglishPossessive=1 generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 preserveOriginal=1 / It has different settings on the query side, but generateNumberParts is 1 for both: filter class=solr.WordDelimiterFilterFactory splitOnCaseChange=1 splitOnNumerics=1 stemEnglishPossessive=1 generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 preserveOriginal=0 / I haven't tried it with generateNumberParts set to 0. Thanks, Shawn
Re: in solrcoud, how to assign a schemaConf to a collection ?
when i add a schema property to core core name=pic instanceDir=pic/ loadOnStartup=true transient=false collection=picCollection config=solrconfig.xml schema=../picconf/schema.xml/ it seems there a default path to schema ,that is /configs/docconf/ the exception is: [18:59:09.211] java.lang.IllegalArgumentException: Invalid path string /configs/docconf/../picconf/schema.xml caused by relative paths not allowed @18 [18:59:09.211] at org.apache.zookeeper.common.PathUtils.validatePath(PathUtils.java:99) [18:59:09.211] at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1133) [18:59:09.211] at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:253) [18:59:09.211] at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:250) [18:59:09.211] at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:65) [18:59:09.211] at org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:250) [18:59:09.211] at org.apache.solr.cloud.ZkController.getConfigFileData(ZkController.java:388) [18:59:09.211] at org.apache.solr.core.CoreContainer.getSchemaFromZk(CoreContainer.java:1659) [18:59:09.211] at org.apache.solr.core.CoreContainer.createFromZk(CoreContainer.java:948) [18:59:09.211] at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1031) [18:59:09.211] at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:629) [18:59:09.211] at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:624) [18:59:09.211] at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) [18:59:09.211] at java.util.concurrent.FutureTask.run(FutureTask.java:138) [18:59:09.211] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) [18:59:09.211] at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) [18:59:09.211] at java.util.concurrent.FutureTask.run(FutureTask.java:138) [18:59:09.211] at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) [18:59:09.211] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) [18:59:09.211] at java.lang.Thread.run(Thread.java:619) -- View this message in context: http://lucene.472066.n3.nabble.com/in-solrcoud-how-to-assign-a-schemaConf-to-a-collection-tp4057238p4057250.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Indexing problems
Hello Thank you for your answer. We have solved our problem now. I describe it for someone who could encounter a similar problem. Some of our fields are dynamic, and the name of one of these fields was not correct : it was sent to Solr as a java object, eg solrInputDocument.addField(myObject, stringValue); A string representation of this object was displayed in the Solr admin page, and that alerted us. We have replaced this wrong field name by the string we expect and no more OOME occur. At least we could test diverse solr configurations. Regards Joel Gaspard -Message d'origine- De : Erick Erickson [mailto:erickerick...@gmail.com] Envoyé : jeudi 31 janvier 2013 14:00 À : solr-user@lucene.apache.org Objet : Re: Indexing problems I'm really surprised you're hitting OOM errors, I suspect you have something else pathological in your system. So, I'd start checking things like - how many concurrent warming searchers you allow - How big your indexing RAM is set to (we find very little gain over 128M BTW). - Other load on your Solr server. Are you, for instance, searching on it too? - what your autocommit characterstics are (think about autocommitting fairly often with openSearcher=false). - have you defined huge caches? - . How big are these documents anyway? With 12G of ram, they'd have to be absolutely _huge_ to matter much. Multiple collections should work fine in ZK. I really think you have some innocent-looking configuration setting thats bollixing you up, this is not expected behavior. If at all possible, I'd also go with 4.1. I don't really think it's relevant to your situation, but there have been a lot of improvements in the code Best Erick
Re: in solrcoud, how to assign a schemaConf to a collection ?
i copy the 3 schema.xml and solrconfig.xml to $solrhome/conf/.xml, and upload this filedir to zk like this: args=-Xmn1000m -Xms2000m -Xmx2000m -XX:MaxPermSize=384m -Dbootstrap_confdir=/home/app/workspace/solrcloud/solr/solrhome/conf -Dcollection.configName=conf -DzkHost=zk1:2181,zk2:2181,zk3:2181 -DnumShards=2 -Dname=docCollection then in solr.xml , it changes to: core name=doc instanceDir=doc/ loadOnStartup=true transient=false collection=docCollection schema=s1.xml config=sc1.xml / in this way , the schema.xml is seprated. it seems the schema and config properties has a relative path /configs/conf, and this is what i uploaded from local, $solrhome/conf is equals to /configs/conf. -- View this message in context: http://lucene.472066.n3.nabble.com/in-solrcoud-how-to-assign-a-schemaConf-to-a-collection-tp4057238p4057254.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr using a ridiculous amount of memory
Hmmm. There has been quite a bit of work lately to support a couple of things that might be of interest (4.3, which Simon cut today, probably available to all mid next week at the latest). Basically, you can choose to pre-define all the cores in solr.xml (so-called old style) _or_ use the new-style solr.xml which uses auto-discover mode to walk the indicated directory and find all the cores (indicated by the presence of a 'core.properties' file). Don't know if this would make your particular case easier, and I should warn you that this is relatively new code (although there are some reasonable unit tests). You also have the option to only load the cores when they are referenced, and only keep N cores open at a time (loadOnStartup and transient properties). See: http://wiki.apache.org/solr/CoreAdmin#Configuration and http://wiki.apache.org/solr/Solr.xml%204.3%20and%20beyond Note, the docs are somewhat sketchy, so if you try to go down this route let us know anything that should be improved (or you can be added to the list of wiki page contributors and help out!) Best Erick On Thu, Apr 18, 2013 at 8:31 AM, John Nielsen j...@mcb.dk wrote: You are missing an essential part: Both the facet and the sort structures needs to hold one reference for each document _in_the_full_index_, even when the document does not have any values in the fields. Wow, thank you for this awesome explanation! This is where the penny dropped for me. I will definetely move to a multi-core setup. It will take some time and a lot of re-coding. As soon as I know the result, I will let you know! -- Med venlig hilsen / Best regards *John Nielsen* Programmer *MCB A/S* Enghaven 15 DK-7500 Holstebro Kundeservice: +45 9610 2824 p...@mcb.dk www.mcb.dk
Re: stats.facet not working for timestamp field
I'm guessing that your timestamp is a tdate, which stores extra information in the index for fast range searches. What happens if you try to facet on just a date field? Best Erick On Thu, Apr 18, 2013 at 8:37 AM, J Mohamed Zahoor zah...@indix.com wrote: Hi I am using SOlr 4.1 with 6 shards. i want to find out some price stats for all the days in my index. I ended up using stats component like stats=truestats.field=pricestats.facet=timestamp. but it throws up error like str name=msgInvalid Date String:' #1;#0;#0;#0;'[my(#0;'/str My Question is : is timestamp supported as stats.facet ? ./zahoor
Re: solr4 : disable updateLog
updateLog is _required_ if you're in solrCloud mode. Assuming that you're not using SolrCloud, then you can freely disable it. Why do you want to? It's not a bad idea necessarily, but this might be an XY problem. Best Erick On Thu, Apr 18, 2013 at 10:47 AM, Jamel ESSOUSSI jamel.essou...@gmail.com wrote: Hi, If I disable (comment) the updateLog bloc, this will affect indexing result: -- View this message in context: http://lucene.472066.n3.nabble.com/solr4-disable-updateLog-tp4056998.html Sent from the Solr - User mailing list archive at Nabble.com.
Update Request Processor Chains
I am trying to understand update request processor chains. Do they runs one by one when indexing a ducument? Can I identify multiple update request processor chains? Also what are that LogUpdateProcessorFactory and RunUpdateProcessorFactory?
Re: solr-cloud performance decrease day by day
How are you committing data? With 4.0, CommitWithin is now a soft commit, which means that the transaction log will grow until you do a hard commit. You need to periodically do a hard commit if you are continually updating the index. How much updating are you doing? Also, check how much heap is available about you first start the server and have done a few queries and then monitor heap available over time. Maybe you are hitting garbage collections. Maybe you have too much heap allocated so that even a normal Java GC just takes a very long time because so much garbage accumulates - which is why you want only a modest amount of heap available above what the data needs after a few queries have loaded caches. -- Jack Krupansky -Original Message- From: qibaoyuan Sent: Friday, April 19, 2013 3:15 AM To: solr-user@lucene.apache.org Subject: solr-cloud performance decrease day by day Hello, i am using sold 4.1.0 and ihave used sold cloud in my product.I have found at first everything seems good,the search time is fast and delay is slow,but it becomes very slow after days.does any one knows if there maybe some params or optimization to use sold cloud?=
fuzzy search issue with PatternTokenizer Factory
I m using Solr4.2 , I have changed my text field definition, to use the Solr.PatternTokenizerFactory instead of Solr.StandardTokenizerFactory , and changed my schema defination as below fieldType name=text_token class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.PatternTokenizerFactory pattern=[^a-zA-Z0-9amp;\-']|\d{0,4}s: / filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=false / filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.PatternTokenizerFactory pattern=[^a-zA-Z0-9amp;\-']|\d{0,4}s: / filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_extra_query.txt enablePositionIncrements=false / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType after doing so, fuzzy search do not seems to working properly as it was working before. I m searching with search term : worde~1 on search , before it was returning , around 300 records , but now its returning only 5 records. not sure what can be issue. Can anybody help me to make it work!! -- View this message in context: http://lucene.472066.n3.nabble.com/fuzzy-search-issue-with-PatternTokenizer-Factory-tp4057275.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: facet.method enum vs fc
Faceting on a high cardinality string field, like url, on a 120 million record index is going to be very memory intensive. You will very likely need to shard the index to get the performance that you need. In Solr 4.2, you can make the url field a Disk based DocValue and shift the memory from Solr to the file system cache. But to run efficiently this is still going to take a lot of memory in the OS file cache. On Thu, Apr 18, 2013 at 12:00 PM, Mingfeng Yang mfy...@wisewindow.comwrote: 20G is allocated to Solr already. Ming On Wed, Apr 17, 2013 at 11:56 PM, Toke Eskildsen t...@statsbiblioteket.dk wrote: On Wed, 2013-04-17 at 20:06 +0200, Mingfeng Yang wrote: I am doing faceting on an index of 120M documents, on the field of url[...] I would guess that you would need 3-4GB for that. How much memory do you allocate to Solr? - Toke Eskildsen -- Joel Bernstein Professional Services LucidWorks
Import in Solr
I want to update(delta-import) one specific item. Is there any query to do that? like i can delete specific item with the following query: localhost:8080/solr/devices/update?stream.body=deletequeryid:46/query/deletecommit=true Thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/Import-in-Solr-tp4057301.html Sent from the Solr - User mailing list archive at Nabble.com.
Returning similarity values for more like this search
Hi, I'm executing a search including a search for similar documents (mlt=truemlt.fl=) which works fine so far. I would like to get the similarity value for each document. I expected this to be quite common and simple, but I could not find a hint how to do it. Any hint how to do it would be very appreciated. kind regards, Achim
RE: DirectSolrSpellChecker : vastly varying spellcheck QTime times.
I guess the first thing I'd do is to set maxCollationTries to zero. This means it will only run your main query once and not re-run it to check the collations. Now see if your queries have consistent qtime. One easy explanation is that with maxCollationTries=10, it may be running your query up to 11 times to check up to 10 possible collations. If the query takes 50ms by itself, then you've got 550ms total to not find spelling corrections. Unfortunately, the worst case here is the one that gives the user nothing back. Another thing to look at, with maxCollationTries at zero, set maxCollations to 10. This will give you a list of the 10 collations it would have tried. You can figure if the one that gets hits is far enough down the list to explain the high total qtime when maxCollationTries=10. If this explains it, then the obvious solution is to set maxCollationTries to something lower than 10. (you'll need tio weigh how long you're willing to make your users wait to possibly get spelling suggestions) Or possibly, use spellcheck.q to give it an easier query to evalutate than the main query (but that can still give valid collations). Also, see https://issues.apache.org/jira/browse/SOLR-3240 which is an optimization for this feature. James Dyer Ingram Content Group (615) 213-4311 -Original Message- From: SandeepM [mailto:skmi...@hotmail.com] Sent: Thursday, April 18, 2013 11:33 PM To: solr-user@lucene.apache.org Subject: DirectSolrSpellChecker : vastly varying spellcheck QTime times. Hi! I am using SOLR 4.2.1. My solrconfig.xml contains the following: searchComponent name=MySpellcheck class=solr.SpellCheckComponent str name=queryAnalyzerFieldTypetext_spell/str lst name=spellchecker str name=nameMySpellchecker/str str name=fieldspell/str str name=classnamesolr.DirectSolrSpellChecker/str str name=distanceMeasureinternal/str float name=accuracy0.5/float int name=maxEdits2/int int name=minPrefix1/int int name=maxInspections5/int int name=minQueryLength3/int float name=maxQueryFrequency0.01/float /lst /searchComponent requestHandler name=/select class=solr.SearchHandler startup=lazy lst name=defaults int name=rows10/int str name=dfid/str str name=spellcheck.dictionaryMySpellchecker/str str name=spellcheckon/str str name=spellcheck.extendedResultsfalse/str str name=spellcheck.count10/str str name=spellcheck.alternativeTermCount10/str str name=spellcheck.maxResultsForSuggest35/str str name=spellcheck.onlyMorePopulartrue/str str name=spellcheck.collatetrue/str str name=spellcheck.collateExtendedResultsfalse/str str name=spellcheck.maxCollationTries10/str str name=spellcheck.maxCollations1/str str name=spellcheck.collateParam.q.opAND/str /lst arr name=last-components strMySpellcheck/str /arr /requestHandler schema.xml with the spell field looks like: fieldType name=text_spell class=solr.TextField positionIncrementGap=100 sortMissingLast=true analyzer type=index tokenizer class=solr.StandardTokenizerFactory / filter class=solr.LowerCaseFilterFactory / filter class=solr.StopFilterFactory ignoreCase=true words=lang/stopwords_en.txt enablePositionIncrements=true / /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory / filter class=solr.LowerCaseFilterFactory / filter class=solr.StopFilterFactory ignoreCase=true words=lang/stopwords_en.txt enablePositionIncrements=true / /analyzer /fieldType field name=spell type=text_spell indexed=true stored=false multiValued=true / copyField source=title dest=spell / copyField source=artist dest=spell / My query: http://host/solr/select?q=spellcheck.q=chocolat%20factryspellcheck=truedf=spellfl=indent=onwt=xmlrows=10version=2.2echoParams=explicit In this case, the intent is to correct chocolat factry with chocolate factory which exists in my spell field index. I see a QTime from the above query as somewhere between 350-400ms I run a similar query replacing the spellcheck terms to pursut hapyness whereas pursuit happyness actually exists in my spell field and I see QTime of 15-17ms . Both query produce collations correctly but there is order of magnitude difference in QTime. There is one edit per term in both cases or 2 edits in each query. The length of words in both these queries seem identical. I'd like to understand why there is this vast difference in QTime. I would appreciate
Re: SolrCloud loadbalancing, replication, and failover
On 4/19/2013 3:48 AM, David Parks wrote: The Physical Memory is 90% utilized (21.18GB of 23.54GB). Solr has dark grey allocation of 602MB, and light grey of an additional 108MB, for a JVM total of 710MB allocated. If I understand correctly, Solr memory utilization is *not* for caching (unless I configured document caches or some of the other cache options in Solr, which don't seem to apply in this case, and I haven't altered from their defaults). Right. Solr does have caches, but they serve specific purposes. The OS is much better at general large-scale caching than Solr is. Solr caches get cleared (and possibly re-warmed) whenever you issue a commit on your index that makes new documents visible. So assuming this box was dedicated to 1 solr instance/shard. What JVM heap should I set? Does that matter? 24GB JVM heap? Or keep it lower and ensure the OS cache has plenty of room to operate? (this is an Ubuntu 12.10 server instance). The JVM heap to use is highly dependent on the nature of your queries, the number of documents, the number of unique terms, etc. The best thing to do is try it out with a relatively large heap, see how much memory actually gets used inside the JVM. The jvisualvm and jconsole tools will give you nice graphs of JVM memory usage. The jstat program will give you raw numbers on the commandline that you'll need to add to get the full picture. Due to the garbage collection model that Java uses, what you'll see is a sawtooth pattern - memory usage goes up to max heap, then garbage collection reduces it to the actual memory used. Generally speaking, you want to have more heap available than the low point of that sawtooth pattern. If that low point is around 3GB when you are hitting your index hard with queries and updates, then you would want to give Solr a heap of 4 to 6 GB. Would I be wise to just put the index on a RAM disk and guarantee performance? Assuming I installed sufficient RAM? A RAM disk is a very good way to guarantee performance - but RAM disks are ephemeral. Reboot or have an OS crash and it's gone, you'll have to reindex. Also remember that you actually need at *least* twice the size of your index so that Solr (Lucene) has enough room to do merges, and the worst-case scenario is *three* times the index size. Merging happens during normal indexing, not just when you optimize. If you have enough RAM for three times your index size and it takes less than an hour or two to rebuild the index, then a RAM disk might be a viable way to go. I suspect that this won't work for you. Thanks, Shawn
Re: Returning similarity values for more like this search
(13/04/19 23:24), Achim Domma wrote: Hi, I'm executing a search including a search for similar documents (mlt=truemlt.fl=) which works fine so far. I would like to get the similarity value for each document. I expected this to be quite common and simple, but I could not find a hint how to do it. Any hint how to do it would be very appreciated. kind regards, Achim Using debugQuery=true, you can find explanations in the debug section of the response. See: https://issues.apache.org/jira/browse/SOLR-860 koji -- http://soleami.com/blog/lucene-4-is-super-convenient-for-developing-nlp-tools.html
is phrase search possible in solr
I want to do a phrase search in solr without analyzers being applied to it eg - If I search for *DelhiDareDevil* (i.e - with inverted commas)it should search the exact text and not apply any analyzers or tokenizers on this field However if i search for *DelhiDareDevil* it should use tokenizers and analyzers and split it to something like this *delhi dare devil* My schema definition for this is as follows fieldType name=text class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=false analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1 preserveOriginal=1/ filter class=solr.LowerCaseFilterFactory / /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1 preserveOriginal=1/ filter class=solr.LowerCaseFilter``Factory / /analyzer /fieldType field name=cContent type=text indexed=true stored=true multiValued=false/ any help would be appreciated -- View this message in context: http://lucene.472066.n3.nabble.com/is-phrase-search-possible-in-solr-tp4057312.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SEVERE: shard update error StdNode on SolrCloud 4.2.1
On 16 April 2013 11:35, Steve Woodcock steve.woodc...@gmail.com wrote: We have a simple SolrCloud setup (4.2.1) running with a single shard and two nodes, and it's working fine except whenever we send an update request, the leader logs this error: SEVERE: shard update error StdNode: http://10.20.10.42:8080/solr/ts/:org.apache.solr.common.SolrException: Server at http://10.20.10.42:8080/solr/ts returned non ok status:500, message:Internal Server Error at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373) Turns out I think this was caused by having the wrong type for the _version_ field in the schema. We had type=string, but should be type=long, ie. field name=_version_ type=long indexed=true stored=true multiValued=false/ Which, to be fair, is well documented at http://wiki.apache.org/solr/SolrCloud Certainly seems to be working a lot better so far ... Cheers, Steve
Re: is phrase search possible in solr
On Apr 19, 2013, at 16:59 , vicky desai vicky.de...@germinait.com wrote: I want to do a phrase search in solr without analyzers being applied to it eg - If I search for *DelhiDareDevil* (i.e - with inverted commas)it should search the exact text and not apply any analyzers or tokenizers on this field However if i search for *DelhiDareDevil* it should use tokenizers and analyzers and split it to something like this *delhi dare devil* My schema definition for this is as follows fieldType name=text class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=false analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1 preserveOriginal=1/ filter class=solr.LowerCaseFilterFactory / /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1 preserveOriginal=1/ filter class=solr.LowerCaseFilter``Factory / /analyzer /fieldType field name=cContent type=text indexed=true stored=true multiValued=false/ any help would be appreciated First of all, it appears that you have a typo in the definition for the LowerCaseFilter for the query analyzer. Secondly, as the two analyzers appear to be identical (except forn the probable typo), I think you could just specify it once, without specifying the type.
Re: is phrase search possible in solr
By definition, phrase search is one of two things: 1) match on a string field literally, or 2) analyze as a sequence of tokens as per the field type index analyzer. You could use the keyword tokenizer to store the whole field as one string, with filtering for the whole string. Or, just make it a string field and do literal and wildcard matches. You can use copyField to make copies of the same input data in multiple fields, each with different analyzers. You would then need to specify which field you want to search, whether literal or keyword. -- Jack Krupansky -Original Message- From: vicky desai Sent: Friday, April 19, 2013 10:59 AM To: solr-user@lucene.apache.org Subject: is phrase search possible in solr I want to do a phrase search in solr without analyzers being applied to it eg - If I search for *DelhiDareDevil* (i.e - with inverted commas)it should search the exact text and not apply any analyzers or tokenizers on this field However if i search for *DelhiDareDevil* it should use tokenizers and analyzers and split it to something like this *delhi dare devil* My schema definition for this is as follows fieldType name=text class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=false analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1 preserveOriginal=1/ filter class=solr.LowerCaseFilterFactory / /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1 preserveOriginal=1/ filter class=solr.LowerCaseFilter``Factory / /analyzer /fieldType field name=cContent type=text indexed=true stored=true multiValued=false/ any help would be appreciated -- View this message in context: http://lucene.472066.n3.nabble.com/is-phrase-search-possible-in-solr-tp4057312.html Sent from the Solr - User mailing list archive at Nabble.com.
Pros and cons of using RAID or different RAIDS?
Is there any documentation that explains pros and cons of using RAID or different RAIDS?
Re: is phrase search possible in solr
Oops... that's query analyzer, not index analyzer, so it's: By definition, phrase search is one of two things: 1) match on a string field literally, or 2) analyze as a sequence of tokens as per the field type query analyzer. -- Jack Krupansky -Original Message- From: Jack Krupansky Sent: Friday, April 19, 2013 11:14 AM To: solr-user@lucene.apache.org Subject: Re: is phrase search possible in solr By definition, phrase search is one of two things: 1) match on a string field literally, or 2) analyze as a sequence of tokens as per the field type index analyzer. You could use the keyword tokenizer to store the whole field as one string, with filtering for the whole string. Or, just make it a string field and do literal and wildcard matches. You can use copyField to make copies of the same input data in multiple fields, each with different analyzers. You would then need to specify which field you want to search, whether literal or keyword. -- Jack Krupansky -Original Message- From: vicky desai Sent: Friday, April 19, 2013 10:59 AM To: solr-user@lucene.apache.org Subject: is phrase search possible in solr I want to do a phrase search in solr without analyzers being applied to it eg - If I search for *DelhiDareDevil* (i.e - with inverted commas)it should search the exact text and not apply any analyzers or tokenizers on this field However if i search for *DelhiDareDevil* it should use tokenizers and analyzers and split it to something like this *delhi dare devil* My schema definition for this is as follows fieldType name=text class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=false analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1 preserveOriginal=1/ filter class=solr.LowerCaseFilterFactory / /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1 preserveOriginal=1/ filter class=solr.LowerCaseFilter``Factory / /analyzer /fieldType field name=cContent type=text indexed=true stored=true multiValued=false/ any help would be appreciated -- View this message in context: http://lucene.472066.n3.nabble.com/is-phrase-search-possible-in-solr-tp4057312.html Sent from the Solr - User mailing list archive at Nabble.com.
Searching
I want to search so that: - if i write an alphabet it returns all the items that start with that alphabet(a returns apple, aspire etc). - if i ask for a whole string, it returns me just the results with exact string. (like search for Samsung S3 then only result is samsung s3) -if i ask for something it returns me anything that is similar to what i m asking.(like if i only write 'sam' it should return 'samsung') right now i m using text_en_splitting for my field type, it looks like this: fieldType name=text_en_splitting class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=lang/stopwords_en.txt enablePositionIncrements=true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.PorterStemFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=lang/stopwords_en.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.PorterStemFilterFactory/ filter class=solr.PositionFilterFactory / /analyzer /fieldType -- View this message in context: http://lucene.472066.n3.nabble.com/Searching-tp4057328.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: facet.method enum vs fc
Joel, Thanks for your kind reply. The problem is solved with sharding and using facet.method=enum. I am curious about what's the different between enum and fc, so that enum works but fc does not. Do you know something about this? Thank you! Regards, Ming On Fri, Apr 19, 2013 at 6:18 AM, Joel Bernstein joels...@gmail.com wrote: Faceting on a high cardinality string field, like url, on a 120 million record index is going to be very memory intensive. You will very likely need to shard the index to get the performance that you need. In Solr 4.2, you can make the url field a Disk based DocValue and shift the memory from Solr to the file system cache. But to run efficiently this is still going to take a lot of memory in the OS file cache. On Thu, Apr 18, 2013 at 12:00 PM, Mingfeng Yang mfy...@wisewindow.com wrote: 20G is allocated to Solr already. Ming On Wed, Apr 17, 2013 at 11:56 PM, Toke Eskildsen t...@statsbiblioteket.dk wrote: On Wed, 2013-04-17 at 20:06 +0200, Mingfeng Yang wrote: I am doing faceting on an index of 120M documents, on the field of url[...] I would guess that you would need 3-4GB for that. How much memory do you allocate to Solr? - Toke Eskildsen -- Joel Bernstein Professional Services LucidWorks
Re: Searching
Yes, you can do all of that... but it would be a non-trivial amount of effort - the kind of thing consultants get paid real money to do. You should also consider doing it in a middleware application layer, using possibly multiple queries of separate Solr collections. Otherwise, your index might become too large and unwieldy (and risk giving bad or misleading results), unless the number of products is rather small. -- Jack Krupansky -Original Message- From: hassancrowdc Sent: Friday, April 19, 2013 11:48 AM To: solr-user@lucene.apache.org Subject: Searching I want to search so that: - if i write an alphabet it returns all the items that start with that alphabet(a returns apple, aspire etc). - if i ask for a whole string, it returns me just the results with exact string. (like search for Samsung S3 then only result is samsung s3) -if i ask for something it returns me anything that is similar to what i m asking.(like if i only write 'sam' it should return 'samsung') right now i m using text_en_splitting for my field type, it looks like this: fieldType name=text_en_splitting class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=lang/stopwords_en.txt enablePositionIncrements=true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.PorterStemFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=lang/stopwords_en.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.PorterStemFilterFactory/ filter class=solr.PositionFilterFactory / /analyzer /fieldType -- View this message in context: http://lucene.472066.n3.nabble.com/Searching-tp4057328.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Update Request Processor Chains
You can have multiple update chains defined and use only one of them per update request. LogUpdateProcessor logs the update request and the RunUpdateProcessor is where the actual index is updated. Erik On Apr 19, 2013, at 07:49 , Furkan KAMACI wrote: I am trying to understand update request processor chains. Do they runs one by one when indexing a ducument? Can I identify multiple update request processor chains? Also what are that LogUpdateProcessorFactory and RunUpdateProcessorFactory?
Re: WordDelimiterFactory
Yes, thank you Erick. The analysis/document handlers hold the key to deciding the type order of the filters to employ given one's document set, subject matter at hand. The finalized terms they produce for SOLR search, mlt etc... are crucial to the quality of the results. - ashok -- View this message in context: http://lucene.472066.n3.nabble.com/WordDelimiterFactory-tp4056529p4057349.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: fuzzy search issue with PatternTokenizer Factory
Give us some examples of tokens that you are expecting that pattern to tokenize. And express the pattern in simple English as well. Some some actual input data. I suspect that Solr is working fine - but you may not have precisely specified your pattern. But we don't know what your pattern is supposed to recognize. Maybe some of your previous hits had punctuation adjacent to to the terms that your pattern doesn't recognize. And use the Solr Admin UI Analysis page to see how your sample input data is analyzed. w One other thing... without a group, the pattern specifies what delimiter sequence will split the rest of the input into tokens. I suspect you didn't mean this. -- Jack Krupansky -Original Message- From: meghana Sent: Friday, April 19, 2013 9:01 AM To: solr-user@lucene.apache.org Subject: fuzzy search issue with PatternTokenizer Factory I m using Solr4.2 , I have changed my text field definition, to use the Solr.PatternTokenizerFactory instead of Solr.StandardTokenizerFactory , and changed my schema defination as below fieldType name=text_token class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.PatternTokenizerFactory pattern=[^a-zA-Z0-9amp;\-']|\d{0,4}s: / filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=false / filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.PatternTokenizerFactory pattern=[^a-zA-Z0-9amp;\-']|\d{0,4}s: / filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_extra_query.txt enablePositionIncrements=false / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType after doing so, fuzzy search do not seems to working properly as it was working before. I m searching with search term : worde~1 on search , before it was returning , around 300 records , but now its returning only 5 records. not sure what can be issue. Can anybody help me to make it work!! -- View this message in context: http://lucene.472066.n3.nabble.com/fuzzy-search-issue-with-PatternTokenizer-Factory-tp4057275.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: DirectSolrSpellChecker : vastly varying spellcheck QTime times.
James, Thanks for the reply. I see your point and sure enough, reducing maxCollationTries does reduce time, however may not produce results. It seems like the time is taken for the collations re-runs. Is there any way we can activate caching for collations. The same query repeatedly takes the same amount of time. My queryCaches are activated, however don't believe it gets used for spellchecks. Thanks. -- Sandeep -- View this message in context: http://lucene.472066.n3.nabble.com/DirectSolrSpellChecker-vastly-varying-spellcheck-QTime-times-tp4057176p4057389.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Updating clusterstate from the zookeeper
I would like to know the answer to this as well. Michael Della Bitta Appinions 18 East 41st Street, 2nd Floor New York, NY 10017-6271 www.appinions.com Where Influence Isn’t a Game On Thu, Apr 18, 2013 at 8:15 PM, Manuel Le Normand manuel.lenorm...@gmail.com wrote: Hello, After creating a distributed collection on several different servers I sometimes get to deal with failing servers (cores appear not available = grey) or failing cores (Down / unable to recover = brown / red). In case i wish to delete this errorneous collection (through collection API) only the green nodes get erased, leaving a meaningless unavailable collection in the clusterstate.json. Is there any way to edit explicitly the clusterstate.json? If not, how do i update it so the collection as above gets deleted? Cheers, Manu
Re: Updating clusterstate from the zookeeper
you can use the eclipse plugin for zookeeper. http://www.massedynamic.org/mediawiki/index.php?title=Eclipse_Plug-in_for_ZooKeeper -Msj. On Fri, Apr 19, 2013 at 1:53 PM, Michael Della Bitta michael.della.bi...@appinions.com wrote: I would like to know the answer to this as well. Michael Della Bitta Appinions 18 East 41st Street, 2nd Floor New York, NY 10017-6271 www.appinions.com Where Influence Isn’t a Game On Thu, Apr 18, 2013 at 8:15 PM, Manuel Le Normand manuel.lenorm...@gmail.com wrote: Hello, After creating a distributed collection on several different servers I sometimes get to deal with failing servers (cores appear not available = grey) or failing cores (Down / unable to recover = brown / red). In case i wish to delete this errorneous collection (through collection API) only the green nodes get erased, leaving a meaningless unavailable collection in the clusterstate.json. Is there any way to edit explicitly the clusterstate.json? If not, how do i update it so the collection as above gets deleted? Cheers, Manu
RE: DirectSolrSpellChecker : vastly varying spellcheck QTime times.
I do not know what it would take to have the collation tests make betetr use of the QueryResultCache. However, outside of a test scenario, I do not know if this would help a lot. Hopefully you wouldn't have a lot of users issuing the exact same query with the exact same misspelled words over and over. In the real world, if you find that a collation is a better query than the one the user intially issued, then when that user pages through results, etc, your application should use the corrected query and not re-run the incorrect query over and over again. In the case of maxResultsForSuggest, if a user does the first query then rejects any did-you-mean suggstions, you can just turn spellcheck off if they page, facet, etc, so that you don't have to generate these suggestions over and over again. You do have to weigh when setting maxCollationTries whether or not it is acceptable to make a user with a misspelled query wait 1/2 second or so to (hopefully) get a correction, or if you want to simply reduce the maximum time someone will have to wait. If you find that it usually needs 10 tries to find a good collation, then you probably need to try a different distance algorithm, or play with the various accuracy settings to see if you can get better corrections to be nearer the top of the individual-word lists. Also, try setting alternativeTermCount lower than count (maybe set atc to 1/2 of what you have count). This will reduce the number of terms it has to try combinations of. If you set maxResultsForSuggest to a lower value (like 2-3, maybe), then it won't try to return did-you-mean suggestions for queries returning (was it 35?!) hits. As I mentioned, SOLR-3240 does have promise of speeding this feature up so maybe we won't have to talk about these kinds of trade-offs so much in the future. James Dyer Ingram Content Group (615) 213-4311 -Original Message- From: SandeepM [mailto:skmi...@hotmail.com] Sent: Friday, April 19, 2013 12:48 PM To: solr-user@lucene.apache.org Subject: RE: DirectSolrSpellChecker : vastly varying spellcheck QTime times. James, Thanks for the reply. I see your point and sure enough, reducing maxCollationTries does reduce time, however may not produce results. It seems like the time is taken for the collations re-runs. Is there any way we can activate caching for collations. The same query repeatedly takes the same amount of time. My queryCaches are activated, however don't believe it gets used for spellchecks. Thanks. -- Sandeep -- View this message in context: http://lucene.472066.n3.nabble.com/DirectSolrSpellChecker-vastly-varying-spellcheck-QTime-times-tp4057176p4057389.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Updating clusterstate from the zookeeper
Right. I am wondering if/how we can download a specific file from the zookeeper, modify it and then upload to rewrite it. Anyone ? Thanks, Ming On Fri, Apr 19, 2013 at 10:53 AM, Michael Della Bitta michael.della.bi...@appinions.com wrote: I would like to know the answer to this as well. Michael Della Bitta Appinions 18 East 41st Street, 2nd Floor New York, NY 10017-6271 www.appinions.com Where Influence Isn’t a Game On Thu, Apr 18, 2013 at 8:15 PM, Manuel Le Normand manuel.lenorm...@gmail.com wrote: Hello, After creating a distributed collection on several different servers I sometimes get to deal with failing servers (cores appear not available = grey) or failing cores (Down / unable to recover = brown / red). In case i wish to delete this errorneous collection (through collection API) only the green nodes get erased, leaving a meaningless unavailable collection in the clusterstate.json. Is there any way to edit explicitly the clusterstate.json? If not, how do i update it so the collection as above gets deleted? Cheers, Manu
Re: Updating clusterstate from the zookeeper
I've used zookeeper's cli to do this. I doubt its the right way and I have no idea if it'll work for clusterstate.json, but it seems to work for certain things. cd /opt/zookeeper/bin ./zkCli.sh -server 127.0.0.1:2183 set /configs/collection1/schema.xml `cat /tmp/newschema.xml` sleep 10 # give a lil time to get pushed out curl http://localhost:8080/solr/admin/cores?wt=jsonaction=RELOADcore=collection1 This is on zk 3.4.5 -- Nate Fox Sr Systems Engineer o: 310.658.5775 m: 714.248.5350 Follow us @NEOGOV http://twitter.com/NEOGOV and on Facebookhttp://www.facebook.com/neogov NEOGOV http://www.neogov.com/ is among the top fastest growing software companies in the USA, recognized by Inc 500|5000, Deloitte Fast 500, and the LA Business Journal. We are hiring!http://www.neogov.com/#/company/careers On Fri, Apr 19, 2013 at 11:30 AM, Mingfeng Yang mfy...@wisewindow.comwrote: Right. I am wondering if/how we can download a specific file from the zookeeper, modify it and then upload to rewrite it. Anyone ? Thanks, Ming On Fri, Apr 19, 2013 at 10:53 AM, Michael Della Bitta michael.della.bi...@appinions.com wrote: I would like to know the answer to this as well. Michael Della Bitta Appinions 18 East 41st Street, 2nd Floor New York, NY 10017-6271 www.appinions.com Where Influence Isn’t a Game On Thu, Apr 18, 2013 at 8:15 PM, Manuel Le Normand manuel.lenorm...@gmail.com wrote: Hello, After creating a distributed collection on several different servers I sometimes get to deal with failing servers (cores appear not available = grey) or failing cores (Down / unable to recover = brown / red). In case i wish to delete this errorneous collection (through collection API) only the green nodes get erased, leaving a meaningless unavailable collection in the clusterstate.json. Is there any way to edit explicitly the clusterstate.json? If not, how do i update it so the collection as above gets deleted? Cheers, Manu
Weird query issues
Hello, We are using Solr 3.6.2 single core ( both index and query on same machine) and randomly the server fails to query correctly. If we query from the admin console the query is not even applied and it returns numFound count equal to total docs in the index as if no query is made, and if use SOLRJ to query it throws javabin error Invalid version (expected 2, but 60) or the data in not in 'javabin' format Once we restart the container everything is back to normal. In the process of debugging the solr logs I found empty queries like the one below. Can anybody tell me what can cause empty queries in the log as given below so trying to see if it may be relateed to the solr issues [#|2013-04-19T14:10:20.308-0400|INFO|sun-appserver2.1.1|org.apache.solr.core.SolrCore|_ThreadID=19;_ThreadName=httpSSLWorkerThread-9001-0;|[core1] webapp=/solr path=/select params={} hits=21727 status=0 QTime=24 |#] Would Appreciate any pointers Thanks Ravi Kiran Bhaskar
Could not find an instance of QueryComponent. Disabling collation verification against the index.
Hi Team, I am trying to configure the Auto-suggest feature for the businessProvince field in my schema. I followed the instructions here:- http://wiki.apache.org/solr/Suggester But then I got the following error:- INFO: Could not find an instance of QueryComponent. Disabling collation verification against the index. Based on this forum (http://stackoverflow.com/questions/10547438/solr-returns-only-one-collation-for-suggester-component), I added a query component. So now all these queries work:- http://localhost:8983/solr/collection1/cityProvinceSuggest?q=AZ - Searches the default field http://localhost:8983/solr/collection1/cityProvinceSuggest?q=businessProvince:AZ Searches the businessProvince field http://localhost:8983/solr/collection1/cityProvinceSuggest?q=businessCity:Phoenix Searches the businessCity field http://localhost:8983/solr/collection1/cityProvinceSuggest?q=name:Balaji Searches the name field So my question now is whether the field element is honored? Bcos holding all the data in the lookup data-structure may cause memory issues. Any help will be appreciated. searchComponent class=solr.SpellCheckComponent name=suggest lst name=spellchecker str name=namesuggest/str str name=classnameorg.apache.solr.spelling.suggest.Suggester/str str name=lookupImplorg.apache.solr.spelling.suggest.tst.TSTLookup/str str name=fieldbusinessProvince/str float name=threshold0.005/float str name=buildOnCommittrue/str /lst /searchComponent Thanks, Balaji -- View this message in context: http://lucene.472066.n3.nabble.com/Could-not-find-an-instance-of-QueryComponent-Disabling-collation-verification-against-the-index-tp4057417.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: solr-cloud performance decrease day by day
How many segments each shard has and what is the reason of running multiple shards in one machine? Alex. -Original Message- From: qibaoyuan qibaoy...@gmail.com To: solr-user solr-user@lucene.apache.org Sent: Fri, Apr 19, 2013 12:26 am Subject: Re: solr-cloud performance decrease day by day there are 6 shards and they are in one machine,and the jvm param is very big,the physical memory is 16GB,the total #docs is about 150k,the index size of each shard is about 1GB.AND there is indexing while searching,I USE auto commit each 10min.and the data comes about 100 per minutes. 在 2013-4-19,下午3:17,Furkan KAMACI furkankam...@gmail.com 写道: Could you give more info about your index size and technical details of your machine? Maybe you are indexing more data day by day and your RAM capability is not enough anymore? 2013/4/19 qibaoyuan qibaoy...@gmail.com Hello, i am using sold 4.1.0 and ihave used sold cloud in my product.I have found at first everything seems good,the search time is fast and delay is slow,but it becomes very slow after days.does any one knows if there maybe some params or optimization to use sold cloud?
Weird query issues
Hello, We are using Solr 3.6.2 single core ( both index and query on same machine) and randomly the server fails to query correctly. If we query from the admin console the query is not even applied and it returns numFound count equal to total docs in the index as if no query is made, and if use SOLRJ to query it throws javabin error Invalid version (expected 2, but 60) or the data in not in 'javabin' format Once we restart the container everything is back to normal. In the process of debugging the solr logs I found empty queries like the one below. Can anybody tell me what can cause empty queries in the log as given below so trying to see if it may be relateed to the solr issues [#|2013-04-19T14:10:20.308-0400|INFO|sun-appserver2.1.1|org.apache.solr.core.SolrCore|_ThreadID=19;_ThreadName=httpSSLWorkerThread-9001-0;|[core1] webapp=/solr path=/select params={} hits=21727 status=0 QTime=24 |#] Would Appreciate any pointers Thanks Ravi Kiran Bhaskar
Re: facet.method enum vs fc
: Thanks for your kind reply. The problem is solved with sharding and using : facet.method=enum. I am curious about what's the different between enum : and fc, so that enum works but fc does not. Do you know something about : this? method=fc/fcs uses the field caches (or uninverted fields if they are multivalued) to build a large data structure that is reusable across many requests and allows faceting happen very quickly even when the number of terms is large. enum causes solr to walk the term enum for the field and generate a DocSet for each term which is then intersected with the main results -- basically doing facet.field just like facet.query iwth simple term queries. these DocSets from using facet.method=enum will be cached in the filterCache, so there is some performance savings there if/when people filter on these facet constraints, but the regular rules about cache evicitions apply. So in a situation where the heap size is big enough not to matter method=fc should be faster and take up less ram then if you size your filterCache big enough to hold all of the DocSets involved if you use method=enum to not have cache evictions. In most cases, the only motivation for using method=enum is if you know the cardinality of your set of constraints is relatively small and fixed (ie: there are only 50 states in the US, so you might find that faceting on a state field with method=enum is just as fast as using method=fc and takes less ram -- this is why boolean fields default to method=enum, the cardinality is garunteed to be 2). But in some less common cases, you might care more about saving ram then speed, or you might be trying to facet on huge index with fields containing lots of terms (ie: full text) so that method=fc just wont work with any concievable amount of ram, so it could make sense to use method=enum with filterCache disabled. -Hoss
Re: Update Request Processor Chains
: I am trying to understand update request processor chains. Do they runs one : by one when indexing a ducument? Can I identify multiple update request : processor chains? Also what are that LogUpdateProcessorFactory and : RunUpdateProcessorFactory? http://wiki.apache.org/solr/UpdateRequestProcessor solrconfig.xml files can contain any number of UpdateRequestProcessorChains... Once one or more update chains are defined, you may select one on the update request through the parameter update.chain https://lucene.apache.org/solr/4_2_1/solr-core/org/apache/solr/update/processor/LogUpdateProcessorFactory.html This keeps track of all commands that have passed through the chain and prints them on finish(). At the Debug (FINE) level, a message will be logged for each command prior to the next stage in the chain. https://lucene.apache.org/solr/4_2_1/solr-core/org/apache/solr/update/processor/RunUpdateProcessorFactory.html Executes the update commands using the underlying UpdateHandler. Allmost all processor chains should end with an instance of RunUpdateProcessorFactory unless the user is explicitly executing the update commands in an alternative custom UpdateRequestProcessorFactory -Hoss
Re: Searching
thanks. I was expecting an answer that could help me to choose analyzers or tokenizers. any help for anyone of the scenarios? -- View this message in context: http://lucene.472066.n3.nabble.com/Searching-tp4057328p4057465.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Update Request Processor Chains
Thanks for detailed answers. 2013/4/19 Chris Hostetter hossman_luc...@fucit.org : I am trying to understand update request processor chains. Do they runs one : by one when indexing a ducument? Can I identify multiple update request : processor chains? Also what are that LogUpdateProcessorFactory and : RunUpdateProcessorFactory? http://wiki.apache.org/solr/UpdateRequestProcessor solrconfig.xml files can contain any number of UpdateRequestProcessorChains... Once one or more update chains are defined, you may select one on the update request through the parameter update.chain https://lucene.apache.org/solr/4_2_1/solr-core/org/apache/solr/update/processor/LogUpdateProcessorFactory.html This keeps track of all commands that have passed through the chain and prints them on finish(). At the Debug (FINE) level, a message will be logged for each command prior to the next stage in the chain. https://lucene.apache.org/solr/4_2_1/solr-core/org/apache/solr/update/processor/RunUpdateProcessorFactory.html Executes the update commands using the underlying UpdateHandler. Allmost all processor chains should end with an instance of RunUpdateProcessorFactory unless the user is explicitly executing the update commands in an alternative custom UpdateRequestProcessorFactory -Hoss
Re: Weird query issues
On 4/19/2013 12:55 PM, Ravi Solr wrote: We are using Solr 3.6.2 single core ( both index and query on same machine) and randomly the server fails to query correctly. If we query from the admin console the query is not even applied and it returns numFound count equal to total docs in the index as if no query is made, and if use SOLRJ to query it throws javabin error Invalid version (expected 2, but 60) or the data in not in 'javabin' format The UI problem is likely a browser issue, but I could be wrong. Some browsers, IE in particular, but not limited to that one, have problems with the admin UI. Using a different browser or clearing the browser cache can sometimes fix those problems. As for SolrJ, are you using a really old (1.x) SolrJ with Solr 3.6.2? Have you ever had Solr 1.x running on the same machine that's now running 3.6.2? Because the javabin version changed between 1.4.1 and 3.1.0, SolrJ 1.x is not compatible with Solr 3.1 and later unless you set the response parser on the server object to XML before you try to use it. If you have upgraded Solr from an old version, your servlet container (sun-appserver) may have some of the old jars remaining from the 1.x install. They must be removed. To change your SolrJ to use the XML response parser, use code like the following: server.setParser(new XMLResponseParser()); When SolrJ and Solr are both version 3.x or 4.x, you can remove this line. Another way that you can get the javabin error is when Solr is returning an error response, or returning a response that is not an error but is an HTML response reporting an unusual circumstance rather than the usual javabin. These HTML responses should no longer exist in the newest versions of Solr. Do you see any errors or warnings in your server log? The server log line you included in your email is not an error. Thanks, Shawn
external values source
I need some explanation on how ValuesSource and related classes work. There are already implemented ExternalFileField, example on how to load data from database ( http://sujitpal.blogspot.com/2011/05/custom-sorting-in-solr-using-external. html http://sujitpal.blogspot.com/2011/05/custom-sorting-in-solr-using-external.h tml) But they all fetch ALL data into memory which may consume large amounts of this resource. Also documents are referenced by 'doc' integer value. My questions: 1) Is the 'doc' value pointing to document in whole index? If so - how to get value of such documents field (for example: field named 'id')? 2) Is there possibility to create ValuesSource, FieldType (or similar interface which will provide external data to sort and in query results) which will work only on some subset of documents and use external source capabilities to fetch document related data? 3) How does it all work (memory consumption, hashtable access speed, etc), when there is a lot of documents in index (tens of millions for example)? 4) Are there any other examples on loading external data from database (I want to have numerical 'rate' from simple table having two columns: 'document unique key' string, 'rate' integer/float) which are not just proof of concept but real-life examples? Any help and hints appreciated TIA -- Maciek
Rogue query killed several replicas with OOM, after recovering - match all docs query problem
We had a rogue query take out several replicas in a large 4.2.0 cluster today, due to OOM's (we use the JVM args to kill the process on OOM). After recovering, when I execute the match all docs query (*:*), I get a different count each time. In other words, if I execute q=*:* several times in a row, then I get a different count back for numDocs. This was not the case prior to the failure as that is one thing we monitor for. I think I should be worried ... any ideas on how to troubleshoot this? One thing to mention is that several of my replicas had to do full recoveries from the leader when they came back online. Indexing was happening when the replicas failed. Thanks. Tim
Re: external values source
Hi Maciek, I think a custom ValueSource is definitely what you want because you need to compute some derived value based on an indexed field and some external value. The trick is figuring how to make the lookup to the external data very, very fast. Here's a rough sketch of what we do: We have a table in a database that contains a numeric value for a user and an organization, such as query: select num from table where userId='bob' and orgId=123 (similar to what you stated in question #4) On the Solr side, documents are indexed with user_id_s field, which is half of what I need to do my lookup. The orgId is determined by the Solr client at query construction time, so is passed to my custom ValueSource (aka function) in the query. In our app, users can be associated with many different orgIds and changes frequently so we can't index the association. To do the lookup to the database, we have a custom ValueSource, something like: dbLookup(user_id_s, 123) (note: user_id_s is the name of the field holding my userID values in the index and 123 is the orgId) Behind the scenes, the ValueSource will have access to the user_id_s field values using FieldCache, something like: final BinaryDocValues dv = FieldCache.DEFAULT.getTerms(reader.reader(), user_id_s); This gives us fast access to the user_id_s value for any given doc (question #1 above) So now we can return an IntDocValues instance by doing: @Override public FunctionValues getValues(Map context, AtomicReaderContext reader) throws IOException { final BytesRef br = new BytesRef(); final BinaryDocValues dv = FieldCache.DEFAULT.getTerms(reader.reader(), fieldName); return new IntDocValues(this) { @Override public int intVal(int doc) { dv.get(doc,br); if (br.length == 0) return 0; final String user_id_s = br.utf8ToString(); // the indexed userID for doc int val = 0; // todo: do custom lookup with orgID and user_id_s to compute int value for doc return val; } } ... } In this code, fieldName is set in the constructor (not shown) by parsing it out of the parameters, something like: this.fieldName = ((org.apache.solr.schema.StrFieldSource)source).getField(); The user_id_s field comes into your ValueSource as a StrFieldSource (or whatever type you use) ... here is how the ValueSource gets constructed at query time: public class MyValueSourceParser extends ValueSourceParser { public void init(NamedList namedList) {} public ValueSource parse(FunctionQParser fqp) throws SyntaxError { return new MyValueSource(fqp.parseValueSource(), fqp.parseArg()); } } There is one instance of your ValueSourceParser created per core. The parse method gets called for every query that uses the ValueSource. At query time, I might use the ValueSource to return this computed value in my fl list, such as: fl=id,looked_up:dbLookup(user_id_l,123),... Or to sort by: sort=dbLookup(user_id_s,123) desc The data in our table doesn't change that frequently, so we export it to a flat file in S3 and our custom ValueSource downloads from S3, transforms it into an in-memory HashMap for fast lookups. We thought about just issuing a query to load the data from the db directly but we have many nodes and the query is expensive and result set is large so we didn't want to hammer our database with N Solr nodes querying for the same data at roughly the same time. So we do it once and post the compressed results to a shared location. The data in the table is sparse as compared to the number of documents and userIds we have. We simply poll S3 for changes every few minutes, which is good enough for us. This happens from many nodes in a large Solr Cloud cluster running in EC2 so S3 works well for us as a distribution mechanism. Admittedly polling kind of sucks so we tried using Zookeeper to notify our custom watchers when a znode changes but a ValueSource doesn't get notified when a core is reloaded so we ended up having many weird issues with Zookeeper watchers in our custom ValueSource. For example, new ValueSourceParsers get created when a core is reloaded but the previous instance doesn't get notified that it's going out of service. So this gives you an idea of how we load external data into a fast lookup data structure in Solr (~question #2) When filtering, we use PostFilter to tell Solr that our filter is expensive so should be applied last (after all other criteria have run), something like: fq={!frange l=2 u=8 cost=200 cache=false}dbLookup(user_id_s,123) This computes a function range query using our custom ValueSource but tells Solr that it is expensive (cost = 100) so apply it after all other filters have been applied. http://yonik.wordpress.com/tag/post-filter/ Lastly, as for speed, the user_id_s field gets loaded into FieldCache and the lookup
RE: SolrCloud loadbalancing, replication, and failover
Again, thank you for this incredible information, I feel on much firmer footing now. I'm going to test distributing this across 10 servers, borrowing a Hadoop cluster temporarily, and see how it does with enough memory to have the whole index cached. But I'm thinking that we'll try the SSD route as our index will probably rest in the 1/2 terabyte range eventually, there's still a lot of active development. I guess the RAM disk would work in our case also, as we only index in batches, and eventually I'd like to do that off of Solr and just update the index (I'm presuming this is doable in solr cloud, but I haven't put it to task yet). If I could purpose Hadoop to index the shards, that would be ideal, though I haven't quite figured out how to go about it yet. David -Original Message- From: Shawn Heisey [mailto:s...@elyograg.org] Sent: Friday, April 19, 2013 9:42 PM To: solr-user@lucene.apache.org Subject: Re: SolrCloud loadbalancing, replication, and failover On 4/19/2013 3:48 AM, David Parks wrote: The Physical Memory is 90% utilized (21.18GB of 23.54GB). Solr has dark grey allocation of 602MB, and light grey of an additional 108MB, for a JVM total of 710MB allocated. If I understand correctly, Solr memory utilization is *not* for caching (unless I configured document caches or some of the other cache options in Solr, which don't seem to apply in this case, and I haven't altered from their defaults). Right. Solr does have caches, but they serve specific purposes. The OS is much better at general large-scale caching than Solr is. Solr caches get cleared (and possibly re-warmed) whenever you issue a commit on your index that makes new documents visible. So assuming this box was dedicated to 1 solr instance/shard. What JVM heap should I set? Does that matter? 24GB JVM heap? Or keep it lower and ensure the OS cache has plenty of room to operate? (this is an Ubuntu 12.10 server instance). The JVM heap to use is highly dependent on the nature of your queries, the number of documents, the number of unique terms, etc. The best thing to do is try it out with a relatively large heap, see how much memory actually gets used inside the JVM. The jvisualvm and jconsole tools will give you nice graphs of JVM memory usage. The jstat program will give you raw numbers on the commandline that you'll need to add to get the full picture. Due to the garbage collection model that Java uses, what you'll see is a sawtooth pattern - memory usage goes up to max heap, then garbage collection reduces it to the actual memory used. Generally speaking, you want to have more heap available than the low point of that sawtooth pattern. If that low point is around 3GB when you are hitting your index hard with queries and updates, then you would want to give Solr a heap of 4 to 6 GB. Would I be wise to just put the index on a RAM disk and guarantee performance? Assuming I installed sufficient RAM? A RAM disk is a very good way to guarantee performance - but RAM disks are ephemeral. Reboot or have an OS crash and it's gone, you'll have to reindex. Also remember that you actually need at *least* twice the size of your index so that Solr (Lucene) has enough room to do merges, and the worst-case scenario is *three* times the index size. Merging happens during normal indexing, not just when you optimize. If you have enough RAM for three times your index size and it takes less than an hour or two to rebuild the index, then a RAM disk might be a viable way to go. I suspect that this won't work for you. Thanks, Shawn
Re: Pros and cons of using RAID or different RAIDS?
Yeah, but as far as I know, there is nothing Solr-specific about that. See http://www.acnc.com/raid Otis -- Solr ElasticSearch Support http://sematext.com/ On Fri, Apr 19, 2013 at 11:19 AM, Furkan KAMACI furkankam...@gmail.com wrote: Is there any documentation that explains pros and cons of using RAID or different RAIDS?
Re: Import in Solr
On 19 April 2013 19:50, hassancrowdc hassancrowdc...@gmail.com wrote: I want to update(delta-import) one specific item. Is there any query to do that? No. Regards, Gora