Re: Solr 3.6 issue - DataImportHandler with CachedSqlEntityProcessor not importing all multi-valued fields
It's hard to troubleshoot without debug logs. Pls pay attention that regular configuration for CachedSqlEP is slightly different http://wiki.apache.org/solr/DataImportHandler#CachedSqlEntityProcessor see where=xid=x.id On Wed, Jun 27, 2012 at 2:29 AM, ps_sra praveens1...@yahoo.com wrote: Not sure if this is the right forum to post this question. If not, please excuse. I'm trying to use the DataImportHandler with processor=CachedSqlEntityProcessor to speed up import from an RDBMS. While processor=CachedSqlEntityProcessor is much faster than processor=SqlEntityProcessor, the resulting Solr index does not contain multi-valued fields on sub-entities. So, for example, my db-data-config.xml has the following structure: document .. entity name=foo pk=id processor=SqlEntityProcessor query=SELECT f.id AS foo_id, f.name AS foo_name FROM foo f field column=foo_id name=foo_id / field column=foo_name name=foo_name / entity name=bar processor=CachedSqlEntityProcessor query=SELECT b.name as bar_name FROMbar b WHEREb.id = '${foo.id}' field column=bar_name name=bar_name / /entity /entity .. /document where the database relationship foo:bar is 1:m. The issue is that when I import with processor=SqlEntityProcessor , everything works fine and the multi-valued field - bar_name has multiple values, while importing with processor=CachedSqlEntityProcessor does not even create the bar_name field in the index. I've deployed Solr 3.6 on Weblogic 11g, with the patch https://issues.apache.org/jira/browse/SOLR-3360 applied. Any help on this issue is appreciated. Thanks, ps -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-3-6-issue-DataImportHandler-with-CachedSqlEntityProcessor-not-importing-all-multi-valued-fields-tp3991449.html Sent from the Solr - User mailing list archive at Nabble.com. -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Multi-thread UpdateProcessor
Okay, why do you think this idea is not worth to look at? On Fri, Jul 6, 2012 at 12:53 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Hello, Most times when single thread streaming http://wiki.apache.org/solr/Solrj#Streaming_documents_for_an_update is used I saw lack of cpu utilization at Solr server. Resonable motivation is utilize more threads to index faster, but it requires more complicated client side. I propose to employ special update processor which can fork the stream processing onto many threads. If you like it pls vote for https://issues.apache.org/jira/browse/SOLR-3585 . Regards -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Multi-thread UpdateProcessor
some benchmark added. pls check jira On Fri, Jul 6, 2012 at 11:13 PM, Dmitry Kan dmitry@gmail.com wrote: Mikhail, you have my +1 and a jira comment :) // Dmitry On Fri, Jul 6, 2012 at 7:41 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Okay, why do you think this idea is not worth to look at? On Fri, Jul 6, 2012 at 12:53 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Hello, Most times when single thread streaming http://wiki.apache.org/solr/Solrj#Streaming_documents_for_an_update is used I saw lack of cpu utilization at Solr server. Resonable motivation is utilize more threads to index faster, but it requires more complicated client side. I propose to employ special update processor which can fork the stream processing onto many threads. If you like it pls vote for https://issues.apache.org/jira/browse/SOLR-3585 . Regards -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com -- Regards, Dmitry Kan -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Searching for sentences containing a list of words with a configurable number of words not in the list inbetween?
Welcome! Two points: - did you choose right maillist? (let me reply to another one) - have you checked http://lucene.apache.org/core/3_6_0/queryparsersyntax.html#Proximity%20Searches? - the same in Lucene Queries api is http://lucene.apache.org/core/3_6_0/api/core/org/apache/lucene/search/PhraseQuery.htmland http://lucene.apache.org/core/3_6_0/api/core/org/apache/lucene/search/spans/SpanNearQuery.html - it seems to me you should familiarize with explain soon http://wiki.apache.org/solr/SolrRelevancyFAQ#Why_does_id:archangel_come_before_id:hawkgirl_when_querying_for_.22wings.22 Regards On Mon, Jul 9, 2012 at 10:28 PM, Svetlana mailingli...@dswp.co.uk wrote: Hi, I am just about to work through the demo and get to know lucene now I actually got it to build :) I was wondering if someone could point me in the right direction for my project. I want to query using a list of words but the order that they appear in and how common they are is not relevant (i.e. no 'stop words' if I got that terminology correct). The only relevant thing is how closely grouped they are and how many of the words in the list occur, and I want to be able to configure from 0 (no other non-queried words inbetween) until 'n' non-queried words inbetween. So for example, if I query for 'a and in house I go together or' (stupid example I guess) and specify 0 words inbetween then I would only want to get hits with those query words in any order sorted by relevance based on how many of those words occured. For example: 'In a house together' may be the most relevant result If I specify 1 other none query word allowed, results may look like 1. 'In a house together.' 2. 'In a house sleeping together.' ('sleeping' being the one extra word allowed) These should also be complete sentences or clauses, i.e. not 'fragments' - I guess I need to use a grammar analyser to determine that. Any help very much appreciated, I realise that this is probably deceptively difficult but if anyone can give some pointers that would be amazing. Svetlana -- View this message in context: http://lucene.472066.n3.nabble.com/Searching-for-sentences-containing-a-list-of-words-with-a-configurable-number-of-words-not-in-the-li-tp3993981.html Sent from the Lucene - Java Developer mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Query for records that have more than N values in a multi-valued field
Hello Alexandre, Some time ago I wanted to contribute it http://mail-archives.apache.org/mod_mbox/lucene-dev/201203.mbox/%3ccangii8dukawp7mt1xqrjb5axdqptm5r4z+yzplfc7ptywsq...@mail.gmail.com%3E On Mon, Jul 23, 2012 at 7:05 PM, Alexandre Rafalovitch arafa...@gmail.comwrote: Hello, I have a multivalued field and I want to find records that have (for example) at least 3 values in that list. Is there an easy way to do it? Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Bulk indexing data into solr
Right in time, guys. https://issues.apache.org/jira/browse/SOLR-3585 Here is server side update processing fork. It does the best for halting processing on exception occurs. Plug this UpdateProcessor, specify number of threads. Then submit lazy iterator into StreamingUpdateServer at client side. PS: Don't do the following: send many-many docs one-by-one or instantiate huge arrayList of SolrInputDocument at client-side. On Thu, Jul 26, 2012 at 7:46 PM, Shawn Heisey s...@elyograg.org wrote: On 7/26/2012 7:34 AM, Rafał Kuć wrote: If you use Java (and I think you do, because you mention Lucene) you should take a look at StreamingUpdateSolrServer. It not only allows you to send data in batches, but also index using multiple threads. A caveat to what Rafał said: The streaming object has no error detection out of the box. It queues everything up internally and returns immediately. Behind the scenes, it uses multiple threads to send documents to Solr, but any errors encountered are simply sent to the logging mechanism, then ignored. When you use HttpSolrServer, all errors encountered will throw exceptions, but you have to wait for completion. If you need both concurrent capability and error detection, you would have to manage multiple indexing threads yourself. Apparently there is a method in the concurrent class that you can override and handle errors differently, though I have not seen how to write code so your program would know that an error occurred. I filed an issue with a patch to solve this, but some of the developers have come up with an idea that might be better. None of the ideas have been committed to the project. https://issues.apache.org/**jira/browse/SOLR-3284https://issues.apache.org/jira/browse/SOLR-3284 Just an FYI, the streaming class was renamed to ConcurrentUpdateSolrServer in Solr 4.0 Alpha. Both are available in 3.6.x. Thanks, Shawn -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Bulk indexing data into solr
Coming back to your original question. I'm puzzled a little. It's not clear where you wanna call Lucene API directly from. if you mean that you has standalone indexer, which write index files. Then it stops and these files become available for Solr Process it will work. Sharing index between processes, or using EmbeddedServer is looking for problem (despite Lucene has Locks mechanism, which I'm not completely aware of). I can conclude that your data for indexing is collocate with the solr server. In this case consider http://wiki.apache.org/solr/ContentStream#RemoteStreaming Please give more details about your design. On Thu, Jul 26, 2012 at 1:22 PM, Zhang, Lisheng lisheng.zh...@broadvision.com wrote: Hi, I am starting to use solr, now I need to index a rather large amount of data, it seems that calling solr to pass data through HTTP is rather inefficient, I am think still call lucene API directly for bulk index but to use solr for search, is this design OK? Thanks very much for helps, Lisheng -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Bulk indexing data into solr
IIRC about a two month ago problem with such scheme discussed here, but I can remember exact details. Scheme is generally correct. But you didn't tell how do you let solr know that it need to reread new index generation, after indexer fsync segments get. btw, it might be a possible issue: https://lucene.apache.org/core/old_versioned_docs//versions/3_0_1/api/all/org/apache/lucene/index/IndexWriter.html#commit() Note that this operation calls Directory.sync on the index files. That call should not return until the file contents metadata are on stable storage. For FSDirectory, this calls the OS's fsync. But, beware: some hardware devices may in fact cache writes even during fsync, and return before the bits are actually on stable storage, to give the appearance of faster performance. you should ensure that after segments.get is fsync'ed, all other index files are fsynced for other processes too. Could you tell more about your data: what's the format? whether they are located relatively to indexer? And why you can't use remote streaming by Solr's upd handler or indexer client app with StreamingUpdateServer ? On Thu, Jul 26, 2012 at 10:47 PM, Zhang, Lisheng lisheng.zh...@broadvision.com wrote: Hi, I think at least before lucene 4.0 we can only allow one process/thread to write on a lucene folder. Based on this fact my initial plan is: 1) There is one set of lucene index folders. 2) Solr server only perform queries in those servers 3) Having a separate process (multi-threads) to index those lucene folders (each folder is a separate app). Only one thread will index one given lucene folder. Thanks very much for helps, Lisheng -Original Message- From: Mikhail Khludnev [mailto:mkhlud...@griddynamics.com] Sent: Thursday, July 26, 2012 10:15 AM To: solr-user@lucene.apache.org Subject: Re: Bulk indexing data into solr Coming back to your original question. I'm puzzled a little. It's not clear where you wanna call Lucene API directly from. if you mean that you has standalone indexer, which write index files. Then it stops and these files become available for Solr Process it will work. Sharing index between processes, or using EmbeddedServer is looking for problem (despite Lucene has Locks mechanism, which I'm not completely aware of). I can conclude that your data for indexing is collocate with the solr server. In this case consider http://wiki.apache.org/solr/ContentStream#RemoteStreaming Please give more details about your design. On Thu, Jul 26, 2012 at 1:22 PM, Zhang, Lisheng lisheng.zh...@broadvision.com wrote: Hi, I am starting to use solr, now I need to index a rather large amount of data, it seems that calling solr to pass data through HTTP is rather inefficient, I am think still call lucene API directly for bulk index but to use solr for search, is this design OK? Thanks very much for helps, Lisheng -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Bulk Indexing
Lan, I assume that some particular server can freeze on such bulk. But overall message seems not absolutely correct to me. Solr has a lot of mechanisms to survive in such cases. Bulk indexing is absolutely right (if you submit single request with long iterator of SolrInputDocs). This indexing thread can occupy single cpu core, keeping others ready for searches. Such indexing occupies ramBufferSizeMB of heap. After limit is exceeded new segment is flushed to disk, which require some IO and can impact searchers. (misconfigured merge can ruin everything, of course) Commit should been executed from business consideration not performance ones. Commit leads to creating new searcher and warming it, these actions can be memory and cpu expensive (almost single thread activity). I did some experiments on 40 M index at desktop box. Constantly adding 1K docs/sec with autocommit more than once per minute, doesn't have significant impact on search latency. Generally, yes. Master-Slave scheme has more performance, for sure. On Sat, Jul 28, 2012 at 4:01 AM, Lan dung@gmail.com wrote: I assume your're indexing on the same server that is used to execute search queries. Adding 20K documents in bulk could cause the Solr Server to 'stop the world' where the server would stop responding to queries. My suggestion is - Setup master/slave to insulate your clients from 'stop the world' events during indexing. - Update in batches with a commit at the end of the batch. -- View this message in context: http://lucene.472066.n3.nabble.com/Bulk-Indexing-tp3997745p3997815.html Sent from the Solr - User mailing list archive at Nabble.com. -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Expression Sort in Solr
Hello, have you tried http://wiki.apache.org/solr/FunctionQuery/#if ? On Mon, Jul 30, 2012 at 3:05 PM, lavesh lavesh.ra...@gmail.com wrote: I am working on solr for search. I required to perform a expression sort such that : say str = ((IF AVAILABLE IN (1,2,3),100,IF(AVAILABLE IN (4,5,6),80,100)) + IF(PRICE1000,70,40)) need to order by (if(str100,40+str/40,33+str/33)+SOMEOTHERCOLUMN) DESC -- View this message in context: http://lucene.472066.n3.nabble.com/Expression-Sort-in-Solr-tp3998050.html Sent from the Solr - User mailing list archive at Nabble.com. -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Bulk Indexing
Usually collecting whole array hurts client's jvm JVM, sending doc-by-doc bloats sever by huge number of small requests. You need just rewrite your code from the eager loop to pulling iterator to be able to submit all docs via single http request http://wiki.apache.org/solr/Solrj#Streaming_documents_for_an_update Then if you wouldn't be happy with low utilization due to using single thread, post your problem and numbers here again. http://wiki.apache.org/solr/SolrReplication http://lucidworks.lucidimagination.com/display/solr/Index+Replication On Sat, Jul 28, 2012 at 11:21 PM, Sohail Aboobaker sabooba...@gmail.comwrote: We have auto commit on and will basically send it in a loop after validating each record, we send it to search service. And keep doing it in a loop. Mikhail / Lan, are you suggesting that instead of sending it in a loop, we should collect them in an array and do a commit at the end? Is this better than doing it in a loop with auto commit. Also, where can I find some reference on Master / Slave configuration. Thanks. -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Expression Sort in Solr
how exactly? On Tue, Jul 31, 2012 at 1:19 PM, lavesh lavesh.ra...@gmail.com wrote: yes i have, its not working as per need -- View this message in context: http://lucene.472066.n3.nabble.com/Expression-Sort-in-Solr-tp3998050p3998310.html Sent from the Solr - User mailing list archive at Nabble.com. -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Map Complex Datastructure with Solr
the possibility to create own FieldTypes, but I don't know if this is the answer of my issues... 2012/8/1 Jack Krupansky j...@basetechnology.com: The general rule is to flatten the structures. You have a choice between sharing common fields between tables, such as title, or adding a prefix/suffix to qualify them, such as document_title vs. product_title. You also have the choice of storing different tables in separate Solr cores/collections, but then you have the burden of querying them separately and coordinating the separate results on your own. It all depends on your application. A lot hinges on: 1. How do you want to search the data? 2. How do you want to access the fields once the Solr documents have been identified by a query - such as fields to retrieve, join, etc. So, once the data is indexed, what are your requirements for accessing the data? E.g., some sample pseudo-queries and the fields you want to access. -- Jack Krupansky -Original Message- From: Thomas Gravel Sent: Wednesday, August 01, 2012 9:52 AM To: solr-user@lucene.apache.org Subject: Map Complex Datastructure with Solr Hi, how can I map these complex Datastructure in Solr? Document - Groups - Group_ID - Group_Name - . - Title - Chapter - Chapter_Title - Chapter_Content Or Product - Groups - Group_ID - Group_Name - . - Title - Articles - Artilce_ID - Artilce_Color - Artilce_Size Thanks for ideas -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Solr 4.0 - Join performance
Hello, You can check my record. https://issues.apache.org/jira/browse/SOLR-3076?focusedCommentId=13415644page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13415644 I'm still working on precise performance measurement. On Thu, Aug 2, 2012 at 6:45 PM, Eric Khoury ekhour...@hotmail.com wrote: Hello all, I’m testing out the new join feature, hitting some perf issues, as described in Erick’s article ( http://architects.dzone.com/articles/solr-experimenting-join). Basically, I’m using 2 objects in solr (this is a simplified view): Item - Id - Name Grant - ItemId - AvailabilityStartTime - AvailabilityEndTime Each item can have multiple grants attached to it. The query I'm using is the following, to find items by name, filtered by grants availability window: solr/select?fq=Name:XXXq={!join from=ItemId to=Id} AvailabilityStartTime:[* TO NOW] AND -AvailabilityEndTime:[* TO NOW] With a hundred thousand items, this query can take multiple seconds to perform, due to the large number or ItemIds returned from the join query. Has anyone come up with a better way to use joins for these types of queries? Are there improvements planned in 4.0 rtm in this area? Btw, I’ve explored simply adding Start-End times to items, but the flat data model makes it hard to maintain start-end pairs. Thanks for the help! Eric. -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Solr 4.0 - Join performance
Eric, you can take last patch from SOLR-3076 [image: Text File] https://issues.apache.org/jira/secure/attachment/12536717/SOLR-3076.patch SOLR-3076.patch https://issues.apache.org/jira/secure/attachment/12536717/SOLR-3076.patch 16/Jul/12 21:16 also can take it applied from https://github.com/m-khl/solr-patches/tree/6611 . But the origin source code might be a little bit old. Regaining a nightly build, it's not so optimistic - I can't attract committer for reviewing it. On Thu, Aug 2, 2012 at 11:51 PM, Eric Khoury ekhour...@hotmail.com wrote: Wow, great work Mikhail, that's impressive. I don't currently have build the dev tree, you wouldn't have a patch for the alpha build handy? If not, when do you think this'll be available in a nightly build? Thanks again, Eric. From: mkhlud...@griddynamics.com Date: Thu, 2 Aug 2012 22:38:13 +0400 Subject: Re: Solr 4.0 - Join performance To: solr-user@lucene.apache.org Hello, You can check my record. https://issues.apache.org/jira/browse/SOLR-3076?focusedCommentId=13415644page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13415644 I'm still working on precise performance measurement. On Thu, Aug 2, 2012 at 6:45 PM, Eric Khoury ekhour...@hotmail.com wrote: Hello all, I’m testing out the new join feature, hitting some perf issues, as described in Erick’s article ( http://architects.dzone.com/articles/solr-experimenting-join). Basically, I’m using 2 objects in solr (this is a simplified view): Item - Id - Name Grant - ItemId - AvailabilityStartTime - AvailabilityEndTime Each item can have multiple grants attached to it. The query I'm using is the following, to find items by name, filtered by grants availability window: solr/select?fq=Name:XXXq={!join from=ItemId to=Id} AvailabilityStartTime:[* TO NOW] AND -AvailabilityEndTime:[* TO NOW] With a hundred thousand items, this query can take multiple seconds to perform, due to the large number or ItemIds returned from the join query. Has anyone come up with a better way to use joins for these types of queries? Are there improvements planned in 4.0 rtm in this area? Btw, I’ve explored simply adding Start-End times to items, but the flat data model makes it hard to maintain start-end pairs. Thanks for the help! Eric. -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: search hit on multivalued fields
Mark, It's not clear what are you want to do. Let's say you requested rows=100 and found 1000 docs. What do you need to show in addition to search result? - matched field on every of 100 snippets - or 400 with F1 and 600 with F2 - or what On Fri, Aug 3, 2012 at 6:41 PM, Jack Krupansky j...@basetechnology.comwrote: You can include the fields in your fl list and then check those field values explicitly in the client, or you could add debugQuery=true to your request and check for which field the term matched in. The latter requires that you have the analyzed term (or check for closest matching term). -- Jack Krupansky -Original Message- From: Mark , N Sent: Friday, August 03, 2012 5:51 AM To: solr-user@lucene.apache.org Subject: search hit on multivalued fields I have a multivalued field Tex which is indexed , for example : F1: some value F2: some value Text = ( content of f1,f2) When user search , I am checking only a Text field but i would also need to display to users which Field ( F1 or F2 ) resulted the search hit Is it possible in SOLR ? -- Thanks, *Nipen Mark * -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Thread Blocking - Apache Solr 3.6.1
) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:756) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) Locked ownable synchronizers: - None 1854553074@qtp-924653460-15 - Thread t@69 java.lang.Thread.State: BLOCKED at java.util.logging.StreamHandler.publish(Unknown Source) - waiting to lock 23efc88b (a java.util.logging.ConsoleHandler) owned by 1462043760@qtp-924653460-20 t@77 at java.util.logging.ConsoleHandler.publish(Unknown Source) at java.util.logging.Logger.log(Unknown Source) at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:588) at org.slf4j.impl.JDK14LoggerAdapter.info(JDK14LoggerAdapter.java:285) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1378) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:756) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) Locked ownable synchronizers: - None 440688079@qtp-924653460-8 - Acceptor1 SocketConnector@0.0.0.0:8983 - Thread t@26 java.lang.Thread.State: BLOCKED at java.net.PlainSocketImpl.accept(Unknown Source) - waiting to lock 5b5bd00c (a java.net.SocksSocketImpl) owned by 370915326@qtp-924653460-9 - Acceptor0 SocketConnector@0.0.0.0:8983 t@27 at java.net.ServerSocket.implAccept(Unknown Source) at java.net.ServerSocket.accept(Unknown Source) at org.mortbay.jetty.bio.SocketConnector.accept(SocketConnector.java:99) at org.mortbay.jetty.AbstractConnector$Acceptor.run(AbstractConnector.java:708) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) Locked ownable synchronizers: - None 1422284074@qtp-924653460-7 - Acceptor2 SocketConnector@0.0.0.0:8983 - Thread t@25 java.lang.Thread.State: BLOCKED at java.net.PlainSocketImpl.accept(Unknown Source) - waiting to lock 5b5bd00c (a java.net.SocksSocketImpl) owned by 370915326@qtp-924653460-9 - Acceptor0 SocketConnector@0.0.0.0:8983 t@27 at java.net.ServerSocket.implAccept(Unknown Source) at java.net.ServerSocket.accept(Unknown Source) at org.mortbay.jetty.bio.SocketConnector.accept(SocketConnector.java:99) at org.mortbay.jetty.AbstractConnector$Acceptor.run(AbstractConnector.java:708) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) Locked ownable synchronizers: - None -- View this message in context: http://lucene.472066.n3.nabble.com/Thread-Blocking-Apache-Solr-3-6-1-tp3999191.html Sent from the Solr - User mailing list archive at Nabble.com. -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Is this too much time for full Data Import?
Hello, Does your indexer utilize CPU/IO? - check it by iostat/vmstat. If it doesn't, take several thread dumps by jvisualvm sampler or jstack, try to understand what blocks your threads from progress. It might happen you need to speedup your SQL data consumption, to do this, you can enable threads in DIH (only in 3.6.1), move from N+1 SQL queries to select all/cache approach http://wiki.apache.org/solr/DataImportHandler#CachedSqlEntityProcessor and https://issues.apache.org/jira/browse/SOLR-2382 Good luck On Wed, Aug 8, 2012 at 9:16 AM, Pranav Prakash pra...@gmail.com wrote: Folks, My full data import takes ~80hrs. It has around ~9m documents and ~15 SQL queries for each document. The database servers are different from Solr Servers. Each document has an update processor chain which (a) calculates signature of the document using SignatureUpdateProcessorFactory and (b) Finds out terms which have term frequency 2; using a custom processor. The index size is ~ 480GiB I want to know if the amount of time taken is too large compared to the document count? How do I benchmark the stats and what are some of the ways I can improve this? I believe there are some optimizations that I could do at Update Processor Factory level as well. What would be a good way to get dirty on this? *Pranav Prakash* temet nosce -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Does Solr support 'Value Search'?
Hello, Have you checked http://lucidworks.lucidimagination.com/display/lweug/Wildcard+Queries ? On Wed, Aug 8, 2012 at 12:56 AM, Bing Hua bh...@cornell.edu wrote: Hi folks, Just wondering if there is a query handler that simply takes a query string and search on all/part of fields for field values? e.g. q=*admin* Response may look like author: [admin, system_admin, sub_admin] last_modifier: [admin, system_admin, sub_admin] doctitle: [AdminGuide, AdminManual] -- View this message in context: http://lucene.472066.n3.nabble.com/Does-Solr-support-Value-Search-tp3999654.html Sent from the Solr - User mailing list archive at Nabble.com. -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Does Solr support 'Value Search'?
Ok. It seems to me you can configure http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactoryfor index-time to produce admin term from all your docs above, after that you'll be able to match by simple term query. Is it what are you looking for? On Wed, Aug 8, 2012 at 6:43 PM, Bing Hua bh...@cornell.edu wrote: Thanks for the response but wait... Is it related to my question searching for field values? I was not asking how to use wildcards though. -- View this message in context: http://lucene.472066.n3.nabble.com/Does-Solr-support-Value-Search-tp3999654p3999817.html Sent from the Solr - User mailing list archive at Nabble.com. -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Does Solr support 'Value Search'?
Ok. this explanation is much cleaner. Have you tried to invoke http://wiki.apache.org/solr/TermsComponent/ against all fields which you need? On Wed, Aug 8, 2012 at 10:56 PM, Bing Hua bh...@cornell.edu wrote: Not quite understand but I'd explain the problem I had. The response would contain only fields and a list of field values that match the query. Essentially it's querying for field values rather than documents. The underlying use case would be, when typing in a quick search box, the drill down menu may contain matches on authors, on doctitles, and potentially on other fields. Still thanks for your response and hopefully I'm making it clearer. Bing -- View this message in context: http://lucene.472066.n3.nabble.com/Does-Solr-support-Value-Search-tp3999654p327.html Sent from the Solr - User mailing list archive at Nabble.com. -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Does Solr support 'Value Search'?
Sure, Lucene is kind of column oriented DB. if the same text occurs in two different fields there is no any relation between such terms i.e. BRAND:RED vs COLOR:RED. The only thing I can suggest you is build separate index (in solr core) with docs like token:RED; fields:{COLOR, BRAND,,} or giving your initial sample: {token:admin; field:author; original_text:system_admin} {token:admin; field:author; original_text:admin} {token:admin; field:doctitle; original_text:AdminGuide} ... then you can search by token:admin and find such occurrence documents. On Thu, Aug 9, 2012 at 10:50 PM, Bing Hua bh...@cornell.edu wrote: Thanks Kuli and Mikhail, Using either termcomponent or suggester I could get some suggested terms but it's still confusing me how to get the respective field names. In order to get that, Use TermComponent I'll need to do a term query to every possible field. Similar things as using SpellCheckComponent. CopyField won't help since I want the original field name. Any suggestions? Bing -- View this message in context: http://lucene.472066.n3.nabble.com/Does-Solr-support-Value-Search-tp3999654p4000267.html Sent from the Solr - User mailing list archive at Nabble.com. -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Solr 4.0 - Join performance
Eric, Unfortunately Solr guys ignores it. On Tue, Aug 14, 2012 at 7:48 PM, Eric Khoury ekhour...@hotmail.com wrote: Hi Mikhail, was trying to figure out if solr-3076 made it into the beta, but since the issue is still marked as opened, I take it it didn't yet?Thanks,Eric. From: mkhlud...@griddynamics.com Date: Fri, 3 Aug 2012 00:06:36 +0400 Subject: Re: Solr 4.0 - Join performance To: ekhour...@hotmail.com; solr-user@lucene.apache.org Eric, you can take last patch from SOLR-3076 [image: Text File] https://issues.apache.org/jira/secure/attachment/12536717/SOLR-3076.patch SOLR-3076.patch https://issues.apache.org/jira/secure/attachment/12536717/SOLR-3076.patch 16/Jul/12 21:16 also can take it applied from https://github.com/m-khl/solr-patches/tree/6611 . But the origin source code might be a little bit old. Regaining a nightly build, it's not so optimistic - I can't attract committer for reviewing it. On Thu, Aug 2, 2012 at 11:51 PM, Eric Khoury ekhour...@hotmail.com wrote: Wow, great work Mikhail, that's impressive. I don't currently have build the dev tree, you wouldn't have a patch for the alpha build handy? If not, when do you think this'll be available in a nightly build? Thanks again, Eric. From: mkhlud...@griddynamics.com Date: Thu, 2 Aug 2012 22:38:13 +0400 Subject: Re: Solr 4.0 - Join performance To: solr-user@lucene.apache.org Hello, You can check my record. https://issues.apache.org/jira/browse/SOLR-3076?focusedCommentId=13415644page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13415644 I'm still working on precise performance measurement. On Thu, Aug 2, 2012 at 6:45 PM, Eric Khoury ekhour...@hotmail.com wrote: Hello all, I’m testing out the new join feature, hitting some perf issues, as described in Erick’s article ( http://architects.dzone.com/articles/solr-experimenting-join). Basically, I’m using 2 objects in solr (this is a simplified view): Item - Id - Name Grant - ItemId - AvailabilityStartTime - AvailabilityEndTime Each item can have multiple grants attached to it. The query I'm using is the following, to find items by name, filtered by grants availability window: solr/select?fq=Name:XXXq={!join from=ItemId to=Id} AvailabilityStartTime:[* TO NOW] AND -AvailabilityEndTime:[* TO NOW] With a hundred thousand items, this query can take multiple seconds to perform, due to the large number or ItemIds returned from the join query. Has anyone come up with a better way to use joins for these types of queries? Are there improvements planned in 4.0 rtm in this area? Btw, I’ve explored simply adding Start-End times to items, but the flat data model makes it hard to maintain start-end pairs. Thanks for the help! Eric. -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Diversifying Search Results - Custom Collector
Hello, I've got the problem description below. Can you explain the expected user experience, and/or solution approach before diving into the algorithm design? Thanks On Sat, Aug 18, 2012 at 2:50 AM, Karthick Duraisamy Soundararaj karthick.soundara...@gmail.com wrote: My problem is that when there are a lot of documents representing products, products from same manufacturer seem to appear in close proximity in the results and therefore, it doesnt provide brand diversity. When you search for sofas, you get sofas from a manufacturer A dominating the first page while the sofas from manufacturer B dominating the second page, etc. The issue here is that a manufacturer tends to describes the different sofas he produces the same way and therefore there is a very little difference between the documents representing two sofas. -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Diversifying Search Results - Custom Collector
Hello, I don't believe your task can be solved by playing with scoring/collector or shuffling. For me it's absolutely Grouping usecase (despite I don't really know this feature well). Grouping cannot solve the problem because I dont want to limit the number of results showed based on the grouping field. I'm not really getting it. why you can set limit to 11 and just show the labels like [+] show 6 result.. or if you have 11 [+] show more than 10 .. If you experience problem with constructing search result page, I can suggest submit search request with rows=0facet.field=BRAND, then your algorithm can choose number of necessary items per every brand and submit rows=Xfq=BRAND:Y it gives you arbitrarily sizes for groups. Will this work for you? On Mon, Aug 20, 2012 at 8:28 PM, Karthick Duraisamy Soundararaj d.s.karth...@gmail.com wrote: Tanguy, You idea is perfect for cases where there is a too many documents with 80-90% documents having same value for a particular field. As an example, your idea is ideal for, lets say we have 10 documents in total like this, doc1 : merchantName Kellog's /merchantName doc2 : merchantName Kellog's /merchantName doc3 : merchantName Kellog's /merchantName doc4 : merchantName Kellog's /merchantName doc5 : merchantName Kellog's /merchantName doc6 : merchantName Kellog's /merchantName doc7 : merchantName Kellog's /merchantName doc8 : merchantName Nestle /merchantName doc9 : merchantName Kellog's /merchantName doc10 : merchantName Kellog's /merchantName But I have doc1 : merchantName Maggi /merchantName doc2 : merchantName Maggi /merchantName doc3 : merchantName MM's /merchantName doc4 : merchantName MM's /merchantName doc5 : merchantName Hershey's /merchantName doc6 : merchantName Hershey's /merchantName doc7 : merchantName Nestle /merchantName doc8 : merchantName Nestle /merchantName doc9 : merchantName Kellog's /merchantName doc10 : merchantName Kellog's /merchantName Thanks, Karthick On Mon, Aug 20, 2012 at 12:01 PM, Tanguy Moal tanguy.m...@gmail.comwrote: Hello, I don't know if that could help, but if I understood your issue, you have a lot of documents with the same or very close scores. Moreover I think you get your matches in Merchant order (more or less) because they must be indexed in that very same order, so solr returns documents of same scores in insertion order (although there is no contract specifying this) You could work around that issue by : 1/ Turning off tf/idf because you're searching in documents with little text where only the match counts, but frequencies obviously aren't helping. 2/ Add a random number to each document at index time, and boost on that random value at query time, this will shuffle your results, that's probably the simplest thing to do. Hope this helps, Tanguy 2012/8/20 Karthick Duraisamy Soundararaj d.s.karth...@gmail.com Hello Mikhail, Thank you for the reply. In terms of user experience, I want to spread out the products from same brand farther from each other, *atleast* in the first 50-100 results we display. I am thinking about two different approaches as solution. 1. For first few results, display one top scoring product of a manufacturer (For a given field, display the top scoring results of the unique field values for the first N matches) . This N could be either a percentage relative to total matches or a configurable absolute value. 2. Enforce a penalty on the score for the results that have duplicate field values. The penalty can be enforced such a way that, the results with higher scores will not be affected as against the ones with lower score. Both of the solutions can be implemented while sorting the documents with TopFieldCollector / TopScoreDocCollector. Does this answer your question? Please let me know if you have any more questions. Thanks, Karthick On Mon, Aug 20, 2012 at 3:26 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Hello, I've got the problem description below. Can you explain the expected user experience, and/or solution approach before diving into the algorithm design? Thanks On Sat, Aug 18, 2012 at 2:50 AM, Karthick Duraisamy Soundararaj karthick.soundara...@gmail.com wrote: My problem is that when there are a lot of documents representing products, products from same manufacturer seem to appear in close proximity in the results and therefore, it doesnt provide brand diversity. When you search for sofas, you get sofas from a manufacturer A dominating the first page while the sofas from manufacturer B dominating the second page, etc. The issue here is that a manufacturer tends to describes the different sofas he produces the same way and therefore there is a very little difference between the documents representing two sofas. -- Sincerely yours Mikhail Khludnev Tech Lead
Re: Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents?
Tom, Feel free to find my benchmark results for two alternative joining approaches. http://blog.griddynamics.com/2012/08/block-join-query-performs.html Regards On Thu, Aug 23, 2012 at 4:40 PM, Erick Erickson erickerick...@gmail.comwrote: Tom: I thin my comments were that grouping on a field where there was a unique value _per document_ chewed up a lot of resources. Conceptually, there's a bucket for each unique group value. And grouping on a file path is just asking for trouble. But the memory used for grouping should max as a function of the unique values in the grouped field. Best Erick On Wed, Aug 22, 2012 at 11:32 PM, Lance Norskog goks...@gmail.com wrote: Yes, distributed grouping works, but grouping takes a lot of resources. If you can avoid in distributed mode, so much the better. On Wed, Aug 22, 2012 at 3:35 PM, Tom Burton-West tburt...@umich.edu wrote: Thanks Tirthankar, So the issue in memory use for sorting. I'm not sure I understand how sorting of grouping fields is involved with the defaults and field collapsing, since the default sorts by relevance not grouping field. On the other hand I don't know much about how field collapsing is implemented. So far the few tests I've made haven't revealed any memory problems. We are using very small string fields for grouping and I think that we probably only have a couple of cases where we are grouping more than a few thousand docs. I will try to find a query with a lot of docs per group and take a look at the memory use using JConsole. Tom On Wed, Aug 22, 2012 at 4:02 PM, Tirthankar Chatterjee tchatter...@commvault.com wrote: Hi Tom, We had an issue where we are keeping millions of docs in a single node and we were trying to group them on a string field which is nothing but full file path… that caused SOLR to go out of memory… ** ** Erick has explained nicely in the thread as to why it won’t work and I had to find another way of architecting it. ** ** How do you think this is different in your case. If you want to group by a string field with thousands of similar entries I am guessing you will face the same issue. ** ** Thanks, Tirthankar ***Legal Disclaimer*** This communication may contain confidential and privileged material for the sole use of the intended recipient. Any unauthorized review, use or distribution by others is strictly prohibited. If you have received the message in error, please advise the sender by reply email and delete the message. Thank you. ** -- Lance Norskog goks...@gmail.com -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
RE: Solr Not releasing memory
Rohit, Which collector do you use? Releasing physical ram is possible with compacting collectors like serial, parallel and maybe g1 and not possible with cms. The more important thing that releasing is really suspicious and even odd requrement. Please provide more details about your jvm and overall challenge. 03.09.2012 15:03 пользователь Rohit ro...@simplify360.com написал: I am currently using StandardDirectoryFactory, would switching directory factory have any impact on the indexes? Regards, Rohit -Original Message- From: Claudio Ranieri [mailto:claudio.rani...@estadao.com] Sent: 03 September 2012 10:03 To: solr-user@lucene.apache.org Subject: RES: Solr Not releasing memory Are you using MMapDirectoryFactory? I had swap problem in linux to a big index when I used MMapDirectoryFactory. You can to try use solr.NIOFSDirectoryFactory. -Mensagem original- De: Lance Norskog [mailto:goks...@gmail.com] Enviada em: domingo, 2 de setembro de 2012 22:00 Para: solr-user@lucene.apache.org Assunto: Re: Solr Not releasing memory 1) I believe Java 1.7 release memory back to the OS. 2) All of the Javas I've used on Windows do this. Is the physical memory use a problem? Does it push out all other programs? Or is it just that the Java process appears larger? This explains the latter: http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html - Original Message - | From: Rohit ro...@simplify360.com | To: solr-user@lucene.apache.org | Sent: Sunday, September 2, 2012 1:22:14 AM | Subject: Solr Not releasing memory | | Hi, | | | | We are running solr3.5 using tomcal 6.26 on a Windows Enterprise RC2 | server, our index size if pretty large. | | | | We have noticed that once tomcat starts using/reserving ram it never | releases them, even when there is not a single user on the system. I | have tried forced garbage collection, but that doesn't seem to help | either. | | | | Regards, | | Rohit | | | |
Re: Solr New Version causes NIO Closed Channel Exception
Hi Does mmap directory works for you? 03.09.2012 19:20 пользователь Pavitar Singh psi...@sprinklr.com написал: Hi, We are facing this problem repeatedly and it goes away on restarts. [#|2012-09-01T12:07:06.947+|SEVERE|glassfish3.1|org.apache.solr.core.SolrCore|_ThreadID=712;_ThreadName=Thread-2;|java.nio.channels.ClosedChannelException at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:88) at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:613) at org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.readInternal(NIOFSDirectory.java:161) at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:160) at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:39) at org.apache.lucene.store.DataInput.readVInt(DataInput.java:86) at org.apache.lucene.index.codecs.standard.StandardPostingsReader$SegmentDocsEnum.read(StandardPostingsReader.java:300) at org.apache.lucene.search.TermScorer.refillBuffer(TermScorer.java:74) at org.apache.lucene.search.TermScorer.nextDoc(TermScorer.java:121) at org.apache.lucene.search.TermScorer.score(TermScorer.java:70) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:210) at org.apache.lucene.search.Searcher.search(Searcher.java:101) at org.apache.solr.search.SolrIndexSearcher.getDocListAndSetNC(SolrIndexSearcher.java:1289) at org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1099) at org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:358) at org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:423) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:231) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1359) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:337) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:240) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:256) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:215) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:279) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175) at org.apache.catalina.core.StandardPipeline.doInvoke(StandardPipeline.java:655) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:595) at com.sun.enterprise.web.WebPipeline.invoke(WebPipeline.java:98) at com.sun.enterprise.web.PESessionLockingStandardPipeline.invoke(PESessionLockingStandardPipeline.java:91) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:162) at org.apache.catalina.core.StandardPipeline.doInvoke(StandardPipeline.java:655) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:595) at org.apache.catalina.connector.CoyoteAdapter.doService(CoyoteAdapter.java:323) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:227) at com.sun.enterprise.v3.services.impl.ContainerMapper.service(ContainerMapper.java:170) at com.sun.grizzly.http.ProcessorTask.invokeAdapter(ProcessorTask.java:822) at com.sun.grizzly.http.ProcessorTask.doProcess(ProcessorTask.java:719) at com.sun.grizzly.http.ProcessorTask.process(ProcessorTask.java:1013) at com.sun.grizzly.http.DefaultProtocolFilter.execute(DefaultProtocolFilter.java:225) at com.sun.grizzly.DefaultProtocolChain.executeProtocolFilter(DefaultProtocolChain.java:137) at com.sun.grizzly.DefaultProtocolChain.execute(DefaultProtocolChain.java:104) at com.sun.grizzly.DefaultProtocolChain.execute(DefaultProtocolChain.java:90) at com.sun.grizzly.http.HttpProtocolChain.execute(HttpProtocolChain.java:79) at com.sun.grizzly.ProtocolChainContextTask.doCall(ProtocolChainContextTask.java:54) at com.sun.grizzly.SelectionKeyContextTask.call(SelectionKeyContextTask.java:59) at com.sun.grizzly.ContextTask.run(ContextTask.java:71) at com.sun.grizzly.util.AbstractThreadPool$Worker.doWork(AbstractThreadPool.java:532) at com.sun.grizzly.util.AbstractThreadPool$Worker.run(AbstractThreadPool.java:513) at java.lang.Thread.run(Thread.java:619) |#]
RE: Solr Not releasing memory
Rohit, Why do you think it should free it during idle time? Let us what numbers you are actually watching. Check this it can be intetesting blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html 04.09.2012 0:45 пользователь Markus Jelsma markus.jel...@openindex.io написал: You've got more than 45GB of physical RAM in your machine? I assume it's actually virtual memory you're seeing, which is not a problem, even on Windows. It's not uncommon for resident memory to be higher than the allocated heap space and it's normal to have a high virtual memory address space if you have a large index. -Original message- From:Rohit ro...@simplify360.com Sent: Tue 04-Sep-2012 00:33 To: solr-user@lucene.apache.org Subject: RE: Solr Not releasing memory I am taking of Physical memory here, we start at -Xms of 2gb but very soon it goes high as 45Gb. The memory never comes down even when a single user is not using the system. Regards, Rohit -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: 03 September 2012 14:58 To: solr-user@lucene.apache.org Subject: RE: Solr Not releasing memory It would be helpful yo know which memory isn't being released. Is it virtual or physical or shared memory? Is it the heap space? -Original message- From:Mikhail Khludnev mkhlud...@griddynamics.com Sent: Mon 03-Sep-2012 16:52 To: solr-user@lucene.apache.org Subject: RE: Solr Not releasing memory Rohit, Which collector do you use? Releasing physical ram is possible with compacting collectors like serial, parallel and maybe g1 and not possible with cms. The more important thing that releasing is really suspicious and even odd requrement. Please provide more details about your jvm and overall challenge. 03.09.2012 15:03 пользователь Rohit ro...@simplify360.com написал: I am currently using StandardDirectoryFactory, would switching directory factory have any impact on the indexes? Regards, Rohit -Original Message- From: Claudio Ranieri [mailto:claudio.rani...@estadao.com] Sent: 03 September 2012 10:03 To: solr-user@lucene.apache.org Subject: RES: Solr Not releasing memory Are you using MMapDirectoryFactory? I had swap problem in linux to a big index when I used MMapDirectoryFactory. You can to try use solr.NIOFSDirectoryFactory. -Mensagem original- De: Lance Norskog [mailto:goks...@gmail.com] Enviada em: domingo, 2 de setembro de 2012 22:00 Para: solr-user@lucene.apache.org Assunto: Re: Solr Not releasing memory 1) I believe Java 1.7 release memory back to the OS. 2) All of the Javas I've used on Windows do this. Is the physical memory use a problem? Does it push out all other programs? Or is it just that the Java process appears larger? This explains the latter: http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.h tml - Original Message - | From: Rohit ro...@simplify360.com | To: solr-user@lucene.apache.org | Sent: Sunday, September 2, 2012 1:22:14 AM | Subject: Solr Not releasing memory | | Hi, | | | | We are running solr3.5 using tomcal 6.26 on a Windows Enterprise | RC2 server, our index size if pretty large. | | | | We have noticed that once tomcat starts using/reserving ram it | never releases them, even when there is not a single user on the | system. I have tried forced garbage collection, but that doesn't | seem to help either. | | | | Regards, | | Rohit | | | |
Re: Re: Get parent when the child is a search hit
Hello, One more approach is BlockJoin. see SOLR-3076 blog.griddynamics.com/2012/08/block-join-query-performs.html 11.09.2012 5:40 пользователь 李�S liyun2...@corp.netease.com написал: I think denormalize the data is the best way. 2012-09-11 李�S 发件人:jimtronic 发送时间:2012-09-11 01:38 主题:Re: Get parent when the child is a search hit 收件人:solr-usersolr-user@lucene.apache.org 抄送: You could create a type field with folder or file as values and then have the parentid present in the folder docs. -- View this message in context: http://lucene.472066.n3.nabble.com/Get-parent-when-the-child-is-a-search-hit-tp4006623p4006687.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Can solr return matched fields?
Dan, if you have foo bar search phrase against field: NAME, BRAND, and you have 10K docs matched and 100 first ones is displayed, what do you actually want to see as fields the query matched and for which docs? looking forward for additional details. On Thu, Sep 13, 2012 at 2:40 AM, Jack Krupansky j...@basetechnology.comwrote: But presumably matched fields relates to indexed fields, which might not have stored values. -- Jack Krupansky -Original Message- From: Casey Callendrello Sent: Wednesday, September 12, 2012 6:15 PM To: solr-user@lucene.apache.org Subject: Re: Can solr return matched fields? -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: MultiSearchHandler - Boosting results of a Query
1. please explain how exactly boost the value of a field in the 2nd based on the results of the 1st.. Please provide sample queries, docs and results. 2. introducing such chaining concern aka * ResponseAware* seems absolutely doubtful for me. 3. are you sure you are aware of tricks like http://wiki.apache.org/solr/SolrRelevancyFAQ#How_do_I_give_a_negative_.28or_very_low.29_boost_to_documents_that_match_a_query.3F and http://wiki.apache.org/solr/DisMaxQParserPlugin#bq_.28Boost_Query.29 and http://wiki.apache.org/solr/QueryElevationComponent On Thu, Sep 13, 2012 at 11:09 PM, Karthick Duraisamy Soundararaj karthick.soundara...@gmail.com wrote: Clarification: Once the parser is response aware, its easy for the components to grab the response and use them. In the context of function queries, by components, I mean various Functions that has been extended from ValueSource. On Thu, Sep 13, 2012 at 3:02 PM, Karthick Duraisamy Soundararaj karthick.soundara...@gmail.com wrote: Hello all, I am making multiple queries in a single url and trying to boost the value of a field in the 2nd based on the results of the 1st. To achieve this, my function query should be able to have access to the response of the first query. However, QParser and QParserPlugin only accepts req parameter and does not have any idea about the response. In a nutshell, all I am trying to do is that, during a serial execution of chain of queries represented by a single url( https://issues.apache.org/jira/browse/SOLR-1093), I am trying to influence the results of the second query with the results of the first query. To make the function queries ResponseAware, there are two options: *Option 1: Make all the QueryParsers ResponseAware* For this the following changes seem to be inevitable 1. Change/overload the createParser definition of QParserPlugin to include SolrQueryResponse createParser(string qstr, SolrParams localParams, SolrParams params, SolrQueryResponse rsp) - createParser(string qstr, SolrParams localParams, SolrParams params, SolrQueryRequest req, SolrQueryResponse rsp) 2. Make similar changes to the getParser function in QPareser *Option 2: Make FunctionQueryParser alone ResponseAware* For this, following changes need to be made 1. Overload the FunctionQueryParserPlugin's create method with the following signature createParser(string qstr, SolrParams localParams, SolrParams params, SolrQueryRequest req, SolrQueryResponse rsp) 2. Overload the getParser methong in QParser to permit the extra SolrResponse parameter and invoke this call wherever necessary. Once the parser is response aware, its easy for the components to grab the response and use them. This change to interface would mandate changes across the various components of SOLR that use all the different kind of parsers but I think this would be a useful feature as it has been requested by different people at various times. I would appreciate any kind of suggestions/feedback. Also, I would be more than happy to discuss if there are anyother way of doing the same. -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: MMapDirectory
My limited understanding, confirmed by profiler though, is that doing mmap IO cost you a copying bytes from mmaped virtual memory into heap VM. Just look into java.nio.DirectByteBuffer.get(byte[], int, int) . It happens several times to me - we saw hotspot in profiler on mmaped IO (yep, just in copying bytes!!), cache them in heap and we had hotspot moved after that. Good sample of heap cache for mmaped data is terminfos cache with configurable interval. Overal question is absolutely worth to think about. On Thu, Sep 20, 2012 at 9:39 PM, Erick Erickson erickerick...@gmail.comwrote: So I just had a curiosity question pop up and wanted to check it out. Solr has the documentCache, designed to hold stored fields while various parts of a requestHandler do their tricks, keeping the stored content from having to be re-fetched from disk. When using MMapDirectory, is this even something to worry about? It seems like documentCache wouldn't be all that useful, but then I don't have a deep understanding here. I can imagine scenarios where it would be more efficient i.e. it's targeted to the documents actually being accessed rather than random places on disk in the fdt/fdx files Thanks, Erick -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: DIH problem
Gian, The only way to handle it is to provide a test case and attach to jira. Thanks On Fri, Sep 21, 2012 at 6:03 PM, Gian Marco Tagliani gm.tagli...@gmail.comwrote: Hi, I'm updating my Solr from version 3.4 to version 3.6.1 and I'm facing a little problem with the DIH. In the delta-import I'm using the /parentDeltaQuery/ feature of the DIH to update the parent entity. I don't think this is working properly. I realized that it's just executing the /parentDeltaQuery/ with the first record of the /deltaQuery /result. Comparing the code with the previous versions I noticed that the rowIterator was never set to null. To solve this I wrote a simple patch: - Index: solr/contrib/**dataimporthandler/src/java/** org/apache/solr/handler/**dataimport/**EntityProcessorBase.java ==**==**=== --- solr/contrib/**dataimporthandler/src/java/**org/apache/solr/handler/** dataimport/**EntityProcessorBase.java (revision 31454) +++ solr/contrib/**dataimporthandler/src/java/**org/apache/solr/handler/** dataimport/**EntityProcessorBase.java (working copy) @@ -121,6 +121,7 @@ if (rowIterator.hasNext()) return rowIterator.next(); query = null; +rowIterator = null; return null; } catch (Exception e) { SolrException.log(log, getNext() failed for query ' + query + ', e); - Do you think this is correct? Thanks for your help -- Gian Marco Tagliani -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Return only matched multiValued field
Hi It seems like highlighting feature. 24.09.2012 0:51 пользователь Dotan Cohen dotanco...@gmail.com написал: Assuming a multivalued, stored and indexed field with name comment. When performing a search, I would like to return only the values of comment which contain the match. For example: When searching for gold instead of getting this result: doc arr name=comment strTheres a lady whos sure/str strall that glitters is gold/str strand shes buying a stairway to heaven/str /arr /doc I would prefer to get this result: doc arr name=comment strall that glitters is gold/str /arr /doc (psuedo-XML from memory, may not be accurate but illustrates the point) Is there any way to do this with a Solr 4 index? The client accessing Solr is on a dial-up connection (no provision for DSL or other high speed internet) so I'd like to move as little data over the wire as possible. In reality, the array will have tens of fields so returning only the relevant fields may reduce the data transferred by an order of magnitude. Thanks. -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Getting the distribution information of scores from query
I suggest to create a component, put it after QueryComponent. in prepare it should add own PostFilter into list of request filters, your post filter will be able to inject own DelegatingCollector, then you can just add collected histogram into result named list http://searchhub.org/dev/2012/02/10/advanced-filter-caching-in-solr/ On Tue, Sep 25, 2012 at 10:03 PM, Amit Nithian anith...@gmail.com wrote: We have a federated search product that issues multiple parallel queries to solr cores and fetches the results and blends them. The approach we were investigating was taking the scores, normalizing them based on some distribution (normal distribution seems reasonable) and use that z score as the way to blend the results (else you'll be blending scores on different scales). To accomplish this, I was looking to get the distribution of the scores for the query as an analog to the stats component but seem to see the only way to accomplish this would be to create a custom collector that would accumulate and store this information (mean, std-dev etc) since the stats component only operates on indexed fields. Is there an easy way to tell Solr to use a custom collector without having to modify the SolrIndexSearcher class? Maybe is there an alternative way to get this information? Thanks Amit -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: need best solution for indexing and searching multiple, related database tables
Fyi, block join query doesnt require denormalization, performant, but has own limitations, of course. Many to many is the most painful point. I deal with it but quite far from contributing generally applicable approach. 29.09.2012 5:21 пользователь Biff Baxter tom.bren...@acmedata.net написал: Hi Walter, I have bought into the denormalize approach. My remaining questions are around how to construct the denormlized view and any solr functions that would support issues related to a) minimizing the denormalization explosion for 3 or more tables and b) handling many to many relationships. One issue I am concerned with is, if I search for IBM and Steve Jones in my example above, no records should be returned. How do I manage that with the equivalent of the one denormalized record approach? I appreciate your help. Biff -- View this message in context: http://lucene.472066.n3.nabble.com/need-best-solution-for-indexing-and-searching-multiple-related-database-tables-tp4009857p4011010.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Lifecycle of a TokenFilter from TokenFilterFactory
It's not clear what you want to achieve. I don't always create custom TokenStreams, but if I do I use Lucenes as a prototype to start from. On Mon, Oct 1, 2012 at 6:07 PM, Em mailformailingli...@yahoo.de wrote: Hi Mikhail, thanks for your feedback. If so, how can I write UnitTests which respect the Reuse strategy? What's the recommended way when creating custom Tokenizers and TokenFilters? Kind regards, Em Am 01.10.2012 10:54, schrieb Mikhail Khludnev: Hello, Analyzers are reused. Analyzer is Tokenizer and several TokenFilters. Check the source org.apache.lucene.analysis.Analyzer, pay attention to reuseStrategy. Best regards On Sun, Sep 30, 2012 at 5:37 PM, Em mailformailingli...@yahoo.de wrote: Hello list, I saw a bug in a TokenFilter that only works, if there is a fresh instance created by the TokenFilterFactory and it seems as TokenFilters are reused some how for more than one request. So, if your TokenFilterFactory has a Logging-Statement in its create()-method, you see that log only now and again - but not on every request. Is this a bug in Solr 4.0-BETA or is this expected behaviour? If it is expected, what could be wrong with the TokenFilter? Kind regards, Em -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Lifecycle of a TokenFilter from TokenFilterFactory
Ok. I might get what you are looking for. Extends SolrTestCase4J (see plenty samples in codebase). Obtain request via req(), obtain schema from it by getSchema(), then getAnalyzer() or getQueryAnalyzer() and ask for analysis org.apache.lucene.analysis.Analyzer.tokenStream(String, Reader). You'll find your filters cached in IndexSchema analyzers. Let me know if it helps. On Mon, Oct 1, 2012 at 10:54 PM, Em mailformailingli...@yahoo.de wrote: That's exactly the way I do it when I have to write some custom stuff. My problem is that I do not know how to integrate an Analyzer's reusability-feature into a Unit-Test to see what happens if - i.e. - a TokenFilter-instance is going to be reused. Some TokenFilter-prototypes I've seen are stateful and do not reset their state as neccessary in order to be reused. This problem only occurs when I deploy those Filters to Solr and index or search for some documents (which does not always calls create() on the TokenFilterFactory). However I have to be able - at least somehow - to tackle those problems in Unit-Tests instead of noticing such problems after a deployment to Solr. So my question is: How can I (Unit-)test a TokenFilter with an Analyzer which reuses the same TokenFilter instance for more than one Input-TokenStream? Kind regards, Em Am 01.10.2012 19:43, schrieb Mikhail Khludnev: It's not clear what you want to achieve. I don't always create custom TokenStreams, but if I do I use Lucenes as a prototype to start from. On Mon, Oct 1, 2012 at 6:07 PM, Em mailformailingli...@yahoo.de wrote: Hi Mikhail, thanks for your feedback. If so, how can I write UnitTests which respect the Reuse strategy? What's the recommended way when creating custom Tokenizers and TokenFilters? Kind regards, Em Am 01.10.2012 10:54, schrieb Mikhail Khludnev: Hello, Analyzers are reused. Analyzer is Tokenizer and several TokenFilters. Check the source org.apache.lucene.analysis.Analyzer, pay attention to reuseStrategy. Best regards On Sun, Sep 30, 2012 at 5:37 PM, Em mailformailingli...@yahoo.de wrote: Hello list, I saw a bug in a TokenFilter that only works, if there is a fresh instance created by the TokenFilterFactory and it seems as TokenFilters are reused some how for more than one request. So, if your TokenFilterFactory has a Logging-Statement in its create()-method, you see that log only now and again - but not on every request. Is this a bug in Solr 4.0-BETA or is this expected behaviour? If it is expected, what could be wrong with the TokenFilter? Kind regards, Em -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Synonyms Phrase not working
Gustav, AFAIK, multi words synonyms is one of the weak points for Lucene/Solr. I'm going to propose a solution approach at forthcoming Eurocon http://www.apachecon.eu/schedule/presentation/18/ . You are welcome! -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Can I rely on correct handling of interrupted status of threads?
I remember a bug in EmbeddedSolrServer at 1.4.1 when exception bypasses request closing that lead to searcher leak and OOM. It was fixed about two years ago. On Tue, Oct 2, 2012 at 1:48 PM, Robert Krüger krue...@lesspain.de wrote: Hi, I'm using Solr 3.6.1 in an application embedded directly, i.e. via EmbeddedSolrServer, not over an HTTP connection, which works perfectly. Our application uses Thread.interrupt() for canceling long-running tasks (e.g. through Future.cancel). A while (and a few Solr versions) back a colleague of mine implemented a workaround because he said that Solr didn't handle the thread's interrupted status correctly, i.e. not setting the interrupted status after having caught an InterruptedException or rethrowing it, thus killing the information that an interrupt has been requested, which breaks libraries relying on that. However, I did not find anything up-to-date in mailing list or forum archives on the web. Is that still or was it ever the case? What does one have to watch out for when interrupting a thread that is doing anything within Solr/Lucene? Any advice would be appreciated. Regards, Robert -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Follow links in xml doc
Billy, Have you tied http://wiki.apache.org/solr/DataImportHandler#ScriptTransformer ? On Wed, Oct 3, 2012 at 7:11 AM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi Billy, There is nothing in Solr that will do XML parsing and link extraction, so you'll need to do that part. Once you do that have a look at Solr join for parent-child querying. http://search-lucene.com/?q=solr+join Otis -- Search Analytics - http://sematext.com/search-analytics/index.html Performance Monitoring - http://sematext.com/spm/index.html On Tue, Oct 2, 2012 at 9:51 PM, Billy Newman newman...@gmail.com wrote: Hello again all. I have a URLDataSource to index xml data. Is there any way to follow links within the xml doc and index items in those under the same document? I.E. if I search for a word or term and that term lives in a link of doc with ID 12345 I would like to return that doc when searched. Thanks, Billy -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Multi-Select Faceting with delimited field values
The only way to do that is split your attributes, which are concatenations of attr and val. you should have color attr with vals red, green, blue; hdmi: yes/no; speaker: yes/no. 04.10.2012 5:19 пользователь Aaron Bains aaronba...@gmail.com написал: I am trying to set up my query for multi-select faceting, here is my attempt at the query: q=category:monitorsfq=attribute:(color-black)facet.field=attributefacet=true The undesired response from the query: response result name=response numFound=1 start=0 doc str name=productid1019141675/str /doc /result lst name=facet_counts lst name=facet_queries/ lst name=facet_fields lst name=attribute int name=color-black1/int int name=vga-yes1/int int name=hdmi-yes1/int int name=speakers-yes1/int /lst /lst lst name=facet_dates/ lst name=facet_ranges/ /lst /response The desired response: response result name=response numFound=1 start=0 doc str name=productid1019141675/str /doc /result lst name=facet_counts lst name=facet_queries/ lst name=facet_fields lst name=attribute int name=color-black120/int int name=color-silver58/int int name=color-white13/int int name=vga-yes1/int int name=hdmi-yes1/int int name=speakers-yes1/int /lst /lst lst name=facet_dates/ lst name=facet_ranges/ /lst /response The way I have the attribute and value delimited by a dash has me stumped on how to perform the tagging and excluding. If we exclude the entire attribute field with facet.field={!ex=dt}attribute it brings an undesired result. What I need to do is exclude (attribute:color) Thanks for the help!!
Re: Can I rely on correct handling of interrupted status of threads?
it was another exception class. On Thu, Oct 4, 2012 at 5:19 PM, Robert Krüger krue...@lesspain.de wrote: On Tue, Oct 2, 2012 at 8:50 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: I remember a bug in EmbeddedSolrServer at 1.4.1 when exception bypasses request closing that lead to searcher leak and OOM. It was fixed about two years ago. You mean InterruptedException? -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Getting list of operators and terms for a query
you've got ResponseBuilder as process() or prepare() argument, check query field, but your component should be registered after QueryComponent in your requestHandler config. On Thu, Oct 4, 2012 at 6:03 PM, Davide Lorenzo Marino davide.mar...@gmail.com wrote: Hi All, i'm working in a new searchComponent that analyze the search queries. I need to know if given a query string is possible to get the list of operators and terms (better in polish notation)? I mean if the default field is country and the query is the String england OR (name:paul AND city:rome) to get the List [ Operator OR, Term country:england, OPERATOR AND, Term name:paul, Term city:rome ] Thanks in advance Davide Marino -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Problem with relating values in two multi value fields
it's a typical nested document problem. there are several approaches. Out of the box solution as far you need facets is http://wiki.apache.org/solr/FieldCollapsing . On Thu, Oct 4, 2012 at 7:19 PM, Torben Honigbaum torben.honigb...@neuland-bfi.de wrote: Hi Jack, thank you for your answer. The problem is, that I don't know the value for option A and that the values are numbers and I've to use the values as facet. So I need something like this: Docs: doc str name=id3/str str name=options strA/str strB/str ... str str name=value str200/str str400/str ... str /doc doc str name=id4/str str name=options strA/str strE/str ... str str name=value str300/str str400/str ... str /doc doc str name=id6/str str name=options strA/str strC/str ... str str name=value str200/str str400/str ... str /doc Query: …?q=options:A Facet: 200 (2), 300 (1) Thank you Torben Am 04.10.2012 um 17:10 schrieb Jack Krupansky: Use a field called option_value_pairs with values like A 200 and then query with a quoted phrase A 200. You could use a special character like equal sign instead of space: A=200 and then you don't have to quote it in the query. -- Jack Krupansky -Original Message- From: Torben Honigbaum Sent: Thursday, October 04, 2012 11:03 AM To: solr-user@lucene.apache.org Subject: Problem with relating values in two multi value fields Hello, I've a problem with relating values in two multi value fields. My documents look like this: doc str name=id3/str str name=options strA/str strB/str strC/str strD/str str str name=value str200/str str400/str str240/str str310/str str /doc My problem is that I've to search for a set of documents and display only the value for option A, for example, and use the value field as facet field. I need a result like this: doc str name=id3/str str name=optionsA/str str name=value200/str /doc facet … I think that this is a use case which isn't possible, right? So can someone show me an alternative way to solve this problem? The documents each have 500 options with 500 related values. Thank you Torben -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Identify exact search in edismax
overall task is not clear to me, but if you want to field's all terms have matched to user query i'd suggest to introduce your own Similarity: - write number of terms as a norm value (which is by default a byte per doc per field), then - you'll be able to retrieve this number during search time and use for evaluating your own mm - criteria. WDYT? On Thu, Oct 4, 2012 at 9:28 PM, rhl4tr rhl4...@gmail.com wrote: I am using edismax for guessing category from user query. If user says I want to buy BMW and Audi car. This query will be fed to edismax which will give me results based on phrase match. Field contains following values -BMW = Cars category -Audi = Cars -2 BHK = Real Estate -need job = jobs category -Buy 1Bhk - Apartments I get results with phrase matches on top. Generally top result will be a phrase match (if there are any). How can I know that field's all terms have matched to user query. e.g. mm = percentage of user query terms should match with field terms I want opposite = percentage of field values should match with user query. which is in my case 100% = phrase match -- View this message in context: http://lucene.472066.n3.nabble.com/Identify-exact-search-in-edismax-tp4011859.html Sent from the Solr - User mailing list archive at Nabble.com. -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Identify exact search in edismax
absolutely, that's what I didn't get in your initial question. Okay it seems you are talking about typical eCommerce search problem. I will speak about it at http://www.apachecon.eu/schedule/presentation/18/ see you. On Fri, Oct 5, 2012 at 9:47 AM, rhl4tr rhl4...@gmail.com wrote: But user query can contain any number of terms. I can not know how many fields term it has to match. { responseHeader:{ status:0, QTime:1, params:{ mm:0, sort:score desc, indent:true, qf:exact_keywords, wt:json, rows:1, defType:dismax, pf:exact_keywords, debugQuery:false, fl:data_id,data_name,exact_keywords, start:0, q:i want to by honda suzuki, fq:+data_type:pwords}}, response:{numFound:2,start:0,docs:[ { data_name:Cars , data_id:71, exact_keywords:honda suzuki, term_mm:100%}, { data_name:bikes , data_id:72, exact_keywords:suzuki, term_mm:50%} ] }} An hypothetical solution would look like above json response. user_mm parameter will tell what percentage of terms has matched to user query. -- View this message in context: http://lucene.472066.n3.nabble.com/Identify-exact-search-in-edismax-tp4011859p4011976.html Sent from the Solr - User mailing list archive at Nabble.com. -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: PriorityQueue:initialize consistently showing up as hot spot while profiling
what's the value of rows param http://wiki.apache.org/solr/CommonQueryParameters#rows ? On Fri, Oct 5, 2012 at 6:56 AM, Aaron Daubman daub...@gmail.com wrote: Greetings, I've been seeing this call chain come up fairly frequently when debugging longer-QTime queries under Solr 3.6.1 but have not been able to understand from the code what is really going on - the call graph and code follow below. Would somebody please explain to me: 1) Why this would show up frequently as a hotspot 2) If it is expected to do so 3) If there is anything I should look in to that may help performance where this frequently shows up as the long pole in the QTime tent 4) What the code is doing and why heap is being allocated as an apparently giant object (which also is apparently not unheard of due to MAX_VALUE wrapping check) ---call-graph--- Filter - SolrDispatchFilter:doFilter (method time = 12 ms, total time = 487 ms) Filter - SolrDispatchFilter:execute:365 (method time = 0 ms, total time = 109 ms) org.apache.solr.core.SolrCore:execute:1376 (method time = 0 ms, total time = 109 ms) org.apache.solr.handler.RequestHandlerBase:handleRequest:129 (method time = 0 ms, total time = 109 ms) org.apache.solr.handler.component.SearchHandler:handleRequestBody:186 (method time = 0 ms, total time = 109 ms) com.echonest.solr.component.EchoArtistGroupingComponent:process:188 (method time = 0 ms, total time = 109 ms) org.apache.solr.search.SolrIndexSearcher:search:375 (method time = 0 ms, total time = 96 ms) org.apache.solr.search.SolrIndexSearcher:getDocListC:1176 (method time = 0 ms, total time = 96 ms) org.apache.solr.search.SolrIndexSearcher:getDocListNC:1209 (method time = 0 ms, total time = 96 ms) org.apache.solr.search.SolrIndexSearcher:getProcessedFilter:796 (method time = 0 ms, total time = 26 ms) org.apache.solr.search.BitDocSet:andNot:185 (method time = 0 ms, total time = 13 ms) org.apache.lucene.util.OpenBitSet:clone:732 (method time = 13 ms, total time = 13 ms) org.apache.solr.search.BitDocSet:intersection:31 (method time = 0 ms, total time = 13 ms) org.apache.solr.search.DocSetBase:intersection:90 (method time = 0 ms, total time = 13 ms) org.apache.lucene.util.OpenBitSet:and:808 (method time = 13 ms, total time = 13 ms) org.apache.lucene.search.TopFieldCollector:create:916 (method time = 0 ms, total time = 46 ms) org.apache.lucene.search.FieldValueHitQueue:create:175 (method time = 0 ms, total time = 46 ms) org.apache.lucene.search.FieldValueHitQueue$MultiComparatorsFieldValueHitQueue:init:111 (method time = 0 ms, total time = 46 ms) org.apache.lucene.search.SortField:getComparator:409 (method time = 0 ms, total time = 13 ms) org.apache.lucene.search.FieldComparator$FloatComparator:init:400 (method time = 13 ms, total time = 13 ms) org.apache.lucene.util.PriorityQueue:initialize:108 (method time = 33 ms, total time = 33 ms) ---snip--- org.apache.lucene.util.PriorityQueue:initialize - hotspot is line 108: heap = (T[]) new Object[heapSize]; // T is unbounded type, so this unchecked cast works always ---PriorityQueue.java--- /** Subclass constructors must call this. */ @SuppressWarnings(unchecked) protected final void initialize(int maxSize) { size = 0; int heapSize; if (0 == maxSize) // We allocate 1 extra to avoid if statement in top() heapSize = 2; else { if (maxSize == Integer.MAX_VALUE) { // Don't wrap heapSize to -1, in this case, which // causes a confusing NegativeArraySizeException. // Note that very likely this will simply then hit // an OOME, but at least that's more indicative to // caller that this values is too big. We don't +1 // in this case, but it's very unlikely in practice // one will actually insert this many objects into // the PQ: heapSize = Integer.MAX_VALUE; } else { // NOTE: we add +1 because all access to heap is // 1-based not 0-based. heap[0] is unused. heapSize = maxSize + 1; } } heap = (T[]) new Object[heapSize]; // T is unbounded type, so this unchecked cast works always this.maxSize = maxSize; // If sentinel objects are supported, populate the queue with them T sentinel = getSentinelObject(); if (sentinel != null) { heap[1] = sentinel; for (int i = 2; i heap.length; i++) { heap[i] = getSentinelObject(); } size = maxSize; } } ---snip--- Thanks, as always! Aaron -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Identify exact search in edismax
I have only pencil scratches yet, can't share it. I can say that i've found it quite close to approach described there http://www.ulakha.com/publications.html it's called there Concept Search, but as far as I understand I have rather different implementation approach. On Fri, Oct 5, 2012 at 2:31 PM, rhl4tr rhl4...@gmail.com wrote: Can you please get me started. I can no wait till presentation. -- View this message in context: http://lucene.472066.n3.nabble.com/Identify-exact-search-in-edismax-tp4011859p4012006.html Sent from the Solr - User mailing list archive at Nabble.com. -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: PriorityQueue:initialize consistently showing up as hot spot while profiling
okay. huge rows value is no.1 way to kill Lucene. It's not possible, absolutely. You need to rethink logic of your component. Check Solr's FieldCollapsing code, IIRC it makes second search to achieve similar goal. Also check PostFilter and DelegatingCollector classes, their approach can also be handy for your task. On Fri, Oct 5, 2012 at 2:38 PM, Aaron Daubman daub...@gmail.com wrote: On Fri, Oct 5, 2012 at 4:33 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: what's the value of rows param http://wiki.apache.org/solr/CommonQueryParameters#rows ? Very interesting question - so, for historic reasons lost to me, we pass in a huge (1000?) number for rows and this hits our custom component, which has its own internal maximum for real rows returned. (This is a custom grouping component, so I am guessing the large number of rows had to do with trying not to limit what got grouped?). Is the value of rows what is used for that heap allocation? absolutely. it's classic priority queue algorithm upon binary heap. Thanks, Aaron On Fri, Oct 5, 2012 at 6:56 AM, Aaron Daubman daub...@gmail.com wrote: Greetings, I've been seeing this call chain come up fairly frequently when debugging longer-QTime queries under Solr 3.6.1 but have not been able to understand from the code what is really going on - the call graph and code follow below. Would somebody please explain to me: 1) Why this would show up frequently as a hotspot 2) If it is expected to do so 3) If there is anything I should look in to that may help performance where this frequently shows up as the long pole in the QTime tent 4) What the code is doing and why heap is being allocated as an apparently giant object (which also is apparently not unheard of due to MAX_VALUE wrapping check) ---call-graph--- Filter - SolrDispatchFilter:doFilter (method time = 12 ms, total time = 487 ms) Filter - SolrDispatchFilter:execute:365 (method time = 0 ms, total time = 109 ms) org.apache.solr.core.SolrCore:execute:1376 (method time = 0 ms, total time = 109 ms) org.apache.solr.handler.RequestHandlerBase:handleRequest:129 (method time = 0 ms, total time = 109 ms) org.apache.solr.handler.component.SearchHandler:handleRequestBody:186 (method time = 0 ms, total time = 109 ms) com.echonest.solr.component.EchoArtistGroupingComponent:process:188 (method time = 0 ms, total time = 109 ms) org.apache.solr.search.SolrIndexSearcher:search:375 (method time = 0 ms, total time = 96 ms) org.apache.solr.search.SolrIndexSearcher:getDocListC:1176 (method time = 0 ms, total time = 96 ms) org.apache.solr.search.SolrIndexSearcher:getDocListNC:1209 (method time = 0 ms, total time = 96 ms) org.apache.solr.search.SolrIndexSearcher:getProcessedFilter:796 (method time = 0 ms, total time = 26 ms) org.apache.solr.search.BitDocSet:andNot:185 (method time = 0 ms, total time = 13 ms) org.apache.lucene.util.OpenBitSet:clone:732 (method time = 13 ms, total time = 13 ms) org.apache.solr.search.BitDocSet:intersection:31 (method time = 0 ms, total time = 13 ms) org.apache.solr.search.DocSetBase:intersection:90 (method time = 0 ms, total time = 13 ms) org.apache.lucene.util.OpenBitSet:and:808 (method time = 13 ms, total time = 13 ms) org.apache.lucene.search.TopFieldCollector:create:916 (method time = 0 ms, total time = 46 ms) org.apache.lucene.search.FieldValueHitQueue:create:175 (method time = 0 ms, total time = 46 ms) org.apache.lucene.search.FieldValueHitQueue$MultiComparatorsFieldValueHitQueue:init:111 (method time = 0 ms, total time = 46 ms) org.apache.lucene.search.SortField:getComparator:409 (method time = 0 ms, total time = 13 ms) org.apache.lucene.search.FieldComparator$FloatComparator:init:400 (method time = 13 ms, total time = 13 ms) org.apache.lucene.util.PriorityQueue:initialize:108 (method time = 33 ms, total time = 33 ms) ---snip--- org.apache.lucene.util.PriorityQueue:initialize - hotspot is line 108: heap = (T[]) new Object[heapSize]; // T is unbounded type, so this unchecked cast works always ---PriorityQueue.java--- /** Subclass constructors must call this. */ @SuppressWarnings(unchecked) protected final void initialize(int maxSize) { size = 0; int heapSize; if (0 == maxSize) // We allocate 1 extra to avoid if statement in top() heapSize = 2; else { if (maxSize == Integer.MAX_VALUE) { // Don't wrap heapSize to -1, in this case, which // causes a confusing NegativeArraySizeException. // Note that very likely this will simply then hit // an OOME, but at least that's more indicative to // caller that this values is too big. We don't +1 // in this case
Re: Problem with relating values in two multi value fields
denormalize your docs to option x value tuples, identify them by duping id. doc str name=setid3/str str name=optionsA/str str name=value200/str /doc doc str name=setid3/str str name=optionsB/str str name=value400/str /doc doc str name=setid3/str str name=optionsB/str str name=value400/str /doc doc str name=setid3/str str name=optionsC/str str name=value240/str /doc then collapse them by set setid field. (it can not be uniqkey). On Fri, Oct 5, 2012 at 6:26 PM, Torben Honigbaum torben.honigb...@neuland-bfi.de wrote: Hi Mikhail, I read the article and can't see how to solve my problem with FieldCollapsing. Any other suggestions? Torben Am 04.10.2012 um 17:31 schrieb Mikhail Khludnev: it's a typical nested document problem. there are several approaches. Out of the box solution as far you need facets is http://wiki.apache.org/solr/FieldCollapsing . On Thu, Oct 4, 2012 at 7:19 PM, Torben Honigbaum torben.honigb...@neuland-bfi.de wrote: Hi Jack, thank you for your answer. The problem is, that I don't know the value for option A and that the values are numbers and I've to use the values as facet. So I need something like this: Docs: doc str name=id3/str str name=options strA/str strB/str ... str str name=value str200/str str400/str ... str /doc doc str name=id4/str str name=options strA/str strE/str ... str str name=value str300/str str400/str ... str /doc doc str name=id6/str str name=options strA/str strC/str ... str str name=value str200/str str400/str ... str /doc Query: …?q=options:A Facet: 200 (2), 300 (1) Thank you Torben Am 04.10.2012 um 17:10 schrieb Jack Krupansky: Use a field called option_value_pairs with values like A 200 and then query with a quoted phrase A 200. You could use a special character like equal sign instead of space: A=200 and then you don't have to quote it in the query. -- Jack Krupansky -Original Message- From: Torben Honigbaum Sent: Thursday, October 04, 2012 11:03 AM To: solr-user@lucene.apache.org Subject: Problem with relating values in two multi value fields Hello, I've a problem with relating values in two multi value fields. My documents look like this: doc str name=id3/str str name=options strA/str strB/str strC/str strD/str str str name=value str200/str str400/str str240/str str310/str str /doc My problem is that I've to search for a set of documents and display only the value for option A, for example, and use the value field as facet field. I need a result like this: doc str name=id3/str str name=optionsA/str str name=value200/str /doc facet … I think that this is a use case which isn't possible, right? So can someone show me an alternative way to solve this problem? The documents each have 500 options with 500 related values. Thank you Torben -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Need to update a field without re-indexing in solr 3.6
Could you please tell me more. What field do you need to update, how it influences the search results, how often, and why you can not afford commit? On Fri, Oct 5, 2012 at 11:14 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, This is not doable in Solr 3.*. There are Lucene-level patches in JIRA, but I'm not sure if they are in Solr 4.* Otis -- Search Analytics - http://sematext.com/search-analytics/index.html Performance Monitoring - http://sematext.com/spm/index.html On Fri, Oct 5, 2012 at 3:02 PM, Thakur, Pramila pramila_tha...@ontla.ola.org wrote: Hi Everyone, I am using Solr 3.6. I want to update a single filed value in the index without re-indexing. Is this possible? I have google and came across partial update in solr 4.0 BETA. Can I do do this with Solr 3.6? Thanks, -- Pramila Thakur -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Problem with relating values in two multi value fields
Torben, Denormalization implies copying attrs which are common for a group into the smaller docs: doc str name=setid3/str str name=attribute_Avalue/str str name=attribute_Bvalue/str str name=optionsA/str str name=value200/str /doc doc str name=setid3/str str name=attribute_Avalue/str str name=attribute_Bvalue/str str name=optionsB/str str name=value400/str /doc doc str name=setid3/str str name=attribute_Avalue/str str name=attribute_Bvalue/str str name=optionsB/str str name=value400/str /doc doc str name=setid3/str str name=attribute_Avalue/str str name=attribute_Bvalue/str str name=optionsC/str str name=value240/str /doc and use group.facet=true On Sat, Oct 6, 2012 at 2:24 AM, Torben Honigbaum torben.honigb...@neuland-bfi.de wrote: Hi Mikhail, thank you for your answer. Maybe my sample data was a not so god. The document always have additional data which I need to use as facet like this: doc str name=id3/str str name=attribute_Avalue/str str name=attribute_Bvalue/str str name=options strA/str strB/str ... str str name=value str200/str str400/str ... str /doc Torben Am 05.10.2012 um 17:20 schrieb Mikhail Khludnev: denormalize your docs to option x value tuples, identify them by duping id. doc str name=setid3/str str name=optionsA/str str name=value200/str /doc doc str name=setid3/str str name=optionsB/str str name=value400/str /doc doc str name=setid3/str str name=optionsB/str str name=value400/str /doc doc str name=setid3/str str name=optionsC/str str name=value240/str /doc then collapse them by set setid field. (it can not be uniqkey). On Fri, Oct 5, 2012 at 6:26 PM, Torben Honigbaum torben.honigb...@neuland-bfi.de wrote: Hi Mikhail, I read the article and can't see how to solve my problem with FieldCollapsing. Any other suggestions? Torben Am 04.10.2012 um 17:31 schrieb Mikhail Khludnev: it's a typical nested document problem. there are several approaches. Out of the box solution as far you need facets is http://wiki.apache.org/solr/FieldCollapsing . On Thu, Oct 4, 2012 at 7:19 PM, Torben Honigbaum torben.honigb...@neuland-bfi.de wrote: Hi Jack, thank you for your answer. The problem is, that I don't know the value for option A and that the values are numbers and I've to use the values as facet. So I need something like this: Docs: doc str name=id3/str str name=options strA/str strB/str ... str str name=value str200/str str400/str ... str /doc doc str name=id4/str str name=options strA/str strE/str ... str str name=value str300/str str400/str ... str /doc doc str name=id6/str str name=options strA/str strC/str ... str str name=value str200/str str400/str ... str /doc Query: …?q=options:A Facet: 200 (2), 300 (1) Thank you Torben Am 04.10.2012 um 17:10 schrieb Jack Krupansky: Use a field called option_value_pairs with values like A 200 and then query with a quoted phrase A 200. You could use a special character like equal sign instead of space: A=200 and then you don't have to quote it in the query. -- Jack Krupansky -Original Message- From: Torben Honigbaum Sent: Thursday, October 04, 2012 11:03 AM To: solr-user@lucene.apache.org Subject: Problem with relating values in two multi value fields Hello, I've a problem with relating values in two multi value fields. My documents look like this: doc str name=id3/str str name=options strA/str strB/str strC/str strD/str str str name=value str200/str str400/str str240/str str310/str str /doc My problem is that I've to search for a set of documents and display only the value for option A, for example, and use the value field as facet field. I need a result like this: doc str name=id3/str str name=optionsA/str str name=value200/str /doc facet … I think that this is a use case which isn't possible, right? So can someone show me an alternative way to solve this problem? The documents each have 500 options with 500 related values. Thank you Torben -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Get report of keywords searched.
Rajani, IIRC solrmeter can grab search phrases from log. There is a special command for doing it there. Right - Tool/Extract Queries. Regards On Sun, Oct 7, 2012 at 10:02 AM, Rajani Maski rajinima...@gmail.com wrote: Hi Davide, Yes right. This can be done. Just one question, I am not sure if I had to create new thread for this question, Just wanted to know whether solrmeter or jmeter can help me get the keywords searched list? I am novice to solrmeter, just know that its used for stress test. Interested to know if I can use same tools for this case of getting keywords searhed list. Thanks Rajani On Fri, Oct 5, 2012 at 7:23 PM, Davide Lorenzo Marino davide.mar...@gmail.com wrote: If you think this could be a problem for your performances you can try two different solutions: 1 - Make the call to update the db in a different thread 2 - Make an asynchronous http call to a web application that update the db (in this case the web app can be resident in a different machine, so the ram, cpu time and disk operations don't slow your solr engine) 2012/10/5 Rajani Maski rajinima...@gmail.com Hi, Thank you for the reply Davide. Writing to db you mean to insert into db the search queries? I was thinking that this might effect search performance? Yes you are right, Getting stats for particular key word is tough. It would suffice if I can get q param and fq param values( when we search using standard request handler). Any open source solr log analysis tools? Can we achieve this with solrmeter? Has anyone tried with this? Thank You On Thu, Oct 4, 2012 at 2:07 PM, Davide Lorenzo Marino davide.mar...@gmail.com wrote: If you need to analyze the search queries is not very difficult, just create a search plugin and put them in a db. If you need to search the single keywords it is more difficult and you need before starting to take some decision. In particular take the following queries and try to answer how you would like to treat them for the keywards: 1) apple OR orange 2) apple AND orange 3) title:apple AND subject:orange 4) apple -orange 5) apple OR (orange AND banana) 6) title:apple OR subject:orange Ciao Davide Marino 2012/10/3 Rajani Maski rajinima...@gmail.com Hi All, I am using solrJ. When there is search query hit, I am logging the url in a location and also it is getting logged into tomcat catalina logs. Now I wanted to implement a functionality of periodically(per week) analyzing search logs of solr and find out the keywords searched. Is there a way to do it using any of the existing functionality of solr? If not, Anybody has tried this implementation with any open source tools? Suggestions welcome. . Awaiting reply Thank you. -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Problem with relating values in two multi value fields
Toke, You are absolutely right, concatenating term is a possible solution. I found faceting is quite complicated in this case, but it was a hot fix which we delivered to production. Torben, This problem arise quite often, beside of these two approaches discussed there, also possible to approach SpanQueries and TermPositions - you can check our experience here: http://blog.griddynamics.com/2011/06/solr-experience-search-parent-child.html http://vimeo.com/album/2012142/video/33817062 Our current way is BlockJoin which is really performant in case of batched updates: http://blog.griddynamics.com/2012/08/block-join-query-performs.html. Bad thing that there is no open facet component for block join. We have a code, but are not ready to share it, yet. On Mon, Oct 8, 2012 at 12:44 PM, Toke Eskildsen t...@statsbiblioteket.dkwrote: On Mon, 2012-10-08 at 08:42 +0200, Torben Honigbaum wrote: sorry, my fault. This was one of my first ideas. My problem is, that I've 1.000.000 documents, each with about 20 attributes. Additionally each document has between 200 and 500 option-value pairs. So if I denormalize the data, it means that I've 1.000.000 x 350 (200 + 500 / 2) = 350.000.000 documents, each with 20 attributes. If you have a few hundred or less distinct primary attributes (the A, B, C's in your example), you could create a new field for each of them: /doc str name=id3/str str name=optionsA B C D/str str name=option_A200/str str name=option_B400/str str name=option_C240/str str name=option_D310/str ... ... /doc Query for options:A and facet on field option_A to get facets for the specific field. This normalization does increase the index size due to duplicated secondary values between the option-fields, but since our assumption is a relatively small amount of primary values, it should not be too much. Alternatively, if you have many distinct primary attributes, index the pairs as Jack suggests: /doc str name=id3/str str name=optionsA B C D/str str name=optionA=200/str str name=optionB=400/str str name=optionC=240/str str name=optionD=310/str ... ... /doc Query for options:A and facet on field option with field.prefix=A=. Your result will be A=200 (2), A=450 (1)... so you'll have to strip whatever= before display. This normalization is potentially a lot heavier than the previous one, as we have distinct_primaries * distinct_secondaries distinct values. Worst case, where every document only contains distinct combinations of primary/secondary, we have 350M distinct option-values, which is quite heavy for a single box to facet on. Whether that is better or worse that 350M documents, I don't know. Is denormalization the only way to handle this problem? I What you are trying to do does look quite a lot like hierarchical faceting, which Solr does not support directly. But even if you apply one of the experimental patches, it does not mitigate the potential combinatorial explosion of your primary secondary values. So that leaves the question: How many distinct combinations of primary and secondary values do you have? Regards, Toke Eskildsen -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Reloading ExternalFileField blocks Solr
Martin, Can you tell me what's the content of that field, and how it should affect search result? On Mon, Oct 8, 2012 at 12:55 PM, Martin Koch m...@issuu.com wrote: Hi List We're using Solr-4.0.0-Beta with a 7M document index running on a single host with 16 shards. We'd like to use an ExternalFileField to hold a value that changes often. However, we've discovered that the file is apparently re-read by every shard/core on *every commit*; the index is unresponsive in this period (around 20s on the host we're running on). This is unacceptable for our needs. In the future, we'd like to add other values as ExternalFileFields, and this will make the problem worse. It would be better if the external file were instead read in in the background, updating previously read relevant values for each shard as they are read in. I guess a change in the ExternalFileField code would be required to achieve this, but I have no experience here, so suggestions are very welcome. Thanks, /Martin Koch - Issuu - Senior Systems Architect. -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Reloading ExternalFileField blocks Solr
Martin, I have kind of hack approach in mind regarding hiding document from search. So, it's a little bit easier than your task. I'm going to deliver talk about it http://www.apachecon.eu/schedule/presentation/89/ . Frankly speaking, there is no reliable out-of-the-box solution for it. I saw that DocValues has been integrated with FunctionQueries already, but DocValues updates, which sounds like doable thing, has not been delivered yet. Regards On Mon, Oct 8, 2012 at 11:54 PM, Martin Koch m...@issuu.com wrote: Sure: We're boosting search results based on user actions which could be e.g. the number of times a particular document has been read. In future, we'd also like to boost by e.g. impressions (the number of times a document has been displayed) and other values. /Martin On Mon, Oct 8, 2012 at 7:02 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Martin, Can you tell me what's the content of that field, and how it should affect search result? On Mon, Oct 8, 2012 at 12:55 PM, Martin Koch m...@issuu.com wrote: Hi List We're using Solr-4.0.0-Beta with a 7M document index running on a single host with 16 shards. We'd like to use an ExternalFileField to hold a value that changes often. However, we've discovered that the file is apparently re-read by every shard/core on *every commit*; the index is unresponsive in this period (around 20s on the host we're running on). This is unacceptable for our needs. In the future, we'd like to add other values as ExternalFileFields, and this will make the problem worse. It would be better if the external file were instead read in in the background, updating previously read relevant values for each shard as they are read in. I guess a change in the ExternalFileField code would be required to achieve this, but I have no experience here, so suggestions are very welcome. Thanks, /Martin Koch - Issuu - Senior Systems Architect. -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Reloading ExternalFileField blocks Solr
Martin, I found slide quite relevant to what are you asking about. http://www.slideshare.net/lucenerevolution/potter-timothy-boosting-documents-in-solr On Tue, Oct 9, 2012 at 7:57 AM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi Martin, Perhaps you could make a small change in Solr to add don't reload EFF if it hasn't been modified since it was last opened. I assume you commit pretty often, but don't modify EFF files that often, so this could save you some needless loading. That said, I'd be surprised EFF doesn't already do this... I didn't check. Otis -- Search Analytics - http://sematext.com/search-analytics/index.html Performance Monitoring - http://sematext.com/spm/index.html On Mon, Oct 8, 2012 at 4:55 AM, Martin Koch m...@issuu.com wrote: Hi List We're using Solr-4.0.0-Beta with a 7M document index running on a single host with 16 shards. We'd like to use an ExternalFileField to hold a value that changes often. However, we've discovered that the file is apparently re-read by every shard/core on *every commit*; the index is unresponsive in this period (around 20s on the host we're running on). This is unacceptable for our needs. In the future, we'd like to add other values as ExternalFileFields, and this will make the problem worse. It would be better if the external file were instead read in in the background, updating previously read relevant values for each shard as they are read in. I guess a change in the ExternalFileField code would be required to achieve this, but I have no experience here, so suggestions are very welcome. Thanks, /Martin Koch - Issuu - Senior Systems Architect. -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Understanding Filter Queries
Amit, Sure. this method https://github.com/apache/lucene-solr/blob/trunk/solr/core/src/java/org/apache/solr/search/SolrIndexSearcher.java#L796beside some other stuff calculates fq's docset intersection which is supplied into filtered search call https://github.com/apache/lucene-solr/blob/trunk/solr/core/src/java/org/apache/solr/search/SolrIndexSearcher.java#L1474 You are welcome. On Sun, Oct 21, 2012 at 12:00 AM, Amit Nithian anith...@gmail.com wrote: Hi all, Quick question. I've been reading up on the filter query and how it's implemented and the multiple articles I see keep referring to this notion of leap frogging and filter query execution in parallel with the main query. Question: Can someone point me to the code that does this so I can better understand? Thanks! Amit -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Bitwise operation
Christopher, Would you mind if i ask you about a sample? 19.03.2013 19:31 пользователь Christopher ARZUR christopher.ar...@cognix-systems.com написал: Hi, Does solr (4.1.0) supports /bitwise/ AND or /bitwise/ OR operator so that we can specify a field to be compared against an index using /bitwise/ AND or OR ? Thanks, -- Christopher
Re: customize solr search/scoring for performance
Robert, I also wonder why it always request to collect doclist in-order https://github.com/apache/lucene-solr/blob/trunk/solr/core/src/java/org/apache/solr/search/SolrIndexSearcher.java#L1469 Do you think it make sense to raise a JIRA to allow out of order collecting? On Tue, Nov 13, 2012 at 6:34 AM, Robert Muir rcm...@gmail.com wrote: Whenever I look at solr users' stacktraces for disjunctions, I always notice they get BooleanScorer2. Is there some reason for this or is it not intentional (e.g. maybe a in-order collector is always being used when its possible at least in simple cases to allow for out-of-order hits?) When I examine test contributions from clover reports (e.g. https://builds.apache.org/job/Lucene-Solr-Clover-4.x/49/clover-report/), I notice that only lucene tests, and solr spellchecking tests actually hit BooleanScorer's collect. All other solr tests hit BooleanScorer2. If its possible to allow for an out of order collector in some common cases (e.g. large disjunctions w/ minShouldMatch generated by solr queryparsers), it could be a nice performance improvement. On Mon, Nov 12, 2012 at 3:48 PM, jchen2000 jchen...@yahoo.com wrote: The following was generated from jvisualvm. Seems like the perf is related to scoring a lot. Any idea/pointer on how to customize that part? http://lucene.472066.n3.nabble.com/file/n4019850/profilingResult.png -- View this message in context: http://lucene.472066.n3.nabble.com/customize-solr-search-scoring-for-performance-tp4019444p4019850.html Sent from the Solr - User mailing list archive at Nabble.com. -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Very slow query when boosting involve with EnternalFileField
Floyd, I think you need provide stack trace or draft sampling. On Fri, Mar 22, 2013 at 6:23 AM, Floyd Wu floyd...@gmail.com wrote: Anybody can point me a direction? Many thanks. 2013/3/20 Floyd Wu floyd...@gmail.com Hi everyone, I have a problem and have no luck to figure out. When I issue a query to Query 1 http://localhost:8983/solr/select?q={!boost+b=recip(ms(NOW/HOUR,last_modified_datetime),3.16e-11,1,1)}all http://localhost:8983/solr/select?q=%7B!boost+b=recip(ms(NOW/HOUR,last_modified_datetime),3.16e-11,1,1)%7Dall :javastart=0rows=10fl=score,authorsort=score+desc Query 2 http://localhost:8983/solr/select?q={!boost+b=sum(ranking,recip(ms(NOW/HOUR,last_modified_datetime)),3.16e-11,1,1)}all http://localhost:8983/solr/select?q=%7B!boost+b=sum(ranking,recip(ms(NOW/HOUR,last_modified_datetime)),3.16e-11,1,1)%7Dall :javastart=0rows=10fl=score,authorsort=score+desc The difference between two query is boost. The boost function of Query 2 using a field named ranking and this field is ExternalFileField. External file is key=value pair about 1 lines. Execution time Query 1--100ms Query 2--2300ms I tried to issue Query 3 and change ranking to a constant 1 http://localhost:8983/solr/select?q={!boost+b=sum(1,recip(ms(NOW/HOUR,last_modified_datetime)),3.16e-11,1,1)}all http://localhost:8983/solr/select?q=%7B!boost+b=sum(1,recip(ms(NOW/HOUR,last_modified_datetime)),3.16e-11,1,1)%7Dall :javastart=0rows=10fl=score,authorsort=score+desc Execution time Query 3--110ms one thing I can sure that involved with externalFileField will slow down query execution time significantly. But I have no idea how to solve this problem as my boost function must calculate value of ranking field. Please help on this. PS: I'm using SOLR-4.1 Floyd -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Is any way to return the number of indexed tokens in a field?
Alex, It's not what do you need to count, pre-analyzed values or tokens as an analysis result. if former, I suggest you to look into something like FieldLengthUpdateProcessorFactory, in case of later you need to override Similarity.computeNorm(String, FieldInvertState) / encode/decodeNorm. On Sun, Apr 14, 2013 at 8:29 AM, Alexandre Rafalovitch arafa...@gmail.comwrote: Hello, We seem to have all sorts of functions around tokenized field content, but I am looking for simple count/length that can be returned as a pseudo-field. Does anyone know of one out of the box? The specific situation is that I am indexing a field for specific regular expressions that become tokens (in a copyField). Not every field has the same number of those. I now want to find the documents that have maximum number of tokens in that field (for testing and review). But I can't figure out how. Any help would be appreciated. Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Questions about the performance of Solr
Hello, start from http://wiki.apache.org/solr/CommonQueryParameters#fq On Mon, May 6, 2013 at 11:42 AM, joo jamodr...@nate.com wrote: Search speed at which data is loaded is more than 7 ten millon current will be reduced too. About 50 seconds it will take, but the number is often just this, it is not possible to know whether such. Will there is a problem with the Query I use it to know the Query Optimizing Solr and fall. The Query, for example I use, time: [time to time] AND category: (1,2) AND (message1: message OR message2: message) I try to this. As long as there is no this problem, you need advice please do take a look at which part. -- View this message in context: http://lucene.472066.n3.nabble.com/Questions-about-the-performance-of-Solr-tp4060988.html Sent from the Solr - User mailing list archive at Nabble.com. -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: maximum number of simultaneous threads
Venkata, Solr is a neat webapp. It doesn't spins threads (almost). It's spin in servlet container threads. You need to configure tomcat/jetty. On Tue, May 14, 2013 at 4:17 PM, Dmitry Kan solrexp...@gmail.com wrote: venkata, If you are after search scaling, then the webapp server (like tomcat, jetty etc) handles allocation of threads per client connection (maxThreads for jetty for instance). Inside one client request SOLR uses threads for various tasks, but I don't have any exact figures (not sure if wiki has them either). Dmitry On Mon, May 13, 2013 at 7:22 PM, venkata vmarr...@yahoo.com wrote: I am seeing configuration point for indexing threads. However I am not finding anything for search. How many simultaneous threads, SOLR can spin during search time? -- View this message in context: http://lucene.472066.n3.nabble.com/maximum-number-of-simultaneous-threads-tp4062903p4062982.html Sent from the Solr - User mailing list archive at Nabble.com. -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Sorting facets by relevance
There is Lucene faceting module, which doesn't do anything in common with Solr, but it looks like it has something what you are looking for. http://shaierera.blogspot.com/2012/11/lucene-facets-part-1.html On Thu, May 16, 2013 at 1:33 AM, Jan Morlock jan.morl...@googlemail.comwrote: Hi, we are using faceted search for our queries. However neither sorting by count nor sorting by index as described in [1] is suitable for our business case. Instead, we would like to have the facets (or at least the beginning of them) sorted by the score of the top document possessing the corresponding facet. The expected behaviour can be compared to what the result grouping feature does (see [2]). I am currently thinking about the following strategy: (1) Create a new search component (2) Perform a sub-query using grouping (3) Use the result of this sub-query in order to sort the facets of the actual query. Currently step no. 2 seems to be pretty difficult. Can anybody point a me to an example, where a sub-query is performed in order to retrieve the groups? Or does anybody have a better/easier strategy for achieving this? Any help is appreciated. Thank you very much in advance. Best regards Jan [1]: http://wiki.apache.org/solr/SimpleFacetParameters#facet.sort [2]: http://wiki.apache.org/solr/FieldCollapsing -- View this message in context: http://lucene.472066.n3.nabble.com/Sorting-facets-by-relevance-tp4063649.html Sent from the Solr - User mailing list archive at Nabble.com. -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Facet pivot 50.000.000 different values
On Fri, May 17, 2013 at 12:47 PM, Carlos Bonilla carlosbonill...@gmail.comwrote: We only need to calculate how many different B values have more than 1 document but it takes ages Carlos, It's not clear whether you need to take results of a query into account or just gather statistics from index. if later you can just enumerate terms and watch into TermsEnum.docFreq() . Am I getting it right? -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: intersection of filter queries with raw query parser
Hello Sascha, I propose to call raw parser from standard one by nested query syntax http://searchhub.org/2009/03/31/nested-queries-in-solr/ Regards. On Fri, May 31, 2013 at 3:35 PM, Sascha Szott sz...@zib.de wrote: Hi folks, is it possible to use the raw query parser with a disjunctive filter query? Say, I have a field 'foo' and two values 'v1' and 'v2' (the field values are free text and can contain any character). What I want is to retrieve all documents satisying fq=foo:(v1 OR v2). In case only one field (v1) is given, the query fq={!raw f=foo}v1 works as expected. But how can I formulate the filter query (with the raw query parser) in case two values are provided. The same question was posted on Stackoverflow (http://stackoverflow.com/** questions/5637675/solr-query-**with-raw-data-and-union-** multiple-facet-valueshttp://stackoverflow.com/questions/5637675/solr-query-with-raw-data-and-union-multiple-facet-values) two years ago. But there was only the advice to give up using the raw query parser which is not what I want to do. Thanks in advance, Sascha -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Configuring lucene to suggest the indexed string for all the searches of the substring of the indexed string
Please excuse my misunderstanding, but I always wonder why this index time processing is suggested usually. from my POV is the case for query-time processing i.e. PrefixQuery aka wildcard query Jason* . Ultra-fast term retrieval also provided by TermsComponent. On Wed, Jun 5, 2013 at 8:09 PM, Jack Krupansky j...@basetechnology.comwrote: ngrams? See: http://lucene.apache.org/core/**4_3_0/analyzers-common/org/** apache/lucene/analysis/ngram/**NGramFilterFactory.htmlhttp://lucene.apache.org/core/4_3_0/analyzers-common/org/apache/lucene/analysis/ngram/NGramFilterFactory.html -- Jack Krupansky -Original Message- From: Prathik Puthran Sent: Wednesday, June 05, 2013 11:59 AM To: solr-user@lucene.apache.org Subject: Configuring lucene to suggest the indexed string for all the searches of the substring of the indexed string Hi, Is it possible to configure solr to suggest the indexed string for all the searches of the substring of the string? Thanks, Prathik -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Configuring lucene to suggest the indexed string for all the searches of the substring of the indexed string
Got it. It's actually contrast to usual prefix suggestions. So, out-of-the box it's provided by http://wiki.apache.org/solr/TermsComponent terms.regex= also see last example there it should works by loading terms in memory and linearly scanning them with regexp. There is nothing more efficient out-of-the box. http://wiki.apache.org/solr/Suggester says Support for infix-suggestions _is planned_ for FSTLookup (which would be the only structure to support these). On Thu, Jun 6, 2013 at 10:25 AM, Prathik Puthran prathik.puthra...@gmail.com wrote: My use case is I want to search for any substring of the indexed string and the Suggester should suggest the indexed string. What can I do to make this work? Thanks, Prathik On Thu, Jun 6, 2013 at 2:05 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Please excuse my misunderstanding, but I always wonder why this index time processing is suggested usually. from my POV is the case for query-time processing i.e. PrefixQuery aka wildcard query Jason* . Ultra-fast term retrieval also provided by TermsComponent. On Wed, Jun 5, 2013 at 8:09 PM, Jack Krupansky j...@basetechnology.com wrote: ngrams? See: http://lucene.apache.org/core/**4_3_0/analyzers-common/org/** apache/lucene/analysis/ngram/**NGramFilterFactory.html http://lucene.apache.org/core/4_3_0/analyzers-common/org/apache/lucene/analysis/ngram/NGramFilterFactory.html -- Jack Krupansky -Original Message- From: Prathik Puthran Sent: Wednesday, June 05, 2013 11:59 AM To: solr-user@lucene.apache.org Subject: Configuring lucene to suggest the indexed string for all the searches of the substring of the indexed string Hi, Is it possible to configure solr to suggest the indexed string for all the searches of the substring of the string? Thanks, Prathik -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: The 'threads' parameter in DIH - SOLR 4.3.0
Hello, Most times users end-up with coding multithread SolrJ indexer that I consider as a sad thing. As 3.x fix contributor I want to share my vision to the problem. While I did that work I realized that join operation itself is too hard and even impossible to make concurrent. I propose to add concurrency into outbound and inbound streams. My plan is: 1. add threads to outbound flow https://issues.apache.org/jira/browse/SOLR-3585 it allows to don't wait for Solr. I mostly like that code, but recently I realized that this code implements ConcurrentUpdateSolrServer algorithm, looking forward I prefer to unify some core concurrent code between them or it's kind of using CUSS inside of DIH's SolrWriter 2. The next problem, which we've faced is SQLEntityProcessor. It has two modes, one of them gets miserable performance due to N+1 problem; cached version is not production capable with default heap cache. Our proposal for it https://issues.apache.org/jira/browse/SOLR-4799 unfortunately I have no time to polish the patch. 3. After that the only thing which DIH waits for is jdbc. it can be easily boosted by implementing DataSource wrapper with producer thread and bounded queue as a buffer. if we complete this plan, we will never need to code SolrJ indexers. Particular question to you is what you need to speed up? On Thu, Jun 13, 2013 at 11:01 PM, Shawn Heisey s...@elyograg.org wrote: On 6/13/2013 12:08 PM, bbarani wrote: I see that the threads parameter has been removed from DIH from all version starting SOLR 4.x. Can someone let me know the best way to initiate indexing in multi threaded mode when using DIH now? Is there a way to do that? That parameter was removed because it didn't work right, and there was no apparent way to fix it. The change that went into a later 3.6 version was a bandaid, not a fix. I don't know all the details. There's no way to get multithreading with DIH directly, but you can do it indirectly: Create multiple request handlers with different names, such as /dataimport1, /dataimport2, etc. Configure each handler with settings that will pull part of your data source. Start them so they run concurrently. Depending on your environment, it may be easier to just write a multi-threaded indexing application using the Solr API for your language of choice. Thanks, Shawn -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics mkhlud...@griddynamics.com
Re: New operator.
Hello Yanis, Two options. 1. Create own SearchComponent, which adds filterQuery into request, and add it into SearchHandler. http://wiki.apache.org/solr/SearchComponent 2. Create QParserPlugin and call them by request param ...fq={!yanisqp}applyvector... http://wiki.apache.org/solr/SolrPlugins#QParserPlugin On Sun, Jun 16, 2013 at 10:01 AM, Yanis Kakamaikis yanis.kakamai...@gmail.com wrote: Hi all,I want to add a new operator to my solr. I need that operator to call my proprietary engine and build an answer vector to solr, in a way that this vector will be part of the boolean query at the next step. How do I do that? Thanks -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Solr large boolean filter
Right. FieldCacheTermsFilter is an option. You need to create own QParserPlugin which yields FieldCacheTermsFilter, hook him as ..fq={!idsqp cache=false}.. Mind disabling caching! Mind term ecoding due to field type! I also suggest to check how much it spend for tokenization. Once a day I've got some profit by using efficient encoding for this param (try fixed length or vint) There is a one more gain when the core query is highly selective and id filter is weakly selective, in this case using explicit PostFiltering (what a hack btw) is desired. see http://yonik.com/posts/advanced-filter-caching-in-solr/ From my experience the proper solution for such problems is moving to one of the joins or ExternalFileField. On Sun, Jun 16, 2013 at 2:49 AM, Igor Kustov ivkus...@gmail.com wrote: I know i'm not the first one with this problem. I'm currently using solr 4.2.1 with approximately 10 mln documents in the index. The index is updated frequently. The filter_query is just a one big boolean or query by id. fq=id:(1 2 3 4 ... 50950) ids list is always different and not sequential. The problem is that query performance not so well, as you can imagine. In some particular cases i'm able to do filtering based on different fields, but in some cases (like 30-40% of all queries) i'm still end up with this large id filter. I'm looking for the ways to improve this query performance. It doesn't seem like solr join could be applied there. Another option that I found is to somehow use Lucene FieldCacheTermsFilter. Does it worth a try? Maybe i've missed some other options? -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-large-boolean-filter-tp4070747.html Sent from the Solr - User mailing list archive at Nabble.com. -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Solr large boolean filter
nonono, mate! I warn you before by 'Mind term ecoding due to field type!' you need to obtain schema from request, then access fieldtype and convert external string representation into (might be) tricky encoded bytes by readableToIndexed() see FieldType.getFieldQuery() btw, it's a really frequent pain in this list, feel free to contribute when you done! Empty BooleanQuery matches nothing. There is a MatchAllDocsQuery(). On Mon, Jun 17, 2013 at 8:35 PM, Igor Kustov ivkus...@gmail.com wrote: Menawhile I'm currently trying to write custom QParser which will use FieldCacheTermsFilter So I'm using query like http://127.0.0.1:8080/solr/select?q=*:*fq={!mqparser}id:%281%202%203%29http://127.0.0.1:8080/solr/select?q=*:*fq=%7B!mqparser%7Did:%281%202%203%29 And I couldn't make it work - I just couldn't find a proper constructor and also not sure that i'm filtering appropriately. private class MyQParser { ListString idsList; MyQParser(String queryString, SolrParams localParams, SolrParams solrParams, SolrQueryRequestsolrQueryRequest) throws SyntaxError { super(queryString,localParams,solrParams, solrQueryRequest); idsList = // extract ids from params } @Override public Query parse() throws SyntaxError { FieldCacheTerms filter = new FieldCacheTermsFilter(id,idsList.toArray()) // first problem id is just an int in my case, but this seems like the only normal constructor return new FilteredQuery(new BooleanQuery(), filter); // my goal here is to get only filtered data, but does BooleanQuery() equals to *:*? } -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-large-boolean-filter-tp4070747p4071049.html Sent from the Solr - User mailing list archive at Nabble.com. -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Converting nested data model to solr schema
On Mon, Jul 1, 2013 at 5:56 PM, adfel70 adfe...@gmail.com wrote: This requires me to override the solr document distribution mechanism. I fear that with this solution I may loose some of solr cloud's capabilities. It's not clear whether you aware of http://searchhub.org/2013/06/13/solr-cloud-document-routing/ , but what you did doesn't sound scary to me. If it works, it should be fine. I'm not aware of any capabilities that you are going to loose. Obviously SOLR-3076 provides astonishing query time performance, with offloading actual join work into index time. Check it if you current approach turns slow. -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Schema design for parent child field
from my experience deeply nested scopes is for SOLR-3076 almost only. On Sat, Jun 29, 2013 at 1:08 PM, Sperrink kevin.sperr...@lexisnexis.co.zawrote: Good day, I'm seeking some guidance on how best to represent the following data within a solr schema. I have a list of subjects which are detailed to n levels. Each document can contain many of these subject entities. As I see it if this had been just 1 subject per document, dynamic fields would have been a good resolution. Any suggestions on how best to create this structure in a denormalised fashion while maintaining the data integrity. For example a document could have: Subject level 1: contract Subject level 2: claims Subject level 1: patent Subject level 2: counter claims If I were to search for level 1 contract, I would only want the facet count for level 2 to contain claims and not counter claims. Any assistance in this would be much appreciated. -- View this message in context: http://lucene.472066.n3.nabble.com/Schema-design-for-parent-child-field-tp4074084.html Sent from the Solr - User mailing list archive at Nabble.com. -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Solr large boolean filter
Rafalovitch arafa...@gmail.com wrote: On Mon, Jun 17, 2013 at 12:35 PM, Igor Kustov ivkus...@gmail.com wrote: So I'm using query like http://127.0.0.1:8080/solr/select?q=*:*fq={!mqparser}id:%281%202%203%29 http://127.0.0.1:8080/solr/select?q=*:*fq=%7B!mqparser%7Did:%281%202%203%29 If the IDs are purely numeric, I wonder if the better way is to send a bitset. So, bit 1 is on if ID:1 is included, bit 2000 is on if ID:2000 is included. Even using URL-encoding rules, you can fit at least 65 sequential ID flags per character and I am sure there are more efficient encoding schemes for long empty sequences. Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: set-based and other less common approaches to search
try to hit dismax query parser specifying mm and qf parameters. On Tue, Jul 2, 2013 at 9:31 PM, gilawem mewa...@gmail.com wrote: Thanks. So following up on a) below, could I set up and query Solr, without any customization of code, to match 10 of my given 20 terms, but only if it finds those 10 terms in an xls document under a column that is named MyID or My ID or My I.D.? If so, what would that query look like? On Jul 2, 2013, at 12:38 PM, Otis Gospodnetic wrote: Hi, Solr can do all of these. There are phrase queries, queries where you specify a field, the mm param for min should match, etc. Otis -- Solr ElasticSearch Support -- http://sematext.com/ Performance Monitoring -- http://sematext.com/spm On Tue, Jul 2, 2013 at 12:36 PM, gilawem mewa...@gmail.com wrote: Let's say I wanted to ask solr to find me any document that contains at least 100 out of some 300 search terms I give it. Can Solr do this out of the box? If not, what kind of customization would it require? Now let's say I want to further have the option to request that those terms a) must show up within the same column of an excel spreadsheet, or b) are exact matches (i.e. match on search, but not searched), or c) occur in the exact order that I specified, or d) occur contiguously and without any words in between, or e) are made up of non-word elements such as 92228345 or SJA12334. Can solr do any of these out of the box? If not, what of these tasks is relatively easy to do with some custom code, and what is not? -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Converting nested data model to solr schema
during indexing whole block (doc and it's attachment) goes into particular shard, then it's can be queried per every shard and results are merged. btw, do you feel any problem with your current approach - query time joins and out-of-the-box shard routing? On Tue, Jul 2, 2013 at 5:19 PM, adfel70 adfe...@gmail.com wrote: I'm not familiar with block join in lucene. I've read a bit, and I just want to make sure - do you think that when this ticket is released, it will solve the current problem of solr cloud joins? Also, can you elaborate a bit about your solution? Jack Krupansky-2 wrote It sounds like 4.4 will have an RC next week, so the prospects for block join in 4.4 are kind of dim. I mean, such a significant feature should have more than a few days to bake before getting released. But... who knows what Yonik has planned! -- Jack Krupansky -Original Message- From: adfel70 Sent: Tuesday, July 02, 2013 7:41 AM To: solr-user@.apache Subject: Re: Converting nested data model to solr schema As you see it, does SOLR-3076 fixes my problem? Is SOLR-3076 fix getting into solr 4.4? Mikhail Khludnev wrote On Mon, Jul 1, 2013 at 5:56 PM, adfel70 lt; adfel70@ gt; wrote: This requires me to override the solr document distribution mechanism. I fear that with this solution I may loose some of solr cloud's capabilities. It's not clear whether you aware of http://searchhub.org/2013/06/13/solr-cloud-document-routing/ , but what you did doesn't sound scary to me. If it works, it should be fine. I'm not aware of any capabilities that you are going to loose. Obviously SOLR-3076 provides astonishing query time performance, with offloading actual join work into index time. Check it if you current approach turns slow. -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics lt;http://www.griddynamics.comgt; lt; mkhludnev@ gt; -- View this message in context: http://lucene.472066.n3.nabble.com/Converting-nested-data-model-to-solr-schema-tp4074351p4074668.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://lucene.472066.n3.nabble.com/Converting-nested-data-model-to-solr-schema-tp4074351p4074696.html Sent from the Solr - User mailing list archive at Nabble.com. -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Solr large boolean filter
Roman, It's covered in http://wiki.apache.org/solr/ContentStream | For POST requests where the content-type is not application/x-www-form-urlencoded, the raw POST body is passed as a stream. So, there is no need for encoding of binary data inside the body. Regarding encoding, I have a positive experience of passing such ids encoded by vInt, but they need to be presorted. On Tue, Jul 2, 2013 at 10:46 PM, Roman Chyla roman.ch...@gmail.com wrote: Hello Mikhail, Yes, GET is limited, but POST is not - so I just wanted that it works in both the same way. But I am not sure if I am understanding your question completely. Could you elaborate on the parameters/body part? Is there no need for encoding of binary data inside the body? Or do you mean it is treated as a string? Or is it just a bytestream and other parameters are seen as string? On a general note: my main concern was to send many ids fast, if we use ints (32bit), in one MB, one can fit ~250K, with bitset 33 times more (sb check numbers please :)). But certainly, if the bitset is sparse or the collection of ids just a 'a few thousands', stream of ints/longs will be smaller, better to use. roman On Tue, Jul 2, 2013 at 2:00 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Hello Roman, Don't you consider to pass long id sequence as body and access internally in solr as a content stream? It makes base64 compression not necessary. AFAIK url length is limited somehow, anyway. On Tue, Jul 2, 2013 at 9:32 PM, Roman Chyla roman.ch...@gmail.com wrote: Wrong link to the parser, should be: https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/search/BitSetQParserPlugin.java On Tue, Jul 2, 2013 at 1:25 PM, Roman Chyla roman.ch...@gmail.com wrote: Hello @, This thread 'kicked' me into finishing som long-past task of sending/receiving large boolean (bitset) filter. We have been using bitsets with solr before, but now I sat down and wrote it as a qparser. The use cases, as you have discussed are: - necessity to send lng list of ids as a query (where it is not possible to do it the 'normal' way) - or filtering ACLs It works in the following way: - external application constructs bitset and sends it as a query to solr (q or fq, depends on your needs) - solr unpacks the bitset (translated bits into lucene ids, if necessary), and wraps this into a query which then has the easy job of 'filtering' wanted/unwanted items Therefore it is good only if you can search against something that is indexed as integer (id's often are). A simple benchmark shows acceptable performance, to send the bitset (randomly populated, 10M, with 4M bits set), it takes 110ms (25+64+20) To decode this string (resulting byte size 1.5Mb!) it takes ~90ms (5+14+68ms) But I haven't tested latency of sending it over the network and the query performance, but since the query is very similar as MatchAllDocs, it is probably very fast (and I know that sending many Mbs to Solr is fast as well) I know this is not exactly 'standard' solution, and it is probably not something you want to see with hundreds of millions of docs, but people seem to be doing 'not the right thing' all the time;) So if you think this is something useful for the community, please let me know. If somebody would be willing to test it, i can file a JIRA ticket. Thanks! Roman The code, if no JIRA is needed, can be found here: https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/search/AdsQParserPlugin.java https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/search/TestBitSetQParserPlugin.java 839ms. run 154ms. Building random bitset indexSize=1000 fill=0.5 -- Size=15054208,cardinality=3934477 highestBit=999 25ms. Converting bitset to byte array -- resulting array length=125 20ms. Encoding byte array into base64 -- resulting array length=168 ratio=1.344 62ms. Compressing byte array with GZIP -- resulting array length=1218602 ratio=0.9748816 20ms. Encoding gzipped byte array into base64 -- resulting string length=1624804 ratio=1.2998432 5ms. Decoding gzipped byte array from base64 14ms. Uncompressing decoded byte array 68ms. Converting from byte array to bitset 743ms. running On Tue, Jun 18, 2013 at 3:51 PM, Erick Erickson erickerick...@gmail.com wrote: Not necessarily. If the auth tokens are available on some other system (DB, LDAP, whatever), one could get them in the PostFilter and cache them somewhere since, presumably, they wouldn't be changing all that often. Or use
Re: What are the options for obtaining IDF at interactive speeds?
Katie, This case is actually really hard to get. Just let me provide the contra-sample, to let you explain problem better by spotting the gap. What if I say that, debugQuery=true provides tf, idf for the terms and documents from the requested page of results. Why you can't use explain to solve the problem? On Wed, Jul 3, 2013 at 1:06 AM, Kathryn Mazaitis kathryn.riv...@gmail.comwrote: Hi, I'm using SOLRJ to run a query, with the goal of obtaining: (1) the retrieved documents, (2) the TF of each term in each document, (3) the IDF of each term in the set of retrieved documents (TF/IDF would be fine too) ...all at interactive speeds, or 10s per query. This is a demo, so if all else fails I can adjust the corpus, but I'd rather, y'know, actually do it. (1) and (2) are working; I completed the patch posted in the following issue: https://issues.apache.org/jira/browse/SOLR-949 and am just setting tv=truetv.tf=true for my query. This way I get the documents and the tf information all in one go. With (3) I'm running into trouble. I have found 2 ways to do it so far: Option A: set tv.df=true or tv.tf_idf for my query, and get the idf information along with the documents and tf information. Since each term may appear in multiple documents, this means retrieving idf information for each term about 20 times, and takes over a minute to do. Option B: After I've gathered the tf information, run through the list of terms used across the set of retrieved documents, and for each term, run a query like: {!func}idf(text,'the_term')deftype=funcfl=scorerows=1 ...while this retrieves idf information only once for each term, the added latency for doing that many queries piles up to almost two minutes on my current corpus. Is there anything I didn't think of -- a way to construct a query to get idf information for a set of terms all in one go, outside the bounds of what terms happen to be in a document? Failing that, does anyone have a sense for how far I'd have to scale down a corpus to approach interactive speeds, if I want this sort of data? Katie -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Performance of cross join vs block join
Mihaela, For me it's reasonable that single core join takes the same time as cross core one. I just can't see which gain can be obtained from in the former case. I hardly able to comment join code, I looked into, it's not trivial, at least. With block join it doesn't need to obtain parentId term values/numbers and lookup parents by them. Both of these actions are expensive. Also blockjoin works as an iterator, but join need to allocate memory for parents bitset and populate it out of order that impacts scalability. Also in None scoring mode BJQ don't need to walk through all children, but only hits first. Also, nice feature is 'both side leapfrog' if you have a highly restrictive filter/query intersects with BJQ, it allows to skip many parents and children as well, that's not possible in Join, which has fairly 'full-scan' nature. Main performance factor for Join is number of child docs. I'm not sure I got all your questions, please specify them in more details, if something is still unclear. have you saw my benchmark http://blog.griddynamics.com/2012/08/block-join-query-performs.html ? On Thu, Jul 11, 2013 at 1:52 PM, mihaela olteanu mihaela...@yahoo.comwrote: Hello, Does anyone know about some measurements in terms of performance for cross joins compared to joins inside a single index? Is it faster the join inside a single index that stores all documents of various types (from parent table or from children tables)with a discriminator field compared to the cross join (basically in this case each document type resides in its own index)? I have performed some tests but to me it seems that having a join in a single index (bigger index) does not add too much speed improvements compared to cross joins. Why a block join would be faster than a cross join if this is the case? What are the variables that count when trying to improve the query execution time? Thanks! Mihaela -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: How to make 'fq' optional?
https://lucene.apache.org/solr/4_2_0/solr-core/org/apache/solr/search/SwitchQParserPlugin.html Hoss cares about you! On Wed, Jul 10, 2013 at 10:40 PM, Learner bbar...@gmail.com wrote: I am trying to make a variable in fq optional, Ex: /select?first_name=peterfq=$first_nameq=*:* I don't want the above query to throw error or die whenever the variable first_name is not passed to the query instead return the value corresponding to rest of the query. I can use switch but its difficult to handle each and every case using switch (as I need to handle switch for so many variables)... Is there a way to resolve this via some other way? -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-make-fq-optional-tp4077042.html Sent from the Solr - User mailing list archive at Nabble.com. -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Performance of cross join vs block join
On Fri, Jul 12, 2013 at 12:19 PM, mihaela olteanu mihaela...@yahoo.comwrote: Hi Mikhail, I have used wrong the term block join. When I said block join I was referring to a join performed on a single core versus cross join which was performed on multiple cores. But I saw your benchmark (from cache) and it seems that block join has better performance. Is this functionality available on Solr 4.3.1? nope SOLR-3076 awaits for ages. I did not find such examples on Solr's wiki page. Does this functionality require a special schema, or a special indexing? Special indexing - yes. How would I need to index the data from my tables? In my case anyway all the indices have a common schema since I am using dynamic fields, thus I can easily add all documents from all tables in one Solr core, but for each document to add a discriminator field. correct. but notion of ' discriminator field' is a little bit different for blockjoin. Could you point me to some more documentation? I can recommend only those http://blog.mikemccandless.com/2012/01/searching-relational-content-with.html http://www.youtube.com/watch?v=-OiIlIijWH0 Thanks in advance, Mihaela From: Mikhail Khludnev mkhlud...@griddynamics.com To: solr-user solr-user@lucene.apache.org; mihaela olteanu mihaela...@yahoo.com Sent: Thursday, July 11, 2013 2:25 PM Subject: Re: Performance of cross join vs block join Mihaela, For me it's reasonable that single core join takes the same time as cross core one. I just can't see which gain can be obtained from in the former case. I hardly able to comment join code, I looked into, it's not trivial, at least. With block join it doesn't need to obtain parentId term values/numbers and lookup parents by them. Both of these actions are expensive. Also blockjoin works as an iterator, but join need to allocate memory for parents bitset and populate it out of order that impacts scalability. Also in None scoring mode BJQ don't need to walk through all children, but only hits first. Also, nice feature is 'both side leapfrog' if you have a highly restrictive filter/query intersects with BJQ, it allows to skip many parents and children as well, that's not possible in Join, which has fairly 'full-scan' nature. Main performance factor for Join is number of child docs. I'm not sure I got all your questions, please specify them in more details, if something is still unclear. have you saw my benchmark http://blog.griddynamics.com/2012/08/block-join-query-performs.html ? On Thu, Jul 11, 2013 at 1:52 PM, mihaela olteanu mihaela...@yahoo.com wrote: Hello, Does anyone know about some measurements in terms of performance for cross joins compared to joins inside a single index? Is it faster the join inside a single index that stores all documents of various types (from parent table or from children tables)with a discriminator field compared to the cross join (basically in this case each document type resides in its own index)? I have performed some tests but to me it seems that having a join in a single index (bigger index) does not add too much speed improvements compared to cross joins. Why a block join would be faster than a cross join if this is the case? What are the variables that count when trying to improve the query execution time? Thanks! Mihaela -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Performance of cross join vs block join
Hello Roman, Thanks for your interest. I briefly looked on your approach, and I'm really interested in your numbers. Here is the trivial code, I'd rather prefer rely on your testing framework, and can provide you a version of Solr 4.2 with SOLR-3076 applied. Do you need it? https://github.com/m-khl/join-tester What you are saying about benchmark representativeness definitely makes sense. I didn't try to establish a complete absolutely representative benchmark. Just wanted to have rough numbers, related for my usecase, certainly. I'm from eCommerce, that volume was enough for me. What I didn't get is, 'not the block joins, because these cannot be used for citation data - we cannot reasonably index them into one segment'. Usually, there is no problem with blocks in multi segment index, block definitely can't span across segments. Anyway, please elaborate. One of block join benefits is an ability to hit only the first matched child in group, and jump over followings. It doesn't applicable in general, but get huge gain some times. On Fri, Jul 12, 2013 at 8:29 PM, Roman Chyla roman.ch...@gmail.com wrote: Hi Mikhail, I have commented on your blog, but it seems I have done st wrong, as the comment is not there. Would it be possible to share the test setup (script)? I have found out that the crucial thing with joins is the number of 'joins' [hits returned] and it seems that the experiments I have seen so far were geared towards small collection - even if Erick's index was 26M, the number of hits was probably small - you can see a very different story if you face some [other] real data. Here is a citation network and I was comparing lucene join's [ie not the block joins, because these cannot be used for citation data - we cannot reasonably index them into one segment]) https://github.com/romanchyla/r-ranking-fun/blob/master/plots/raw/comparison-join-2nd.png Notice, the y axes is sqrt, so the running time for lucene join is growing and growing very fast! It takes lucene 30s to do the search that selects 1M hits. The comparison is against our own implementation of a similar search - but the main point I am making is that the join benchmarks should be showing the number of hits selected by the join operation. Otherwise, a very important detail is hidden. Best, roman On Fri, Jul 12, 2013 at 4:57 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: On Fri, Jul 12, 2013 at 12:19 PM, mihaela olteanu mihaela...@yahoo.com wrote: Hi Mikhail, I have used wrong the term block join. When I said block join I was referring to a join performed on a single core versus cross join which was performed on multiple cores. But I saw your benchmark (from cache) and it seems that block join has better performance. Is this functionality available on Solr 4.3.1? nope SOLR-3076 awaits for ages. I did not find such examples on Solr's wiki page. Does this functionality require a special schema, or a special indexing? Special indexing - yes. How would I need to index the data from my tables? In my case anyway all the indices have a common schema since I am using dynamic fields, thus I can easily add all documents from all tables in one Solr core, but for each document to add a discriminator field. correct. but notion of ' discriminator field' is a little bit different for blockjoin. Could you point me to some more documentation? I can recommend only those http://blog.mikemccandless.com/2012/01/searching-relational-content-with.html http://www.youtube.com/watch?v=-OiIlIijWH0 Thanks in advance, Mihaela From: Mikhail Khludnev mkhlud...@griddynamics.com To: solr-user solr-user@lucene.apache.org; mihaela olteanu mihaela...@yahoo.com Sent: Thursday, July 11, 2013 2:25 PM Subject: Re: Performance of cross join vs block join Mihaela, For me it's reasonable that single core join takes the same time as cross core one. I just can't see which gain can be obtained from in the former case. I hardly able to comment join code, I looked into, it's not trivial, at least. With block join it doesn't need to obtain parentId term values/numbers and lookup parents by them. Both of these actions are expensive. Also blockjoin works as an iterator, but join need to allocate memory for parents bitset and populate it out of order that impacts scalability. Also in None scoring mode BJQ don't need to walk through all children, but only hits first. Also, nice feature is 'both side leapfrog' if you have a highly restrictive filter/query intersects with BJQ, it allows to skip many parents and children as well, that's not possible in Join, which has fairly 'full-scan' nature. Main performance factor for Join is number of child docs. I'm not sure I got all your questions, please specify them in more
Re: Nested query in SOLR filter query (fq)
Hello, it sounds like FieldCollapsing or Join scenarios, but given the only information which you provided, it can be solved by indexing statuses as multivalue field: -ID- -STATUS- id1(1 2 3 4) id2(1 2) id3(1) q=*:*fq=STATUS:1fq=NOT STATUS:3 On Mon, Jul 15, 2013 at 3:19 PM, EquilibriumCST valeri_ho...@abv.bg wrote: Hi all, I have the following case. Solr documents has fields -- id and status. Id is not unique. Unique is the combination of these two elements. Documents with same id have different statuses. List of Documents -ID- -STATUS- id11 id12 id13 id14 id21 id22 id31 I need to make query that takes all documents with specific status and to exclude documents that don't have other specific status. As an example I need to get all documents with status 2 and don't have status 3. The expected result should be document : id22 Another example: all documents with status 1 and don't have status 3. Then the result should be: id21 id31 Here is my query that don't work http://192.168.130.14:13080/solr/select/?q=status:1version=2.2start=0rows=10indent=onfl=id,statusfq=-id:(*:*%20AND%20status:2) The problem is in filter query(fq) part. In fq must be the ids of the documents with status 2 and if the current document id is in this list to be excluded. I guess some subquery must be used in fq part or something else. Just for information we are using APACHE SOLR 3.6 and document count is around 100k. Thanks in advance! -- View this message in context: http://lucene.472066.n3.nabble.com/Nested-query-in-SOLR-filter-query-fq-tp4078020.html Sent from the Solr - User mailing list archive at Nabble.com. -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: short-circuit OR operator in lucene/solr
Short answer, no - it has zero sense. But after some thinking, it can make some sense, potentially. DisjunctionSumScorer holds child scorers semi-ordered in a binary heap. Hypothetically inequality can be enforced at that heap, but heap might not work anymore for such alignment. Hence, instead of heap TreeSet can be used for experiment. fwiw, it's a dev list question. On Mon, Jul 22, 2013 at 4:48 AM, Deepak Konidena deepakk...@gmail.comwrote: I understand that lucene's AND (), OR (||) and NOT (!) operators are shorthands for REQUIRED, OPTIONAL and EXCLUDE respectively, which is why one can't treat them as boolean operators (adhering to boolean algebra). I have been trying to construct a simple OR expression, as follows q = +(field1:value1 OR field2:value2) with a match on either field1 or field2. But since the OR is merely an optional, documents where both field1:value1 and field2:value2 are matched, the query returns a score resulting in a match on both the clauses. How do I enforce short-circuiting in this context? In other words, how to implement short-circuiting as in boolean algebra where an expression A || B || C returns true if A is true without even looking into whether B or C could be true. -Deepak -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Appending *-wildcard suffix on all terms for querying: move logic from client to server side
It can be done by extending LuceneQParser/SolrQueryParser see http://wiki.apache.org/solr/SolrPlugins#QParserPlugin there is newTermQuery(Term) it should be overridden and delegate to newPrefixQuery() method. Overall, I suggest you consider to use EdgeNGramTokenFilter in index time, and then search by plain termqueries. On Tue, Jul 23, 2013 at 2:05 PM, Paul Blanchaert p...@amosis.eu wrote: My client has an installation with 3 different clients using the same Solr index. These clients all append a * wildcard suffix in the query: user enters abc def while search is performed against (abc* def*). In order to move away from this way of searching, we'd like to move the clients away from this wildcard search at the moment we implement a new index. However, at that time, the client apps will still need to use this wildcard suffix search. So the goal is to have the wildcard search option to append * suffix when not yet set configurable on server side. I thought a tokenizer would do the work, but as the wildcard searches are detected before analyzers do the work, this is not an option. Can I enable this without coding? Or should I use a (custom) functionquery or custom search handler? Any thought is appreciated. - Kind regards, Paul Blanchaert -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: How to make soft commit more reliable?
Hello, First of all, I don't think it can commit (even soft) every second, afaik it's too frequent for typical deployment. Hence, if you really need such (N)RT I suggest you experiment with it right now, to face the bummer sooner. Also, one second durability sounds like over-expectation for Solr, it sounds like OLTP requirements. Then, now Solr has some sort of pre-indexing record storage called UpdateLog. try to experiment with syncLevel = FSYNC vs FLUSH . That's how it works, when document arrives for indexing it's written into update log, which is plain binary file. Indexing works as-is relying on RAMbuffer. When node dies, RAMbuffer dies, but updateLog is persistent, during startup Solr recovers uncommitted updates from updateLog. Caveat! UpdateLog has HashMap internally which easily hits OOM on rare commits. On Wed, Jul 24, 2013 at 2:56 AM, SolrLover bbar...@gmail.com wrote: Currently I am using SOLR 3.5.X and I push updates to SOLR via queue (Active MQ) and perform hard commit every 30 minutes (since my index is relatively big around 30 million documents). I am thinking of using soft commit to implement NRT search but I am worried about the reliability. For ex: If I have the hard autocommit set to 10 minutes and a softcommit every second, new documents will show up every second but in case of JVM crash or power goes out I will lose all the documents after the last hard commit. I was thinking of using a backup database or another SOLR index that I can use as a backup and write the document from queue in both places (one with soft commit, another index with just the push updates with normal hard commits (or) write simultaneously to a db and delete the rows once the hard commit is successful after making sure that we didn't lose any records). Does someone have any other idea to improve the reliability of the push updates when using soft commit? -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-make-soft-commit-more-reliable-tp4079892.html Sent from the Solr - User mailing list archive at Nabble.com. -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Processing a lot of results in Solr
Roman, Can you disclosure how that streaming writer works? What does it stream docList or docSet? Thanks On Wed, Jul 24, 2013 at 5:57 AM, Roman Chyla roman.ch...@gmail.com wrote: Hello Matt, You can consider writing a batch processing handler, which receives a query and instead of sending results back, it writes them into a file which is then available for streaming (it has its own UUID). I am dumping many GBs of data from solr in few minutes - your query + streaming writer can go very long way :) roman On Tue, Jul 23, 2013 at 5:04 PM, Matt Lieber mlie...@impetus.com wrote: Hello Solr users, Question regarding processing a lot of docs returned from a query; I potentially have millions of documents returned back from a query. What is the common design to deal with this ? 2 ideas I have are: - create a client service that is multithreaded to handled this - Use the Solr pagination to retrieve a batch of rows at a time (start, rows in Solr Admin console ) Any other ideas that I may be missing ? Thanks, Matt NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference. -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Processing a lot of results in Solr
fwiw, i did some prototype with the following differences: - it streams straight to the socket output stream - it streams on-going during collecting, without necessity to store a bitset. It might have some limited extreme usage. Is there anyone interested? On Wed, Jul 24, 2013 at 7:19 PM, Roman Chyla roman.ch...@gmail.com wrote: On Tue, Jul 23, 2013 at 10:05 PM, Matt Lieber mlie...@impetus.com wrote: That sounds like a satisfactory solution for the time being - I am assuming you dump the data from Solr in a csv format? JSON How did you implement the streaming processor ? (what tool did you use for this? Not familiar with that) this is what dumps the docs: https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/response/JSONDumper.java it is called by one of our batch processors, which can pass it a bitset of recs https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/handler/batch/BatchProviderDumpIndex.java as far as streaming is concerned, we were all very nicely surprised, a few GB file (on local network) took ridiculously short time - in fact, a colleague of mine was assuming it is not working, until we looked into the downloaded file ;-), you may want to look at line 463 https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/handler/batch/BatchHandler.java roman You say it takes a few minutes only to dump the data - how long does it to stream it back in, are performances acceptable (~ within minutes) ? Thanks, Matt On 7/23/13 6:57 PM, Roman Chyla roman.ch...@gmail.com wrote: Hello Matt, You can consider writing a batch processing handler, which receives a query and instead of sending results back, it writes them into a file which is then available for streaming (it has its own UUID). I am dumping many GBs of data from solr in few minutes - your query + streaming writer can go very long way :) roman On Tue, Jul 23, 2013 at 5:04 PM, Matt Lieber mlie...@impetus.com wrote: Hello Solr users, Question regarding processing a lot of docs returned from a query; I potentially have millions of documents returned back from a query. What is the common design to deal with this ? 2 ideas I have are: - create a client service that is multithreaded to handled this - Use the Solr pagination to retrieve a batch of rows at a time (start, rows in Solr Admin console ) Any other ideas that I may be missing ? Thanks, Matt NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference. NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference. -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Synonym Phrase
Hello, As far as I know http://nolanlawson.com/2012/10/31/better-synonym-handling-in-solr/ has some usage in the industry. On Fri, Jul 26, 2013 at 8:28 PM, Jack Krupansky j...@basetechnology.comwrote: Hmmm... Actually, I think there was also a solution where you could specify an alternate tokenizer for the synonym file which would not tokenize on space, so that the full phrase would be passed to the query parser/generator as a single term so that it would generate a phrase (if you have the autogeneratePhraseQuery attribute of the field type set to true.) But, I don't recall the details... and it's not the default, which maybe it should be. -- Jack Krupansky -Original Message- From: Furkan KAMACI Sent: Friday, July 26, 2013 12:18 PM To: solr-user@lucene.apache.org Subject: Re: Synonym Phrase Why Solr does not split that terms by*;* I think that it both split by *;* and white space character? 2013/7/26 Jack Krupansky j...@basetechnology.com Well, that's one of the areas where Solr synonym support breaks down. The LucidWorks Search query parser has a proprietary solution for that problem, but it won't help you with bare Solr. Some people have used shingles. In short, for query-time synonym phrases your best bet is to parse the query at the application level and generate a Solr query that has the synonyms pre-expanded. Application preprocessing could be as simple as scanning for the synonym phrases and then adding OR terms for the synonym phrases. -- Jack Krupansky -Original Message- From: Furkan KAMACI Sent: Friday, July 26, 2013 10:53 AM To: solr-user@lucene.apache.org Subject: Synonym Phrase I have a synonyms file as like that: cart; shopping cart; market trolley When I analyse my query I see that when I search cart these becomes synonyms: cart, shopping, market, trolley so cart is synonym with shopping. How should I define my synonyms.txt file that it will understand that cart is synonym to shopping cart? -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: paging vs streaming. spawn from (Processing a lot of results in Solr)
Otis, You gave links to 'deep paging' when I asked about response streaming. Let me understand. From my POV, deep paging is a special case for regular search scenarios. We definitely need it in Solr. However, if we are talking about data analytic like problems, when we need to select an endless stream of responses (or store them in file as Roman did), 'deep paging' is a suboptimal hack. What's your vision on this?
Re: paging vs streaming. spawn from (Processing a lot of results in Solr)
Roman, Let me briefly explain the design special RequestParser stores servlet output stream into the context https://github.com/m-khl/solr-patches/compare/streaming#L7R22 then special component injects special PostFilter/DelegatingCollector which writes right into output https://github.com/m-khl/solr-patches/compare/streaming#L2R146 here is how it streams the doc, you see it's lazy enough https://github.com/m-khl/solr-patches/compare/streaming#L2R181 I mention that it disables later collectors https://github.com/m-khl/solr-patches/compare/streaming#L2R57 hence, no facets with streaming, yet as well as memory consumption. This test shows how it works https://github.com/m-khl/solr-patches/compare/streaming#L15R115 all other code purposed for distributed search. On Sat, Jul 27, 2013 at 4:44 PM, Roman Chyla roman.ch...@gmail.com wrote: Mikhail, If your solution gives lazy loading of solr docs /and thus streaming of huge result lists/ it should be big YES! Roman On 27 Jul 2013 07:55, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Otis, You gave links to 'deep paging' when I asked about response streaming. Let me understand. From my POV, deep paging is a special case for regular search scenarios. We definitely need it in Solr. However, if we are talking about data analytic like problems, when we need to select an endless stream of responses (or store them in file as Roman did), 'deep paging' is a suboptimal hack. What's your vision on this? -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: paging vs streaming. spawn from (Processing a lot of results in Solr)
Hello, Please find below Let me just explain better what I found when I dug inside solr: documents (results of the query) are loaded before they are passed into a writer - so the writers are expecting to encounter the solr documents, but these documents were loaded by one of the components before rendering them - so it is kinda 'hard-coded'. there is the code https://github.com/apache/lucene-solr/blob/trunk/solr/core/src/java/org/apache/solr/handler/component/QueryComponent.java#L445which pulls documents into document's cache to achieve your goal you can try to remove documents cache, or disable lazy fields loading. But if solr was NOT loading these docs before passing them to a writer, writer can load them instead (hence lazy loading, but the difference is in numbers - it could deal with hundreds of thousands of docs, instead of few thousands now). anyway, even if writer pulls docs one by one, it doesn't allow to stream a billion of them. Solr writes out DocList, which is really problematic even in deep-paging scenarios. roman On Sat, Jul 27, 2013 at 3:52 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Roman, Let me briefly explain the design special RequestParser stores servlet output stream into the context https://github.com/m-khl/solr-patches/compare/streaming#L7R22 then special component injects special PostFilter/DelegatingCollector which writes right into output https://github.com/m-khl/solr-patches/compare/streaming#L2R146 here is how it streams the doc, you see it's lazy enough https://github.com/m-khl/solr-patches/compare/streaming#L2R181 I mention that it disables later collectors https://github.com/m-khl/solr-patches/compare/streaming#L2R57 hence, no facets with streaming, yet as well as memory consumption. This test shows how it works https://github.com/m-khl/solr-patches/compare/streaming#L15R115 all other code purposed for distributed search. On Sat, Jul 27, 2013 at 4:44 PM, Roman Chyla roman.ch...@gmail.com wrote: Mikhail, If your solution gives lazy loading of solr docs /and thus streaming of huge result lists/ it should be big YES! Roman On 27 Jul 2013 07:55, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Otis, You gave links to 'deep paging' when I asked about response streaming. Let me understand. From my POV, deep paging is a special case for regular search scenarios. We definitely need it in Solr. However, if we are talking about data analytic like problems, when we need to select an endless stream of responses (or store them in file as Roman did), 'deep paging' is a suboptimal hack. What's your vision on this? -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: paging vs streaming. spawn from (Processing a lot of results in Solr)
On Sun, Jul 28, 2013 at 1:25 AM, Yonik Seeley yo...@lucidworks.com wrote: Which part is problematic... the creation of the DocList (the search), Literally DocList is a copy of TopDocs. Creating TopDocs is not a search, but ranking. And ranking costs is log(rows+start) beside of numFound, which the search takes. Interesting that we still pay that log() even if ask for collecting docs as-is with _docid_ or it's memory requirements (an int per doc)? TopXxxCollector as well as XxxComparators allocates same [rows+start] it's clear that after we have deep paging, we need to handle heaps just with size of rows (without start). It's fairly ok, if we use Solr like site navigation engine, but it's 'sub-optimal' for data analytic use-cases, where we need something like SELECT * FROM ... in rdbms. In this case any memory allocation on billions docs index is a bummer. That's why I'm asking about removing heap based collector/comparator. -Yonik http://lucidworks.com -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: DIH to index the data - 250 millions - Need a best architecture
Mishra, What if you setup DIH with single SQLEntityProcessor without caching, does it works for you? On Mon, Jul 29, 2013 at 4:00 PM, Santanu8939967892 mishra.sant...@gmail.com wrote: Hi, I have a huge volume of DB records, which is close to 250 millions. I am going to use DIH to index the data into Solr. I need a best architecture to index and query the data in an efficient manner. I am using windows server 2008 with 16 GB RAM, zion processor and Solr 4.4. With Regards, Santanu -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Pentaho Kettle vs DIH
Hello, Don't you have any experience with using Pentaho Kettle for processing RDBMS and pouring them into Solr? Isn't it some sort of replacement of the DIH? -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: How might one search for dupe IDs other than faceting on the ID field?
Dotan, Could you please provide more line of the stack trace? I have no idea why it made worse at 4.3. I know that 4.3 can use facets backed on DocValues, which are modest for the heap. But from what I saw, but can be wrong it's disabled from numeric facets. Hence, I can suggest to reindex id as string docvalues and hope for them. However, it's doubtful to reindex everything without strong guaranties. Also, I checked source code of http://wiki.apache.org/solr/TermsComponentand found that it can be really memory modest (ie without sort nor limit). Be aware that df-s returned by that component are unaware of deleted document, hence expungeDeletes before. On Tue, Jul 30, 2013 at 10:16 PM, Dotan Cohen dotanco...@gmail.com wrote: To search for duplicate IDs, I am running the following query: select?q=*:*facet=truefacet.field=idrows=0 However, since upgrading from Solr 4.1 to Solr 4.3 I am receiving OutOfMemoryError errors instead of the desired facet: responselst name=errorstr name=msgjava.lang.OutOfMemoryError: Java heap space/strstr name=tracejava.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space at org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:670) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:380) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155) at ... Might there be a less resource-intensive way to get this information. This is Solr 4.3 running on Ubuntu Server 12.04 in Jetty. The index has over 100,000,000 small records, for a total of about 95 GiB of disk space, with Solr running on it's own disk. Actually, the 'disk' is an Amazon Web Service EBS volume. -- Dotan Cohen http://gibberish.co.il http://what-is-what.com -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com