Re: How to return score without using _val_
I know that the _val_ is the only thing influencing the score. The fq is just to limit also by those queries. What I am asking is if it is possible to just influence the score using _val_ but not in the Q parameter? Something like bq=val_:"{!type=dismax qf=$qqf v=$qspec}" _val_:"{!type=dismax qt=dismaxname v=$qname}" Is there something like that? On 4/21/11 2:45 AM, "Em" wrote: >Hi, > >I agree with Yonik here - I do not understand what you would like to do as >well. >But some additional note from my side: >Your FQs never influences the score! Of course you can specify the same >query twice, once as a filter - query and once as a regular query but I do >not see the reason to do so. It sounds like unnecessary effort without a >win. > >Regards, >Em > > >Bill Bell wrote: >> >> I would like to influence the score but I would rather not mess with the >> q= >> field since I want the query to dismax for Q. >> >> Something like: >> >> fq={!type=dismax qf=$qqf v=$qspec}& >> fq={!type=dismax qt=dismaxname v=$qname}& >> q=_val_:"{!type=dismax qf=$qqf v=$qspec}" _val_:"{!type=dismax >> qt=dismaxname v=$qname}" >> >> Is there a way to do a filter and add the FQ to the score by doing it >> another way? >> >> Also does this do multiple queries? Is this the right way to do it? >> > > >-- >View this message in context: >http://lucene.472066.n3.nabble.com/How-to-return-score-without-using-val-t >p2841443p2846317.html >Sent from the Solr - User mailing list archive at Nabble.com.
Re: term position question from analyzer stack for WordDelimiterFilterFactory
On Thu, Apr 21, 2011 at 8:06 PM, Robert Petersen wrote: > So if I don't put preserveOriginal=1 in my WordDelimiterFilterFactory > settings I cannot get a match between AppleTV on the indexing side and > appletv on the search side. Hmmm, that shouldn't be the case. The "text" field in the solr example config doesn't use preserveOriginal, and AppleTV is indexed as appl, tv/appletv And a search for appletv does match fine. Perhaps on the search side there is actually a phrase query like "big appletv"? One workaround for that is to add a little slop... "big appletv"~1 -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco
Re: Indexing 20M documents from MySQL with DIH
Can you post the dataconfig.XML? Probably you didn't use batch size Sent from my iPhone On Apr 21, 2011, at 5:09 PM, Scott Bigelow wrote: > Thanks for the e-mail. I probably should have provided more details, > but I was more interested in making sure I was approaching the problem > correctly (using DIH, with one big SELECT statement for millions of > rows) instead of solving this specific problem. Here's a partial > stacktrace from this specific problem: > > ... > Caused by: java.io.EOFException: Can not read response from server. > Expected to read 4 bytes, read 0 bytes before connection was > unexpectedly lost. >at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:2539) >at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2989) >... 22 more > Apr 21, 2011 3:53:28 AM > org.apache.solr.handler.dataimport.EntityProcessorBase getNext > SEVERE: getNext() failed for query 'REDACTED' > org.apache.solr.handler.dataimport.DataImportHandlerException: > com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: > Communications link failure > > The last packet successfully received from the server was 128 > milliseconds ago. The last packet sent successfully to the server was > 25,273,484 milliseconds ago. > ... > > > A custom indexer, so that's a fairly common practice? So when you are > dealing with these large indexes, do you try not to fully rebuild them > when you can? It's not a nightly thing, but something to do in case of > a disaster? Is there a difference in the performance of an index that > was built all at once vs. one that has had delta inserts and updates > applied over a period of months? > > Thank you for your insight. > > > On Thu, Apr 21, 2011 at 4:31 PM, Chris Hostetter > wrote: >> >> : For a new project, I need to index about 20M records (30 fields) and I >> : have been running into issues with MySQL disconnects, right around >> : 15M. I've tried several remedies I've found on blogs, changing >> >> if you can provide some concrete error/log messages and the details of how >> you are configuring your datasource that might help folks provide better >> suggestions -- youv'e said you run into a problem but you havne't provided >> any details for people to go on in giving you feedback. >> >> : resolved the issue. It got me wondering: Is this the way everyone does >> : it? What about 100M records up to 1B; are those all pulled using DIH >> : and a single query? >> >> I've only recently started using DIH, and while it definitely has a lot >> of quirks/anoyances, it seems like a pretty good 80/20 solution for >> indexing with Solr -- but that doens't mean it's perfect for all >> situations. >> >> Writing custom indexer code can certianly make sense in a lot of cases -- >> particularly where you already have a data pblishing system that you wnat >> to tie into directly -- the trick is to ensure you have a decent strategy >> for rebuilding the entire index should the need arrise (but this is relaly >> only an issue if your primary indexing solution is incremental -- many use >> cases can be satisifed just fine with a brute force "full rebuild >> periodically" impelmentation. >> >> >> -Hoss >>
Re: Indexing 20M documents from MySQL with DIH
Thanks for the e-mail. I probably should have provided more details, but I was more interested in making sure I was approaching the problem correctly (using DIH, with one big SELECT statement for millions of rows) instead of solving this specific problem. Here's a partial stacktrace from this specific problem: ... Caused by: java.io.EOFException: Can not read response from server. Expected to read 4 bytes, read 0 bytes before connection was unexpectedly lost. at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:2539) at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2989) ... 22 more Apr 21, 2011 3:53:28 AM org.apache.solr.handler.dataimport.EntityProcessorBase getNext SEVERE: getNext() failed for query 'REDACTED' org.apache.solr.handler.dataimport.DataImportHandlerException: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure The last packet successfully received from the server was 128 milliseconds ago. The last packet sent successfully to the server was 25,273,484 milliseconds ago. ... A custom indexer, so that's a fairly common practice? So when you are dealing with these large indexes, do you try not to fully rebuild them when you can? It's not a nightly thing, but something to do in case of a disaster? Is there a difference in the performance of an index that was built all at once vs. one that has had delta inserts and updates applied over a period of months? Thank you for your insight. On Thu, Apr 21, 2011 at 4:31 PM, Chris Hostetter wrote: > > : For a new project, I need to index about 20M records (30 fields) and I > : have been running into issues with MySQL disconnects, right around > : 15M. I've tried several remedies I've found on blogs, changing > > if you can provide some concrete error/log messages and the details of how > you are configuring your datasource that might help folks provide better > suggestions -- youv'e said you run into a problem but you havne't provided > any details for people to go on in giving you feedback. > > : resolved the issue. It got me wondering: Is this the way everyone does > : it? What about 100M records up to 1B; are those all pulled using DIH > : and a single query? > > I've only recently started using DIH, and while it definitely has a lot > of quirks/anoyances, it seems like a pretty good 80/20 solution for > indexing with Solr -- but that doens't mean it's perfect for all > situations. > > Writing custom indexer code can certianly make sense in a lot of cases -- > particularly where you already have a data pblishing system that you wnat > to tie into directly -- the trick is to ensure you have a decent strategy > for rebuilding the entire index should the need arrise (but this is relaly > only an issue if your primary indexing solution is incremental -- many use > cases can be satisifed just fine with a brute force "full rebuild > periodically" impelmentation. > > > -Hoss >
term position question from analyzer stack for WordDelimiterFilterFactory
So if I don't put preserveOriginal=1 in my WordDelimiterFilterFactory settings I cannot get a match between AppleTV on the indexing side and appletv on the search side. Without that setting the all lowercase version of AppleTV is in term position two due to the catenateWords=1 or the catenateAll=1 settings. I am surprised. How does term position affect searching? Here is my analysis with preserveOriginal=1 to make the lower case occur in both term position 1 and 2: Index Analyzer org.apache.solr.analysis.WhitespaceTokenizerFactory {} term position 1 term text AppleTV term type word source start,end0,7 payload org.apache.solr.analysis.SynonymFilterFactory {synonyms=index_synonyms.txt, expand=true, ignoreCase=true} term position 1 term text AppleTV term type word source start,end0,7 payload org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt, ignoreCase=true} term position 1 term text AppleTV term type word source start,end0,7 payload org.apache.solr.analysis.WordDelimiterFilterFactory {preserveOriginal=1, generateNumberParts=1, catenateWords=1, generateWordParts=1, catenateAll=1, catenateNumbers=1} term position 1 2 term text AppleTV TV Apple AppleTV term type wordword wordword source start,end0,7 5,7 0,5 0,7 payload org.apache.solr.analysis.LowerCaseFilterFactory {} term position 1 2 term text appletv tv apple appletv term type wordword wordword source start,end0,7 5,7 0,5 0,7 payload com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory {protected=protwords.txt} term position 1 2 term text appletv tv apple appletv term type wordword wordword source start,end0,7 5,7 0,5 0,7 payload org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {} term position 1 2 term text appletv tv apple appletv term type wordword wordword source start,end0,7 5,7 0,5 0,7 payload Query Analyzer org.apache.solr.analysis.WhitespaceTokenizerFactory {} term position 1 term text appletv term type word source start,end0,7 payload org.apache.solr.analysis.SynonymFilterFactory {synonyms=query_synonyms.txt, expand=true, ignoreCase=true} term position 1 term text appletv term type word source start,end0,7 payload org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt, ignoreCase=true} term position 1 term text appletv term type word source start,end0,7 payload org.apache.solr.analysis.WordDelimiterFilterFactory {preserveOriginal=1, generateNumberParts=1, catenateWords=1, generateWordParts=1, catenateAll=1, catenateNumbers=1} term position 1 term text appletv term type word source start,end0,7 payload org.apache.solr.analysis.LowerCaseFilterFactory {} term position 1 term text appletv term type word source start,end0,7 payload com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory {protected=protwords.txt} term position 1 term text appletv term type word source start,end0,7 payload org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {} term position 1 term text appletv term type word source start,end0,7 payload
Re: Indexing 20M documents from MySQL with DIH
: For a new project, I need to index about 20M records (30 fields) and I : have been running into issues with MySQL disconnects, right around : 15M. I've tried several remedies I've found on blogs, changing if you can provide some concrete error/log messages and the details of how you are configuring your datasource that might help folks provide better suggestions -- youv'e said you run into a problem but you havne't provided any details for people to go on in giving you feedback. : resolved the issue. It got me wondering: Is this the way everyone does : it? What about 100M records up to 1B; are those all pulled using DIH : and a single query? I've only recently started using DIH, and while it definitely has a lot of quirks/anoyances, it seems like a pretty good 80/20 solution for indexing with Solr -- but that doens't mean it's perfect for all situations. Writing custom indexer code can certianly make sense in a lot of cases -- particularly where you already have a data pblishing system that you wnat to tie into directly -- the trick is to ensure you have a decent strategy for rebuilding the entire index should the need arrise (but this is relaly only an issue if your primary indexing solution is incremental -- many use cases can be satisifed just fine with a brute force "full rebuild periodically" impelmentation. -Hoss
Re: Highest frequency terms for a subset of documents
Ok, thanks On Friday, April 22, 2011, Yonik Seeley wrote: > On Thu, Apr 21, 2011 at 6:50 PM, Ofer Fort wrote: >> Ok, I'll give it a try, as this is a server I am willing to risk. >> How is the competability between solrj of bulkpostings, trunk, 3.1 and 1.4.1? > > bulkpostings, trunk, and 3.1 should all be relatively solrj > compatible. But the SolrJ javabin format (used by default for > queries) changed for strings between 1.4.1 and 3.1 (SOLR-2034). > > -Yonik > http://www.lucenerevolution.org -- Lucene/Solr User Conference, May > 25-26, San Francisco > > >> On Friday, April 22, 2011, Yonik Seeley wrote: >>> On Thu, Apr 21, 2011 at 6:34 PM, Ofer Fort wrote: So I'm guessing my best approach now would be to test trunk, and hope that as 3.1 cut the performance in half, trunk will do the same >>> >>> Trunk prob won't be much better... but the bulkpostings branch >>> possibly could be. >>> >>> -Yonik >>> http://www.lucenerevolution.org -- Lucene/Solr User Conference, May >>> 25-26, San Francisco >>> Thanks for the info Ofer On Friday, April 22, 2011, Yonik Seeley wrote: > On Thu, Apr 21, 2011 at 6:25 PM, Ofer Fort wrote: >> Well, it was worth the try;-) >> But will using the facet.method=fc, will reducing the subset size >> reduce the time and memory? Meaning is it an O( ndocs of the set)? > > facet.method=fc builds a multi-valued fieldcache like structure > (UnInvertedField) the first time, that > is used for counting facets for all subsequent requests. So the > faceting time (after the first time) is O(ndocs of the set), > but the UnInvertedField singleton uses a large amout of memory > unrelated to any particular base docset. > > -Yonik > http://www.lucenerevolution.org -- Lucene/Solr User Conference, May > 25-26, San Francisco > > >> Thanks >> On Thursday, April 21, 2011, Yonik Seeley >> wrote: >>> On Thu, Apr 21, 2011 at 11:15 AM, Ofer Fort wrote: So if i want to use the facet.method=fc, is there a way to speed it up? and remove the bucket size limitation? >>> >>> Not really - else we would have done it already ;-) >>> We don't really have great methods for faceting on full-text fields >>> (as opposed to shorter meta-data fields) today. >>> >>> -Yonik >>> http://www.lucenerevolution.org -- Lucene/Solr User Conference, May >>> 25-26, San Francisco >>> >> > >>> >> >
Re: Highest frequency terms for a subset of documents
On Thu, Apr 21, 2011 at 6:50 PM, Ofer Fort wrote: > Ok, I'll give it a try, as this is a server I am willing to risk. > How is the competability between solrj of bulkpostings, trunk, 3.1 and 1.4.1? bulkpostings, trunk, and 3.1 should all be relatively solrj compatible. But the SolrJ javabin format (used by default for queries) changed for strings between 1.4.1 and 3.1 (SOLR-2034). -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco > On Friday, April 22, 2011, Yonik Seeley wrote: >> On Thu, Apr 21, 2011 at 6:34 PM, Ofer Fort wrote: >>> So I'm guessing my best approach now would be to test trunk, and hope >>> that as 3.1 cut the performance in half, trunk will do the same >> >> Trunk prob won't be much better... but the bulkpostings branch >> possibly could be. >> >> -Yonik >> http://www.lucenerevolution.org -- Lucene/Solr User Conference, May >> 25-26, San Francisco >> >>> Thanks for the info >>> Ofer >>> >>> On Friday, April 22, 2011, Yonik Seeley wrote: On Thu, Apr 21, 2011 at 6:25 PM, Ofer Fort wrote: > Well, it was worth the try;-) > But will using the facet.method=fc, will reducing the subset size > reduce the time and memory? Meaning is it an O( ndocs of the set)? facet.method=fc builds a multi-valued fieldcache like structure (UnInvertedField) the first time, that is used for counting facets for all subsequent requests. So the faceting time (after the first time) is O(ndocs of the set), but the UnInvertedField singleton uses a large amout of memory unrelated to any particular base docset. -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco > Thanks > On Thursday, April 21, 2011, Yonik Seeley > wrote: >> On Thu, Apr 21, 2011 at 11:15 AM, Ofer Fort wrote: >>> So if i want to use the facet.method=fc, is there a way to speed it up? >>> and >>> remove the bucket size limitation? >> >> Not really - else we would have done it already ;-) >> We don't really have great methods for faceting on full-text fields >> (as opposed to shorter meta-data fields) today. >> >> -Yonik >> http://www.lucenerevolution.org -- Lucene/Solr User Conference, May >> 25-26, San Francisco >> > >>> >> >
Re: Highest frequency terms for a subset of documents
Ok, I'll give it a try, as this is a server I am willing to risk. How is the competability between solrj of bulkpostings, trunk, 3.1 and 1.4.1? On Friday, April 22, 2011, Yonik Seeley wrote: > On Thu, Apr 21, 2011 at 6:34 PM, Ofer Fort wrote: >> So I'm guessing my best approach now would be to test trunk, and hope >> that as 3.1 cut the performance in half, trunk will do the same > > Trunk prob won't be much better... but the bulkpostings branch > possibly could be. > > -Yonik > http://www.lucenerevolution.org -- Lucene/Solr User Conference, May > 25-26, San Francisco > >> Thanks for the info >> Ofer >> >> On Friday, April 22, 2011, Yonik Seeley wrote: >>> On Thu, Apr 21, 2011 at 6:25 PM, Ofer Fort wrote: Well, it was worth the try;-) But will using the facet.method=fc, will reducing the subset size reduce the time and memory? Meaning is it an O( ndocs of the set)? >>> >>> facet.method=fc builds a multi-valued fieldcache like structure >>> (UnInvertedField) the first time, that >>> is used for counting facets for all subsequent requests. So the >>> faceting time (after the first time) is O(ndocs of the set), >>> but the UnInvertedField singleton uses a large amout of memory >>> unrelated to any particular base docset. >>> >>> -Yonik >>> http://www.lucenerevolution.org -- Lucene/Solr User Conference, May >>> 25-26, San Francisco >>> >>> Thanks On Thursday, April 21, 2011, Yonik Seeley wrote: > On Thu, Apr 21, 2011 at 11:15 AM, Ofer Fort wrote: >> So if i want to use the facet.method=fc, is there a way to speed it up? >> and >> remove the bucket size limitation? > > Not really - else we would have done it already ;-) > We don't really have great methods for faceting on full-text fields > (as opposed to shorter meta-data fields) today. > > -Yonik > http://www.lucenerevolution.org -- Lucene/Solr User Conference, May > 25-26, San Francisco > >>> >> >
Re: Multiple Tags and Facets
Thank you Hoss. I will try the comma-separated thing out. It seems to be what I searched for. :) Regards, Em Chris Hostetter-3 wrote: > > : I watched an online video with Chris Hostsetter from Lucidimagination. > He > : showed the possibility of having some Facets that exclude *all* filter > while > : also having some Facets that take care of some of the set filters while > : ignoring other filters. > > FWIW: That webinar is nearly identical to the apachecon talk i gave on the > same topic, slides of which can be found here... > > http://people.apache.org/~hossman/apachecon2010/facets/ > > This is the example i used on Slide #29... > > Same Facet, Different Exclusions > > * A key can be specified for a facet to change the name used to > identify it in the response. > * This allows you to have multiple instances of a facet, with >differnet exclusions. > > q = Hot Rod >fq = {!df=colors tag=cx}purple green > facet.field = {!key=all_colors ex=cx}colors > facet.field = {!key=overlap_colors}colors > > ...the point in that example is to treat a field (color) as two > differnt facets: one with exclusions and one without. > > it sounds like what you want is differnet -- i *think* what you > are asking for is multiple exclusions for a single facet. I didn't > mention that in my slides, but you can do that using a comma seperated > list of exclusions... > > q = Hot Rod >fq = {!df=body tag=bc}purple >fq = {!df=interior tag=ic}green > facet.field = {!ex=bc,ic}model > > -Hoss > -- View this message in context: http://lucene.472066.n3.nabble.com/Multiple-Tags-and-Facets-tp2843130p2849115.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Highest frequency terms for a subset of documents
On Thu, Apr 21, 2011 at 6:34 PM, Ofer Fort wrote: > So I'm guessing my best approach now would be to test trunk, and hope > that as 3.1 cut the performance in half, trunk will do the same Trunk prob won't be much better... but the bulkpostings branch possibly could be. -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco > Thanks for the info > Ofer > > On Friday, April 22, 2011, Yonik Seeley wrote: >> On Thu, Apr 21, 2011 at 6:25 PM, Ofer Fort wrote: >>> Well, it was worth the try;-) >>> But will using the facet.method=fc, will reducing the subset size >>> reduce the time and memory? Meaning is it an O( ndocs of the set)? >> >> facet.method=fc builds a multi-valued fieldcache like structure >> (UnInvertedField) the first time, that >> is used for counting facets for all subsequent requests. So the >> faceting time (after the first time) is O(ndocs of the set), >> but the UnInvertedField singleton uses a large amout of memory >> unrelated to any particular base docset. >> >> -Yonik >> http://www.lucenerevolution.org -- Lucene/Solr User Conference, May >> 25-26, San Francisco >> >> >>> Thanks >>> On Thursday, April 21, 2011, Yonik Seeley >>> wrote: On Thu, Apr 21, 2011 at 11:15 AM, Ofer Fort wrote: > So if i want to use the facet.method=fc, is there a way to speed it up? > and > remove the bucket size limitation? Not really - else we would have done it already ;-) We don't really have great methods for faceting on full-text fields (as opposed to shorter meta-data fields) today. -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco >>> >> >
Re: Highest frequency terms for a subset of documents
So I'm guessing my best approach now would be to test trunk, and hope that as 3.1 cut the performance in half, trunk will do the same Thanks for the info Ofer On Friday, April 22, 2011, Yonik Seeley wrote: > On Thu, Apr 21, 2011 at 6:25 PM, Ofer Fort wrote: >> Well, it was worth the try;-) >> But will using the facet.method=fc, will reducing the subset size >> reduce the time and memory? Meaning is it an O( ndocs of the set)? > > facet.method=fc builds a multi-valued fieldcache like structure > (UnInvertedField) the first time, that > is used for counting facets for all subsequent requests. So the > faceting time (after the first time) is O(ndocs of the set), > but the UnInvertedField singleton uses a large amout of memory > unrelated to any particular base docset. > > -Yonik > http://www.lucenerevolution.org -- Lucene/Solr User Conference, May > 25-26, San Francisco > > >> Thanks >> On Thursday, April 21, 2011, Yonik Seeley wrote: >>> On Thu, Apr 21, 2011 at 11:15 AM, Ofer Fort wrote: So if i want to use the facet.method=fc, is there a way to speed it up? and remove the bucket size limitation? >>> >>> Not really - else we would have done it already ;-) >>> We don't really have great methods for faceting on full-text fields >>> (as opposed to shorter meta-data fields) today. >>> >>> -Yonik >>> http://www.lucenerevolution.org -- Lucene/Solr User Conference, May >>> 25-26, San Francisco >>> >> >
Re: Indexing 20M documents from MySQL with DIH
Thanks for your response! I think the issue is that the records are being returned TOO fast from MySQL. I can dump them to CSV in about 30 minutes, but building the solr index takes hours on the system I'm using. I may just need to use a more powerful Solr instance so it doesn't leave MySQL hanging for too long? What about autoCommit, does that factor in to your import strategy? 2011/4/21 Robert Gründler : > we're indexing around 10M records from a mysql database into > a single solr core. > > The DataImportHandler needs to join 3 sub-entities to denormalize > the data. > > We've run into some troubles for the first 2 attempts, but setting > batchSize="-1" for the dataSource resolved the issues. > > Do you need a lot of complex joins to import the data from mysql? > > > > -robert > > > > > On 4/21/11 8:08 PM, Scott Bigelow wrote: >> >> I've been using Solr for a while now, indexing 2-4 million records >> using the DIH to pull data from MySQL, which has been working great. >> For a new project, I need to index about 20M records (30 fields) and I >> have been running into issues with MySQL disconnects, right around >> 15M. I've tried several remedies I've found on blogs, changing >> autoCommit, batchSize etc., and none of them have seem to majorly >> resolved the issue. It got me wondering: Is this the way everyone does >> it? What about 100M records up to 1B; are those all pulled using DIH >> and a single query? >> >> I've used sphinx in the past, which uses multiple queries to pull out >> a subset of records ranged based on PrimaryKey, does Solr offer >> functionality similar to this? It seems that once a Solr index gets to >> a certain size, the indexing of a batch takes longer than MySQL's >> net_write_timeout, so it kills the connection. >> >> Thanks for your help, I really enjoy using Solr and I look forward to >> indexing even more data! > >
Index upgrade from 1.4.1 to 3.1 and 4.0
Hi all, While doing some tests, I realized that an index that was created with solr 1.4.1 is readable by solr 3.1, but nt readable by solr 4.0. If I plan to migrate my index to 4.0, and I prefer not to reindex it all, what would be my best course of action? Will it be possible to continue to write to the index with 3.1? Will that make it readable from 4.0 or only the newly created segments? If I optimize it using 3.1, will that make it readable also from 4.0? Thanks Ofer
Re: Highest frequency terms for a subset of documents
On Thu, Apr 21, 2011 at 6:25 PM, Ofer Fort wrote: > Well, it was worth the try;-) > But will using the facet.method=fc, will reducing the subset size > reduce the time and memory? Meaning is it an O( ndocs of the set)? facet.method=fc builds a multi-valued fieldcache like structure (UnInvertedField) the first time, that is used for counting facets for all subsequent requests. So the faceting time (after the first time) is O(ndocs of the set), but the UnInvertedField singleton uses a large amout of memory unrelated to any particular base docset. -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco > Thanks > On Thursday, April 21, 2011, Yonik Seeley wrote: >> On Thu, Apr 21, 2011 at 11:15 AM, Ofer Fort wrote: >>> So if i want to use the facet.method=fc, is there a way to speed it up? and >>> remove the bucket size limitation? >> >> Not really - else we would have done it already ;-) >> We don't really have great methods for faceting on full-text fields >> (as opposed to shorter meta-data fields) today. >> >> -Yonik >> http://www.lucenerevolution.org -- Lucene/Solr User Conference, May >> 25-26, San Francisco >> >
Re: Highest frequency terms for a subset of documents
Well, it was worth the try;-) But will using the facet.method=fc, will reducing the subset size reduce the time and memory? Meaning is it an O( ndocs of the set)? Thanks On Thursday, April 21, 2011, Yonik Seeley wrote: > On Thu, Apr 21, 2011 at 11:15 AM, Ofer Fort wrote: >> So if i want to use the facet.method=fc, is there a way to speed it up? and >> remove the bucket size limitation? > > Not really - else we would have done it already ;-) > We don't really have great methods for faceting on full-text fields > (as opposed to shorter meta-data fields) today. > > -Yonik > http://www.lucenerevolution.org -- Lucene/Solr User Conference, May > 25-26, San Francisco >
Re: HTMLStripCharFilterFactory, highlighting and InvalidTokenOffsetsException
Perhaps a better place to start is here: http://wiki.apache.org/solr/HowToContribute#Contributing_Code_.28Features.2C_Big_Fixes.2C_Tests.2C_etc29 That page also has information about setting up Eclipse or IntelliJ environments. But the place to start is to get the source and get to the point where you can issue "ant clean test" from the command line. That should compile all the source and run the junit tests. "ant example" will build you a full deployment in the example directory that you can run the usual way "java -jar start.jar". The IDEs also have a wizardly way to apply patches if you don't want to apply them the command-line way. Best Erick 2011/4/21 Robert Gründler : > On 20.04.11 18:51, Robert Muir wrote: >> >> Hi, there is a proposed patch uploaded to the issue. Maybe you can >> help by reviewing/testing it? > > if i succeed in compiling solr, i can test the patch. Is this the right > starting point > for such an endeavour ? http://wiki.apache.org/solr/HackingSolr > > > > -robert > >> 2011/4/20 Robert Gründler: >>> >>> Hi all, >>> >>> i'm getting the following exception when using highlighting for a field >>> containing HTMLStripCharFilterFactory: >>> >>> org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token >>> ... >>> exceeds length of provided text sized 21 >>> >>> It seems this is a know issue: >>> >>> https://issues.apache.org/jira/browse/LUCENE-2208 >>> >>> Does anyone know if there's a fix implemented yet in solr? >>> >>> >>> thanks! >>> >>> >>> -robert >>> >>> >>> >>> > >
Re: Indexing 20M documents from MySQL with DIH
we're indexing around 10M records from a mysql database into a single solr core. The DataImportHandler needs to join 3 sub-entities to denormalize the data. We've run into some troubles for the first 2 attempts, but setting batchSize="-1" for the dataSource resolved the issues. Do you need a lot of complex joins to import the data from mysql? -robert On 4/21/11 8:08 PM, Scott Bigelow wrote: I've been using Solr for a while now, indexing 2-4 million records using the DIH to pull data from MySQL, which has been working great. For a new project, I need to index about 20M records (30 fields) and I have been running into issues with MySQL disconnects, right around 15M. I've tried several remedies I've found on blogs, changing autoCommit, batchSize etc., and none of them have seem to majorly resolved the issue. It got me wondering: Is this the way everyone does it? What about 100M records up to 1B; are those all pulled using DIH and a single query? I've used sphinx in the past, which uses multiple queries to pull out a subset of records ranged based on PrimaryKey, does Solr offer functionality similar to this? It seems that once a Solr index gets to a certain size, the indexing of a batch takes longer than MySQL's net_write_timeout, so it kills the connection. Thanks for your help, I really enjoy using Solr and I look forward to indexing even more data!
Re: Multiple Tags and Facets
: I watched an online video with Chris Hostsetter from Lucidimagination. He : showed the possibility of having some Facets that exclude *all* filter while : also having some Facets that take care of some of the set filters while : ignoring other filters. FWIW: That webinar is nearly identical to the apachecon talk i gave on the same topic, slides of which can be found here... http://people.apache.org/~hossman/apachecon2010/facets/ This is the example i used on Slide #29... Same Facet, Different Exclusions * A key can be specified for a facet to change the name used to identify it in the response. * This allows you to have multiple instances of a facet, with differnet exclusions. q = Hot Rod fq = {!df=colors tag=cx}purple green facet.field = {!key=all_colors ex=cx}colors facet.field = {!key=overlap_colors}colors ...the point in that example is to treat a field (color) as two differnt facets: one with exclusions and one without. it sounds like what you want is differnet -- i *think* what you are asking for is multiple exclusions for a single facet. I didn't mention that in my slides, but you can do that using a comma seperated list of exclusions... q = Hot Rod fq = {!df=body tag=bc}purple fq = {!df=interior tag=ic}green facet.field = {!ex=bc,ic}model -Hoss
MoreLikeThis
Hi all, I have an mlt search set up on my site with over 2 million records in the index. Normally, my results look like: 0 204 Some result. A similar result ... And there are 100 results under response. However, in some cases, there are no results under "response". Why is this the case and is there anything I can do about it? Here is my mlt configuration: title,score 1 100 *,score And here is the URL I use to get results: http://localhost:8983/solr/mlt/?q=title:Some random title Any help on this matter would be greatly appreciated. Thanks! Brian Lamb
Re: Multiple Tags and Facets
Hi Jay, thank you for your reply. We most enhance your example to reproduce what I mean: You got the following facets: project: - Solr - Lucene - Nutch - Mahout source: - Documentation - Mailinglist - Wiki - Commercial Websites What I want now is: When I click on Solr + Documentation (fq={tag=p}project:Solr&fq={tag=s}source:Documentation), I want to get back a result-set where I on the one hand see that there are no matches for Mahout given the filter queries. On the other hand I also want to see that there are results available for my search but not for the current filters. This information is usefull for creating a powerfull UI: You can show the user that there is possibly valuable information available on commercial websites but they are excluded from the current search. Another point is that you can "fix" your UI: You always show all facets relevant to the current search, no matter which of them are active. Those who do not apply anymore to the given result-set (like Mahout in our example) still remain in the list of available projects but are marked as unuasable (displayed in smooth gray or something like that to show that they are inactive). My problem is that I do not know how to create such a user-experience, because, if I add another dimension (like the source-facet) things are getting complicated. Since Hoss showed in the Mastering Facets Webinar that such cross-taggings are possible, I thought that this is an already built-in option for Solr. Regards, Em -- View this message in context: http://lucene.472066.n3.nabble.com/Multiple-Tags-and-Facets-tp2843130p2848085.html Sent from the Solr - User mailing list archive at Nabble.com.
Indexing 20M documents from MySQL with DIH
I've been using Solr for a while now, indexing 2-4 million records using the DIH to pull data from MySQL, which has been working great. For a new project, I need to index about 20M records (30 fields) and I have been running into issues with MySQL disconnects, right around 15M. I've tried several remedies I've found on blogs, changing autoCommit, batchSize etc., and none of them have seem to majorly resolved the issue. It got me wondering: Is this the way everyone does it? What about 100M records up to 1B; are those all pulled using DIH and a single query? I've used sphinx in the past, which uses multiple queries to pull out a subset of records ranged based on PrimaryKey, does Solr offer functionality similar to this? It seems that once a Solr index gets to a certain size, the indexing of a batch takes longer than MySQL's net_write_timeout, so it kills the connection. Thanks for your help, I really enjoy using Solr and I look forward to indexing even more data!
Re: Multiple Tags and Facets
I don't think I understand what you're trying to do. Are you trying to preserve all facets after a user clicks on a facet, and thereby triggers a filter query, which excludes the other facets? If that's the case, you can use local parameters to tag the filter queries so they are not used for the facets: Let's say I have the following facets: - Solr - Lucene - Nutch - Mahout And I do a search for "solr". All of these links will have a filter query: - Solr [ ?q=solr&fq=project:solr ] - Lucene [ ?q=solr&fq=project:lucene ] - Nutch [ ?q=solr&fq=project:nutch ] - Mahout [ ?q=solr&fq=project:mahout ] But if a user clicks on the "Solr" facet, the resulting query will exclude the other facets, so you only see this facet: - Solr By using local parameters like this: ?q=solr&fq={!tag=myTag}project:solr &facet=on&facet.field{!ex=myTag}=project I can preserve all my facets, so that my query is filtered but all facets still remain: - Solr - Lucene - Nutch - Mahout Hope this helps, but I'm not sure that's what you were after. -Jay On Wed, Apr 20, 2011 at 8:03 AM, Em wrote: > Hello, > > I watched an online video with Chris Hostsetter from Lucidimagination. He > showed the possibility of having some Facets that exclude *all* filter > while > also having some Facets that take care of some of the set filters while > ignoring other filters. > > Unfortunately the Webinar did not explain how they made this and I wasn't > able to give a filter/facet more than one tag. > > Here is an example: > > Facets and Filters: DocType, Author > > Facet: > - Author > -- George (10) > -- Brian (12) > -- Christian (78) > -- Julia (2) > > -Doctype > -- PDF (70) > -- ODT (10) > -- Word (20) > -- JPEG (1) > -- PNG (1) > > When clicking on "Julia" I would like to achieve the following: > Facet: > - Author > -- George (10) > -- Brian (12) > -- Christian (78) > -- Julia (2) > Julia's Doctypes: > -- JPEG (1) > -- PNG (1) > > -Doctype > -- PDF (70) > -- ODT (10) > -- Word (20) > -- JPEG (1) > -- PNG (1) > > Another example which adds special options to your GUI could be as > following: > Imagine a fashion store. > If you search for "shirt" you get a color-facet: > > colors: > - red (19) > - green (12) > - blue (4) > - black (2) > > As well as a brand-facet: > > brands: > - puma (18) > - nike (19) > > When I click on the red color-facet, I would like to get the following > back: > colors: > - red (19) > - green (12)* > - blue (4)* > - black (2)* > > brands: > - puma (18)* > - nike (19) > > All those filters marked by an "*" could be displayed half-transparent or > so > - they just show the user that those filter-options exist for his/her > search > but aren't included in the result-set, since he/she excluded them by > clicking the "red" filter. > > This case is more interesting, if not all red shirts were from nike. > This way you can show the user that i.e. 8 of 19 red - shirts are from the > brand you selected/you see 8 of 19 red shirts. > > I hope I explained what I want to achive. > > Thank you! > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Multiple-Tags-and-Facets-tp2843130p2843130.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Solr search based on list of terms. Order by max(score) for each term.
Hello, I am trying to query a solr server in order to obtain the most relevant results for a list of terms. For example i have the list of words "nokia", "iphone", "charger" My schema contains the following data: nokia iphone nokia iphone otherwords nokia white iphone white If I run a simple query like q=nokia OR iphone OR charger i get "nokia iphone otherwords" as the most relevant result (because it contains more query terms) I would like to get "nokia" or "iphone" or "iphone white" as first results, because for each individual term they would be the most relevant. In order to obtain the correct list i would do a query for each term, then aggregate the results and order them based on the maximum score. Can I make this query in one request? This question has also been asked on http://stackoverflow.com/questions/5743264/solr-search-based-on-list-of-terms-order-by-maxscore-for-each-term Thank you.
RE: stemming filter analyzers, any favorites?
Nice! Thanks! -Original Message- From: Em [mailto:mailformailingli...@yahoo.de] Sent: Thursday, April 21, 2011 9:23 AM To: solr-user@lucene.apache.org Subject: RE: stemming filter analyzers, any favorites? As far as I know Lucene does not store an inverted index per field, so no, it would not double the size of the index. However, it could influence the score a little bit. For example: If both stemmers reduce "schools" to "school" and you are searching for "all schools in america" the term "school" has more weight to the resulting score, since it definitly occurs in two fields which consist of nearly the same value. To reduce this effect you could write your own queryParser which creates a disjunctionMaxQuery consisting of two boolean queries and a tie-break of 0 - so only the better scoring stemmed-field contributes to the total score of your document. Regards, Em Robert Petersen-3 wrote: > > Adding another field with another stemmer and searching both??? Wow never > thought of doing that. I guess that doesn't really double the size of > your index tho because all the terms are almost the same right? Let me > look into that. I'll raise the other issue in a separate thread and > thanks. > > -Original Message- > From: Em [mailto:mailformailingli...@yahoo.de] > Sent: Thursday, April 21, 2011 1:55 AM > To: solr-user@lucene.apache.org > Subject: RE: stemming filter analyzers, any favorites? > > Hi Robert, > > we often ran into the same issue with stemmers. This is why we created > more > than one field, each field with different stemmers. It adds some overhead > but worked quite well. > > Regarding your off-topic-question: > Look at the debugging-output of your searches. Sometimes you configured > your > tools, especially the WDF, wrong and the queryParser creates an unexpected > result which leads to unmatched but still relevant documents. > > Please, show us your debugging-output and the field-definition so that we > can provide you some help! > > Regards, > Em > > > Robert Petersen-3 wrote: >> >> I have been doing that, and for Bags example the trailing 's' is not >> being >> removed by the Kstemmer so if indexing the word bags and searching on bag >> you get no matches. Why wouldn't the trailing 's' get stemmed off? >> Kstemmer is dictionary based so bags isn't in the dictionary? That >> trailing 's' should always be dropped no? That seems like it would be >> better, we don't want to make synonyms for basic use cases like this. I >> fear I will have to return to the Porter stemmer. Are there other better >> ones is my main question. >> >> Off topic secondary question: sometimes I am puzzled by the output of the >> analysis page. It seems like there should be a match, but I don't get >> the >> results during a search that I'd expect... >> >> Like in the case if the WordDelimiterFilterFactory splits up a term into >> a >> bunch of terms before the K-stemmer is applied, sometimes if the matching >> term is in position two of the final analysis but the searcher had the >> partial term just alone and so thereby in position 1 in the analysis >> stack >> then when searching there wasn't a match. Am I reading this correctly? >> Is that right or should that match and I am misreading my analysis >> output? >> >> Thanks! >> >> Robi >> >> PS I have a category named Bags and am catching flack for it not coming >> up in a search for bag. hah >> PPS the term is not in protwords.txt >> >> >> com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory >> {protected=protwords.txt} >> term position1 >> term textbags >> term typeword >> source start,end 0,4 >> payload >> >> >> -Original Message- >> From: Erick Erickson [mailto:erickerick...@gmail.com] >> Sent: Wednesday, April 20, 2011 10:55 AM >> To: solr-user@lucene.apache.org >> Subject: Re: stemming filter analyzers, any favorites? >> >> You can get a better sense of exactly what tranformations occur when >> if you look at the analysis page (be sure to check the "verbose" >> checkbox). >> >> I'm surprised that "bags" doesn't match "bag", what does the analysis >> page say? >> >> Best >> Erick >> >> On Wed, Apr 20, 2011 at 1:44 PM, Robert Petersen>> wrote: >>> Stemming filter analyzers... anyone have any favorites for particular >>> search domains? Just wondering what people are using. I'm using Lucid >>> K Stemmer and having issues. Seems like it misses a lot of common >>> stems. We went to that because of excessively loose matches on the >>> solr.PorterStemFilterFactory >>> >>> >>> I understand K Stemmer is a dictionary based stemmer. Seems to me like >>> it is missing a lot of common stem reductions. Ie Bags does not match >>> Bag in our searches. >>> >>> Here is my analyzer stack: >>> >>> >> positionIncrementGap="100"> >>> >>> >> class="solr.WhitespaceTokenizerFact
Re: Highest frequency terms for a subset of documents
On Thu, Apr 21, 2011 at 11:15 AM, Ofer Fort wrote: > So if i want to use the facet.method=fc, is there a way to speed it up? and > remove the bucket size limitation? Not really - else we would have done it already ;-) We don't really have great methods for faceting on full-text fields (as opposed to shorter meta-data fields) today. -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco
RE: stemming filter analyzers, any favorites?
As far as I know Lucene does not store an inverted index per field, so no, it would not double the size of the index. However, it could influence the score a little bit. For example: If both stemmers reduce "schools" to "school" and you are searching for "all schools in america" the term "school" has more weight to the resulting score, since it definitly occurs in two fields which consist of nearly the same value. To reduce this effect you could write your own queryParser which creates a disjunctionMaxQuery consisting of two boolean queries and a tie-break of 0 - so only the better scoring stemmed-field contributes to the total score of your document. Regards, Em Robert Petersen-3 wrote: > > Adding another field with another stemmer and searching both??? Wow never > thought of doing that. I guess that doesn't really double the size of > your index tho because all the terms are almost the same right? Let me > look into that. I'll raise the other issue in a separate thread and > thanks. > > -Original Message- > From: Em [mailto:mailformailingli...@yahoo.de] > Sent: Thursday, April 21, 2011 1:55 AM > To: solr-user@lucene.apache.org > Subject: RE: stemming filter analyzers, any favorites? > > Hi Robert, > > we often ran into the same issue with stemmers. This is why we created > more > than one field, each field with different stemmers. It adds some overhead > but worked quite well. > > Regarding your off-topic-question: > Look at the debugging-output of your searches. Sometimes you configured > your > tools, especially the WDF, wrong and the queryParser creates an unexpected > result which leads to unmatched but still relevant documents. > > Please, show us your debugging-output and the field-definition so that we > can provide you some help! > > Regards, > Em > > > Robert Petersen-3 wrote: >> >> I have been doing that, and for Bags example the trailing 's' is not >> being >> removed by the Kstemmer so if indexing the word bags and searching on bag >> you get no matches. Why wouldn't the trailing 's' get stemmed off? >> Kstemmer is dictionary based so bags isn't in the dictionary? That >> trailing 's' should always be dropped no? That seems like it would be >> better, we don't want to make synonyms for basic use cases like this. I >> fear I will have to return to the Porter stemmer. Are there other better >> ones is my main question. >> >> Off topic secondary question: sometimes I am puzzled by the output of the >> analysis page. It seems like there should be a match, but I don't get >> the >> results during a search that I'd expect... >> >> Like in the case if the WordDelimiterFilterFactory splits up a term into >> a >> bunch of terms before the K-stemmer is applied, sometimes if the matching >> term is in position two of the final analysis but the searcher had the >> partial term just alone and so thereby in position 1 in the analysis >> stack >> then when searching there wasn't a match. Am I reading this correctly? >> Is that right or should that match and I am misreading my analysis >> output? >> >> Thanks! >> >> Robi >> >> PS I have a category named Bags and am catching flack for it not coming >> up in a search for bag. hah >> PPS the term is not in protwords.txt >> >> >> com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory >> {protected=protwords.txt} >> term position1 >> term textbags >> term typeword >> source start,end 0,4 >> payload >> >> >> -Original Message- >> From: Erick Erickson [mailto:erickerick...@gmail.com] >> Sent: Wednesday, April 20, 2011 10:55 AM >> To: solr-user@lucene.apache.org >> Subject: Re: stemming filter analyzers, any favorites? >> >> You can get a better sense of exactly what tranformations occur when >> if you look at the analysis page (be sure to check the "verbose" >> checkbox). >> >> I'm surprised that "bags" doesn't match "bag", what does the analysis >> page say? >> >> Best >> Erick >> >> On Wed, Apr 20, 2011 at 1:44 PM, Robert Petersen>> wrote: >>> Stemming filter analyzers... anyone have any favorites for particular >>> search domains? Just wondering what people are using. I'm using Lucid >>> K Stemmer and having issues. Seems like it misses a lot of common >>> stems. We went to that because of excessively loose matches on the >>> solr.PorterStemFilterFactory >>> >>> >>> I understand K Stemmer is a dictionary based stemmer. Seems to me like >>> it is missing a lot of common stem reductions. Ie Bags does not match >>> Bag in our searches. >>> >>> Here is my analyzer stack: >>> >>> >> positionIncrementGap="100"> >>> >>> >> class="solr.WhitespaceTokenizerFactory"/> >>> >> class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" >>> ignoreCase="true" expand="true"/> >>> >> ignoreCase="true" words="stopword
Re: Multiple Tags and Facets
Are there no ideas of how to use multiple tags per filter or to combine some tags for excluding more than one filter per facet? Regards, Em -- View this message in context: http://lucene.472066.n3.nabble.com/Multiple-Tags-and-Facets-tp2843130p2847569.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: stemming filter analyzers, any favorites?
Adding another field with another stemmer and searching both??? Wow never thought of doing that. I guess that doesn't really double the size of your index tho because all the terms are almost the same right? Let me look into that. I'll raise the other issue in a separate thread and thanks. -Original Message- From: Em [mailto:mailformailingli...@yahoo.de] Sent: Thursday, April 21, 2011 1:55 AM To: solr-user@lucene.apache.org Subject: RE: stemming filter analyzers, any favorites? Hi Robert, we often ran into the same issue with stemmers. This is why we created more than one field, each field with different stemmers. It adds some overhead but worked quite well. Regarding your off-topic-question: Look at the debugging-output of your searches. Sometimes you configured your tools, especially the WDF, wrong and the queryParser creates an unexpected result which leads to unmatched but still relevant documents. Please, show us your debugging-output and the field-definition so that we can provide you some help! Regards, Em Robert Petersen-3 wrote: > > I have been doing that, and for Bags example the trailing 's' is not being > removed by the Kstemmer so if indexing the word bags and searching on bag > you get no matches. Why wouldn't the trailing 's' get stemmed off? > Kstemmer is dictionary based so bags isn't in the dictionary? That > trailing 's' should always be dropped no? That seems like it would be > better, we don't want to make synonyms for basic use cases like this. I > fear I will have to return to the Porter stemmer. Are there other better > ones is my main question. > > Off topic secondary question: sometimes I am puzzled by the output of the > analysis page. It seems like there should be a match, but I don't get the > results during a search that I'd expect... > > Like in the case if the WordDelimiterFilterFactory splits up a term into a > bunch of terms before the K-stemmer is applied, sometimes if the matching > term is in position two of the final analysis but the searcher had the > partial term just alone and so thereby in position 1 in the analysis stack > then when searching there wasn't a match. Am I reading this correctly? > Is that right or should that match and I am misreading my analysis output? > > Thanks! > > Robi > > PS I have a category named Bags and am catching flack for it not coming > up in a search for bag. hah > PPS the term is not in protwords.txt > > > com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory > {protected=protwords.txt} > term position 1 > term text bags > term type word > source start,end 0,4 > payload > > > -Original Message- > From: Erick Erickson [mailto:erickerick...@gmail.com] > Sent: Wednesday, April 20, 2011 10:55 AM > To: solr-user@lucene.apache.org > Subject: Re: stemming filter analyzers, any favorites? > > You can get a better sense of exactly what tranformations occur when > if you look at the analysis page (be sure to check the "verbose" > checkbox). > > I'm surprised that "bags" doesn't match "bag", what does the analysis > page say? > > Best > Erick > > On Wed, Apr 20, 2011 at 1:44 PM, Robert Petersen> wrote: >> Stemming filter analyzers... anyone have any favorites for particular >> search domains? Just wondering what people are using. I'm using Lucid >> K Stemmer and having issues. Seems like it misses a lot of common >> stems. We went to that because of excessively loose matches on the >> solr.PorterStemFilterFactory >> >> >> I understand K Stemmer is a dictionary based stemmer. Seems to me like >> it is missing a lot of common stem reductions. Ie Bags does not match >> Bag in our searches. >> >> Here is my analyzer stack: >> >> > positionIncrementGap="100"> >> >> > class="solr.WhitespaceTokenizerFactory"/> >> > class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" >> ignoreCase="true" expand="true"/> >> > ignoreCase="true" words="stopwords.txt"/> >> > generateWordParts="1" >> generateNumberParts="1" >> catenateWords="1" >> catenateNumbers="1" >> catenateAll="1" >> preserveOriginal="1" >> /> > class="solr.LowerCaseFilterFactory"/> >> >> > class="com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory" >> protected="protwords.txt"/> >> > class="solr.RemoveDuplicatesTokenFilterFactory"/> >> >> >> > class="solr.WhitespaceTokenizerFactory"/> >> > class="solr.SynonymFilterFactory" synonyms="query_synonyms.txt" >> ignoreCase="true" expand=
Re: old searchers not closing after optimize or replication
Hey Bernd, Checkout https://issues.apache.org/jira/browse/SOLR-2469. There is a pretty bad bug in Solr 3.1 which occurs if you have startup set in your replication configuration in solrconfig.xml. See the thread between Yonik and myself from a few days ago titled "Solr 3.1: Old Index Files Not Removed on Optimize". You can disable startup replication and perform an optimize to see if this fixes your problem of old index files being left behind (though you may have some old index files left behind from before this change that you still need to clean-up). Yonik has already pushed up a patch into the 3x branch and trunk for this issue. I can confirm that applying the patch (or just removing startup replication) resolved the issue for us. Do you think this is your issue? Thanks, -Trey On Thu, Apr 21, 2011 at 2:27 AM, Bernd Fehling wrote: > Hi Erik, > > > 1 > 0 > > > Due to 44 minutes optimization time we do an optimization once a day > during the night. > > I will try with an smaler index on my development system. > > Best regards, > Bernd > > > Am 20.04.2011 17:50, schrieb Erick Erickson: >> >> It looks OK, but still doesn't explain keeping the old files around. What >> is >> your in your solrconfig.xml look like? It's >> possible that you're seeing Solr attempt to keep around several >> optimized copies of the index, but that still doesn't explain why >> restarting Solr removes them unless the deletionPolicy gets invoked >> on sometime and you're index files are aging out (I don't know the >> internals of deletion well enough to say). >> >> About optimization. It's become less important with recent code. Once >> upon a time, it made a substantial difference in search speed. More >> recently, it has very little impact on search speed, and is used >> much more sparingly. Its greatest benefit is reclaiming unused resources >> left over from deleted documents. So you might want to avoid the pain >> of optimizing (44 minutes!) and only optimize rarely of if you have >> deleted a lot of documents. >> >> It might be worthwhile to try (with a smaller index !) a bunch of optimize >> cycles and see if the idea has any merit. I'd expect >> your index to reach a maximum and stay there after the saved >> copies of the index was reached... >> >> But otherwise I'm puzzled... >> >> Erick >> >> On Wed, Apr 20, 2011 at 10:30 AM, Bernd Fehling >> wrote: >>> >>> Hi Erik, >>> >>> Am 20.04.2011 15:42, schrieb Erick Erickson: H, this isn't right. You've pretty much eliminated the obvious things. What does lsof show? I'm assuming it shows the files are being held open by your Solr instance, but it's worth checking. >>> >>> Just commited new content 3 times and finally optimized. >>> Again having old index files left. >>> >>> Then checked on my master, only the newest version of index files are >>> listed with lsof. No file handles to the old index files but the >>> old index files remain in data/index/. >>> Thats strange. >>> >>> This time replication worked fine and cleaned up old index on slaves. >>> I'm not getting the same behavior, admittedly on a Windows box. The only other thing I can think of is that you have a query that's somehow never ending, but that's grasping at straws. Do your log files show anything interesting? >>> >>> Lets see: >>> - it has the old generation (generation=12) and its files >>> - and recognizes that there have been several commits (generation=18) >>> >>> 20.04.2011 14:05:26 org.apache.solr.update.DirectUpdateHandler2 commit >>> INFO: start >>> >>> commit(optimize=true,waitFlush=false,waitSearcher=true,expungeDeletes=false) >>> 20.04.2011 14:05:26 org.apache.solr.core.SolrDeletionPolicy onInit >>> INFO: SolrDeletionPolicy.onInit: commits:num=2 >>> >>> >>> commit{dir=/srv/www/solr/solr/solrserver/solr/data/index,segFN=segments_c,version=1302159868435,generation=12,filenames=[_3xm.nrm, >>> _3xm.fdx, segment >>> s_c, _3xm.fnm, _3xm.fdt, _3xm.tis, _3xm.tii, _3xm.prx, _3xm.frq] >>> >>> >>> commit{dir=/srv/www/solr/solr/solrserver/solr/data/index,segFN=segments_i,version=1302159868447,generation=18,filenames=[_3xm.nrm, >>> _3xo.tis, _3xp.pr >>> x, _3xo.fnm, _3xp.fdx, _3xs.frq, _3xo.tii, _3xp.fdt, _3xn.tii, _3xm.fdx, >>> _3xn.nrm, _3xm.fdt, _3xs.prx, _3xn.tis, _3xn.fdx, _3xr.nrm, _3xm.prx, >>> _3xn.fdt, _3x >>> p.tii, _3xs.nrm, _3xp.tis, _3xo.prx, segments_i, _3xm.tii, _3xq.tii, >>> _3xs.fdx, _3xs.fdt, _3xo.frq, _3xn.prx, _3xm.tis, _3xr.prx, _3xq.tis, >>> _3xo.fdt, _3xp.fr >>> q, _3xq.fnm, _3xo.fdx, _3xp.fnm, _3xr.tis, _3xr.fnm, _3xq.frq, _3xr.tii, >>> _3xr.frq, _3xo.nrm, _3xs.tii, _3xq.fdx, _3xq.fdt, _3xp.nrm, _3xq.prx, >>> _3xs.tis, _3x >>> m.frq, _3xr.fdx, _3xm.fnm, _3xn.frq, _3xq.nrm, _3xs.fnm, _3xn.fnm, >>> _3xr.fdt] >>> 20.04.2011 14:05:26 org.apache.solr.core.SolrDeletionPolicy updateCommits >>> INFO: newest commit = 1302159868447 >>> >>> >>> - after 44 minutes of optimizing (over 140GB and 27.8 mio docs) it gets >>> the Sol
Re: Apache Spam Filter Blocking Messages
Good to know; I'll go change those settings, then. Thanks for the feedback. -Trey On Thu, Apr 21, 2011 at 4:42 AM, Em wrote: > > This really helps at the mailinglists. > If you send your mails with Thunderbird, be sure to check that you enforce > plain-text-emails. If not, it will often send HTML-mails. > > Regards, > Em > > > Marvin Humphrey wrote: > > > > On Thu, Apr 21, 2011 at 12:30:29AM -0400, Trey Grainger wrote: > >> (FREEMAIL_FROM,FS_REPLICA,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL > > > > Note the "HTML_MESSAGE" in the list of things SpamAssassin didn't like. > > > >> Apparently I sound like spam when I write perfectly good English and > >> include > >> some xml and a link to a jira ticket in my e-mail (I tried a couple > >> different variations). Anyone know a way around this filter, or should I > >> just respond to those involved in the e-mail chain directly and avoid the > >> mailing list? > > > > Send plain text email instead of HTML. That solves the problem 99% of the > > time. > > > > Marvin Humphrey > > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Apache-Spam-Filter-Blocking-Messages-tp2845854p2846304.html > Sent from the Solr - User mailing list archive at Nabble.com.
Re: Highest frequency terms for a subset of documents
So if i want to use the facet.method=fc, is there a way to speed it up? and remove the bucket size limitation? On Thu, Apr 21, 2011 at 5:58 PM, Yonik Seeley wrote: > On Thu, Apr 21, 2011 at 10:41 AM, Ofer Fort wrote: > > I see, thanks. > > So if I would want to implement something that would fit my needs, would > > going through the subset of documents and counting all the terms in each > > one, would be faster? and easier to implement? > > That's not just your needs, that's everyone's needs (it's the > definition of field faceting). > There's no way to do what you're asking with a term enumerator (i.e. > facet.method=enum). > > Going through documents and counting all the terms in each is what > facet.method=fc does. > But it's also not great when the number of unique terms per document is > high. > If you can think of a better way, go for it! > > > -Yonik > http://www.lucenerevolution.org -- Lucene/Solr User Conference, May > 25-26, San Francisco >
Re: Highest frequency terms for a subset of documents
On Thu, Apr 21, 2011 at 10:41 AM, Ofer Fort wrote: > I see, thanks. > So if I would want to implement something that would fit my needs, would > going through the subset of documents and counting all the terms in each > one, would be faster? and easier to implement? That's not just your needs, that's everyone's needs (it's the definition of field faceting). There's no way to do what you're asking with a term enumerator (i.e. facet.method=enum). Going through documents and counting all the terms in each is what facet.method=fc does. But it's also not great when the number of unique terms per document is high. If you can think of a better way, go for it! -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco
Re: Highest frequency terms for a subset of documents
I see, thanks. So if I would want to implement something that would fit my needs, would going through the subset of documents and counting all the terms in each one, would be faster? and easier to implement? On Thu, Apr 21, 2011 at 5:36 PM, Yonik Seeley wrote: > On Thu, Apr 21, 2011 at 9:44 AM, Ofer Fort wrote: > > Not sure i fully understand, > > If "facet.method=enum steps over all terms in the index for that field", > > than what does setting the q=field:subset do? if i set the q=*:*, than > how > > do i get the frequency only on my subset? > > It's an implementation detail. Faceting *does* just give you counts > that just match > q=field:subset. How it does it is a different matter (i.e. for > facet.method=enum, it > must step over all terms in the field), so it's closer to O(nterms in > field) rather than O(ndocs in base set) > > -Yonik > http://www.lucenerevolution.org -- Lucene/Solr User Conference, May > 25-26, San Francisco > > > > Ofer > > > > On Thu, Apr 21, 2011 at 4:40 PM, Yonik Seeley < > yo...@lucidimagination.com> > > wrote: > >> > >> On Thu, Apr 21, 2011 at 9:24 AM, Ofer Fort wrote: > >> > Another strange behavior is that the Qtime seems pretty stable, no > >> > matter > >> > how many object match my query. 200K and 20K both take about 17s. > >> > I would have guessed that since the time is going over all the terms > of > >> > all > >> > the subset documents, would mean that the more documents, the more > time. > >> > >> facet.method=enum steps over all terms in the index for that field... > >> that takes time regardless of how many documents are in the base set. > >> > >> There are also short-circuit methods that avoid looking at the docs > >> for a term if it's docfreq is low enough that it couldn't possibly > >> make it into the priority queue. Because if this, it can actually be > >> faster to facet on a larger base set (try *:* as the base query). > >> > >> Actually, it might be interesting to see the query time if you set > >> facet.mincount equal to the number of docs in the base set - that will > >> test pretty much just the time to enumerate over the terms without > >> doing any set intersections at all. Be careful not to set mincount > >> greater than the number of docs in the base set though - solr will > >> short-circuit that too and skip enumeration altogether. > >> > >> The work on the bulkpostings branch should definitely speed up your > >> case even more - but I have no idea when it will "land" on trunk. > >> > >> > >> -Yonik > >> http://www.lucenerevolution.org -- Lucene/Solr User Conference, May > >> 25-26, San Francisco > > > > >
Re: Highest frequency terms for a subset of documents
On Thu, Apr 21, 2011 at 9:44 AM, Ofer Fort wrote: > Not sure i fully understand, > If "facet.method=enum steps over all terms in the index for that field", > than what does setting the q=field:subset do? if i set the q=*:*, than how > do i get the frequency only on my subset? It's an implementation detail. Faceting *does* just give you counts that just match q=field:subset. How it does it is a different matter (i.e. for facet.method=enum, it must step over all terms in the field), so it's closer to O(nterms in field) rather than O(ndocs in base set) -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco > Ofer > > On Thu, Apr 21, 2011 at 4:40 PM, Yonik Seeley > wrote: >> >> On Thu, Apr 21, 2011 at 9:24 AM, Ofer Fort wrote: >> > Another strange behavior is that the Qtime seems pretty stable, no >> > matter >> > how many object match my query. 200K and 20K both take about 17s. >> > I would have guessed that since the time is going over all the terms of >> > all >> > the subset documents, would mean that the more documents, the more time. >> >> facet.method=enum steps over all terms in the index for that field... >> that takes time regardless of how many documents are in the base set. >> >> There are also short-circuit methods that avoid looking at the docs >> for a term if it's docfreq is low enough that it couldn't possibly >> make it into the priority queue. Because if this, it can actually be >> faster to facet on a larger base set (try *:* as the base query). >> >> Actually, it might be interesting to see the query time if you set >> facet.mincount equal to the number of docs in the base set - that will >> test pretty much just the time to enumerate over the terms without >> doing any set intersections at all. Be careful not to set mincount >> greater than the number of docs in the base set though - solr will >> short-circuit that too and skip enumeration altogether. >> >> The work on the bulkpostings branch should definitely speed up your >> case even more - but I have no idea when it will "land" on trunk. >> >> >> -Yonik >> http://www.lucenerevolution.org -- Lucene/Solr User Conference, May >> 25-26, San Francisco > >
Re: PECL SOLR PHP extension, JSON output
Am 21.04.2011 13:58, schrieb roySolr: I have tried that but it seems like JSON is not supported Parameters responseWriter One of the following : - xml - phpnative -- View this message in context: http://lucene.472066.n3.nabble.com/PECL-SOLR-PHP-extension-JSON-output-tp2846092p2846728.html Sent from the Solr - User mailing list archive at Nabble.com. And I can´t get phpnative working with SOLR 3.1 :-( -- Greets, Ralf Kraus
Re: Highest frequency terms for a subset of documents
Not sure i fully understand, If "facet.method=enum steps over all terms in the index for that field", than what does setting the q=field:subset do? if i set the q=*:*, than how do i get the frequency only on my subset? Ofer On Thu, Apr 21, 2011 at 4:40 PM, Yonik Seeley wrote: > On Thu, Apr 21, 2011 at 9:24 AM, Ofer Fort wrote: > > Another strange behavior is that the Qtime seems pretty stable, no matter > > how many object match my query. 200K and 20K both take about 17s. > > I would have guessed that since the time is going over all the terms of > all > > the subset documents, would mean that the more documents, the more time. > > facet.method=enum steps over all terms in the index for that field... > that takes time regardless of how many documents are in the base set. > > There are also short-circuit methods that avoid looking at the docs > for a term if it's docfreq is low enough that it couldn't possibly > make it into the priority queue. Because if this, it can actually be > faster to facet on a larger base set (try *:* as the base query). > > Actually, it might be interesting to see the query time if you set > facet.mincount equal to the number of docs in the base set - that will > test pretty much just the time to enumerate over the terms without > doing any set intersections at all. Be careful not to set mincount > greater than the number of docs in the base set though - solr will > short-circuit that too and skip enumeration altogether. > > The work on the bulkpostings branch should definitely speed up your > case even more - but I have no idea when it will "land" on trunk. > > > -Yonik > http://www.lucenerevolution.org -- Lucene/Solr User Conference, May > 25-26, San Francisco >
Re: Highest frequency terms for a subset of documents
On Thu, Apr 21, 2011 at 9:24 AM, Ofer Fort wrote: > Another strange behavior is that the Qtime seems pretty stable, no matter > how many object match my query. 200K and 20K both take about 17s. > I would have guessed that since the time is going over all the terms of all > the subset documents, would mean that the more documents, the more time. facet.method=enum steps over all terms in the index for that field... that takes time regardless of how many documents are in the base set. There are also short-circuit methods that avoid looking at the docs for a term if it's docfreq is low enough that it couldn't possibly make it into the priority queue. Because if this, it can actually be faster to facet on a larger base set (try *:* as the base query). Actually, it might be interesting to see the query time if you set facet.mincount equal to the number of docs in the base set - that will test pretty much just the time to enumerate over the terms without doing any set intersections at all. Be careful not to set mincount greater than the number of docs in the base set though - solr will short-circuit that too and skip enumeration altogether. The work on the bulkpostings branch should definitely speed up your case even more - but I have no idea when it will "land" on trunk. -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco
Re: Highest frequency terms for a subset of documents
OK, so I copied my index and ran solr3.1 against it. Qtime dropped, from about 40s to 17s! This is good news, but still longer than i hoped for. I tried to do the same text with 4.0, but i'm getting IndexFormatTooOldException since my index was created using 1.4.1. Is my only chance to test this is to reindex using 3.1 or 4.0? Another strange behavior is that the Qtime seems pretty stable, no matter how many object match my query. 200K and 20K both take about 17s. I would have guessed that since the time is going over all the terms of all the subset documents, would mean that the more documents, the more time. Thanks for any insights ofer On Thu, Apr 21, 2011 at 3:07 AM, Ofer Fort wrote: > my documents are user entries, so i'm guessing they vary a lot. > Tomorrow i'll try 3.1 and also 4.0, and see if they have an improvement. > thanks guys! > > > On Thu, Apr 21, 2011 at 3:02 AM, Yonik Seeley > wrote: > >> On Wed, Apr 20, 2011 at 7:45 PM, Ofer Fort wrote: >> > Thanks >> > but i've disabled the cache already, since my concern is speed and i'm >> > willing to pay the price (memory) >> >> Then you should not disable the cache. >> >> >, and my subset are not fixed. >> > Does the facet search do any extra work that i don't need, that i might >> be >> > able to disable (either by a flag or by a code change), >> > Somehow i feel, or rather hope, that counting the terms of 200K >> documents >> > and finding the top 500 should take less than 30 seconds. >> >> Using facet.enum.cache.minDf should be a little faster than just >> disabling the cache - it's a different code path. >> Using the cache selectively will speed things up, so try setting that >> minDf to 1000 or so for example. >> >> How many unique terms do you have in the index? >> Is this Solr 3.1 - there were some optimizations when there were many >> terms to iterate over? >> You could also try trunk, which has even more optimizations, or the >> bulkpostings branch if you really want to experiment. >> >> -Yonik >> > >
Re: Can't determine Sort Order error when using sort by function
On Thu, Apr 21, 2011 at 8:30 AM, Otis Gospodnetic wrote: > Hello, > > I'm trying out sorting by function with the new function queries and > invariably > getting this error: > > Can't determine Sort Order: 'termfreq(name,samsung)', pos=22 > > Here's an example call: > http://localhost:8983/solr/select/?q=*:*&sort=termfreq%28name,samsung%29 > > What am I doing wrong? Try adding the sort order "asc" or "desc" after the function. -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco > Thanks, > Otis >
Re: entity name issue
Hi Em, Thanks a lot! But it still does not work. Actually my "where" clause in my query was '${dataimporter.request.clean}' != 'false' and myschema.table_a.aid=${dataimporter.request.aid}" which I used to pass a value to the full import process, and it worked without the prefix "myschema." on sybase database, but did not work on oracle either with or without the prefix. (It would complain table not existing without the prefix). TJ -- View this message in context: http://lucene.472066.n3.nabble.com/entity-name-issue-tp2843812p2846816.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: PECL SOLR PHP extension, JSON output
Hm yes correct .. there is a explicit validation of response-writers in place. if you want to modify it yourself, check the current trunk (http://svn.php.net/repository/pecl/solr/trunk/) modify solr_constants.h, define another response_writer and add another check in solr_functions_helpers.c in function solr_is_supported_response_writer compile the module and go ahead :) Regards Stefan On Thu, Apr 21, 2011 at 1:58 PM, roySolr wrote: > I have tried that but it seems like JSON is not supported > > Parameters > > responseWriter > > One of the following : > > - xml > - phpnative > > > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/PECL-SOLR-PHP-extension-JSON-output-tp2846092p2846728.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Can't determine Sort Order error when using sort by function
Hello, I'm trying out sorting by function with the new function queries and invariably getting this error: Can't determine Sort Order: 'termfreq(name,samsung)', pos=22 Here's an example call: http://localhost:8983/solr/select/?q=*:*&sort=termfreq%28name,samsung%29 What am I doing wrong? Thanks, Otis
Re: PECL SOLR PHP extension, JSON output
I have tried that but it seems like JSON is not supported Parameters responseWriter One of the following : - xml - phpnative -- View this message in context: http://lucene.472066.n3.nabble.com/PECL-SOLR-PHP-extension-JSON-output-tp2846092p2846728.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Unable to load EntityProcessor implementation for entity:16865747177753
can i see your tikaconfig.xml? meanwhile have a look at this bug: https://issues.apache.org/jira/browse/SOLR-2116 a similar thread also exists: http://lucene.472066.n3.nabble.com/TikaEntityProcessor-td2839188.html -- View this message in context: http://lucene.472066.n3.nabble.com/Unable-to-load-EntityProcessor-implementation-for-entity-16865747177753-tp2846513p2846574.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: HTMLStripCharFilterFactory, highlighting and InvalidTokenOffsetsException
On 20.04.11 18:51, Robert Muir wrote: Hi, there is a proposed patch uploaded to the issue. Maybe you can help by reviewing/testing it? if i succeed in compiling solr, i can test the patch. Is this the right starting point for such an endeavour ? http://wiki.apache.org/solr/HackingSolr -robert 2011/4/20 Robert Gründler: Hi all, i'm getting the following exception when using highlighting for a field containing HTMLStripCharFilterFactory: org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token ... exceeds length of provided text sized 21 It seems this is a know issue: https://issues.apache.org/jira/browse/LUCENE-2208 Does anyone know if there's a fix implemented yet in solr? thanks! -robert
Re: PECL SOLR PHP extension, JSON output
give it a try: http://php.net/manual/en/solrclient.setresponsewriter.php On Thu, Apr 21, 2011 at 9:03 AM, roySolr wrote: > Hello, > > I use the PECL php extension for SOLR. I want my output in JSON. > > This is not working: > > $query->set('wt', 'json'); > > How do i solve this problem? > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/PECL-SOLR-PHP-extension-JSON-output-tp2846092p2846092.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Unable to load EntityProcessor implementation for entity:16865747177753
hello i have one datasource - is sql server db and second datasource - is file but dynamic means based on first datasource db record i want to fetch one file that's why i try to use tikaentityprocessor but got following error org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to load EntityProcessor implementation for entity:16865747177753 Processing Document # 1 at org.apache.solr.handler.dataimport.DocBuilder.getEntityProcessor(DocBuilder.java:576) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:314) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:383) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:242) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:180) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:331) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:389) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:370) Caused by: java.lang.ClassNotFoundException: Unable to load TikaEntityProcessor or org.apache.solr.handler.dataimport.TikaEntityProcessor at org.apache.solr.handler.dataimport.DocBuilder.loadClass(DocBuilder.java:738) at org.apache.solr.handler.dataimport.DocBuilder.getEntityProcessor(DocBuilder.java:573) ... 7 more Caused by: org.apache.solr.common.SolrException: Error loading class 'TikaEntityProcessor' at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:375) at org.apache.solr.handler.dataimport.DocBuilder.loadClass(DocBuilder.java:728) ... 8 more Caused by: java.lang.ClassNotFoundException: TikaEntityProcessor at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:307)... data config file please help me to solve this problem Thanks Vishal -- View this message in context: http://lucene.472066.n3.nabble.com/Unable-to-load-EntityProcessor-implementation-for-entity-16865747177753-tp2846513p2846513.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: The issue of import data from database using Solr DIH
Yes, it is like the left outer join. In my example.the table may be table or view or stored procedure,I can not change it in database. If for every id in table1,we need search the fields by id from table2 in database,it will met performance issue,especially the size of tables are very big. -Original Message- From: lboutros [mailto:boutr...@gmail.com] Sent: Thursday, April 21, 2011 5:25 PM To: solr-user@lucene.apache.org Subject: RE: The issue of import data from database using Solr DIH What you want to do is something like a left outer join, isn't it ? something like : select table2.OS06Y, f1,f2,f3,f4,f5 from table2 left outer join table1 on table2.OS06Y = table1.OS06Y where ... could you prepare a view in your RDBMS ? That could be another solution ? Ludovic. - Jouve France. -- View this message in context: http://lucene.472066.n3.nabble.com/The-issue-of-import-data-from-databas e-using-Solr-DIH-tp2845318p2846403.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Need to create dyanamic indexies base on different document workspaces
Additionally, there is an already set up example for a multicore-setup in the example directory of your Solr-distribution. Regards, Em -- View this message in context: http://lucene.472066.n3.nabble.com/Need-to-create-dyanamic-indexies-base-on-different-document-workspaces-tp2845919p2846417.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: The issue of import data from database using Solr DIH
As Iboutrus mentioned, if you can summarize it in a query, than yes, Solr can handle it. Make a step backward: Do not think of Solr. Write a query (one! query) that shows exactly the output you exepct. Afterwards, implement this query as a source for DIH. Regards, Em -- View this message in context: http://lucene.472066.n3.nabble.com/The-issue-of-import-data-from-database-using-Solr-DIH-tp2845318p2846414.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: The issue of import data from database using Solr DIH
What you want to do is something like a left outer join, isn't it ? something like : select table2.OS06Y, f1,f2,f3,f4,f5 from table2 left outer join table1 on table2.OS06Y = table1.OS06Y where ... could you prepare a view in your RDBMS ? That could be another solution ? Ludovic. - Jouve France. -- View this message in context: http://lucene.472066.n3.nabble.com/The-issue-of-import-data-from-database-using-Solr-DIH-tp2845318p2846403.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Need to create dyanamic indexies base on different document workspaces
Actually you need to put a file named *solr.xml* in the solr.home directory to create the solr core . you can do that programatically if you want to make it dynamic based on your logic pls check the solr core admin document. On Thu, Apr 21, 2011 at 2:52 PM, Gaurav Shingala < gaurav.shing...@hotmail.com> wrote: > > Is it possible to create solr core dyanamically? > > In our case we want each workspace to have its own solr index. > > > > Thanks > > > From: chandan.tamra...@nepasoft.com > > Date: Thu, 21 Apr 2011 11:57:53 +0545 > > Subject: Re: Need to create dyanamic indexies base on different document > workspaces > > To: solr-user@lucene.apache.org > > > > It depends on your application design how you want your index > > > > > > There is a feature called solr core . > http://wiki.apache.org/solr/CoreAdmin > > You could still have a single index but a field to differentiate the > items > > in index > > > > thanks > > > > > > On Thu, Apr 21, 2011 at 10:55 AM, Gaurav Shingala < > > gaurav.shing...@hotmail.com> wrote: > > > > > > > > > > > > > > > > > Hi, > > > > > > Is there a way to create different solr indexes for different > categories? > > > We have different document workspaces and ideally want each workspace > to > > > have its own solr index. > > > > > > Thanks, > > > Gaurav > > > > > > > > > > > > > -- > > Chandan Tamrakar > > * > > * > -- Chandan Tamrakar * *
RE: The issue of import data from database using Solr DIH
I try "remove the OS06Y-field from your second entity ",import the second entity failed. Give a example: Table1: OS06Y=123,f1=100,f2=200,f3=300; OS06Y=456,f1=100,f2=200,f3=300; Table2: OS06Y=123,f4=100,f5=200; OS06Y=456,f4=100; OS06Y=789,f4=100; I want the result: OS06Y=123,f1=100,f2=200,f3=300,f4=100,f5=200; OS06Y=456,f1=100,f2=200,f3=300,f4=100; OS06Y=789,f4=100; Can solr implement it? if yes,how to configure dataconfig.xml in solr? -Original Message- From: Em [mailto:mailformailingli...@yahoo.de] Sent: Thursday, April 21, 2011 4:59 PM To: solr-user@lucene.apache.org Subject: RE: The issue of import data from database using Solr DIH Not sure I understood you correct: You expect that OS06Y stores *two* different performanceIds? One from table1 and the other from table2? I think this may be a problem. If both OS06Y-keys are equal, than you can use the syntax as mentioned in the wiki without any problems. You just have to rewrite your config to make the second entity a sub-entity and to add a WHERE-clause. If this is really not possible for you, just a guess, what happens if you remove the OS06Y-field from your second entity? Regards, Em -- View this message in context: http://lucene.472066.n3.nabble.com/The-issue-of-import-data-from-databas e-using-Solr-DIH-tp2845318p2846347.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Need to create dyanamic indexies base on different document workspaces
Yes, have a look at the wiki-page. It explains some configurations and REST-API-methods to create cores dynamically and if/how they are persisted. Regards, Em Gaurav Shingala wrote: > > Is it possible to create solr core dyanamically? > > In our case we want each workspace to have its own solr index. > > > > Thanks > >> From: chandan.tamra...@nepasoft.com >> Date: Thu, 21 Apr 2011 11:57:53 +0545 >> Subject: Re: Need to create dyanamic indexies base on different document >> workspaces >> To: solr-user@lucene.apache.org >> >> It depends on your application design how you want your index >> >> >> There is a feature called solr core . >> http://wiki.apache.org/solr/CoreAdmin >> You could still have a single index but a field to differentiate the >> items >> in index >> >> thanks >> >> >> On Thu, Apr 21, 2011 at 10:55 AM, Gaurav Shingala < >> gaurav.shing...@hotmail.com> wrote: >> >> > >> > >> > >> > >> > Hi, >> > >> > Is there a way to create different solr indexes for different >> categories? >> > We have different document workspaces and ideally want each workspace >> to >> > have its own solr index. >> > >> > Thanks, >> > Gaurav >> > >> >> >> >> >> -- >> Chandan Tamrakar >> * >> * > -- View this message in context: http://lucene.472066.n3.nabble.com/Need-to-create-dyanamic-indexies-base-on-different-document-workspaces-tp2845919p2846371.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: how to abort a running optimize
Hi Stockii, how did you configured your segments-number in Solrconfig.xml? Decrease the number to speed up things automatically. Regards, Em -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-abort-a-running-optimize-tp2838721p2846369.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Need to create dyanamic indexies base on different document workspaces
Is it possible to create solr core dyanamically? In our case we want each workspace to have its own solr index. Thanks > From: chandan.tamra...@nepasoft.com > Date: Thu, 21 Apr 2011 11:57:53 +0545 > Subject: Re: Need to create dyanamic indexies base on different document > workspaces > To: solr-user@lucene.apache.org > > It depends on your application design how you want your index > > > There is a feature called solr core . http://wiki.apache.org/solr/CoreAdmin > You could still have a single index but a field to differentiate the items > in index > > thanks > > > On Thu, Apr 21, 2011 at 10:55 AM, Gaurav Shingala < > gaurav.shing...@hotmail.com> wrote: > > > > > > > > > > > Hi, > > > > Is there a way to create different solr indexes for different categories? > > We have different document workspaces and ideally want each workspace to > > have its own solr index. > > > > Thanks, > > Gaurav > > > > > > > -- > Chandan Tamrakar > * > *
RE: The issue of import data from database using Solr DIH
Not sure I understood you correct: You expect that OS06Y stores *two* different performanceIds? One from table1 and the other from table2? I think this may be a problem. If both OS06Y-keys are equal, than you can use the syntax as mentioned in the wiki without any problems. You just have to rewrite your config to make the second entity a sub-entity and to add a WHERE-clause. If this is really not possible for you, just a guess, what happens if you remove the OS06Y-field from your second entity? Regards, Em -- View this message in context: http://lucene.472066.n3.nabble.com/The-issue-of-import-data-from-database-using-Solr-DIH-tp2845318p2846347.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: stemming filter analyzers, any favorites?
Hi Robert, we often ran into the same issue with stemmers. This is why we created more than one field, each field with different stemmers. It adds some overhead but worked quite well. Regarding your off-topic-question: Look at the debugging-output of your searches. Sometimes you configured your tools, especially the WDF, wrong and the queryParser creates an unexpected result which leads to unmatched but still relevant documents. Please, show us your debugging-output and the field-definition so that we can provide you some help! Regards, Em Robert Petersen-3 wrote: > > I have been doing that, and for Bags example the trailing 's' is not being > removed by the Kstemmer so if indexing the word bags and searching on bag > you get no matches. Why wouldn't the trailing 's' get stemmed off? > Kstemmer is dictionary based so bags isn't in the dictionary? That > trailing 's' should always be dropped no? That seems like it would be > better, we don't want to make synonyms for basic use cases like this. I > fear I will have to return to the Porter stemmer. Are there other better > ones is my main question. > > Off topic secondary question: sometimes I am puzzled by the output of the > analysis page. It seems like there should be a match, but I don't get the > results during a search that I'd expect... > > Like in the case if the WordDelimiterFilterFactory splits up a term into a > bunch of terms before the K-stemmer is applied, sometimes if the matching > term is in position two of the final analysis but the searcher had the > partial term just alone and so thereby in position 1 in the analysis stack > then when searching there wasn't a match. Am I reading this correctly? > Is that right or should that match and I am misreading my analysis output? > > Thanks! > > Robi > > PS I have a category named Bags and am catching flack for it not coming > up in a search for bag. hah > PPS the term is not in protwords.txt > > > com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory > {protected=protwords.txt} > term position 1 > term text bags > term type word > source start,end 0,4 > payload > > > -Original Message- > From: Erick Erickson [mailto:erickerick...@gmail.com] > Sent: Wednesday, April 20, 2011 10:55 AM > To: solr-user@lucene.apache.org > Subject: Re: stemming filter analyzers, any favorites? > > You can get a better sense of exactly what tranformations occur when > if you look at the analysis page (be sure to check the "verbose" > checkbox). > > I'm surprised that "bags" doesn't match "bag", what does the analysis > page say? > > Best > Erick > > On Wed, Apr 20, 2011 at 1:44 PM, Robert Petersen> wrote: >> Stemming filter analyzers... anyone have any favorites for particular >> search domains? Just wondering what people are using. I'm using Lucid >> K Stemmer and having issues. Seems like it misses a lot of common >> stems. We went to that because of excessively loose matches on the >> solr.PorterStemFilterFactory >> >> >> I understand K Stemmer is a dictionary based stemmer. Seems to me like >> it is missing a lot of common stem reductions. Ie Bags does not match >> Bag in our searches. >> >> Here is my analyzer stack: >> >> > positionIncrementGap="100"> >> >> > class="solr.WhitespaceTokenizerFactory"/> >> > class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" >> ignoreCase="true" expand="true"/> >> > ignoreCase="true" words="stopwords.txt"/> >> > generateWordParts="1" >> generateNumberParts="1" >> catenateWords="1" >> catenateNumbers="1" >> catenateAll="1" >> preserveOriginal="1" >> /> > class="solr.LowerCaseFilterFactory"/> >> >> > class="com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory" >> protected="protwords.txt"/> >> > class="solr.RemoveDuplicatesTokenFilterFactory"/> >> >> >> > class="solr.WhitespaceTokenizerFactory"/> >> > class="solr.SynonymFilterFactory" synonyms="query_synonyms.txt" >> ignoreCase="true" expand="true"/> >> > ignoreCase="true" words="stopwords.txt"/> >> > generateWordParts="1" >> generateNumberParts="1" >> catenateWords="1" >> catenateNumbers="1" >> catenateAll="1" >> preserveOriginal="1" >> /> > class="solr.LowerCaseFilterFactory"/> >> >> > class="com.lu
RE: The issue of import data from database using Solr DIH
Thanks Em. Yes, OS06Y is the uniqueKey. Table1 and Table2 is parallel in my example. In the Url: http://wiki.apache.org/solr/DIHQuickStart#Index_data_from_multiple_table s_into_Solr The tables don't have parallel relations in the above URL example I want to know that can solr implement the case? 1.Get data from database table1; 2.Get data from database table2; 3.merge the fields of table1 and table2; The configuration of db-data-config.xml is the following: Because I don't want to get one id and data from table1 and then get the data by id from table2,it may met performance issue. -Original Message- From: Em [mailto:mailformailingli...@yahoo.de] Sent: Thursday, April 21, 2011 4:38 PM To: solr-user@lucene.apache.org Subject: Re: The issue of import data from database using Solr DIH Hi Kevin, I think you made OS06Y the uniqueKey, right? So, in entity 1 you specify values for it, but in entity 2 you do so as well. I am not absolutely sure about this, but: It seems like your two entities create two documents and the second will overwrite the first. Have a look at this page: http://wiki.apache.org/solr/DIHQuickStart#Index_data_from_multiple_table s_into_Solr I think it will help you in rewriting your queries to fit your usecase. Regards, Em -- View this message in context: http://lucene.472066.n3.nabble.com/The-issue-of-import-data-from-databas e-using-Solr-DIH-tp2845318p2846296.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: entity name issue
Hi Tjong, seems like your XML was invalid. Try the following and compare it to your original config: Regards, Em -- View this message in context: http://lucene.472066.n3.nabble.com/entity-name-issue-tp2843812p2846326.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to return score without using _val_
Hi, I agree with Yonik here - I do not understand what you would like to do as well. But some additional note from my side: Your FQs never influences the score! Of course you can specify the same query twice, once as a filter - query and once as a regular query but I do not see the reason to do so. It sounds like unnecessary effort without a win. Regards, Em Bill Bell wrote: > > I would like to influence the score but I would rather not mess with the > q= > field since I want the query to dismax for Q. > > Something like: > > fq={!type=dismax qf=$qqf v=$qspec}& > fq={!type=dismax qt=dismaxname v=$qname}& > q=_val_:"{!type=dismax qf=$qqf v=$qspec}" _val_:"{!type=dismax > qt=dismaxname v=$qname}" > > Is there a way to do a filter and add the FQ to the score by doing it > another way? > > Also does this do multiple queries? Is this the right way to do it? > -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-return-score-without-using-val-tp2841443p2846317.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Apache Spam Filter Blocking Messages
This really helps at the mailinglists. If you send your mails with Thunderbird, be sure to check that you enforce plain-text-emails. If not, it will often send HTML-mails. Regards, Em Marvin Humphrey wrote: > > On Thu, Apr 21, 2011 at 12:30:29AM -0400, Trey Grainger wrote: >> (FREEMAIL_FROM,FS_REPLICA,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL > > Note the "HTML_MESSAGE" in the list of things SpamAssassin didn't like. > >> Apparently I sound like spam when I write perfectly good English and >> include >> some xml and a link to a jira ticket in my e-mail (I tried a couple >> different variations). Anyone know a way around this filter, or should I >> just respond to those involved in the e-mail chain directly and avoid the >> mailing list? > > Send plain text email instead of HTML. That solves the problem 99% of the > time. > > Marvin Humphrey > -- View this message in context: http://lucene.472066.n3.nabble.com/Apache-Spam-Filter-Blocking-Messages-tp2845854p2846304.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: The issue of import data from database using Solr DIH
Hi Kevin, I think you made OS06Y the uniqueKey, right? So, in entity 1 you specify values for it, but in entity 2 you do so as well. I am not absolutely sure about this, but: It seems like your two entities create two documents and the second will overwrite the first. Have a look at this page: http://wiki.apache.org/solr/DIHQuickStart#Index_data_from_multiple_tables_into_Solr I think it will help you in rewriting your queries to fit your usecase. Regards, Em -- View this message in context: http://lucene.472066.n3.nabble.com/The-issue-of-import-data-from-database-using-Solr-DIH-tp2845318p2846296.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr - upgrade from 1.4.1 to 3.1 - finding AbstractSolrTestCase binaries - help please?
There is a jar for the tests in solr. I added this dependency in my pom.xml : org.apache.solr solr-core 3.1-SNAPSHOT tests test jar Ludovic. - Jouve France. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-upgrade-from-1-4-1-to-3-1-finding-AbstractSolrTestCase-binaries-help-please-tp2845011p2846223.html Sent from the Solr - User mailing list archive at Nabble.com.
PECL SOLR PHP extension, JSON output
Hello, I use the PECL php extension for SOLR. I want my output in JSON. This is not working: $query->set('wt', 'json'); How do i solve this problem? -- View this message in context: http://lucene.472066.n3.nabble.com/PECL-SOLR-PHP-extension-JSON-output-tp2846092p2846092.html Sent from the Solr - User mailing list archive at Nabble.com.