date:20110421

Re: How to return score without using _val_

2011-04-21 Thread Bill Bell

I know that the _val_ is the only thing influencing the score.

The fq is just to limit also by those queries.

What I am asking is if it is possible to just influence the score using
_val_ but not in the Q parameter?

Something like bq=val_:"{!type=dismax qf=$qqf  v=$qspec}"
_val_:"{!type=dismax
qt=dismaxname v=$qname}"


Is there something like that?

On 4/21/11 2:45 AM, "Em"  wrote:

>Hi,
>
>I agree with Yonik here - I do not understand what you would like to do as
>well.
>But some additional note from my side:
>Your FQs never influences the score! Of course you can specify the same
>query twice, once as a filter - query and once as a regular query but I do
>not see the reason to do so. It sounds like unnecessary effort without a
>win. 
>
>Regards,
>Em 
>
>
>Bill Bell wrote:
>> 
>> I would like to influence the score but I would rather not mess with the
>> q=
>> field since I want the query to dismax for Q.
>> 
>> Something like:
>> 
>> fq={!type=dismax qf=$qqf v=$qspec}&
>> fq={!type=dismax qt=dismaxname v=$qname}&
>> q=_val_:"{!type=dismax qf=$qqf  v=$qspec}" _val_:"{!type=dismax
>> qt=dismaxname v=$qname}"
>> 
>> Is there a way to do a filter and add the FQ to the score by doing it
>> another way? 
>> 
>> Also does this do multiple queries? Is this the right way to do it?
>> 
>
>
>--
>View this message in context:
>http://lucene.472066.n3.nabble.com/How-to-return-score-without-using-val-t
>p2841443p2846317.html
>Sent from the Solr - User mailing list archive at Nabble.com.

Re: term position question from analyzer stack for WordDelimiterFilterFactory

2011-04-21 Thread Yonik Seeley

On Thu, Apr 21, 2011 at 8:06 PM, Robert Petersen  wrote:
> So if I don't put preserveOriginal=1 in my WordDelimiterFilterFactory 
> settings I cannot get a match between AppleTV on the indexing side and 
> appletv on the search side.

Hmmm, that shouldn't be the case.  The "text" field in the solr
example config doesn't use preserveOriginal, and AppleTV is indexed as

appl, tv/appletv

And a search for appletv does match fine.

Perhaps on the search side there is actually a phrase query like "big
appletv"?  One workaround for that is to add a little slop... "big
appletv"~1

-Yonik
http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
25-26, San Francisco

Re: Indexing 20M documents from MySQL with DIH

2011-04-21 Thread Li

Can you post the dataconfig.XML? Probably you didn't use batch size

Sent from my iPhone

On Apr 21, 2011, at 5:09 PM, Scott Bigelow  wrote:

> Thanks for the e-mail. I probably should have provided more details,
> but I was more interested in making sure I was approaching the problem
> correctly (using DIH, with one big SELECT statement for millions of
> rows) instead of solving this specific problem. Here's a partial
> stacktrace from this specific problem:
> 
> ...
> Caused by: java.io.EOFException: Can not read response from server.
> Expected to read 4 bytes, read 0 bytes before connection was
> unexpectedly lost.
>at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:2539)
>at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2989)
>... 22 more
> Apr 21, 2011 3:53:28 AM
> org.apache.solr.handler.dataimport.EntityProcessorBase getNext
> SEVERE: getNext() failed for query 'REDACTED'
> org.apache.solr.handler.dataimport.DataImportHandlerException:
> com.mysql.jdbc.exceptions.jdbc4.CommunicationsException:
> Communications link failure
> 
> The last packet successfully received from the server was 128
> milliseconds ago.  The last packet sent successfully to the server was
> 25,273,484 milliseconds ago.
> ...
> 
> 
> A custom indexer, so that's a fairly common practice? So when you are
> dealing with these large indexes, do you try not to fully rebuild them
> when you can? It's not a nightly thing, but something to do in case of
> a disaster? Is there a difference in the performance of an index that
> was built all at once vs. one that has had delta inserts and updates
> applied over a period of months?
> 
> Thank you for your insight.
> 
> 
> On Thu, Apr 21, 2011 at 4:31 PM, Chris Hostetter
>  wrote:
>> 
>> : For a new project, I need to index about 20M records (30 fields) and I
>> : have been running into issues with MySQL disconnects, right around
>> : 15M. I've tried several remedies I've found on blogs, changing
>> 
>> if you can provide some concrete error/log messages and the details of how
>> you are configuring your datasource that might help folks provide better
>> suggestions -- youv'e said you run into a problem but you havne't provided
>> any details for people to go on in giving you feedback.
>> 
>> : resolved the issue. It got me wondering: Is this the way everyone does
>> : it? What about 100M records up to 1B; are those all pulled using DIH
>> : and a single query?
>> 
>> I've only recently started using DIH, and while it definitely has a lot
>> of quirks/anoyances, it seems like a pretty good 80/20 solution for
>> indexing with Solr -- but that doens't mean it's perfect for all
>> situations.
>> 
>> Writing custom indexer code can certianly make sense in a lot of cases --
>> particularly where you already have a data pblishing system that you wnat
>> to tie into directly -- the trick is to ensure you have a decent strategy
>> for rebuilding the entire index should the need arrise (but this is relaly
>> only an issue if your primary indexing solution is incremental -- many use
>> cases can be satisifed just fine with a brute force "full rebuild
>> periodically" impelmentation.
>> 
>> 
>> -Hoss
>>

Re: Indexing 20M documents from MySQL with DIH

2011-04-21 Thread Scott Bigelow

Thanks for the e-mail. I probably should have provided more details,
but I was more interested in making sure I was approaching the problem
correctly (using DIH, with one big SELECT statement for millions of
rows) instead of solving this specific problem. Here's a partial
stacktrace from this specific problem:

...
Caused by: java.io.EOFException: Can not read response from server.
Expected to read 4 bytes, read 0 bytes before connection was
unexpectedly lost.
at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:2539)
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2989)
... 22 more
Apr 21, 2011 3:53:28 AM
org.apache.solr.handler.dataimport.EntityProcessorBase getNext
SEVERE: getNext() failed for query 'REDACTED'
org.apache.solr.handler.dataimport.DataImportHandlerException:
com.mysql.jdbc.exceptions.jdbc4.CommunicationsException:
Communications link failure

The last packet successfully received from the server was 128
milliseconds ago.  The last packet sent successfully to the server was
25,273,484 milliseconds ago.
...

A custom indexer, so that's a fairly common practice? So when you are
dealing with these large indexes, do you try not to fully rebuild them
when you can? It's not a nightly thing, but something to do in case of
a disaster? Is there a difference in the performance of an index that
was built all at once vs. one that has had delta inserts and updates
applied over a period of months?

Thank you for your insight.

On Thu, Apr 21, 2011 at 4:31 PM, Chris Hostetter
 wrote:
>
> : For a new project, I need to index about 20M records (30 fields) and I
> : have been running into issues with MySQL disconnects, right around
> : 15M. I've tried several remedies I've found on blogs, changing
>
> if you can provide some concrete error/log messages and the details of how
> you are configuring your datasource that might help folks provide better
> suggestions -- youv'e said you run into a problem but you havne't provided
> any details for people to go on in giving you feedback.
>
> : resolved the issue. It got me wondering: Is this the way everyone does
> : it? What about 100M records up to 1B; are those all pulled using DIH
> : and a single query?
>
> I've only recently started using DIH, and while it definitely has a lot
> of quirks/anoyances, it seems like a pretty good 80/20 solution for
> indexing with Solr -- but that doens't mean it's perfect for all
> situations.
>
> Writing custom indexer code can certianly make sense in a lot of cases --
> particularly where you already have a data pblishing system that you wnat
> to tie into directly -- the trick is to ensure you have a decent strategy
> for rebuilding the entire index should the need arrise (but this is relaly
> only an issue if your primary indexing solution is incremental -- many use
> cases can be satisifed just fine with a brute force "full rebuild
> periodically" impelmentation.
>
>
> -Hoss
>

term position question from analyzer stack for WordDelimiterFilterFactory

2011-04-21 Thread Robert Petersen

So if I don't put preserveOriginal=1 in my WordDelimiterFilterFactory settings 
I cannot get a match between AppleTV on the indexing side and appletv on the 
search side.  Without that setting the all lowercase version of AppleTV is in 
term position two due to the catenateWords=1 or the catenateAll=1 settings.  I 
am surprised.  How does term position affect searching?  Here is my analysis 
with preserveOriginal=1 to make the lower case occur in both term position 1 
and 2:

Index Analyzer
org.apache.solr.analysis.WhitespaceTokenizerFactory {}
term position   1
term text   AppleTV
term type   word
source start,end0,7
payload 
org.apache.solr.analysis.SynonymFilterFactory {synonyms=index_synonyms.txt, 
expand=true, ignoreCase=true}
term position   1
term text   AppleTV
term type   word
source start,end0,7
payload 
org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt, 
ignoreCase=true}
term position   1
term text   AppleTV
term type   word
source start,end0,7
payload 
org.apache.solr.analysis.WordDelimiterFilterFactory {preserveOriginal=1, 
generateNumberParts=1, catenateWords=1, generateWordParts=1, catenateAll=1, 
catenateNumbers=1}
term position   1   2
term text   AppleTV TV
Apple   AppleTV
term type   wordword
wordword
source start,end0,7 5,7
0,5 0,7
payload 

org.apache.solr.analysis.LowerCaseFilterFactory {}
term position   1   2
term text   appletv tv
apple   appletv
term type   wordword
wordword
source start,end0,7 5,7
0,5 0,7
payload 

com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory 
{protected=protwords.txt}
term position   1   2
term text   appletv tv
apple   appletv
term type   wordword
wordword
source start,end0,7 5,7
0,5 0,7
payload 

org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {}
term position   1   2
term text   appletv tv
apple   appletv
term type   wordword
wordword
source start,end0,7 5,7
0,5 0,7
payload 

Query Analyzer
org.apache.solr.analysis.WhitespaceTokenizerFactory {}
term position   1
term text   appletv
term type   word
source start,end0,7
payload 
org.apache.solr.analysis.SynonymFilterFactory {synonyms=query_synonyms.txt, 
expand=true, ignoreCase=true}
term position   1
term text   appletv
term type   word
source start,end0,7
payload 
org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt, 
ignoreCase=true}
term position   1
term text   appletv
term type   word
source start,end0,7
payload 
org.apache.solr.analysis.WordDelimiterFilterFactory {preserveOriginal=1, 
generateNumberParts=1, catenateWords=1, generateWordParts=1, catenateAll=1, 
catenateNumbers=1}
term position   1
term text   appletv
term type   word
source start,end0,7
payload 
org.apache.solr.analysis.LowerCaseFilterFactory {}
term position   1
term text   appletv
term type   word
source start,end0,7
payload 
com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory 
{protected=protwords.txt}
term position   1
term text   appletv
term type   word
source start,end0,7
payload 
org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {}
term position   1
term text   appletv
term type   word
source start,end0,7
payload

Re: Indexing 20M documents from MySQL with DIH

2011-04-21 Thread Chris Hostetter


: For a new project, I need to index about 20M records (30 fields) and I
: have been running into issues with MySQL disconnects, right around
: 15M. I've tried several remedies I've found on blogs, changing

if you can provide some concrete error/log messages and the details of how 
you are configuring your datasource that might help folks provide better 
suggestions -- youv'e said you run into a problem but you havne't provided 
any details for people to go on in giving you feedback.

: resolved the issue. It got me wondering: Is this the way everyone does
: it? What about 100M records up to 1B; are those all pulled using DIH
: and a single query?

I've only recently started using DIH, and while it definitely has a lot 
of quirks/anoyances, it seems like a pretty good 80/20 solution for 
indexing with Solr -- but that doens't mean it's perfect for all 
situations.

Writing custom indexer code can certianly make sense in a lot of cases -- 
particularly where you already have a data pblishing system that you wnat 
to tie into directly -- the trick is to ensure you have a decent strategy 
for rebuilding the entire index should the need arrise (but this is relaly 
only an issue if your primary indexing solution is incremental -- many use 
cases can be satisifed just fine with a brute force "full rebuild 
periodically" impelmentation.


-Hoss

Re: Highest frequency terms for a subset of documents

2011-04-21 Thread Ofer Fort

Ok, thanks

On Friday, April 22, 2011, Yonik Seeley  wrote:
> On Thu, Apr 21, 2011 at 6:50 PM, Ofer Fort  wrote:
>> Ok, I'll give it a try, as this is a server I am willing to risk.
>> How is the competability between solrj of bulkpostings, trunk, 3.1 and 1.4.1?
>
> bulkpostings, trunk, and 3.1 should all be relatively solrj
> compatible.  But the SolrJ javabin format (used by default for
> queries) changed for strings between 1.4.1 and 3.1 (SOLR-2034).
>
> -Yonik
> http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
> 25-26, San Francisco
>
>
>> On Friday, April 22, 2011, Yonik Seeley  wrote:
>>> On Thu, Apr 21, 2011 at 6:34 PM, Ofer Fort  wrote:
 So I'm guessing my best approach now would be to test trunk, and hope
 that as 3.1 cut the performance in half, trunk will do the same
>>>
>>> Trunk prob won't be much better... but the bulkpostings branch
>>> possibly could be.
>>>
>>> -Yonik
>>> http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
>>> 25-26, San Francisco
>>>
 Thanks for the info
 Ofer

 On Friday, April 22, 2011, Yonik Seeley  wrote:
> On Thu, Apr 21, 2011 at 6:25 PM, Ofer Fort  wrote:
>> Well, it was worth the try;-)
>> But will using the facet.method=fc, will reducing the subset size
>> reduce the time and memory? Meaning is it an O( ndocs of the set)?
>
> facet.method=fc builds a multi-valued fieldcache like structure
> (UnInvertedField) the first time, that
> is used for counting facets for all subsequent requests.  So the
> faceting time (after the first time) is O(ndocs of the set),
> but the UnInvertedField singleton uses a large amout of memory
> unrelated to any particular base docset.
>
> -Yonik
> http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
> 25-26, San Francisco
>
>
>> Thanks
>> On Thursday, April 21, 2011, Yonik Seeley  
>> wrote:
>>> On Thu, Apr 21, 2011 at 11:15 AM, Ofer Fort  wrote:
 So if i want to use the facet.method=fc, is there a way to speed it 
 up? and
 remove the bucket size limitation?
>>>
>>> Not really - else we would have done it already ;-)
>>> We don't really have great methods for faceting on full-text fields
>>> (as opposed to shorter meta-data fields) today.
>>>
>>> -Yonik
>>> http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
>>> 25-26, San Francisco
>>>
>>
>

>>>
>>
>

Re: Highest frequency terms for a subset of documents

2011-04-21 Thread Yonik Seeley

On Thu, Apr 21, 2011 at 6:50 PM, Ofer Fort  wrote:
> Ok, I'll give it a try, as this is a server I am willing to risk.
> How is the competability between solrj of bulkpostings, trunk, 3.1 and 1.4.1?

bulkpostings, trunk, and 3.1 should all be relatively solrj
compatible.  But the SolrJ javabin format (used by default for
queries) changed for strings between 1.4.1 and 3.1 (SOLR-2034).

-Yonik
http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
25-26, San Francisco


> On Friday, April 22, 2011, Yonik Seeley  wrote:
>> On Thu, Apr 21, 2011 at 6:34 PM, Ofer Fort  wrote:
>>> So I'm guessing my best approach now would be to test trunk, and hope
>>> that as 3.1 cut the performance in half, trunk will do the same
>>
>> Trunk prob won't be much better... but the bulkpostings branch
>> possibly could be.
>>
>> -Yonik
>> http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
>> 25-26, San Francisco
>>
>>> Thanks for the info
>>> Ofer
>>>
>>> On Friday, April 22, 2011, Yonik Seeley  wrote:
 On Thu, Apr 21, 2011 at 6:25 PM, Ofer Fort  wrote:
> Well, it was worth the try;-)
> But will using the facet.method=fc, will reducing the subset size
> reduce the time and memory? Meaning is it an O( ndocs of the set)?

 facet.method=fc builds a multi-valued fieldcache like structure
 (UnInvertedField) the first time, that
 is used for counting facets for all subsequent requests.  So the
 faceting time (after the first time) is O(ndocs of the set),
 but the UnInvertedField singleton uses a large amout of memory
 unrelated to any particular base docset.

 -Yonik
 http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
 25-26, San Francisco


> Thanks
> On Thursday, April 21, 2011, Yonik Seeley  
> wrote:
>> On Thu, Apr 21, 2011 at 11:15 AM, Ofer Fort  wrote:
>>> So if i want to use the facet.method=fc, is there a way to speed it up? 
>>> and
>>> remove the bucket size limitation?
>>
>> Not really - else we would have done it already ;-)
>> We don't really have great methods for faceting on full-text fields
>> (as opposed to shorter meta-data fields) today.
>>
>> -Yonik
>> http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
>> 25-26, San Francisco
>>
>

>>>
>>
>

Re: Highest frequency terms for a subset of documents

2011-04-21 Thread Ofer Fort

Ok, I'll give it a try, as this is a server I am willing to risk.
How is the competability between solrj of bulkpostings, trunk, 3.1 and 1.4.1?

On Friday, April 22, 2011, Yonik Seeley  wrote:
> On Thu, Apr 21, 2011 at 6:34 PM, Ofer Fort  wrote:
>> So I'm guessing my best approach now would be to test trunk, and hope
>> that as 3.1 cut the performance in half, trunk will do the same
>
> Trunk prob won't be much better... but the bulkpostings branch
> possibly could be.
>
> -Yonik
> http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
> 25-26, San Francisco
>
>> Thanks for the info
>> Ofer
>>
>> On Friday, April 22, 2011, Yonik Seeley  wrote:
>>> On Thu, Apr 21, 2011 at 6:25 PM, Ofer Fort  wrote:
 Well, it was worth the try;-)
 But will using the facet.method=fc, will reducing the subset size
 reduce the time and memory? Meaning is it an O( ndocs of the set)?
>>>
>>> facet.method=fc builds a multi-valued fieldcache like structure
>>> (UnInvertedField) the first time, that
>>> is used for counting facets for all subsequent requests.  So the
>>> faceting time (after the first time) is O(ndocs of the set),
>>> but the UnInvertedField singleton uses a large amout of memory
>>> unrelated to any particular base docset.
>>>
>>> -Yonik
>>> http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
>>> 25-26, San Francisco
>>>
>>>
 Thanks
 On Thursday, April 21, 2011, Yonik Seeley  
 wrote:
> On Thu, Apr 21, 2011 at 11:15 AM, Ofer Fort  wrote:
>> So if i want to use the facet.method=fc, is there a way to speed it up? 
>> and
>> remove the bucket size limitation?
>
> Not really - else we would have done it already ;-)
> We don't really have great methods for faceting on full-text fields
> (as opposed to shorter meta-data fields) today.
>
> -Yonik
> http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
> 25-26, San Francisco
>

>>>
>>
>

Re: Multiple Tags and Facets

2011-04-21 Thread Em

Thank you Hoss.
I will try the comma-separated thing out. It seems to be what I searched
for. :)

Regards,
Em


Chris Hostetter-3 wrote:
> 
> : I watched an online video with Chris Hostsetter from Lucidimagination.
> He
> : showed the possibility of having some Facets that exclude *all* filter
> while
> : also having some Facets that take care of some of the set filters while
> : ignoring other filters.
> 
> FWIW: That webinar is nearly identical to the apachecon talk i gave on the 
> same topic, slides of which can be found here...
> 
> http://people.apache.org/~hossman/apachecon2010/facets/
> 
> This is the example i used on Slide #29...
> 
>   Same Facet, Different Exclusions
> 
> * A key can be specified for a facet to change the name used to
>   identify it in the response.
> * This allows you to have multiple instances of a facet, with
>differnet exclusions.
> 
> q = Hot Rod 
>fq = {!df=colors tag=cx}purple green 
>   facet.field = {!key=all_colors ex=cx}colors 
>   facet.field = {!key=overlap_colors}colors
> 
> ...the point in that example is to treat a field (color) as two 
> differnt facets: one with exclusions and one without.
> 
> it sounds like what you want is differnet -- i *think* what you 
> are asking for is multiple exclusions for a single facet.  I didn't 
> mention that in my slides, but you can do that using a comma seperated 
> list of exclusions...
> 
> q = Hot Rod 
>fq = {!df=body tag=bc}purple
>fq = {!df=interior tag=ic}green
>   facet.field = {!ex=bc,ic}model
> 
> -Hoss
> 


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Multiple-Tags-and-Facets-tp2843130p2849115.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Highest frequency terms for a subset of documents

2011-04-21 Thread Yonik Seeley

On Thu, Apr 21, 2011 at 6:34 PM, Ofer Fort  wrote:
> So I'm guessing my best approach now would be to test trunk, and hope
> that as 3.1 cut the performance in half, trunk will do the same

Trunk prob won't be much better... but the bulkpostings branch
possibly could be.

-Yonik
http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
25-26, San Francisco

> Thanks for the info
> Ofer
>
> On Friday, April 22, 2011, Yonik Seeley  wrote:
>> On Thu, Apr 21, 2011 at 6:25 PM, Ofer Fort  wrote:
>>> Well, it was worth the try;-)
>>> But will using the facet.method=fc, will reducing the subset size
>>> reduce the time and memory? Meaning is it an O( ndocs of the set)?
>>
>> facet.method=fc builds a multi-valued fieldcache like structure
>> (UnInvertedField) the first time, that
>> is used for counting facets for all subsequent requests.  So the
>> faceting time (after the first time) is O(ndocs of the set),
>> but the UnInvertedField singleton uses a large amout of memory
>> unrelated to any particular base docset.
>>
>> -Yonik
>> http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
>> 25-26, San Francisco
>>
>>
>>> Thanks
>>> On Thursday, April 21, 2011, Yonik Seeley  
>>> wrote:
 On Thu, Apr 21, 2011 at 11:15 AM, Ofer Fort  wrote:
> So if i want to use the facet.method=fc, is there a way to speed it up? 
> and
> remove the bucket size limitation?

 Not really - else we would have done it already ;-)
 We don't really have great methods for faceting on full-text fields
 (as opposed to shorter meta-data fields) today.

 -Yonik
 http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
 25-26, San Francisco

>>>
>>
>

Re: Highest frequency terms for a subset of documents

2011-04-21 Thread Ofer Fort

So I'm guessing my best approach now would be to test trunk, and hope
that as 3.1 cut the performance in half, trunk will do the same
Thanks for the info
Ofer

On Friday, April 22, 2011, Yonik Seeley  wrote:
> On Thu, Apr 21, 2011 at 6:25 PM, Ofer Fort  wrote:
>> Well, it was worth the try;-)
>> But will using the facet.method=fc, will reducing the subset size
>> reduce the time and memory? Meaning is it an O( ndocs of the set)?
>
> facet.method=fc builds a multi-valued fieldcache like structure
> (UnInvertedField) the first time, that
> is used for counting facets for all subsequent requests.  So the
> faceting time (after the first time) is O(ndocs of the set),
> but the UnInvertedField singleton uses a large amout of memory
> unrelated to any particular base docset.
>
> -Yonik
> http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
> 25-26, San Francisco
>
>
>> Thanks
>> On Thursday, April 21, 2011, Yonik Seeley  wrote:
>>> On Thu, Apr 21, 2011 at 11:15 AM, Ofer Fort  wrote:
 So if i want to use the facet.method=fc, is there a way to speed it up? and
 remove the bucket size limitation?
>>>
>>> Not really - else we would have done it already ;-)
>>> We don't really have great methods for faceting on full-text fields
>>> (as opposed to shorter meta-data fields) today.
>>>
>>> -Yonik
>>> http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
>>> 25-26, San Francisco
>>>
>>
>

Re: Indexing 20M documents from MySQL with DIH

2011-04-21 Thread Scott Bigelow

Thanks for your response!

I think the issue is that the records are being returned TOO fast from
MySQL. I can dump them to CSV in about 30 minutes, but building the
solr index takes hours on the system I'm using. I may just need to use
a more powerful Solr instance so it doesn't leave MySQL hanging for
too long?

What about autoCommit, does that factor in to your import strategy?

2011/4/21 Robert Gründler :
> we're indexing around 10M records from a mysql database into
> a single solr core.
>
> The DataImportHandler needs to join 3 sub-entities to denormalize
> the data.
>
> We've run into some troubles for the first 2 attempts, but setting
> batchSize="-1" for the dataSource resolved the issues.
>
> Do you need a lot of complex joins to import the data from mysql?
>
>
>
> -robert
>
>
>
>
> On 4/21/11 8:08 PM, Scott Bigelow wrote:
>>
>> I've been using Solr for a while now, indexing 2-4 million records
>> using the DIH to pull data from MySQL, which has been working great.
>> For a new project, I need to index about 20M records (30 fields) and I
>> have been running into issues with MySQL disconnects, right around
>> 15M. I've tried several remedies I've found on blogs, changing
>> autoCommit, batchSize etc., and none of them have seem to majorly
>> resolved the issue. It got me wondering: Is this the way everyone does
>> it? What about 100M records up to 1B; are those all pulled using DIH
>> and a single query?
>>
>> I've used sphinx in the past, which uses multiple queries to pull out
>> a subset of records ranged based on PrimaryKey, does Solr offer
>> functionality similar to this? It seems that once a Solr index gets to
>> a certain size, the indexing of a batch takes longer than MySQL's
>> net_write_timeout, so it kills the connection.
>>
>> Thanks for your help, I really enjoy using Solr and I look forward to
>> indexing even more data!
>
>

Index upgrade from 1.4.1 to 3.1 and 4.0

2011-04-21 Thread Ofer Fort

Hi all,
While doing some tests, I realized that an index that was created with
solr 1.4.1 is readable by solr 3.1, but nt readable by solr 4.0.
If I plan to migrate my index to 4.0, and I prefer not to reindex it
all, what would be my best course of action?
Will it be possible to continue to write to the index with 3.1? Will
that make it readable from 4.0 or only the newly created segments?
If I optimize it using 3.1, will that make it readable also from 4.0?
Thanks
Ofer

Re: Highest frequency terms for a subset of documents

2011-04-21 Thread Yonik Seeley

On Thu, Apr 21, 2011 at 6:25 PM, Ofer Fort  wrote:
> Well, it was worth the try;-)
> But will using the facet.method=fc, will reducing the subset size
> reduce the time and memory? Meaning is it an O( ndocs of the set)?

facet.method=fc builds a multi-valued fieldcache like structure
(UnInvertedField) the first time, that
is used for counting facets for all subsequent requests.  So the
faceting time (after the first time) is O(ndocs of the set),
but the UnInvertedField singleton uses a large amout of memory
unrelated to any particular base docset.

-Yonik
http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
25-26, San Francisco


> Thanks
> On Thursday, April 21, 2011, Yonik Seeley  wrote:
>> On Thu, Apr 21, 2011 at 11:15 AM, Ofer Fort  wrote:
>>> So if i want to use the facet.method=fc, is there a way to speed it up? and
>>> remove the bucket size limitation?
>>
>> Not really - else we would have done it already ;-)
>> We don't really have great methods for faceting on full-text fields
>> (as opposed to shorter meta-data fields) today.
>>
>> -Yonik
>> http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
>> 25-26, San Francisco
>>
>

Re: Highest frequency terms for a subset of documents

2011-04-21 Thread Ofer Fort

Well, it was worth the try;-)
But will using the facet.method=fc, will reducing the subset size
reduce the time and memory? Meaning is it an O( ndocs of the set)?
Thanks
On Thursday, April 21, 2011, Yonik Seeley  wrote:
> On Thu, Apr 21, 2011 at 11:15 AM, Ofer Fort  wrote:
>> So if i want to use the facet.method=fc, is there a way to speed it up? and
>> remove the bucket size limitation?
>
> Not really - else we would have done it already ;-)
> We don't really have great methods for faceting on full-text fields
> (as opposed to shorter meta-data fields) today.
>
> -Yonik
> http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
> 25-26, San Francisco
>

Re: HTMLStripCharFilterFactory, highlighting and InvalidTokenOffsetsException

2011-04-21 Thread Erick Erickson

Perhaps a better place to start is here:
http://wiki.apache.org/solr/HowToContribute#Contributing_Code_.28Features.2C_Big_Fixes.2C_Tests.2C_etc29

That page also has information about setting up Eclipse or IntelliJ
environments. But the place to start is to get the source and get to
the point where you can issue "ant clean test" from the command line.
That should compile all the source and run the junit tests.

"ant example" will build you a full deployment in the example
directory that you can run the usual way "java -jar start.jar".

The IDEs also have a wizardly way to apply patches if you don't want
to apply them the command-line way.

Best
Erick

2011/4/21 Robert Gründler :
> On 20.04.11 18:51, Robert Muir wrote:
>>
>> Hi, there is a proposed patch uploaded to the issue. Maybe you can
>> help by reviewing/testing it?
>
> if i succeed in compiling solr, i can test the patch. Is this the right
> starting point
> for such an endeavour ? http://wiki.apache.org/solr/HackingSolr
>
>
>
> -robert
>
>> 2011/4/20 Robert Gründler:
>>>
>>> Hi all,
>>>
>>> i'm getting the following exception when using highlighting for a field
>>> containing HTMLStripCharFilterFactory:
>>>
>>> org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token
>>> ...
>>> exceeds length of provided text sized 21
>>>
>>> It seems this is a know issue:
>>>
>>> https://issues.apache.org/jira/browse/LUCENE-2208
>>>
>>> Does anyone know if there's a fix implemented yet in solr?
>>>
>>>
>>> thanks!
>>>
>>>
>>> -robert
>>>
>>>
>>>
>>>
>
>

Re: Indexing 20M documents from MySQL with DIH

2011-04-21 Thread Robert Gründler


we're indexing around 10M records from a mysql database into
a single solr core.

The DataImportHandler needs to join 3 sub-entities to denormalize
the data.

We've run into some troubles for the first 2 attempts, but setting
batchSize="-1" for the dataSource resolved the issues.

Do you need a lot of complex joins to import the data from mysql?



-robert




On 4/21/11 8:08 PM, Scott Bigelow wrote:

I've been using Solr for a while now, indexing 2-4 million records
using the DIH to pull data from MySQL, which has been working great.
For a new project, I need to index about 20M records (30 fields) and I
have been running into issues with MySQL disconnects, right around
15M. I've tried several remedies I've found on blogs, changing
autoCommit, batchSize etc., and none of them have seem to majorly
resolved the issue. It got me wondering: Is this the way everyone does
it? What about 100M records up to 1B; are those all pulled using DIH
and a single query?

I've used sphinx in the past, which uses multiple queries to pull out
a subset of records ranged based on PrimaryKey, does Solr offer
functionality similar to this? It seems that once a Solr index gets to
a certain size, the indexing of a batch takes longer than MySQL's
net_write_timeout, so it kills the connection.

Thanks for your help, I really enjoy using Solr and I look forward to
indexing even more data!

Re: Multiple Tags and Facets

2011-04-21 Thread Chris Hostetter

: I watched an online video with Chris Hostsetter from Lucidimagination. He
: showed the possibility of having some Facets that exclude *all* filter while
: also having some Facets that take care of some of the set filters while
: ignoring other filters.

FWIW: That webinar is nearly identical to the apachecon talk i gave on the 
same topic, slides of which can be found here...

http://people.apache.org/~hossman/apachecon2010/facets/

This is the example i used on Slide #29...

  Same Facet, Different Exclusions

* A key can be specified for a facet to change the name used to
  identify it in the response.
* This allows you to have multiple instances of a facet, with
   differnet exclusions.

q = Hot Rod 
   fq = {!df=colors tag=cx}purple green 
  facet.field = {!key=all_colors ex=cx}colors 
  facet.field = {!key=overlap_colors}colors

...the point in that example is to treat a field (color) as two 
differnt facets: one with exclusions and one without.

it sounds like what you want is differnet -- i *think* what you 
are asking for is multiple exclusions for a single facet.  I didn't 
mention that in my slides, but you can do that using a comma seperated 
list of exclusions...

q = Hot Rod 
   fq = {!df=body tag=bc}purple
   fq = {!df=interior tag=ic}green
  facet.field = {!ex=bc,ic}model

-Hoss

MoreLikeThis

2011-04-21 Thread Brian Lamb

Hi all,

I have an mlt search set up on my site with over 2 million records in the
index. Normally, my results look like:


  
0
204
  
  

  Some result.

  
  

  A similar result

...
  


And there are 100 results under response. However, in some cases, there are
no results under "response". Why is this the case and is there anything I
can do about it?

Here is my mlt configuration:


  
title,score
1
100
*,score
   


And here is the URL I use to get results:
http://localhost:8983/solr/mlt/?q=title:Some random title

Any help on this matter would be greatly appreciated. Thanks!

Brian Lamb

Re: Multiple Tags and Facets

2011-04-21 Thread Em

Hi Jay,

thank you for your reply.

We most enhance your example to reproduce what I mean:

You got the following facets:

project:
- Solr
- Lucene
- Nutch
- Mahout

source:
- Documentation
- Mailinglist
- Wiki
- Commercial Websites

What I want now is: When I click on Solr + Documentation
(fq={tag=p}project:Solr&fq={tag=s}source:Documentation), I want to get back
a result-set where I on the one hand see that there are no matches for
Mahout given the filter queries.
On the other hand I also want to see that there are results available for my
search but not for the current filters.

This information is usefull for creating a powerfull UI:
You can show the user that there is possibly valuable information available
on commercial websites but they are excluded from the current search.
Another point is that you can "fix" your UI: You always show all facets
relevant to the current search, no matter which of them are active.
Those who do not apply anymore to the given result-set (like Mahout in our
example) still remain in the list of available projects but are marked as
unuasable (displayed in smooth gray or something like that to show that they
are inactive).

My problem is that I do not know how to create such a user-experience,
because, if I add another dimension (like the source-facet) things are
getting complicated.

Since Hoss showed in the Mastering Facets Webinar that such cross-taggings
are possible, I thought that this is an already built-in option for Solr. 

Regards,
Em

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Multiple-Tags-and-Facets-tp2843130p2848085.html
Sent from the Solr - User mailing list archive at Nabble.com.

Indexing 20M documents from MySQL with DIH

2011-04-21 Thread Scott Bigelow

I've been using Solr for a while now, indexing 2-4 million records
using the DIH to pull data from MySQL, which has been working great.
For a new project, I need to index about 20M records (30 fields) and I
have been running into issues with MySQL disconnects, right around
15M. I've tried several remedies I've found on blogs, changing
autoCommit, batchSize etc., and none of them have seem to majorly
resolved the issue. It got me wondering: Is this the way everyone does
it? What about 100M records up to 1B; are those all pulled using DIH
and a single query?

I've used sphinx in the past, which uses multiple queries to pull out
a subset of records ranged based on PrimaryKey, does Solr offer
functionality similar to this? It seems that once a Solr index gets to
a certain size, the indexing of a batch takes longer than MySQL's
net_write_timeout, so it kills the connection.

Thanks for your help, I really enjoy using Solr and I look forward to
indexing even more data!

Re: Multiple Tags and Facets

2011-04-21 Thread Jay Hill

I don't think I understand what you're trying to do. Are you trying to
preserve all facets after a user clicks on a facet, and thereby triggers a
filter query, which excludes the other facets? If that's the case, you can
use local parameters to tag the filter queries so they are not used for the
facets:

Let's say I have the following facets:
- Solr
- Lucene
- Nutch
- Mahout

And I do a search for "solr".

All of these links will have a filter query:
- Solr [ ?q=solr&fq=project:solr ]
- Lucene [ ?q=solr&fq=project:lucene ]
- Nutch [ ?q=solr&fq=project:nutch ]
- Mahout [ ?q=solr&fq=project:mahout ]

But if a user clicks on the "Solr" facet, the resulting query will exclude
the other facets, so you only see this facet:
- Solr

By using local parameters like this:

?q=solr&fq={!tag=myTag}project:solr &facet=on&facet.field{!ex=myTag}=project

I can preserve all my facets, so that my query is filtered but all facets
still remain:
- Solr
- Lucene
- Nutch
- Mahout

Hope this helps, but I'm not sure that's what you were after.

-Jay



On Wed, Apr 20, 2011 at 8:03 AM, Em  wrote:

> Hello,
>
> I watched an online video with Chris Hostsetter from Lucidimagination. He
> showed the possibility of having some Facets that exclude *all* filter
> while
> also having some Facets that take care of some of the set filters while
> ignoring other filters.
>
> Unfortunately the Webinar did not explain how they made this and I wasn't
> able to give a filter/facet more than one tag.
>
> Here is an example:
>
> Facets and Filters: DocType, Author
>
> Facet:
> - Author
> -- George (10)
> -- Brian (12)
> -- Christian (78)
> -- Julia (2)
>
> -Doctype
> -- PDF (70)
> -- ODT (10)
> -- Word (20)
> -- JPEG (1)
> -- PNG (1)
>
> When clicking on "Julia" I would like to achieve the following:
> Facet:
> - Author
> -- George (10)
> -- Brian (12)
> -- Christian (78)
> -- Julia (2)
>  Julia's Doctypes:
> -- JPEG (1)
> -- PNG (1)
>
> -Doctype
> -- PDF (70)
> -- ODT (10)
> -- Word (20)
> -- JPEG (1)
> -- PNG (1)
>
> Another example which adds special options to your GUI could be as
> following:
> Imagine a fashion store.
> If you search for "shirt" you get a color-facet:
>
> colors:
> - red (19)
> - green (12)
> - blue (4)
> - black (2)
>
> As well as a brand-facet:
>
> brands:
> - puma (18)
> - nike (19)
>
> When I click on the red color-facet, I would like to get the following
> back:
> colors:
> - red (19)
> - green (12)*
> - blue (4)*
> - black (2)*
>
> brands:
> - puma (18)*
> - nike (19)
>
> All those filters marked by an "*" could be displayed half-transparent or
> so
> - they just show the user that those filter-options exist for his/her
> search
> but aren't included in the result-set, since he/she excluded them by
> clicking the "red" filter.
>
> This case is more interesting, if not all red shirts were from nike.
> This way you can show the user that i.e. 8 of 19 red - shirts are from the
> brand you selected/you see 8 of 19 red shirts.
>
> I hope I explained what I want to achive.
>
> Thank you!
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Multiple-Tags-and-Facets-tp2843130p2843130.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Solr search based on list of terms. Order by max(score) for each term.

2011-04-21 Thread Bogdan STOICA

Hello,

I am trying to query a solr server in order to obtain the most relevant
results for a list of terms.

For example i have the list of words "nokia", "iphone", "charger"

My schema contains the following data:
nokia
iphone
nokia iphone otherwords
nokia white
iphone white

If I run a simple query like q=nokia OR iphone OR charger i get "nokia
iphone otherwords" as the most relevant result (because it contains more
query terms)

I would like to get "nokia" or "iphone" or "iphone white" as first results,
because for each individual term they would be the most relevant.

In order to obtain the correct list i would do a query for each term, then
aggregate the results and order them based on the maximum score.

Can I make this query in one request?

This question has also been asked on

http://stackoverflow.com/questions/5743264/solr-search-based-on-list-of-terms-order-by-maxscore-for-each-term

Thank you.

RE: stemming filter analyzers, any favorites?

2011-04-21 Thread Robert Petersen

Nice!  Thanks!

-Original Message-
From: Em [mailto:mailformailingli...@yahoo.de] 
Sent: Thursday, April 21, 2011 9:23 AM
To: solr-user@lucene.apache.org
Subject: RE: stemming filter analyzers, any favorites?

As far as I know Lucene does not store an inverted index per field, so no, it
would not double the size of the index.

However, it could influence the score a little bit.

For example: If both stemmers reduce "schools" to "school" and you are
searching for "all schools in america" the term "school" has more weight to
the resulting score, since it definitly occurs in two fields which consist
of nearly the same value.

To reduce this effect you could write your own queryParser which creates a
disjunctionMaxQuery consisting of two boolean queries and a tie-break of 0 -
so only the better scoring stemmed-field contributes to the total score of
your document.

Regards,
Em


Robert Petersen-3 wrote:
> 
> Adding another field with another stemmer and searching both???  Wow never
> thought of doing that.  I guess that doesn't really double the size of
> your index tho because all the terms are almost the same right?  Let me
> look into that.  I'll raise the other issue in a separate thread and
> thanks.
> 
> -Original Message-
> From: Em [mailto:mailformailingli...@yahoo.de] 
> Sent: Thursday, April 21, 2011 1:55 AM
> To: solr-user@lucene.apache.org
> Subject: RE: stemming filter analyzers, any favorites?
> 
> Hi Robert,
> 
> we often ran into the same issue with stemmers. This is why we created
> more
> than one field, each field with different stemmers. It adds some overhead
> but worked quite well.
> 
> Regarding your off-topic-question:
> Look at the debugging-output of your searches. Sometimes you configured
> your
> tools, especially the WDF, wrong and the queryParser creates an unexpected
> result which leads to unmatched but still relevant documents.
> 
> Please, show us your debugging-output and the field-definition so that we
> can provide you some help!
> 
> Regards,
> Em
> 
> 
> Robert Petersen-3 wrote:
>> 
>> I have been doing that, and for Bags example the trailing 's' is not
>> being
>> removed by the Kstemmer so if indexing the word bags and searching on bag
>> you get no matches.  Why wouldn't the trailing 's' get stemmed off? 
>> Kstemmer is dictionary based so bags isn't in the dictionary?   That
>> trailing 's' should always be dropped no?  That seems like it would be
>> better, we don't want to make synonyms for basic use cases like this.  I
>> fear I will have to return to the Porter stemmer.  Are there other better
>> ones is my main question.
>> 
>> Off topic secondary question: sometimes I am puzzled by the output of the
>> analysis page.  It seems like there should be a match, but I don't get
>> the
>> results during a search that I'd expect...  
>> 
>> Like in the case if the WordDelimiterFilterFactory splits up a term into
>> a
>> bunch of terms before the K-stemmer is applied, sometimes if the matching
>> term is in position two of the final analysis but the searcher had the
>> partial term just alone and so thereby in position 1 in the analysis
>> stack
>> then when searching there wasn't a match.  Am I reading this correctly? 
>> Is that right or should that match and I am misreading my analysis
>> output?  
>> 
>> Thanks!
>> 
>> Robi
>> 
>> PS  I have a category named Bags and am catching flack for it not coming
>> up in a search for bag.  hah
>> PPS the term is not in protwords.txt
>> 
>> 
>> com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory
>> {protected=protwords.txt}
>> term position1
>> term textbags
>> term typeword
>> source start,end 0,4
>> payload  
>> 
>> 
>> -Original Message-
>> From: Erick Erickson [mailto:erickerick...@gmail.com] 
>> Sent: Wednesday, April 20, 2011 10:55 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: stemming filter analyzers, any favorites?
>> 
>> You can get a better sense of exactly what tranformations occur when
>> if you look at the analysis page (be sure to check the "verbose"
>> checkbox).
>> 
>> I'm surprised that "bags" doesn't match "bag", what does the analysis
>> page say?
>> 
>> Best
>> Erick
>> 
>> On Wed, Apr 20, 2011 at 1:44 PM, Robert Petersen 
>> wrote:
>>> Stemming filter analyzers... anyone have any favorites for particular
>>> search domains?  Just wondering what people are using.  I'm using Lucid
>>> K Stemmer and having issues.   Seems like it misses a lot of common
>>> stems.  We went to that because of excessively loose matches on the
>>> solr.PorterStemFilterFactory
>>>
>>>
>>> I understand K Stemmer is a dictionary based stemmer.  Seems to me like
>>> it is missing a lot of common stem reductions.  Ie   Bags does not match
>>> Bag in our searches.
>>>
>>> Here is my analyzer stack:
>>>
>>>                >> positionIncrementGap="100">
>>>                        
>>>                                >> class="solr.WhitespaceTokenizerFact

Re: Highest frequency terms for a subset of documents

2011-04-21 Thread Yonik Seeley

On Thu, Apr 21, 2011 at 11:15 AM, Ofer Fort  wrote:
> So if i want to use the facet.method=fc, is there a way to speed it up? and
> remove the bucket size limitation?

Not really - else we would have done it already ;-)
We don't really have great methods for faceting on full-text fields
(as opposed to shorter meta-data fields) today.

-Yonik
http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
25-26, San Francisco

RE: stemming filter analyzers, any favorites?

2011-04-21 Thread Em

As far as I know Lucene does not store an inverted index per field, so no, it
would not double the size of the index.

However, it could influence the score a little bit.

For example: If both stemmers reduce "schools" to "school" and you are
searching for "all schools in america" the term "school" has more weight to
the resulting score, since it definitly occurs in two fields which consist
of nearly the same value.

To reduce this effect you could write your own queryParser which creates a
disjunctionMaxQuery consisting of two boolean queries and a tie-break of 0 -
so only the better scoring stemmed-field contributes to the total score of
your document.

Regards,
Em


Robert Petersen-3 wrote:
> 
> Adding another field with another stemmer and searching both???  Wow never
> thought of doing that.  I guess that doesn't really double the size of
> your index tho because all the terms are almost the same right?  Let me
> look into that.  I'll raise the other issue in a separate thread and
> thanks.
> 
> -Original Message-
> From: Em [mailto:mailformailingli...@yahoo.de] 
> Sent: Thursday, April 21, 2011 1:55 AM
> To: solr-user@lucene.apache.org
> Subject: RE: stemming filter analyzers, any favorites?
> 
> Hi Robert,
> 
> we often ran into the same issue with stemmers. This is why we created
> more
> than one field, each field with different stemmers. It adds some overhead
> but worked quite well.
> 
> Regarding your off-topic-question:
> Look at the debugging-output of your searches. Sometimes you configured
> your
> tools, especially the WDF, wrong and the queryParser creates an unexpected
> result which leads to unmatched but still relevant documents.
> 
> Please, show us your debugging-output and the field-definition so that we
> can provide you some help!
> 
> Regards,
> Em
> 
> 
> Robert Petersen-3 wrote:
>> 
>> I have been doing that, and for Bags example the trailing 's' is not
>> being
>> removed by the Kstemmer so if indexing the word bags and searching on bag
>> you get no matches.  Why wouldn't the trailing 's' get stemmed off? 
>> Kstemmer is dictionary based so bags isn't in the dictionary?   That
>> trailing 's' should always be dropped no?  That seems like it would be
>> better, we don't want to make synonyms for basic use cases like this.  I
>> fear I will have to return to the Porter stemmer.  Are there other better
>> ones is my main question.
>> 
>> Off topic secondary question: sometimes I am puzzled by the output of the
>> analysis page.  It seems like there should be a match, but I don't get
>> the
>> results during a search that I'd expect...  
>> 
>> Like in the case if the WordDelimiterFilterFactory splits up a term into
>> a
>> bunch of terms before the K-stemmer is applied, sometimes if the matching
>> term is in position two of the final analysis but the searcher had the
>> partial term just alone and so thereby in position 1 in the analysis
>> stack
>> then when searching there wasn't a match.  Am I reading this correctly? 
>> Is that right or should that match and I am misreading my analysis
>> output?  
>> 
>> Thanks!
>> 
>> Robi
>> 
>> PS  I have a category named Bags and am catching flack for it not coming
>> up in a search for bag.  hah
>> PPS the term is not in protwords.txt
>> 
>> 
>> com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory
>> {protected=protwords.txt}
>> term position1
>> term textbags
>> term typeword
>> source start,end 0,4
>> payload  
>> 
>> 
>> -Original Message-
>> From: Erick Erickson [mailto:erickerick...@gmail.com] 
>> Sent: Wednesday, April 20, 2011 10:55 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: stemming filter analyzers, any favorites?
>> 
>> You can get a better sense of exactly what tranformations occur when
>> if you look at the analysis page (be sure to check the "verbose"
>> checkbox).
>> 
>> I'm surprised that "bags" doesn't match "bag", what does the analysis
>> page say?
>> 
>> Best
>> Erick
>> 
>> On Wed, Apr 20, 2011 at 1:44 PM, Robert Petersen 
>> wrote:
>>> Stemming filter analyzers... anyone have any favorites for particular
>>> search domains?  Just wondering what people are using.  I'm using Lucid
>>> K Stemmer and having issues.   Seems like it misses a lot of common
>>> stems.  We went to that because of excessively loose matches on the
>>> solr.PorterStemFilterFactory
>>>
>>>
>>> I understand K Stemmer is a dictionary based stemmer.  Seems to me like
>>> it is missing a lot of common stem reductions.  Ie   Bags does not match
>>> Bag in our searches.
>>>
>>> Here is my analyzer stack:
>>>
>>>                >> positionIncrementGap="100">
>>>                        
>>>                                >> class="solr.WhitespaceTokenizerFactory"/>
>>>                                >> class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt"
>>> ignoreCase="true" expand="true"/>
>>>                                >> ignoreCase="true" words="stopword

Re: Multiple Tags and Facets

2011-04-21 Thread Em

Are there no ideas of how to use multiple tags per filter or to combine some
tags for excluding more than one filter per facet?

Regards,
Em

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Multiple-Tags-and-Facets-tp2843130p2847569.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: stemming filter analyzers, any favorites?

2011-04-21 Thread Robert Petersen

Adding another field with another stemmer and searching both???  Wow never 
thought of doing that.  I guess that doesn't really double the size of your 
index tho because all the terms are almost the same right?  Let me look into 
that.  I'll raise the other issue in a separate thread and thanks.

-Original Message-
From: Em [mailto:mailformailingli...@yahoo.de] 
Sent: Thursday, April 21, 2011 1:55 AM
To: solr-user@lucene.apache.org
Subject: RE: stemming filter analyzers, any favorites?

Hi Robert,

we often ran into the same issue with stemmers. This is why we created more
than one field, each field with different stemmers. It adds some overhead
but worked quite well.

Regarding your off-topic-question:
Look at the debugging-output of your searches. Sometimes you configured your
tools, especially the WDF, wrong and the queryParser creates an unexpected
result which leads to unmatched but still relevant documents.

Please, show us your debugging-output and the field-definition so that we
can provide you some help!

Regards,
Em


Robert Petersen-3 wrote:
> 
> I have been doing that, and for Bags example the trailing 's' is not being
> removed by the Kstemmer so if indexing the word bags and searching on bag
> you get no matches.  Why wouldn't the trailing 's' get stemmed off? 
> Kstemmer is dictionary based so bags isn't in the dictionary?   That
> trailing 's' should always be dropped no?  That seems like it would be
> better, we don't want to make synonyms for basic use cases like this.  I
> fear I will have to return to the Porter stemmer.  Are there other better
> ones is my main question.
> 
> Off topic secondary question: sometimes I am puzzled by the output of the
> analysis page.  It seems like there should be a match, but I don't get the
> results during a search that I'd expect...  
> 
> Like in the case if the WordDelimiterFilterFactory splits up a term into a
> bunch of terms before the K-stemmer is applied, sometimes if the matching
> term is in position two of the final analysis but the searcher had the
> partial term just alone and so thereby in position 1 in the analysis stack
> then when searching there wasn't a match.  Am I reading this correctly? 
> Is that right or should that match and I am misreading my analysis output?  
> 
> Thanks!
> 
> Robi
> 
> PS  I have a category named Bags and am catching flack for it not coming
> up in a search for bag.  hah
> PPS the term is not in protwords.txt
> 
> 
> com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory
> {protected=protwords.txt}
> term position 1
> term text bags
> term type word
> source start,end  0,4
> payload   
> 
> 
> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com] 
> Sent: Wednesday, April 20, 2011 10:55 AM
> To: solr-user@lucene.apache.org
> Subject: Re: stemming filter analyzers, any favorites?
> 
> You can get a better sense of exactly what tranformations occur when
> if you look at the analysis page (be sure to check the "verbose"
> checkbox).
> 
> I'm surprised that "bags" doesn't match "bag", what does the analysis
> page say?
> 
> Best
> Erick
> 
> On Wed, Apr 20, 2011 at 1:44 PM, Robert Petersen 
> wrote:
>> Stemming filter analyzers... anyone have any favorites for particular
>> search domains?  Just wondering what people are using.  I'm using Lucid
>> K Stemmer and having issues.   Seems like it misses a lot of common
>> stems.  We went to that because of excessively loose matches on the
>> solr.PorterStemFilterFactory
>>
>>
>> I understand K Stemmer is a dictionary based stemmer.  Seems to me like
>> it is missing a lot of common stem reductions.  Ie   Bags does not match
>> Bag in our searches.
>>
>> Here is my analyzer stack:
>>
>>                > positionIncrementGap="100">
>>                        
>>                                > class="solr.WhitespaceTokenizerFactory"/>
>>                                > class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt"
>> ignoreCase="true" expand="true"/>
>>                                > ignoreCase="true" words="stopwords.txt"/>
>>          >                generateWordParts="1"
>>                generateNumberParts="1"
>>                catenateWords="1"
>>                catenateNumbers="1"
>>                catenateAll="1"
>>                preserveOriginal="1"
>>                />                              > class="solr.LowerCaseFilterFactory"/>
>>                                
>>                                > class="com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory"
>> protected="protwords.txt"/>
>>                                > class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>                        
>>                        
>>                                > class="solr.WhitespaceTokenizerFactory"/>
>>                                > class="solr.SynonymFilterFactory" synonyms="query_synonyms.txt"
>> ignoreCase="true" expand=

Re: old searchers not closing after optimize or replication

2011-04-21 Thread Trey Grainger

Hey Bernd,

Checkout https://issues.apache.org/jira/browse/SOLR-2469.  There is a
pretty bad bug in Solr 3.1 which occurs if you have  startup set in your replication
configuration in solrconfig.xml.  See the thread between Yonik and
myself from a few days ago titled "Solr 3.1: Old Index Files Not
Removed on Optimize".

You can disable startup replication and perform an optimize to see if
this fixes your problem of old index files being left behind (though
you may have some old index files left behind from before this change
that you still need to clean-up).  Yonik has already pushed up a patch
into the 3x branch and trunk for this issue.  I can confirm that
applying the patch (or just removing startup replication) resolved the
issue for us.

Do you think this is your issue?

Thanks,

-Trey



On Thu, Apr 21, 2011 at 2:27 AM, Bernd Fehling
 wrote:
> Hi Erik,
>
> 
> 1
> 0
> 
>
> Due to 44 minutes optimization time we do an optimization once a day
> during the night.
>
> I will try with an smaler index on my development system.
>
> Best regards,
> Bernd
>
>
> Am 20.04.2011 17:50, schrieb Erick Erickson:
>>
>> It looks OK, but still doesn't explain keeping the old files around. What
>> is
>> your  in your solrconfig.xml look like? It's
>> possible that you're seeing Solr attempt to keep around several
>> optimized copies of the index, but that still doesn't explain why
>> restarting Solr removes them unless the deletionPolicy gets invoked
>> on sometime and you're index files are aging out (I don't know the
>> internals of deletion well enough to say).
>>
>> About optimization. It's become less important with recent code. Once
>> upon a time, it made a substantial difference in search speed. More
>> recently, it has very little impact on search speed, and is used
>> much more sparingly. Its greatest benefit is reclaiming unused resources
>> left over from deleted documents. So you might want to avoid the pain
>> of optimizing (44 minutes!) and only optimize rarely of if you have
>> deleted a lot of documents.
>>
>> It might be worthwhile to try (with a smaller index !) a bunch of optimize
>> cycles and see if the  idea has any merit. I'd expect
>> your index to reach a maximum and stay there after the saved
>> copies of the index was reached...
>>
>> But otherwise I'm puzzled...
>>
>> Erick
>>
>> On Wed, Apr 20, 2011 at 10:30 AM, Bernd Fehling
>>   wrote:
>>>
>>> Hi Erik,
>>>
>>> Am 20.04.2011 15:42, schrieb Erick Erickson:

 H, this isn't right. You've pretty much eliminated the obvious
 things. What does lsof show? I'm assuming it shows the files are
 being held open by your Solr instance, but it's worth checking.
>>>
>>> Just commited new content 3 times and finally optimized.
>>> Again having old index files left.
>>>
>>> Then checked on my master, only the newest version of index files are
>>> listed with lsof. No file handles to the old index files but the
>>> old index files remain in data/index/.
>>> Thats strange.
>>>
>>> This time replication worked fine and cleaned up old index on slaves.
>>>

 I'm not getting the same behavior, admittedly on a Windows box.
 The only other thing I can think of is that you have a query that's
 somehow never ending, but that's grasping at straws.

 Do your log files show anything interesting?
>>>
>>> Lets see:
>>> - it has the old generation (generation=12) and its files
>>> - and recognizes that there have been several commits (generation=18)
>>>
>>> 20.04.2011 14:05:26 org.apache.solr.update.DirectUpdateHandler2 commit
>>> INFO: start
>>>
>>> commit(optimize=true,waitFlush=false,waitSearcher=true,expungeDeletes=false)
>>> 20.04.2011 14:05:26 org.apache.solr.core.SolrDeletionPolicy onInit
>>> INFO: SolrDeletionPolicy.onInit: commits:num=2
>>>
>>>
>>>  commit{dir=/srv/www/solr/solr/solrserver/solr/data/index,segFN=segments_c,version=1302159868435,generation=12,filenames=[_3xm.nrm,
>>> _3xm.fdx, segment
>>> s_c, _3xm.fnm, _3xm.fdt, _3xm.tis, _3xm.tii, _3xm.prx, _3xm.frq]
>>>
>>>
>>>  commit{dir=/srv/www/solr/solr/solrserver/solr/data/index,segFN=segments_i,version=1302159868447,generation=18,filenames=[_3xm.nrm,
>>> _3xo.tis, _3xp.pr
>>> x, _3xo.fnm, _3xp.fdx, _3xs.frq, _3xo.tii, _3xp.fdt, _3xn.tii, _3xm.fdx,
>>> _3xn.nrm, _3xm.fdt, _3xs.prx, _3xn.tis, _3xn.fdx, _3xr.nrm, _3xm.prx,
>>> _3xn.fdt, _3x
>>> p.tii, _3xs.nrm, _3xp.tis, _3xo.prx, segments_i, _3xm.tii, _3xq.tii,
>>> _3xs.fdx, _3xs.fdt, _3xo.frq, _3xn.prx, _3xm.tis, _3xr.prx, _3xq.tis,
>>> _3xo.fdt, _3xp.fr
>>> q, _3xq.fnm, _3xo.fdx, _3xp.fnm, _3xr.tis, _3xr.fnm, _3xq.frq, _3xr.tii,
>>> _3xr.frq, _3xo.nrm, _3xs.tii, _3xq.fdx, _3xq.fdt, _3xp.nrm, _3xq.prx,
>>> _3xs.tis, _3x
>>> m.frq, _3xr.fdx, _3xm.fnm, _3xn.frq, _3xq.nrm, _3xs.fnm, _3xn.fnm,
>>> _3xr.fdt]
>>> 20.04.2011 14:05:26 org.apache.solr.core.SolrDeletionPolicy updateCommits
>>> INFO: newest commit = 1302159868447
>>>
>>>
>>> - after 44 minutes of optimizing (over 140GB and 27.8 mio docs) it gets
>>>  the Sol

Re: Apache Spam Filter Blocking Messages

2011-04-21 Thread Trey Grainger

Good to know; I'll go change those settings, then.  Thanks for the feedback.

-Trey


On Thu, Apr 21, 2011 at 4:42 AM, Em  wrote:
>
> This really helps at the mailinglists.
> If you send your mails with Thunderbird, be sure to check that you enforce
> plain-text-emails. If not, it will often send HTML-mails.
>
> Regards,
> Em
>
>
> Marvin Humphrey wrote:
> >
> > On Thu, Apr 21, 2011 at 12:30:29AM -0400, Trey Grainger wrote:
> >> (FREEMAIL_FROM,FS_REPLICA,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL
> >                             
> > Note the "HTML_MESSAGE" in the list of things SpamAssassin didn't like.
> >
> >> Apparently I sound like spam when I write perfectly good English and
> >> include
> >> some xml and a link to a jira ticket in my e-mail (I tried a couple
> >> different variations).  Anyone know a way around this filter, or should I
> >> just respond to those involved in the e-mail chain directly and avoid the
> >> mailing list?
> >
> > Send plain text email instead of HTML.  That solves the problem 99% of the
> > time.
> >
> > Marvin Humphrey
> >
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Apache-Spam-Filter-Blocking-Messages-tp2845854p2846304.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Highest frequency terms for a subset of documents

2011-04-21 Thread Ofer Fort

So if i want to use the facet.method=fc, is there a way to speed it up? and
remove the bucket size limitation?

On Thu, Apr 21, 2011 at 5:58 PM, Yonik Seeley wrote:

> On Thu, Apr 21, 2011 at 10:41 AM, Ofer Fort  wrote:
> > I see, thanks.
> > So if I would want to implement something that would fit my needs, would
> > going through the subset of documents and counting all the terms in each
> > one, would be faster? and easier to implement?
>
> That's not just your needs, that's everyone's needs (it's the
> definition of field faceting).
> There's no way to do what you're asking with a term enumerator (i.e.
> facet.method=enum).
>
> Going through documents and counting all the terms in each is what
> facet.method=fc does.
> But it's also not great when the number of unique terms per document is
> high.
> If you can think of a better way, go for it!
>
>
> -Yonik
> http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
> 25-26, San Francisco
>

Re: Highest frequency terms for a subset of documents

2011-04-21 Thread Yonik Seeley

On Thu, Apr 21, 2011 at 10:41 AM, Ofer Fort  wrote:
> I see, thanks.
> So if I would want to implement something that would fit my needs, would
> going through the subset of documents and counting all the terms in each
> one, would be faster? and easier to implement?

That's not just your needs, that's everyone's needs (it's the
definition of field faceting).
There's no way to do what you're asking with a term enumerator (i.e.
facet.method=enum).

Going through documents and counting all the terms in each is what
facet.method=fc does.
But it's also not great when the number of unique terms per document is high.
If you can think of a better way, go for it!

-Yonik
http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
25-26, San Francisco

Re: Highest frequency terms for a subset of documents

2011-04-21 Thread Ofer Fort

I see, thanks.
So if I would want to implement something that would fit my needs, would
going through the subset of documents and counting all the terms in each
one, would be faster? and easier to implement?

On Thu, Apr 21, 2011 at 5:36 PM, Yonik Seeley wrote:

> On Thu, Apr 21, 2011 at 9:44 AM, Ofer Fort  wrote:
> > Not sure i fully understand,
> > If "facet.method=enum steps over all terms in the index for that field",
> > than what does setting the q=field:subset do? if i set the q=*:*, than
> how
> > do i get the frequency only on my subset?
>
> It's an implementation detail.  Faceting *does* just give you counts
> that just match
> q=field:subset.  How it does it is a different matter (i.e. for
> facet.method=enum, it
> must step over all terms in the field), so it's closer to O(nterms in
> field) rather than O(ndocs in base set)
>
> -Yonik
> http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
> 25-26, San Francisco
>
>
> > Ofer
> >
> > On Thu, Apr 21, 2011 at 4:40 PM, Yonik Seeley <
> yo...@lucidimagination.com>
> > wrote:
> >>
> >> On Thu, Apr 21, 2011 at 9:24 AM, Ofer Fort  wrote:
> >> > Another strange behavior is that the Qtime seems pretty stable, no
> >> > matter
> >> > how many object match my query. 200K and 20K both take about 17s.
> >> > I would have guessed that since the time is going over all the terms
> of
> >> > all
> >> > the subset documents, would mean that the more documents, the more
> time.
> >>
> >> facet.method=enum steps over all terms in the index for that field...
> >> that takes time regardless of how many documents are in the base set.
> >>
> >> There are also short-circuit methods that avoid looking at the docs
> >> for a term if it's docfreq is low enough that it couldn't possibly
> >> make it into the priority queue.  Because if this, it can actually be
> >> faster to facet on a larger base set (try *:* as the base query).
> >>
> >> Actually, it might be interesting to see the query time if you set
> >> facet.mincount equal to the number of docs in the base set - that will
> >> test pretty much just the time to enumerate over the terms without
> >> doing any set intersections at all.  Be careful not to set mincount
> >> greater than the number of docs in the base set though - solr will
> >> short-circuit that too and skip enumeration altogether.
> >>
> >> The work on the bulkpostings branch should definitely speed up your
> >> case even more - but I have no idea when it will "land" on trunk.
> >>
> >>
> >> -Yonik
> >> http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
> >> 25-26, San Francisco
> >
> >
>

Re: Highest frequency terms for a subset of documents

2011-04-21 Thread Yonik Seeley

On Thu, Apr 21, 2011 at 9:44 AM, Ofer Fort  wrote:
> Not sure i fully understand,
> If "facet.method=enum steps over all terms in the index for that field",
> than what does setting the q=field:subset do? if i set the q=*:*, than how
> do i get the frequency only on my subset?

It's an implementation detail.  Faceting *does* just give you counts
that just match
q=field:subset.  How it does it is a different matter (i.e. for
facet.method=enum, it
must step over all terms in the field), so it's closer to O(nterms in
field) rather than O(ndocs in base set)

-Yonik
http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
25-26, San Francisco


> Ofer
>
> On Thu, Apr 21, 2011 at 4:40 PM, Yonik Seeley 
> wrote:
>>
>> On Thu, Apr 21, 2011 at 9:24 AM, Ofer Fort  wrote:
>> > Another strange behavior is that the Qtime seems pretty stable, no
>> > matter
>> > how many object match my query. 200K and 20K both take about 17s.
>> > I would have guessed that since the time is going over all the terms of
>> > all
>> > the subset documents, would mean that the more documents, the more time.
>>
>> facet.method=enum steps over all terms in the index for that field...
>> that takes time regardless of how many documents are in the base set.
>>
>> There are also short-circuit methods that avoid looking at the docs
>> for a term if it's docfreq is low enough that it couldn't possibly
>> make it into the priority queue.  Because if this, it can actually be
>> faster to facet on a larger base set (try *:* as the base query).
>>
>> Actually, it might be interesting to see the query time if you set
>> facet.mincount equal to the number of docs in the base set - that will
>> test pretty much just the time to enumerate over the terms without
>> doing any set intersections at all.  Be careful not to set mincount
>> greater than the number of docs in the base set though - solr will
>> short-circuit that too and skip enumeration altogether.
>>
>> The work on the bulkpostings branch should definitely speed up your
>> case even more - but I have no idea when it will "land" on trunk.
>>
>>
>> -Yonik
>> http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
>> 25-26, San Francisco
>
>

Re: PECL SOLR PHP extension, JSON output

2011-04-21 Thread Ralf Kraus


Am 21.04.2011 13:58, schrieb roySolr:

I have tried that but it seems like JSON is not supported

Parameters

responseWriter

 One of the following :

 - xml
  - phpnative





--
View this message in context: 
http://lucene.472066.n3.nabble.com/PECL-SOLR-PHP-extension-JSON-output-tp2846092p2846728.html
Sent from the Solr - User mailing list archive at Nabble.com.

And I can´t get phpnative working with SOLR 3.1 :-(

--
Greets,
Ralf Kraus

Re: Highest frequency terms for a subset of documents

2011-04-21 Thread Ofer Fort

Not sure i fully understand,
If "facet.method=enum steps over all terms in the index for that field",
than what does setting the q=field:subset do? if i set the q=*:*, than how
do i get the frequency only on my subset?
Ofer

On Thu, Apr 21, 2011 at 4:40 PM, Yonik Seeley wrote:

> On Thu, Apr 21, 2011 at 9:24 AM, Ofer Fort  wrote:
> > Another strange behavior is that the Qtime seems pretty stable, no matter
> > how many object match my query. 200K and 20K both take about 17s.
> > I would have guessed that since the time is going over all the terms of
> all
> > the subset documents, would mean that the more documents, the more time.
>
> facet.method=enum steps over all terms in the index for that field...
> that takes time regardless of how many documents are in the base set.
>
> There are also short-circuit methods that avoid looking at the docs
> for a term if it's docfreq is low enough that it couldn't possibly
> make it into the priority queue.  Because if this, it can actually be
> faster to facet on a larger base set (try *:* as the base query).
>
> Actually, it might be interesting to see the query time if you set
> facet.mincount equal to the number of docs in the base set - that will
> test pretty much just the time to enumerate over the terms without
> doing any set intersections at all.  Be careful not to set mincount
> greater than the number of docs in the base set though - solr will
> short-circuit that too and skip enumeration altogether.
>
> The work on the bulkpostings branch should definitely speed up your
> case even more - but I have no idea when it will "land" on trunk.
>
>
> -Yonik
> http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
> 25-26, San Francisco
>

Re: Highest frequency terms for a subset of documents

2011-04-21 Thread Yonik Seeley

On Thu, Apr 21, 2011 at 9:24 AM, Ofer Fort  wrote:
> Another strange behavior is that the Qtime seems pretty stable, no matter
> how many object match my query. 200K and 20K both take about 17s.
> I would have guessed that since the time is going over all the terms of all
> the subset documents, would mean that the more documents, the more time.

facet.method=enum steps over all terms in the index for that field...
that takes time regardless of how many documents are in the base set.

There are also short-circuit methods that avoid looking at the docs
for a term if it's docfreq is low enough that it couldn't possibly
make it into the priority queue.  Because if this, it can actually be
faster to facet on a larger base set (try *:* as the base query).

Actually, it might be interesting to see the query time if you set
facet.mincount equal to the number of docs in the base set - that will
test pretty much just the time to enumerate over the terms without
doing any set intersections at all.  Be careful not to set mincount
greater than the number of docs in the base set though - solr will
short-circuit that too and skip enumeration altogether.

The work on the bulkpostings branch should definitely speed up your
case even more - but I have no idea when it will "land" on trunk.

-Yonik
http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
25-26, San Francisco

Re: Highest frequency terms for a subset of documents

2011-04-21 Thread Ofer Fort

OK, so I copied my index and ran solr3.1 against it.
Qtime dropped, from about 40s to 17s! This is good news, but still longer
than i hoped for.
I tried to do the same text with 4.0, but i'm getting
IndexFormatTooOldException since my index was created using 1.4.1. Is my
only chance to test this is to reindex using 3.1 or 4.0?

Another strange behavior is that the Qtime seems pretty stable, no matter
how many object match my query. 200K and 20K both take about 17s.
I would have guessed that since the time is going over all the terms of all
the subset documents, would mean that the more documents, the more time.

Thanks for any insights

ofer

On Thu, Apr 21, 2011 at 3:07 AM, Ofer Fort  wrote:

> my documents are user entries, so i'm guessing they vary a lot.
> Tomorrow i'll try 3.1 and also 4.0, and see if they have an improvement.
> thanks guys!
>
>
> On Thu, Apr 21, 2011 at 3:02 AM, Yonik Seeley 
> wrote:
>
>> On Wed, Apr 20, 2011 at 7:45 PM, Ofer Fort  wrote:
>> > Thanks
>> > but i've disabled the cache already, since my concern is speed and i'm
>> > willing to pay the price (memory)
>>
>> Then you should not disable the cache.
>>
>> >, and my subset are not fixed.
>> > Does the facet search do any extra work that i don't need, that i might
>> be
>> > able to disable (either by a flag or by a code change),
>> > Somehow i feel, or rather hope, that counting the terms of 200K
>> documents
>> > and finding the top 500 should take less than 30 seconds.
>>
>> Using facet.enum.cache.minDf should be a little faster than just
>> disabling the cache - it's a different code path.
>> Using the cache selectively will speed things up, so try setting that
>> minDf to 1000 or so for example.
>>
>> How many unique terms do you have in the index?
>> Is this Solr 3.1 - there were some optimizations when there were many
>> terms to iterate over?
>> You could also try trunk, which has even more optimizations, or the
>> bulkpostings branch if you really want to experiment.
>>
>> -Yonik
>>
>
>

Re: Can't determine Sort Order error when using sort by function

2011-04-21 Thread Yonik Seeley

On Thu, Apr 21, 2011 at 8:30 AM, Otis Gospodnetic
 wrote:
> Hello,
>
> I'm trying out sorting by function with the new function queries and 
> invariably
> getting this error:
>
>  Can't determine Sort Order: 'termfreq(name,samsung)', pos=22
>
> Here's an example call:
> http://localhost:8983/solr/select/?q=*:*&sort=termfreq%28name,samsung%29
>
> What am I doing wrong?

Try adding the sort order "asc" or "desc" after the function.

-Yonik
http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
25-26, San Francisco




> Thanks,
> Otis
>

Re: entity name issue

2011-04-21 Thread tjtong

Hi Em,

Thanks a lot! But it still does not work. Actually my "where" clause in my
query was '${dataimporter.request.clean}' != 'false' and
myschema.table_a.aid=${dataimporter.request.aid}" which I used to pass a
value to the full import process, and it worked without the prefix
"myschema." on sybase database, but did not work on oracle either with or
without the prefix. (It would complain table not existing without the
prefix). 

TJ

--
View this message in context: 
http://lucene.472066.n3.nabble.com/entity-name-issue-tp2843812p2846816.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: PECL SOLR PHP extension, JSON output

2011-04-21 Thread Stefan Matheis

Hm yes correct .. there is a explicit validation of response-writers in place.

if you want to modify it yourself, check the current trunk
(http://svn.php.net/repository/pecl/solr/trunk/) modify
solr_constants.h, define another response_writer and add another check
in solr_functions_helpers.c in function
solr_is_supported_response_writer

compile the module and go ahead :)

Regards
Stefan

On Thu, Apr 21, 2011 at 1:58 PM, roySolr  wrote:
> I have tried that but it seems like JSON is not supported
>
> Parameters
>
> responseWriter
>
>    One of the following :
>
>    - xml
>     - phpnative
>
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/PECL-SOLR-PHP-extension-JSON-output-tp2846092p2846728.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Can't determine Sort Order error when using sort by function

2011-04-21 Thread Otis Gospodnetic

Hello,

I'm trying out sorting by function with the new function queries and invariably 
getting this error:

  Can't determine Sort Order: 'termfreq(name,samsung)', pos=22

Here's an example call:
http://localhost:8983/solr/select/?q=*:*&sort=termfreq%28name,samsung%29

What am I doing wrong?

Thanks,
Otis

Re: PECL SOLR PHP extension, JSON output

2011-04-21 Thread roySolr

I have tried that but it seems like JSON is not supported

Parameters

responseWriter

One of the following :

- xml
 - phpnative





--
View this message in context: 
http://lucene.472066.n3.nabble.com/PECL-SOLR-PHP-extension-JSON-output-tp2846092p2846728.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Unable to load EntityProcessor implementation for entity:16865747177753

2011-04-21 Thread firdous_kind86

can i see your tikaconfig.xml?

meanwhile have a look at this bug:
https://issues.apache.org/jira/browse/SOLR-2116
a similar thread also exists:
http://lucene.472066.n3.nabble.com/TikaEntityProcessor-td2839188.html

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Unable-to-load-EntityProcessor-implementation-for-entity-16865747177753-tp2846513p2846574.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: HTMLStripCharFilterFactory, highlighting and InvalidTokenOffsetsException

2011-04-21 Thread Robert Gründler


On 20.04.11 18:51, Robert Muir wrote:

Hi, there is a proposed patch uploaded to the issue. Maybe you can
help by reviewing/testing it?


if i succeed in compiling solr, i can test the patch. Is this the right 
starting point

for such an endeavour ? http://wiki.apache.org/solr/HackingSolr



-robert


2011/4/20 Robert Gründler:

Hi all,

i'm getting the following exception when using highlighting for a field
containing HTMLStripCharFilterFactory:

org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token ...
exceeds length of provided text sized 21

It seems this is a know issue:

https://issues.apache.org/jira/browse/LUCENE-2208

Does anyone know if there's a fix implemented yet in solr?


thanks!


-robert

Re: PECL SOLR PHP extension, JSON output

2011-04-21 Thread Stefan Matheis

give it a try: http://php.net/manual/en/solrclient.setresponsewriter.php

On Thu, Apr 21, 2011 at 9:03 AM, roySolr  wrote:
> Hello,
>
> I use the PECL php extension for SOLR. I want my output in JSON.
>
> This is not working:
>
> $query->set('wt', 'json');
>
> How do i solve this problem?
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/PECL-SOLR-PHP-extension-JSON-output-tp2846092p2846092.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Unable to load EntityProcessor implementation for entity:16865747177753

2011-04-21 Thread vrpar...@gmail.com

hello i have one datasource - is sql server db and
second datasource - is file but dynamic means based on first
datasource db record i want to fetch one file that's why i try to use
tikaentityprocessor but got following error

org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to
load EntityProcessor implementation for entity:16865747177753 Processing
Document # 1
at
org.apache.solr.handler.dataimport.DocBuilder.getEntityProcessor(DocBuilder.java:576)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:314)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:383)
at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:242)
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:180)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:331)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:389)
at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:370)
Caused by: java.lang.ClassNotFoundException: Unable to load
TikaEntityProcessor or
org.apache.solr.handler.dataimport.TikaEntityProcessor
at
org.apache.solr.handler.dataimport.DocBuilder.loadClass(DocBuilder.java:738)
at
org.apache.solr.handler.dataimport.DocBuilder.getEntityProcessor(DocBuilder.java:573)
... 7 more
Caused by: org.apache.solr.common.SolrException: Error loading class
'TikaEntityProcessor'
at
org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:375)
at
org.apache.solr.handler.dataimport.DocBuilder.loadClass(DocBuilder.java:728)
... 8 more
Caused by: java.lang.ClassNotFoundException: TikaEntityProcessor
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)...




data config file



   







 
  
  

 



please help me to solve this problem

Thanks

Vishal 




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Unable-to-load-EntityProcessor-implementation-for-entity-16865747177753-tp2846513p2846513.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: The issue of import data from database using Solr DIH

2011-04-21 Thread Kevin Xiang

Yes, it is like the left outer join.
In my example.the table may be table or view or stored procedure,I can
not change it in database.
If for every id in table1,we need search the fields by id from table2 in
database,it will met performance issue,especially the size of tables are
very big.

-Original Message-
From: lboutros [mailto:boutr...@gmail.com] 
Sent: Thursday, April 21, 2011 5:25 PM
To: solr-user@lucene.apache.org
Subject: RE: The issue of import data from database using Solr DIH

What you want to do is something like a left outer join, isn't it ?

something like : select table2.OS06Y, f1,f2,f3,f4,f5 from table2 left
outer
join table1 on table2.OS06Y = table1.OS06Y where ...

could you prepare a view in your RDBMS ? That could be another solution
?

Ludovic.

-
Jouve
France.
--
View this message in context:
http://lucene.472066.n3.nabble.com/The-issue-of-import-data-from-databas
e-using-Solr-DIH-tp2845318p2846403.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Need to create dyanamic indexies base on different document workspaces

2011-04-21 Thread Em

Additionally, there is an already set up example for a multicore-setup in the
example directory of your Solr-distribution.

Regards,
Em

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Need-to-create-dyanamic-indexies-base-on-different-document-workspaces-tp2845919p2846417.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: The issue of import data from database using Solr DIH

2011-04-21 Thread Em

As Iboutrus mentioned, if you can summarize it in a query, than yes, Solr can
handle it. 

Make a step backward: Do not think of Solr. Write a query (one! query) that
shows exactly the output you exepct. Afterwards, implement this query as a
source for DIH.

Regards,
Em

--
View this message in context: 
http://lucene.472066.n3.nabble.com/The-issue-of-import-data-from-database-using-Solr-DIH-tp2845318p2846414.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: The issue of import data from database using Solr DIH

2011-04-21 Thread lboutros

What you want to do is something like a left outer join, isn't it ?

something like : select table2.OS06Y, f1,f2,f3,f4,f5 from table2 left outer
join table1 on table2.OS06Y = table1.OS06Y where ...

could you prepare a view in your RDBMS ? That could be another solution ?

Ludovic.

-
Jouve
France.
--
View this message in context: 
http://lucene.472066.n3.nabble.com/The-issue-of-import-data-from-database-using-Solr-DIH-tp2845318p2846403.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Need to create dyanamic indexies base on different document workspaces

2011-04-21 Thread Chandan Tamrakar

Actually you need to put  a file named *solr.xml* in the solr.home directory
to create the solr core .
you can do that programatically if you want to make it dynamic based on your
logic

pls check the solr core admin document.



On Thu, Apr 21, 2011 at 2:52 PM, Gaurav Shingala <
gaurav.shing...@hotmail.com> wrote:

>
> Is it possible to create solr core dyanamically?
>
> In our case we want each workspace to have its own solr index.
>
>
>
> Thanks
>
> > From: chandan.tamra...@nepasoft.com
> > Date: Thu, 21 Apr 2011 11:57:53 +0545
> > Subject: Re: Need to create dyanamic indexies base on different document
> workspaces
> > To: solr-user@lucene.apache.org
> >
> > It depends on your application design how you want your index
> >
> >
> > There is a feature called solr core .
> http://wiki.apache.org/solr/CoreAdmin
> > You could still have a single index but a field to differentiate the
> items
> > in index
> >
> > thanks
> >
> >
> > On Thu, Apr 21, 2011 at 10:55 AM, Gaurav Shingala <
> > gaurav.shing...@hotmail.com> wrote:
> >
> > >
> > >
> > >
> > >
> > > Hi,
> > >
> > > Is there a way to create different solr indexes for different
> categories?
> > > We have different document workspaces and ideally want each workspace
> to
> > > have its own solr index.
> > >
> > > Thanks,
> > > Gaurav
> > >
> >
> >
> >
> >
> > --
> > Chandan Tamrakar
> > *
> > *
>




-- 
Chandan Tamrakar
*
*

RE: The issue of import data from database using Solr DIH

2011-04-21 Thread Kevin Xiang

I try "remove the OS06Y-field from your second entity ",import the
second entity failed.

Give a example:

Table1:
OS06Y=123,f1=100,f2=200,f3=300;
OS06Y=456,f1=100,f2=200,f3=300;

Table2:
OS06Y=123,f4=100,f5=200;
OS06Y=456,f4=100;
OS06Y=789,f4=100;

I want the result:
OS06Y=123,f1=100,f2=200,f3=300,f4=100,f5=200;
OS06Y=456,f1=100,f2=200,f3=300,f4=100;
OS06Y=789,f4=100;

Can solr implement it? if yes,how to configure dataconfig.xml in solr?

-Original Message-
From: Em [mailto:mailformailingli...@yahoo.de] 
Sent: Thursday, April 21, 2011 4:59 PM
To: solr-user@lucene.apache.org
Subject: RE: The issue of import data from database using Solr DIH

Not sure I understood you correct:

You expect that OS06Y stores *two* different performanceIds? One from
table1
and the other from table2?
I think this may be a problem.

If both OS06Y-keys are equal, than you can use the syntax as mentioned
in
the wiki without any problems. You just have to rewrite your config to
make
the second entity a sub-entity and to add a WHERE-clause.

If this is really not possible for you, just a guess, what happens if
you
remove the OS06Y-field from your second entity?

Regards,
Em

--
View this message in context:
http://lucene.472066.n3.nabble.com/The-issue-of-import-data-from-databas
e-using-Solr-DIH-tp2845318p2846347.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Need to create dyanamic indexies base on different document workspaces

2011-04-21 Thread Em

Yes, have a look at the wiki-page. It explains some configurations and
REST-API-methods to create cores dynamically and if/how they are persisted.

Regards,
Em 


Gaurav Shingala wrote:
> 
> Is it possible to create solr core dyanamically?
> 
> In our case we want each workspace to have its own solr index.
> 
>  
> 
> Thanks
>  
>> From: chandan.tamra...@nepasoft.com
>> Date: Thu, 21 Apr 2011 11:57:53 +0545
>> Subject: Re: Need to create dyanamic indexies base on different document
>> workspaces
>> To: solr-user@lucene.apache.org
>> 
>> It depends on your application design how you want your index
>> 
>> 
>> There is a feature called solr core .
>> http://wiki.apache.org/solr/CoreAdmin
>> You could still have a single index but a field to differentiate the
>> items
>> in index
>> 
>> thanks
>> 
>> 
>> On Thu, Apr 21, 2011 at 10:55 AM, Gaurav Shingala <
>> gaurav.shing...@hotmail.com> wrote:
>> 
>> >
>> >
>> >
>> >
>> > Hi,
>> >
>> > Is there a way to create different solr indexes for different
>> categories?
>> > We have different document workspaces and ideally want each workspace
>> to
>> > have its own solr index.
>> >
>> > Thanks,
>> > Gaurav
>> >
>> 
>> 
>> 
>> 
>> -- 
>> Chandan Tamrakar
>> *
>> *
> 


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Need-to-create-dyanamic-indexies-base-on-different-document-workspaces-tp2845919p2846371.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: how to abort a running optimize

2011-04-21 Thread Em

Hi Stockii,

how did you configured your segments-number in Solrconfig.xml?
Decrease the number to speed up things automatically.

Regards,
Em

--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-abort-a-running-optimize-tp2838721p2846369.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Need to create dyanamic indexies base on different document workspaces

2011-04-21 Thread Gaurav Shingala


Is it possible to create solr core dyanamically?

In our case we want each workspace to have its own solr index.

 

Thanks
 
> From: chandan.tamra...@nepasoft.com
> Date: Thu, 21 Apr 2011 11:57:53 +0545
> Subject: Re: Need to create dyanamic indexies base on different document 
> workspaces
> To: solr-user@lucene.apache.org
> 
> It depends on your application design how you want your index
> 
> 
> There is a feature called solr core . http://wiki.apache.org/solr/CoreAdmin
> You could still have a single index but a field to differentiate the items
> in index
> 
> thanks
> 
> 
> On Thu, Apr 21, 2011 at 10:55 AM, Gaurav Shingala <
> gaurav.shing...@hotmail.com> wrote:
> 
> >
> >
> >
> >
> > Hi,
> >
> > Is there a way to create different solr indexes for different categories?
> > We have different document workspaces and ideally want each workspace to
> > have its own solr index.
> >
> > Thanks,
> > Gaurav
> >
> 
> 
> 
> 
> -- 
> Chandan Tamrakar
> *
> *

RE: The issue of import data from database using Solr DIH

2011-04-21 Thread Em

Not sure I understood you correct:

You expect that OS06Y stores *two* different performanceIds? One from table1
and the other from table2?
I think this may be a problem.

If both OS06Y-keys are equal, than you can use the syntax as mentioned in
the wiki without any problems. You just have to rewrite your config to make
the second entity a sub-entity and to add a WHERE-clause.

If this is really not possible for you, just a guess, what happens if you
remove the OS06Y-field from your second entity?

Regards,
Em

--
View this message in context: 
http://lucene.472066.n3.nabble.com/The-issue-of-import-data-from-database-using-Solr-DIH-tp2845318p2846347.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: stemming filter analyzers, any favorites?

2011-04-21 Thread Em

Hi Robert,

we often ran into the same issue with stemmers. This is why we created more
than one field, each field with different stemmers. It adds some overhead
but worked quite well.

Regarding your off-topic-question:
Look at the debugging-output of your searches. Sometimes you configured your
tools, especially the WDF, wrong and the queryParser creates an unexpected
result which leads to unmatched but still relevant documents.

Please, show us your debugging-output and the field-definition so that we
can provide you some help!

Regards,
Em


Robert Petersen-3 wrote:
> 
> I have been doing that, and for Bags example the trailing 's' is not being
> removed by the Kstemmer so if indexing the word bags and searching on bag
> you get no matches.  Why wouldn't the trailing 's' get stemmed off? 
> Kstemmer is dictionary based so bags isn't in the dictionary?   That
> trailing 's' should always be dropped no?  That seems like it would be
> better, we don't want to make synonyms for basic use cases like this.  I
> fear I will have to return to the Porter stemmer.  Are there other better
> ones is my main question.
> 
> Off topic secondary question: sometimes I am puzzled by the output of the
> analysis page.  It seems like there should be a match, but I don't get the
> results during a search that I'd expect...  
> 
> Like in the case if the WordDelimiterFilterFactory splits up a term into a
> bunch of terms before the K-stemmer is applied, sometimes if the matching
> term is in position two of the final analysis but the searcher had the
> partial term just alone and so thereby in position 1 in the analysis stack
> then when searching there wasn't a match.  Am I reading this correctly? 
> Is that right or should that match and I am misreading my analysis output?  
> 
> Thanks!
> 
> Robi
> 
> PS  I have a category named Bags and am catching flack for it not coming
> up in a search for bag.  hah
> PPS the term is not in protwords.txt
> 
> 
> com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory
> {protected=protwords.txt}
> term position 1
> term text bags
> term type word
> source start,end  0,4
> payload   
> 
> 
> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com] 
> Sent: Wednesday, April 20, 2011 10:55 AM
> To: solr-user@lucene.apache.org
> Subject: Re: stemming filter analyzers, any favorites?
> 
> You can get a better sense of exactly what tranformations occur when
> if you look at the analysis page (be sure to check the "verbose"
> checkbox).
> 
> I'm surprised that "bags" doesn't match "bag", what does the analysis
> page say?
> 
> Best
> Erick
> 
> On Wed, Apr 20, 2011 at 1:44 PM, Robert Petersen 
> wrote:
>> Stemming filter analyzers... anyone have any favorites for particular
>> search domains?  Just wondering what people are using.  I'm using Lucid
>> K Stemmer and having issues.   Seems like it misses a lot of common
>> stems.  We went to that because of excessively loose matches on the
>> solr.PorterStemFilterFactory
>>
>>
>> I understand K Stemmer is a dictionary based stemmer.  Seems to me like
>> it is missing a lot of common stem reductions.  Ie   Bags does not match
>> Bag in our searches.
>>
>> Here is my analyzer stack:
>>
>>                > positionIncrementGap="100">
>>                        
>>                                > class="solr.WhitespaceTokenizerFactory"/>
>>                                > class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt"
>> ignoreCase="true" expand="true"/>
>>                                > ignoreCase="true" words="stopwords.txt"/>
>>          >                generateWordParts="1"
>>                generateNumberParts="1"
>>                catenateWords="1"
>>                catenateNumbers="1"
>>                catenateAll="1"
>>                preserveOriginal="1"
>>                />                              > class="solr.LowerCaseFilterFactory"/>
>>                                
>>                                > class="com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory"
>> protected="protwords.txt"/>
>>                                > class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>                        
>>                        
>>                                > class="solr.WhitespaceTokenizerFactory"/>
>>                                > class="solr.SynonymFilterFactory" synonyms="query_synonyms.txt"
>> ignoreCase="true" expand="true"/>
>>                                > ignoreCase="true" words="stopwords.txt"/>
>>          >                generateWordParts="1"
>>                generateNumberParts="1"
>>                catenateWords="1"
>>                catenateNumbers="1"
>>                catenateAll="1"
>>                preserveOriginal="1"
>>                />                              > class="solr.LowerCaseFilterFactory"/>
>>                                
>>                                > class="com.lu

RE: The issue of import data from database using Solr DIH

2011-04-21 Thread Kevin Xiang

Thanks Em.
Yes, OS06Y is the uniqueKey.
Table1 and Table2 is parallel in my example.
In the Url:
http://wiki.apache.org/solr/DIHQuickStart#Index_data_from_multiple_table
s_into_Solr
The tables don't have parallel relations in the above URL example
I want to know that can solr implement the case?
1.Get data from database table1;
2.Get data from database table2;
3.merge the fields of table1 and table2;

The configuration of db-data-config.xml is the following:
















Because I don't want to get one id and data from table1 and then get the
data by id from table2,it may met performance issue.

-Original Message-
From: Em [mailto:mailformailingli...@yahoo.de] 
Sent: Thursday, April 21, 2011 4:38 PM
To: solr-user@lucene.apache.org
Subject: Re: The issue of import data from database using Solr DIH

Hi Kevin,

I think you made OS06Y the uniqueKey, right?
So, in entity 1 you specify values for it, but in entity 2 you do so as
well. 
I am not absolutely sure about this, but: It seems like your two
entities
create two documents and the second will overwrite the first.

Have a look at this page:
http://wiki.apache.org/solr/DIHQuickStart#Index_data_from_multiple_table
s_into_Solr

I think it will help you in rewriting your queries to fit your usecase.

Regards,
Em

--
View this message in context:
http://lucene.472066.n3.nabble.com/The-issue-of-import-data-from-databas
e-using-Solr-DIH-tp2845318p2846296.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: entity name issue

2011-04-21 Thread Em

Hi Tjong,

seems like your XML was invalid.

Try the following and compare it to your original config:



  

Regards,
Em

--
View this message in context: 
http://lucene.472066.n3.nabble.com/entity-name-issue-tp2843812p2846326.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to return score without using _val_

2011-04-21 Thread Em

Hi,

I agree with Yonik here - I do not understand what you would like to do as
well.
But some additional note from my side:
Your FQs never influences the score! Of course you can specify the same
query twice, once as a filter - query and once as a regular query but I do
not see the reason to do so. It sounds like unnecessary effort without a
win. 

Regards,
Em 

Bill Bell wrote:
> 
> I would like to influence the score but I would rather not mess with the
> q=
> field since I want the query to dismax for Q.
> 
> Something like:
> 
> fq={!type=dismax qf=$qqf v=$qspec}&
> fq={!type=dismax qt=dismaxname v=$qname}&
> q=_val_:"{!type=dismax qf=$qqf  v=$qspec}" _val_:"{!type=dismax
> qt=dismaxname v=$qname}"
> 
> Is there a way to do a filter and add the FQ to the score by doing it
> another way? 
> 
> Also does this do multiple queries? Is this the right way to do it?
> 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-return-score-without-using-val-tp2841443p2846317.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Apache Spam Filter Blocking Messages

2011-04-21 Thread Em

This really helps at the mailinglists. 
If you send your mails with Thunderbird, be sure to check that you enforce
plain-text-emails. If not, it will often send HTML-mails.

Regards,
Em


Marvin Humphrey wrote:
> 
> On Thu, Apr 21, 2011 at 12:30:29AM -0400, Trey Grainger wrote:
>> (FREEMAIL_FROM,FS_REPLICA,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL
> 
> Note the "HTML_MESSAGE" in the list of things SpamAssassin didn't like.
> 
>> Apparently I sound like spam when I write perfectly good English and
>> include
>> some xml and a link to a jira ticket in my e-mail (I tried a couple
>> different variations).  Anyone know a way around this filter, or should I
>> just respond to those involved in the e-mail chain directly and avoid the
>> mailing list?
> 
> Send plain text email instead of HTML.  That solves the problem 99% of the
> time.
> 
> Marvin Humphrey
> 


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Apache-Spam-Filter-Blocking-Messages-tp2845854p2846304.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: The issue of import data from database using Solr DIH

2011-04-21 Thread Em

Hi Kevin,

I think you made OS06Y the uniqueKey, right?
So, in entity 1 you specify values for it, but in entity 2 you do so as
well. 
I am not absolutely sure about this, but: It seems like your two entities
create two documents and the second will overwrite the first.

Have a look at this page:
http://wiki.apache.org/solr/DIHQuickStart#Index_data_from_multiple_tables_into_Solr

I think it will help you in rewriting your queries to fit your usecase.

Regards,
Em

--
View this message in context: 
http://lucene.472066.n3.nabble.com/The-issue-of-import-data-from-database-using-Solr-DIH-tp2845318p2846296.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr - upgrade from 1.4.1 to 3.1 - finding AbstractSolrTestCase binaries - help please?

2011-04-21 Thread lboutros

There is a jar for the tests in solr.
I added this dependency in my pom.xml :


org.apache.solr
solr-core
3.1-SNAPSHOT
tests
test
jar


Ludovic.

-
Jouve
France.
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-upgrade-from-1-4-1-to-3-1-finding-AbstractSolrTestCase-binaries-help-please-tp2845011p2846223.html
Sent from the Solr - User mailing list archive at Nabble.com.

PECL SOLR PHP extension, JSON output

2011-04-21 Thread roySolr

Hello,

I use the PECL php extension for SOLR. I want my output in JSON.

This is not working:

$query->set('wt', 'json');

How do i solve this problem?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/PECL-SOLR-PHP-extension-JSON-output-tp2846092p2846092.html
Sent from the Solr - User mailing list archive at Nabble.com.

66 matches

Mail list logo