date:20110601

> Will it be slow if there are 3-5 million key/value rows?
AFAIK it shouldn't affect search time significantly as Solr caches it
in memory after you reloading Solr core / issuing commit.

But obviously you need more memory and commit/reload will take more time.

Re: Better Spellcheck

> I've tried to use a spellcheck dictionary built from my own content, but my
> content ends up having a lot of misspelled words so the spellcheck ends up
> being less than effective.
You can try to use sp.dictionary.threshold parameter to solve this problem
* http://wiki.apache.org/solr/SpellCheckerRequestHandler#sp.dictionary.threshold

> It also misses phrases. When someone
> searches for "Untied States" I would hope the spellcheck would suggest
> "United States" but it just recognizes that "untied" is a valid word and
> doesn't suggest any thing.
So you are saying about auto suggest component and not spellcheck
right? These are two different use cases.

If you want auto suggest and you have some search logs for your system
then you can probably use the following solution:
* 
http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/

If you don't have significant search logs history and want to populate
your auto suggest dictionary from index or some text file you should
check
* http://wiki.apache.org/solr/Suggester

Re: DIH render html entities

Maybe HTMLStripTransformer is what you are looking for.

* http://wiki.apache.org/solr/DataImportHandler#HTMLStripTransformer

On Tue, May 31, 2011 at 5:35 PM, Erick Erickson  wrote:
> Convert them to what? Individual fields in your docs? Text?
>
> If the former, you might get some joy from the XpathEntityProcessor.
> If you want to just strip the markup and index all the content you
> might get some joy from the various *html* analyzers listed here:
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
>
> Best
> Erick
>
> On Fri, May 27, 2011 at 5:19 AM, anass talby  wrote:
>> Sorry my question was not clear.
>> when I get data from database, some field contains some html special chars,
>> and what i want to do is just convert them automatically.
>>
>> On Fri, May 27, 2011 at 1:00 PM, Gora Mohanty  wrote:
>>
>>> On Fri, May 27, 2011 at 3:50 PM, anass talby 
>>> wrote:
>>> > Is there any way to render html entities in DIH for a specific field?
>>> [...]
>>>
>>> This does not make too much sense: What do you mean by
>>> "rendering HTML entities". DIH just indexes, so where would
>>> it render HTML to, even if it could?
>>>
>>> Please take a look at http://wiki.apache.org/solr/UsingMailingLists
>>>
>>> Regards,
>>> Gora
>>>
>>
>>
>>
>> --
>>       Anass
>>
>

Re: Solr memory consumption

Hey Denis,

* How big is your index in terms of number of documents and index size?
* Is it production system where you have many search requests?
* Is there any pattern for OOM errors? I.e. right after you start your
Solr app, after some search activity or specific Solr queries, etc?
* What are 1) cache settings 2) facets and sort-by fields 3) commit
frequency and warmup queries?
etc

Generally you might want to connect to your jvm using jconsole tool
and monitor your heap usage (and other JVM/Solr numbers)

* http://java.sun.com/developer/technicalArticles/J2SE/jconsole.html
* http://wiki.apache.org/solr/SolrJmx#Remote_Connection_to_Solr_JMX

HTH,
Alexey

2011/6/1 Denis Kuzmenok :
> There  were  no  parameters  at  all,  and java hitted "out of memory"
> almost  every day, then i tried to add parameters but nothing changed.
> Xms/Xmx  -  did  not solve the problem too. Now i try the MaxPermSize,
> because it's the last thing i didn't try yet :(
>
>
> Wednesday, June 1, 2011, 9:00:56 PM, you wrote:
>
>> Could be related to your crazy high MaxPermSize like Marcus said.
>
>> I'm no JVM tuning expert either. Few people are, it's confusing. So if
>> you don't understand it either, why are you trying to throw in very
>> non-standard parameters you don't understand?  Just start with whatever
>> the Solr example jetty has, and only change things if you have a reason
>> to (that you understand).
>
>> On 6/1/2011 1:19 PM, Denis Kuzmenok wrote:
>>> Overall  memory on server is 24G, and 24G of swap, mostly all the time
>>> swap  is  free and is not used at all, that's why "no free swap" sound
>>> strange to me..
>
>
>
>
>

RE: Newbie question: how to deal with different # of search results per page due to pagination then grouping

2011-06-01 Thread Robert Petersen

Yes that is exactly the issue... we're thinking just maybe always have a
next button and if you go too far you just get zero results.  User gets
what the user asks for, and so user could simply back up if desired to
where the facet still has values.  Could also detect an empty facet
results on the front end.  You can also only expand one facet only to
allow paging only the facet pane and not the whole page using an ajax
call.



-Original Message-
From: Jonathan Rochkind [mailto:rochk...@jhu.edu] 
Sent: Wednesday, June 01, 2011 2:30 PM
To: solr-user@lucene.apache.org
Cc: Robert Petersen
Subject: Re: Newbie question: how to deal with different # of search
results per page due to pagination then grouping

How do you know whether to provide a 'next' button, or whether you are 
the end of your facet list?

On 6/1/2011 4:47 PM, Robert Petersen wrote:
> I think facet.offset allows facet paging nicely by letting you index
> into the list of facet values.  It is working for me...
>
> http://wiki.apache.org/solr/SimpleFacetParameters#facet.offset
>
>
> -Original Message-
> From: Jonathan Rochkind [mailto:rochk...@jhu.edu]
> Sent: Wednesday, June 01, 2011 12:41 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Newbie question: how to deal with different # of search
> results per page due to pagination then grouping
>
> There's no great way to do that.
>
> One approach would be using facets, but that will just get you the
> author names (as stored in fields), and not the documents under it. If
> you really only want to show the author names, facets could work. One
> issue with facets though is Solr won't tell you the total number of
> facet values for your query, so it's tricky to provide next/prev
paging
> through them.
>
> There is also a 'field collapsing' feature that I think is not in a
> released Solr, but may be in the Solr repo. I'm not sure it will quite
> do what you want either though, although it's related and worth a
look.
> http://wiki.apache.org/solr/FieldCollapsing
>
> Another vaguely related thing that is also not yet in a released Solr,
> is a 'join' function. That could possibly be used to do what you want,
> although it'd be tricky too.
> https://issues.apache.org/jira/browse/SOLR-2272
>
> Jonathan
>
> On 6/1/2011 2:56 PM, beccax wrote:
>> Apologize if this question has already been raised.  I tried
searching
> but
>> couldn't find the relevant posts.
>>
>> We've indexed a bunch of documents by different authors.  Then for
> search
>> results, we'd like to show the authors that have 1 or more documents
>> matching the search keywords.
>>
>> The problem is right now our solr search method first paginates
> results to
>> 100 documents per page, then we take the results and group by
authors.
> This
>> results in different number of authors per page.  (Some authors may
> only
>> have one matching document and others 5 or 10.)
>>
>> How do we change it to somehow show the same number of authors (say
> 25) per
>> page?
>>
>> I mean alternatively we could just show all the documents themselves
> ordered
>> by author, but it's not the user experience we're looking for.
>>
>> Thanks so much.  And please let me know if you need more details not
>> provided here.
>> B
>>
>> --
>> View this message in context:
>
http://lucene.472066.n3.nabble.com/Newbie-question-how-to-deal-with-diff
>
erent-of-search-results-per-page-due-to-pagination-then-grouping-tp30121
> 68p3012168.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>

Re: Newbie question: how to deal with different # of search results per page due to pagination then grouping

How do you know whether to provide a 'next' button, or whether you are 
the end of your facet list?


On 6/1/2011 4:47 PM, Robert Petersen wrote:

I think facet.offset allows facet paging nicely by letting you index
into the list of facet values.  It is working for me...

http://wiki.apache.org/solr/SimpleFacetParameters#facet.offset


-Original Message-
From: Jonathan Rochkind [mailto:rochk...@jhu.edu]
Sent: Wednesday, June 01, 2011 12:41 PM
To: solr-user@lucene.apache.org
Subject: Re: Newbie question: how to deal with different # of search
results per page due to pagination then grouping

There's no great way to do that.

One approach would be using facets, but that will just get you the
author names (as stored in fields), and not the documents under it. If
you really only want to show the author names, facets could work. One
issue with facets though is Solr won't tell you the total number of
facet values for your query, so it's tricky to provide next/prev paging
through them.

There is also a 'field collapsing' feature that I think is not in a
released Solr, but may be in the Solr repo. I'm not sure it will quite
do what you want either though, although it's related and worth a look.
http://wiki.apache.org/solr/FieldCollapsing

Another vaguely related thing that is also not yet in a released Solr,
is a 'join' function. That could possibly be used to do what you want,
although it'd be tricky too.
https://issues.apache.org/jira/browse/SOLR-2272

Jonathan

On 6/1/2011 2:56 PM, beccax wrote:

Apologize if this question has already been raised.  I tried searching

but

couldn't find the relevant posts.

We've indexed a bunch of documents by different authors.  Then for

search

results, we'd like to show the authors that have 1 or more documents
matching the search keywords.

The problem is right now our solr search method first paginates

results to

100 documents per page, then we take the results and group by authors.

This

results in different number of authors per page.  (Some authors may

only

have one matching document and others 5 or 10.)

How do we change it to somehow show the same number of authors (say

25) per

page?

I mean alternatively we could just show all the documents themselves

ordered

by author, but it's not the user experience we're looking for.

Thanks so much.  And please let me know if you need more details not
provided here.
B

--
View this message in context:

http://lucene.472066.n3.nabble.com/Newbie-question-how-to-deal-with-diff
erent-of-search-results-per-page-due-to-pagination-then-grouping-tp30121
68p3012168.html

Sent from the Solr - User mailing list archive at Nabble.com.

RE: Newbie question: how to deal with different # of search results per page due to pagination then grouping

2011-06-01 Thread Robert Petersen

I think facet.offset allows facet paging nicely by letting you index
into the list of facet values.  It is working for me...

http://wiki.apache.org/solr/SimpleFacetParameters#facet.offset

-Original Message-
From: Jonathan Rochkind [mailto:rochk...@jhu.edu] 
Sent: Wednesday, June 01, 2011 12:41 PM
To: solr-user@lucene.apache.org
Subject: Re: Newbie question: how to deal with different # of search
results per page due to pagination then grouping

There's no great way to do that.

One approach would be using facets, but that will just get you the 
author names (as stored in fields), and not the documents under it. If 
you really only want to show the author names, facets could work. One 
issue with facets though is Solr won't tell you the total number of 
facet values for your query, so it's tricky to provide next/prev paging 
through them.

There is also a 'field collapsing' feature that I think is not in a 
released Solr, but may be in the Solr repo. I'm not sure it will quite 
do what you want either though, although it's related and worth a look. 
http://wiki.apache.org/solr/FieldCollapsing

Another vaguely related thing that is also not yet in a released Solr, 
is a 'join' function. That could possibly be used to do what you want, 
although it'd be tricky too.
https://issues.apache.org/jira/browse/SOLR-2272

Jonathan

On 6/1/2011 2:56 PM, beccax wrote:
> Apologize if this question has already been raised.  I tried searching
but
> couldn't find the relevant posts.
>
> We've indexed a bunch of documents by different authors.  Then for
search
> results, we'd like to show the authors that have 1 or more documents
> matching the search keywords.
>
> The problem is right now our solr search method first paginates
results to
> 100 documents per page, then we take the results and group by authors.
This
> results in different number of authors per page.  (Some authors may
only
> have one matching document and others 5 or 10.)
>
> How do we change it to somehow show the same number of authors (say
25) per
> page?
>
> I mean alternatively we could just show all the documents themselves
ordered
> by author, but it's not the user experience we're looking for.
>
> Thanks so much.  And please let me know if you need more details not
> provided here.
> B
>
> --
> View this message in context:
http://lucene.472066.n3.nabble.com/Newbie-question-how-to-deal-with-diff
erent-of-search-results-per-page-due-to-pagination-then-grouping-tp30121
68p3012168.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

RE: Newbie question: how to deal with different # of search results per page due to pagination then grouping

2011-06-01 Thread Robert Petersen

Don't manually group by author from your results, the list will always
be incomplete...  use faceting instead to show the authors of the books
you have found in your search.

http://wiki.apache.org/solr/SolrFacetingOverview

-Original Message-
From: beccax [mailto:bec...@gmail.com] 
Sent: Wednesday, June 01, 2011 11:56 AM
To: solr-user@lucene.apache.org
Subject: Newbie question: how to deal with different # of search results
per page due to pagination then grouping

Apologize if this question has already been raised.  I tried searching
but
couldn't find the relevant posts.

We've indexed a bunch of documents by different authors.  Then for
search
results, we'd like to show the authors that have 1 or more documents
matching the search keywords.  

The problem is right now our solr search method first paginates results
to
100 documents per page, then we take the results and group by authors.
This
results in different number of authors per page.  (Some authors may only
have one matching document and others 5 or 10.)

How do we change it to somehow show the same number of authors (say 25)
per
page?

I mean alternatively we could just show all the documents themselves
ordered
by author, but it's not the user experience we're looking for.

Thanks so much.  And please let me know if you need more details not
provided here.
B

--
View this message in context:
http://lucene.472066.n3.nabble.com/Newbie-question-how-to-deal-with-diff
erent-of-search-results-per-page-due-to-pagination-then-grouping-tp30121
68p3012168.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Spellcheck Phrases

2011-06-01 Thread Dyer, James

Tanner,

I just entered SOLR-2571 to fix the float-parsing-bug that breaks 
"thresholdTokenFrequency".  Its just a 1-line code fix so I also included a 
patch that should cleanly apply to solr 3.1.  See 
https://issues.apache.org/jira/browse/SOLR-2571 for info and patches.

This parameter appears absent from the wiki.  And as it has always been broken 
for me, I haven't tested it.  However, my understanding it should be set as the 
minimum percentage of documents in which a term has to occur in order for it to 
appear in the spelling dictionary.  For instance in the config below, a term 
would have to occur in at least 1% of the documents for it to be part of the 
spelling dictionary.  This might be a good setting for long fields but for the 
short fields in my application, I was thinking of setting this to something 
like 1/1000 of 1% ...

 text

  spellchecker
  Spelling_Dictionary
  text
  ./spellchecker
  .01 

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311

-Original Message-
From: Tanner Postert [mailto:tanner.post...@gmail.com] 
Sent: Friday, May 27, 2011 6:04 PM
To: solr-user@lucene.apache.org
Subject: Re: Spellcheck Phrases

are there any updates on this? any third party apps that can make this work
as expected?

On Wed, Feb 23, 2011 at 12:38 PM, Dyer, James wrote:

> Tanner,
>
> Currently Solr will only make suggestions for words that are not in the
> dictionary, unless you specifiy "spellcheck.onlyMorePopular=true".  However,
> if you do that, then it will try to "improve" every word in your query, even
> the ones that are spelled correctly (so while it might change "brake" to
> "break" it might also change "leg" to "log".)
>
> You might be able to alleviate some of the pain by setting the
> "thresholdTokenFrequency" so as to remove misspelled and rarely-used words
> from your dictionary, although I personally haven't been able to get this
> parameter to work.  It also doesn't seem to be documented on the wiki but it
> is in the 1.4.1. source code, in class IndexBasedSpellChecker.  Its also
> mentioned in Smiley&Pugh's book.  I tried setting it like this, but got a
> ClassCastException on the float value:
>
> 
>  text_spelling
>  
>  spellchecker
>  Spelling_Dictionary
>  text_spelling
>  true
>  .001
>  
> 
>
> I have it on my to-do list to look into this further but haven't yet.  If
> you decide to try it and can get it to work, please let me know how you do
> it.
>
> James Dyer
> E-Commerce Systems
> Ingram Content Group
> (615) 213-4311
>
> -Original Message-
> From: Tanner Postert [mailto:tanner.post...@gmail.com]
> Sent: Wednesday, February 23, 2011 12:53 PM
> To: solr-user@lucene.apache.org
> Subject: Spellcheck Phrases
>
> right now when I search for 'brake a leg', solr returns valid results with
> no indication of misspelling, which is understandable since all of those
> terms are valid words and are probably found in a few pieces of our
> content.
> My question is:
>
> is there any way for it to recognize that the phase should be "break a leg"
> and not "brake a leg" and suggest the proper phrase?
>

Re: Change default scoring formula

2011-06-01 Thread ngaurav2005

Thanks Tomas. Well I am sorting results by a function query. I donot want
solr to do extra effort in calculating score for each document and eat up my
cpu cycles. Also, I need to use "if" condition in score calculation, which I
emulated through "map" function, but map function do not accept a function
as one of the values. This causes me to write my own scoring algorithm.

Can you help me with the steps or link to any post which explains step by
step overriding(DefaultSimilarity class) default sorting algorithm?

Thanks in advance.
Gaurav


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Change-default-scoring-formula-tp3012196p3012372.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Searching using a PDF

I'm not quite sure what you mean by "regular search". When
you index a PDF (Presumably through Tika or Solr Cell) the text
is indexed into your index and you can certainly search that. Additionally,
there may be meta data indexed in specific fields (e.g. author,
date modified, etc).

But what does "search based on a PDF file" mean in your context?

Best
Erick

On Wed, Jun 1, 2011 at 3:41 PM, Brian Lamb
 wrote:
> Is it possible to do a search based on a PDF file? I know its possible to
> update the index with a PDF but can you do just a regular search with it?
>
> Thanks,
>
> Brian Lamb
>

NRT facet search options comparison

2011-06-01 Thread Andy

Hi,

I need to provide NRT search with faceting. Been looking at the options out 
there. Wondered if anyone could clarify some questions I have and perhaps share 
your NRT experiences.

The various NRT options:

1) Solr
-Solr doesn't have NRT, yet. What is the expected time frame for NRT? Is it a 
few months or more like a year?
-How would Solr faceting work with NRT? My understanding is that faceting in 
Solr relies on caching, which doesn't go well with NRT updates. When NRT 
arrives, would facet performance take a huge drop when using with NRT because 
of this caching issue?

2) ElasticSearch
-ES supports NRT so that's great. Does anyone have experiences with ES that 
they could share? Does faceting work with NRT in ES? Any Solr features that are 
missing in ES?

3) Solr-RA
-Read in this list about Solr-RA, which has NRT support. Has anyone used it? 
Can you share your experiences?
-Again not sure if facet would work with Solr-RA NRT. Solr-RA is based on Solr, 
so faceting in Solr-RA relies on caching I suppose. Does NRT affect facet 
performance?

4) Zoie plugin for Solr
-Zoie is a NRT search library. I tried but couldn't get the Zoie plugin to work 
with Solr. Always got the error message of opening too many Searchers. Has 
anyone got this to work?

Any other options?

Thanks
Andy

Re: Limit data stored from fmap.content with Solr cell

If you can live with an across-the-board limit, you can set maxFieldLength
in your solrconfig.xml file. Note that this is in terms rather than
chars though...

Best
Erick

On Wed, Jun 1, 2011 at 2:22 PM, Greg Georges  wrote:
> Hello everyone,
>
> I have just gotten extracting information from files with Solr Cell. Some of 
> the files we are indexing are large, and have much content. I would like to 
> limit the amount of data I index to a specified limit of characters (example 
> 300 chars) which I will use as a document preview. Is this possible to set as 
> a parameter with the fmap.content param, of must I index it all and then do a 
> copyfield but just with a specified number of characters? Thanks in advance
>
> Greg
>

Re: best way to update custom fieldcache after index commit?

How are you implementing your custom cache? If you're defining
it in the solrconfig, couldn't you implement the regenerator? See:
http://wiki.apache.org/solr/SolrCaching#User.2BAC8-Generic_Caches

Best
Erick

On Wed, Jun 1, 2011 at 12:38 PM, oleole  wrote:
> Hi,
>
> We use solr and lucene fieldcache like this
> static DocTerms myfieldvalues =
> org.apache.lucene.search.FieldCache.DEFAULT.getTerms(reader, "myField");
> which is initialized at first use and will stay in memory for fast retrieval
> of field values based on DocID
>
> The problem is after an index/commit, the lucene fieldcache is reloaded in
> the new searcher, but this static list need to updated as well,  what is the
> best way to handle this? Basically we want to update those custom filedcache
> whenever there is a commit. The possible solution I can think of:
>
> 1) manually call an request handler to clean up those custom stuffs after
> commit, which is a hack and ugly.
> 2) use some listener event (not sure whether I can use newSearcher event
> listener in Solr); also there seems to be a lucene ticket (
> https://issues.apache.org/jira/browse/LUCENE-2474, Allow to plug in a Cache
> Eviction Listener to IndexReader to eagerly clean custom caches that use the
> IndexReader (getFieldCacheKey)), not clear to me how to use it though
>
> Any of your suggestion/comments is much appreciated. Thanks!
>
> oleole
>

Re: Debugging a Solr/Jetty Hung Process

First guess (and it really is just a guess) would be Java garbage 
collection taking over. There are some JVM parameters you can use to 
tune the GC process, especially if the machine is multi-core, making 
sure GC happens in a seperate thread is helpful.


But figuring out exactly what's going on requires confusing JVM 
debugging of which I am no expert at either.


On 6/1/2011 3:04 PM, Chris Cowan wrote:

About once a day a Solr/Jetty process gets hung on my server consuming 100% of 
one of the CPU's. Once this happens the server no longer responds to requests. 
I've looked through the logs to try and see if anything stands out but so far 
I've found nothing out of the ordinary.

My current remedy is to log in and just kill the single processes that's hung. 
Once that happens everything goes back to normal and I'm good for a day or so.  
I'm currently  the running following:

solr-jetty-1.4.0+ds1-1ubuntu1

which is comprised of

Solr 1.4.0
Jetty 6.1.22
on Unbuntu 10.10

I'm pretty new to managing a Jetty/Solr instance so at this point I'm just 
looking for advice on how I should go about trouble shooting this problem.

Chris

Searching using a PDF

2011-06-01 Thread Brian Lamb

Is it possible to do a search based on a PDF file? I know its possible to
update the index with a PDF but can you do just a regular search with it?

Thanks,

Brian Lamb

Re: Newbie question: how to deal with different # of search results per page due to pagination then grouping

There's no great way to do that.

One approach would be using facets, but that will just get you the
author names (as stored in fields), and not the documents under it. If
you really only want to show the author names, facets could work. One
issue with facets though is Solr won't tell you the total number of
facet values for your query, so it's tricky to provide next/prev paging
through them.

There is also a 'field collapsing' feature that I think is not in a
released Solr, but may be in the Solr repo. I'm not sure it will quite
do what you want either though, although it's related and worth a look.
http://wiki.apache.org/solr/FieldCollapsing

Another vaguely related thing that is also not yet in a released Solr,
is a 'join' function. That could possibly be used to do what you want,
although it'd be tricky too. https://issues.apache.org/jira/browse/SOLR-2272

Jonathan

On 6/1/2011 2:56 PM, beccax wrote:

Apologize if this question has already been raised. I tried searching but
couldn't find the relevant posts.

We've indexed a bunch of documents by different authors. Then for search
results, we'd like to show the authors that have 1 or more documents
matching the search keywords.

The problem is right now our solr search method first paginates results to
100 documents per page, then we take the results and group by authors. This
results in different number of authors per page. (Some authors may only
have one matching document and others 5 or 10.)

How do we change it to somehow show the same number of authors (say 25) per
page?

I mean alternatively we could just show all the documents themselves ordered
by author, but it's not the user experience we're looking for.

Thanks so much. And please let me know if you need more details not
provided here.
B

--
View this message in context:
http://lucene.472066.n3.nabble.com/Newbie-question-how-to-deal-with-different-of-search-results-per-page-due-to-pagination-then-grouping-tp3012168p3012168.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Change default scoring formula

2011-06-01 Thread Tomás Fernández Löbbe

Hi Gaurav, not sure what your use case is (and if no sorting at all is ever
required, is Solr / Lucene what you need?).
You can certainly sort by a field (or more) in descendant or ascendant order
by using the "sort" parameter.
You can customize the scoring algorithm by overriding the DefaultSimilarity
class, but first make sure that this is what you need, as most use cases can
be implemented with the default similarity plus queries / filter queries /
function queries, etc.
Regards,

Tomás
On Wed, Jun 1, 2011 at 4:02 PM, ngaurav2005  wrote:

> Hi All,
>
> I need to change the default scoring formula of solr. How shall I hack the
> code to do so?
> also, is there any way to stop solr to do its default scoring and sorting?
>
> Thanks,
> Gaurav
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Change-default-scoring-formula-tp3012196p3012196.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Edgengram

2011-06-01 Thread Brian Lamb

I think in my case LowerCaseTokenizerFactory will be sufficient because
there will never be spaces in this particular field. But thank you for the
useful link!

Thanks,

Brian Lamb

On Wed, Jun 1, 2011 at 11:44 AM, Erick Erickson wrote:

> Be a little careful here. LowerCaseTokenizerFactory is different than
> KeywordTokenizerFactory.
>
> LowerCaseTokenizerFactory will give you more than one term. e.g.
> the string "Intelligence can't be MeaSurEd" will give you 5 terms,
> any of which may match. i.e.
> "intelligence", "can", "t", "be", "measured".
> whereas KeywordTokenizerFactory followed, by, say LowerCaseFilter
> would give you exactly one token:
> "intelligence can't be measured".
>
> So searching for "measured" would get a hit in the first case but
> not in the second. Searching for "intellig*" would hit both.
>
> Neither is better, just make sure they do what you want!
>
> This page will help a lot:
>
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LowerCaseTokenizerFactory
> as will the admin/analysis page.
>
> Best
> Erick
>
> On Wed, Jun 1, 2011 at 10:43 AM, Brian Lamb
>  wrote:
> > Hi Tomás,
> >
> > Thank you very much for your suggestion. I took another crack at it using
> > your recommendation and it worked ideally. The only thing I had to change
> > was
> >
> > 
> >  
> > 
> >
> > to
> >
> > 
> >  
> > 
> >
> > The first did not produce any results but the second worked beautifully.
> >
> > Thanks!
> >
> > Brian Lamb
> >
> > 2011/5/31 Tomás Fernández Löbbe 
> >
> >> ...or also use the LowerCaseTokenizerFactory at query time for
> consistency,
> >> but not the edge ngram filter.
> >>
> >> 2011/5/31 Tomás Fernández Löbbe 
> >>
> >> > Hi Brian, I don't know if I understand what you are trying to achieve.
> >> You
> >> > want the term query "abcdefg" to have an idf of 1 insead of 7? I think
> >> using
> >> > the KeywordTokenizerFilterFactory at query time should work. I would
> be
> >> > something like:
> >> >
> >> >  >> > positionIncrementGap="1000">
> >> >   
> >> >
> >> > 
> >> >  >> > maxGramSize="25" side="front" />
> >> >   
> >> >   
> >> >   
> >> >   
> >> > 
> >> >
> >> > this way, at query time "abcdefg" won't be turned to "a ab abc abcd
> abcde
> >> > abcdef abcdefg". At index time it will.
> >> >
> >> > Regards,
> >> > Tomás
> >> >
> >> >
> >> > On Tue, May 31, 2011 at 1:07 PM, Brian Lamb <
> >> brian.l...@journalexperts.com
> >> > > wrote:
> >> >
> >> >>  >> >> positionIncrementGap="1000">
> >> >>   
> >> >> 
> >> >>  >> >> maxGramSize="25" side="front" />
> >> >>   
> >> >> 
> >> >>
> >> >> I believe I used that link when I initially set up the field and it
> >> worked
> >> >> great (and I'm still using it in other places). In this particular
> >> example
> >> >> however it does not appear to be practical for me. I mentioned that I
> >> have
> >> >> a
> >> >> similarity class that returns 1 for the idf and in the case of an
> >> >> edgengram,
> >> >> it returns 1 * length of the search string.
> >> >>
> >> >> Thanks,
> >> >>
> >> >> Brian Lamb
> >> >>
> >> >> On Tue, May 31, 2011 at 11:34 AM, bmdakshinamur...@gmail.com <
> >> >> bmdakshinamur...@gmail.com> wrote:
> >> >>
> >> >> > Can you specify the analyzer you are using for your queries?
> >> >> >
> >> >> > May be you could use a KeywordAnalyzer for your queries so you
> don't
> >> end
> >> >> up
> >> >> > matching parts of your query.
> >> >> >
> >> >> >
> >> >>
> >>
> http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
> >> >> > This should help you.
> >> >> >
> >> >> > On Tue, May 31, 2011 at 8:24 PM, Brian Lamb
> >> >> > wrote:
> >> >> >
> >> >> > > In this particular case, I will be doing a solr search based on
> user
> >> >> > > preferences. So I will not be depending on the user to type
> >> "abcdefg".
> >> >> > That
> >> >> > > will be automatically generated based on user selections.
> >> >> > >
> >> >> > > The contents of the field do not contain spaces and since I am
> >> created
> >> >> > the
> >> >> > > search parameters, case isn't important either.
> >> >> > >
> >> >> > > Thanks,
> >> >> > >
> >> >> > > Brian Lamb
> >> >> > >
> >> >> > > On Tue, May 31, 2011 at 9:44 AM, Erick Erickson <
> >> >> erickerick...@gmail.com
> >> >> > > >wrote:
> >> >> > >
> >> >> > > > That'll work for your case, although be aware that string types
> >> >> aren't
> >> >> > > > analyzed at all,
> >> >> > > > so case matters, as do spaces etc.
> >> >> > > >
> >> >> > > > What is the use-case here? If you explain it a bit there might
> be
> >> >> > > > better answers
> >> >> > > >
> >> >> > > > Best
> >> >> > > > Erick
> >> >> > > >
> >> >> > > > On Fri, May 27, 2011 at 9:17 AM, Brian Lamb
> >> >> > > >  wrote:
> >> >> > > > > For this, I ended up just changing it to string and using
> >> >> "abcdefg*"
> >> >> > to
> >> >> > > > > match. That seems to work so far.
> >> >> > > > >
> >> >> > > > > Thanks,
> >> >> > > > >
> >> >> > > > > Brian Lamb
> >>

Re: Debugging a Solr/Jetty Hung Process

2011-06-01 Thread Chris Cowan

Sorry ... I just found it. I will try that next time. I have a feeling it wont 
work since the server usually stops accepting connections.

Chris

On Jun 1, 2011, at 12:12 PM, Chris Cowan wrote:

> I'm pretty green... is that something I  can do while the event is happening 
> or is there something I need to configure to capture the dump ahead of time. 
> 
> I've tried to reproduce the problem by putting the server under load but that 
> doesn't seem to be the issue.
> 
> Chris
> 
> On Jun 1, 2011, at 12:06 PM, Bill Au wrote:
> 
>> Taking a thread dump will take you what's going.
>> 
>> Bill
>> 
>> On Wed, Jun 1, 2011 at 3:04 PM, Chris Cowan 
>> wrote:
>> 
>>> About once a day a Solr/Jetty process gets hung on my server consuming 100%
>>> of one of the CPU's. Once this happens the server no longer responds to
>>> requests. I've looked through the logs to try and see if anything stands out
>>> but so far I've found nothing out of the ordinary.
>>> 
>>> My current remedy is to log in and just kill the single processes that's
>>> hung. Once that happens everything goes back to normal and I'm good for a
>>> day or so.  I'm currently  the running following:
>>> 
>>> solr-jetty-1.4.0+ds1-1ubuntu1
>>> 
>>> which is comprised of
>>> 
>>> Solr 1.4.0
>>> Jetty 6.1.22
>>> on Unbuntu 10.10
>>> 
>>> I'm pretty new to managing a Jetty/Solr instance so at this point I'm just
>>> looking for advice on how I should go about trouble shooting this problem.
>>> 
>>> Chris
>

Re: Debugging a Solr/Jetty Hung Process

2011-06-01 Thread Chris Cowan

I'm pretty green... is that something I  can do while the event is happening or 
is there something I need to configure to capture the dump ahead of time. 

I've tried to reproduce the problem by putting the server under load but that 
doesn't seem to be the issue.

Chris

On Jun 1, 2011, at 12:06 PM, Bill Au wrote:

> Taking a thread dump will take you what's going.
> 
> Bill
> 
> On Wed, Jun 1, 2011 at 3:04 PM, Chris Cowan 
> wrote:
> 
>> About once a day a Solr/Jetty process gets hung on my server consuming 100%
>> of one of the CPU's. Once this happens the server no longer responds to
>> requests. I've looked through the logs to try and see if anything stands out
>> but so far I've found nothing out of the ordinary.
>> 
>> My current remedy is to log in and just kill the single processes that's
>> hung. Once that happens everything goes back to normal and I'm good for a
>> day or so.  I'm currently  the running following:
>> 
>> solr-jetty-1.4.0+ds1-1ubuntu1
>> 
>> which is comprised of
>> 
>> Solr 1.4.0
>> Jetty 6.1.22
>> on Unbuntu 10.10
>> 
>> I'm pretty new to managing a Jetty/Solr instance so at this point I'm just
>> looking for advice on how I should go about trouble shooting this problem.
>> 
>> Chris

Re: CLOSE_WAIT after connecting to multiple shards from a primary shard

2011-06-01 Thread Mukunda Madhava

Hi Otis,
Sending to solr-user mailing list.

We see this CLOSE_WAIT connections even when i do a simple http request via
curl, that is, even when i do a simple curl using a primary and secondary
shard query, like for e.g.

curl "
http://primaryshardhost:8180/solr/core0/select?q=*%3A*&shards=secondaryshardhost1:8090/solr/appgroup1_11053000_11053100
"

While fetching data it is in ESTABLISHED state

-sh-3.2$ netstat | grep ESTABLISHED | grep 8090
tcp0  0 primaryshardhost:36805 secondaryshardhost1:8090
ESTABLISHED

After the request has come back, it is in CLOSE_WAIT state

-sh-3.2$ netstat | grep CLOSE_WAIT | grep 8090
tcp1  0 primaryshardhost:36805 secondaryshardhost1:8090
CLOSE_WAIT

why does Solr keep the connection to the shards in CLOSE_WAIT?

Is this a feature of Solr? If we modify an OS property (I dont know how) to
cleanup the CLOSE_WAITs will it cause an issue with subsequent searches?

Can someone help me please?

thanks,
Mukunda

On Mon, May 30, 2011 at 5:59 PM, Otis Gospodnetic <
otis_gospodne...@yahoo.com> wrote:

> Hi,
>
> A few things:
> 1) why not send this to the Solr list?
> 2) you talk about searching, but the code sample is about optimizing the
> index.
>
> 3) I don't have SolrJ API in front of me, but isn't there is
> CommonsSolrServe
> ctor that takes in a URL instead of HttpClient instance?  Try that one.
>
> Otis
> -
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>
>
>
> - Original Message 
> > From: Mukunda Madhava 
> > To: gene...@lucene.apache.org
> > Sent: Mon, May 30, 2011 1:54:07 PM
> > Subject: CLOSE_WAIT after connecting to multiple shards from a primary
> shard
> >
> > Hi,
> > We are having a "primary" Solr shard, and multiple "secondary" shards.
>  We
> > query data from the secondary shards by specifying the "shards" param in
>  the
> > query params.
> >
> > But we found that after recieving the data, there  are large number of
> > CLOSE_WAIT on the secondary shards from the primary  shards.
> >
> > Like for e.g.
> >
> > tcp1   0 primaryshardhost:56109  secondaryshardhost1:8090
> > CLOSE_WAIT
> > tcp1   0 primaryshardhost:51049  secondaryshardhost1:8090
> > CLOSE_WAIT
> > tcp1   0 primaryshardhost:49537  secondaryshardhost1:8089
> > CLOSE_WAIT
> > tcp1   0 primaryshardhost:44109  secondaryshardhost2:8090
> > CLOSE_WAIT
> > tcp1   0 primaryshardhost:32041  secondaryshardhost2:8090
> > CLOSE_WAIT
> > tcp1   0 primaryshardhost:48533  secondaryshardhost2:8089
> > CLOSE_WAIT
> >
> >
> > We open the Solr connections  as below..
> >
> > SimpleHttpConnectionManager cm =  new
> > SimpleHttpConnectionManager(true);
> >  cm.closeIdleConnections(0L);
> > HttpClient  httpClient = new HttpClient(cm);
> > solrServer = new  CommonsHttpSolrServer(url,httpClient);
> >  solrServer.optimize();
> >
> > But still we see these issues. Any ideas?
> > --
> > Thanks,
> > Mukunda
> >
>



-- 
Thanks,
Mukunda

Re: Debugging a Solr/Jetty Hung Process

2011-06-01 Thread Bill Au

Taking a thread dump will take you what's going.

Bill

On Wed, Jun 1, 2011 at 3:04 PM, Chris Cowan wrote:

> About once a day a Solr/Jetty process gets hung on my server consuming 100%
> of one of the CPU's. Once this happens the server no longer responds to
> requests. I've looked through the logs to try and see if anything stands out
> but so far I've found nothing out of the ordinary.
>
> My current remedy is to log in and just kill the single processes that's
> hung. Once that happens everything goes back to normal and I'm good for a
> day or so.  I'm currently  the running following:
>
> solr-jetty-1.4.0+ds1-1ubuntu1
>
> which is comprised of
>
> Solr 1.4.0
> Jetty 6.1.22
> on Unbuntu 10.10
>
> I'm pretty new to managing a Jetty/Solr instance so at this point I'm just
> looking for advice on how I should go about trouble shooting this problem.
>
> Chris

Debugging a Solr/Jetty Hung Process

2011-06-01 Thread Chris Cowan

About once a day a Solr/Jetty process gets hung on my server consuming 100% of 
one of the CPU's. Once this happens the server no longer responds to requests. 
I've looked through the logs to try and see if anything stands out but so far 
I've found nothing out of the ordinary. 

My current remedy is to log in and just kill the single processes that's hung. 
Once that happens everything goes back to normal and I'm good for a day or so.  
I'm currently  the running following:

solr-jetty-1.4.0+ds1-1ubuntu1

which is comprised of

Solr 1.4.0
Jetty 6.1.22
on Unbuntu 10.10

I'm pretty new to managing a Jetty/Solr instance so at this point I'm just 
looking for advice on how I should go about trouble shooting this problem.

Chris

Change default scoring formula

2011-06-01 Thread ngaurav2005

Hi All,

I need to change the default scoring formula of solr. How shall I hack the
code to do so?
also, is there any way to stop solr to do its default scoring and sorting?

Thanks,
Gaurav

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Change-default-scoring-formula-tp3012196p3012196.html
Sent from the Solr - User mailing list archive at Nabble.com.

Newbie question: how to deal with different # of search results per page due to pagination then grouping

2011-06-01 Thread beccax

Apologize if this question has already been raised.  I tried searching but
couldn't find the relevant posts.

We've indexed a bunch of documents by different authors.  Then for search
results, we'd like to show the authors that have 1 or more documents
matching the search keywords.  

The problem is right now our solr search method first paginates results to
100 documents per page, then we take the results and group by authors.  This
results in different number of authors per page.  (Some authors may only
have one matching document and others 5 or 10.)

How do we change it to somehow show the same number of authors (say 25) per
page?

I mean alternatively we could just show all the documents themselves ordered
by author, but it's not the user experience we're looking for.

Thanks so much.  And please let me know if you need more details not
provided here.
B

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Newbie-question-how-to-deal-with-different-of-search-results-per-page-due-to-pagination-then-grouping-tp3012168p3012168.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr memory consumption

There  were  no  parameters  at  all,  and java hitted "out of memory"
almost  every day, then i tried to add parameters but nothing changed.
Xms/Xmx  -  did  not solve the problem too. Now i try the MaxPermSize,
because it's the last thing i didn't try yet :(


Wednesday, June 1, 2011, 9:00:56 PM, you wrote:

> Could be related to your crazy high MaxPermSize like Marcus said.

> I'm no JVM tuning expert either. Few people are, it's confusing. So if
> you don't understand it either, why are you trying to throw in very 
> non-standard parameters you don't understand?  Just start with whatever
> the Solr example jetty has, and only change things if you have a reason
> to (that you understand).

> On 6/1/2011 1:19 PM, Denis Kuzmenok wrote:
>> Overall  memory on server is 24G, and 24G of swap, mostly all the time
>> swap  is  free and is not used at all, that's why "no free swap" sound
>> strange to me..

Limit data stored from fmap.content with Solr cell

2011-06-01 Thread Greg Georges

Hello everyone,

I have just gotten extracting information from files with Solr Cell. Some of 
the files we are indexing are large, and have much content. I would like to 
limit the amount of data I index to a specified limit of characters (example 
300 chars) which I will use as a document preview. Is this possible to set as a 
parameter with the fmap.content param, of must I index it all and then do a 
copyfield but just with a specified number of characters? Thanks in advance

Greg

Re: Solr memory consumption


Could be related to your crazy high MaxPermSize like Marcus said.

I'm no JVM tuning expert either. Few people are, it's confusing. So if 
you don't understand it either, why are you trying to throw in very 
non-standard parameters you don't understand?  Just start with whatever 
the Solr example jetty has, and only change things if you have a reason 
to (that you understand).


On 6/1/2011 1:19 PM, Denis Kuzmenok wrote:

Overall  memory on server is 24G, and 24G of swap, mostly all the time
swap  is  free and is not used at all, that's why "no free swap" sound
strange to me..



There is no simple answer.
All I can say is you don't usually want to use an Xmx that's more than
you actually have available RAM, and _can't_ use more than you have
available ram+swap, and the Java error seems to be suggesting you are
using more than is available in ram+swap. That may not be what's going
on, JVM memory issues are indeed confusing.
Why don't you start smaller, and see what happens.  But if you end up
needing more RAM for your Solr than you have available on the server,
then you're just going to need more RAM.
You may have to learn something about java/jvm to do memory tuning for
Solr. Or, just start with the default parameters from the Solr example
jetty, and if you don't run into any problems, then great.  Starting
with the example jetty shipped with Solr would be the easiest way to get
started for someone who doesn't know much about Java/JVM.
On 6/1/2011 12:37 PM, Denis Kuzmenok wrote:

So what should i do to evoid that error?
I can use 10G on server, now i try to run with flags:
java -Xms6G -Xmx6G -XX:MaxPermSize=1G -XX:PermSize=512M -D64

Or should i set xmx to lower numbers and what about other params?
Sorry, i don't know much about java/jvm =(



Wednesday, June 1, 2011, 7:29:50 PM, you wrote:


Are you in fact out of swap space, as the java error suggested?
The way JVM's work always, if you tell it -Xmx6g, it WILL use all 6g
eventually.  The JVM doesn't Garbage Collect until it's going to run out
of heap space, until it gets to your Xmx.  It will keep using RAM until
it reaches your Xmx.
If your Xmx is set so high you don't have enough RAM available, that
will be a problem, you don't want to set Xmx like this. Ideally you
don't even want to swap, but normally the OS will swap to give you
enough RAM if neccesary -- if you don't have swap space for it to do
that, to give the JVM the 6g you've configured it to take well, that
seems to be what the Java error message is telling you. Of course
sometimes error messages are misleading.
But yes, if you set Xmx to 6G, the process WILL use all 6G eventually.
This is just how the JVM works.

Re: Solr memory consumption

2011-06-01 Thread Markus Jelsma

PermSize and MaxPermSize don't need to be higher than 64M.  You should read on 
JVM tuning. The permanent generation is only used for the code that's being 
executed. 

> So what should i do to evoid that error?
> I can use 10G on server, now i try to run with flags:
> java -Xms6G -Xmx6G -XX:MaxPermSize=1G -XX:PermSize=512M -D64
> 
> Or should i set xmx to lower numbers and what about other params?
> Sorry, i don't know much about java/jvm =(
> 
> Wednesday, June 1, 2011, 7:29:50 PM, you wrote:
> > Are you in fact out of swap space, as the java error suggested?
> > 
> > The way JVM's work always, if you tell it -Xmx6g, it WILL use all 6g
> > eventually.  The JVM doesn't Garbage Collect until it's going to run out
> > of heap space, until it gets to your Xmx.  It will keep using RAM until
> > it reaches your Xmx.
> > 
> > If your Xmx is set so high you don't have enough RAM available, that
> > will be a problem, you don't want to set Xmx like this. Ideally you
> > don't even want to swap, but normally the OS will swap to give you
> > enough RAM if neccesary -- if you don't have swap space for it to do
> > that, to give the JVM the 6g you've configured it to take well, that
> > seems to be what the Java error message is telling you. Of course
> > sometimes error messages are misleading.
> > 
> > But yes, if you set Xmx to 6G, the process WILL use all 6G eventually.
> > This is just how the JVM works.

Re: Solr memory consumption

Overall  memory on server is 24G, and 24G of swap, mostly all the time
swap  is  free and is not used at all, that's why "no free swap" sound
strange to me..


> There is no simple answer.

> All I can say is you don't usually want to use an Xmx that's more than
> you actually have available RAM, and _can't_ use more than you have 
> available ram+swap, and the Java error seems to be suggesting you are 
> using more than is available in ram+swap. That may not be what's going
> on, JVM memory issues are indeed confusing.

> Why don't you start smaller, and see what happens.  But if you end up 
> needing more RAM for your Solr than you have available on the server, 
> then you're just going to need more RAM.

> You may have to learn something about java/jvm to do memory tuning for
> Solr. Or, just start with the default parameters from the Solr example
> jetty, and if you don't run into any problems, then great.  Starting 
> with the example jetty shipped with Solr would be the easiest way to get
> started for someone who doesn't know much about Java/JVM.

> On 6/1/2011 12:37 PM, Denis Kuzmenok wrote:
>> So what should i do to evoid that error?
>> I can use 10G on server, now i try to run with flags:
>> java -Xms6G -Xmx6G -XX:MaxPermSize=1G -XX:PermSize=512M -D64
>>
>> Or should i set xmx to lower numbers and what about other params?
>> Sorry, i don't know much about java/jvm =(
>>
>>
>>
>> Wednesday, June 1, 2011, 7:29:50 PM, you wrote:
>>
>>> Are you in fact out of swap space, as the java error suggested?
>>> The way JVM's work always, if you tell it -Xmx6g, it WILL use all 6g
>>> eventually.  The JVM doesn't Garbage Collect until it's going to run out
>>> of heap space, until it gets to your Xmx.  It will keep using RAM until
>>> it reaches your Xmx.
>>> If your Xmx is set so high you don't have enough RAM available, that
>>> will be a problem, you don't want to set Xmx like this. Ideally you
>>> don't even want to swap, but normally the OS will swap to give you
>>> enough RAM if neccesary -- if you don't have swap space for it to do
>>> that, to give the JVM the 6g you've configured it to take well, that
>>> seems to be what the Java error message is telling you. Of course
>>> sometimes error messages are misleading.
>>> But yes, if you set Xmx to 6G, the process WILL use all 6G eventually.
>>> This is just how the JVM works.
>>
>>

Re: Solr memory consumption


There is no simple answer.

All I can say is you don't usually want to use an Xmx that's more than 
you actually have available RAM, and _can't_ use more than you have 
available ram+swap, and the Java error seems to be suggesting you are 
using more than is available in ram+swap. That may not be what's going 
on, JVM memory issues are indeed confusing.


Why don't you start smaller, and see what happens.  But if you end up 
needing more RAM for your Solr than you have available on the server, 
then you're just going to need more RAM.


You may have to learn something about java/jvm to do memory tuning for 
Solr. Or, just start with the default parameters from the Solr example 
jetty, and if you don't run into any problems, then great.  Starting 
with the example jetty shipped with Solr would be the easiest way to get 
started for someone who doesn't know much about Java/JVM.


On 6/1/2011 12:37 PM, Denis Kuzmenok wrote:

So what should i do to evoid that error?
I can use 10G on server, now i try to run with flags:
java -Xms6G -Xmx6G -XX:MaxPermSize=1G -XX:PermSize=512M -D64

Or should i set xmx to lower numbers and what about other params?
Sorry, i don't know much about java/jvm =(



Wednesday, June 1, 2011, 7:29:50 PM, you wrote:


Are you in fact out of swap space, as the java error suggested?
The way JVM's work always, if you tell it -Xmx6g, it WILL use all 6g
eventually.  The JVM doesn't Garbage Collect until it's going to run out
of heap space, until it gets to your Xmx.  It will keep using RAM until
it reaches your Xmx.
If your Xmx is set so high you don't have enough RAM available, that
will be a problem, you don't want to set Xmx like this. Ideally you
don't even want to swap, but normally the OS will swap to give you
enough RAM if neccesary -- if you don't have swap space for it to do
that, to give the JVM the 6g you've configured it to take well, that
seems to be what the Java error message is telling you. Of course
sometimes error messages are misleading.
But yes, if you set Xmx to 6G, the process WILL use all 6G eventually.
This is just how the JVM works.

best way to update custom fieldcache after index commit?

2011-06-01 Thread oleole

Hi,

We use solr and lucene fieldcache like this
static DocTerms myfieldvalues =
org.apache.lucene.search.FieldCache.DEFAULT.getTerms(reader, "myField");
which is initialized at first use and will stay in memory for fast retrieval
of field values based on DocID

The problem is after an index/commit, the lucene fieldcache is reloaded in
the new searcher, but this static list need to updated as well,  what is the
best way to handle this? Basically we want to update those custom filedcache
whenever there is a commit. The possible solution I can think of:

1) manually call an request handler to clean up those custom stuffs after
commit, which is a hack and ugly.
2) use some listener event (not sure whether I can use newSearcher event
listener in Solr); also there seems to be a lucene ticket (
https://issues.apache.org/jira/browse/LUCENE-2474, Allow to plug in a Cache
Eviction Listener to IndexReader to eagerly clean custom caches that use the
IndexReader (getFieldCacheKey)), not clear to me how to use it though

Any of your suggestion/comments is much appreciated. Thanks!

oleole

Re: Solr memory consumption

So what should i do to evoid that error?
I can use 10G on server, now i try to run with flags:
java -Xms6G -Xmx6G -XX:MaxPermSize=1G -XX:PermSize=512M -D64

Or should i set xmx to lower numbers and what about other params?
Sorry, i don't know much about java/jvm =(



Wednesday, June 1, 2011, 7:29:50 PM, you wrote:

> Are you in fact out of swap space, as the java error suggested?

> The way JVM's work always, if you tell it -Xmx6g, it WILL use all 6g 
> eventually.  The JVM doesn't Garbage Collect until it's going to run out
> of heap space, until it gets to your Xmx.  It will keep using RAM until
> it reaches your Xmx.

> If your Xmx is set so high you don't have enough RAM available, that 
> will be a problem, you don't want to set Xmx like this. Ideally you 
> don't even want to swap, but normally the OS will swap to give you 
> enough RAM if neccesary -- if you don't have swap space for it to do 
> that, to give the JVM the 6g you've configured it to take well, that
> seems to be what the Java error message is telling you. Of course 
> sometimes error messages are misleading.

> But yes, if you set Xmx to 6G, the process WILL use all 6G eventually.
> This is just how the JVM works.

Re: Solr vs ElasticSearch

You _could_ configure it as a slave, if you plan to sometimes use it as 
a slave.  It can be configured as both a master and a slave. You can 
configure it as a slave, but turn off automatic polling.  And then issue 
one-off replicate commands whenever you want.


But yeah, it gets messy, your use case is definitely not what 
ReplicationHandler is expecting, definitely some Java improvements would 
be nice, agreed.


On 6/1/2011 12:20 PM, Upayavira wrote:

On Wed, 01 Jun 2011 11:47 -0400, "Jonathan Rochkind"
wrote:

On 6/1/2011 11:26 AM, Upayavira wrote:

Probably the ReplicationHandler would need a 'one-off' replication
command...

It's got one already, if you mean a command you can issue to a slave to
tell it to pull replication right now.  The thing is, you can only issue
this command if the core is configured as a slave.  You can turn off
polling though.

You can include a custom masterURL in the one-off pull command, which
over-rides whatever masterURL is configured in the core --- but you
still need a masterURL configured in the core, or Solr will complain on
startup if the core is configured as slave without a masterURL. (And if
it's not configured as a slave, you can't issue the one-off pull
command).

Right, but this wouldn't be a slave - so I'd want to wire the
destination core so that it can accept a 'pull request' without being
correctly configured. Stuff to look at.

Upayavira

Re: Solr memory consumption


Are you in fact out of swap space, as the java error suggested?

The way JVM's work always, if you tell it -Xmx6g, it WILL use all 6g 
eventually.  The JVM doesn't Garbage Collect until it's going to run out 
of heap space, until it gets to your Xmx.  It will keep using RAM until 
it reaches your Xmx.


If your Xmx is set so high you don't have enough RAM available, that 
will be a problem, you don't want to set Xmx like this. Ideally you 
don't even want to swap, but normally the OS will swap to give you 
enough RAM if neccesary -- if you don't have swap space for it to do 
that, to give the JVM the 6g you've configured it to take well, that 
seems to be what the Java error message is telling you. Of course 
sometimes error messages are misleading.


But yes, if you set Xmx to 6G, the process WILL use all 6G eventually.  
This is just how the JVM works.


On 6/1/2011 12:15 PM, Denis Kuzmenok wrote:

Here  is output after about 24 hours running solr. Maybe there is some
way to limit memory consumption? :(


test@d6 ~/solr/example $ java -Xms3g-Xmx6g-D64
-Dsolr.solr.home=/home/test/solr/example/multicore/ -jar start.jar
2011-05-31 17:05:14.265:INFO::Logging to STDERR via org.mortbay.log.StdErrLog
2011-05-31 17:05:14.355:INFO::jetty-6.1-SNAPSHOT
2011-05-31 17:05:16.447:INFO::Started SocketConnector@0.0.0.0:4900
#
# A fatal error has been detected by the Java Runtime Environment:
#
# java.lang.OutOfMemoryError: requested 32744 bytes for ChunkPool::allocate. 
Out of swap space?
#
#  Internal Error (allocation.cpp:117), pid=17485, tid=1090320704
#  Error: ChunkPool::allocate
#
# JRE version: 6.0_17-b17
# Java VM: OpenJDK 64-Bit Server VM (14.0-b16 mixed mode linux-amd64 )
# Derivative: IcedTea6 1.7.5
# Distribution: Custom build (Wed Oct 13 13:04:40 EDT 2010)
# An error report file with more information is saved as:
# /mnt/data/solr/example/hs_err_pid17485.log
#
# If you would like to submit a bug report, please include
# instructions how to reproduce the bug and visit:
#   http://icedtea.classpath.org/bugzilla
#
Aborted



I  run  multiple-core  solr with flags: -Xms3g -Xmx6g -D64, but i see
this in top after 6-8 hours and still raising:
17485  test214 10.0g 7.4g 9760 S 308.2 31.3 448:00.75 java
-Xms3g -Xmx6g -D64
-Dsolr.solr.home=/home/test/solr/example/multicore/ -jar start.jar

Are there any ways to limit memory for sure?
Thanks

Re: K-Stemmer for Solr 3.1

2011-06-01 Thread Mark


Thanks. Ill have to create a Jira account to vote i guess.

We are already using KStemmer in 1.4.2 production and I would like to 
upgrade to 3.1. In the meantime, what is another stemmer I could use out 
of the box that would be have similar to KStemmer?


Thanks

On 5/28/11 10:02 AM, Steven A Rowe wrote:

Hi Mark,

Yonik Seeley indicated on LUCENE-152 that he is considering contributing 
Lucid's KStemmer version to Lucene:



You can vote on the issue to communicate your interest.

Steve


-Original Message-
From: Mark [mailto:static.void@gmail.com]
Sent: Friday, May 27, 2011 7:31 PM
To: solr-user@lucene.apache.org
Subject: Re: K-Stemmer for Solr 3.1

Where can one find the KStemmer source for 4.0?

On 5/12/11 11:28 PM, Bernd Fehling wrote:

I backported a Lucid KStemmer version from solr 4.0 which I found
somewhere.
Just changed from
import org.apache.lucene.analysis.util.CharArraySet;  // solr4.0
to
import org.apache.lucene.analysis.CharArraySet;  // solr3.1

Bernd


Am 12.05.2011 16:32, schrieb Mark:

java.lang.AbstractMethodError:
org.apache.lucene.analysis.TokenStream.incrementToken()Z

Would you mind explaining your modifications? Thanks

On 5/11/11 11:14 PM, Bernd Fehling wrote:

Am 12.05.2011 02:05, schrieb Mark:

It appears that the older version of the Lucid Works KStemmer is
incompatible with Solr 3.1. Has anyone been able to get this to
work? If not,
what are you using as an alternative?

Thanks

Lucid KStemmer works nice with Solr3.1 after some minor mods to
KStemFilter.java and KStemFilterFactory.java.
What problems do you have?

Bernd

Re: Index vs. Query Time Aware Filters

2011-06-01 Thread Mike Schultz

I should have explained that the queryMode parameter is for our own custom
filter.  So the result is that we have 8 filters in our field definition. 
All the filter parameters (30 or so) of the query time and index time are
identical EXCEPT for our one custom filter which needs to know if it's in
query time or index time mode.  If we could determine inside our custom code
whether we're indexing or querying, then we could omit the query time
definition entirely and save about 50 lines of configuration and be much
less error prone.

One possible solution would be if we could get at the SolrCore from within a
filter.  Then at init time we could iterate through the filter chains and
determine when we find a factory == this.  (I've done this in other places
where it's useful to know the name of a ValueSourceParser for example)

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Index-vs-Query-Time-Aware-Filters-tp3009450p3011556.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr vs ElasticSearch

2011-06-01 Thread Upayavira

On Wed, 01 Jun 2011 11:47 -0400, "Jonathan Rochkind" 
wrote:
> On 6/1/2011 11:26 AM, Upayavira wrote:
> >
> > Probably the ReplicationHandler would need a 'one-off' replication
> > command...
> 
> It's got one already, if you mean a command you can issue to a slave to 
> tell it to pull replication right now.  The thing is, you can only issue 
> this command if the core is configured as a slave.  You can turn off 
> polling though.
> 
> You can include a custom masterURL in the one-off pull command, which 
> over-rides whatever masterURL is configured in the core --- but you 
> still need a masterURL configured in the core, or Solr will complain on 
> startup if the core is configured as slave without a masterURL. (And if 
> it's not configured as a slave, you can't issue the one-off pull
> command).

Right, but this wouldn't be a slave - so I'd want to wire the
destination core so that it can accept a 'pull request' without being
correctly configured. Stuff to look at.

Upayavira

Re: Solr memory consumption

Here  is output after about 24 hours running solr. Maybe there is some
way to limit memory consumption? :(


test@d6 ~/solr/example $ java -Xms3g-Xmx6g-D64
-Dsolr.solr.home=/home/test/solr/example/multicore/ -jar start.jar
2011-05-31 17:05:14.265:INFO::Logging to STDERR via org.mortbay.log.StdErrLog
2011-05-31 17:05:14.355:INFO::jetty-6.1-SNAPSHOT
2011-05-31 17:05:16.447:INFO::Started SocketConnector@0.0.0.0:4900
#
# A fatal error has been detected by the Java Runtime Environment:
#
# java.lang.OutOfMemoryError: requested 32744 bytes for ChunkPool::allocate. 
Out of swap space?
#
#  Internal Error (allocation.cpp:117), pid=17485, tid=1090320704
#  Error: ChunkPool::allocate
#
# JRE version: 6.0_17-b17
# Java VM: OpenJDK 64-Bit Server VM (14.0-b16 mixed mode linux-amd64 )
# Derivative: IcedTea6 1.7.5
# Distribution: Custom build (Wed Oct 13 13:04:40 EDT 2010)
# An error report file with more information is saved as:
# /mnt/data/solr/example/hs_err_pid17485.log
#
# If you would like to submit a bug report, please include
# instructions how to reproduce the bug and visit:
#   http://icedtea.classpath.org/bugzilla
#
Aborted


> I  run  multiple-core  solr with flags: -Xms3g -Xmx6g -D64, but i see
> this in top after 6-8 hours and still raising:

> 17485  test214 10.0g 7.4g 9760 S 308.2 31.3 448:00.75 java
> -Xms3g -Xmx6g -D64
> -Dsolr.solr.home=/home/test/solr/example/multicore/ -jar start.jar
>   
> Are there any ways to limit memory for sure?

> Thanks

Re: What's your query result cache's stats?

I believe you need SOME query cache even with low hit counts, for things 
like a user paging through results. You want the query to still be in 
the cache when they go to the next page or what have you. Other 
operations like this may depend on the query cache too for good 
performance.


So even with a low hit rate, you still want enough query cache that all 
the "current" queries, all the queries someone is in the middle of doing 
something with and may do more with can stay in the cache. (what things 
those are can depend on your particular client interface).   So the 
cache hit count may not actually be a good guide to sizing your query 
cache.


Correct me if I'm wrong, but this is what I've been thinking.

On 6/1/2011 12:03 PM, Shawn Heisey wrote:

On 5/31/2011 3:02 PM, Markus Jelsma wrote:

Hi,

I've seen the stats page many times, of quite a few installations and 
even
more servers. There's one issue that keeps bothering me: the 
cumulative hit

ratio of the query result cache, it's almost never higher than 50%.

What are your stats? How do you deal with it?


Below are my stats.

I will be lowering my warmcounts dramatically when I respin for 3.1.  
The 28 second warm time is too high for me.  I don't think it's going 
to make a lot of difference in performance.  I think most of the 
warming benefit is realized after the first few queries.


queryResultCache:
Concurrent LRU Cache(maxSize=1024, initialSize=1024, minSize=921, 
acceptableSize=972, cleanupThread=true, autowarmCount=64, 
regenerator=org.apache.solr.search.SolrIndexSearcher$3@60c0c8b5)


lookups : 932
hits : 528
hitratio : 0.56
inserts : 403
evictions : 0
size : 449
warmupTime : 28198
cumulative_lookups : 980357
cumulative_hits : 622726
cumulative_hitratio : 0.63
cumulative_inserts : 369692
cumulative_evictions : 83711

lookups : 68543
hits : 57286
hitratio : 0.83
inserts : 11357
evictions : 0
size : 11357
warmupTime : 0
cumulative_lookups : 219118491
cumulative_hits : 179119106
cumulative_hitratio : 0.81
cumulative_inserts : 3385
cumulative_evictions : 32833254


documentCache:
LRU Cache(maxSize=16384, initialSize=4096)

lookups : 68543
hits : 57286
hitratio : 0.83
inserts : 11357
evictions : 0
size : 11357
warmupTime : 0
cumulative_lookups : 219118491
cumulative_hits : 179119106
cumulative_hitratio : 0.81
cumulative_inserts : 3385
cumulative_evictions : 32833254


filterCache:
LRU Cache(maxSize=512, initialSize=512, autowarmCount=32, 
regenerator=org.apache.solr.search.SolrIndexSearcher$2@6910b640)


lookups : 859
hits : 464
hitratio : 0.54
inserts : 465
evictions : 0
size : 464
warmupTime : 27747
cumulative_lookups : 682600
cumulative_hits : 355130
cumulative_hitratio : 0.52
cumulative_inserts : 327479
cumulative_evictions : 161624

Re: What's your query result cache's stats?

2011-06-01 Thread Shawn Heisey


On 5/31/2011 3:02 PM, Markus Jelsma wrote:

Hi,

I've seen the stats page many times, of quite a few installations and even
more servers. There's one issue that keeps bothering me: the cumulative hit
ratio of the query result cache, it's almost never higher than 50%.

What are your stats? How do you deal with it?


Below are my stats.

I will be lowering my warmcounts dramatically when I respin for 3.1.  
The 28 second warm time is too high for me.  I don't think it's going to 
make a lot of difference in performance.  I think most of the warming 
benefit is realized after the first few queries.


queryResultCache:
Concurrent LRU Cache(maxSize=1024, initialSize=1024, minSize=921, 
acceptableSize=972, cleanupThread=true, autowarmCount=64, 
regenerator=org.apache.solr.search.SolrIndexSearcher$3@60c0c8b5)


lookups : 932
hits : 528
hitratio : 0.56
inserts : 403
evictions : 0
size : 449
warmupTime : 28198
cumulative_lookups : 980357
cumulative_hits : 622726
cumulative_hitratio : 0.63
cumulative_inserts : 369692
cumulative_evictions : 83711

lookups : 68543
hits : 57286
hitratio : 0.83
inserts : 11357
evictions : 0
size : 11357
warmupTime : 0
cumulative_lookups : 219118491
cumulative_hits : 179119106
cumulative_hitratio : 0.81
cumulative_inserts : 3385
cumulative_evictions : 32833254


documentCache:
LRU Cache(maxSize=16384, initialSize=4096)

lookups : 68543
hits : 57286
hitratio : 0.83
inserts : 11357
evictions : 0
size : 11357
warmupTime : 0
cumulative_lookups : 219118491
cumulative_hits : 179119106
cumulative_hitratio : 0.81
cumulative_inserts : 3385
cumulative_evictions : 32833254


filterCache:
LRU Cache(maxSize=512, initialSize=512, autowarmCount=32, 
regenerator=org.apache.solr.search.SolrIndexSearcher$2@6910b640)


lookups : 859
hits : 464
hitratio : 0.54
inserts : 465
evictions : 0
size : 464
warmupTime : 27747
cumulative_lookups : 682600
cumulative_hits : 355130
cumulative_hitratio : 0.52
cumulative_inserts : 327479
cumulative_evictions : 161624

Re: Solr vs ElasticSearch

2011-06-01 Thread Jason Rutherglen

Jonathan,

This is all true, however it ends up being hacky (this is from
experience) and the core on the source needs to be deleted.  Feel free
to post to the issue.

Jason

On Wed, Jun 1, 2011 at 8:44 AM, Jonathan Rochkind  wrote:
> On 6/1/2011 10:52 AM, Jason Rutherglen wrote:
>>
>> nightmarish to setup. The problem is, it freezes each core into a
>> respective role, so if I wanted to then 'move' the slave, I can't
>> because it's still setup as a slave.
>
> Don't know if this helps or not, but you CAN set up a core as both a master
> and a slave. Normally this is to make it a "repeater", still always taking
> from the same upstream and sending downstream. But there might be a way to
> hack it for your needs without actually changing Java code, a core _can_ be
> both a master and slave simultaneously, and there might be a way to change
> it's masterURL (where it pulls from when acting as a slave) without
> restarting the core too.  You can supply a 'custom' (not configured)
> masterURL in a manual 'pull' command (over HTTP), but of course usually
> slaves poll rather than be directed by manual 'pull' commands.
>
>

Re: Solr vs ElasticSearch


On 6/1/2011 11:26 AM, Upayavira wrote:


Probably the ReplicationHandler would need a 'one-off' replication
command...


It's got one already, if you mean a command you can issue to a slave to 
tell it to pull replication right now.  The thing is, you can only issue 
this command if the core is configured as a slave.  You can turn off 
polling though.


You can include a custom masterURL in the one-off pull command, which 
over-rides whatever masterURL is configured in the core --- but you 
still need a masterURL configured in the core, or Solr will complain on 
startup if the core is configured as slave without a masterURL. (And if 
it's not configured as a slave, you can't issue the one-off pull command).


This is all from my experience on 1.4, don't know if things change in 
3.1, probably not.

Re: Solr vs ElasticSearch


On 6/1/2011 10:52 AM, Jason Rutherglen wrote:

nightmarish to setup. The problem is, it freezes each core into a
respective role, so if I wanted to then 'move' the slave, I can't
because it's still setup as a slave.


Don't know if this helps or not, but you CAN set up a core as both a 
master and a slave. Normally this is to make it a "repeater", still 
always taking from the same upstream and sending downstream. But there 
might be a way to hack it for your needs without actually changing Java 
code, a core _can_ be both a master and slave simultaneously, and there 
might be a way to change it's masterURL (where it pulls from when acting 
as a slave) without restarting the core too.  You can supply a 'custom' 
(not configured) masterURL in a manual 'pull' command (over HTTP), but 
of course usually slaves poll rather than be directed by manual 'pull' 
commands.

Re: Edgengram

Be a little careful here. LowerCaseTokenizerFactory is different than
KeywordTokenizerFactory.

LowerCaseTokenizerFactory will give you more than one term. e.g.
the string "Intelligence can't be MeaSurEd" will give you 5 terms,
any of which may match. i.e.
"intelligence", "can", "t", "be", "measured".
whereas KeywordTokenizerFactory followed, by, say LowerCaseFilter
would give you exactly one token:
"intelligence can't be measured".

So searching for "measured" would get a hit in the first case but
not in the second. Searching for "intellig*" would hit both.

Neither is better, just make sure they do what you want!

This page will help a lot:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LowerCaseTokenizerFactory
as will the admin/analysis page.

Best
Erick

On Wed, Jun 1, 2011 at 10:43 AM, Brian Lamb
 wrote:
> Hi Tomás,
>
> Thank you very much for your suggestion. I took another crack at it using
> your recommendation and it worked ideally. The only thing I had to change
> was
>
> 
>  
> 
>
> to
>
> 
>  
> 
>
> The first did not produce any results but the second worked beautifully.
>
> Thanks!
>
> Brian Lamb
>
> 2011/5/31 Tomás Fernández Löbbe 
>
>> ...or also use the LowerCaseTokenizerFactory at query time for consistency,
>> but not the edge ngram filter.
>>
>> 2011/5/31 Tomás Fernández Löbbe 
>>
>> > Hi Brian, I don't know if I understand what you are trying to achieve.
>> You
>> > want the term query "abcdefg" to have an idf of 1 insead of 7? I think
>> using
>> > the KeywordTokenizerFilterFactory at query time should work. I would be
>> > something like:
>> >
>> > > > positionIncrementGap="1000">
>> >   
>> >
>> >     
>> >     > > maxGramSize="25" side="front" />
>> >   
>> >   
>> >   
>> >   
>> > 
>> >
>> > this way, at query time "abcdefg" won't be turned to "a ab abc abcd abcde
>> > abcdef abcdefg". At index time it will.
>> >
>> > Regards,
>> > Tomás
>> >
>> >
>> > On Tue, May 31, 2011 at 1:07 PM, Brian Lamb <
>> brian.l...@journalexperts.com
>> > > wrote:
>> >
>> >> > >> positionIncrementGap="1000">
>> >>   
>> >>     
>> >>     > >> maxGramSize="25" side="front" />
>> >>   
>> >> 
>> >>
>> >> I believe I used that link when I initially set up the field and it
>> worked
>> >> great (and I'm still using it in other places). In this particular
>> example
>> >> however it does not appear to be practical for me. I mentioned that I
>> have
>> >> a
>> >> similarity class that returns 1 for the idf and in the case of an
>> >> edgengram,
>> >> it returns 1 * length of the search string.
>> >>
>> >> Thanks,
>> >>
>> >> Brian Lamb
>> >>
>> >> On Tue, May 31, 2011 at 11:34 AM, bmdakshinamur...@gmail.com <
>> >> bmdakshinamur...@gmail.com> wrote:
>> >>
>> >> > Can you specify the analyzer you are using for your queries?
>> >> >
>> >> > May be you could use a KeywordAnalyzer for your queries so you don't
>> end
>> >> up
>> >> > matching parts of your query.
>> >> >
>> >> >
>> >>
>> http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
>> >> > This should help you.
>> >> >
>> >> > On Tue, May 31, 2011 at 8:24 PM, Brian Lamb
>> >> > wrote:
>> >> >
>> >> > > In this particular case, I will be doing a solr search based on user
>> >> > > preferences. So I will not be depending on the user to type
>> "abcdefg".
>> >> > That
>> >> > > will be automatically generated based on user selections.
>> >> > >
>> >> > > The contents of the field do not contain spaces and since I am
>> created
>> >> > the
>> >> > > search parameters, case isn't important either.
>> >> > >
>> >> > > Thanks,
>> >> > >
>> >> > > Brian Lamb
>> >> > >
>> >> > > On Tue, May 31, 2011 at 9:44 AM, Erick Erickson <
>> >> erickerick...@gmail.com
>> >> > > >wrote:
>> >> > >
>> >> > > > That'll work for your case, although be aware that string types
>> >> aren't
>> >> > > > analyzed at all,
>> >> > > > so case matters, as do spaces etc.
>> >> > > >
>> >> > > > What is the use-case here? If you explain it a bit there might be
>> >> > > > better answers
>> >> > > >
>> >> > > > Best
>> >> > > > Erick
>> >> > > >
>> >> > > > On Fri, May 27, 2011 at 9:17 AM, Brian Lamb
>> >> > > >  wrote:
>> >> > > > > For this, I ended up just changing it to string and using
>> >> "abcdefg*"
>> >> > to
>> >> > > > > match. That seems to work so far.
>> >> > > > >
>> >> > > > > Thanks,
>> >> > > > >
>> >> > > > > Brian Lamb
>> >> > > > >
>> >> > > > > On Wed, May 25, 2011 at 4:53 PM, Brian Lamb
>> >> > > > > wrote:
>> >> > > > >
>> >> > > > >> Hi all,
>> >> > > > >>
>> >> > > > >> I'm running into some confusion with the way edgengram works. I
>> >> have
>> >> > > the
>> >> > > > >> field set up as:
>> >> > > > >>
>> >> > > > >> > >> > > > >> positionIncrementGap="1000">
>> >> > > > >>    
>> >> > > > >>      
>> >> > > > >>        > >> minGramSize="1"
>> >> > > > >> maxGramSize="100" side="front" />
>> >> > > > >>    
>> >> > > > >> 
>> >> > > > >>
>> >> > > > >> I've also set up my own similarity c

Re: Solr vs ElasticSearch

2011-06-01 Thread Jason Rutherglen

> And some way to delete the core when it has been transferred.

Right, I manually added that to CoreAdminHandler.  I opened an issue
to try to solve this problem: SOLR-2569

On Wed, Jun 1, 2011 at 8:26 AM, Upayavira  wrote:
>
>
> On Wed, 01 Jun 2011 07:52 -0700, "Jason Rutherglen"
>  wrote:
>> > I'm likely to try playing with moving cores between hosts soon. In
>> > theory it shouldn't be hard. We'll see what the practice is like!
>>
>> Right, in theory it's quite simple, in practice I've setup a master,
>> then a slave, then had to add replication to both, then call create
>> core, then replicate, then unload core on the master.  It's
>> nightmarish to setup.  The problem is, it freezes each core into a
>> respective role, so if I wanted to then 'move' the slave, I can't
>> because it's still setup as a slave.
>
> Yep, I'm expecting it to require some changes to both the
> CoreAdminHandler and the ReplicationHandler.
>
> Probably the ReplicationHandler would need a 'one-off' replication
> command. And some way to delete the core when it has been transferred.
>
> Upayavira
>
>> On Wed, Jun 1, 2011 at 4:14 AM, Upayavira  wrote:
>> >
>> >
>> > On Tue, 31 May 2011 19:38 -0700, "Jason Rutherglen"
>> >  wrote:
>> >> Mark,
>> >>
>> >> Nice email address.  I personally have no idea, maybe ask Shay Banon
>> >> to post an answer?  I think it's possible to make Solr more elastic,
>> >> eg, it's currently difficult to make it move cores between servers
>> >> without a lot of manual labor.
>> >
>> > I'm likely to try playing with moving cores between hosts soon. In
>> > theory it shouldn't be hard. We'll see what the practice is like!
>> >
>> > Upayavira
>> > ---
>> > Enterprise Search Consultant at Sourcesense UK,
>> > Making Sense of Open Source
>> >
>> >
>>
> ---
> Enterprise Search Consultant at Sourcesense UK,
> Making Sense of Open Source
>
>

Re: Solr vs ElasticSearch

2011-06-01 Thread Upayavira



On Wed, 01 Jun 2011 07:52 -0700, "Jason Rutherglen"
 wrote:
> > I'm likely to try playing with moving cores between hosts soon. In
> > theory it shouldn't be hard. We'll see what the practice is like!
> 
> Right, in theory it's quite simple, in practice I've setup a master,
> then a slave, then had to add replication to both, then call create
> core, then replicate, then unload core on the master.  It's
> nightmarish to setup.  The problem is, it freezes each core into a
> respective role, so if I wanted to then 'move' the slave, I can't
> because it's still setup as a slave.

Yep, I'm expecting it to require some changes to both the
CoreAdminHandler and the ReplicationHandler.

Probably the ReplicationHandler would need a 'one-off' replication
command. And some way to delete the core when it has been transferred.

Upayavira
 
> On Wed, Jun 1, 2011 at 4:14 AM, Upayavira  wrote:
> >
> >
> > On Tue, 31 May 2011 19:38 -0700, "Jason Rutherglen"
> >  wrote:
> >> Mark,
> >>
> >> Nice email address.  I personally have no idea, maybe ask Shay Banon
> >> to post an answer?  I think it's possible to make Solr more elastic,
> >> eg, it's currently difficult to make it move cores between servers
> >> without a lot of manual labor.
> >
> > I'm likely to try playing with moving cores between hosts soon. In
> > theory it shouldn't be hard. We'll see what the practice is like!
> >
> > Upayavira
> > ---
> > Enterprise Search Consultant at Sourcesense UK,
> > Making Sense of Open Source
> >
> >
> 
--- 
Enterprise Search Consultant at Sourcesense UK, 
Making Sense of Open Source

Re: Solr vs ElasticSearch

2011-06-01 Thread Jason Rutherglen

> I'm likely to try playing with moving cores between hosts soon. In
> theory it shouldn't be hard. We'll see what the practice is like!

Right, in theory it's quite simple, in practice I've setup a master,
then a slave, then had to add replication to both, then call create
core, then replicate, then unload core on the master.  It's
nightmarish to setup.  The problem is, it freezes each core into a
respective role, so if I wanted to then 'move' the slave, I can't
because it's still setup as a slave.

On Wed, Jun 1, 2011 at 4:14 AM, Upayavira  wrote:
>
>
> On Tue, 31 May 2011 19:38 -0700, "Jason Rutherglen"
>  wrote:
>> Mark,
>>
>> Nice email address.  I personally have no idea, maybe ask Shay Banon
>> to post an answer?  I think it's possible to make Solr more elastic,
>> eg, it's currently difficult to make it move cores between servers
>> without a lot of manual labor.
>
> I'm likely to try playing with moving cores between hosts soon. In
> theory it shouldn't be hard. We'll see what the practice is like!
>
> Upayavira
> ---
> Enterprise Search Consultant at Sourcesense UK,
> Making Sense of Open Source
>
>

Re: Synonyms valid only in specific categories of data

2011-06-01 Thread Spyros Kapnissis

Yes that would probably be a lot of fields.. I guess a way would be to extend 
the SynonymFilter and change the format of the synonyms.txt file to take the 
categories into account. 

Thanks again for your answer.

From: lee carroll 
To: solr-user@lucene.apache.org
Sent: Wednesday, June 1, 2011 12:23 PM
Subject: Re: Synonyms valid only in specific categories of data

I don't think you can assign a synonyms file dynamically to a field.
you would need to create multiple fields for each lang / cat phrases
and have their own synonyms file referenced for each field. that would
be a lot of fields.

On 1 June 2011 09:59, Spyros Kapnissis  wrote:
> Hello to all,
>
>
> I have a collection of text phrases in more than 20 languages that I'm 
> indexing
> in solr. Each phrase belongs to one of about 30 different phrase categories. I
> have specified different fields for each language and added a synonym filter 
> at
> query time. I would however like the synonym filter to take into account the
> category as well. So, a specific synonym should be valid and used only in one 
> or
> more categories per language. (the category is indexed in another field).
>
> Is this somehow possible in the current SynonymFilterFactory implementation?
>
> Hope it makes sense.
>
> Thank you,
> Spyros
>

Re: Edgengram

2011-06-01 Thread Brian Lamb

Hi Tomás,

Thank you very much for your suggestion. I took another crack at it using
your recommendation and it worked ideally. The only thing I had to change
was


  


to


  


The first did not produce any results but the second worked beautifully.

Thanks!

Brian Lamb

2011/5/31 Tomás Fernández Löbbe 

> ...or also use the LowerCaseTokenizerFactory at query time for consistency,
> but not the edge ngram filter.
>
> 2011/5/31 Tomás Fernández Löbbe 
>
> > Hi Brian, I don't know if I understand what you are trying to achieve.
> You
> > want the term query "abcdefg" to have an idf of 1 insead of 7? I think
> using
> > the KeywordTokenizerFilterFactory at query time should work. I would be
> > something like:
> >
> >  > positionIncrementGap="1000">
> >   
> >
> > 
> >  > maxGramSize="25" side="front" />
> >   
> >   
> >   
> >   
> > 
> >
> > this way, at query time "abcdefg" won't be turned to "a ab abc abcd abcde
> > abcdef abcdefg". At index time it will.
> >
> > Regards,
> > Tomás
> >
> >
> > On Tue, May 31, 2011 at 1:07 PM, Brian Lamb <
> brian.l...@journalexperts.com
> > > wrote:
> >
> >>  >> positionIncrementGap="1000">
> >>   
> >> 
> >>  >> maxGramSize="25" side="front" />
> >>   
> >> 
> >>
> >> I believe I used that link when I initially set up the field and it
> worked
> >> great (and I'm still using it in other places). In this particular
> example
> >> however it does not appear to be practical for me. I mentioned that I
> have
> >> a
> >> similarity class that returns 1 for the idf and in the case of an
> >> edgengram,
> >> it returns 1 * length of the search string.
> >>
> >> Thanks,
> >>
> >> Brian Lamb
> >>
> >> On Tue, May 31, 2011 at 11:34 AM, bmdakshinamur...@gmail.com <
> >> bmdakshinamur...@gmail.com> wrote:
> >>
> >> > Can you specify the analyzer you are using for your queries?
> >> >
> >> > May be you could use a KeywordAnalyzer for your queries so you don't
> end
> >> up
> >> > matching parts of your query.
> >> >
> >> >
> >>
> http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
> >> > This should help you.
> >> >
> >> > On Tue, May 31, 2011 at 8:24 PM, Brian Lamb
> >> > wrote:
> >> >
> >> > > In this particular case, I will be doing a solr search based on user
> >> > > preferences. So I will not be depending on the user to type
> "abcdefg".
> >> > That
> >> > > will be automatically generated based on user selections.
> >> > >
> >> > > The contents of the field do not contain spaces and since I am
> created
> >> > the
> >> > > search parameters, case isn't important either.
> >> > >
> >> > > Thanks,
> >> > >
> >> > > Brian Lamb
> >> > >
> >> > > On Tue, May 31, 2011 at 9:44 AM, Erick Erickson <
> >> erickerick...@gmail.com
> >> > > >wrote:
> >> > >
> >> > > > That'll work for your case, although be aware that string types
> >> aren't
> >> > > > analyzed at all,
> >> > > > so case matters, as do spaces etc.
> >> > > >
> >> > > > What is the use-case here? If you explain it a bit there might be
> >> > > > better answers
> >> > > >
> >> > > > Best
> >> > > > Erick
> >> > > >
> >> > > > On Fri, May 27, 2011 at 9:17 AM, Brian Lamb
> >> > > >  wrote:
> >> > > > > For this, I ended up just changing it to string and using
> >> "abcdefg*"
> >> > to
> >> > > > > match. That seems to work so far.
> >> > > > >
> >> > > > > Thanks,
> >> > > > >
> >> > > > > Brian Lamb
> >> > > > >
> >> > > > > On Wed, May 25, 2011 at 4:53 PM, Brian Lamb
> >> > > > > wrote:
> >> > > > >
> >> > > > >> Hi all,
> >> > > > >>
> >> > > > >> I'm running into some confusion with the way edgengram works. I
> >> have
> >> > > the
> >> > > > >> field set up as:
> >> > > > >>
> >> > > > >>  >> > > > >> positionIncrementGap="1000">
> >> > > > >>
> >> > > > >>  
> >> > > > >> >> minGramSize="1"
> >> > > > >> maxGramSize="100" side="front" />
> >> > > > >>
> >> > > > >> 
> >> > > > >>
> >> > > > >> I've also set up my own similarity class that returns 1 as the
> >> idf
> >> > > > score.
> >> > > > >> What I've found this does is if I match a string "abcdefg"
> >> against a
> >> > > > field
> >> > > > >> containing "abcdefghijklmnop", then the idf will score that as
> a
> >> 7:
> >> > > > >>
> >> > > > >> 7.0 = idf(myfield: a=51 ab=23 abc=2 abcd=2 abcde=2 abcdef=2
> >> > abcdefg=2)
> >> > > > >>
> >> > > > >> I get why that's happening, but is there a way to avoid that?
> Do
> >> I
> >> > > need
> >> > > > to
> >> > > > >> do a new field type to achieve the desired affect?
> >> > > > >>
> >> > > > >> Thanks,
> >> > > > >>
> >> > > > >> Brian Lamb
> >> > > > >>
> >> > > > >
> >> > > >
> >> > >
> >> >
> >> >
> >> >
> >> > --
> >> > Thanks and Regards,
> >> > DakshinaMurthy BM
> >> >
> >>
> >
> >
>

Re: collapse component with pivot faceting

You might have more luck going the other way, applying the
field collapsing patch to trunk. This is currently being worked
on, see:
https://issues.apache.org/jira/browse/SOLR-2564

Best
Erick

On Wed, Jun 1, 2011 at 12:22 AM, Isha Garg  wrote:
> Hi,
>          Actually  currently I am using solr version 3.0 . I applied the
> field collapsing patch of solr . The field collapsing work fine with
> collapse.facet=after for any facet.field but when I  try to use  facet.pivot
> query after collapse.facet=after it does nt show any results. Also pivot
> faceting feature is not present in solr 3.0.
> So which pivot faceting patch should I use with solr 3.0,solr 4.0 support
> the pivot faceting but it does not have field collapsing feature.Can anyone
> guide me regarding which Solr version support both field collapsing and
> pivot faceting .
>
>
> Thanks in Advance!
> Isha Garg
>
>
>
>
> On Tuesday 31 May 2011 07:39 PM, Erick Erickson wrote:
>>
>> Please provide a more detailed request. This is so general that it's hard
>> to
>> respond. What is the use-case you're trying to understand/implement?
>>
>> You might review:
>> http://wiki.apache.org/solr/UsingMailingLists
>>
>> Best
>> Erick
>>
>> On Mon, May 30, 2011 at 4:31 AM, Isha Garg  wrote:
>>
>>>
>>> Hi All!
>>>
>>>         Can anyone tell me how pivot faceting works in combination with
>>> field collapsing.?
>>> Please guide me in this respect.
>>>
>>>
>>> Thanks!
>>> Isha Garg
>>>
>>>
>
>

Re: Problem with caps and star symbol

Take a look here:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory

I think you want generateWordParts=1, catenateWords=1 and preserveOriginal=1,
but check it out with the admin/analysis page.

Oh, and your index-time and query-time patterns for WDFF will probably
be different, see
the example schema.

Best
Erick

On Wed, Jun 1, 2011 at 7:40 AM, Saumitra Chowdhury
 wrote:
> Thanks for your point. I was really tripping that issue. But Now I need a
> bit help more.
> As far I have noticed that in the case of a value like "*role_delete*" ,
> WordDelimiterFilterFactory
> index two words like "*role"* and "*delete"* and in both search result with
> the term "*role*" and "*delete*" will
> include that document.
>
> Now In the case of the value like "*role_delete*" I want to index all four
> terms like [ *role_delete, roledelete, role, delete ].*
> In total both the original and processed word by WordDelimiterFilterFactory
> will be indexed.
>
> Is it possible ?? Does any additional filter with WordDelimiterFilterFactory
>  can do that ?? Or
> any filter can do such like operation ??
>
> On Tue, May 31, 2011 at 8:07 PM, Erick Erickson 
> wrote:
>
>> I think you're tripping over the issue that wildcards aren't analyzed, they
>> don't go through your analysis chain. So the casing matters. Try
>> lowercasing
>> the input and I believe you'll see more like what you expect...
>>
>> Best
>> Erick
>>
>> On Mon, May 30, 2011 at 12:07 AM, Saumitra Chowdhury
>>  wrote:
>> > I am sending some xml to understand the scenario.
>> > Indexed term = ROLE_DELETE
>> > Search Term = roledelete
>> > 
>> > 
>> > 0
>> > 4
>> > 
>> > on
>> > 0
>> > name : roledelete
>> > 2.2
>> > 10
>> > 
>> > 
>> > 
>> >
>> > Indexed term = ROLE_DELETE
>> > Search Term = role
>> > 
>> > 
>> > 0
>> > 5
>> > 
>> > on
>> > 0
>> > name : role
>> > 2.2
>> > 10
>> > 
>> > 
>> > 
>> > 
>> > Mon May 30 13:09:14 BDST 2011
>> > Global Role for Deletion
>> > role:9223372036854775802
>> > Mon May 30 13:09:14 BDST 2011
>> > ROLE_DELETE
>> > 
>> > 
>> > 
>> > 
>> > Mon May 30 13:09:14 BDST 2011
>> > Global Role for Deletion
>> > role:9223372036854775802
>> > Mon May 30 13:09:14 BDST 2011
>> > ROLE_DELETE
>> > 
>> > 
>> > 
>> >
>> >
>> > Indexed term = ROLE_DELETE
>> > Search Term = role*
>> > 
>> > 
>> > 0
>> > 4
>> > 
>> > on
>> > 0
>> > name : role*
>> > 2.2
>> > 10
>> > 
>> > 
>> > 
>> > 
>> > Mon May 30 13:09:14 BDST 2011
>> > Global Role for Deletion
>> > role:9223372036854775802
>> > Mon May 30 13:09:14 BDST 2011
>> > ROLE_DELETE
>> > 
>> > 
>> > 
>> >
>> >
>> > Indexed term = ROLE_DELETE
>> > Search Term = Role*
>> > 
>> > 
>> > 0
>> > 4
>> > 
>> > on
>> > 0
>> > name : Role*
>> > 2.2
>> > 10
>> > 
>> > 
>> > 
>> > 
>> >
>> >
>> > Indexed term = ROLE_DELETE
>> > Search Term = ROLE_DELETE*
>> > 
>> > 
>> > 0
>> > 4
>> > 
>> > on
>> > 0
>> > name : ROLE_DELETE*
>> > 2.2
>> > 10
>> > 
>> > 
>> > 
>> > 
>> > I am also adding a analysis html.
>> >
>> >
>> > On Mon, May 30, 2011 at 7:19 AM, Erick Erickson > >
>> > wrote:
>> >>
>> >> I'd start by looking at the analysis page from the Solr admin page. That
>> >> will give you an idea of the transformations the various steps carry
>> out,
>> >> it's invaluable!
>> >>
>> >> Best
>> >> Erick
>> >> On May 26, 2011 12:53 AM, "Saumitra Chowdhury" <
>> >> saumi...@smartitengineering.com> wrote:
>> >> > Hi all ,
>> >> > In my schema.xml i am using WordDelimiterFilterFactory,
>> >> > LowerCaseFilterFactory, StopFilterFactory for index analyzer and an
>> >> > extra
>> >> > SynonymFilterFactory for query analyzer. I am indexing a field name
>> >> > '*name*'.Now
>> >> > if a value with all caps like "NAME_BILL" is indexed I am able get
>> this
>> >> > as
>> >> > search result with the term " *name_bill *", " *NAME_BILL *", "
>> >> > *namebill
>> >> *",
>> >> > "*namebill** ", " *nameb** " ... But for the term like following " *
>> >> > NAME_BILL** ", " *name_bill** ", " *namebill** ", " *NAME** " the
>> result
>> >> > does mot show this document. Can anyone please explain why this is
>> >> > happening? .In fact star " * " is not giving any result in many
>> >> > cases specially if it is used after full value of a field.
>> >> >
>> >> > Portion of my schema is given below.
>> >> >
>> >> > > >> positionIncrementGap="100">
>> >> > -
>> >> > 
>> >> > 
>> >> > 
>> >> > 
>> >> > -
>> >> > > >> > positionIncrementGap="100">
>> >> > -
>> >> > 
>> >> > 
>> >> > > >> > generateNumberParts="0" catenateWords="1" catenateNumbers="1"
>> >> > catenateAll="0"/>
>> >> > 
>> >> > > >> > words="stopwords.txt" enablePositionIncrements="true"/>
>> >> > 
>> >> > -
>> >> > 
>> >> > 
>> >> > > >> > generateNumberParts="0" catenateWords="1" catenateNumbers="1"
>> >> > catenateAll="0"/>
>> >> > 
>> >> > > >> > ignoreCase="true" expand="true"/>
>> >> > > >> > words="stopwords.txt" enablePositionIncrements="true"/>
>> >> > 
>> >> > 
>> >> > -
>> >> > > >> > positionIncrementGap="100">
>> >> > -
>>

Re: Query problem in Solr

If I read this correctly, one approach is to specify an
increment gap in a multiValued field, then search for phrases
with a slop less than that increment gap. i.e.
incrementGap=100 in your definition, and search for
"apple orange"~99

If this is gibberish, please post some examples and we'll
try something else.

Best
Erick

On Wed, Jun 1, 2011 at 4:21 AM, Kurt Sultana  wrote:
>  Hi all,
>
> We're using Solr to search on a Shop index and a Product index. Currently a
> Shop has a field `shop_keyword` which also contains the keywords of the
> products assigned to it. The shop keywords are separated by a space.
> Consequently, if there is a product which has a keyword "apple" and another
> which has "orange", a search for shops having `Apple AND Orange` would
> return the shop for these products.
>
> However, this is incorrect since we want that a search for shops having
> `Apple AND Orange` returns shop(s) having products with both "apple" and
> "orange" as keywords.
>
> We tried solving this problem, by making shop keywords multi-valued and
> assigning the keywords of every product of the shop as a new value in shop
> keywords. However as was confirmed in another post
> http://markmail.org/thread/xce4qyzs5367yplo#query:+page:1+mid:76eerw5yqev2aanu+state:results,
> Solr does not support "all words must match in the same value of a
> multi-valued field".
>
> (Hope I explained myself well)
>
> How can we go about this? Ideally, we shouldn't change our search
> infrastructure dramatically.
>
> Thanks!
>
> Krt_Malta
>

Re: Index vs. Query Time Aware Filters

Could you post one of your pairs of definitions? Because
I don't recognize queryMode and a web search doesn't turn
anything up, so I'm puzzled.

Best
Erick

On Wed, Jun 1, 2011 at 1:13 AM, Mike Schultz  wrote:
> We have very long schema files for each of our language dependent query
> shards.  One thing that is doubling the configuration length of our main
> text processing field definition is that we have to repeat the exact same
> filter chain for query time version EXCEPT with a queryMode=true parameter.
>
> Is there a way for a filter to figure out if it's the index vs. query time
> version?
>
> A similar wish would be for the filter to be able to figure out the name of
> the field currently being indexed.  This would allow a filter to set a
> parameter at runtime based on fieldname, instead of boilerplate copying the
> same filterchain definition in schema.xml EXCEPT for one parameter.  The
> motivation is again to reduce errors and increase readability of the schema
> file.
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Index-vs-Query-Time-Aware-Filters-tp3009450p3009450.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Bulk indexing, UpdateProcessor overwriteDupes and poor IO performances

2011-06-01 Thread Tanguy Moal


Lee,

Thank you very much for your answer.

Using the signature field as the uniqueKey is effectively what I was 
doing, so the "overwriteDupes=true" parameter in my solrconfig was 
somehow redundant, although I wasn't aware of it! =D


In practice it works perfectly and that's the nice part.

By the way, I wonder what happens when we enter in the following code 
snippet when the id field is the same as the signature field, from 
addDoc@DirectUpdateHandler2(AddUpdateCommand) :

  if(del) { // ensure id remains unique
  BooleanQuery bq = new BooleanQuery();
  bq.add(new BooleanClause(new TermQuery(updateTerm), 
Occur.MUST_NOT));

  bq.add(new BooleanClause(new TermQuery(idTerm), Occur.MUST));
  writer.deleteDocuments(bq);
}

May be all my problems started from here...

I'll try to reproduce using a different uniqueKey field and turning 
overwriteDupes back to "on" to see if the problem was because of the 
signature field being the same as the uniqueKey field *and* having 
overwriteDupes on, when I'll have some time. If so, maybe that a simple 
configuration check should be performed to avoid the issue. Otherwise it 
means that having overwriteDupes turned on simply doesn't scale and that 
should be added to the wiki's Deduplication page, IMHO.


Thank you again.
Regards,

--
Tanguy

On 31/05/2011 14:58, lee carroll wrote:

Tanguy

You might have tried this already but can you set overwritedupes to
false and set the signiture key to be the id. That way solr
will manage updates?

from the wiki

http://wiki.apache.org/solr/Deduplication



HTH

Lee


On 30 May 2011 08:32, Tanguy Moal  wrote:

Hello,

Sorry for re-posting this but it seems my message got lost in the mailing 
list's messages stream without hitting anyone's attention... =D

Shortly, has anyone already experienced dramatic indexing slowdowns during 
large bulk imports with overwriteDupes turned on and a fairly high duplicates 
rate (around 4-8x) ?

It seems to produce a lot of deletions, which in turn appear to make the 
merging of segments pretty slow, by fairly increasing the number of little 
reads operations occuring simultaneously with the regular large write 
operations of the merge. Added to the poor IO performances of a commodity SATA 
drive, indexing takes ages.

I temporarily bypassed that limitation by disabling the overwriting of 
duplicates, but that changes the way I request the index, requiring me to turn 
on field collapsing at search time.

Is this a known limitation ?

Has anyone a few hints on how to optimize the handling of index time 
deduplication ?

More details on my setup and the state of my understanding are in my previous 
message here-after.

Thank you very much in advance.

Regards,

Tanguy

On 05/25/11 15:35, Tanguy Moal wrote:

Dear list,

I'm posting here after some unsuccessful investigations.
In my setup I push documents to Solr using the StreamingUpdateSolrServer.

I'm sending a comfortable initial amount of documents (~250M) and wished to 
perform overwriting of duplicated documents at index time, during the update, 
taking advantage of the UpdateProcessorChain.

At the beginning of the indexing stage, everything is quite fast; documents 
arrive at a rate of about 1000 doc/s.
The only extra processing during the import is computation of a couple of 
hashes that are used to identify uniquely documents given their content, using 
both stock (MD5Signature) and custom (derived from Lookup3Signature) update 
processors.
I send a commit command to the server every 500k documents sent.

During a first period, the server is CPU bound. After a short while (~10 
minutes), the rate at which documents are received starts to fall dramatically, 
the server being IO bound.
I've been firstly thinking of a normal speed decrease during the commit, while 
my push client is waiting for the flush to occur. That would have been a normal 
slowdown.

The thing that retained my attention was the fact that unexpectedly, the server 
was performing a lot of small reads, way more the number writes, which seem to 
be larger.
The combination of the many small reads with the constant amount of bigger 
writes seem to be creating a lot of IO contention on my commodity SATA drive, 
and the ETA of my built index started to increase scarily =D

I then restarted the JVM with JMX enabled so I could start investigating a 
little bit more. I've the realized that the UpdateHandler was performing many 
reads while processing the update request.

Are there any known limitations around the UpdateProcessorChain, when 
overwriteDupes is set to true ?
I turned that off, which of course breaks the intent of my built index, but for 
comparison purposes it's good.

That did the trick, indexing is fast again, even with the periodic commits.

I therefor have two questions, an interesting first  one and a boring second 
one :

1 / What's the workflow of the UpdateProcessorChain when one or more processors 
have overwriti

Re: Anyway to know changed documents?


On 6/1/2011 6:12 AM, pravesh wrote:

SOLR wiki will provide help on this. You might be interested in pure Java
based replication too. I'm not sure,whether SOLR operational will have this
feature(synch'ing only changed segments). You might need to change
configuration in searchconfig.xml


Yes, this feature is there in the Java/HTTP based replication since Solr 1.4

Re: Anyway to know changed documents?

You may be interested in Solr's replication feature? 
http://wiki.apache.org/solr/SolrReplication


On 6/1/2011 2:07 AM,  wrote:

Hi everyone,
If I have two server ,their indexes should be synchronized. I changed A's 
index via HTTP send document objects, Is there any config or some plug-ins to 
let solr know which objects are changed and can push it B ?
   Any suggestion will be appreciate.
   Thanks :)

London open source search social - 13th June

2011-06-01 Thread Richard Marr

Hi guys,

Just to let you know we're meeting up to talk all-things-search on Monday
13th June. There's usually a good mix of backgrounds and experience levels
so if you're free and in the London area then it'd be good to see you there.

Details:
7pm - The Elgin - 96 Ladbrooke Grove
http://www.meetup.com/london-search-social/events/20387881/



Greetings search geeks!

We've booked the next meetup for the 13th June. As usual, the plan is to
meet up and geek out over a friendly beer.

I know my co-organiser René has been working on some interesting search
projects, and I've recently left Empora to work on my own project so by June
I should hopefully have some war stories about using @elasticsearch in
production. The format is completely open though so please bring your own
topics if you've got them.

Hope to see you there!

--
Richard Marr



-- 
Richard Marr

Re: Solr memory consumption