what differents between SolrCloud and Solr+Hadoop

2010-09-12 Thread 郭芸
Dear All:
now,i need solr to distributed search.and i found there are two choices: 
SolrCloud and Solr+Hadoop
So i want to know what differents between them?
and we can download SolrCloud from svn,and how can we get the Solr+Hadoop?
please help me!Thank you!

2010-09-13 



郭芸 


Re: Tuning Solr caches with high commit rates (NRT)

2010-09-12 Thread Dennis Gearon
BTW, what is a segment?

I've only heard about them in the last 2 weeks here on the list.
Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Sun, 9/12/10, Jason Rutherglen  wrote:

> From: Jason Rutherglen 
> Subject: Re: Tuning Solr caches with high commit rates (NRT)
> To: solr-user@lucene.apache.org
> Date: Sunday, September 12, 2010, 7:52 PM
> Yeah there's no patch... I think
> Yonik can write it. :-)  Yah... The
> Lucene version shouldn't matter.  The distributed
> faceting
> theoretically can easily be applied to multiple segments,
> however the
> way it's written for me is a challenge to untangle and
> apply
> successfully to a working patch.  Also I don't have
> this as an itch to
> scratch at the moment.
> 
> On Sun, Sep 12, 2010 at 7:18 PM, Peter Sturge 
> wrote:
> > Hi Jason,
> >
> > I've tried some limited testing with the 4.x trunk
> using fcs, and I
> > must say, I really like the idea of per-segment
> faceting.
> > I was hoping to see it in 3.x, but I don't see this
> option in the
> > branch_3x trunk. Is your SOLR-1606 patch referred to
> in SOLR-1617 the
> > one to use with 3.1?
> > There seems to be a number of Solr issues tied to this
> - one of them
> > being Lucene-1785. Can the per-segment faceting patch
> work with Lucene
> > 2.9/branch_3x?
> >
> > Thanks,
> > Peter
> >
> >
> >
> > On Mon, Sep 13, 2010 at 12:05 AM, Jason Rutherglen
> > 
> wrote:
> >> Peter,
> >>
> >> Are you using per-segment faceting, eg, SOLR-1617?
>  That could help
> >> your situation.
> >>
> >> On Sun, Sep 12, 2010 at 12:26 PM, Peter Sturge
> 
> wrote:
> >>> Hi,
> >>>
> >>> Below are some notes regarding Solr cache
> tuning that should prove
> >>> useful for anyone who uses Solr with frequent
> commits (e.g. <5min).
> >>>
> >>> Environment:
> >>> Solr 1.4.1 or branch_3x trunk.
> >>> Note the 4.x trunk has lots of neat new
> features, so the notes here
> >>> are likely less relevant to the 4.x
> environment.
> >>>
> >>> Overview:
> >>> Our Solr environment makes extensive use of
> faceting, we perform
> >>> commits every 30secs, and the indexes tend be
> on the large-ish side
> >>> (>20million docs).
> >>> Note: For our data, when we commit, we are
> always adding new data,
> >>> never changing existing data.
> >>> This type of environment can be tricky to
> tune, as Solr is more geared
> >>> toward fast reads than frequent writes.
> >>>
> >>> Symptoms:
> >>> If anyone has used faceting in searches where
> you are also performing
> >>> frequent commits, you've likely encountered
> the dreaded OutOfMemory or
> >>> GC Overhead Exeeded errors.
> >>> In high commit rate environments, this is
> almost always due to
> >>> multiple 'onDeck' searchers and autowarming -
> i.e. new searchers don't
> >>> finish autowarming their caches before the
> next commit()
> >>> comes along and invalidates them.
> >>> Once this starts happening on a regular basis,
> it is likely your
> >>> Solr's JVM will run out of memory eventually,
> as the number of
> >>> searchers (and their cache arrays) will keep
> growing until the JVM
> >>> dies of thirst.
> >>> To check if your Solr environment is suffering
> from this, turn on INFO
> >>> level logging, and look for: 'PERFORMANCE
> WARNING: Overlapping
> >>> onDeckSearchers=x'.
> >>>
> >>> In tests, we've only ever seen this problem
> when using faceting, and
> >>> facet.method=fc.
> >>>
> >>> Some solutions to this are:
> >>>    Reduce the commit rate to allow searchers
> to fully warm before the
> >>> next commit
> >>>    Reduce or eliminate the autowarming in
> caches
> >>>    Both of the above
> >>>
> >>> The trouble is, if you're doing NRT commits,
> you likely have a good
> >>> reason for it, and reducing/elimintating
> autowarming will very
> >>> significantly impact search performance in
> high commit rate
> >>> environments.
> >>>
> >>> Solution:
> >>> Here are some setup steps we've used that
> allow lots of faceting (we
> >>> typically search with at least 20-35 different
> facet fields, and date
> >>> faceting/sorting) on large indexes, and still
> keep decent search
> >>> performance:
> >>>
> >>> 1. Firstly, you should consider using the enum
> method for facet
> >>> searches (facet.method=enum) unless you've got
> A LOT of memory on your
> >>> machine. In our tests, this method uses a lot
> less memory and
> >>> autowarms more quickly than fc. (Note, I've
> not tried the new
> >>> segement-based 'fcs' option, as I can't find
> support for it in
> >>> branch_3x - looks nice for 4.x though)
> >>> Admittedly, for our data, enum is not quite as
> fast for searching as
> >>> fc, but short of purchsing a Thaiwanese RAM
> factory, it's a worthwhile
> >>> tradeoff.
> >>> If you do have access to LOTS of memory, AND
> you can guarantee that
> >>> the index won't grow beyond the memory
> capacity (i.e. you have some
> >>> sort of deletion policy in place), fc ca

RE: multivalued fields in result

2010-09-12 Thread Jason Chaffee
My schema.xml was fine.  The problem was that my test queries weren't returning 
top 10 documents that had data in the fields.  Once I increased the rows, I saw 
the results.

Definitely user error.  :)

Thanks for help though.

Jason


-Original Message-
From: Lance Norskog [mailto:goks...@gmail.com]
Sent: Sun 9/12/2010 6:23 PM
To: solr-user@lucene.apache.org
Subject: Re: multivalued fields in result
 
Also, the 'v' is capitalized: multiValued. (This is one reason why 
posting your schema helps.)

Erick Erickson wrote:
> Can we see your schema file? Because it sounds like you didn't
> really declare your field multivalued="true" on the face of things.
>
> But if it is multivalued AND you changed it, did you reindex after
> you changed the schema?
>
> Best
> Erick
>
> On Sun, Sep 12, 2010 at 4:21 AM, Jason Chaffee  wrote:
>
>
>> But it doesn't seem to be returning mulitvalued fields that are stored.  It
>> is returning all of the single value fields though.
>>
>>
>> -Original Message-
>> From: Markus Jelsma [mailto:markus.jel...@buyways.nl]
>> Sent: Sat 9/11/2010 4:19 AM
>> To: solr-user@lucene.apache.org
>> Subject: RE: multivalued fields in result
>>
>> Yes, you'll get what is stored and asked for.
>>
>> -Original message-
>> From: Jason Chaffee
>> Sent: Sat 11-09-2010 05:27
>> To: solr-user@lucene.apache.org;
>> Subject: multivalued fields in result
>>
>> Is it possible to return multivalued files in the result?
>>
>> I would like to have a multivalued field that is stored and not indexed (I
>> also copy the same field into another field where it is tokenized and
>> indexed).  I would then like all the values of this field returned in the
>> result set.  Is there a way to do this?
>>
>> If it is not possible, could someone elaborate why that is so that I may
>> see if I can make it work.
>>
>> thanks,
>>
>> Jason
>>
>>
>>  
>



Re: Tuning Solr caches with high commit rates (NRT)

2010-09-12 Thread Jason Rutherglen
Yeah there's no patch... I think Yonik can write it. :-)  Yah... The
Lucene version shouldn't matter.  The distributed faceting
theoretically can easily be applied to multiple segments, however the
way it's written for me is a challenge to untangle and apply
successfully to a working patch.  Also I don't have this as an itch to
scratch at the moment.

On Sun, Sep 12, 2010 at 7:18 PM, Peter Sturge  wrote:
> Hi Jason,
>
> I've tried some limited testing with the 4.x trunk using fcs, and I
> must say, I really like the idea of per-segment faceting.
> I was hoping to see it in 3.x, but I don't see this option in the
> branch_3x trunk. Is your SOLR-1606 patch referred to in SOLR-1617 the
> one to use with 3.1?
> There seems to be a number of Solr issues tied to this - one of them
> being Lucene-1785. Can the per-segment faceting patch work with Lucene
> 2.9/branch_3x?
>
> Thanks,
> Peter
>
>
>
> On Mon, Sep 13, 2010 at 12:05 AM, Jason Rutherglen
>  wrote:
>> Peter,
>>
>> Are you using per-segment faceting, eg, SOLR-1617?  That could help
>> your situation.
>>
>> On Sun, Sep 12, 2010 at 12:26 PM, Peter Sturge  
>> wrote:
>>> Hi,
>>>
>>> Below are some notes regarding Solr cache tuning that should prove
>>> useful for anyone who uses Solr with frequent commits (e.g. <5min).
>>>
>>> Environment:
>>> Solr 1.4.1 or branch_3x trunk.
>>> Note the 4.x trunk has lots of neat new features, so the notes here
>>> are likely less relevant to the 4.x environment.
>>>
>>> Overview:
>>> Our Solr environment makes extensive use of faceting, we perform
>>> commits every 30secs, and the indexes tend be on the large-ish side
>>> (>20million docs).
>>> Note: For our data, when we commit, we are always adding new data,
>>> never changing existing data.
>>> This type of environment can be tricky to tune, as Solr is more geared
>>> toward fast reads than frequent writes.
>>>
>>> Symptoms:
>>> If anyone has used faceting in searches where you are also performing
>>> frequent commits, you've likely encountered the dreaded OutOfMemory or
>>> GC Overhead Exeeded errors.
>>> In high commit rate environments, this is almost always due to
>>> multiple 'onDeck' searchers and autowarming - i.e. new searchers don't
>>> finish autowarming their caches before the next commit()
>>> comes along and invalidates them.
>>> Once this starts happening on a regular basis, it is likely your
>>> Solr's JVM will run out of memory eventually, as the number of
>>> searchers (and their cache arrays) will keep growing until the JVM
>>> dies of thirst.
>>> To check if your Solr environment is suffering from this, turn on INFO
>>> level logging, and look for: 'PERFORMANCE WARNING: Overlapping
>>> onDeckSearchers=x'.
>>>
>>> In tests, we've only ever seen this problem when using faceting, and
>>> facet.method=fc.
>>>
>>> Some solutions to this are:
>>>    Reduce the commit rate to allow searchers to fully warm before the
>>> next commit
>>>    Reduce or eliminate the autowarming in caches
>>>    Both of the above
>>>
>>> The trouble is, if you're doing NRT commits, you likely have a good
>>> reason for it, and reducing/elimintating autowarming will very
>>> significantly impact search performance in high commit rate
>>> environments.
>>>
>>> Solution:
>>> Here are some setup steps we've used that allow lots of faceting (we
>>> typically search with at least 20-35 different facet fields, and date
>>> faceting/sorting) on large indexes, and still keep decent search
>>> performance:
>>>
>>> 1. Firstly, you should consider using the enum method for facet
>>> searches (facet.method=enum) unless you've got A LOT of memory on your
>>> machine. In our tests, this method uses a lot less memory and
>>> autowarms more quickly than fc. (Note, I've not tried the new
>>> segement-based 'fcs' option, as I can't find support for it in
>>> branch_3x - looks nice for 4.x though)
>>> Admittedly, for our data, enum is not quite as fast for searching as
>>> fc, but short of purchsing a Thaiwanese RAM factory, it's a worthwhile
>>> tradeoff.
>>> If you do have access to LOTS of memory, AND you can guarantee that
>>> the index won't grow beyond the memory capacity (i.e. you have some
>>> sort of deletion policy in place), fc can be a lot faster than enum
>>> when searching with lots of facets across many terms.
>>>
>>> 2. Secondly, we've found that LRUCache is faster at autowarming than
>>> FastLRUCache - in our tests, about 20% faster. Maybe this is just our
>>> environment - your mileage may vary.
>>>
>>> So, our filterCache section in solrconfig.xml looks like this:
>>>    >>      class="solr.LRUCache"
>>>      size="3600"
>>>      initialSize="1400"
>>>      autowarmCount="3600"/>
>>>
>>> For a 28GB index, running in a quad-core x64 VMWare instance, 30
>>> warmed facet fields, Solr is running at ~4GB. Stats filterCache size
>>> shows usually in the region of ~2400.
>>>
>>> 3. It's also a good idea to have some sort of
>>> firstSearcher/newSearcher event listener queries to allow new

Re: Tuning Solr caches with high commit rates (NRT)

2010-09-12 Thread Chris Haggstrom
Thanks, Peter.  This is really great info.

One setting I've found to be very useful for the problem of overlapping 
onDeskSearchers is to reduce the value of maxWarmingSearchers in 
solrconfig.xml.  I've reduced this to 1, so if a slave is already busy doing 
pre-warming, it won't try to also pre-warm additional updates.  This has 
greatly reduced our time to incorporate updates, with no visible downsides 
other than an uglier snapinstaller.log (we're still using 1.3 w/rsync-based 
replication).

-Chris

On Sep 12, 2010, at 9:26 AM, Peter Sturge wrote:

> Hi,
> 
> Below are some notes regarding Solr cache tuning that should prove
> useful for anyone who uses Solr with frequent commits (e.g. <5min).
> 
> Environment:
> Solr 1.4.1 or branch_3x trunk.
> Note the 4.x trunk has lots of neat new features, so the notes here
> are likely less relevant to the 4.x environment.
> 
> Overview:
> Our Solr environment makes extensive use of faceting, we perform
> commits every 30secs, and the indexes tend be on the large-ish side
> (>20million docs).
> Note: For our data, when we commit, we are always adding new data,
> never changing existing data.
> This type of environment can be tricky to tune, as Solr is more geared
> toward fast reads than frequent writes.
> 
> Symptoms:
> If anyone has used faceting in searches where you are also performing
> frequent commits, you've likely encountered the dreaded OutOfMemory or
> GC Overhead Exeeded errors.
> In high commit rate environments, this is almost always due to
> multiple 'onDeck' searchers and autowarming - i.e. new searchers don't
> finish autowarming their caches before the next commit()
> comes along and invalidates them.
> Once this starts happening on a regular basis, it is likely your
> Solr's JVM will run out of memory eventually, as the number of
> searchers (and their cache arrays) will keep growing until the JVM
> dies of thirst.
> To check if your Solr environment is suffering from this, turn on INFO
> level logging, and look for: 'PERFORMANCE WARNING: Overlapping
> onDeckSearchers=x'.
> 
> In tests, we've only ever seen this problem when using faceting, and
> facet.method=fc.
> 
> Some solutions to this are:
>Reduce the commit rate to allow searchers to fully warm before the
> next commit
>Reduce or eliminate the autowarming in caches
>Both of the above
> 
> The trouble is, if you're doing NRT commits, you likely have a good
> reason for it, and reducing/elimintating autowarming will very
> significantly impact search performance in high commit rate
> environments.
> 
> Solution:
> Here are some setup steps we've used that allow lots of faceting (we
> typically search with at least 20-35 different facet fields, and date
> faceting/sorting) on large indexes, and still keep decent search
> performance:
> 
> 1. Firstly, you should consider using the enum method for facet
> searches (facet.method=enum) unless you've got A LOT of memory on your
> machine. In our tests, this method uses a lot less memory and
> autowarms more quickly than fc. (Note, I've not tried the new
> segement-based 'fcs' option, as I can't find support for it in
> branch_3x - looks nice for 4.x though)
> Admittedly, for our data, enum is not quite as fast for searching as
> fc, but short of purchsing a Thaiwanese RAM factory, it's a worthwhile
> tradeoff.
> If you do have access to LOTS of memory, AND you can guarantee that
> the index won't grow beyond the memory capacity (i.e. you have some
> sort of deletion policy in place), fc can be a lot faster than enum
> when searching with lots of facets across many terms.
> 
> 2. Secondly, we've found that LRUCache is faster at autowarming than
> FastLRUCache - in our tests, about 20% faster. Maybe this is just our
> environment - your mileage may vary.
> 
> So, our filterCache section in solrconfig.xml looks like this:
>  class="solr.LRUCache"
>  size="3600"
>  initialSize="1400"
>  autowarmCount="3600"/>
> 
> For a 28GB index, running in a quad-core x64 VMWare instance, 30
> warmed facet fields, Solr is running at ~4GB. Stats filterCache size
> shows usually in the region of ~2400.
> 
> 3. It's also a good idea to have some sort of
> firstSearcher/newSearcher event listener queries to allow new data to
> populate the caches.
> Of course, what you put in these is dependent on the facets you need/use.
> We've found a good combination is a firstSearcher with as many facets
> in the search as your environment can handle, then a subset of the
> most common facets for the newSearcher.
> 
> 4. We also set:
>   true
> just in case.
> 
> 5. Another key area for search performance with high commits is to use
> 2 Solr instances - one for the high commit rate indexing, and one for
> searching.
> The read-only searching instance can be a remote replica, or a local
> read-only instance that reads the same core as the indexing instance
> (for the latter, you'll need something that periodically refreshes -
> i.e. runs commit()

Re: multivalued fields in result

2010-09-12 Thread Lance Norskog
Also, the 'v' is capitalized: multiValued. (This is one reason why 
posting your schema helps.)


Erick Erickson wrote:

Can we see your schema file? Because it sounds like you didn't
really declare your field multivalued="true" on the face of things.

But if it is multivalued AND you changed it, did you reindex after
you changed the schema?

Best
Erick

On Sun, Sep 12, 2010 at 4:21 AM, Jason Chaffee  wrote:

   

But it doesn't seem to be returning mulitvalued fields that are stored.  It
is returning all of the single value fields though.


-Original Message-
From: Markus Jelsma [mailto:markus.jel...@buyways.nl]
Sent: Sat 9/11/2010 4:19 AM
To: solr-user@lucene.apache.org
Subject: RE: multivalued fields in result

Yes, you'll get what is stored and asked for.

-Original message-
From: Jason Chaffee
Sent: Sat 11-09-2010 05:27
To: solr-user@lucene.apache.org;
Subject: multivalued fields in result

Is it possible to return multivalued files in the result?

I would like to have a multivalued field that is stored and not indexed (I
also copy the same field into another field where it is tokenized and
indexed).  I would then like all the values of this field returned in the
result set.  Is there a way to do this?

If it is not possible, could someone elaborate why that is so that I may
see if I can make it work.

thanks,

Jason


 
   


Re: Tuning Solr caches with high commit rates (NRT)

2010-09-12 Thread Lance Norskog

Bravo!

Other tricks: here is a policy for deciding when to merge segments that 
attempts to balance merging with performance. It was contributed by 
LinkedIn- they also run index&search in the same instance (not Solr, a 
different Lucene app).


lucene/contrib/misc/src/java/org/apache/lucene/index/BalancedSegmentMergePolicy.java

The optimize command now includes a partial optimize option, so you can 
do larger controlled merges.


Peter Sturge wrote:

Hi,

Below are some notes regarding Solr cache tuning that should prove
useful for anyone who uses Solr with frequent commits (e.g.<5min).

Environment:
Solr 1.4.1 or branch_3x trunk.
Note the 4.x trunk has lots of neat new features, so the notes here
are likely less relevant to the 4.x environment.

Overview:
Our Solr environment makes extensive use of faceting, we perform
commits every 30secs, and the indexes tend be on the large-ish side
(>20million docs).
Note: For our data, when we commit, we are always adding new data,
never changing existing data.
This type of environment can be tricky to tune, as Solr is more geared
toward fast reads than frequent writes.

Symptoms:
If anyone has used faceting in searches where you are also performing
frequent commits, you've likely encountered the dreaded OutOfMemory or
GC Overhead Exeeded errors.
In high commit rate environments, this is almost always due to
multiple 'onDeck' searchers and autowarming - i.e. new searchers don't
finish autowarming their caches before the next commit()
comes along and invalidates them.
Once this starts happening on a regular basis, it is likely your
Solr's JVM will run out of memory eventually, as the number of
searchers (and their cache arrays) will keep growing until the JVM
dies of thirst.
To check if your Solr environment is suffering from this, turn on INFO
level logging, and look for: 'PERFORMANCE WARNING: Overlapping
onDeckSearchers=x'.

In tests, we've only ever seen this problem when using faceting, and
facet.method=fc.

Some solutions to this are:
 Reduce the commit rate to allow searchers to fully warm before the
next commit
 Reduce or eliminate the autowarming in caches
 Both of the above

The trouble is, if you're doing NRT commits, you likely have a good
reason for it, and reducing/elimintating autowarming will very
significantly impact search performance in high commit rate
environments.

Solution:
Here are some setup steps we've used that allow lots of faceting (we
typically search with at least 20-35 different facet fields, and date
faceting/sorting) on large indexes, and still keep decent search
performance:

1. Firstly, you should consider using the enum method for facet
searches (facet.method=enum) unless you've got A LOT of memory on your
machine. In our tests, this method uses a lot less memory and
autowarms more quickly than fc. (Note, I've not tried the new
segement-based 'fcs' option, as I can't find support for it in
branch_3x - looks nice for 4.x though)
Admittedly, for our data, enum is not quite as fast for searching as
fc, but short of purchsing a Thaiwanese RAM factory, it's a worthwhile
tradeoff.
If you do have access to LOTS of memory, AND you can guarantee that
the index won't grow beyond the memory capacity (i.e. you have some
sort of deletion policy in place), fc can be a lot faster than enum
when searching with lots of facets across many terms.

2. Secondly, we've found that LRUCache is faster at autowarming than
FastLRUCache - in our tests, about 20% faster. Maybe this is just our
environment - your mileage may vary.

So, our filterCache section in solrconfig.xml looks like this:
 

For a 28GB index, running in a quad-core x64 VMWare instance, 30
warmed facet fields, Solr is running at ~4GB. Stats filterCache size
shows usually in the region of ~2400.

3. It's also a good idea to have some sort of
firstSearcher/newSearcher event listener queries to allow new data to
populate the caches.
Of course, what you put in these is dependent on the facets you need/use.
We've found a good combination is a firstSearcher with as many facets
in the search as your environment can handle, then a subset of the
most common facets for the newSearcher.

4. We also set:
true
just in case.

5. Another key area for search performance with high commits is to use
2 Solr instances - one for the high commit rate indexing, and one for
searching.
The read-only searching instance can be a remote replica, or a local
read-only instance that reads the same core as the indexing instance
(for the latter, you'll need something that periodically refreshes -
i.e. runs commit()).
This way, you can tune the indexing instance for writing performance
and the searching instance as above for max read performance.

Using the setup above, we get fantastic searching speed for small
facet sets (well under 1sec), and really good searching for large
facet sets (a couple of secs depending on index size, number of
facets, unique terms etc. etc.),
even when searching against l

Re: No more trunk support for 2.9 indexes

2010-09-12 Thread Ryan McKinley
> I suppose an index 'remaker' might be something like a DIH reader for
> a Solr index - streams everything out of the existing index, writing
> it into the new one?

This works fine if all fields are stored (and copy field does not go
to a stored field), otherwise you would need/want to start with the
orignial source.

ryan


Re: Tuning Solr caches with high commit rates (NRT)

2010-09-12 Thread Peter Sturge
Hi Jason,

I've tried some limited testing with the 4.x trunk using fcs, and I
must say, I really like the idea of per-segment faceting.
I was hoping to see it in 3.x, but I don't see this option in the
branch_3x trunk. Is your SOLR-1606 patch referred to in SOLR-1617 the
one to use with 3.1?
There seems to be a number of Solr issues tied to this - one of them
being Lucene-1785. Can the per-segment faceting patch work with Lucene
2.9/branch_3x?

Thanks,
Peter



On Mon, Sep 13, 2010 at 12:05 AM, Jason Rutherglen
 wrote:
> Peter,
>
> Are you using per-segment faceting, eg, SOLR-1617?  That could help
> your situation.
>
> On Sun, Sep 12, 2010 at 12:26 PM, Peter Sturge  wrote:
>> Hi,
>>
>> Below are some notes regarding Solr cache tuning that should prove
>> useful for anyone who uses Solr with frequent commits (e.g. <5min).
>>
>> Environment:
>> Solr 1.4.1 or branch_3x trunk.
>> Note the 4.x trunk has lots of neat new features, so the notes here
>> are likely less relevant to the 4.x environment.
>>
>> Overview:
>> Our Solr environment makes extensive use of faceting, we perform
>> commits every 30secs, and the indexes tend be on the large-ish side
>> (>20million docs).
>> Note: For our data, when we commit, we are always adding new data,
>> never changing existing data.
>> This type of environment can be tricky to tune, as Solr is more geared
>> toward fast reads than frequent writes.
>>
>> Symptoms:
>> If anyone has used faceting in searches where you are also performing
>> frequent commits, you've likely encountered the dreaded OutOfMemory or
>> GC Overhead Exeeded errors.
>> In high commit rate environments, this is almost always due to
>> multiple 'onDeck' searchers and autowarming - i.e. new searchers don't
>> finish autowarming their caches before the next commit()
>> comes along and invalidates them.
>> Once this starts happening on a regular basis, it is likely your
>> Solr's JVM will run out of memory eventually, as the number of
>> searchers (and their cache arrays) will keep growing until the JVM
>> dies of thirst.
>> To check if your Solr environment is suffering from this, turn on INFO
>> level logging, and look for: 'PERFORMANCE WARNING: Overlapping
>> onDeckSearchers=x'.
>>
>> In tests, we've only ever seen this problem when using faceting, and
>> facet.method=fc.
>>
>> Some solutions to this are:
>>    Reduce the commit rate to allow searchers to fully warm before the
>> next commit
>>    Reduce or eliminate the autowarming in caches
>>    Both of the above
>>
>> The trouble is, if you're doing NRT commits, you likely have a good
>> reason for it, and reducing/elimintating autowarming will very
>> significantly impact search performance in high commit rate
>> environments.
>>
>> Solution:
>> Here are some setup steps we've used that allow lots of faceting (we
>> typically search with at least 20-35 different facet fields, and date
>> faceting/sorting) on large indexes, and still keep decent search
>> performance:
>>
>> 1. Firstly, you should consider using the enum method for facet
>> searches (facet.method=enum) unless you've got A LOT of memory on your
>> machine. In our tests, this method uses a lot less memory and
>> autowarms more quickly than fc. (Note, I've not tried the new
>> segement-based 'fcs' option, as I can't find support for it in
>> branch_3x - looks nice for 4.x though)
>> Admittedly, for our data, enum is not quite as fast for searching as
>> fc, but short of purchsing a Thaiwanese RAM factory, it's a worthwhile
>> tradeoff.
>> If you do have access to LOTS of memory, AND you can guarantee that
>> the index won't grow beyond the memory capacity (i.e. you have some
>> sort of deletion policy in place), fc can be a lot faster than enum
>> when searching with lots of facets across many terms.
>>
>> 2. Secondly, we've found that LRUCache is faster at autowarming than
>> FastLRUCache - in our tests, about 20% faster. Maybe this is just our
>> environment - your mileage may vary.
>>
>> So, our filterCache section in solrconfig.xml looks like this:
>>    >      class="solr.LRUCache"
>>      size="3600"
>>      initialSize="1400"
>>      autowarmCount="3600"/>
>>
>> For a 28GB index, running in a quad-core x64 VMWare instance, 30
>> warmed facet fields, Solr is running at ~4GB. Stats filterCache size
>> shows usually in the region of ~2400.
>>
>> 3. It's also a good idea to have some sort of
>> firstSearcher/newSearcher event listener queries to allow new data to
>> populate the caches.
>> Of course, what you put in these is dependent on the facets you need/use.
>> We've found a good combination is a firstSearcher with as many facets
>> in the search as your environment can handle, then a subset of the
>> most common facets for the newSearcher.
>>
>> 4. We also set:
>>   true
>> just in case.
>>
>> 5. Another key area for search performance with high commits is to use
>> 2 Solr instances - one for the high commit rate indexing, and one for
>> searching.
>> The read-only searching

Re: Tuning Solr caches with high commit rates (NRT)

2010-09-12 Thread Jason Rutherglen
Peter,

Are you using per-segment faceting, eg, SOLR-1617?  That could help
your situation.

On Sun, Sep 12, 2010 at 12:26 PM, Peter Sturge  wrote:
> Hi,
>
> Below are some notes regarding Solr cache tuning that should prove
> useful for anyone who uses Solr with frequent commits (e.g. <5min).
>
> Environment:
> Solr 1.4.1 or branch_3x trunk.
> Note the 4.x trunk has lots of neat new features, so the notes here
> are likely less relevant to the 4.x environment.
>
> Overview:
> Our Solr environment makes extensive use of faceting, we perform
> commits every 30secs, and the indexes tend be on the large-ish side
> (>20million docs).
> Note: For our data, when we commit, we are always adding new data,
> never changing existing data.
> This type of environment can be tricky to tune, as Solr is more geared
> toward fast reads than frequent writes.
>
> Symptoms:
> If anyone has used faceting in searches where you are also performing
> frequent commits, you've likely encountered the dreaded OutOfMemory or
> GC Overhead Exeeded errors.
> In high commit rate environments, this is almost always due to
> multiple 'onDeck' searchers and autowarming - i.e. new searchers don't
> finish autowarming their caches before the next commit()
> comes along and invalidates them.
> Once this starts happening on a regular basis, it is likely your
> Solr's JVM will run out of memory eventually, as the number of
> searchers (and their cache arrays) will keep growing until the JVM
> dies of thirst.
> To check if your Solr environment is suffering from this, turn on INFO
> level logging, and look for: 'PERFORMANCE WARNING: Overlapping
> onDeckSearchers=x'.
>
> In tests, we've only ever seen this problem when using faceting, and
> facet.method=fc.
>
> Some solutions to this are:
>    Reduce the commit rate to allow searchers to fully warm before the
> next commit
>    Reduce or eliminate the autowarming in caches
>    Both of the above
>
> The trouble is, if you're doing NRT commits, you likely have a good
> reason for it, and reducing/elimintating autowarming will very
> significantly impact search performance in high commit rate
> environments.
>
> Solution:
> Here are some setup steps we've used that allow lots of faceting (we
> typically search with at least 20-35 different facet fields, and date
> faceting/sorting) on large indexes, and still keep decent search
> performance:
>
> 1. Firstly, you should consider using the enum method for facet
> searches (facet.method=enum) unless you've got A LOT of memory on your
> machine. In our tests, this method uses a lot less memory and
> autowarms more quickly than fc. (Note, I've not tried the new
> segement-based 'fcs' option, as I can't find support for it in
> branch_3x - looks nice for 4.x though)
> Admittedly, for our data, enum is not quite as fast for searching as
> fc, but short of purchsing a Thaiwanese RAM factory, it's a worthwhile
> tradeoff.
> If you do have access to LOTS of memory, AND you can guarantee that
> the index won't grow beyond the memory capacity (i.e. you have some
> sort of deletion policy in place), fc can be a lot faster than enum
> when searching with lots of facets across many terms.
>
> 2. Secondly, we've found that LRUCache is faster at autowarming than
> FastLRUCache - in our tests, about 20% faster. Maybe this is just our
> environment - your mileage may vary.
>
> So, our filterCache section in solrconfig.xml looks like this:
>          class="solr.LRUCache"
>      size="3600"
>      initialSize="1400"
>      autowarmCount="3600"/>
>
> For a 28GB index, running in a quad-core x64 VMWare instance, 30
> warmed facet fields, Solr is running at ~4GB. Stats filterCache size
> shows usually in the region of ~2400.
>
> 3. It's also a good idea to have some sort of
> firstSearcher/newSearcher event listener queries to allow new data to
> populate the caches.
> Of course, what you put in these is dependent on the facets you need/use.
> We've found a good combination is a firstSearcher with as many facets
> in the search as your environment can handle, then a subset of the
> most common facets for the newSearcher.
>
> 4. We also set:
>   true
> just in case.
>
> 5. Another key area for search performance with high commits is to use
> 2 Solr instances - one for the high commit rate indexing, and one for
> searching.
> The read-only searching instance can be a remote replica, or a local
> read-only instance that reads the same core as the indexing instance
> (for the latter, you'll need something that periodically refreshes -
> i.e. runs commit()).
> This way, you can tune the indexing instance for writing performance
> and the searching instance as above for max read performance.
>
> Using the setup above, we get fantastic searching speed for small
> facet sets (well under 1sec), and really good searching for large
> facet sets (a couple of secs depending on index size, number of
> facets, unique terms etc. etc.),
> even when searching against largeish indexes (>20

Saravanan Chinnadurai/Actionimages is out of the office.

2010-09-12 Thread Saravanan . Chinnadurai
I will be out of the office starting  12/09/2010 and will not return until
14/09/2010.

Please email to itsta...@actionimages.com  for any urgent issues.
(Embedded image moved to file: pic19187.jpg)

Re: Tuning Solr caches with high commit rates (NRT)

2010-09-12 Thread Peter Karich
Peter,

thanks a lot for your in-depth explanations!
Your findings will be definitely helpful for my next performance
improvement tests :-)

Two questions:

1. How would I do that:

> or a local read-only instance that reads the same core as the indexing 
> instance (for the latter, you'll need something that periodically refreshes - 
> i.e. runs commit()).


2. Did you try sharding with your current setup (e.g. one big,
nearly-static index and a tiny write+read index)?

Regards,
Peter.

> Hi,
>
> Below are some notes regarding Solr cache tuning that should prove
> useful for anyone who uses Solr with frequent commits (e.g. <5min).
>
> Environment:
> Solr 1.4.1 or branch_3x trunk.
> Note the 4.x trunk has lots of neat new features, so the notes here
> are likely less relevant to the 4.x environment.
>
> Overview:
> Our Solr environment makes extensive use of faceting, we perform
> commits every 30secs, and the indexes tend be on the large-ish side
> (>20million docs).
> Note: For our data, when we commit, we are always adding new data,
> never changing existing data.
> This type of environment can be tricky to tune, as Solr is more geared
> toward fast reads than frequent writes.
>
> Symptoms:
> If anyone has used faceting in searches where you are also performing
> frequent commits, you've likely encountered the dreaded OutOfMemory or
> GC Overhead Exeeded errors.
> In high commit rate environments, this is almost always due to
> multiple 'onDeck' searchers and autowarming - i.e. new searchers don't
> finish autowarming their caches before the next commit()
> comes along and invalidates them.
> Once this starts happening on a regular basis, it is likely your
> Solr's JVM will run out of memory eventually, as the number of
> searchers (and their cache arrays) will keep growing until the JVM
> dies of thirst.
> To check if your Solr environment is suffering from this, turn on INFO
> level logging, and look for: 'PERFORMANCE WARNING: Overlapping
> onDeckSearchers=x'.
>
> In tests, we've only ever seen this problem when using faceting, and
> facet.method=fc.
>
> Some solutions to this are:
> Reduce the commit rate to allow searchers to fully warm before the
> next commit
> Reduce or eliminate the autowarming in caches
> Both of the above
>
> The trouble is, if you're doing NRT commits, you likely have a good
> reason for it, and reducing/elimintating autowarming will very
> significantly impact search performance in high commit rate
> environments.
>
> Solution:
> Here are some setup steps we've used that allow lots of faceting (we
> typically search with at least 20-35 different facet fields, and date
> faceting/sorting) on large indexes, and still keep decent search
> performance:
>
> 1. Firstly, you should consider using the enum method for facet
> searches (facet.method=enum) unless you've got A LOT of memory on your
> machine. In our tests, this method uses a lot less memory and
> autowarms more quickly than fc. (Note, I've not tried the new
> segement-based 'fcs' option, as I can't find support for it in
> branch_3x - looks nice for 4.x though)
> Admittedly, for our data, enum is not quite as fast for searching as
> fc, but short of purchsing a Thaiwanese RAM factory, it's a worthwhile
> tradeoff.
> If you do have access to LOTS of memory, AND you can guarantee that
> the index won't grow beyond the memory capacity (i.e. you have some
> sort of deletion policy in place), fc can be a lot faster than enum
> when searching with lots of facets across many terms.
>
> 2. Secondly, we've found that LRUCache is faster at autowarming than
> FastLRUCache - in our tests, about 20% faster. Maybe this is just our
> environment - your mileage may vary.
>
> So, our filterCache section in solrconfig.xml looks like this:
>class="solr.LRUCache"
>   size="3600"
>   initialSize="1400"
>   autowarmCount="3600"/>
>
> For a 28GB index, running in a quad-core x64 VMWare instance, 30
> warmed facet fields, Solr is running at ~4GB. Stats filterCache size
> shows usually in the region of ~2400.
>
> 3. It's also a good idea to have some sort of
> firstSearcher/newSearcher event listener queries to allow new data to
> populate the caches.
> Of course, what you put in these is dependent on the facets you need/use.
> We've found a good combination is a firstSearcher with as many facets
> in the search as your environment can handle, then a subset of the
> most common facets for the newSearcher.
>
> 4. We also set:
>true
> just in case.
>
> 5. Another key area for search performance with high commits is to use
> 2 Solr instances - one for the high commit rate indexing, and one for
> searching.
> The read-only searching instance can be a remote replica, or a local
> read-only instance that reads the same core as the indexing instance
> (for the latter, you'll need something that periodically refreshes -
> i.e. runs commit()).
> This way, you can tune the indexing instance for writing performance

Re: Solr and jvm Garbage Collection tuning

2010-09-12 Thread Grant Ingersoll

On Sep 10, 2010, at 7:01 PM, Burton-West, Tom wrote:

> We have noticed that when the first query hits Solr after starting it up, 
> memory use increases significantly, from about 1GB to about 16GB, and then as 
> queries are received it goes up to about 19GB at which point there is a Full 
> Garbage Collection which takes about 30 seconds and then memory use drops 
> back down to 16GB.  Under a relatively heavy load, the full GC happens about 
> every 10-20 minutes.
> 
> We are running 3 Solr shards under one Tomcat with 20GB allocated to the jvm. 
>  Each shard has a total index size of about 400GB on and a tii size of about 
> 600MB and indexes about 650,000 full-text books. (The server has a total of 
> 72GB of memory, so we are leaving quite a bit of memory for the OS disk 
> cache).
> 
> Is there some argument we could give the jvm so that it would collect garbage 
> more frequently? Or some other JVM tuning action that might reduce the amount 
> of time where Solr is waiting on GC?
> 
> If we could get the time for each GC to take under a second, with the 
> trade-off being that GC  would occur much more frequently, that would help us 
> avoid the occasional query taking more than 30 seconds at the cost of a 
> larger number of queries taking at least a second.
> 

What are your current GC settings?  Also, I guess I'd look at ways you can 
reduce the heap size needed.  Caching, field type choices, faceting choices.  
Also could try playing with the termIndexInterval which will load fewer terms 
into memory at the cost of longer seeks.  At some point, though, you just may 
need more shards and the resulting smaller indexes.  How many CPU cores do you 
have on each machine?

Re: Tuning Solr caches with high commit rates (NRT)

2010-09-12 Thread Dennis Gearon
Wow! Thanks for that. This email is DEFINITELY being filed.

Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Sun, 9/12/10, Peter Sturge  wrote:

> From: Peter Sturge 
> Subject: Tuning Solr caches with high commit rates (NRT)
> To: solr-user@lucene.apache.org
> Date: Sunday, September 12, 2010, 9:26 AM
> Hi,
> 
> Below are some notes regarding Solr cache tuning that
> should prove
> useful for anyone who uses Solr with frequent commits (e.g.
> <5min).
> 
> Environment:
> Solr 1.4.1 or branch_3x trunk.
> Note the 4.x trunk has lots of neat new features, so the
> notes here
> are likely less relevant to the 4.x environment.
> 
> Overview:
> Our Solr environment makes extensive use of faceting, we
> perform
> commits every 30secs, and the indexes tend be on the
> large-ish side
> (>20million docs).
> Note: For our data, when we commit, we are always adding
> new data,
> never changing existing data.
> This type of environment can be tricky to tune, as Solr is
> more geared
> toward fast reads than frequent writes.
> 
> Symptoms:
> If anyone has used faceting in searches where you are also
> performing
> frequent commits, you've likely encountered the dreaded
> OutOfMemory or
> GC Overhead Exeeded errors.
> In high commit rate environments, this is almost always due
> to
> multiple 'onDeck' searchers and autowarming - i.e. new
> searchers don't
> finish autowarming their caches before the next commit()
> comes along and invalidates them.
> Once this starts happening on a regular basis, it is likely
> your
> Solr's JVM will run out of memory eventually, as the number
> of
> searchers (and their cache arrays) will keep growing until
> the JVM
> dies of thirst.
> To check if your Solr environment is suffering from this,
> turn on INFO
> level logging, and look for: 'PERFORMANCE WARNING:
> Overlapping
> onDeckSearchers=x'.
> 
> In tests, we've only ever seen this problem when using
> faceting, and
> facet.method=fc.
> 
> Some solutions to this are:
>     Reduce the commit rate to allow searchers to
> fully warm before the
> next commit
>     Reduce or eliminate the autowarming in
> caches
>     Both of the above
> 
> The trouble is, if you're doing NRT commits, you likely
> have a good
> reason for it, and reducing/elimintating autowarming will
> very
> significantly impact search performance in high commit
> rate
> environments.
> 
> Solution:
> Here are some setup steps we've used that allow lots of
> faceting (we
> typically search with at least 20-35 different facet
> fields, and date
> faceting/sorting) on large indexes, and still keep decent
> search
> performance:
> 
> 1. Firstly, you should consider using the enum method for
> facet
> searches (facet.method=enum) unless you've got A LOT of
> memory on your
> machine. In our tests, this method uses a lot less memory
> and
> autowarms more quickly than fc. (Note, I've not tried the
> new
> segement-based 'fcs' option, as I can't find support for it
> in
> branch_3x - looks nice for 4.x though)
> Admittedly, for our data, enum is not quite as fast for
> searching as
> fc, but short of purchsing a Thaiwanese RAM factory, it's a
> worthwhile
> tradeoff.
> If you do have access to LOTS of memory, AND you can
> guarantee that
> the index won't grow beyond the memory capacity (i.e. you
> have some
> sort of deletion policy in place), fc can be a lot faster
> than enum
> when searching with lots of facets across many terms.
> 
> 2. Secondly, we've found that LRUCache is faster at
> autowarming than
> FastLRUCache - in our tests, about 20% faster. Maybe this
> is just our
> environment - your mileage may vary.
> 
> So, our filterCache section in solrconfig.xml looks like
> this:
>            class="solr.LRUCache"
>       size="3600"
>       initialSize="1400"
>       autowarmCount="3600"/>
> 
> For a 28GB index, running in a quad-core x64 VMWare
> instance, 30
> warmed facet fields, Solr is running at ~4GB. Stats
> filterCache size
> shows usually in the region of ~2400.
> 
> 3. It's also a good idea to have some sort of
> firstSearcher/newSearcher event listener queries to allow
> new data to
> populate the caches.
> Of course, what you put in these is dependent on the facets
> you need/use.
> We've found a good combination is a firstSearcher with as
> many facets
> in the search as your environment can handle, then a subset
> of the
> most common facets for the newSearcher.
> 
> 4. We also set:
>    true
> just in case.
> 
> 5. Another key area for search performance with high
> commits is to use
> 2 Solr instances - one for the high commit rate indexing,
> and one for
> searching.
> The read-only searching instance can be a remote replica,
> or a local
> read-only instance that reads the same core as the indexing
> instance
> (for the latter, you'll need something that periodically
> refreshes -
> i.e. runs commit()).
> This way, you 

Re: Tuning Solr caches with high commit rates (NRT)

2010-09-12 Thread Erick Erickson
Peter:

This kind of information is extremely useful to document, thanks! Do you
have the time/energy to put it up on the Wiki? Anyone can edit it by
creating
a logon. If you don't, would it be OK if someone else did it (with
attribution,
of course)? I guess that by bringing it up I'm volunteering :)...

Best
Erick

On Sun, Sep 12, 2010 at 12:26 PM, Peter Sturge wrote:

> Hi,
>
> Below are some notes regarding Solr cache tuning that should prove
> useful for anyone who uses Solr with frequent commits (e.g. <5min).
>
> Environment:
> Solr 1.4.1 or branch_3x trunk.
> Note the 4.x trunk has lots of neat new features, so the notes here
> are likely less relevant to the 4.x environment.
>
> Overview:
> Our Solr environment makes extensive use of faceting, we perform
> commits every 30secs, and the indexes tend be on the large-ish side
> (>20million docs).
> Note: For our data, when we commit, we are always adding new data,
> never changing existing data.
> This type of environment can be tricky to tune, as Solr is more geared
> toward fast reads than frequent writes.
>
> Symptoms:
> If anyone has used faceting in searches where you are also performing
> frequent commits, you've likely encountered the dreaded OutOfMemory or
> GC Overhead Exeeded errors.
> In high commit rate environments, this is almost always due to
> multiple 'onDeck' searchers and autowarming - i.e. new searchers don't
> finish autowarming their caches before the next commit()
> comes along and invalidates them.
> Once this starts happening on a regular basis, it is likely your
> Solr's JVM will run out of memory eventually, as the number of
> searchers (and their cache arrays) will keep growing until the JVM
> dies of thirst.
> To check if your Solr environment is suffering from this, turn on INFO
> level logging, and look for: 'PERFORMANCE WARNING: Overlapping
> onDeckSearchers=x'.
>
> In tests, we've only ever seen this problem when using faceting, and
> facet.method=fc.
>
> Some solutions to this are:
>Reduce the commit rate to allow searchers to fully warm before the
> next commit
>Reduce or eliminate the autowarming in caches
>Both of the above
>
> The trouble is, if you're doing NRT commits, you likely have a good
> reason for it, and reducing/elimintating autowarming will very
> significantly impact search performance in high commit rate
> environments.
>
> Solution:
> Here are some setup steps we've used that allow lots of faceting (we
> typically search with at least 20-35 different facet fields, and date
> faceting/sorting) on large indexes, and still keep decent search
> performance:
>
> 1. Firstly, you should consider using the enum method for facet
> searches (facet.method=enum) unless you've got A LOT of memory on your
> machine. In our tests, this method uses a lot less memory and
> autowarms more quickly than fc. (Note, I've not tried the new
> segement-based 'fcs' option, as I can't find support for it in
> branch_3x - looks nice for 4.x though)
> Admittedly, for our data, enum is not quite as fast for searching as
> fc, but short of purchsing a Thaiwanese RAM factory, it's a worthwhile
> tradeoff.
> If you do have access to LOTS of memory, AND you can guarantee that
> the index won't grow beyond the memory capacity (i.e. you have some
> sort of deletion policy in place), fc can be a lot faster than enum
> when searching with lots of facets across many terms.
>
> 2. Secondly, we've found that LRUCache is faster at autowarming than
> FastLRUCache - in our tests, about 20% faster. Maybe this is just our
> environment - your mileage may vary.
>
> So, our filterCache section in solrconfig.xml looks like this:
>  class="solr.LRUCache"
>  size="3600"
>  initialSize="1400"
>  autowarmCount="3600"/>
>
> For a 28GB index, running in a quad-core x64 VMWare instance, 30
> warmed facet fields, Solr is running at ~4GB. Stats filterCache size
> shows usually in the region of ~2400.
>
> 3. It's also a good idea to have some sort of
> firstSearcher/newSearcher event listener queries to allow new data to
> populate the caches.
> Of course, what you put in these is dependent on the facets you need/use.
> We've found a good combination is a firstSearcher with as many facets
> in the search as your environment can handle, then a subset of the
> most common facets for the newSearcher.
>
> 4. We also set:
>   true
> just in case.
>
> 5. Another key area for search performance with high commits is to use
> 2 Solr instances - one for the high commit rate indexing, and one for
> searching.
> The read-only searching instance can be a remote replica, or a local
> read-only instance that reads the same core as the indexing instance
> (for the latter, you'll need something that periodically refreshes -
> i.e. runs commit()).
> This way, you can tune the indexing instance for writing performance
> and the searching instance as above for max read performance.
>
> Using the setup above, we get fantastic searching speed for s

Tuning Solr caches with high commit rates (NRT)

2010-09-12 Thread Peter Sturge
Hi,

Below are some notes regarding Solr cache tuning that should prove
useful for anyone who uses Solr with frequent commits (e.g. <5min).

Environment:
Solr 1.4.1 or branch_3x trunk.
Note the 4.x trunk has lots of neat new features, so the notes here
are likely less relevant to the 4.x environment.

Overview:
Our Solr environment makes extensive use of faceting, we perform
commits every 30secs, and the indexes tend be on the large-ish side
(>20million docs).
Note: For our data, when we commit, we are always adding new data,
never changing existing data.
This type of environment can be tricky to tune, as Solr is more geared
toward fast reads than frequent writes.

Symptoms:
If anyone has used faceting in searches where you are also performing
frequent commits, you've likely encountered the dreaded OutOfMemory or
GC Overhead Exeeded errors.
In high commit rate environments, this is almost always due to
multiple 'onDeck' searchers and autowarming - i.e. new searchers don't
finish autowarming their caches before the next commit()
comes along and invalidates them.
Once this starts happening on a regular basis, it is likely your
Solr's JVM will run out of memory eventually, as the number of
searchers (and their cache arrays) will keep growing until the JVM
dies of thirst.
To check if your Solr environment is suffering from this, turn on INFO
level logging, and look for: 'PERFORMANCE WARNING: Overlapping
onDeckSearchers=x'.

In tests, we've only ever seen this problem when using faceting, and
facet.method=fc.

Some solutions to this are:
Reduce the commit rate to allow searchers to fully warm before the
next commit
Reduce or eliminate the autowarming in caches
Both of the above

The trouble is, if you're doing NRT commits, you likely have a good
reason for it, and reducing/elimintating autowarming will very
significantly impact search performance in high commit rate
environments.

Solution:
Here are some setup steps we've used that allow lots of faceting (we
typically search with at least 20-35 different facet fields, and date
faceting/sorting) on large indexes, and still keep decent search
performance:

1. Firstly, you should consider using the enum method for facet
searches (facet.method=enum) unless you've got A LOT of memory on your
machine. In our tests, this method uses a lot less memory and
autowarms more quickly than fc. (Note, I've not tried the new
segement-based 'fcs' option, as I can't find support for it in
branch_3x - looks nice for 4.x though)
Admittedly, for our data, enum is not quite as fast for searching as
fc, but short of purchsing a Thaiwanese RAM factory, it's a worthwhile
tradeoff.
If you do have access to LOTS of memory, AND you can guarantee that
the index won't grow beyond the memory capacity (i.e. you have some
sort of deletion policy in place), fc can be a lot faster than enum
when searching with lots of facets across many terms.

2. Secondly, we've found that LRUCache is faster at autowarming than
FastLRUCache - in our tests, about 20% faster. Maybe this is just our
environment - your mileage may vary.

So, our filterCache section in solrconfig.xml looks like this:


For a 28GB index, running in a quad-core x64 VMWare instance, 30
warmed facet fields, Solr is running at ~4GB. Stats filterCache size
shows usually in the region of ~2400.

3. It's also a good idea to have some sort of
firstSearcher/newSearcher event listener queries to allow new data to
populate the caches.
Of course, what you put in these is dependent on the facets you need/use.
We've found a good combination is a firstSearcher with as many facets
in the search as your environment can handle, then a subset of the
most common facets for the newSearcher.

4. We also set:
   true
just in case.

5. Another key area for search performance with high commits is to use
2 Solr instances - one for the high commit rate indexing, and one for
searching.
The read-only searching instance can be a remote replica, or a local
read-only instance that reads the same core as the indexing instance
(for the latter, you'll need something that periodically refreshes -
i.e. runs commit()).
This way, you can tune the indexing instance for writing performance
and the searching instance as above for max read performance.

Using the setup above, we get fantastic searching speed for small
facet sets (well under 1sec), and really good searching for large
facet sets (a couple of secs depending on index size, number of
facets, unique terms etc. etc.),
even when searching against largeish indexes (>20million docs).
We have yet to see any OOM or GC errors using the techniques above,
even in low memory conditions.

I hope there are people that find this useful. I know I've spent a lot
of time looking for stuff like this, so hopefullly, this will save
someone some time.


Peter


Re: Solr memory use, jmap and TermInfos/tii

2010-09-12 Thread Robert Muir
On Sun, Sep 12, 2010 at 9:57 AM, Simon Willnauer <
simon.willna...@googlemail.com> wrote:

> > To change the divisor in your solrconfig, for example to 4, it looks like
> > you need to do this.
> >
> >   > class="org.apache.solr.core.StandardIndexReaderFactory">
> >4
> >  
>
> Ah, thanks robert! I didn't know about that one either!
>
> simon


actually I'm wrong, for solr 1.4, use "setTermIndexDivisor".

i was looking at 3.1/trunk and there is a bug in the name of this parameter:
https://issues.apache.org/jira/browse/SOLR-2118

-- 
Robert Muir
rcm...@gmail.com


Re: multivalued fields in result

2010-09-12 Thread Erick Erickson
Can we see your schema file? Because it sounds like you didn't
really declare your field multivalued="true" on the face of things.

But if it is multivalued AND you changed it, did you reindex after
you changed the schema?

Best
Erick

On Sun, Sep 12, 2010 at 4:21 AM, Jason Chaffee  wrote:

> But it doesn't seem to be returning mulitvalued fields that are stored.  It
> is returning all of the single value fields though.
>
>
> -Original Message-
> From: Markus Jelsma [mailto:markus.jel...@buyways.nl]
> Sent: Sat 9/11/2010 4:19 AM
> To: solr-user@lucene.apache.org
> Subject: RE: multivalued fields in result
>
> Yes, you'll get what is stored and asked for.
>
> -Original message-
> From: Jason Chaffee 
> Sent: Sat 11-09-2010 05:27
> To: solr-user@lucene.apache.org;
> Subject: multivalued fields in result
>
> Is it possible to return multivalued files in the result?
>
> I would like to have a multivalued field that is stored and not indexed (I
> also copy the same field into another field where it is tokenized and
> indexed).  I would then like all the values of this field returned in the
> result set.  Is there a way to do this?
>
> If it is not possible, could someone elaborate why that is so that I may
> see if I can make it work.
>
> thanks,
>
> Jason
>
>


Re: Invalid version or the data in not in 'javabin' format

2010-09-12 Thread h00kpub...@gmail.com
 thats was the solution!! i package the current lucene and solrj 
repositories (dev 4.0) and copy the nesseccary jars to nutch-libs (after 
removing the old), building nutch and run it - it works!! thank you peter :)


marcel

On 09/12/2010 03:40 PM, Peter Sturge wrote:

Could be a solrj .jar version compat issue. Check that  the client and
server's solrj version jars match up.

Peter


On Sun, Sep 12, 2010 at 1:16 PM, h00kpub...@gmail.com
  wrote:

  hi... currently i am integrating nutch (release 1.2) into solr (trunk). if
i indexing to solr index with nutch i got the exception:

java.lang.RuntimeException: Invalid version or the data in not in 'javabin'
format
at
org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:99)
at
org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(BinaryResponseParser.java:39)
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:466)
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:243)
at
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:49)
at org.apache.nutch.indexer.solr.SolrWriter.close(SolrWriter.java:98)
at
org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:48)
at
org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:474)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
2010-09-12 11:44:55,101 ERROR solr.SolrIndexer - java.io.IOException: Job
failed!

can you tell me, whats wrong or how can i fix this?

best regards marcel :)








Re: mm=0?

2010-09-12 Thread Erick Erickson
Could you explain the use-case a bit? Because the very
first response I would have is "why in the world did
product management make this a requirement" and try
to get the requirement changed

As a user, I'm having a hard time imagining being well
served by getting a document in response to a search that
had no relation to my search, it was just a random doc
selected from the corpus.

All that said, I don't think a single query would do the trick.
You could include a "very special" document with a field
that no other document had with very special text in it. Say
field name "bogusmatch", filled with the text "bogustext"
then, at least the second query would match one and only
one document and would take minimal time. Or you could
tack on to each and every query "OR bogusmatch:bogustext^0.001"
(which would really be inexpensive) and filter it out if there
was more than one response. By boosting it really low, it should
always appear at the end of the list which wouldn't be a bad thing.

DisMax might help you here...

But do ask if it is really a requirement or just something nobody's
objected to before bothering IMO...

Best
Erick

On Sat, Sep 11, 2010 at 1:10 PM, Satish Kumar <
satish.kumar.just.d...@gmail.com> wrote:

> Hi,
>
> We have a requirement to show at least one result every time -- i.e., even
> if user entered term is not found in any of the documents. I was hoping
> setting mm to 0 will return results in all cases, but it is not.
>
> For example, if user entered term "alpha" and it is *not* in any of the
> documents in the index, any document in the index can be returned. If term
> "alpha" is in the document set, documents having the term "alpha" only must
> be returned.
>
> My idea so far is to perform a search using user entered term. If there are
> any results, return them. If there are no results, perform another search
> without the query term-- this means doing two searches. Any suggestions on
> implementing this requirement using only one search?
>
>
> Thanks,
> Satish
>


Re: Solr memory use, jmap and TermInfos/tii

2010-09-12 Thread Simon Willnauer
On Sun, Sep 12, 2010 at 12:42 PM, Robert Muir  wrote:
> On Sat, Sep 11, 2010 at 7:51 PM, Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
>> On Sat, Sep 11, 2010 at 11:07 AM, Burton-West, Tom 
>> wrote:
>> >  Is there an example of how to set up the divisor parameter in
>> solrconfig.xml somewhere?
>>
>> Alas I don't know how to configure terms index divisor from Solr...
>>
>>
> To change the divisor in your solrconfig, for example to 4, it looks like
> you need to do this.
>
>   class="org.apache.solr.core.StandardIndexReaderFactory">
>    4
>  

Ah, thanks robert! I didn't know about that one either!

simon
>
> This parameter was added in SOLR-1296 so its in Solr 1.4
>
> Tom, i would recommend altering this parameter, instead of the default
> (1)... especially since you don't have to reindex to take advantage of it.
>
> --
> Robert Muir
> rcm...@gmail.com
>


Re: Invalid version or the data in not in 'javabin' format

2010-09-12 Thread Peter Sturge
Could be a solrj .jar version compat issue. Check that  the client and
server's solrj version jars match up.

Peter


On Sun, Sep 12, 2010 at 1:16 PM, h00kpub...@gmail.com
 wrote:
>  hi... currently i am integrating nutch (release 1.2) into solr (trunk). if
> i indexing to solr index with nutch i got the exception:
>
> java.lang.RuntimeException: Invalid version or the data in not in 'javabin'
> format
>        at
> org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:99)
>        at
> org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(BinaryResponseParser.java:39)
>        at
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:466)
>        at
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:243)
>        at
> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
>        at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:49)
>        at org.apache.nutch.indexer.solr.SolrWriter.close(SolrWriter.java:98)
>        at
> org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:48)
>        at
> org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:474)
>        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
>        at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
> 2010-09-12 11:44:55,101 ERROR solr.SolrIndexer - java.io.IOException: Job
> failed!
>
> can you tell me, whats wrong or how can i fix this?
>
> best regards marcel :)
>
>
>
>


Invalid version or the data in not in 'javabin' format

2010-09-12 Thread h00kpub...@gmail.com
 hi... currently i am integrating nutch (release 1.2) into solr 
(trunk). if i indexing to solr index with nutch i got the exception:


java.lang.RuntimeException: Invalid version or the data in not in 
'javabin' format
at 
org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:99)
at 
org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(BinaryResponseParser.java:39)
at 
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:466)
at 
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:243)
at 
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)

at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:49)
at 
org.apache.nutch.indexer.solr.SolrWriter.close(SolrWriter.java:98)
at 
org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:48)
at 
org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:474)

at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
2010-09-12 11:44:55,101 ERROR solr.SolrIndexer - java.io.IOException: 
Job failed!


can you tell me, whats wrong or how can i fix this?

best regards marcel :)





Re: Solr memory use, jmap and TermInfos/tii

2010-09-12 Thread Robert Muir
On Sat, Sep 11, 2010 at 7:51 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> On Sat, Sep 11, 2010 at 11:07 AM, Burton-West, Tom 
> wrote:
> >  Is there an example of how to set up the divisor parameter in
> solrconfig.xml somewhere?
>
> Alas I don't know how to configure terms index divisor from Solr...
>
>
To change the divisor in your solrconfig, for example to 4, it looks like
you need to do this.

  
4
  

This parameter was added in SOLR-1296 so its in Solr 1.4

Tom, i would recommend altering this parameter, instead of the default
(1)... especially since you don't have to reindex to take advantage of it.

-- 
Robert Muir
rcm...@gmail.com


RE: Delta Import with something other than Date

2010-09-12 Thread Ephraim Ofir
Alternatively, you could use the deltaQuery to retrieve the last indexed
id from the DB (you'd have to save it there on your previous import).
Your entity would look something like:




You could implement your deltaImportQuery as a stored procedure which
would store the appropriate id in last_id_table (for the next
delta-import) in addition to returning the data from the query.

Ephraim Ofir


-Original Message-
From: Shawn Heisey [mailto:s...@elyograg.org] 
Sent: Friday, September 10, 2010 4:54 AM
To: solr-user@lucene.apache.org
Subject: Re: Delta Import with something other than Date

  On 9/9/2010 1:23 PM, Vladimir Sutskever wrote:
> Shawn,
>
> Can you provide a sample of passing the parameter via URL? And how
using it would look in the data-config.xml
>

Here's the URL that I send to do a full build on my last shard:

http://idxst5-a:8983/solr/build/dataimport?command=full-import&optimize=
true&commit=true&dataTable=ncdat&numShards=6&modVal=5&minDid=0&maxDid=24
2895591

If I want to do a delta, I just change the command to delta-import and 
give it a proper minDid value, rather than 0.

Below is the entity from my data-config.xml.  You have to have a 
deltaQuery defined for delta-import to work, but if you're going to use 
your own placeholders, just put something in that returns a single value

very quickly.  In my case, my query and deltaImportQuery are actually 
identical.







Re: Solr memory use, jmap and TermInfos/tii

2010-09-12 Thread Michael McCandless
One thing that the Codec API makes possible ("in theory", anyway)...
is variable gap terms index.

Ie, Lucene today makes an indexed term at regular (every N -- 128 in
3.x, 32 in 4.0) intervals.

But this is rather silly.  Imagine the terms you are going through are
all singletons (happen only in one doc, eg if they are OCR noise or
whatver).  Maybe you have 500 such terms in sequence and then you hit
a "real" term with a high freq.  In this case, you don't really need
to add any indexed terms from those 500, but then make the real term
an indexed term.

Because... a TermQuery against those singleton terms is going to be
wicked fast, so you can afford the extra term-seek time.  Whereas a
TermQuery against a high-frequency term will be costly, so you want to
minimize term-seek time.

Such an approach could tremendously reduce the RAM required by the
terms index w/ no appreciable hit to the worst-case queries (and
possibly a slight improvement).

Mike

On Sat, Sep 11, 2010 at 7:51 PM, Michael McCandless
 wrote:
> On Sat, Sep 11, 2010 at 11:07 AM, Burton-West, Tom  wrote:
>>  Is there an example of how to set up the divisor parameter in 
>> solrconfig.xml somewhere?
>
> Alas I don't know how to configure terms index divisor from Solr...
>
In 4.0, w/ flex indexing, the RAM efficiency is much better -- we use large 
parallel arrays instead of separate objects, and,
we hold much less in RAM.  Simply upgrading to 4.0 and re-indexing will 
show this gain...;
>>
>> I'm looking forward to a number of the developments in 4.0, but am a bit 
>> wary of using it in production.   I've wanted to work in some tests with 
>> 4.0, but other more pressing issues have so far prevented this.
>
> Understood.
>
>> What about Lucene 2205?  Would that be a way to get some of the benefit 
>> similar to the changes in flex without the rest of the changes in flex and 
>> 4.0?
>
> 2205 was a similar idea (don't create tons of small objects), but it
> was never committed...
>
I'd be really curious to test the RAM reduction in 4.0 on your terms  
dict/index --
is there any way I could get a copy of just the tii/tis  files in your 
index?  Your index is a great test for Lucene!
>>
>> We haven't been able to make much data available due to copyright and other 
>> legal issues.  However, since there is absolutely no way anyone could 
>> reconstruct copyrighted works from the tii/tis index alone, that should be 
>> ok on that front.  On Monday I'll try to get legal/administrative clearance 
>> to provide the data and also ask around and see if I can get the ok to 
>> either find a spare hard drive to ship, or make some kind of sftp 
>> arrangement.  Hopefully we will find a way to be able to do this.
>
> That would be awesome, thanks!
>
>> BTW  Most of the terms are probably the result of  dirty OCR and the impact 
>> is probably increased by our present "punctuation filter".  When we re-index 
>> we plan to use a more intelligent filter that will truncate extremely long 
>> tokens on punctuation and we also plan to do some minimal prefiltering prior 
>> to sending documents to Solr for indexing.  However, since with now have 
>> over 400 languages , we will have to be conservative in our filtering since 
>> we would rather  index dirty OCR than risk not indexing legitimate content.
>
> Got it... it's a great test case for Lucene :)
>
> Mike
>


RE: multivalued fields in result

2010-09-12 Thread Jason Chaffee
But it doesn't seem to be returning mulitvalued fields that are stored.  It is 
returning all of the single value fields though.


-Original Message-
From: Markus Jelsma [mailto:markus.jel...@buyways.nl]
Sent: Sat 9/11/2010 4:19 AM
To: solr-user@lucene.apache.org
Subject: RE: multivalued fields in result
 
Yes, you'll get what is stored and asked for. 
 
-Original message-
From: Jason Chaffee 
Sent: Sat 11-09-2010 05:27
To: solr-user@lucene.apache.org; 
Subject: multivalued fields in result

Is it possible to return multivalued files in the result?  

I would like to have a multivalued field that is stored and not indexed (I also 
copy the same field into another field where it is tokenized and indexed).  I 
would then like all the values of this field returned in the result set.  Is 
there a way to do this?

If it is not possible, could someone elaborate why that is so that I may see if 
I can make it work.

thanks,

Jason