Re: is replication eating up OldGen space

2011-05-30 Thread Bernd Fehling

Some more info,
after one week the servers have the following status:

Master (indexing only)
+ looks good and has heap size of about 6g from 10g OldGen
+ has loaded meanwhile 2 times the index from scratch via DIH
+ has added new documents into existing index via DIH
+ has optimized and replicated
+ no full GC within one week

Slave A (search only) Online
- looks bad and has heap size of 9.5g from 10g OldGen
+ was replicated
- several full GC

Slave B (search only) Backup
+ looks good has heap size of 4 g from 10g OldGen
+ was replicated
+ no full GC within one week

Conclusion:
+ DIH, processing, indexing, replication are fine
- the search is crap and "eats up" OldGen heap which can't be
  cleaned up by full GC. May be memory leaks or what ever...

Due to this Solr 3.1 can _NOT_ be recommended as high-availability,
high-search-load search engine because of unclear heap problems
caused by the search. The search is "out of the box", so no
self produced programming errors.

Any tools available for JAVA to analyze this?
(like valgrind or electric fence for C++)

Is it possible to analyze a heap dump produced with jvisualvm?
Which tools?


Bernd


Am 30.05.2011 15:51, schrieb Bernd Fehling:

Dear list,
after switching from FAST to Solr I get the first _real_ data.
This includes search times, memory consumption, perfomance of solr,...

What I recognized so far is that something eats up my OldGen and
I assume it might be replication.

Current Data:
one master - indexing only
two slaves - search only
over 28 million docs
single instance
single core
index size 140g
current heap size 16g

After startup I have about 4g heap in use and about 3.5g of OldGen.
After one week and some replications OldGen is filled close to 100 percent.
If I start an optimize under this condition I get OOM of heap.
So my assumption is that something is eating up my heap.

Any idea how to trace this down?

May be a memory leak somewhere?

Best regards
Bernd



Re: How to display solr search results in Json format

2011-05-30 Thread Romi
Thanks for reply, But i want to know how Json does it internally, I mean how
it display results as Field:value.

-
Thanks & Regards
Romi
--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-display-solr-search-results-in-Json-format-tp3004734p3004768.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to display solr search results in Json format

2011-05-30 Thread bmdakshinamur...@gmail.com
Hi Romi,

When querying the Solr index, use 'wt=json' as part of your query string to
get the results back in json format.

On Tue, May 31, 2011 at 11:35 AM, Romi  wrote:

> I have indexed all my database data in solr, now I want to rum search on it
> and display results in JSON. what i need to do for it.
>
>
> -
> Thanks & Regards
> Romi
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/How-to-display-solr-search-results-in-Json-format-tp3004734p3004734.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Thanks and Regards,
DakshinaMurthy BM


How to display solr search results in Json format

2011-05-30 Thread Romi
I have indexed all my database data in solr, now I want to rum search on it
and display results in JSON. what i need to do for it.


-
Thanks & Regards
Romi
--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-display-solr-search-results-in-Json-format-tp3004734p3004734.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Indexing files Solr cell and Amazon S3

2011-05-30 Thread Jan Høydahl
Hi,

You can use parameter stream.file to tell Solr to read the file from local 
disk, not stream across network:
http://lucene.472066.n3.nabble.com/Example-of-using-quot-stream-file-quot-to-post-a-binary-file-to-solr-td781172.html

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

On 30. mai 2011, at 22.46, Greg Georges wrote:

> Hello everyone,
> 
> We have our infrastructure on Amazon cloud servers, and we use the S3 file 
> system. We need to index files using Solr Cell. From what I have read, we 
> need to stream files to Solr in order for it to extract the metadata into the 
> index. If we stream data through a public url there will be costs associated 
> to the transfer on the Amazon cloud. We have planned to have a directory with 
> the files, is it possible to tell solr to add documents from a specific 
> folder location? Or must we stream them into Solr? In SolrJ I see that the 
> only option is streaming. Thank you very much.
> 
> Greg



Resolved- Re: Replication Error - Index fetch failed - File Not Found & OverlappingFileLockException

2011-05-30 Thread Renaud Delbru

Hi,

I found out the problem by myself.
The reason was a bad deployment of of Solr on tomcat. Two instances of 
solr were instantiated instead of one. The two instances were managing 
the same indexes, and therefore were trying to write at the same time.


My apologies for the noise created on the ml,
--
Renaud Delbru

On 30/05/11 21:52, Renaud Delbru wrote:

Hi,

For months, we were using apache solr 3.1.0 snapshots without problems.
Recently, we have upgraded our index to apache solr 3.1.0,
and also moved to a multi-core infrastructure (4 core per nodes, each
core having its own index).

We found that one of the index slave started to show failure, i.e.,
query errors. By looking at the log, we observed some errors during the
latest snappull, due to two type of exceptions:
- java.io.FileNotFoundException: File does not exist ...
and
- java.nio.channels.OverlappingFileLockException: null

Then, after the failed pull, the index started to show some index
related failure:

java.io.IOException: read past EOF at
org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:207)]


However, after manually restarting the node, everything went back to
normal.

You can find a more detailed log at [1].

We are afraid to see this problem occurring again. Have you some idea on
what can be the cause ? Or a solution to avoid such problem ?

[1] http://pastebin.com/vbnyrUgJ

Thanks in advance




Replication Error - Index fetch failed - File Not Found & OverlappingFileLockException

2011-05-30 Thread Renaud Delbru

Hi,

For months, we were using apache solr 3.1.0 snapshots without problems.
Recently, we have upgraded our index to apache solr 3.1.0,
and also moved to a multi-core infrastructure (4 core per nodes, each 
core having its own index).


We found that one of the index slave started to show failure, i.e., 
query errors. By looking at the log, we observed some errors during the 
latest snappull, due to two type of exceptions:

- java.io.FileNotFoundException: File does not exist ...
and
- java.nio.channels.OverlappingFileLockException: null

Then, after the failed pull, the index started to show some index 
related failure:


java.io.IOException: read past EOF at 
org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:207)]


However, after manually restarting the node, everything went back to normal.

You can find a more detailed log at [1].

We are afraid to see this problem occurring again. Have you some idea on 
what can be the cause ? Or a solution to avoid such problem ?


[1] http://pastebin.com/vbnyrUgJ

Thanks in advance
--
Renaud Delbru


Indexing files Solr cell and Amazon S3

2011-05-30 Thread Greg Georges
Hello everyone,

We have our infrastructure on Amazon cloud servers, and we use the S3 file 
system. We need to index files using Solr Cell. From what I have read, we need 
to stream files to Solr in order for it to extract the metadata into the index. 
If we stream data through a public url there will be costs associated to the 
transfer on the Amazon cloud. We have planned to have a directory with the 
files, is it possible to tell solr to add documents from a specific folder 
location? Or must we stream them into Solr? In SolrJ I see that the only option 
is streaming. Thank you very much.

Greg


Solr Dismax bf & bq vs. q:{boost ...}

2011-05-30 Thread chazzuka
I tried to do this:

#1. search phrases in title^3 & text^1
#2. based on result #1 add boost for field closed:0^2
#3. based on result in #2 boost based on last_modified
 
and i tried like these:

/solr/select
?q={!boost b=$dateboost v=$qq defType=dismax}
&dateboost=recip(ms(NOW/HOUR,modified),8640,2,1)
&qq=video
&qf=title^3+text
&pf=title^3+text
&bq=closed:0^2
&debugQuery=true

then i tried differently by changing solrconfig like these:

title^3 text
title^3 text
recip(ms(NOW/HOUR,modified),8640,2,1)
closed:0^2

with query:
/solr/select
?q=video
&debugQuery=true

both seems give wrong results, anyone have an idea about doing those tasks?

thanks in advanced



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Dismax-bf-bq-vs-q-boost-tp3003028p3003028.html
Sent from the Solr - User mailing list archive at Nabble.com.


Explain the difference in similarity and similarityProvider

2011-05-30 Thread Brian Lamb
I'm looking over the patch notes from
https://issues.apache.org/jira/browse/SOLR-2338 and I do not understand the
difference between


  param value


and


  is there an echo?


When would I use one over the other?

Thanks,

Brian Lamb


Re: SOLR-1155 on 3.1

2011-05-30 Thread Otis Gospodnetic
I think the answers to both are negative.

Vote for it!

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
> From: Ofer Fort 
> To: solr-user@lucene.apache.org
> Sent: Mon, May 30, 2011 7:50:15 AM
> Subject: SOLR-1155 on 3.1
> 
> Hey all,
> In the last comment on SOLR-1155 by Jayson Minard (
>https://issues.apache.org/jira/browse/SOLR-1155?focusedCommentId=13019955&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13019955
>5
> )
> "I'll  look at updating this for 3.1"
> was it integrated into 3.1? if not is there a  patch one can use?
> thanks
> 


Re: Can we stream binary data with StreamingUpdateSolrServer ?

2011-05-30 Thread Otis Gospodnetic
I'm not looking at the source code, but this doesn't sound right.  I think it 
uses javabin.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
> From: pravesh 
> To: solr-user@lucene.apache.org
> Sent: Mon, May 30, 2011 8:40:28 AM
> Subject: Can we stream binary data with StreamingUpdateSolrServer ?
> 
> Hi,
> 
> I'm using StreamingUpdateSolrServer to post a batch of content to  SOLR1.4.1.
> By looking at StreamingUpdateSolrServer code, it looks it only  provides the
> content to be streamed in XML format only.
> 
> Can we use it  to stream data in binary format?
> 
> 
> 
> --
> View this message in  context: 
>http://lucene.472066.n3.nabble.com/Can-we-stream-binary-data-with-StreamingUpdateSolrServer-tp3001813p3001813.html
>
> Sent  from the Solr - User mailing list archive at Nabble.com.
> 


Re: n-gram speed

2011-05-30 Thread Otis Gospodnetic
Denis,

Also, what are your documents and queries like?  Maybe give a few examples so 
we 
can help.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
> From: Tor Henning Ueland 
> To: solr-user@lucene.apache.org
> Sent: Mon, May 30, 2011 8:40:34 AM
> Subject: Re: n-gram speed
> 
> 2011/5/30 Denis Kuzmenok :
> > I have a  database with n-gram field, about 5 millions documents. QTime
> > is  about  200-1000 ms, database is not optimized because it must reply
> > to  queries   everytime  and  data  are updated often. Is it normal?
> > Solr: 3.1, java  -Xms2048M -Xmx4096M
> > Server: i7, 12Gb
> 
> Start by optimizing it, it  wont "stop working" due to a optimize. Some
> other vital info is the size of  the index, disk type used etc (SSD,
> SATA, IDE..)
> 
> -- 
> Mvh
> Tor  Henning Ueland
> 


Re: DataImportHandler

2011-05-30 Thread Jeffrey Chang
I faced the same problem before, but that's because some parent classloader has 
loaded the DataImport class instead of using the SolrResourceLoader's delegated 
classloader.

How are you starting your Solr? Via Eclipse? If you try starting Solr using 
cmdline, will you encounter the same issue?



On May 30, 2011, at 9:28 PM, adpablos  wrote:

> Hi,
> 
> i've tryed to install DataImportHandler but i've some problems when run up
> solr.
> 
> 
> GRAVE: org.apache.solr.common.SolrException: Error Instantiating Request
> Handler, 
> org.apache.solr.handler.dataimport.DataImportHandler is not a
> org.apache.solr.request.SolrRequestHandler
> 
> This is the log.
> 
> I've 
> 
>   class="org.apache.solr.handler.dataimport.DataImportHandler">
>
>db-data-config.xml
>
>  
> 
> in my solrconfig.xml
> 
> i'm working ina  java project and in my eclipse project, i can write
> something like this: SolrRequestHandler srh = new DataImportHandler(); with
> out problem. 
> 
> Sorry about my english and thank you in advance.
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/DataImportHandler-tp3001957p3001957.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: return unaltered complete multivalued fields with Highlighted results

2011-05-30 Thread lboutros
Hi Alexei,

We have the same issue/behavior.
The highlighting component fragments the fields to highlight and choose the
bests to be returned and highlighted.
You can return all fragments with the maximum size for each one, but it will
never return fragments with scores equal to 0, I mean without any words
found.

To return the whole mutli valued field, the Highlighting component needs to
be modified for this specific case.
That is something we should do in the next weeks.

If I missed something, I would be happy to find another solution too :)

Ludovic.

-
Jouve
France.
--
View this message in context: 
http://lucene.472066.n3.nabble.com/return-unaltered-complete-multivalued-fields-with-Highlighted-results-tp2967146p3002357.html
Sent from the Solr - User mailing list archive at Nabble.com.


Solr 3.1 commit errors

2011-05-30 Thread Denis Kuzmenok
After restart i have these errors every time i do commit via post.jar.

Config: multicore / 5 cores, Solr 3.1

Lock obtain timed out: 
SimpleFSLock@/home/ava/solr/example/multicore/context/data/index/write.lock  
org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: 
SimpleFSLock@/home/ava/solr/example/multicore/context/data/index/write.lock  at 
org.apache.lucene.store.Lock.obtain(Lock.java:84)  at 
org.apache.lucene.index.IndexWriter.(IndexWriter.java:1097)  at 
org.apache.solr.update.SolrIndexWriter.(SolrIndexWriter.java:83)  at 
org.apache.solr.update.UpdateHandler.createMainIndexWriter(UpdateHandler.java:102)
  at 
org.apache.solr.update.DirectUpdateHandler2.openWriter(DirectUpdateHandler2.java:174)
  at 
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:222)
  at 
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:61)
  at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:147)  at 
org.apache.solr.handler.XMLLoader.load(XMLLoader.java:77)  at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:55)
  at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
  at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360)  at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356) 
 at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
  at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
  at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)  
at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)  
at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)  at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)  at 
org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)  at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
  at 
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)  
at org.mortbay.j

Tried to google a little bit but without any luck..



Re: return unaltered complete multivalued fields with Highlighted results

2011-05-30 Thread alexei
Thank you for the reply Erick. 
I can return the stored content but I would like to show the highlighted
results. 
With multivalued fields there seems to be some sorting of highlighed results
(in order of importance?) going on.
The problem is: 
1 - I could not find a way to keep the original order of my text. 
2 - I could not display all of the values in my multivalued field.

So if I have a multivalued field with four values:
value1
value2 with text
value3 
value4 and something

and the search is: "value2 something"

the highlighted result would be:
value2 with text
value4 and something

value1 and value3 will be skipped completely. When a field is not
multivalued everything works as advertised.

Any suggestions? 

Regards,
Alexei

--
View this message in context: 
http://lucene.472066.n3.nabble.com/return-unaltered-complete-multivalued-fields-with-Highlighted-results-tp2967146p3002248.html
Sent from the Solr - User mailing list archive at Nabble.com.


Spellcheck component not returned with numeric queries

2011-05-30 Thread Markus Jelsma
Hi,

The spell check component's output is not written when sending queries that 
consist of numbers only. Clients depending on the availability of the 
spellcheck output need to check if the output is actually there.

This is with a very recent Solr 3.x check out. Is this a feature or a bug? 
File an issue?

Cheers,
-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


is replication eating up OldGen space

2011-05-30 Thread Bernd Fehling

Dear list,
after switching from FAST to Solr I get the first _real_ data.
This includes search times, memory consumption, perfomance of solr,...

What I recognized so far is that something eats up my OldGen and
I assume it might be replication.

Current Data:
one master - indexing only
two slaves - search only
over 28 million docs
single instance
single core
index size 140g
current heap size 16g

After startup I have about 4g heap in use and about 3.5g of OldGen.
After one week and some replications OldGen is filled close to 100 percent.
If I start an optimize under this condition I get OOM of heap.
So my assumption is that something is eating up my heap.

Any idea how to trace this down?

May be a memory leak somewhere?

Best regards
Bernd

--
*
Bernd FehlingUniversitätsbibliothek Bielefeld
Dipl.-Inform. (FH)Universitätsstr. 25
Tel. +49 521 106-4060   Fax. +49 521 106-4052
bernd.fehl...@uni-bielefeld.de33615 Bielefeld

BASE - Bielefeld Academic Search Engine - www.base-search.net
*


DataImportHandler

2011-05-30 Thread adpablos
Hi,

i've tryed to install DataImportHandler but i've some problems when run up
solr.


GRAVE: org.apache.solr.common.SolrException: Error Instantiating Request
Handler, 
org.apache.solr.handler.dataimport.DataImportHandler is not a
org.apache.solr.request.SolrRequestHandler

This is the log.

I've 

  

db-data-config.xml

  

in my solrconfig.xml

i'm working ina  java project and in my eclipse project, i can write
something like this: SolrRequestHandler srh = new DataImportHandler(); with
out problem. 

Sorry about my english and thank you in advance.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/DataImportHandler-tp3001957p3001957.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: n-gram speed

2011-05-30 Thread Tor Henning Ueland
2011/5/30 Denis Kuzmenok :
> I have a database with n-gram field, about 5 millions documents. QTime
> is  about 200-1000 ms, database is not optimized because it must reply
> to  queries  everytime  and  data  are updated often. Is it normal?
> Solr: 3.1, java -Xms2048M -Xmx4096M
> Server: i7, 12Gb

Start by optimizing it, it wont "stop working" due to a optimize. Some
other vital info is the size of the index, disk type used etc (SSD,
SATA, IDE..)

-- 
Mvh
Tor Henning Ueland


Can we stream binary data with StreamingUpdateSolrServer ?

2011-05-30 Thread pravesh
Hi,

I'm using StreamingUpdateSolrServer to post a batch of content to SOLR1.4.1.
By looking at StreamingUpdateSolrServer code, it looks it only provides the
content to be streamed in XML format only.

Can we use it to stream data in binary format?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Can-we-stream-binary-data-with-StreamingUpdateSolrServer-tp3001813p3001813.html
Sent from the Solr - User mailing list archive at Nabble.com.


SOLR-1155 on 3.1

2011-05-30 Thread Ofer Fort
Hey all,
In the last comment on SOLR-1155 by Jayson Minard (
https://issues.apache.org/jira/browse/SOLR-1155?focusedCommentId=13019955&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13019955
)
"I'll look at updating this for 3.1"
was it integrated into 3.1? if not is there a patch one can use?
thanks


collapse component with pivot faceting

2011-05-30 Thread Isha Garg

Hi All!

 Can anyone tell me how pivot faceting works in combination 
with field collapsing.?

Please guide me in this respect.


Thanks!
Isha Garg


n-gram speed

2011-05-30 Thread Denis Kuzmenok
I have a database with n-gram field, about 5 millions documents. QTime
is  about 200-1000 ms, database is not optimized because it must reply
to  queries  everytime  and  data  are updated often. Is it normal?
Solr: 3.1, java -Xms2048M -Xmx4096M
Server: i7, 12Gb




Re: wildcards and German umlauts

2011-05-30 Thread Jan Høydahl
Hi,

Agree that this is annoying for foreign languages. I get the idea behind the 
original behaviour, but there could be more elegant ways of handling it. It 
would make sense to always run the CharFilters. Perhaps a mechanism where 
TokenFilters can be tagged for exclusion from wildcard terms would be an idea. 
That way we can skip stemming, synonym and phonetic for wildcard terms, but 
still do lowercasing and characterNormalization.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

On 29. mai 2011, at 19.24, mdz-munich wrote:

> Ah, NOW I got it. It's not a bug, it's a feature. 
> 
> But that would mean, that every character-manipulation (e.g.
> char-mapping/replacement, Porter-Stemmer in some cases ...) would cause a
> wildcard-query to fail. That too bad.
> 
> But why? What's the Problem with passing the prefix through the
> analyzer/filter-chain?  
> 
> Greetz,
> 
> Sebastian
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/wildcards-and-German-umlauts-tp499972p2999237.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: Problem with spellchecking, dont want multiple request to SOLR

2011-05-30 Thread Jan Høydahl
Hi,

Define two searchComponents with different names. Then refer to both in 
 in your Search Request Handler config.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

On 27. mai 2011, at 10.01, roySolr wrote:

> mm ok. I configure 2 spellcheckers:
> 
> 
>
>   spell_what
>   spell_what
>   true
>   spellchecker_what
>   
>   
>   spell_where
>   spell_where
>   true
>   spellchecker_where
>   
> 
> 
> How can i enable it in my search request handler and search both in one
> request?
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Problem-with-spellchecking-dont-want-multiple-request-to-SOLR-tp2988167p2992076.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: Bulk indexing, UpdateProcessor overwriteDupes and poor IO performances

2011-05-30 Thread Tanguy Moal

Hello,

Sorry for re-posting this but it seems my message got lost in the 
mailing list's messages stream without hitting anyone's attention... =D


Shortly, has anyone already experienced dramatic indexing slowdowns 
during large bulk imports with overwriteDupes turned on and a fairly 
high duplicates rate (around 4-8x) ?


It seems to produce a lot of deletions, which in turn appear to make the 
merging of segments pretty slow, by fairly increasing the number of 
little reads operations occuring simultaneously with the regular large 
write operations of the merge. Added to the poor IO performances of a 
commodity SATA drive, indexing takes ages.


I temporarily bypassed that limitation by disabling the overwriting of 
duplicates, but that changes the way I request the index, requiring me 
to turn on field collapsing at search time.


Is this a known limitation ?

Has anyone a few hints on how to optimize the handling of index time 
deduplication ?


More details on my setup and the state of my understanding are in my 
previous message here-after.


Thank you very much in advance.

Regards,

Tanguy

On 05/25/11 15:35, Tanguy Moal wrote:

Dear list,

I'm posting here after some unsuccessful investigations.
In my setup I push documents to Solr using the StreamingUpdateSolrServer.

I'm sending a comfortable initial amount of documents (~250M) and 
wished to perform overwriting of duplicated documents at index time, 
during the update, taking advantage of the UpdateProcessorChain.


At the beginning of the indexing stage, everything is quite fast; 
documents arrive at a rate of about 1000 doc/s.
The only extra processing during the import is computation of a couple 
of hashes that are used to identify uniquely documents given their 
content, using both stock (MD5Signature) and custom (derived from 
Lookup3Signature) update processors.

I send a commit command to the server every 500k documents sent.

During a first period, the server is CPU bound. After a short while 
(~10 minutes), the rate at which documents are received starts to fall 
dramatically, the server being IO bound.
I've been firstly thinking of a normal speed decrease during the 
commit, while my push client is waiting for the flush to occur. That 
would have been a normal slowdown.


The thing that retained my attention was the fact that unexpectedly, 
the server was performing a lot of small reads, way more the number 
writes, which seem to be larger.
The combination of the many small reads with the constant amount of 
bigger writes seem to be creating a lot of IO contention on my 
commodity SATA drive, and the ETA of my built index started to 
increase scarily =D


I then restarted the JVM with JMX enabled so I could start 
investigating a little bit more. I've the realized that the 
UpdateHandler was performing many reads while processing the update 
request.


Are there any known limitations around the UpdateProcessorChain, when 
overwriteDupes is set to true ?
I turned that off, which of course breaks the intent of my built 
index, but for comparison purposes it's good.


That did the trick, indexing is fast again, even with the periodic 
commits.


I therefor have two questions, an interesting first  one and a boring 
second one :


1 / What's the workflow of the UpdateProcessorChain when one or more 
processors have overwriting of duplicates turned on ? What happens 
under the hood ?


I tried to answer that myself looking at DirectUpdateHandler2 and my 
understanding stopped at the following :

- The document is added to the lucene IW
- The duplicates are deleted from the lucene IW
The dark magic I couldn't understand seems to occur around the idTerm 
and updateTerm things, in the addDoc method. The deletions seem to be 
buffered somewhere, I just didn't get it :-)


I might be wrong since I didn't read the code more than that, but the 
point might be at how does solr handles deletions, which is something 
still unclear to me. In anyways, a lot of reads seem to occur for that 
precise task and it tends to produce a lot of IO, killing indexing 
performances when overwriteDupes is on. I don't even understand why so 
many read operations occur at this stage since my process had a 
comfortable amount of RAM (with Xms=Xmx=8GB), with only 4.5GB are used 
so far.


Any help, recommandation or idea is welcome :-)

2 / In the case there isn't a simple fix for this, I'll have to do 
with duplicates in my index. I don't mind since solr offers a great 
grouping feature, which I already use in some other applications. The 
only thing I don't know yet is that if I do rely on grouping at search 
time, in combination with the Stats component (which is the intent of 
that index), and limiting the results to 1 document per group, will 
the computed statistics take those duplicates into account or not ? 
Shortly, how well does the Stats component behave when combined to 
hits collapsing ?


I had firstly implemented my solution using overwriteDupes becau

Re: Problem with caps and star symbol

2011-05-30 Thread Saumitra Chowdhury
I am sending some xml to understand the scenario.

Indexed term = ROLE_DELETE
Search Term = roledelete


0
4

on
0
name : roledelete
2.2
10





Indexed term = ROLE_DELETE
Search Term = role


0
5

on
0
name : role
2.2
10




Mon May 30 13:09:14 BDST 2011
Global Role for Deletion
role:9223372036854775802
Mon May 30 13:09:14 BDST 2011
ROLE_DELETE




Mon May 30 13:09:14 BDST 2011
Global Role for Deletion
role:9223372036854775802
Mon May 30 13:09:14 BDST 2011
ROLE_DELETE






Indexed term = ROLE_DELETE
Search Term = role*


0
4

on
0
name : role*
2.2
10




Mon May 30 13:09:14 BDST 2011
Global Role for Deletion
role:9223372036854775802
Mon May 30 13:09:14 BDST 2011
ROLE_DELETE






Indexed term = ROLE_DELETE
Search Term = Role*



0
4

on
0
name : Role*
2.2
10







Indexed term = ROLE_DELETE
Search Term = ROLE_DELETE*



0
4

on
0
name : ROLE_DELETE*
2.2
10





I am also adding a analysis html.



On Mon, May 30, 2011 at 7:19 AM, Erick Erickson wrote:

> I'd start by looking at the analysis page from the Solr admin page. That
> will give you an idea of the transformations the various steps carry out,
> it's invaluable!
>
> Best
> Erick
> On May 26, 2011 12:53 AM, "Saumitra Chowdhury" <
> saumi...@smartitengineering.com> wrote:
> > Hi all ,
> > In my schema.xml i am using WordDelimiterFilterFactory,
> > LowerCaseFilterFactory, StopFilterFactory for index analyzer and an extra
> > SynonymFilterFactory for query analyzer. I am indexing a field name
> > '*name*'.Now
> > if a value with all caps like "NAME_BILL" is indexed I am able get this
> as
> > search result with the term " *name_bill *", " *NAME_BILL *", " *namebill
> *",
> > "*namebill** ", " *nameb** " ... But for the term like following " *
> > NAME_BILL** ", " *name_bill** ", " *namebill** ", " *NAME** " the result
> > does mot show this document. Can anyone please explain why this is
> > happening? .In fact star " * " is not giving any result in many
> > cases specially if it is used after full value of a field.
> >
> > Portion of my schema is given below.
> >
> >  positionIncrementGap="100">
> > -
> > 
> > 
> > 
> > 
> > -
> > 
> > -
> > 
> > 
> >  > generateNumberParts="0" catenateWords="1" catenateNumbers="1"
> > catenateAll="0"/>
> > 
> >  > words="stopwords.txt" enablePositionIncrements="true"/>
> > 
> > -
> > 
> > 
> >  > generateNumberParts="0" catenateWords="1" catenateNumbers="1"
> > catenateAll="0"/>
> > 
> >  > ignoreCase="true" expand="true"/>
> >  > words="stopwords.txt" enablePositionIncrements="true"/>
> > 
> > 
> > -
> >  > positionIncrementGap="100">
> > -
> > 
> > 
> >  > generateNumberParts="0" catenateWords="1" catenateNumbers="1"
> > catenateAll="0"/>
> > 
> >  > ignoreCase="true" expand="false"/>
> >  > words="stopwords.txt"/>
> > 
> > 
> > 
>