Query performance very slow even after autowarming

2010-12-02 Thread johnnyisrael

Hi,

I am using edgeNgramFilterfactory on SOLR 1.4.1 []
for my indexing.

Each document will have about 5 fields in it and only one field is indexed
with EdgeNGramFilterFactory.

I have about 1.4 million documents in my index now and my index size is
approx 296MB.

I made the field that is indexed with EdgeNGramFilterFactory as default
search field. All my query responses are very slow, some of them taking more
than 10seconds to respond. 

All my query responses are very slow, Queries with single letters are still
very slow.

/select/?q=m

So I tried query warming as follows.


  
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z
  


The same above is done for firstSearcher as well.

My cache settings are as follows.







Still after query warming, few single character search is taking up to 3
seconds to respond.

Am i doing anything wrong in my cache setting or autowarm setting or am i
missing anything here?

Thanks,

Johnny
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Query-performance-very-slow-even-after-autowarming-tp2010384p2010384.html
Sent from the Solr - User mailing list archive at Nabble.com.


Solr Multi-thread Update Transaction Control

2010-12-02 Thread wangjb

Hi,
 Now we are using solr1.4.1, and encounter a problem.
 When multi-threads update solr data at the same time, can every thread 
have its separate transaction?

 If this is possible, how can we realize this.
 Is there any suggestion here?
 Waiting online.
 Thank you for any useful reply.





PDF text extracted without spaces

2010-12-02 Thread Ganesh
Hello all,

I know, this is not the right group to ask this question, thought some of you 
guys might have experienced.  

I newbie with Tika. I am using latest version 0.8 version. I extracted text 
from PDF document but found spaces and new line missing. Indexing the data 
gives wrong result. Could any one in this group could help me? I am using tika 
directly to extract the contents, which later gets indexed.

Regards
Ganesh
Send free SMS to your Friends on Mobile from your Yahoo! Messenger. Download 
Now! http://messenger.yahoo.com/download.php


Limit number of characters returned

2010-12-02 Thread Mark

Is there way to limit the number of characters returned from a stored field?

For example:

Say I have a document (~2K words) and I search for a word that's 
somewhere in the middle. I would like the document to match the search 
query but the stored field should only return the first 200 characters 
of the document. Is there anyway to accomplish this that doesn't involve 
two fields?


Thanks


Re: solr/admin/dataimport Not Found

2010-12-02 Thread Ruixiang Zhang
Thank you so much, Koji, the example-DIH works. I'm reading for details...

Richard


On Thu, Dec 2, 2010 at 4:39 PM, Koji Sekiguchi  wrote:

> (10/12/03 9:29), Ruixiang Zhang wrote:
>
>> Hi Koji
>>
>> Thanks for your reply.
>> I pasted the wrong link.
>> Actually I tried this fist http://mydomain.com:8983/solr/dataimport
>> It didn't work.
>> The page should be there after installation, right? Did I miss something?
>>
>> Thanks a lot!
>> Richard
>>
>
> To work that URL, you have to have a request handler in your
> solrconfig.xml:
>
>class="org.apache.solr.handler.dataimport.DataImportHandler">
>
>db-data-config.xml
>
>  
>
> If you try DIH for the first time, please read
> solr/example/example-DIH/README.txt
> and try example-DIH first.
>
>
> Koji
> --
> http://www.rondhuit.com/en/
>


Re: ramBufferSizeMB not reflected in segment sizes in index

2010-12-02 Thread Michael McCandless
On Thu, Dec 2, 2010 at 4:31 PM, Burton-West, Tom  wrote:

> We turned on infostream.   Is there documentation about how to interpret it, 
> or should I just grep through the codebase?

There isn't any documentation... and it changes over time as we add
new diagnostics.

> Is the excerpt below what I am looking for as far as understanding the 
> relationship between ramBufferSize and size on disk?
> is newFlushedSize the size on disk in bytes?

Yes -- so IW's buffer was using 329.782 MB RAM, and was flushed to a
69,848,046 byte segment.

Mike


Re: solr/admin/dataimport Not Found

2010-12-02 Thread Koji Sekiguchi

(10/12/03 9:29), Ruixiang Zhang wrote:

Hi Koji

Thanks for your reply.
I pasted the wrong link.
Actually I tried this fist http://mydomain.com:8983/solr/dataimport
It didn't work.
The page should be there after installation, right? Did I miss something?

Thanks a lot!
Richard


To work that URL, you have to have a request handler in your solrconfig.xml:

   

db-data-config.xml

  

If you try DIH for the first time, please read 
solr/example/example-DIH/README.txt
and try example-DIH first.

Koji
--
http://www.rondhuit.com/en/


Re: solr/admin/dataimport Not Found

2010-12-02 Thread Ruixiang Zhang
Hi Koji

Thanks for your reply.
I pasted the wrong link.
Actually I tried this fist http://mydomain.com:8983/solr/dataimport
It didn't work.
The page should be there after installation, right? Did I miss something?

Thanks a lot!
Richard





On Thu, Dec 2, 2010 at 4:23 PM, Koji Sekiguchi  wrote:

> (10/12/03 8:58), Ruixiang Zhang wrote:
>
>> I tried to import data from mysql. When I tried to run
>> http://mydomain.com:8983/solr/admin/dataimport , I got these error
>> message:
>>
>> HTTP ERROR: 404
>>
>> NOT_FOUND
>>
>> RequestURI=/solr/admin/dataimport
>>
>> *Powered by Jetty://
>>
>>
>>
>> *
>> Any help will be appreciated!!!
>> Thanks
>> Richard
>>
>
> Richard,
>
> Usually, it should be http://mydomain.com:8983/solr/dataimport
>
> Koji
> --
> http://www.rondhuit.com/en/
>


Re: solr/admin/dataimport Not Found

2010-12-02 Thread Koji Sekiguchi

(10/12/03 8:58), Ruixiang Zhang wrote:

I tried to import data from mysql. When I tried to run
http://mydomain.com:8983/solr/admin/dataimport , I got these error message:

HTTP ERROR: 404

NOT_FOUND

RequestURI=/solr/admin/dataimport

*Powered by Jetty://


*
Any help will be appreciated!!!
Thanks
Richard


Richard,

Usually, it should be http://mydomain.com:8983/solr/dataimport

Koji
--
http://www.rondhuit.com/en/


solr/admin/dataimport Not Found

2010-12-02 Thread Ruixiang Zhang
I tried to import data from mysql. When I tried to run
http://mydomain.com:8983/solr/admin/dataimport , I got these error message:

HTTP ERROR: 404

NOT_FOUND

RequestURI=/solr/admin/dataimport

*Powered by Jetty:// 


*
Any help will be appreciated!!!
Thanks
Richard


Re: TermsComponent prefix query with fileds analyzers

2010-12-02 Thread Ahmet Arslan
> Does anyone know how to apply some analyzers over a prefix
> query?

Lucene has an special QueryParser for this.

http://lucene.apache.org/java/3_0_2/api/contrib-misc/org/apache/lucene/queryParser/analyzing/AnalyzingQueryParser.html

Someone provided a patch to use it in solr. It was an attachment to a thread at 
nabble. I couldn't find it now.

Similar discussion : http://search-lucene.com/m/oMtRJQPgGb1/


  


Cannot start Solr anymore

2010-12-02 Thread Ruixiang Zhang
Hi, I'm new here.

First, could anyone tell me how to restart solr?

I started solr and killed the process. Then when I tried to start it again,
it failed:

$ java -jar start.jar
2010-12-02 14:28:00.011::INFO:  Logging to STDERR via
org.mortbay.log.StdErrLog
2010-12-02 14:28:00.099::INFO:  jetty-6.1.3
2010-12-02 14:28:00.231::WARN:  Failed startup of context
org.mortbay.jetty.webapp.webappcont...@73901437
{/solr,jar:file:/.../solr/apache-solr-1.4.1/example/webapps/solr.war!/}
java.util.zip.ZipException: invalid END header (bad central directory
offset)


Thanks
Richard


Re: spatial query parinsg error: org.apache.lucene.queryParser.ParseException

2010-12-02 Thread Dennis Gearon
It WORKED

Thank you so much everybody!

I feel like jumping up and down like 'Hiro' on Heroes

 Dennis Gearon
- Original Message - From: "Dennis Gearon" 

To: 
Sent: Wednesday, December 01, 2010 7:51 PM
Subject: spatial query parinsg error: 
org.apache.lucene.queryParser.ParseException


I am trying to get spatial search to work on my Solr installation. I am running
version 1.4.1 with the Jayway Team spatial-solr-plugin. I am performing the
search with the following url:

http://localhost:8080/solr/select?wt=json&indent=true&q=title:Art%20Loft{!spatial%20lat=37.326375%20lng=-121.892639%20radius=3%20unit=km%20threadCount=3}



The result that I get is the following error:

HTTP Status 400 - org.apache.lucene.queryParser.ParseException: Cannot parse
'title:Art Loft{!spatial lat=37.326375 lng=-121.892639 radius=3 unit=km
threadCount=3}': Encountered "  "lng=-121.892639 "" at line 1,
column 38. Was expecting: "}"

Not sure why it would be complaining about the lng parameter in the query. I
double-checked to make sure that I had the right name for the longitude field in
my solrconfig.xml file.

Any help/suggestions would be greatly appreciated

Dennis Gearon


Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a better
idea to learn from others’ mistakes, so you do not have to make them yourself.
from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.


Re: Joining Fields in and Index

2010-12-02 Thread Adam Estrada
Hi,

I was hoping to do it directly in the index but it was more out of curiosity 
than anything. I can certainly map it in the DAO but again...I was hoping to 
learn if it was possible in the index.

Thanks for the feedback!

Adam

On Dec 2, 2010, at 5:48 PM, Savvas-Andreas Moysidis wrote:

> Hi,
> 
> If you are able to do a full re-index then you could index the full names
> and not the codes. When you later facet on the Country field you'll get the
> actual name rather than the code.
> If you are not able to re-index then probably this conversion could be added
> at your application layer prior to displaying your results.(e.g. in your DAO
> object)
> 
> On 2 December 2010 22:05, Adam Estrada wrote:
> 
>> All,
>> 
>> I have an index that has a field with country codes in it. I have 7 million
>> or so documents in the index and when displaying facets the country codes
>> don't mean a whole lot to me. Is there any way to add a field with the full
>> country names then join the codes in there accordingly? I suppose I can do
>> this before updating the records in the index but before I do that I would
>> like to know if there is a way to do this sort of join.
>> 
>> Example: US -> United States
>> 
>> Thanks,
>> Adam



Re: Joining Fields in and Index

2010-12-02 Thread Savvas-Andreas Moysidis
Hi,

If you are able to do a full re-index then you could index the full names
and not the codes. When you later facet on the Country field you'll get the
actual name rather than the code.
If you are not able to re-index then probably this conversion could be added
at your application layer prior to displaying your results.(e.g. in your DAO
object)

On 2 December 2010 22:05, Adam Estrada wrote:

> All,
>
> I have an index that has a field with country codes in it. I have 7 million
> or so documents in the index and when displaying facets the country codes
> don't mean a whole lot to me. Is there any way to add a field with the full
> country names then join the codes in there accordingly? I suppose I can do
> this before updating the records in the index but before I do that I would
> like to know if there is a way to do this sort of join.
>
> Example: US -> United States
>
> Thanks,
> Adam


Joining Fields in and Index

2010-12-02 Thread Adam Estrada
All,

I have an index that has a field with country codes in it. I have 7 million or 
so documents in the index and when displaying facets the country codes don't 
mean a whole lot to me. Is there any way to add a field with the full country 
names then join the codes in there accordingly? I suppose I can do this before 
updating the records in the index but before I do that I would like to know if 
there is a way to do this sort of join. 

Example: US -> United States

Thanks,
Adam

Re: ramBufferSizeMB not reflected in segment sizes in index

2010-12-02 Thread Yonik Seeley
On Wed, Dec 1, 2010 at 3:01 PM, Shawn Heisey  wrote:
> I have seen this.  In Solr 1.4.1, the .fdt, .fdx, and the .tv* files do not
> segment, but all the other files do.  I can't remember whether it behaves
> the same under 3.1, or whether it also creates these files in each segment.

Yep, that's the shared doc store (where stored fields go.. the
non-inverted part of the index), and it works like that in 3.x and
trunk too.
It's nice because when you merge segments, you don't have to re-copy
the docs (provided you're within a single indexing session).
There have been discussions about removing it in trunk though... we'll see.

-Yonik
http://www.lucidimagination.com


RE: ramBufferSizeMB not reflected in segment sizes in index

2010-12-02 Thread Burton-West, Tom
Hi Mike,

We turned on infostream.   Is there documentation about how to interpret it, or 
should I just grep through the codebase?

Is the excerpt below what I am looking for as far as understanding the 
relationship between ramBufferSize and size on disk?
is newFlushedSize the size on disk in bytes?


DW:   ramUsed=329.782 MB newFlushedSize=74520060 docs/MB=0.943 new/old=21.55%

RAM: now balance allocations: usedMB=325.997 vs trigger=320 deletesMB=0.048 
byteBlockFre
e=0.125 perDocFree=0.006 charBlockFree=0
...
DW: after free: freedMB=0.225 usedMB=325.82
Dec 1, 2010 5:40:22 PM IW 0 [Wed Dec 01 17:40:22 EST 2010; 
http-8091-Processor12]: flush: now pause all indexing threads
Dec 1, 2010 5:40:22 PM IW 0 [Wed Dec 01 17:40:22 EST 2010; 
http-8091-Processor12]:   flush: segment=_5h docStoreSegment=_5e 
docStoreOffset=266 flushDocs=true flushDeletes=false 
flushDocStores=false numDocs=40 numBufDelTerms=40
... Dec 1, 2010 5:40:22 PM   purge field=geographic
Dec 1, 2010 5:40:22 PM   purge field=serialTitle_ab
Dec 1, 2010 5:40:33 PM IW 0 [Wed Dec 01 17:40:33 EST 2010; 
http-8091-Processor12]: DW:   ramUsed=325.772 MB newFlushedSize=69848046 
docs/MB=0.6 new/old=20.447%
Dec 1, 2010 5:40:33 PM IW 0 [Wed Dec 01 17:40:33 EST 2010; 
http-8091-Processor12]: flushedFiles=[_5h.frq, _5h.tis, _5h.prx, _5h.nrm, 
_5h.fnm, _5h.tii]



Tom


-Original Message-
From: Michael McCandless [mailto:luc...@mikemccandless.com] 
Sent: Wednesday, December 01, 2010 3:43 PM
To: solr-user@lucene.apache.org
Subject: Re: ramBufferSizeMB not reflected in segment sizes in index

On Wed, Dec 1, 2010 at 3:16 PM, Burton-West, Tom  wrote:
> Thanks Mike,
>
> Yes we have many unique terms due to dirty OCR and 400 languages and probably 
> lots of low doc freq terms as well (although with the ICUTokenizer and 
> ICUFoldingFilter we should get fewer terms due to bad tokenization and 
> normalization.)

OK likely this explains the lowish RAM efficiency.

> Is this additional overhead because each unique term takes a certain amount 
> of space compared to adding entries to a list for an existing term?

Exactly.  There's a highish "startup cost" for each term but then
appending docs/positions to that term is more efficient especially for
higher frequency terms.  In the limit, a single unique term  across
all docs will have very high RAM efficiency...

> Does turning on IndexWriters infostream have a significant impact on memory 
> use or indexing speed?

I don't believe so

Mike


Re: TermsComponent prefix query with fileds analyzers

2010-12-02 Thread Jonathan Rochkind
I don't believe you can.  If you just need query-time transformation, 
can't you just do it in your client app? If you need index-time 
transformation... well, you can do that, but it's up to your schema.xml 
and will of course apply to the field as a whole, not just for 
termscomponent queries, because that's just how solr works.


I'd note for your example, you'll also have to lowercase that capital A 
if you want it to match a lowercased a in a termscomponent prefix query.


To my mind (others may disagree), robust flexible auto-complete like 
this is still a somewhat unsolved problem in Solr, the termscomponent 
approach has it's definite limitations.


On 12/2/2010 12:24 PM, Nestor Oviedo wrote:

Hi everyone
Does anyone know how to apply some analyzers over a prefix query?
What I'm looking for is a way to build an autosuggest using the
termsComponent that could be able to remove the accents from the
query's prefix.
For example, I have the term "analisis" in the index and I want to
retrieve it with the prefix "Análi" (notice the accent in the third
letter).
I think the regexp function won't help here, so I was wondering if
specifying some analyzers (LowerCase and ASCIIFolding) in the
termComponents configuration, it would be applied over the prefix.

Thanks in advance.
Nestor



Re: SOLR Thesaurus

2010-12-02 Thread Jonathan Rochkind
No, it doesn't.  And it's not entirely clear what (if any) simple way 
there is to use Solr to expose hieararchically related documents in a 
way that preserves and usefully allows navigation of the relationships.  
At least in general, for sophisticated stuff.


On 12/2/2010 3:55 AM, lee carroll wrote:

Hi List,

Coming to and end of a proto type evaluation of SOLR (all very good etc etc)
Getting to the point at looking at bells and whistles. Does SOLR have a
thesuarus. Cant find any refrerence
to one in the docs and on the wiki etc. (Apart from a few mail threads which
describe the synonym.txt as a thesuarus)

I mean something like:

PT: 
BT: xxx,,
NT: xxx,,
RT:xxx,xxx,xxx
Scope Note: xx,

Like i say bells and whistles

cheers Lee



Re: Import Data Into Solr

2010-12-02 Thread Erick Erickson
You can just point your Solr instance at your Lucene index. Really, copy the
Lucene index into the right place to be found by solr.

HOWEVER, you need to take great care that the field definitions that you
used
when you built your Lucene index are compatible with the ones configured in
your
schema.xml file. This is NOT a trivial task.

I'd recommend that you try having Solr build your index, you'll probably
want to
sometime in the future anyway so you might as well bite the bullet now if
possible...

Plus, I'm not quite sure about version index issues.

Best
Erick

On Thu, Dec 2, 2010 at 11:54 AM, Bing Li  wrote:

> Hi, all,
>
> I am a new user of Solr. Before using it, all of the data is indexed myself
> with Lucene. According to the Chapter 3 of the book, Solr. 1.4 Enterprise
> Search Server written by David Smiley and Eric Pugh, data in the formats of
> XML, CSV and even PDF, etc, can be imported to Solr.
>
> If I wish to import the Lucene indexes into Solr, may I have any other
> approaches? I know that Solr is a serverized Lucene.
>
> Thanks,
> Bing Li
>


Re: Return Lucene DocId in Solr Results

2010-12-02 Thread Erick Erickson
You have to call termDocs.next() after termDocs.seek. Something like
termDocs.seek().
if (termDocs.next()) {
   // means there was a term/doc matching and your references should be
valid.
}

On Thu, Dec 2, 2010 at 10:22 AM, Lohrenz, Steven
wrote:

> I must be missing something as I'm getting a NPE on the line: docIds[i] =
> termDocs.doc();
> here's what I came up with:
>
> private int[] getDocIdsFromPrimaryKey(SolrQueryRequest req, List
> favsBeans) throws ParseException {
>// open the core & get data directory
>String indexDir = req.getCore().getIndexDir();
>
> FSDirectory indexDirectory = null;
>try {
>indexDirectory = FSDirectory.open(new File(indexDir));
> } catch (IOException e) {
>throw new ParseException("IOException, cannot open the index at:
> " + indexDir + " " + e.getMessage());
>}
>
> //String pkQueryString = "resourceId:" + favBean.getResourceId();
> //Query pkQuery = new QueryParser(Version.LUCENE_CURRENT,
> "resourceId", new StandardAnalyzer()).parse(pkQueryString);
>
>IndexSearcher searcher = null;
>TopScoreDocCollector collector = null;
> IndexReader indexReader = null;
>TermDocs termDocs = null;
>
>try {
>searcher = new IndexSearcher(indexDirectory, true);
>indexReader = new FilterIndexReader(searcher.getIndexReader());
>termDocs = indexReader.termDocs();
> } catch (IOException e) {
>throw new ParseException("IOException, cannot open the index at:
> " + indexDir + " " + e.getMessage());
>}
>
>int[] docIds = new int[favsBeans.size()];
>int i = 0;
>for(Favorites favBean: favsBeans) {
> Term term = new Term("resourceId", favBean.getResourceId());
>try {
>termDocs.seek(term);
>docIds[i] = termDocs.doc();
>} catch (IOException e) {
>throw new ParseException("IOException, cannot seek to the
> primary key " + favBean.getResourceId() + " in : " + indexDir + " " +
> e.getMessage());
> }
>//ScoreDoc[] hits = collector.topDocs().scoreDocs;
>//if(hits != null && hits[0] != null) {
>
> i++;
>//}
>}
>
>Arrays.sort(docIds);
>return docIds;
>}
>
> Thanks,
> Steve
> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: 02 December 2010 14:20
> To: solr-user@lucene.apache.org
> Subject: Re: Return Lucene DocId in Solr Results
>
> Ahhh, you're already down in Lucene. That makes things easier...
>
> See TermDocs. Particularly seek(Term). That'll directly access the indexed
> unique key rather than having to form a bunch of queries.
>
> Best
> Erick
>
>
> On Thu, Dec 2, 2010 at 8:59 AM, Lohrenz, Steven
> wrote:
>
> > I would be interested in hearing about some ways to improve the
> algorithm.
> > I have done a very straightforward Lucene query within a loop to get the
> > docIds.
> >
> > Here's what I did to get it working where favsBean are objects returned
> > from a query of the second core, but there is probably a better way to do
> > it:
> >
> > private int[] getDocIdsFromPrimaryKey(SolrQueryRequest req,
> List
> > favsBeans) throws ParseException {
> >// open the core & get data directory
> >String indexDir = req.getCore().getIndexDir();
> >FSDirectory index = null;
> >try {
> >index = FSDirectory.open(new File(indexDir));
> >} catch (IOException e) {
> >throw new ParseException("IOException, cannot open the index
> at:
> > " + indexDir + " " + e.getMessage());
> >}
> >
> >int[] docIds = new int[favsBeans.size()];
> >int i = 0;
> >for(Favorites favBean: favsBeans) {
> >String pkQueryString = "resourceId:" +
> favBean.getResourceId();
> >Query pkQuery = new QueryParser(Version.LUCENE_CURRENT,
> > "resourceId", new StandardAnalyzer()).parse(pkQueryString);
> >
> >IndexSearcher searcher = null;
> >TopScoreDocCollector collector = null;
> >try {
> >searcher = new IndexSearcher(index, true);
> >collector = TopScoreDocCollector.create(1, true);
> >searcher.search(pkQuery, collector);
> >} catch (IOException e) {
> >throw new ParseException("IOException, cannot search the
> > index at: " + indexDir + " " + e.getMessage());
> >}
> >
> >ScoreDoc[] hits = collector.topDocs().scoreDocs;
> >if(hits != null && hits[0] != null) {
> >docIds[i] = hits[0].doc;
> >i++;
> >}
> >}
> >
> >Arrays.sort(docIds);
> >return docIds;
> > }
> >
> > -Original Message-
> > From: Erick Erickson [mailto:erickerick...@gmail.com]
> > Sent: 02 December 2010 13:46
> > To: solr-user

RE: disabled replication setting

2010-12-02 Thread Xin Li
Does anything know?

Thanks,

-Original Message-
From: Xin Li [mailto:xin.li@gmail.com] 
Sent: Thursday, December 02, 2010 12:25 PM
To: solr-user@lucene.apache.org
Subject: disabled replication setting

For solr replication, we can send command to disable replication. Does
anyone know where i can verify the replication enabled/disabled
setting? i cannot seem to find it on dashboard or details command
output.

Thanks,

Xin

This electronic mail message contains information that (a) is or
may be CONFIDENTIAL, PROPRIETARY IN NATURE, OR OTHERWISE
PROTECTED
BY LAW FROM DISCLOSURE, and (b) is intended only for the use of
the
addressee(s) named herein.  If you are not an intended recipient,
please contact the sender immediately and take the steps
necessary
to delete the message completely from your computer system.

Not Intended as a Substitute for a Writing: Notwithstanding the
Uniform Electronic Transaction Act or any other law of similar
effect, absent an express statement to the contrary, this e-mail
message, its contents, and any attachments hereto are not
intended
to represent an offer or acceptance to enter into a contract and
are not otherwise intended to bind this sender,
barnesandnoble.com
llc, barnesandnoble.com inc. or any other person or entity.


Exceptions in Embedded Solr

2010-12-02 Thread Tharindu Mathew
Hi everyone,

I get the exception below when using Embedded Solr suddenly. If I
delete the Solr index it goes back to normal, but it obviously has to
start indexing from scratch. Any idea what the cause of this is?

java.lang.RuntimeException: java.io.FileNotFoundException:
/home/evanthika/WSO2/CARBON/GREG/3.6.0/23-11-2010/normal/wso2greg-3.6.0/solr/data/index/segments_2
(No such file or directory)
at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1068)
at org.apache.solr.core.SolrCore.(SolrCore.java:579)
at 
org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:137)
at org.wso2.carbon.registry.indexing.solr.SolrClient.(SolrClient.java:103)
at 
org.wso2.carbon.registry.indexing.solr.SolrClient.getInstance(SolrClient.java:115)
... 44 more
Caused by: java.io.FileNotFoundException:
/home/evanthika/WSO2/CARBON/GREG/3.6.0/23-11-2010/normal/wso2greg-3.6.0/solr/data/index/segments_2
(No such file or directory)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.(RandomAccessFile.java:212)
at 
org.apache.lucene.store.SimpleFSDirectory$SimpleFSIndexInput$Descriptor.(SimpleFSDirectory.java:78)
at 
org.apache.lucene.store.SimpleFSDirectory$SimpleFSIndexInput.(SimpleFSDirectory.java:108)
at 
org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.(NIOFSDirectory.java:94)
at org.apache.lucene.store.NIOFSDirectory.openInput(NIOFSDirectory.java:70)
at org.apache.lucene.store.FSDirectory.openInput(FSDirectory.java:691)
at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:236)
at org.apache.lucene.index.DirectoryReader$1.doBody(DirectoryReader.java:72)
at 
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:683)
at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:69)
at org.apache.lucene.index.IndexReader.open(IndexReader.java:476)
at org.apache.lucene.index.IndexReader.open(IndexReader.java:403)
at 
org.apache.solr.core.StandardIndexReaderFactory.newReader(StandardIndexReaderFactory.java:38)
at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1057)
... 48 more

[2010-11-23 14:14:46,568] ERROR {org.apache.solr.core.SolrCore} -
REFCOUNT ERROR: unreferenced org.apache.solr.core.solrc...@58f24b6
(null) has a reference count of 1
[2010-11-23 14:14:46,568] ERROR {org.apache.solr.core.SolrCore} -
REFCOUNT ERROR: unreferenced org.apache.solr.core.solrc...@654dbbf6
(null) has a reference count of 1
[2010-11-23 14:14:46,568] ERROR {org.apache.solr.core.CoreContainer} -
CoreContainer was not shutdown prior to finalize(), indicates a bug --
POSSIBLE RESOURCE LEAK!!!
[2010-11-23 14:14:46,568] ERROR {org.apache.solr.core.CoreContainer} -
CoreContainer was not shutdown prior to finalize(), indicates a bug --
POSSIBLE RESOURCE LEAK!!!

-- 
Regards,

Tharindu



-- 
Regards,

Tharindu


disabled replication setting

2010-12-02 Thread Xin Li
For solr replication, we can send command to disable replication. Does
anyone know where i can verify the replication enabled/disabled
setting? i cannot seem to find it on dashboard or details command
output.

Thanks,

Xin


TermsComponent prefix query with fileds analyzers

2010-12-02 Thread Nestor Oviedo
Hi everyone
Does anyone know how to apply some analyzers over a prefix query?
What I'm looking for is a way to build an autosuggest using the
termsComponent that could be able to remove the accents from the
query's prefix.
For example, I have the term "analisis" in the index and I want to
retrieve it with the prefix "Análi" (notice the accent in the third
letter).
I think the regexp function won't help here, so I was wondering if
specifying some analyzers (LowerCase and ASCIIFolding) in the
termComponents configuration, it would be applied over the prefix.

Thanks in advance.
Nestor


Re: Dinamically change master

2010-12-02 Thread Tommaso Teofili
Back with my master resiliency need, talking with Upayavira we discovered we
were proposing the same solution :-)
This can be useful if you don't have a VIP with master/backup polling
policy.
It goes like this: there are 2 host for indexing, one is the main and one is
the backup one, the backup one is slave of the main one and the main one is
also master of N hosts which will be used for searching. If the main master
goes down then the backup one will be used for indexing and/or serving
search slaves.
This last feature can be done defining an external properties file for each
search slave which will contain the URL to master (pointed inside the
replication request handler tag of solrconfig.xml), so if these search
slaves run on multi core one has only to change properties file URL and
issue a http://SLAVEURL/solr/admin/cores?action=RELOAD&core=core0 to get
polling the backup master.
Cheers,
Tommaso



2010/12/1 Tommaso Teofili 

> Thanks Upayavira, that sounds very good.
>
> p.s.:
> I read that page some weeks ago and didn't get back to check on it.
>
>
> 2010/12/1 Upayavira 
>
>> Note, all extracted from http://wiki.apache.org/solr/SolrReplication
>>
>> You'd put:
>>
>> 
>>
>>
>>startup
>>commit
>>
>> 
>>
>> into every box you want to be able to act as a master, then use:
>>
>> http://slave_host:port
>> /solr/replication?command=fetchindex&masterUrl=> master URL>
>>
>> As the above page says better than I can, "It is possible to pass on
>> extra attribute 'masterUrl' or other attributes like 'compression' (or
>> any other parameter which is specified in the  tag) to
>> do a one time replication from a master. This obviates the need for
>> hardcoding the master in the slave."
>>
>> HTH, Upayavira
>>
>> On Wed, 01 Dec 2010 06:24 +0100, "Tommaso Teofili"
>>  wrote:
>> > Hi Upayavira,
>> > this is a good start for solving my problem, can you please tell how
>> does
>> > such a replication URL look like?
>> > Thanks,
>> > Tommaso
>> >
>> > 2010/12/1 Upayavira 
>> >
>> > > Hi Tommaso,
>> > >
>> > > I believe you can tell each server to act as a master (which means it
>> > > can have its indexes pulled from it).
>> > >
>> > > You can then include the master hostname in the URL that triggers a
>> > > replication process. Thus, if you triggered replication from outside
>> > > solr, you'd have control over which master you pull from.
>> > >
>> > > Does this answer your question?
>> > >
>> > > Upayavira
>> > >
>> > >
>> > > On Tue, 30 Nov 2010 09:18 -0800, "Ken Krugler"
>> > >  wrote:
>> > > > Hi Tommaso,
>> > > >
>> > > > On Nov 30, 2010, at 7:41am, Tommaso Teofili wrote:
>> > > >
>> > > > > Hi all,
>> > > > >
>> > > > > in a replication environment if the host where the master is
>> running
>> > > > > goes
>> > > > > down for some reason, is there a way to communicate to the slaves
>> to
>> > > > > point
>> > > > > to a different (backup) master without manually changing
>> > > > > configuration (and
>> > > > > restarting the slaves or their cores)?
>> > > > >
>> > > > > Basically I'd like to be able to change the replication master
>> > > > > dinamically
>> > > > > inside the slaves.
>> > > > >
>> > > > > Do you have any idea of how this could be achieved?
>> > > >
>> > > > One common approach is to use VIP (virtual IP) support provided by
>> > > > load balancers.
>> > > >
>> > > > Your slaves are configured to use a VIP to talk to the master, so
>> that
>> > > > it's easy to dynamically change which master they use, via updates
>> to
>> > > > the load balancer config.
>> > > >
>> > > > -- Ken
>> > > >
>> > > > --
>> > > > Ken Krugler
>> > > > +1 530-210-6378
>> > > > http://bixolabs.com
>> > > > e l a s t i c   w e b   m i n i n g
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > >
>> >
>>
>
>


Import Data Into Solr

2010-12-02 Thread Bing Li
Hi, all,

I am a new user of Solr. Before using it, all of the data is indexed myself
with Lucene. According to the Chapter 3 of the book, Solr. 1.4 Enterprise
Search Server written by David Smiley and Eric Pugh, data in the formats of
XML, CSV and even PDF, etc, can be imported to Solr.

If I wish to import the Lucene indexes into Solr, may I have any other
approaches? I know that Solr is a serverized Lucene.

Thanks,
Bing Li


Re: SOLR Thesaurus

2010-12-02 Thread lee carroll
Hi

Stephen, yes sorry should have been more plain

a term can have a Prefered Term (PT), many Broader Terms (BT), Many Narrower
Terms (NT) Related Terms (RT) etc

So

User supplied Term is say : Ski

Prefered term: Skiing
Broader terms could be : Ski and Snow Boarding, Mountain Sports, Sports
Narrower terms: down hill skiing, telemark, cross country
Related terms: boarding, snow boarding, winter holidays

Michael,

yes exactly, SKOS, although maybe without the over wheening ambition to take
over the world.

By the sounds of it though out of the box you get a simple (but pretty
effective synonym list and ring) Anything more we'd need to write it
ourselfs ie your thesaurus filter and plus a change to the response as
broader terms, narrower terms etc would be good to be suggested to the ui.

No plugins out there ?

On 2 December 2010 16:16, Michael Zach  wrote:

> Hello Lee,
>
> these bells sound like "SKOS" ;o)
>
> AFAIK Solr does not support thesauri just plain flat synonym lists.
>
> One could implement a thesaurus filter and put it into the end of the
> analyzer chain of solr.
>
> The filter would then do a thesaurus lookup for each token it receives and
> possibly
> * expand the query
> or
> * kind of "stem" document tokens to some prefered variants according to the
> thesaurus
>
> Maybe even taking term relations from thesaurus into account and boost
> queries or doc fields at index time.
>
> Maybe have a look at http://poolparty.punkt.at/ a full features SKOS
> thesaurus management server.
> It's also providing webservices which could feed such a Solr filter.
>
> Kind regards
> Michael
>
>
> - Ursprüngliche Mail -
> Von: "lee carroll" 
> An: solr-user@lucene.apache.org
> Gesendet: Donnerstag, 2. Dezember 2010 09:55:54
> Betreff: SOLR Thesaurus
>
> Hi List,
>
> Coming to and end of a proto type evaluation of SOLR (all very good etc
> etc)
> Getting to the point at looking at bells and whistles. Does SOLR have a
> thesuarus. Cant find any refrerence
> to one in the docs and on the wiki etc. (Apart from a few mail threads
> which
> describe the synonym.txt as a thesuarus)
>
> I mean something like:
>
> PT: 
> BT: xxx,,
> NT: xxx,,
> RT:xxx,xxx,xxx
> Scope Note: xx,
>
> Like i say bells and whistles
>
> cheers Lee
>


Re: SOLR Thesaurus

2010-12-02 Thread Michael Zach
Hello Lee,

these bells sound like "SKOS" ;o)

AFAIK Solr does not support thesauri just plain flat synonym lists.

One could implement a thesaurus filter and put it into the end of the analyzer 
chain of solr.

The filter would then do a thesaurus lookup for each token it receives and 
possibly 
* expand the query 
or
* kind of "stem" document tokens to some prefered variants according to the 
thesaurus

Maybe even taking term relations from thesaurus into account and boost queries 
or doc fields at index time.

Maybe have a look at http://poolparty.punkt.at/ a full features SKOS thesaurus 
management server.
It's also providing webservices which could feed such a Solr filter.

Kind regards
Michael


- Ursprüngliche Mail -
Von: "lee carroll" 
An: solr-user@lucene.apache.org
Gesendet: Donnerstag, 2. Dezember 2010 09:55:54
Betreff: SOLR Thesaurus

Hi List,

Coming to and end of a proto type evaluation of SOLR (all very good etc etc)
Getting to the point at looking at bells and whistles. Does SOLR have a
thesuarus. Cant find any refrerence
to one in the docs and on the wiki etc. (Apart from a few mail threads which
describe the synonym.txt as a thesuarus)

I mean something like:

PT: 
BT: xxx,,
NT: xxx,,
RT:xxx,xxx,xxx
Scope Note: xx,

Like i say bells and whistles

cheers Lee


Multi-valued poly fields & search

2010-12-02 Thread Vincent Cautaerts
Hi,

(should this be on solr-dev mailing list?)

I have this kind of data, about articles in newspapers:

article A-001
.  published on 2010-10-31, in newspaper "N-1", edition "E1"
.  published on 2010-10-30, in newspaper "N-2", edition "E2"
article A-002
.  published on 2010-10-30, in newspaper "N-1", edition "E1"

I have to be able to search on those "sub-fields", eg:

all articles published on 2010-10-30 in newspaper "N-1" (all editions)

I expect to find document A-002, but not document A-001

I control the indexing, analyzers,... but I would like use standard Solr
query syntax (or an extension of it)

If I index those documents:


A-001
2010-10-31
N-1
E1
2010-10-30
N-2
E2


A-002
2010-10-30
N-1
E1



(ie: flattening the structure, losing the link between newspapers and dates)
then a search for "pubDate=2010-10-30 AND ns=N-1" will give me both
documents (because A-001 has been published in newspaper N-1 (at another
date) and has been published on 2010-10-30 (but in another newspaper))

Is there any way to index the data/express the search/... to be able to find
only document "A-002"?

In Solr terms, I believe that this is a multi-valued "poly" field (not yet
in the current stable version 1.4...)

Will this be supported by the next release? (what syntax?)

Some idea that I've had (usable with Solr 1.4)

(1)
Add fields like this for doc A-001:
 N-1/E1/2010-10-31
 N-2/E2/2010-10-30
and make a wildcard search "N-1/*/2010-10-30"


this will work for simple queries, but:
. I think that it will not allow range queries: "all articles published in
newspaper N-1 between 2009-08-01 and 2010-10-15"
. a wildcard query on N-1/E2/* will be very inefficient!
. writing queries will be more difficult (sometimes the user has to use the
field "ns", something the field "combined",...)


(2)
Make the simple query "pubDate=2010-10-30 AND ns=N-1", but filter the
results (the above query will give all correct results, plus some more).
This is not a generic solution, and writing the filter will be difficult if
the query is more complex:
(pubDate=2010-10-31 AND ns=N-1 ) OR (text contains "Barack")

(3)
On the same field as (1) here above, use an analyzer that will cheat the
proximity search, in issuing the following terms:

term 1: "ns:N-1"
term 2: "ed:E1"
term 3: "pubDate:2010-10-31"
term 11: "ns:N-2"
term 12: "ed:E2"
term 13: "pubDate:2010-10-30"
...

then a proximity search

(combined:"ns:N-1" AND combined:"pubDate:2010-10-30")~3

will give me only document A-002, not document A-001

Again, this will make problems with range queries, won't it?

Isn't there any better way to do this?

Ideally, I would index this (with my own syntax...):


A-001
2010-10-31
N-1
E1
2010-10-30
N-2
E2


 and then search:

(pubDate=2010-10-31 AND ns=N-1){sameSet}

or something like this...

I've found references to similar questions, but no answer that I could use
in my case.
(this one being the closer to my problem:
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201001.mbox/%3c9b742a34aa31814594f2bed8dfd9cceec96ca...@sparky.office.techtarget.com%3e

or *http://tinyurl.com/3527w4u*)

Thanks in advance for your ideas!
(and sorry for any english mistakes)


RE: Return Lucene DocId in Solr Results

2010-12-02 Thread Lohrenz, Steven
I must be missing something as I'm getting a NPE on the line: docIds[i] = 
termDocs.doc(); 
here's what I came up with:

private int[] getDocIdsFromPrimaryKey(SolrQueryRequest req, List 
favsBeans) throws ParseException {
// open the core & get data directory
String indexDir = req.getCore().getIndexDir();

FSDirectory indexDirectory = null;
try {
indexDirectory = FSDirectory.open(new File(indexDir));
} catch (IOException e) {
throw new ParseException("IOException, cannot open the index at: " 
+ indexDir + " " + e.getMessage());
}

//String pkQueryString = "resourceId:" + favBean.getResourceId();
//Query pkQuery = new QueryParser(Version.LUCENE_CURRENT, "resourceId", 
new StandardAnalyzer()).parse(pkQueryString);

IndexSearcher searcher = null;
TopScoreDocCollector collector = null;
IndexReader indexReader = null;
TermDocs termDocs = null;

try {
searcher = new IndexSearcher(indexDirectory, true);
indexReader = new FilterIndexReader(searcher.getIndexReader());
termDocs = indexReader.termDocs();
} catch (IOException e) {
throw new ParseException("IOException, cannot open the index at: " 
+ indexDir + " " + e.getMessage());
}

int[] docIds = new int[favsBeans.size()];
int i = 0;
for(Favorites favBean: favsBeans) {
Term term = new Term("resourceId", favBean.getResourceId());
try {
termDocs.seek(term);
docIds[i] = termDocs.doc();
} catch (IOException e) {
throw new ParseException("IOException, cannot seek to the 
primary key " + favBean.getResourceId() + " in : " + indexDir + " " + 
e.getMessage());
}
//ScoreDoc[] hits = collector.topDocs().scoreDocs;
//if(hits != null && hits[0] != null) {

i++;
//}
}

Arrays.sort(docIds);
return docIds;
}

Thanks,
Steve
-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: 02 December 2010 14:20
To: solr-user@lucene.apache.org
Subject: Re: Return Lucene DocId in Solr Results

Ahhh, you're already down in Lucene. That makes things easier...

See TermDocs. Particularly seek(Term). That'll directly access the indexed
unique key rather than having to form a bunch of queries.

Best
Erick


On Thu, Dec 2, 2010 at 8:59 AM, Lohrenz, Steven
wrote:

> I would be interested in hearing about some ways to improve the algorithm.
> I have done a very straightforward Lucene query within a loop to get the
> docIds.
>
> Here's what I did to get it working where favsBean are objects returned
> from a query of the second core, but there is probably a better way to do
> it:
>
> private int[] getDocIdsFromPrimaryKey(SolrQueryRequest req, List
> favsBeans) throws ParseException {
>// open the core & get data directory
>String indexDir = req.getCore().getIndexDir();
>FSDirectory index = null;
>try {
>index = FSDirectory.open(new File(indexDir));
>} catch (IOException e) {
>throw new ParseException("IOException, cannot open the index at:
> " + indexDir + " " + e.getMessage());
>}
>
>int[] docIds = new int[favsBeans.size()];
>int i = 0;
>for(Favorites favBean: favsBeans) {
>String pkQueryString = "resourceId:" + favBean.getResourceId();
>Query pkQuery = new QueryParser(Version.LUCENE_CURRENT,
> "resourceId", new StandardAnalyzer()).parse(pkQueryString);
>
>IndexSearcher searcher = null;
>TopScoreDocCollector collector = null;
>try {
>searcher = new IndexSearcher(index, true);
>collector = TopScoreDocCollector.create(1, true);
>searcher.search(pkQuery, collector);
>} catch (IOException e) {
>throw new ParseException("IOException, cannot search the
> index at: " + indexDir + " " + e.getMessage());
>}
>
>ScoreDoc[] hits = collector.topDocs().scoreDocs;
>if(hits != null && hits[0] != null) {
>docIds[i] = hits[0].doc;
>i++;
>}
>}
>
>Arrays.sort(docIds);
>return docIds;
> }
>
> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: 02 December 2010 13:46
> To: solr-user@lucene.apache.org
> Subject: Re: Return Lucene DocId in Solr Results
>
> Sounds good, especially because your old scenario was fragile. The doc IDs
> in
> your first core could change as a result of a single doc deletion and
> optimize. So
> the doc IDs stored in the second core would then be wrong...
>
> Your user-defined unique key is definitely a better way to go. There are
> some tricks
> you could try if there are performance issues
>

Re: Dataimport destroys our harddisks

2010-12-02 Thread Sven Almgren
That's the same series we use... we hade problems when running other
disk-heavy operations like rsync and backup on them too..

But in our case we mostly had hangs or load > 180 :P... Can you
simulate very heavy random disk i/o? if so then you could check if you
still have the same problems...

That's all I can be of help with, good luck :)

/Sven

2010/12/2 Robert Gründler :
> On Dec 2, 2010, at 15:43 , Sven Almgren wrote:
>
>> What Raid controller do you use, and what kernel version? (Assuming
>> Linux). We hade problems during high load with a 3Ware raid controller
>> and the current kernel for Ubuntu 10.04, we hade to downgrade the
>> kernel...
>>
>> The problem was a bug in the driver that only showed up with very high
>> disk load (as is the case when doing imports)
>>
>
> We're running freebsd:
>
> RaidController  3ware 9500S-8
> Corrupt unit: Raid-10 3725.27GB 256K Stripe Size without BBU
> Freebsd 7.2, UFS Filesystem.
>
>
>
>> /Sven
>>
>> 2010/12/2 Robert Gründler :
 The very first thing I'd ask is "how much free space is on your disk
 when this occurs?" Is it possible that you're simply filling up your
 disk?
>>>
>>> no, i've checked that already. all disks have plenty of space (they have
>>> a capacity of 2TB, and are currently filled up to 20%.
>>>

 do note that an optimize may require up to 2X the size of your index
 if/when it occurs. Are you sure you aren't optimizing as you add
 items to your index?

>>>
>>> index size is not a problem in our case. Our index currently has about 3GB.
>>>
>>> What do you mean with "optimizing as you add items to your index"?
>>>
 But I've never heard of Solr causing hard disk crashes,
>>>
>>> neither did we, and google is the same opinion.
>>>
>>> One thing that i've found is the mergeFactor value:
>>>
>>> http://wiki.apache.org/solr/SolrPerformanceFactors#mergeFactor
>>>
>>> Our sysadmin speculates that maybe the chunk size of our raid/harddisks
>>> and the segment size of the lucene index does not play well together.
>>>
>>> Does the lucene segment size affect how the data is written to the disk?
>>>
>>>
>>> thanks for your help.
>>>
>>>
>>> -robert
>>>
>>>
>>>
>>>
>>>
>>>
>>>

 Best
 Erick

 2010/12/2 Robert Gründler 

> Hi,
>
> we have a serious harddisk problem, and it's definitely related to a
> full-import from a relational
> database into a solr index.
>
> The first time it happened on our development server, where the
> raidcontroller crashed during a full-import
> of ~ 8 Million documents. This happened 2 weeks ago, and in this period 2
> of the harddisks where the solr
> index files are located stopped working (we needed to replace them).
>
> After the crash of the raid controller, we decided to move the development
> of solr/index related stuff to our
> local development machines.
>
> Yesterday i was running another full-import of ~10 Million documents on my
> local development machine,
> and during the import, a harddisk failure occurred. Since this failure, my
> harddisk activity seems to
> be around 100% all the time, even if no solr server is running at all.
>
> I've been googling the last 2 days to find some info about solr related
> harddisk problems, but i didn't find anything
> useful.
>
> Are there any steps we need to take care of in respect to harddisk 
> failures
> when doing a full-import? Right now,
> our steps look like this:
>
> 1. Delete the current index
> 2. Restart solr, to load the updated schemas
> 3. Start the full import
>
> Initially, the solr index and the relational database were located on the
> same harddisk. After the crash, we moved
> the index to a separate harddisk, but nevertheless this harddisk crashed
> too.
>
> I'd really appreciate any hints on what we might do wrong when importing
> data, as we can't release this
> on our production servers when there's the risk of harddisk failures.
>
>
> thanks.
>
>
> -robert
>
>
>
>
>
>
>>>
>>>
>
>


Re: Dataimport destroys our harddisks

2010-12-02 Thread Robert Gründler
On Dec 2, 2010, at 15:43 , Sven Almgren wrote:

> What Raid controller do you use, and what kernel version? (Assuming
> Linux). We hade problems during high load with a 3Ware raid controller
> and the current kernel for Ubuntu 10.04, we hade to downgrade the
> kernel...
> 
> The problem was a bug in the driver that only showed up with very high
> disk load (as is the case when doing imports)
> 

We're running freebsd:

RaidController  3ware 9500S-8
Corrupt unit: Raid-10 3725.27GB 256K Stripe Size without BBU
Freebsd 7.2, UFS Filesystem.



> /Sven
> 
> 2010/12/2 Robert Gründler :
>>> The very first thing I'd ask is "how much free space is on your disk
>>> when this occurs?" Is it possible that you're simply filling up your
>>> disk?
>> 
>> no, i've checked that already. all disks have plenty of space (they have
>> a capacity of 2TB, and are currently filled up to 20%.
>> 
>>> 
>>> do note that an optimize may require up to 2X the size of your index
>>> if/when it occurs. Are you sure you aren't optimizing as you add
>>> items to your index?
>>> 
>> 
>> index size is not a problem in our case. Our index currently has about 3GB.
>> 
>> What do you mean with "optimizing as you add items to your index"?
>> 
>>> But I've never heard of Solr causing hard disk crashes,
>> 
>> neither did we, and google is the same opinion.
>> 
>> One thing that i've found is the mergeFactor value:
>> 
>> http://wiki.apache.org/solr/SolrPerformanceFactors#mergeFactor
>> 
>> Our sysadmin speculates that maybe the chunk size of our raid/harddisks
>> and the segment size of the lucene index does not play well together.
>> 
>> Does the lucene segment size affect how the data is written to the disk?
>> 
>> 
>> thanks for your help.
>> 
>> 
>> -robert
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>>> 
>>> Best
>>> Erick
>>> 
>>> 2010/12/2 Robert Gründler 
>>> 
 Hi,
 
 we have a serious harddisk problem, and it's definitely related to a
 full-import from a relational
 database into a solr index.
 
 The first time it happened on our development server, where the
 raidcontroller crashed during a full-import
 of ~ 8 Million documents. This happened 2 weeks ago, and in this period 2
 of the harddisks where the solr
 index files are located stopped working (we needed to replace them).
 
 After the crash of the raid controller, we decided to move the development
 of solr/index related stuff to our
 local development machines.
 
 Yesterday i was running another full-import of ~10 Million documents on my
 local development machine,
 and during the import, a harddisk failure occurred. Since this failure, my
 harddisk activity seems to
 be around 100% all the time, even if no solr server is running at all.
 
 I've been googling the last 2 days to find some info about solr related
 harddisk problems, but i didn't find anything
 useful.
 
 Are there any steps we need to take care of in respect to harddisk failures
 when doing a full-import? Right now,
 our steps look like this:
 
 1. Delete the current index
 2. Restart solr, to load the updated schemas
 3. Start the full import
 
 Initially, the solr index and the relational database were located on the
 same harddisk. After the crash, we moved
 the index to a separate harddisk, but nevertheless this harddisk crashed
 too.
 
 I'd really appreciate any hints on what we might do wrong when importing
 data, as we can't release this
 on our production servers when there's the risk of harddisk failures.
 
 
 thanks.
 
 
 -robert
 
 
 
 
 
 
>> 
>> 



Re: Dataimport destroys our harddisks

2010-12-02 Thread Sven Almgren
What Raid controller do you use, and what kernel version? (Assuming
Linux). We hade problems during high load with a 3Ware raid controller
and the current kernel for Ubuntu 10.04, we hade to downgrade the
kernel...

The problem was a bug in the driver that only showed up with very high
disk load (as is the case when doing imports)

/Sven

2010/12/2 Robert Gründler :
>> The very first thing I'd ask is "how much free space is on your disk
>> when this occurs?" Is it possible that you're simply filling up your
>> disk?
>
> no, i've checked that already. all disks have plenty of space (they have
> a capacity of 2TB, and are currently filled up to 20%.
>
>>
>> do note that an optimize may require up to 2X the size of your index
>> if/when it occurs. Are you sure you aren't optimizing as you add
>> items to your index?
>>
>
> index size is not a problem in our case. Our index currently has about 3GB.
>
> What do you mean with "optimizing as you add items to your index"?
>
>> But I've never heard of Solr causing hard disk crashes,
>
> neither did we, and google is the same opinion.
>
> One thing that i've found is the mergeFactor value:
>
> http://wiki.apache.org/solr/SolrPerformanceFactors#mergeFactor
>
> Our sysadmin speculates that maybe the chunk size of our raid/harddisks
> and the segment size of the lucene index does not play well together.
>
> Does the lucene segment size affect how the data is written to the disk?
>
>
> thanks for your help.
>
>
> -robert
>
>
>
>
>
>
>
>>
>> Best
>> Erick
>>
>> 2010/12/2 Robert Gründler 
>>
>>> Hi,
>>>
>>> we have a serious harddisk problem, and it's definitely related to a
>>> full-import from a relational
>>> database into a solr index.
>>>
>>> The first time it happened on our development server, where the
>>> raidcontroller crashed during a full-import
>>> of ~ 8 Million documents. This happened 2 weeks ago, and in this period 2
>>> of the harddisks where the solr
>>> index files are located stopped working (we needed to replace them).
>>>
>>> After the crash of the raid controller, we decided to move the development
>>> of solr/index related stuff to our
>>> local development machines.
>>>
>>> Yesterday i was running another full-import of ~10 Million documents on my
>>> local development machine,
>>> and during the import, a harddisk failure occurred. Since this failure, my
>>> harddisk activity seems to
>>> be around 100% all the time, even if no solr server is running at all.
>>>
>>> I've been googling the last 2 days to find some info about solr related
>>> harddisk problems, but i didn't find anything
>>> useful.
>>>
>>> Are there any steps we need to take care of in respect to harddisk failures
>>> when doing a full-import? Right now,
>>> our steps look like this:
>>>
>>> 1. Delete the current index
>>> 2. Restart solr, to load the updated schemas
>>> 3. Start the full import
>>>
>>> Initially, the solr index and the relational database were located on the
>>> same harddisk. After the crash, we moved
>>> the index to a separate harddisk, but nevertheless this harddisk crashed
>>> too.
>>>
>>> I'd really appreciate any hints on what we might do wrong when importing
>>> data, as we can't release this
>>> on our production servers when there's the risk of harddisk failures.
>>>
>>>
>>> thanks.
>>>
>>>
>>> -robert
>>>
>>>
>>>
>>>
>>>
>>>
>
>


Re: Dataimport destroys our harddisks

2010-12-02 Thread Robert Gründler
> The very first thing I'd ask is "how much free space is on your disk
> when this occurs?" Is it possible that you're simply filling up your
> disk?

no, i've checked that already. all disks have plenty of space (they have
a capacity of 2TB, and are currently filled up to 20%.

> 
> do note that an optimize may require up to 2X the size of your index
> if/when it occurs. Are you sure you aren't optimizing as you add
> items to your index?
> 

index size is not a problem in our case. Our index currently has about 3GB.

What do you mean with "optimizing as you add items to your index"? 

> But I've never heard of Solr causing hard disk crashes,

neither did we, and google is the same opinion. 

One thing that i've found is the mergeFactor value:

http://wiki.apache.org/solr/SolrPerformanceFactors#mergeFactor

Our sysadmin speculates that maybe the chunk size of our raid/harddisks
and the segment size of the lucene index does not play well together.

Does the lucene segment size affect how the data is written to the disk?


thanks for your help.


-robert







> 
> Best
> Erick
> 
> 2010/12/2 Robert Gründler 
> 
>> Hi,
>> 
>> we have a serious harddisk problem, and it's definitely related to a
>> full-import from a relational
>> database into a solr index.
>> 
>> The first time it happened on our development server, where the
>> raidcontroller crashed during a full-import
>> of ~ 8 Million documents. This happened 2 weeks ago, and in this period 2
>> of the harddisks where the solr
>> index files are located stopped working (we needed to replace them).
>> 
>> After the crash of the raid controller, we decided to move the development
>> of solr/index related stuff to our
>> local development machines.
>> 
>> Yesterday i was running another full-import of ~10 Million documents on my
>> local development machine,
>> and during the import, a harddisk failure occurred. Since this failure, my
>> harddisk activity seems to
>> be around 100% all the time, even if no solr server is running at all.
>> 
>> I've been googling the last 2 days to find some info about solr related
>> harddisk problems, but i didn't find anything
>> useful.
>> 
>> Are there any steps we need to take care of in respect to harddisk failures
>> when doing a full-import? Right now,
>> our steps look like this:
>> 
>> 1. Delete the current index
>> 2. Restart solr, to load the updated schemas
>> 3. Start the full import
>> 
>> Initially, the solr index and the relational database were located on the
>> same harddisk. After the crash, we moved
>> the index to a separate harddisk, but nevertheless this harddisk crashed
>> too.
>> 
>> I'd really appreciate any hints on what we might do wrong when importing
>> data, as we can't release this
>> on our production servers when there's the risk of harddisk failures.
>> 
>> 
>> thanks.
>> 
>> 
>> -robert
>> 
>> 
>> 
>> 
>> 
>> 



Re: Return Lucene DocId in Solr Results

2010-12-02 Thread Erick Erickson
Ahhh, you're already down in Lucene. That makes things easier...

See TermDocs. Particularly seek(Term). That'll directly access the indexed
unique key rather than having to form a bunch of queries.

Best
Erick


On Thu, Dec 2, 2010 at 8:59 AM, Lohrenz, Steven
wrote:

> I would be interested in hearing about some ways to improve the algorithm.
> I have done a very straightforward Lucene query within a loop to get the
> docIds.
>
> Here's what I did to get it working where favsBean are objects returned
> from a query of the second core, but there is probably a better way to do
> it:
>
> private int[] getDocIdsFromPrimaryKey(SolrQueryRequest req, List
> favsBeans) throws ParseException {
>// open the core & get data directory
>String indexDir = req.getCore().getIndexDir();
>FSDirectory index = null;
>try {
>index = FSDirectory.open(new File(indexDir));
>} catch (IOException e) {
>throw new ParseException("IOException, cannot open the index at:
> " + indexDir + " " + e.getMessage());
>}
>
>int[] docIds = new int[favsBeans.size()];
>int i = 0;
>for(Favorites favBean: favsBeans) {
>String pkQueryString = "resourceId:" + favBean.getResourceId();
>Query pkQuery = new QueryParser(Version.LUCENE_CURRENT,
> "resourceId", new StandardAnalyzer()).parse(pkQueryString);
>
>IndexSearcher searcher = null;
>TopScoreDocCollector collector = null;
>try {
>searcher = new IndexSearcher(index, true);
>collector = TopScoreDocCollector.create(1, true);
>searcher.search(pkQuery, collector);
>} catch (IOException e) {
>throw new ParseException("IOException, cannot search the
> index at: " + indexDir + " " + e.getMessage());
>}
>
>ScoreDoc[] hits = collector.topDocs().scoreDocs;
>if(hits != null && hits[0] != null) {
>docIds[i] = hits[0].doc;
>i++;
>}
>}
>
>Arrays.sort(docIds);
>return docIds;
> }
>
> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: 02 December 2010 13:46
> To: solr-user@lucene.apache.org
> Subject: Re: Return Lucene DocId in Solr Results
>
> Sounds good, especially because your old scenario was fragile. The doc IDs
> in
> your first core could change as a result of a single doc deletion and
> optimize. So
> the doc IDs stored in the second core would then be wrong...
>
> Your user-defined unique key is definitely a better way to go. There are
> some tricks
> you could try if there are performance issues
>
> Best
> Erick
>
> On Thu, Dec 2, 2010 at 7:47 AM, Lohrenz, Steven
> wrote:
>
> > I know the doc ids from one core have nothing to do with the other. I was
> > going to use the docId returned from the first core in the solr results
> and
> > store it in the second core that way the second core knows about the doc
> ids
> > from the first core. So when you query the second core from the Filter in
> > the first core you get returned a set of data that includes the docId
> from
> > the first core that the document relates to.
> >
> > I have backed off from this approach and have a user defined primary key
> in
> > the firstCore, which is stored as the reference in the secondCore and
> when
> > the filter performs the search it goes off and queries the firstCore for
> > each primary key and gets the lucene docId from the returned doc.
> >
> > Thanks,
> > Steve
> >
> > -Original Message-
> > From: Erick Erickson [mailto:erickerick...@gmail.com]
> > Sent: 02 December 2010 02:19
> > To: solr-user@lucene.apache.org
> > Subject: Re: Return Lucene DocId in Solr Results
> >
> > On the face of it, this doesn't make sense, so perhaps you can explain a
> > bit.The doc IDs
> > from one Solr instance have no relation to the doc IDs from another Solr
> > instance. So anything
> > that uses doc IDs from one Solr instance to create a filter on another
> > instance doesn't seem
> > to be something you'd want to do...
> >
> > Which may just mean I don't understand what you're trying to do. Can you
> > back up a bit
> > and describe the higher-level problem? This seems like it may be an XY
> > problem, see:
> > http://people.apache.org/~hossman/#xyproblem
> >
> > Best
> > Erick
> >
> > On Tue, Nov 30, 2010 at 6:57 AM, Lohrenz, Steven
> > wrote:
> >
> > > Hi,
> > >
> > > I was wondering how I would go about getting the lucene docid included
> in
> > > the results from a solr query?
> > >
> > > I've built a QueryParser to query another solr instance and and join
> the
> > > results of the two instances through the use of a Filter.  The Filter
> > needs
> > > the lucene docid to work. This is the only bit I'm missing right now.
> > >
> > > Thanks,
> > > Steve
> > >
> > >
> >
>


RE: Return Lucene DocId in Solr Results

2010-12-02 Thread Lohrenz, Steven
I would be interested in hearing about some ways to improve the algorithm. I 
have done a very straightforward Lucene query within a loop to get the docIds.

Here's what I did to get it working where favsBean are objects returned from a 
query of the second core, but there is probably a better way to do it:

private int[] getDocIdsFromPrimaryKey(SolrQueryRequest req, List 
favsBeans) throws ParseException {
// open the core & get data directory
String indexDir = req.getCore().getIndexDir();
FSDirectory index = null;
try {
index = FSDirectory.open(new File(indexDir));
} catch (IOException e) {
throw new ParseException("IOException, cannot open the index at: " 
+ indexDir + " " + e.getMessage());
}

int[] docIds = new int[favsBeans.size()];
int i = 0;
for(Favorites favBean: favsBeans) {
String pkQueryString = "resourceId:" + favBean.getResourceId();
Query pkQuery = new QueryParser(Version.LUCENE_CURRENT, 
"resourceId", new StandardAnalyzer()).parse(pkQueryString);

IndexSearcher searcher = null;
TopScoreDocCollector collector = null;
try {
searcher = new IndexSearcher(index, true);
collector = TopScoreDocCollector.create(1, true);
searcher.search(pkQuery, collector);
} catch (IOException e) {
throw new ParseException("IOException, cannot search the index 
at: " + indexDir + " " + e.getMessage());
}

ScoreDoc[] hits = collector.topDocs().scoreDocs;
if(hits != null && hits[0] != null) {
docIds[i] = hits[0].doc;
i++;
}
}

Arrays.sort(docIds);
return docIds;
}

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: 02 December 2010 13:46
To: solr-user@lucene.apache.org
Subject: Re: Return Lucene DocId in Solr Results

Sounds good, especially because your old scenario was fragile. The doc IDs
in
your first core could change as a result of a single doc deletion and
optimize. So
the doc IDs stored in the second core would then be wrong...

Your user-defined unique key is definitely a better way to go. There are
some tricks
you could try if there are performance issues

Best
Erick

On Thu, Dec 2, 2010 at 7:47 AM, Lohrenz, Steven
wrote:

> I know the doc ids from one core have nothing to do with the other. I was
> going to use the docId returned from the first core in the solr results and
> store it in the second core that way the second core knows about the doc ids
> from the first core. So when you query the second core from the Filter in
> the first core you get returned a set of data that includes the docId from
> the first core that the document relates to.
>
> I have backed off from this approach and have a user defined primary key in
> the firstCore, which is stored as the reference in the secondCore and when
> the filter performs the search it goes off and queries the firstCore for
> each primary key and gets the lucene docId from the returned doc.
>
> Thanks,
> Steve
>
> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: 02 December 2010 02:19
> To: solr-user@lucene.apache.org
> Subject: Re: Return Lucene DocId in Solr Results
>
> On the face of it, this doesn't make sense, so perhaps you can explain a
> bit.The doc IDs
> from one Solr instance have no relation to the doc IDs from another Solr
> instance. So anything
> that uses doc IDs from one Solr instance to create a filter on another
> instance doesn't seem
> to be something you'd want to do...
>
> Which may just mean I don't understand what you're trying to do. Can you
> back up a bit
> and describe the higher-level problem? This seems like it may be an XY
> problem, see:
> http://people.apache.org/~hossman/#xyproblem
>
> Best
> Erick
>
> On Tue, Nov 30, 2010 at 6:57 AM, Lohrenz, Steven
> wrote:
>
> > Hi,
> >
> > I was wondering how I would go about getting the lucene docid included in
> > the results from a solr query?
> >
> > I've built a QueryParser to query another solr instance and and join the
> > results of the two instances through the use of a Filter.  The Filter
> needs
> > the lucene docid to work. This is the only bit I'm missing right now.
> >
> > Thanks,
> > Steve
> >
> >
>


RE: SOLR Thesaurus

2010-12-02 Thread Steven A Rowe
Hi Lee,

Can you describe your thesaurus format (it's not exactly self-descriptive) and 
how you would like it to be applied?

I gather you're referring to a thesaurus feature in another product (or product 
class)?  Maybe if you describe that it would help too.

Steve

> -Original Message-
> From: lee carroll [mailto:lee.a.carr...@googlemail.com]
> Sent: Thursday, December 02, 2010 3:56 AM
> To: solr-user@lucene.apache.org
> Subject: SOLR Thesaurus
> 
> Hi List,
> 
> Coming to and end of a proto type evaluation of SOLR (all very good etc
> etc)
> Getting to the point at looking at bells and whistles. Does SOLR have a
> thesuarus. Cant find any refrerence
> to one in the docs and on the wiki etc. (Apart from a few mail threads
> which
> describe the synonym.txt as a thesuarus)
> 
> I mean something like:
> 
> PT: 
> BT: xxx,,
> NT: xxx,,
> RT:xxx,xxx,xxx
> Scope Note: xx,
> 
> Like i say bells and whistles
> 
> cheers Lee


Re: Return Lucene DocId in Solr Results

2010-12-02 Thread Erick Erickson
Sounds good, especially because your old scenario was fragile. The doc IDs
in
your first core could change as a result of a single doc deletion and
optimize. So
the doc IDs stored in the second core would then be wrong...

Your user-defined unique key is definitely a better way to go. There are
some tricks
you could try if there are performance issues

Best
Erick

On Thu, Dec 2, 2010 at 7:47 AM, Lohrenz, Steven
wrote:

> I know the doc ids from one core have nothing to do with the other. I was
> going to use the docId returned from the first core in the solr results and
> store it in the second core that way the second core knows about the doc ids
> from the first core. So when you query the second core from the Filter in
> the first core you get returned a set of data that includes the docId from
> the first core that the document relates to.
>
> I have backed off from this approach and have a user defined primary key in
> the firstCore, which is stored as the reference in the secondCore and when
> the filter performs the search it goes off and queries the firstCore for
> each primary key and gets the lucene docId from the returned doc.
>
> Thanks,
> Steve
>
> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: 02 December 2010 02:19
> To: solr-user@lucene.apache.org
> Subject: Re: Return Lucene DocId in Solr Results
>
> On the face of it, this doesn't make sense, so perhaps you can explain a
> bit.The doc IDs
> from one Solr instance have no relation to the doc IDs from another Solr
> instance. So anything
> that uses doc IDs from one Solr instance to create a filter on another
> instance doesn't seem
> to be something you'd want to do...
>
> Which may just mean I don't understand what you're trying to do. Can you
> back up a bit
> and describe the higher-level problem? This seems like it may be an XY
> problem, see:
> http://people.apache.org/~hossman/#xyproblem
>
> Best
> Erick
>
> On Tue, Nov 30, 2010 at 6:57 AM, Lohrenz, Steven
> wrote:
>
> > Hi,
> >
> > I was wondering how I would go about getting the lucene docid included in
> > the results from a solr query?
> >
> > I've built a QueryParser to query another solr instance and and join the
> > results of the two instances through the use of a Filter.  The Filter
> needs
> > the lucene docid to work. This is the only bit I'm missing right now.
> >
> > Thanks,
> > Steve
> >
> >
>


Re: Dataimport destroys our harddisks

2010-12-02 Thread Erick Erickson
The very first thing I'd ask is "how much free space is on your disk
when this occurs?" Is it possible that you're simply filling up your
disk?

do note that an optimize may require up to 2X the size of your index
if/when it occurs. Are you sure you aren't optimizing as you add
items to your index?

But I've never heard of Solr causing hard disk crashes, it doesn't do
anything special but read/write...

Best
Erick

2010/12/2 Robert Gründler 

> Hi,
>
> we have a serious harddisk problem, and it's definitely related to a
> full-import from a relational
> database into a solr index.
>
> The first time it happened on our development server, where the
> raidcontroller crashed during a full-import
> of ~ 8 Million documents. This happened 2 weeks ago, and in this period 2
> of the harddisks where the solr
> index files are located stopped working (we needed to replace them).
>
> After the crash of the raid controller, we decided to move the development
> of solr/index related stuff to our
> local development machines.
>
> Yesterday i was running another full-import of ~10 Million documents on my
> local development machine,
> and during the import, a harddisk failure occurred. Since this failure, my
> harddisk activity seems to
> be around 100% all the time, even if no solr server is running at all.
>
> I've been googling the last 2 days to find some info about solr related
> harddisk problems, but i didn't find anything
> useful.
>
> Are there any steps we need to take care of in respect to harddisk failures
> when doing a full-import? Right now,
> our steps look like this:
>
> 1. Delete the current index
> 2. Restart solr, to load the updated schemas
> 3. Start the full import
>
> Initially, the solr index and the relational database were located on the
> same harddisk. After the crash, we moved
> the index to a separate harddisk, but nevertheless this harddisk crashed
> too.
>
> I'd really appreciate any hints on what we might do wrong when importing
> data, as we can't release this
> on our production servers when there's the risk of harddisk failures.
>
>
> thanks.
>
>
> -robert
>
>
>
>
>
>


Re: Best practice for Delta every 2 Minutes.

2010-12-02 Thread Erick Erickson
In fact, having a master/slave where the master is the
indexing/updating machine and the slave(s) are searchers
is one of the recommended configurations. The replication
is used in many, many sites so it's pretty solid.

It's generally not recommended, though, to run separate
instances on the *same* server. No matter how many
cores/instances/etc, you're still running on the same
physical hardware so I/O contention, memory issues, etc
are still bounded by your hardware

Best
Erick

On Thu, Dec 2, 2010 at 5:12 AM, stockii  wrote:

>
> at the time no OOM occurs. but we are not in correct live system ...
>
> i thougt maybe i get this problem ...
>
> we are running seven cores and each want be update very fast. only one core
> have a huge index with 28M docs. maybe it makes sense for the future to use
> solr with replication !? or can i runs two instances, one for search and
> one
> for updating ? or is there the danger of corrupt indizes ?
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Best-practice-for-Delta-every-2-Minutes-tp1992714p2005108.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Tuning Solr caches with high commit rates (NRT)

2010-12-02 Thread Peter Sturge
In order for the 'read-only' instance to see any new/updated
documents, it needs to do a commit (since it's read-only, it is a
commit of 0 documents).
You can do this via a client service that issues periodic commits, or
use autorefresh from within solrconfig.xml. Be careful that you don't
do anything in the read-only instance that will change the underlying
index - like optimize.

Peter


On Thu, Dec 2, 2010 at 12:51 PM, stockii  wrote:
>
> great thread and exactly my problems :D
>
> i set up two solr-instances, one for update the index and another for
> searching.
>
> When i perform an update. the search-instance dont get the new documents.
> when i start a commit on searcher he found it. how can i say the searcher
> that he alwas look not only the "old" index. automatic refresh ? XD
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Tuning-Solr-caches-with-high-commit-rates-NRT-tp1461275p2005738.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Tuning Solr caches with high commit rates (NRT)

2010-12-02 Thread stockii

great thread and exactly my problems :D

i set up two solr-instances, one for update the index and another for
searching. 

When i perform an update. the search-instance dont get the new documents.
when i start a commit on searcher he found it. how can i say the searcher
that he alwas look not only the "old" index. automatic refresh ? XD
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Tuning-Solr-caches-with-high-commit-rates-NRT-tp1461275p2005738.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Return Lucene DocId in Solr Results

2010-12-02 Thread Lohrenz, Steven
I know the doc ids from one core have nothing to do with the other. I was going 
to use the docId returned from the first core in the solr results and store it 
in the second core that way the second core knows about the doc ids from the 
first core. So when you query the second core from the Filter in the first core 
you get returned a set of data that includes the docId from the first core that 
the document relates to. 

I have backed off from this approach and have a user defined primary key in the 
firstCore, which is stored as the reference in the secondCore and when the 
filter performs the search it goes off and queries the firstCore for each 
primary key and gets the lucene docId from the returned doc. 

Thanks,
Steve

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: 02 December 2010 02:19
To: solr-user@lucene.apache.org
Subject: Re: Return Lucene DocId in Solr Results

On the face of it, this doesn't make sense, so perhaps you can explain a
bit.The doc IDs
from one Solr instance have no relation to the doc IDs from another Solr
instance. So anything
that uses doc IDs from one Solr instance to create a filter on another
instance doesn't seem
to be something you'd want to do...

Which may just mean I don't understand what you're trying to do. Can you
back up a bit
and describe the higher-level problem? This seems like it may be an XY
problem, see:
http://people.apache.org/~hossman/#xyproblem

Best
Erick

On Tue, Nov 30, 2010 at 6:57 AM, Lohrenz, Steven
wrote:

> Hi,
>
> I was wondering how I would go about getting the lucene docid included in
> the results from a solr query?
>
> I've built a QueryParser to query another solr instance and and join the
> results of the two instances through the use of a Filter.  The Filter needs
> the lucene docid to work. This is the only bit I'm missing right now.
>
> Thanks,
> Steve
>
>


Re: Troubles with forming query for solr.

2010-12-02 Thread Savvas-Andreas Moysidis
Hello,

would something similar along those lines:
(field1:term AND field2:term AND field3:term)^2 OR (field1:term AND
field2:term)^0.8 OR (field2:term AND field3:term)^0.5

work? You'll probably need to experiment with the boost values to get the
desired result.

Another option could be investigating the Dismax handler.

On 1 December 2010 02:38, kolesman  wrote:

>
> Hi,
>
> I have some troubles with forming query for solr.
>
> Here is my task :
> I'm indexing objects with 3 fields, for example {field1, field2, filed3}
> In solr's response I want to get object in special order :
> 1. Firstly I want to get objects where all 3 fields are matched
> 2. Then I want to get objects where ONLY field1 and field2 are matched
> 3. And finnally I want to get objects where ONLY field2 and field3 are
> matched.
>
> Could your explain me how to form query for my task?
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Troubles-with-forming-query-for-solr-tp1996630p1996630.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Preventing index segment corruption when windows crashes

2010-12-02 Thread Michael McCandless
On Thu, Dec 2, 2010 at 4:53 AM, Peter Sturge  wrote:
> As I'm not familiar with the syncing in Lucene, I couldn't say whether
> there's a specific problem with regards Win7/2008 server etc.
>
> Windows has long had the somewhat odd behaviour of deliberately
> caching file handles after an explicit close(). This has been part of
> NTFS since NT 4 days, but there may be some new behaviour introduced
> in Windows 6.x (and there is a lot of new behaviour) that causes an
> issue. I have also seen this problem in Windows Server 2008 (server
> version of Win7 - same file system).
>
> I'll try some further testing on previous Windows versions, but I've
> not previously come across a single segment corruption on Win 2k3/XP
> after hard failures. In fact, it was when I first encountered this
> problem on Server 2008 that I even discovered CheckIndex existed!
>
> I guess a good question for the community is: Has anyone else
> seen/reproduced this problem on Windows 6.x (i.e. Server 2008 or
> Win7)?
>
> Mike, are there any diagnostics/config etc. that I could try to help
> isolate the problem?

Actually it might be easiest to make a standalone Java test, maybe
using Lucene's FSDir, that opens files in sequence (0.bin, 1.bin,
2.bin...), writes verifiable them (eg random bytes from a fixed seed)
and then closes & syncs each one.  Then, crash the box while this is
running.  Finally, run a verify step that checks that the data is
"correct"?  Ie that our attempt to fsync "worked"?

It could very well be that windows 6.x is now "smarter" about fsync in
that it only syncs bytes actually written with the currently open file
descriptor, and not bytes written agains the same file by past file
descriptors (ie via a global buffer cache, like Linux).

Mike


Dataimport destroys our harddisks

2010-12-02 Thread Robert Gründler
Hi,

we have a serious harddisk problem, and it's definitely related to a 
full-import from a relational
database into a solr index.

The first time it happened on our development server, where the raidcontroller 
crashed during a full-import
of ~ 8 Million documents. This happened 2 weeks ago, and in this period 2 of 
the harddisks where the solr
index files are located stopped working (we needed to replace them).

After the crash of the raid controller, we decided to move the development of 
solr/index related stuff to our
local development machines. 

Yesterday i was running another full-import of ~10 Million documents on my 
local development machine, 
and during the import, a harddisk failure occurred. Since this failure, my 
harddisk activity seems to 
be around 100% all the time, even if no solr server is running at all. 

I've been googling the last 2 days to find some info about solr related 
harddisk problems, but i didn't find anything
useful.

Are there any steps we need to take care of in respect to harddisk failures 
when doing a full-import? Right now,
our steps look like this:

1. Delete the current index
2. Restart solr, to load the updated schemas
3. Start the full import

Initially, the solr index and the relational database were located on the same 
harddisk. After the crash, we moved
the index to a separate harddisk, but nevertheless this harddisk crashed too.

I'd really appreciate any hints on what we might do wrong when importing data, 
as we can't release this
on our production servers when there's the risk of harddisk failures.


thanks.


-robert







Re: Best practice for Delta every 2 Minutes.

2010-12-02 Thread stockii

at the time no OOM occurs. but we are not in correct live system ... 

i thougt maybe i get this problem ... 

we are running seven cores and each want be update very fast. only one core
have a huge index with 28M docs. maybe it makes sense for the future to use
solr with replication !? or can i runs two instances, one for search and one
for updating ? or is there the danger of corrupt indizes ? 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Best-practice-for-Delta-every-2-Minutes-tp1992714p2005108.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Preventing index segment corruption when windows crashes

2010-12-02 Thread Peter Sturge
As I'm not familiar with the syncing in Lucene, I couldn't say whether
there's a specific problem with regards Win7/2008 server etc.

Windows has long had the somewhat odd behaviour of deliberately
caching file handles after an explicit close(). This has been part of
NTFS since NT 4 days, but there may be some new behaviour introduced
in Windows 6.x (and there is a lot of new behaviour) that causes an
issue. I have also seen this problem in Windows Server 2008 (server
version of Win7 - same file system).

I'll try some further testing on previous Windows versions, but I've
not previously come across a single segment corruption on Win 2k3/XP
after hard failures. In fact, it was when I first encountered this
problem on Server 2008 that I even discovered CheckIndex existed!

I guess a good question for the community is: Has anyone else
seen/reproduced this problem on Windows 6.x (i.e. Server 2008 or
Win7)?

Mike, are there any diagnostics/config etc. that I could try to help
isolate the problem?

Many thanks,
Peter



On Thu, Dec 2, 2010 at 9:28 AM, Michael McCandless
 wrote:
> On Thu, Dec 2, 2010 at 4:10 AM, Peter Sturge  wrote:
>> The Win7 crashes aren't from disk drivers - they come from, in this
>> case, a Broadcom wireless adapter driver.
>> The corruption comes as a result of the 'hard stop' of Windows.
>>
>> I would imagine this same problem could/would occur on any OS if the
>> plug was pulled from the machine.
>
> Actually, Lucene should be robust to this -- losing power, OS crash,
> hardware failure (as long as the failure doesn't flip bits), etc.
> This is because we do not delete files associated with an old commit
> point until all files referenced by the new commit point are
> successfully fsync'd.
>
> However it sounds like something is wrong, at least on Windows 7.
>
> I suspect it may be how we do the fsync -- if you look in
> FSDirectory.fsync, you'll see that we take a String fileName in.  We
> then open a new read/write RandomAccessFile, and call its
> .getFD().sync().
>
> I think this is potentially risky, ie, it would be better if we called
> .sync() on the original file we had opened for writing and written
> lots of data to, before closing it, instead of closing it, opening a
> new FileDescriptor, and calling sync on it.  We could conceivably take
> this approach, entirely in the Directory impl, by keeping the pool of
> file handles for write open even after .close() was called.  When a
> file is deleted we'd remove it from that pool, and when it's finally
> sync'd we'd then sync it and remove it from the pool.
>
> Could it be that on Windows 7 the way we fsync (opening a new
> FileDescriptor long after the first one was closed) doesn't in fact
> work?
>
> Mike
>


Re: Preventing index segment corruption when windows crashes

2010-12-02 Thread Michael McCandless
On Thu, Dec 2, 2010 at 4:10 AM, Peter Sturge  wrote:
> The Win7 crashes aren't from disk drivers - they come from, in this
> case, a Broadcom wireless adapter driver.
> The corruption comes as a result of the 'hard stop' of Windows.
>
> I would imagine this same problem could/would occur on any OS if the
> plug was pulled from the machine.

Actually, Lucene should be robust to this -- losing power, OS crash,
hardware failure (as long as the failure doesn't flip bits), etc.
This is because we do not delete files associated with an old commit
point until all files referenced by the new commit point are
successfully fsync'd.

However it sounds like something is wrong, at least on Windows 7.

I suspect it may be how we do the fsync -- if you look in
FSDirectory.fsync, you'll see that we take a String fileName in.  We
then open a new read/write RandomAccessFile, and call its
.getFD().sync().

I think this is potentially risky, ie, it would be better if we called
.sync() on the original file we had opened for writing and written
lots of data to, before closing it, instead of closing it, opening a
new FileDescriptor, and calling sync on it.  We could conceivably take
this approach, entirely in the Directory impl, by keeping the pool of
file handles for write open even after .close() was called.  When a
file is deleted we'd remove it from that pool, and when it's finally
sync'd we'd then sync it and remove it from the pool.

Could it be that on Windows 7 the way we fsync (opening a new
FileDescriptor long after the first one was closed) doesn't in fact
work?

Mike


Re: Preventing index segment corruption when windows crashes

2010-12-02 Thread Peter Sturge
The Win7 crashes aren't from disk drivers - they come from, in this
case, a Broadcom wireless adapter driver.
The corruption comes as a result of the 'hard stop' of Windows.

I would imagine this same problem could/would occur on any OS if the
plug was pulled from the machine.

Thanks,
Peter


On Thu, Dec 2, 2010 at 4:07 AM, Lance Norskog  wrote:
> Is there any way that Windows 7 and disk drivers are not honoring the
> fsync() calls? That would cause files and/or blocks to get saved out
> of order.
>
> On Tue, Nov 30, 2010 at 3:24 PM, Peter Sturge  wrote:
>> After a recent Windows 7 crash (:-\), upon restart, Solr starts giving
>> LockObtainFailedException errors: (excerpt)
>>
>>   30-Nov-2010 23:10:51 org.apache.solr.common.SolrException log
>>   SEVERE: org.apache.lucene.store.LockObtainFailedException: Lock
>> obtain timed out:
>> nativefsl...@solr\.\.\data0\index\lucene-ad25f73e3c87e6f192c4421756925f47-write.lock
>>
>>
>> When I run CheckIndex, I get: (excerpt)
>>
>>  30 of 30: name=_2fi docCount=857
>>    compound=false
>>    hasProx=true
>>    numFiles=8
>>    size (MB)=0.769
>>    diagnostics = {os.version=6.1, os=Windows 7, lucene.version=3.1-dev 
>> ${svnver
>> sion} - 2010-09-11 11:09:06, source=flush, os.arch=amd64, 
>> java.version=1.6.0_18,
>> java.vendor=Sun Microsystems Inc.}
>>    no deletions
>>    test: open reader.FAILED
>>    WARNING: fixIndex() would remove reference to this segment; full 
>> exception:
>> org.apache.lucene.index.CorruptIndexException: did not read all bytes from 
>> file
>> "_2fi.fnm": read 1 vs size 512
>>        at org.apache.lucene.index.FieldInfos.read(FieldInfos.java:367)
>>        at org.apache.lucene.index.FieldInfos.(FieldInfos.java:71)
>>        at 
>> org.apache.lucene.index.SegmentReader$CoreReaders.(SegmentReade
>> r.java:119)
>>        at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:583)
>>        at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:561)
>>        at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:467)
>>        at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:878)
>>
>> WARNING: 1 broken segments (containing 857 documents) detected
>>
>>
>> This seems to happen every time Windows 7 crashes, and it would seem
>> extraordinary bad luck for this tiny test index to be in the middle of
>> a commit every time.
>> (it is set to commit every 40secs, but for such a small index it only
>> takes millis to complete)
>>
>> Does this seem right? I don't remember seeing so many corruptions in
>> the index - maybe it is the world of Win7 dodgy drivers, but it would
>> be worth investigating if there's something amiss in Solr/Lucene when
>> things go down unexpectedly...
>>
>> Thanks,
>> Peter
>>
>>
>> On Tue, Nov 30, 2010 at 9:19 AM, Peter Sturge  wrote:
>>> The index itself isn't corrupt - just one of the segment files. This
>>> means you can read the index (less the offending segment(s)), but once
>>> this happens it's no longer possible to
>>> access the documents that were in that segment (they're gone forever),
>>> nor write/commit to the index (depending on the env/request, you get
>>> 'Error reading from index file..' and/or WriteLockError)
>>> (note that for my use case, documents are dynamically created so can't
>>> be re-indexed).
>>>
>>> Restarting Solr fixes the write lock errors (an indirect environmental
>>> symptom of the problem), and running CheckIndex -fix is the only way
>>> I've found to repair the index so it can be written to (rewrites the
>>> corrupted segment(s)).
>>>
>>> I guess I was wondering if there's a mechanism that would support
>>> something akin to a transactional rollback for segments.
>>>
>>> Thanks,
>>> Peter
>>>
>>>
>>>
>>> On Mon, Nov 29, 2010 at 5:33 PM, Yonik Seeley
>>>  wrote:
 On Mon, Nov 29, 2010 at 10:46 AM, Peter Sturge  
 wrote:
> If a Solr index is running at the time of a system halt, this can
> often corrupt a segments file, requiring the index to be -fix'ed by
> rewriting the offending file.

 Really?  That shouldn't be possible (if you mean the index is truly
 corrupt - i.e. you can't open it).

 -Yonik
 http://www.lucidimagination.com

>>>
>>
>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>


SOLR Thesaurus

2010-12-02 Thread lee carroll
Hi List,

Coming to and end of a proto type evaluation of SOLR (all very good etc etc)
Getting to the point at looking at bells and whistles. Does SOLR have a
thesuarus. Cant find any refrerence
to one in the docs and on the wiki etc. (Apart from a few mail threads which
describe the synonym.txt as a thesuarus)

I mean something like:

PT: 
BT: xxx,,
NT: xxx,,
RT:xxx,xxx,xxx
Scope Note: xx,

Like i say bells and whistles

cheers Lee


Re: Restrict access to localhost

2010-12-02 Thread Peter Karich
 for 1) use the tomcat configuration in conf/server.xml address="127.0.0.1" port="8080" ...
for 2) if they have direct access to solr either insert a middleware 
layer or create a write lock ;-)



Hello all,

1)
I want to restrict access to Solr only in localhost. How to acheive that?

2)
If i want to allow the clients to search but not to delete? How to restric the 
access?

Any thoughts?

Regards
Ganesh.
Send free SMS to your Friends on Mobile from your Yahoo! Messenger. Download 
Now! http://messenger.yahoo.com/download.php




--
http://jetwick.com twitter search prototype