Re: advice on creating a solr index when data source is from many unrelated db tables

2010-07-30 Thread Chantal Ackermann
Hi Ahmed, fields that are empty do not impact the index. It's different from a database. I have text fields for different languages and per document there is always only one of the languages set (the text fields for the other languages are empty/not set). It works all very well and fast. I

Re: advice on creating a solr index when data source is from many unrelated db tables

2010-07-30 Thread Gora Mohanty
On Thu, 29 Jul 2010 15:33:42 -0400 S Ahmed sahmed1...@gmail.com wrote: I understand (and its straightforward) when you want to create a index for something simple like Products. But how do you go about creating a Solr index when you have data coming from 10-15 database tables, and the

Re: Get unique values

2010-07-30 Thread Rafal Bluszcz Zawadzki
2010/7/28 Rafal Bluszcz Zawadzki ra...@headnet.dk Hi, In my schema I have (inter ali) fields CollectionID, and CollectionName. These two values always match together, which means that for every value of CollectionID there is matching value from CollectionName. I am interested in query

Re: wildcard and proximity searches

2010-07-30 Thread Ahmet Arslan
What approach shoud I use to perform wildcard and proximity searches? Like: solr mail*~10 For getting docs where solr is within 10 words of mailing for instance? You can do it with the plug-in described here: https://issues.apache.org/jira/browse/SOLR-1604 It would be great

Solr and Lucene in South Africa

2010-07-30 Thread Jaco Olivier
Hi to all Solr/Lucene Users... Out team had a discussion today regarding the Solr/Lucene community closer to home. I am hereby putting out an SOS to all Solr/Lucene users in the South African market and wish to organize a meet-up (or user support group) if at all possible. It would be great to

RE: wildcard and proximity searches

2010-07-30 Thread Frederico Azeiteiro
Hi Ahmet, Thank you. I'll be happy to test it if I manage to install it ok.. I'm a newbie at solr but I'm going to try the instructions in the thread to load it. Another doubts I have about wildcard searches: a) I think wildcard search is by default case sensitive? Is there a way to make case

RE: wildcard and proximity searches

2010-07-30 Thread Ahmet Arslan
a) I think wildcard search is by default case sensitive? Is there a way to make case insensitive? Wildcard searches are not analyzed. To case insensitive search you can lowercase query terms at client side. (with using lowercasefilter at index time) e.g. Mail* = mail* I discovered that

Re: Can't find org.apache.solr.client.solrj.embedded

2010-07-30 Thread Uwe Reh
Sorry, I had inspected the ...core.jar three times, without recognizing the package. I was realy blind. =8-) thanks Uwe Am 26.07.2010 20:48, schrieb Chris Hostetter: : where is a Jar, containing org.apache.solr.client.solrj.embedded? Classes in the embedded package are useless w/o the rest

RE: wildcard and proximity searches

2010-07-30 Thread Frederico Azeiteiro
Hi Ahmet, a) I think wildcard search is by default case sensitive? Is there a way to make case insensitive? Wildcard searches are not analyzed. To case insensitive search you can lowercase query terms at client side. (with using lowercasefilter at index time) e.g. Mail* = mail* I discovered

Re: Customize order field list ???

2010-07-30 Thread kenf_nc
I believe they come back alphabetically sorted (not sure if this is language specific or not), so a quick way might be to change the name from createdate to zz_createdate or something like that. Generally with XML you should not be worried about order however. It's usually a sign of a design

Re: Using Solr to perform range queries in Dspace

2010-07-30 Thread Mckeane
Thank you for your reply. This is a background as to what I am trying to achieve. I want to be able to perform a search across numeric index ranges and get the results in logical ordering instead of a lexicographic ordering using dspace. Currently if I do a search using the query: var:[10 TO

Re: Solr searching performance issues, using large documents

2010-07-30 Thread Li Li
hightlight's time is mainly spent on getting the field which you want to highlight and tokenize this field(If you don't store term vector) . you can check what's wrong, 2010/7/30 Peter Spam ps...@mac.com: If I don't do highlighting, it's really fast.  Optimize has no effect. -Peter On Jul

Re: question about relevance

2010-07-30 Thread Bharat Jain
Hi, Thanks a lot for the info and your time. I think field collapse will work for us. I looked at the https://issues.apache.org/jira/browse/SOLR-236 but which file I should use for patch. We use solr-1.3. Thanks Bharat Jain On Fri, Jul 30, 2010 at 12:53 AM, Chris Hostetter

Document Boost with Solr Extraction - SolrContentHandler

2010-07-30 Thread jayendra patil
We are using Solr Extract Handler for indexing document metadata with attachments. (/update/extract) However, the SolrContentHandler doesn't seem to support index time document boost attribute. Probably , document.setDocumentBoost(Float.parseFloat(boost)) is missing. Regards, Jayendra

Re: advice on creating a solr index when data source is from many unrelated db tables

2010-07-30 Thread S Ahmed
So I have tables like this: Users UserSales UserHistory UserAddresses UserNotes ClientAddress CalenderEvent Articles Blogs Just seems odd to me, jamming on these tables into a single index. But I guess the idea of using a 'type' field to quality exactly what I am searching is a good idea, in

Re: Solr searching performance issues, using large documents

2010-07-30 Thread Peter Spam
I do store term vector: field name=body type=text_pl indexed=true stored=true multiValued=false termVectors=true termPositions=true termOffsets=true / -Pete On Jul 30, 2010, at 7:30 AM, Li Li wrote: hightlight's time is mainly spent on getting the field which you want to highlight and

Programmatically retrieving numDocs (or any other statistic)

2010-07-30 Thread John DeRosa
I want to programmatically retrieve the number of indexed documents. I.e., get the value of numDocs. The only two ways I've come up with are searching for *:* and reporting the hit count, or sending an Http GET to http://xxx.xx.xxx.xxx:8080/solr/admin/stats.jsp#core and searching for stat

Re: Help with schema design

2010-07-30 Thread Erick Erickson
I'd just index the eventtype, eventby and eventtime as separate fields. Then queries something like eventtype:update AND eventtime:[targettime TO *]. Similarly for events update by pramod, the query would be something like: eventby:pramod AND eventtype:update HTH Erick On Wed, Jul 28, 2010 at

Re: Solr using 1500 threads - is that normal?

2010-07-30 Thread Erick Erickson
Glad to help. Do be aware that there are several config values that influence the commit frequency, they might also be relevant. Best Erick On Thu, Jul 29, 2010 at 5:11 AM, Christos Constantinou ch...@simpleweb.co.uk wrote: Eric, Thank you very much for the indicators! I had a closer look

Re: Solr Indexing slows down

2010-07-30 Thread Erick Erickson
See the subject about 1500 threads. The first place I'd look is how often you're committing. If you're committing before the warmup queries from the previous commit have done their magic, you might be getting into a death spiral. HTH Erick On Thu, Jul 29, 2010 at 7:02 AM, Peter Karich

Problems running on tomcat

2010-07-30 Thread Claudio Devecchi
Hi, I'm new with solr and I'm doing my first installation under tomcat, I followed the documentation on link ( http://wiki.apache.org/solr/SolrTomcat#Installing_Tomcat_6) but there are some problems. The http://localhost:8080/solr/admin works fine, but in some cases, for example to see my

Re: Solr Indexing slows down

2010-07-30 Thread Peter Karich
Hi Erick! thanks for the response! I will answer your questions ;-) How often are you making changes to your index? Every 30-60 seconds. Too heavy? Do you have autocommit on? No. Do you commit when updating each document? No. I commit after a batch update of 200 documents Committing

Re: Programmatically retrieving numDocs (or any other statistic)

2010-07-30 Thread Peter Karich
Both approaches are ok, I think. (although I don't know the python API) BTW: If you query q=*:* then add rows=0 to avoid some traffic. Regards, Peter. I want to programmatically retrieve the number of indexed documents. I.e., get the value of numDocs. The only two ways I've come up with are

Re: Solr searching performance issues, using large documents

2010-07-30 Thread Peter Karich
Hi Peter :-), did you already try other values for hl.maxAnalyzedChars=2147483647 ? Also regular expression highlighting is more expensive, I think. What does the 'fuzzy' variable mean? If you use this to query via ~someTerm instead someTerm then you should try the trunk of solr which is a lot

Good list of English words that get butchered by Porter Stemmer

2010-07-30 Thread Otis Gospodnetic
Hello, I'm looking for a list of English words that, when stemmed by Porter stemmer, end up in the same stem as some similar, but unrelated words. Below are some examples: # this gets stemmed to iron, so if you search for ironic, you'll get iron matches ironic # same stem as animal anime

Re: Solr Indexing slows down

2010-07-30 Thread Otis Gospodnetic
Peter, there are events in solrconfig where you define warm up queries when a new searcher is opened. There are also cache settings that play a role here. 30-60 seconds is pretty frequent for Solr. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search ::

Re: Programmatically retrieving numDocs (or any other statistic)

2010-07-30 Thread Otis Gospodnetic
I suppose you could write a component that just gets this info from SolrIndexSearcher and write that in the response? Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: John DeRosa

Re: question about relevance

2010-07-30 Thread Otis Gospodnetic
May I suggest looking at some of the related issues, say SOLR-1682 This issue is related to: SOLR-1682 Implement CollapseComponent SOLR-1311 pseudo-field-collapsing LUCENE-1421 Ability to group search results by field SOLR-1773 Field Collapsing (lightweight version)

Re: Good list of English words that get butchered by Porter Stemmer

2010-07-30 Thread Walter Underwood
Some collisions are listed here: http://www.attivio.com/blog/34-attivio-blog/333-doing-things-with-words-part-three-stemming-and-lemmatization.html Have you asked Martin Porter? You can find his e-mail here: http://tartarus.org/~martin/ wunder On Jul 30, 2010, at 1:41 PM, Otis Gospodnetic

Stem collision, word protection, synonym hack

2010-07-30 Thread Otis Gospodnetic
Hello, I'm wondering if anyone has good ideas for handling the following (Porter) stemming problem. The word city gets stemmed to citi. But citi is short for citibank, so we have a conflict - the stems of both city and citi are citi, so when you search for city, you will get matches that

Re: Stem collision, word protection, synonym hack

2010-07-30 Thread Robert Zotter
Otis, https://issues.apache.org/jira/browse/LUCENE-2055 may be of some help. cheers On 7/30/10 2:18 PM, Otis Gospodnetic wrote: Hello, I'm wondering if anyone has good ideas for handling the following (Porter) stemming problem. The word city gets stemmed to citi. But citi is short for

Re: Programmatically retrieving numDocs (or any other statistic)

2010-07-30 Thread John DeRosa
Thanks! On Jul 30, 2010, at 1:11 PM, Peter Karich wrote: Both approaches are ok, I think. (although I don't know the python API) BTW: If you query q=*:* then add rows=0 to avoid some traffic. Regards, Peter. I want to programmatically retrieve the number of indexed documents. I.e.,

Re: Good list of English words that get butchered by Porter Stemmer

2010-07-30 Thread Yonik Seeley
On Fri, Jul 30, 2010 at 4:41 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: I'm looking for a list of English  words that, when stemmed by Porter stemmer, end up in the same stem as  some similar, but unrelated words.  Below are some examples: # this gets stemmed to iron, so if you

Some basic DataImportHandler questions

2010-07-30 Thread Harry Smith
Just starting with DataImportHandler and had a few simple questions. Is there a location for more in depth documentation other than http://wiki.apache.org/solr/DataImportHandler? Specifically I was looking for a detailed document outlining data-config.xml, the fields and attributes and how they

Re: Solr Indexing slows down

2010-07-30 Thread Peter Karich
Hi Otis, does it mean that a new searcher is opened after I commit? I thought only on startup...(?) Regards, Peter. Peter, there are events in solrconfig where you define warm up queries when a new searcher is opened. There are also cache settings that play a role here. 30-60 seconds is

RE: Good list of English words that get butchered by Porter Stemmer

2010-07-30 Thread Burton-West, Tom
A good starting place might be the list of stemming errors for the original Porter stemmer in this article that describes k-stem: Krovetz, R. (1993). Viewing morphology as an inference process. In Proceedings of the 16th annual international ACM SIGIR conference on Research and development in

Re: Good list of English words that get butchered by Porter Stemmer

2010-07-30 Thread Robert Muir
Otis, I think this is a great idea. you could also go even further by making a better example for StemmerOverrideFilter (stemdict.txt) ( http://wiki.apache.org/solr/LanguageAnalysis#solr.StemmerOverrideFilterFactory ) for example: animated tab animate animation tab animation animations tab

Re: Solr searching performance issues, using large documents

2010-07-30 Thread Lance Norskog
Wait- how much text are you highlighting? You say these logfiles are X big- how big are the actual documents you are storing? On Fri, Jul 30, 2010 at 1:16 PM, Peter Karich peat...@yahoo.de wrote: Hi Peter :-), did you already try other values for hl.maxAnalyzedChars=2147483647 ? Also

Re: Solr Indexing slows down

2010-07-30 Thread Otis Gospodnetic
As you make changes to your index, you probably want to see the new/modified documents in your search results. In order to do that, the new searcher needs to be reopened, and this happens on commit. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search ::

Re: Some basic DataImportHandler questions

2010-07-30 Thread Shalin Shekhar Mangar
On Sat, Jul 31, 2010 at 3:40 AM, Harry Smith harrysmith...@gmail.comwrote: Just starting with DataImportHandler and had a few simple questions. Is there a location for more in depth documentation other than http://wiki.apache.org/solr/DataImportHandler? Umm, no, but let us know what is not

Re: Programmatically retrieving numDocs (or any other statistic)

2010-07-30 Thread Chris Hostetter
: I want to programmatically retrieve the number of indexed documents. I.e., get the value of numDocs. Index level stats like this can be fetched from the LukeRequestHandler in any recent version of SOlr... http://localhost:8983/solr/admin/luke?numTerms=0 In future releases (ie: