Re: Can I exclude certain terms from MoreLikeThis query?

2009-08-25 Thread Koji Sekiguchi
Paras Chopra wrote: Hi All, I am tinkering with MoreLikeThis component of Solr and had a particular use case where I would like to exclude certain terms from consideration while MoreLikeThis makes a query vector out of a document. Is it possible with Solr? I searched for this in the

Re: Can I exclude certain terms from MoreLikeThis query?

2009-08-25 Thread Paras Chopra
Hi Koji, Thank you for your reply. Actually, the terms I would like to exclude would be based on the document I use for MoreLikeThis Query. As I understand from StopFilter, it is a static method to exclude terms such as stop words. My problem is that I want to return theme/area specific results

Re: why would a search for a specific field value fail when data is present?

2009-08-25 Thread Shalin Shekhar Mangar
On Tue, Aug 25, 2009 at 2:04 AM, Brian Klippel br...@theport.com wrote: Hopefully, someone can tell me what is going wrong here. I have a field, SearchObjectType, and a large number of the documents indexed in a give core have a value of USER_PROFILE. When I examine the schema browser

Re: defining qf in your own request handler

2009-08-25 Thread Shalin Shekhar Mangar
On Tue, Aug 25, 2009 at 5:26 AM, darniz rnizamud...@edmunds.com wrote: Continuing on this i am having a use case where i have to strip out single quote for certain fields for example for testing i added teh following fieldType in schema.xml file fieldType name=removeComma

Re: Solr Query help - sorting

2009-08-25 Thread Constantijn Visinescu
make a new multivalued field in your schema.xml, copy both width and length into that field, and then sort on that field ? On Tue, Aug 25, 2009 at 5:40 AM, erikea...@yahoo.com erikea...@yahoo.comwrote: Clever... but if more than one row adds up to the same value I may get the wrong order (like

Re: solr nutch url indexing

2009-08-25 Thread Thibaut Lassalle
Thanks for your help. I use the default Nutch configuration and I use solrindex to give the Nutch result to Solr. I have results when I query therefore Nutch works properly (it gives a url, title, content ...) I would like to query on Solr to emphase the title field and not the content field.

Re: Solr Query help - sorting

2009-08-25 Thread Erik Hatcher
You couldn't sort on a multiValued field though. I'd simply index a max_side field, and have the indexing client add a single valued field with max(length,width) to it. Then sort on max_side. Erik On Aug 25, 2009, at 4:00 AM, Constantijn Visinescu wrote: make a new multivalued

Re: Can I exclude certain terms from MoreLikeThis query?

2009-08-25 Thread Koji Sekiguchi
Hi Paras, As I understand from StopFilter, it is a static method to exclude terms such as stop words. Correct. As far as I know, to control what words MLT component chooses for generating BooleanQuery, what you can do is that you specify the following parameters: mlt.mintf Minimum Term

Re: Solr Query help - sorting

2009-08-25 Thread Koji Sekiguchi
Hi Erik Earle, Ahh, I read your mail too fast... Erik Hatcher's method should work. Thanks! Koji Erik Hatcher wrote: You couldn't sort on a multiValued field though. I'd simply index a max_side field, and have the indexing client add a single valued field with max(length,width) to it.

Re: Can I exclude certain terms from MoreLikeThis query?

2009-08-25 Thread Paras Chopra
Hi Koji, I have already used MLT parameters to refine the query but still I'd like to exclude additional terms. I was just going through some docs online and came across filterQuery mechanism. Won't specifying fq=~term1+~term2 do the job? Thanks Paras Chopra On Tue, Aug 25, 2009 at 4:08 PM, Koji

Re: solr nutch url indexing

2009-08-25 Thread Uri Boness
It seems to me that this configuration actually does what you want - queries on title mostly. The default search field doesn't influence a dismax query. I would suggest you to include the debugQuery=true parameter, it will help you figure out how the matching is performed. You can read more

Re: Can I exclude certain terms from MoreLikeThis query?

2009-08-25 Thread Koji Sekiguchi
Hi Paras, Won't specifying fq=~term1+~term2 do the job? Briefly looking at the source, it seems that MLT handler (not component) uses fq parameter, so if you use MLT handler, it do the job. Koji Paras Chopra wrote: Hi Koji, I have already used MLT parameters to refine the query but still

Lucene Meetup - September 3, Mountain View, CA

2009-08-25 Thread Erik Hatcher
Announcing a new Meetup for SFBay Apache Lucene/Solr Meetup! What: SFBay Apache Lucene/Solr June Meetup When: September 3, 2009 6:30 PM Where: Computer History Museum, 1401 N Shoreline Blvd, Mountain View, CA 94043 Presentations and discussions on Lucene/Solr, the Apache Open Source Search

Re: multi-language search

2009-08-25 Thread Elaine Li
Uri, Thanks a lot! I don't need to do cross language search. So Option 2 sounds better, coz my corpus is very large. I am still looking for help on chinese language search. I tried chinesetokenizerfactory as my analyzer, but it did not help. Only word with white space, comma and etc around them

query time relevancy tuning - need details

2009-08-25 Thread Fuad Efendi
query time relevancy tuning It is mentioned at http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ -What is it? Just GET request parameters for standard handler? Thanks

Re: query time relevancy tuning - need details

2009-08-25 Thread Erik Hatcher
On Aug 25, 2009, at 11:29 AM, Fuad Efendi wrote: query time relevancy tuning It is mentioned at http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ -What is it? Just GET request parameters for standard handler? To me, this primarily refers to dismax client-side parameterization of

RE: solr nutch url indexing

2009-08-25 Thread Fuad Efendi
Thanks for the link, so, SolrIndex is NOT plugin, it is an application... I use similar approach... -Original Message- From: Uri Boness Hi, Nutch comes with support for Solr out of the box. I suggest you follow the steps as described here:

Re: solr nutch url indexing

2009-08-25 Thread Uri Boness
Well... yes, it's a tool the Nutch ships with. It also ships with an example Solr schema which you can use. Fuad Efendi wrote: Thanks for the link, so, SolrIndex is NOT plugin, it is an application... I use similar approach... -Original Message- From: Uri Boness Hi, Nutch comes

Re: Responses getting truncated

2009-08-25 Thread Rupert Fiasco
Using wt=json also yields an invalid document. So after more investigation it appears that I can always break the response by pulling back a specific field via the fl parameter. If I leave off a field then the response is valid, if I include it then Solr yields an invalid document - a truncated

Re: Responses getting truncated

2009-08-25 Thread Avlesh Singh
Can you copy-paste the source data indexed in this field which causes the error? Cheers Avlesh On Tue, Aug 25, 2009 at 10:01 PM, Rupert Fiasco rufia...@gmail.com wrote: Using wt=json also yields an invalid document. So after more investigation it appears that I can always break the response

Re: Responses getting truncated

2009-08-25 Thread Rupert Fiasco
The text file at: http://brockwine.com/solr.txt Represents one of these truncated responses (this one in XML). It starts out great, then look at the bottom, boom, game over. :) I found this document by first running our bigger search which breaks and then zeroing in a specific broken document

Re: solr.StopFilterFactory not filtering words

2009-08-25 Thread darniz
Thanks Yonik So the stopFilter works is that if i give a string like the elephant is an animal, and when i retrieve the document the stored value will always be the same, only the index will be done on elephant and animal. I was of the impression that Solr automatically takes out that words

Re: Adding cores dynamically

2009-08-25 Thread Chris Hostetter
: We're doing similar thing with multi-core - when a core reaches : capacity (in our case 200 million records) we start a new core. We are : doing this via web service call (Create web service), this whole thread perplexes me ... while i can understand not wanting to let an index grow without

Re: solr.StopFilterFactory not filtering words

2009-08-25 Thread darniz
Thanks Yonik So its basically how the field is indexed and not stored. So i give the elephant is an animal and try to get back the document it should see the entire string, only the index is done on elephant and animal. i was of the impression that when solr loads that document it strips out

Re: Adding cores dynamically

2009-08-25 Thread Lance Norskog
One problem is the IT logistics of handling the file set. At 200 million records you have at least 20G of data in one Lucene index. It takes hours to optimize this, and 10s of minutes to copy the optimized index around to query servers. Another problem is that indexing speed drops off after the

Re: Solr Query help - sorting

2009-08-25 Thread Erik Earle
Is there a way to have the max_side field only in Solr ...as in a conditional copyField or something like that? I'd like to push as much of this into Solr as I can because the app and db that Solr is indexing are not really the best place to add this type of functionality. -

Re: Solr Query help - sorting

2009-08-25 Thread Erik Hatcher
If you're using DataImportHandler, a custom (Java or script) transformer could do this. Also an UpdateProcessor could do it. But there is no conditional copyField capabilities otherwise. Keep in mind that pragmatically, if you're doing your own indexing code, why not have a line like this?

Solr Replication

2009-08-25 Thread J G
Hello, We are running multiple slices in our environment. I have enabled JMX and I am inspecting the replication handler mbean to obtain some information about the master/slave configuration for replication. Is the replication handler mbean a singleton? I only see one mbean for the entire

Re: Responses getting truncated

2009-08-25 Thread Uri Boness
Hi, This is a very strange behavior and the fact that it is cause by one specific field, again, leads me to believe it's still a data issue. Did you try using SolrJ to query the data as well? If the same thing happens when using the binary protocol, then it's probably not a data issue. On

Solr index - Size and indexing speed

2009-08-25 Thread engy.ali
Summary === I had about 120,000 object of total size 71.2 GB, those objects are already indexed using Lucene. The index size is about 111 GB. I tried to use solr 1.4 nightly build to index the same collection. I divided collection on three servers, each server had 5 solr instances

Re: Solr Query help - sorting

2009-08-25 Thread Erik Earle
I am indexing my data both through DataImportHandler and per transaction from JPA using @PostXXX listeners. UpdateRequestProcessor looks like exactly what I need I don't suppose there's a scriptable subclass available in 1.4 that is configured from schema.xml? :-) Thanks guys!

Re: multi-language search

2009-08-25 Thread Erik Hatcher
On Aug 25, 2009, at 10:34 AM, Elaine Li wrote: I am still looking for help on chinese language search. I tried chinesetokenizerfactory as my analyzer, but it did not help. Only word with white space, comma and etc around them can be found. Try using the StandardTokenizerFactory - it handles

frequency of commit when building index from scratch

2009-08-25 Thread Bill Au
Just curious, how often do folks commit when building their Solr/Lucene index from scratch for index with millions of documents? Should I just wait and do a single commit at the end after adding all the documents to the index? Bill

Re: Responses getting truncated

2009-08-25 Thread Rupert Fiasco
So I whipped up a quick SolrJ client and ran it against the document that I referenced earlier. When I retrieve the doc and just print its field/value pairs to stdout it ends like this: http://brockwine.com/images/output1.png It appears to be some kind of garbage characters. -Rupert On Tue,

Re: frequency of commit when building index from scratch

2009-08-25 Thread Edward Capriolo
On Tue, Aug 25, 2009 at 5:29 PM, Bill Aubill.w...@gmail.com wrote: Just curious, how often do folks commit when building their Solr/Lucene index from scratch for index with millions of documents?  Should I just wait and do a single commit at the end after adding all the documents to the index?

Re: frequency of commit when building index from scratch

2009-08-25 Thread Bill Au
That's my gut feeling (start big and go lower if OOM occurs) too. Bill On Tue, Aug 25, 2009 at 5:34 PM, Edward Capriolo edlinuxg...@gmail.comwrote: On Tue, Aug 25, 2009 at 5:29 PM, Bill Aubill.w...@gmail.com wrote: Just curious, how often do folks commit when building their Solr/Lucene

RE: Solr index - Size and indexing speed

2009-08-25 Thread Fuad Efendi
Hi, Can you try to use single SOLR instance with heavy RAM (so that ramBufferSizeMB=8192 for instance) and mergeFactor=10? Single SOLR instance is fast enough ( 100 client threads of Tomcat; configurable) - I usually prefer single instance for single writable box with heavy RAM allocation and

RE: frequency of commit when building index from scratch

2009-08-25 Thread Fuad Efendi
I do commit once a day, millions of small docs... it takes 20 minutes in average... why OOM? I see only reduced I/O... -Original Message- From: Edward Capriolo [mailto:edlinuxg...@gmail.com] Sent: August-25-09 5:35 PM To: solr-user@lucene.apache.org Subject: Re: frequency of commit when

com.ctc.wstx.exc.WstxUnexpectedCharException error

2009-08-25 Thread Phillip Farber
I have a valid xml document that begins: adddocfield name=idmdp.39015052775379/field field name=rights2/field field name=titleTechnology transfer and in-house Ramp;D in Indian industry : in the later 1990s / edited and with an introduction by Binay Kumar Pattnaik. v.1/field field

Incremental Deletes to Index

2009-08-25 Thread KaktuChakarabati
Hey, I was wondering - is there a mechanism in lucene and/or solr to mark a document in the index as deleted and then have this change reflect in query serving without performing the whole commit/warmup cycle? this seems to me largely appealing as it allows a kind of solution where deletes are

solr 1.4: extending StatsComponent to recognize localparm {!ex}

2009-08-25 Thread Britske
hi, I'm looking for a way to extend StatsComponent te recognize localparams especially the {!ex}-param. To my knowledge this isn't implemented in the current trunk. One of my use-cases for this is to be able to have a javascript price-slider, where the user can operate the slider and thus set

Re: Incremental Deletes to Index

2009-08-25 Thread Jason Rutherglen
This will be implemented as you're stating when IndexWriter.getReader is incorporated. This will carry over deletes in RAM until IW.commit is called (i.e. Solr commit). It's a fairly simple change though perhaps too late for 1.4 release? On Tue, Aug 25, 2009 at 3:10 PM,

Seattle / NW Hadoop, HBase Lucene, etc. Meetup , Wed August 26th, 6:45pm

2009-08-25 Thread Bradford Stephens
Hey there, Apologies for this not going out sooner -- apparently it was sitting as a draft in my inbox. A few of you have pinged me, so thanks for your vigilance. It's time for another Hadoop/Lucene/Apache Stack meetup! We've had great attendance in the past few months, let's keep it up! I'm

Re: Adding cores dynamically

2009-08-25 Thread vivek sar
There were two main reasons we went with multi-core solution, 1) We found the indexing speed starts dipping once the index grow to a certain size - in our case around 50G. We don't optimize, but we have to maintain a consistent index speed. The only way we could do that was keep creating new

Re: Incremental Deletes to Index

2009-08-25 Thread Jason Rutherglen
I can give an overview, IW.getReader replaces IR.reopen. So you'd replace in SolrCore.getSearcher. However as per another discussion IW isn't public yet, so all you'd need to do is expose it from UpdateHandler. Then it should work as you want, though there would need to be a new method to create a

Re: Adding cores dynamically

2009-08-25 Thread Chris Hostetter
: 1) We found the indexing speed starts dipping once the index grow to a : certain size - in our case around 50G. We don't optimize, but we have : to maintain a consistent index speed. The only way we could do that : was keep creating new cores (on the same box, though we do use Hmmm... it seems

Re: Incremental Deletes to Index

2009-08-25 Thread KaktuChakarabati
Jason, sounds like a very promising change to me - so much that I would gladly work toward creating a patch myself. Are there any specific points in the code u could point me to if I wanna look at how to start off implementing it? Lucene/Solr Classes involved etc? i'll start looking myself anyhow

Re: Responses getting truncated

2009-08-25 Thread Chris Hostetter
1. Exactly which version of Solr / SolrJ are you using? 2. ... : I am using the SolrJ client to add documents to in my index. My field : is a normal text field type and the text itself is the first 1000 : characters of an article. Can you put the orriginal (pre solr, pre solrj, raw

Re: frequency of commit when building index from scratch

2009-08-25 Thread Lance Norskog
The latest Solr 1.4 can index 200k records in several minutes, then commit in a few seconds. I don't know but I'm guessing it is due to Lucene improvements. It does not use much memory doing this. Lance On Tue, Aug 25, 2009 at 2:43 PM, Fuad Efendi f...@efendi.ca wrote: I do commit once a day,

Re: Boolean logic in distributed searches

2009-08-25 Thread Chris Hostetter
Matthew: did you ever resolve your issue? I'm not an expect on the distributed searching code, but there's no reason I know of why a basic OR type query should fail just becuase you're using hte shards param. Are you sure both of your solr instances (solr-archway and solr-portal) are using the

Re: Which versions?

2009-08-25 Thread Chris Hostetter
: Which versions of Lucene, Nutch and Solr work together? I've : discovered that the Nutch trunk and the Solr trunk use wildly : different versions of the Lucene jars, and it's causing me problems. The Solr and Nuthc projects don't really target any sort of strict binary compatibility with

Re: Putting a something as first query result

2009-08-25 Thread Chris Hostetter
: I'm a bit new to solr and have the following problem, it's about events and : venues. : If a user types a name of a venue, then I'd like to return the exact match : for the venue first and then the list of events taking place at this venue. : Currently I have defined a document bound to a

Re: Calculate Theoretical Max

2009-08-25 Thread Chris Hostetter
: Is there a way to calculate a theoretical max score for the current query? there's been some discussion on this on the java-user list over the years ... the short answer is yes it's possible, but only in very controlled situations ... as i recall it depended on limiting the set of possible

Problem with ResponseBuilder

2009-08-25 Thread Daniel Cassiano
Hi folks, I'm writing some search component to Solr and I'm having some troubles with the ResponseBuilder. I'd like to add to response, eg: only 5 documents of a search. My problem is when I try to add these docs to the ResponseBuilder. A snipet of the code: [...] QParser parser =

Re: frequency of commit when building index from scratch

2009-08-25 Thread Yonik Seeley
On Tue, Aug 25, 2009 at 8:37 PM, Lance Norskoggoks...@gmail.com wrote: The latest Solr 1.4 can index 200k records in several minutes, then commit in a few seconds. I don't know but I'm guessing it is due to Lucene improvements. It does not use much memory doing this. If you're using SolrJ,

Re: Responses getting truncated

2009-08-25 Thread Rupert Fiasco
1. Exactly which version of Solr / SolrJ are you using? Solr Specification Version: 1.3.0 Solr Implementation Version: 1.3.0 694707 - grantingersoll - 2008-09-12 11:06:47 Latest SolrJ that I downloaded a couple of days ago. Can you put the orriginal (pre solr, pre solrj, raw untouched,

RE: frequency of commit when building index from scratch

2009-08-25 Thread Fuad Efendi
But again, why someone has OOM??? I never had... What I discovered is: committing millions docs (in SOLR-1.4) may take several days (although adding docs takes a day) if you have somehow _many_segments_ and bad I/O with = 2 CPUs; I am using heavy ramBufferSizeMB instead of heavy mergeFactor, and

Re: Incremental Deletes to Index

2009-08-25 Thread KaktuChakarabati
So basically the idea is to replace the underlying IndexReader currently associated with a searcher/solrCore following an update without calling commit explicitly? This will also have the effect of bringing in inserts btw? or is it just usable for deletes? In terms of cache invalidation etc there

Re: Responses getting truncated

2009-08-25 Thread Chris Hostetter
: We are running an instance of MediaWiki so the text goes through a : couple of transformations: wiki markup - html - plain text. : Its at this last step that I take a snippet and insert that into Solr. ... : doc.addField(text_snippet_t, article.getSnippet(1000)); ok, well first off:

Re: solr 1.4: extending StatsComponent to recognize localparm {!ex}

2009-08-25 Thread Erik Hatcher
On Aug 25, 2009, at 6:35 PM, Britske wrote: Moreover, I can't seem to find the actual code in FacetComponent or anywhere else for that matter where the {!ex}-param case is treated. I assume it's in FacetComponent.refineFacets but I can't seem to get a grip on it.. Perhaps it's late here..

encoding problem

2009-08-25 Thread Bernadette Houghton
We have an encoding problem with our solr application. That is, non-ASCII chars displaying fine in SOLR, but in googledegook in our application . Our tomcat server.xml file already contains URIencoding=UTF-8 under the relevant connector. A google search reveals that I should set the encoding

Re: Solr Replication

2009-08-25 Thread Noble Paul നോബിള്‍ नोब्ळ्
The ReplicationHandler is not enforced as a singleton , but for all practical purposes it is a singleton for one core. If an instance (a slice as you say) is setup as a repeater, It can act as both a master and slave in the repeater the configuration should be as follows MASTER