Re: Fine Tuning Lucene implementation

2007-07-25 Thread Dmitry
Askar, why do you need to add +id:idWeCareAbout? thanks, dt, www.ejinz.com search engine news forms - Original Message - From: Askar Zaidi [EMAIL PROTECTED] To: java-user@lucene.apache.org; [EMAIL PROTECTED] Sent: Wednesday, July 25, 2007 12:39 AM Subject: Re: Fine Tuning Lucene

Re: Lucene and Eastern languages (Japanese, Korean and Chinese)

2007-07-25 Thread Mathieu Lecarme
Le mardi 24 juillet 2007 à 13:01 -0700, Shaw, James a écrit : Hi, guys, I found Analyzers for Japanese, Korean and Chinese, but not stemmers; the Snowball stemmers only include European languages. Does stemming not make sense for ideograph-based languages (i.e., no stemming is needed for

Re: What replaced org.apache.lucene.document.Field.Text?

2007-07-25 Thread Patrick Kimber
Hi Andy I think: Field.Text(name, value); has been replaced with: new Field(name, value, Field.Store.YES, Field.Index.TOKENIZED); Patrick On 25/07/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Please reference How do I get code written for Lucene 1.4.x to work with Lucene 2.x?

Re: Search for null

2007-07-25 Thread daniel rosher
You will be unable to search for fields that do not exist which is what you originally wanted to do, instead you can do something like: -Establish the query that will select all non-null values TermQuery tq1 = new TermQuery(new Term(field,value1)); TermQuery tq2 = new TermQuery(new

Recovering from a Crash

2007-07-25 Thread Simon Wistow
We were affected by the great SF outage yesterday and apparently the indexing machine crashed without being shutdown properly. I've taken a backup of the indexes which has the usual smattering of write.lock segments.gen, .cfs, .fdt, .fnm and .fdx etc files and looks to be about the right size.

Re: Recovering from a Crash

2007-07-25 Thread Michael McCandless
Simon Wistow [EMAIL PROTECTED] wrote: We were affected by the great SF outage yesterday and apparently the indexing machine crashed without being shutdown properly. Eek, sorry! We are so reliant on electricity these days I've taken a backup of the indexes which has the usual smattering

Re: Recovering from a Crash

2007-07-25 Thread Simon Wistow
On Wed, Jul 25, 2007 at 10:08:56AM +0100, me said: The data appears to be there - please tell me that I'm doing something stupid and I can recover from this. It appears by deleting the write.lock files everything has recovered. Is this best practice? Have I just done something so terribly

Re: Lucene and Eastern languages (Japanese, Korean and Chinese)

2007-07-25 Thread Maximilian Hütter
Mathieu Lecarme schrieb: Le mardi 24 juillet 2007 à 13:01 -0700, Shaw, James a écrit : Hi, guys, I found Analyzers for Japanese, Korean and Chinese, but not stemmers; the Snowball stemmers only include European languages. Does stemming not make sense for ideograph-based languages (i.e., no

Re: Recovering from a Crash

2007-07-25 Thread Simon Wistow
On Wed, Jul 25, 2007 at 05:19:31AM -0400, Michael McCandless said: It's somewhat spooky that you have a write.lock present because that means you backed up while a writer was actively writing to the index which is a bit dangerous because if the timing is unlucky (backup does an ls but before

Re: Recovering from a Crash

2007-07-25 Thread Michael McCandless
Simon Wistow [EMAIL PROTECTED] wrote: On Wed, Jul 25, 2007 at 05:19:31AM -0400, Michael McCandless said: It's somewhat spooky that you have a write.lock present because that means you backed up while a writer was actively writing to the index which is a bit dangerous because if the timing

Re: Recovering from a Crash

2007-07-25 Thread Michael McCandless
The data appears to be there - please tell me that I'm doing something stupid and I can recover from this. It appears by deleting the write.lock files everything has recovered. Hmmm -- it's odd that the existence of the write.lock caused you to lose most of your index. All that should have

Re: Recovering from a Crash

2007-07-25 Thread Simon Wistow
On Wed, Jul 25, 2007 at 05:49:41AM -0400, Michael McCandless said: Ahhh, OK. But do you have a segments_N file? Yup. Yes, this is perfect. This is the simple option I described. The more complex option is to use a custom deletion policy which enables you to safely do backups (even if the

Re: Recovering from a Crash

2007-07-25 Thread Michael McCandless
Simon Wistow [EMAIL PROTECTED] wrote: On Wed, Jul 25, 2007 at 05:49:41AM -0400, Michael McCandless said: Ahhh, OK. But do you have a segments_N file? Yup. OK, though I still don't understand why the existence of write.lock caused you to lose most of your index on creating a new writer.

Re: StandardTokenizer is slowing down highlighting a lot

2007-07-25 Thread Stanislaw Osinski
Unfortunately, StandardAnalyzer is slow. StandardAnalyzer is really limited by JavaCC speed. You cannot shave much more performance out of the grammar as it is already about as simple as it gets. JavaCC is slow indeed. We used it for a while for Carrot2, but then (3 years ago :) switched to

Which field matched ?

2007-07-25 Thread makkhar
This problem has been baffling me since quite some time now and has no perfect solution in the forum ! I have 10 documents, each with 10 fields with parameterName and parameterValue. Now, When i search for some term and I get 5 hits, how do I find out which paramName-Value pair matched ? I am

Re: Which field matched ?

2007-07-25 Thread makkhar
Currently, we use regular expression pattern matching to get hold of which field matched. Again a pathetic solution since we have to agree upon the subset of the lucene search and pattern matching. We cannot use Boolean queries etc in this case. makkhar wrote: This problem has been

Re: StandardTokenizer is slowing down highlighting a lot

2007-07-25 Thread Stanislaw Osinski
I am sure a faster StandardAnalyzer would be greatly appreciated. I'm increasing the priority of that task then :) StandardAnalyzer appears widely used and horrendously slow. Even better would be a StandardAnalyzer that could have different recognizers enabled/disabled. For example,

Re: StandardTokenizer is slowing down highlighting a lot

2007-07-25 Thread Grant Ingersoll
On Jul 25, 2007, at 7:19 AM, Stanislaw Osinski wrote: Unfortunately, StandardAnalyzer is slow. StandardAnalyzer is really limited by JavaCC speed. You cannot shave much more performance out of the grammar as it is already about as simple as it gets. JavaCC is slow indeed. We used it for

Lucene Highlighter linkage Error

2007-07-25 Thread ki
Hello! I am working with Tomcat. I have put the Lucene highlighter.jar in the folder lib. And I have created an extra css, where I say that the background color has to be yellow. The searchword has to be highlighted know. I have got a dataTable in which the result of the following Lucene method

Re: Fine Tuning Lucene implementation

2007-07-25 Thread Askar Zaidi
Hey Guys, I need to know how I can use the HitCollector class ? I am using Hits and looping over all the possible document hits (turns out its 92 times I am looping; for 300 searches, its 300*92 !!). Can I avoid this using HitCollector ? I can't seem to understand how its used. thanks a lot,

Re: Fine Tuning Lucene implementation

2007-07-25 Thread Grant Ingersoll
Hi Askar, I suggest we take a step back, and ask the question, what are you trying to accomplish? That is, what is your application trying to do? Forget the code, etc. just explain what you want the end result to be and we can work from there. Based on what you have described, I am

Re: Fine Tuning Lucene implementation

2007-07-25 Thread Askar Zaidi
Hi Grant, Thanks for the response. Heres what I am trying to accomplish: 1. Iterate over itemID (unique) in the database using one SQL query. 2. For every itemID found, run 4 searches on Lucene Index. 3. doTagSearch(itemID) ; collect score 4. doTitleSearch(itemID...) ; collect score 5.

Re: Search for null

2007-07-25 Thread Jay Yu
what if I do not know all possible values of that field which is a typical case in a free text search? daniel rosher wrote: You will be unable to search for fields that do not exist which is what you originally wanted to do, instead you can do something like: -Establish the query that will

Re: Fine Tuning Lucene implementation

2007-07-25 Thread Grant Ingersoll
So, you really want a single Lucene score (based on the scores of your 4 fields) for every itemID, correct? And this score consists of scoring the title, tag, summary and body against some keywords correct? Here's what I would do: while (rs.next()) { doc = getDocument(itemId); // Get

Re: Fine Tuning Lucene implementation

2007-07-25 Thread Askar Zaidi
Instead of refactoring the code, would there be a way to just modify the query in each search routine ? Such as, search contents:text and item:itemID; This means it would just collect the score of that one document whose itemID field = itemID passed from while(rs.next()). I just need to collect

Re: Fine Tuning Lucene implementation

2007-07-25 Thread Grant Ingersoll
Yes, you can do that. On Jul 25, 2007, at 12:31 PM, Askar Zaidi wrote: Heres what I mean: http://lucene.apache.org/java/docs/queryparsersyntax.html#Fields title:The Right Way AND text:go Although, I am not searching for the title the right way , I am looking for the score by specifying

Re: Fine Tuning Lucene implementation

2007-07-25 Thread Askar Zaidi
Heres what I mean: http://lucene.apache.org/java/docs/queryparsersyntax.html#Fields title:The Right Way AND text:go Although, I am not searching for the title the right way , I am looking for the score by specifying a unique field (itemID). when I do System.out.println(query); I get:

Re: Search for null

2007-07-25 Thread daniel rosher
In this case you should look at the source for RangeFilter.java. Using this you could create your own filter using TermEnum and TermDocs to find all documents that had some value for the field. You would then flip this filter (perhaps write a FlipFilter.java, that takes an existing filter in

Re: Fine Tuning Lucene implementation

2007-07-25 Thread Askar Zaidi
Hey guys, One last question and I think I'll have an optimized algorithm. How can I build a query in my program ? This is what I am doing: QueryParser queryParser = new QueryParser(contents, new StandardAnalyzer()); queryParser.setDefaultOperator(QueryParser.Operator.AND); Query q =

Re: What replaced org.apache.lucene.document.Field.Text?

2007-07-25 Thread Lindsey Hess
Andy, Patrick, Thank you. I replaced Field.Text with new Field(name, value, Field.Store.YES, Field.Index.TOKENIZED); and it works just fine. Cheers, Lindsey Patrick Kimber [EMAIL PROTECTED] wrote: Hi Andy I think: Field.Text(name, value); has been replaced with:

Re: Fine Tuning Lucene implementation

2007-07-25 Thread Grant Ingersoll
On Jul 25, 2007, at 1:26 PM, Askar Zaidi wrote: Hey guys, One last question and I think I'll have an optimized algorithm. How can I build a query in my program ? This is what I am doing: QueryParser queryParser = new QueryParser(contents, new StandardAnalyzer());

MoreLikeThis for multiple documents

2007-07-25 Thread Jens Grivolla
Hello, I'm looking to extract significant terms characterizing a set of documents (which in turn relate to a topic). This basically comes down to functionality similar to determining the terms with the greatest offer weight (as used for blind relevance feedback), or maximizing tf.idf (as is

Assembling a query from multiple fields

2007-07-25 Thread Joe Attardi
Hi all, Apologies for the cryptic subject line, but I couldn't think of a more descriptive one-liner to describe my problem/question to you all. Still fairly new to Lucene here, although I'm hoping to have more of a clue once I get a chance to read Lucene In Action. I am implementing a search

Re: StandardTokenizer is slowing down highlighting a lot

2007-07-25 Thread Yonik Seeley
On 7/25/07, Stanislaw Osinski [EMAIL PROTECTED] wrote: JavaCC is slow indeed. JavaCC is a very fast parser for a large document... the issue is small fields and JavaCC's use of an exception for flow control at the end of a value. As JVMs have advanced, exception-as-control-flow as gotten

Linear Hashing in Lucene?

2007-07-25 Thread Dmitry
Hey, Some common questions about Lucene. 1. does exist Ontology Wraper in Lucene implementation? 2. Does Lucene using Linear Hashing? thnaks, DT, www.ejinz.com Search news - To unsubscribe, e-mail: [EMAIL PROTECTED] For

Re: Search for null

2007-07-25 Thread Daniel Noll
On Thursday 26 July 2007 03:12:20 daniel rosher wrote: In this case you should look at the source for RangeFilter.java. Using this you could create your own filter using TermEnum and TermDocs to find all documents that had some value for the field. That's certainly the way to do it for speed.

Highlighter strategy in Lucene

2007-07-25 Thread Dmitry
Waht kind of Highlighter strategy Lucene is using? thanks, Dt www.ejinz.com Search Engine for News - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Displaying results in the order

2007-07-25 Thread Dmitry
Is there a way to update a document in the Index without causing any change to the order in which it comes up in searches? thanks, DT, www.ejinz.com Search everything news, tech, movies, music - To unsubscribe, e-mail: [EMAIL

Re: Fine Tuning Lucene implementation

2007-07-25 Thread Askar Zaidi
Hey Guys, Thanks for all the responses. I finally got it working with some query modification. The idea was to pick an itemID from the database and for that itemID in the Index, get the scores across 4 fields; add them up and ta-da ! I still have to verify my scores. Thanks a ton, I'll be

java gc with a frequently changing index?

2007-07-25 Thread Tim Sturge
Hi, I am indexing a set of constantly changing documents. The change rate is moderate (about 10 docs/sec over a 10M document collection with a 6G total size) but I want to be right up to date (ideally within a second but within 5 seconds is acceptable) with the index. Right now I have code

Re: Fine Tuning Lucene implementation

2007-07-25 Thread Doron Cohen
Askar Zaidi wrote: ... Heres what I am trying to accomplish: 1. Iterate over itemID (unique) in the database using one SQL query. 2. For every itemID found, run 4 searches on Lucene Index. 3. doTagSearch(itemID) ; collect score 4. doTitleSearch(itemID...) ; collect score 5.

Re: Query parsing?

2007-07-25 Thread Daniel Naber
On Wednesday 25 July 2007 00:44, Lindsey Hess wrote: Now, I do not need Lucene to index anything, but I'm wondering if Lucene has query parsing classes that will allow me to transform the queries. The Lucene QueryParser class can parse the format descriped at

Delete corrupted doc

2007-07-25 Thread Rafael Rossini
Hi guys, Is there a way of deleting a document that, because of some corruption, got and docID larger than the maxDoc() ? I´m trying to do this but I get this Exception: Exception in thread main java.lang.ArrayIndexOutOfBoundsException: Array index out of range: 106577 at