Linking Fields to Documents possible?

2009-11-23 Thread sameerpatil
Hi, I have a requirement where I have a list of Suppliers(documents for lucene index) and a list of Products(documents again). Each Product has a supplier. e.g. : Product - RouterX, Supplier - DLink, Netgear Product - RouterY, Supplier - Cisco If I search for Cisco, RouterY should show up.

How to find the fields that are indexed?

2009-11-23 Thread DHIVYA M
Sir,   Am using lucene 2.3.2. I would like to know what are the fields that are been indexed?   Ex:   doc.get(path);   this statement returns the path of the document   like path what are the other fields of the document used by lucene   I went through converting all the class files to java

Re: How to find the fields that are indexed?

2009-11-23 Thread Ian Lea
Lucene will index and store the fields that you tell it to when a document is written to the index. In lucene 2.4 doc.getFields() returns a List of all the fields in a document and probably in 2.3.2 as well. See the javadoc. That will tell you the fields that have been stored but I think not

Re: Linking Fields to Documents possible?

2009-11-23 Thread Ian Lea
Lucene is not a database. You'll need to flatten the data and yes, that does mean duplication. -- Ian. On Mon, Nov 23, 2009 at 9:05 AM, sameerpatil nabblegm...@gmail.com wrote: Hi,  I have a requirement where I have a list of Suppliers(documents for lucene index) and a list of

Re: How to find the fields that are indexed?

2009-11-23 Thread Shashi Kant
Use this tool to examine the index: http://www.getopt.org/luke/ I would also suggest getting hold of a Lucene book such as Lucene In Action (http://www.manning.com/hatcher2/) to get familiar with the basics of Lucene. On Mon, Nov 23, 2009 at 4:42 AM, DHIVYA M dhivyakrishna...@yahoo.comwrote:

Re: How to find the fields that are indexed?

2009-11-23 Thread DHIVYA M
That was a good solution to my problem and i found my fields for the document. Acutally i was trying it to find out how to implement autosuggest with lucene. Can you suggest me an idea of how to use autosuggest wih lucene.   Thanks in advance, Dhivya --- On Mon, 23/11/09, Ian Lea

Re: Efficient filtering advise

2009-11-23 Thread Eran Sevi
After commenting out the collector logic, the time is still more or less the same. Anyway, since without the filter collecting the documents is very fast it's probably something with the filter itself. I don't know how the filter (or boolean query) work internally but probably for 10K or 50K

Re: How to find the fields that are indexed?

2009-11-23 Thread Ian Lea
That was a good solution to my problem and i found my fields for the document. Good. Acutally i was trying it to find out how to implement autosuggest with lucene. Can you suggest me an idea of how to use autosuggest wih lucene. There was something about it recently on this list. Take a

Re: How to find the fields that are indexed?

2009-11-23 Thread Anshum
By autosuggest, would you mean similar documents? In that case you could try the lucene 'morelikethis' class. -- Anshum Gupta Naukri Labs! http://ai-cafe.blogspot.com The facts expressed here belong to everybody, the opinions to me. The distinction is yours to draw On Mon, Nov 23,

autosuggest - in the sense of autocomplete

2009-11-23 Thread DHIVYA M
Sir,   I actually meant auto suggest as such available for google suggest similar to autocomplete. Where, users need not type the entire text and instead can go with the suggestions available.   Thanks in advance, Dhivya --- On Mon, 23/11/09, Anshum ansh...@gmail.com wrote: From: Anshum

Re: scoring adjacent terms without proximity search

2009-11-23 Thread liat oren
Hi Joel, I encounter the same problem. Could you please elaborate a bit on this? Many thanks, Liat 2009/11/2 Joel Halbert j...@su3analytics.com I opted to use the following query to solve this problem, since it meets my requirements, for the time being. +(cheese sandwich) cheese

Re: Efficient filtering advise

2009-11-23 Thread Erick Erickson
Now I'm really confused, which usually means I'm making some assumptions that aren't true. So here they are... 1 You're talking about Filters that contain BitSets, right? Not some other kind of filter. 2 When you create your 10-50K filters, you wind up with a single filter by combining

Re: Linking Fields to Documents possible?

2009-11-23 Thread Erick Erickson
There are some tricks you can apply, but they amount to keeping your own lists and manipulating them manually. As Ian says, Lucene isn't a database, and if you find yourself spending much time trying to *make* it behave like a database you should probably re-think your approach. But in this case,

Re: Efficient filtering advise

2009-11-23 Thread Eran Sevi
Erick, Maybe I didn't make myself clear enough. I'm talking about high level filters used when searching. I construct a very big BooleanQuery and add 50K clauses to it (I removed the limit on max clauses). Each clause is a TermQuery on the same field. I don't know the internal doc ids that I

Re: Efficient filtering advise

2009-11-23 Thread Erick Erickson
Oh my goodness yes. No wonder nothing I suggested made any difference G. Ignore everything I've written OK, here's something to try, and it goes back to a Filter. Rather than make this enormous bunch of ORs, try creating a Filter. Use TermDocs to run through your list of IDs assembling a

Re: SpanQuery for Terms at same position

2009-11-23 Thread Christopher Tignor
Tested it out. It doesn't work. A slop of zero indicates no words between the provided terms. E.g. my query of plan _n returns entries like contingency plan. My work around for this problem is to use a PhraseQuery, where you can explicitly set Terms to occur at the same location, t orecover

Re: Efficient filtering advise

2009-11-23 Thread Eran Sevi
I've taken TermsFilter from contrib which does exactly that and indeed the speed was reduced to half, which starts to be reasonable for my needs. I've researched the regular QueryFilter and what I write here might not be the complete picture: I found out that most of the time is spent on scoring

Re: SpanQuery for Terms at same position

2009-11-23 Thread Christopher Tignor
A slop of -1 doesn't work either. I get no results returned. this would be a *really* helpful feature for me if someone might suggest an implementation as I would really like to be able to do arbitrary span searches where tokens may be at the same position and also in other positions where the

Re: SpanQuery for Terms at same position

2009-11-23 Thread Paul Elschot
Op maandag 23 november 2009 17:27:56 schreef Christopher Tignor: A slop of -1 doesn't work either. I get no results returned. I think the problem is in the NearSpansOrdered.docSpansOrdered methods. Could you replace the by = in there (4 times) and try again? That will allow spans at the same

Re: SpanQuery for Terms at same position

2009-11-23 Thread Mark Miller
Your trying -1 with ordered right? Try it with non ordered. Christopher Tignor wrote: A slop of -1 doesn't work either. I get no results returned. this would be a *really* helpful feature for me if someone might suggest an implementation as I would really like to be able to do arbitrary span

Re: autosuggest - in the sense of autocomplete

2009-11-23 Thread Anshum
For auto complete, you could try the following: 1. Run a prefix query. [Could be a fuzzy query] 2. Index using something like ngrams. term : sample is indexed as 4 terms, viz: t te ter term -- Anshum Gupta Naukri Labs! http://ai-cafe.blogspot.com The facts expressed here belong to everybody,

Re: Efficient filtering advise

2009-11-23 Thread Erick Erickson
See: http://issues.apache.org/jira/browse/LUCENE-1427 http://issues.apache.org/jira/browse/LUCENE-1427Short form: this is fixed, but not until 2.9. If you don't want to upgrade, you could always leave the Filter off your initial query and have your Collector insure that any docs were in the

RE: autosuggest - in the sense of autocomplete

2009-11-23 Thread Uwe Schindler
If you just want to autocomplete the current term the user enters, initialize a TermEnum with the current entered term fragment. If you then iterate through the termenum, you get all terms that exist in the index *after* that term (in unicode codepoint order). Stop iterating, when the term does

Re: SpanQuery for Terms at same position

2009-11-23 Thread Christopher Tignor
Thanks so much for this. Using an un-ordered query, the -1 slop indeed returns the correct results, matching tokens at the same position. I tried the same query but ordered both after and before rebuilding the source with Paul's changes to NearSpansOrdered but the query was still failing,

RE: ConcurrentMergeScheduler, Exception and transaction

2009-11-23 Thread Teruhiko Kurosaka
Thank you, Mike, for explanation. So I understand that all the data is kept even if any of these merging threads fail. Will Lucene keep attempting merge every time addDocument is called afterwards once this happened (and the error is persistent - such as filesystem full)? Will

Re: ConcurrentMergeScheduler, Exception and transaction

2009-11-23 Thread Michael McCandless
IndexWriter will try the merge again, the next time it checks merges (eg after flushing a new segment, but not after adding a new document). You'll only get an exception out of addDocument/commit/flush if they hit the problem, eg, if on flushing a new segment it runs out of space. But often

Re: SpanQuery for Terms at same position

2009-11-23 Thread Christopher Tignor
Also, I noticed that with the above edit to NearSpansOrdered I am getting erroneous results fo normal ordered searches using searches like: _n followed by work where because _n and work are at the same position the code changes accept their pairing as a valid in-order result now that the eqaul

Re: Linking Fields to Documents possible?

2009-11-23 Thread sameerpatil
Thanks guys, I get the point, it is best to reindex(hope it isnt very expensive). And yes, it's true that the suppliers dont change often. I -- View this message in context: http://old.nabble.com/Linking-Fields-to-Documents-possible--tp26474610p26485036.html Sent from the Lucene - Java Users

Searching while optimizing

2009-11-23 Thread vsevel
Hi, I am using lucene 2.9.1 to index a continuous flow of events. My server keeps an index writer open at all time and write events as groups of a few hundred followed by a commit. While writing, users invoke my server to perform searches. Once a day I optimize the index, while writes happens and

Re: Searching while optimizing

2009-11-23 Thread Michael McCandless
When you say getting a reader of the writer do you mean writer.getReader()? Ie the new near real-time API in 2.9? For that API (an in general whenever you open a reader), you must close it. I think all your files is because you're not closing your old readers. Reopening readers during optimize

Re: SpanQuery for Terms at same position

2009-11-23 Thread Paul Elschot
Op maandag 23 november 2009 20:07:58 schreef Christopher Tignor: Also, I noticed that with the above edit to NearSpansOrdered I am getting erroneous results fo normal ordered searches using searches like: _n followed by work where because _n and work are at the same position the code

Re: Efficient filtering advise

2009-11-23 Thread Erick Erickson
This was a really silly idea I had G. If your time is being spent in the scoring in the first place, keeping the Filter out of the query and checking against it later in your Collector won't change the timing because you'll have done all the scoring anyway. But I only thought about it on the way

Is Lucene a good choice for PB scale mailbox search?

2009-11-23 Thread fulin tang
We are going to add full-text search for our mailbox service . The problem is we have more than 1 PB mails there , and obviously we don't want to add another PB storage for search service , so we hope the index data will be small enough for storage while the search keeps fast . The lucky is that

Re: Is Lucene a good choice for PB scale mailbox search?

2009-11-23 Thread Shashi Kant
Hi, I have not worked on a petascale (yet!) - mostly on the scale of tens of terabyes - but I do think Lucene would be very helpful for such usecase. I would indeed suggest partitioning the index by users (seems the most logical., straightforward way, also offers the security of insulating one

updating spell index

2009-11-23 Thread m.harig
hello all is there any way to update the spell index directory ? please any1 help me out of this. -- View this message in context: http://old.nabble.com/updating-spell-index-tp26490695p26490695.html Sent from the Lucene - Java Users mailing list archive at Nabble.com.

Re: Is Lucene a good choice for PB scale mailbox search?

2009-11-23 Thread Jason Rutherglen
A sharded architecture (i.e. smaller indexes) used by Google for example and implemented by open source in the Katta project may be best for scaling to sizable levels. Katta is also useful for redundancy and fault tolerance. On Mon, Nov 23, 2009 at 6:35 PM, fulin tang tangfu...@gmail.com wrote:

Re: did you mean issue

2009-11-23 Thread m.harig
String[] suggestions = spellChecker.suggestSimilar(hoem, 3,indexReader, contents, true); this is how am retrieving my did you words Grant Ingersoll-6 wrote: How are you invoking the spell checker? On Nov 19, 2009, at 1:22 AM, m.harig wrote: hello all i've a doubt in

Re: Searching while optimizing

2009-11-23 Thread vsevel
1) correct: I am using IndexWriter.getReader(). I guess I was assuming that was a privately owned object and I had no business dealing with its lifecycle. the api would be clearer to rename the operation createReader(). 2) how much transient disk space should I expect? isn't this pretty much

RamDirectory and FS at the same moment

2009-11-23 Thread Rafal Janik
Hi all! i've just started my adventure with Lucene i've got one question regarding indexing. Does Lucene have got built-in mechanism to store indexes first in RAM and after some time or after some number of documents added to move them to FS? And searching docs all the time in both

Re: Is Lucene a good choice for PB scale mailbox search?

2009-11-23 Thread Kay Kay
fulin tang wrote: We are going to add full-text search for our mailbox service . The problem is we have more than 1 PB mails there , and obviously we don't want to add another PB storage for search service , so we hope the index data will be small enough for storage while the search keeps fast

Re: RamDirectory and FS at the same moment

2009-11-23 Thread Anshum
Hi Rafal, If what I understand about your implementation is correct, you could try a parallelmultisearcher http://lucene.apache.org/java/2_9_1/api/core/org/apache/lucene/search/ParallelMultiSearcher.html -- Anshum Gupta Naukri Labs! http://ai-cafe.blogspot.com The facts expressed here belong to