Re: Open an IndexWriter in parallel with an IndexReader on the same index.
Chris Hostetter [EMAIL PROTECTED] wrote on 22/02/2006 03:24:58 AM: : It would have been nice if someone wrote something like indexModifier, : but with a cache, similar to what Yonik suggested above: deletions will : not be done immediately, but rather cached and later done in batches. : Of course, batched deletions should not remember the term to delete, : but rather the matching document numbers at the time of the deletion - : because after the addition of the modified document if we search for : the term again we'll find two documents. That's not a safe sequence of events. An Add can trigger a segment merge, which cna renumber documents. I see. Then maybe there's a way to catch this merge and do the deletions just before it, because... As yonik said, you want to queue up the adds/updates, then do a delete for each update in your queue, then do your adds in one batch. knowing The problem in this solution is that unlike queuing deletes, queuing additions requires you to queue the actual document contents. Doing this in memory might add a large memory pentalty which is undesired for applications that try to maintain a small memory footprint. when/what to delete requies knowing a key for your records -- which isnt' a native lucne concept, but it is certainly a general enough one that a helper class could be written for this. I realise that the name of this delete key isn't defined by Lucene, but I believe that the concept of such a key was officially sanctioned by Lucene with the deleteDocuments(Term) method (whose documentation even mentions the unique ID string scenario). So indeed a helper class of this sort will probably be useful to more than a few people. -- Nadav Har'El - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How can I get a term's frequency?
en, but IndexReader.getTermFreqVector is an abstract method, I do not know how to implement it in an efficient way. Anyone has good advise? I search with a group of query terms, I can get a document from the search result: Query(term1, term2, term3)--search index--Hits(doc1, doc2, doc3, ..) I wanna get term1's frequency in doc1 ? I think the tf value is caculated in the index procedure. can I get the tf(term frequency) value of term1 directly? I can do it in this way: QueryTermVector vector= new QueryTermVector(Document.getValues(field)); freq = result.getTermFrequencies(); but I think this is a very low efficient way. Anyone can help me ? thx sog - Original Message - From: Daniel Noll [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Wednesday, February 22, 2006 1:19 PM Subject: Re: How can I get a term's frequency? sog wrote: I search the index with a group of terms. I want to get every term's frequency in each document of the search result. Are you looking for this? TermFreqVector vector = IndexReader.getTermFreqVector(docNum, field); That gives you the frequency of every term, but you can just look up the ones you're interested in. Daniel -- Daniel Noll Nuix Australia Pty Ltd Suite 79, 89 Jones St, Ultimo NSW 2007, Australia Phone: (02) 9280 0699 Fax: (02) 9212 6902 This message is intended only for the named recipient. If you are not the intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this message or attachment is strictly prohibited. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Phrase query vs span query
On Wednesday 22 February 2006 00:45, Rajesh Munavalli wrote: I am trying to adopt lucene for a special IR system. The following scenario is an approximation of what I am trying to do. Please bear with me if some things doesnt make sense. I need some suggestions on formulating queries for the following scenario Each document consists of a set of fields (standard in lucene). But in my case, the field is somewhat different as explained below. Field: - Each field consists of a set of conceptual sections. Each of these sections is separated by say N (say 1000) index positions but are in the same field. Sizes of sections vary and do not have any lower or upper bound on the number of terms they may contain . Ex: Lets say Field contents has section1 of 100 termsgap of 1000 term positionssection 2 of 1500 termsgap of 1000 term positionsgap of 1000 term positionssection 3 of 10 terms NOTE: At index time, I am assuming I somehow know how to form these sections. One more choice you have is too index both the full document and each section as a Lucene document. Typical Query: - Consists of 15 to 30 query terms. In other words, these query terms represent a conceptual section. Would you need synonyms of these terms, too? Aim of the Query formation: I want to rank the documents proportional to the number query terms For this there is the coord() factor used in Lucene boolean queries. But scoring exactly proportional to the number of query terms is difficult to do because the lucene score is not bound by default. appearing in the SAME SECTION and IN ORDER. Documents containing terms with To query the exact order, you can use PhraseQuery and SpanQuery. the My Questions: - Considering the structure of the fields/documents and the number of query terms. (1) Is there an effective way of formulating a query with the existing query types in Lucene? I don't think so, see below. (2) After considering the way different queries work and their limitations, I think forming phrase/span queries of groups of query terms might approximate the rankings I am expecting. In that case which of the following queries will perform better (in terms of QUERY SPEED and RANKING) (a) phrase query with certain slope factor (b) span query SpanQuery is slower than PhraseQuery, but it has the advantage that it can be nested. Nesting here means the possibility to use eg. a short phrase as a unit to be matched and scored. Concerning this: Rank 2: Documents containing section containing all terms but randomly ordered SpanQuery can also match unordered occurrences, I don't know about PhraseQuery. To formulate a single query for your requirements, there is still the problem that PhraseQuery and SpanQuery only work when all their terms are present in an indexed lucene document field. Putting it differently, when fewer terms present, their order cannot be taken into account, unless the query contains an (non)ordered query specifying a subset of the terms present in the documents. An alternative to the current span query implementation is here: http://issues.apache.org/jira/browse/LUCENE-413 but this will only help to get an impression of how to match in the ordered and unordered cases. It might be possible to generalize the various span algorithms there and in the trunk to work with fewer terms. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index missing documents
I'm using Lucene 1.4.3, and maxBufferedDocs only appears to be in the new (unreleased?) version of IndexWriter in CVS. Looking at the code though, setMaxBufferedDocs(n) just translates to minMergeDocs = n. My index was constructed using the default minMergeDocs = 10, so somehow this doesn't seem to be the culprit that caused all 2 million+ documents to be missing from the crashed index. It seems more likely that none of the index files were registered in Lucene's segements file. Is there perhaps some other trigger that causes Lucene to register the indexes in the segments file, or is there some way of flushing the segments file every so often to ensure that it's list is up to date? Thanks again for your assistance. Michael. - Original Message - From: Otis Gospodnetic [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Monday, February 20, 2006 8:39 PM Subject: Re: Index missing documents No, using the same IndexWriter is the way to go. If you want things to be written to disk more frequently, lower the maxBufferedDocs setting. Go down to 1, if you want. You'll use less memory (RAM), Documents will be written to disk without getting buffered in RAM, but the indexing process will be slower. Otis - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How can I get a term's frequency?
en, I describe my question more clearly: I search with a group of query terms, I can get a document from the search result: Query(term1, term2, term3)--search index--Hits(doc1, doc2, doc3, ..) I wanna get term1's frequency in doc1 ? Hits(docs1((term1,freq),(term2,freq),(term3,freq)), docs2((term1,freq),(term2,freq),(term3,freq)),..) I think the tf value is caculated in the index procedure. can I get the tf(term frequency) value of term1 directly? I can do it in this way: QueryTermVector vector= new QueryTermVector(Document.getValues(field)); freq = result.getTermFrequencies(); but I think this is a very low efficient way. Anyone can help me ? thx sog - Original Message - From: Daniel Noll [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Wednesday, February 22, 2006 1:19 PM Subject: Re: How can I get a term's frequency? sog wrote: I search the index with a group of terms. I want to get every term's frequency in each document of the search result. Are you looking for this? TermFreqVector vector = IndexReader.getTermFreqVector(docNum, field); That gives you the frequency of every term, but you can just look up the ones you're interested in. Daniel -- Daniel Noll Nuix Australia Pty Ltd Suite 79, 89 Jones St, Ultimo NSW 2007, Australia Phone: (02) 9280 0699 Fax: (02) 9212 6902 This message is intended only for the named recipient. If you are not the intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this message or attachment is strictly prohibited. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: webserverless search with lucene on offline HTML doc
The signed applet is surely a simpler and more elegant solution.. In some projects however this could not be a viable option: the System properties problem you have pointed out (and I had missed :-) is hopefully going to be solved in 1.9 (http://issues.apache.org/jira/browse/LUCENE-369) Fabio P.S.: is there any possibilty to have a look at your quick and dirty implementation of the JarDirectory? I've written a JarReadOnlyDirectory but it was very dirty and not even so quick for me to write :-( I wrote a quick and dirty implementation of a JarDirectory - it works, but a new problem is encountered soon after: The indexWriter requires information from the System properties; an applet is allowed to read only a limited set of Properties. Especially with an offline applet I would stick to the solution of signing the applet. Dolf. On 2/21/06, Trieschnigg, R.B. (Dolf) [EMAIL PROTECTED] wrote: Wouldn't this be a good case for the JarDirectory implementation somebody asked for? The index could then be statically written in a jar file downloaded with the applet (the original mail refers to static offline HTML files). - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
:Lucene 1.9 RC1 is not working properly with older version of Code 1.43:
Hi , I got the latest source code of Lucene 1.9 RC1 and modified my code according to that by removing the deprecated methods. But once I have updated to this version the search is not working at all.. if I try with luke it is working fine but If I try with program it is not returning any error .. and result.. please let me know If any problems are there with this version. If that contains I will replace with old version1.4.3 which is working fine.. Thanks Ravi Kumar Jaladanki
Lucene, Cannot rename segments.new to segments
I am getting intermittent errors with Lucene. Here are two examples: java.io.IOException: Cannot rename E:\lucene\segments.new to E:\lucene\segments java.io.IOException: Cannot rename E:\lucene\_8ya.tmp to E:\lucene\_8ya.del This issue has an open BugZilla entry: http://issues.apache.org/bugzilla/show_bug.cgi?id=36241 I thought this error must be caused by an error in my application. To try and solve the error I used the LuceneIndexAccessor in my application: http://issues.apache.org/bugzilla/show_bug.cgi?id=34995 I am still getting the error. 1) Is there a reason (other than time and resource) why the bug report is still set to NEW after 6 months (since August 2005)? 2) Is the problem likely to be in my application? Any ideas how I could go about solving this issue? Thanks for your help Patrick - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: :Lucene 1.9 RC1 is not working properly with older version of Code 1.43:
Hi Ravi, Could you try 1.9RC1 without changing your code to remove the deprecated calls first? If that works, try changing one type of deprecated call at a time until the culprit is found. It may either be a bug in API usage in your code, or a bug in Lucene. -Yonik On 2/22/06, Ravi [EMAIL PROTECTED] wrote: I got the latest source code of Lucene 1.9 RC1 and modified my code according to that by removing the deprecated methods. But once I have updated to this version the search is not working at all.. if I try with luke it is working fine but If I try with program it is not returning any error .. and result.. please let me know If any problems are there with this version. If that contains I will replace with old version1.4.3 which is working fine.. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
TREC,INEX and Lucene
Hi all, I am planning on participating in the INEX and hopefully passively on a couple of TREC tracks mainly using the Lucene API. Is anyone else on this list planning on using Lucene during participation? I am particularly interested in the SPAM, Blog and ADHOC tracks. Malcolm Clark - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Phrase query vs span query
On 2/22/06, Paul Elschot [EMAIL PROTECTED] wrote: Typical Query: - Consists of 15 to 30 query terms. In other words, these query terms represent a conceptual section. Would you need synonyms of these terms, too? Yes. (2) After considering the way different queries work and their limitations, I think forming phrase/span queries of groups of query terms might approximate the rankings I am expecting. In that case which of the following queries will perform better (in terms of QUERY SPEED and RANKING) (a) phrase query with certain slope factor (b) span query SpanQuery is slower than PhraseQuery, but it has the advantage that it can be nested. Nesting here means the possibility to use eg. a short phrase as a unit to be matched and scored. I wasn't aware of the capability to nest spanquery. Is there a link where I could read more about this? To formulate a single query for your requirements, there is still the problem that PhraseQuery and SpanQuery only work when all their terms are present in an indexed lucene document field. Putting it differently, when fewer terms present, their order cannot be taken into account, unless the query contains an (non)ordered query specifying a subset of the terms present in the documents. I was thinking of building a boolean combination of either phrase/span query on subset of terms. Though its not exhaustive, but might be sufficient in majority of the cases. An alternative to the current span query implementation is here: http://issues.apache.org/jira/browse/LUCENE-413 but this will only help to get an impression of how to match in the ordered and unordered cases. It might be possible to generalize the various span algorithms there and in the trunk to work with fewer terms. I will consider that option. Thanks, Rajesh Munavalli
ArrayIndexOutOfBoundsException being thrown ...
Getting an ArrayIndexOutOfBoundsException ... Line 31 in IndexSearcherManager.java: ... public static IndexSearcher getIndexSearcher(String indexPath) { logger.debug(indexPath = + indexPath); searcher = new IndexSearcher(indexPath); LINE 31 return searcher; } ... ... I get the following exception: 28628 DEBUG com.allegrocentral.tandoori.managers.search.IndexSearcherManager [21] - indexPath = /opt/tomcat/webapps/ROOT/WEB-INF/search-index 28666 WARN org.apache.struts.action.RequestProcessor [516] - Unhandled Exception thrown: class java.lang.ArrayIndexOutOfBoundsException 28669 ERROR org.apache.catalina.core.ContainerBase.[Catalina].[localhost].[/].[action] [704] - Servlet.service() for servlet action threw exception java.lang.ArrayIndexOutOfBoundsException: -1 at java.util.ArrayList.get(ArrayList.java:323) at org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:155) at org.apache.lucene.index.FieldInfos.fieldName(FieldInfos.java:151) at org.apache.lucene.index.SegmentTermEnum.readTerm(SegmentTermEnum.java:149) at org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:115) at org.apache.lucene.index.TermInfosReader.readIndex(TermInfosReader.java:86) at org.apache.lucene.index.TermInfosReader.init(TermInfosReader.java:45) at org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:112) at org.apache.lucene.index.SegmentReader.init(SegmentReader.java:89) at org.apache.lucene.index.IndexReader$1.doBody(IndexReader.java:118) at org.apache.lucene.store.Lock$With.run(Lock.java:109) at org.apache.lucene.index.IndexReader.open(IndexReader.java:111) at org.apache.lucene.index.IndexReader.open(IndexReader.java:95) at org.apache.lucene.search.IndexSearcher.init(IndexSearcher.java:38) at com.allegrocentral.tandoori.managers.search.IndexSearcherManager.getIndexSearcher(IndexSearcherManager.java:31) Any ideas as to why this might be happening? (Am using lucene-core-1.9-rc1.jar) -Thanks.
IndexSearcher
Maybe too general a question, but is there anything about creating an IndexSearcher( directory) object that would make the instantiation really slow? I have one index where the instantiation is very fast, to the point where I don't need to do any pooling. A new index I have created, takes a very long time to create the IndexSearcher object. With a 30mb index, it can take about 30 seconds just to instantiate an IndexSearcher(). It almost seems like it is reading the index at that point. The only difference between the indexes has been the # of fields indexed. The newer one only having one field indexed. Any ways to speed up that instantiation? Or do I have to use a pooling system? Thanks for any suggestions, -Gus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: IndexSearcher
This doesn't really address your question, but... Once you have the single indexsearcher, do you need any others? Could your app just use the single instance? -Original Message- From: Gus Kormeier [mailto:[EMAIL PROTECTED] Sent: Wednesday, February 22, 2006 11:28 AM To: java-user@lucene.apache.org Subject: IndexSearcher Maybe too general a question, but is there anything about creating an IndexSearcher( directory) object that would make the instantiation really slow? I have one index where the instantiation is very fast, to the point where I don't need to do any pooling. A new index I have created, takes a very long time to create the IndexSearcher object. With a 30mb index, it can take about 30 seconds just to instantiate an IndexSearcher(). It almost seems like it is reading the index at that point. The only difference between the indexes has been the # of fields indexed. The newer one only having one field indexed. Any ways to speed up that instantiation? Or do I have to use a pooling system? Thanks for any suggestions, -Gus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
search a subdirectory (New to Lucene)
I'm new to Lucene and was wondering what is the best way to perform a search on a subdirectory or subdirectories within the index? My thought at this point is to build a query to first search for files in the required directory(ies) and then use that query to make a QueryFilter and use that QueryFilter in the actual search. Is there an easier way? On an unrelated note, does anybody know of a way to get results a the section level within a document? For example, could I find not just a document that matches my query, but the paragraph within that document that best matches the query? thanks, John
RE: IndexSearcher
I guess what I meant was to have all your servlets use the same instance. They could get it from the class or from a parent of all your servlets. Then you can let the indexsearcher take care of all the search requests. -Original Message- From: Gus Kormeier [mailto:[EMAIL PROTECTED] Sent: Wednesday, February 22, 2006 12:42 PM To: 'java-user@lucene.apache.org' Subject: RE: IndexSearcher It's in a servlet, so one work around I have been going with is to just open it at init(). That gives me some threading concerns. And I didn't have to do that in the past, -Gus -Original Message- From: John Powers [mailto:[EMAIL PROTECTED] Sent: Wednesday, February 22, 2006 9:35 AM To: java-user@lucene.apache.org Subject: RE: IndexSearcher This doesn't really address your question, but... Once you have the single indexsearcher, do you need any others? Could your app just use the single instance? -Original Message- From: Gus Kormeier [mailto:[EMAIL PROTECTED] Sent: Wednesday, February 22, 2006 11:28 AM To: java-user@lucene.apache.org Subject: IndexSearcher Maybe too general a question, but is there anything about creating an IndexSearcher( directory) object that would make the instantiation really slow? I have one index where the instantiation is very fast, to the point where I don't need to do any pooling. A new index I have created, takes a very long time to create the IndexSearcher object. With a 30mb index, it can take about 30 seconds just to instantiate an IndexSearcher(). It almost seems like it is reading the index at that point. The only difference between the indexes has been the # of fields indexed. The newer one only having one field indexed. Any ways to speed up that instantiation? Or do I have to use a pooling system? Thanks for any suggestions, -Gus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Throughput doesn't increase when using more concurrent threads
Hmmm, not sure what that could be. You could try using the default FSDir instead of MMapDir to see if the differences are there. Some things that could be different: - thread scheduling (shouldn't make too much of a difference though) - synchronization workings - page replacement policy... how to figure out what pages to swap in and which to swap out, esp of the memory mapped files. You could also try a profiler on both platforms to try and see where the difference is. -Yonik On 2/22/06, Peter Keegan [EMAIL PROTECTED] wrote: I am doing a performance comparison of Lucene on Linux vs Windows. I have 2 identically configured servers (8-CPUs (real) x 3GHz Xeon processors, 64GB RAM). One is running CentOS 4 Linux, the other is running Windows server 2003 Enterprise Edition x64. Both have 64-bit JVMs from Sun. The Lucene server is using MMapDirectory. I'm running the jvm with -Xmx16000M. Peak memory usage of the jvm on Linux is about 6GB and 7.8GB on windows. I'm observing query rates of 330 queries/sec on the Wintel server, but only 200 qps on the Linux box. At first, I suspected a network bottleneck, but when I 'short-circuited' Lucene, the query rates were identical. I suspect that there are some things to be tuned in Linux, but I'm not sure what. Any advice would be appreciated. Peter On 1/30/06, Peter Keegan [EMAIL PROTECTED] wrote: I cranked up the dial on my query tester and was able to get the rate up to 325 qps. Unfortunately, the machine died shortly thereafter (memory errors :-( ) Hopefully, it was just a coincidence. I haven't measured 64-bit indexing speed, yet. Peter On 1/29/06, Daniel Noll [EMAIL PROTECTED] wrote: Peter Keegan wrote: I tried the AMD64-bit JVM from Sun and with MMapDirectory and I'm now getting 250 queries/sec and excellent cpu utilization (equal concurrency on all cpus)!! Yonik, thanks for the pointer to the 64-bit jvm. I wasn't aware of it. Wow. That's fast. Out of interest, does indexing time speed up much on 64-bit hardware? I'm particularly interested in this side of things because for our own application, any query response under half a second is good enough, but the indexing side could always be faster. :-) Daniel -- Daniel Noll Nuix Australia Pty Ltd Suite 79, 89 Jones St, Ultimo NSW 2007, Australia Phone: (02) 9280 0699 Fax: (02) 9212 6902 This message is intended only for the named recipient. If you are not the intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this message or attachment is strictly prohibited. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Searching/sorting strategy for many properties for semantic web app
One very nice implementation to take a look at is the Simile project at MIT. The Piggy Bank and Longwell projects use Lucene to index RDF and integrate full-text and structural queries nicely together. http://simile.mit.edu Erik On Feb 21, 2006, at 10:20 PM, David Pratt wrote: Hi there. I am new to Lucene and I have been developing a semantic application for a while and it appears to me Lucene could help me to get a much needed search with reasonable speed. I have some general question to start: 1) Since my app is virtually all metadata, what should I store in the indexes if anything? 2) Should I only index the most common properties that people will search and combine the rest (and index this combined text as a field)? 3) I would like to sort and filter results but am concerned this could be very memory intensive 4) Some general guidance on organizing indexes in an app would be appreciated. My schema is fairly large but I generally expect people to search on about 6 to 8 properties for the most part. I have the data stored in an sql database but not in a conventional way. I am willing to accept a slower advanced search on less common properties (accomodating this with sql search) but I really want some speed for the main properties with full text search. Pretty much everything in the app is metadata so I am most interested in focussing on the 6-8 properties that people will use to search on for the most part. I am thinking of combining the text of the remaining properties (quite a number) into a single description type field so that essentially all information gets indexed and ranked. Is this a reasonable approach? I see that there are advanced possibilities with the indexes to sort and filter. How advisable is using sort for large record sets. For example, say you have got 2 records returned from your search. Because this will have a web interface I will only be showing first 20 likely so I will be batching results. Is the sorting filtering highly memory intensive? Hopefully, someone can provide some initial advice. Many thanks. Regards, David - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: search a subdirectory (New to Lucene)
I presume by saying subdirectory you're referring to filesystem directories and you're indexing a directory tree of files. If you index the path (perhaps relative from the root is best) as a keyword field (untokenized, but indexed) you could perform filtering on a / path/subpath sort of way using PrefixQuery. As for paragraphs - how you index a document is entirely application dependent. Maybe it makes sense to parse the documents before handing them to Lucene such that you're creating a Lucene Document for each paragraph rather than for each entire file. Slicing the granularity of a domain into Documents is a fascinating topic :) Erik On Feb 22, 2006, at 1:00 PM, John Hamilton wrote: I'm new to Lucene and was wondering what is the best way to perform a search on a subdirectory or subdirectories within the index? My thought at this point is to build a query to first search for files in the required directory(ies) and then use that query to make a QueryFilter and use that QueryFilter in the actual search. Is there an easier way? On an unrelated note, does anybody know of a way to get results a the section level within a document? For example, could I find not just a document that matches my query, but the paragraph within that document that best matches the query? thanks, John - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Lucene 1.9 RC1 release available
Release 1.9 RC1 of Lucene is now available from: http://www.apache.org/dyn/closer.cgi/lucene/java/ This release candidate has many improvements since release 1.4.3, including new features, performance improvements, bug fixes, etc. For details, see: http://svn.apache.org/viewcvs.cgi/*checkout*/lucene/java/branches/lucene_1_9/CHANGES.txt?rev=379190 1.9 will be the last 1.x release. It is both back-compatible with 1.4.3 and forward-compatible with the upcoming 2.0 release. Many methods and classes in 1.4.3 have been deprecated in 1.9 and will be removed in 2.0. Applications must compile against 1.9 without deprecation warnings before they are compatible with 2.0. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: IndexSearcher
: I have one index where the instantiation is very fast, to the point where I : don't need to do any pooling. A new index I have created, takes a very long : time to create the IndexSearcher object. With a 30mb index, it can take : about 30 seconds just to instantiate an IndexSearcher(). It almost seems : like it is reading the index at that point. : : : The only difference between the indexes has been the # of fields indexed. : The newer one only having one field indexed. If i remember correctly, The IndexSearcher constructor doesn't do anything but open an IndexReader ... IndexReader.open() opens a MultiReader on all of the segments, and each of the SegmentReaders open up a bunch of files. so off hte top of my head, one thing that can make a differnece in the new IndexSearcher times, is how many segments you have in your index (ie: is it optimized?) ... using the compound fileformat can probably make a difference as well. : Any ways to speed up that instantiation? Or do I have to use a pooling : system? Even if you get it down to 0.1 seconds,i would still reuse the same IndexSearcher as much as possible. See previous replies from me in the archive about memory for my reasoning. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: IndexSearcher
Thanks Hoss, I did figure out that I was putting about 400 stored fields per document into my new index; more than my prior indexes. Reducing the number of stored fields seems to have helped significantly. I do call writer.optimize() after loading in documents, but not sure how I would set the # of segments? I think I will keep the IndexSearcher statically for all instances. The slow times I was seeing, weren't even sufficient for that though. Since this is a case of really only needing to search on one field and use the index as a storage medium for the rest of the data(pretty much textual data), I'm thinking it would make sense to get the latest version of lucene and create a two field index. Something like: Field1: id Field2: serialized data object. Any reason why that wouldn't be fast? I have been having elusive memory issues with my other usage, maybe you just helped me find that solution as well. Thanks, -Gus -Original Message- From: Chris Hostetter [mailto:[EMAIL PROTECTED] Sent: Wednesday, February 22, 2006 4:02 PM To: java-user@lucene.apache.org Subject: Re: IndexSearcher : I have one index where the instantiation is very fast, to the point where I : don't need to do any pooling. A new index I have created, takes a very long : time to create the IndexSearcher object. With a 30mb index, it can take : about 30 seconds just to instantiate an IndexSearcher(). It almost seems : like it is reading the index at that point. : : : The only difference between the indexes has been the # of fields indexed. : The newer one only having one field indexed. If i remember correctly, The IndexSearcher constructor doesn't do anything but open an IndexReader ... IndexReader.open() opens a MultiReader on all of the segments, and each of the SegmentReaders open up a bunch of files. so off hte top of my head, one thing that can make a differnece in the new IndexSearcher times, is how many segments you have in your index (ie: is it optimized?) ... using the compound fileformat can probably make a difference as well. : Any ways to speed up that instantiation? Or do I have to use a pooling : system? Even if you get it down to 0.1 seconds,i would still reuse the same IndexSearcher as much as possible. See previous replies from me in the archive about memory for my reasoning. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How can I get a term's frequency?
sog wrote: en, but IndexReader.getTermFreqVector is an abstract method, I do not know how to implement it in an efficient way. Anyone has good advise? You probably don't need to implement it, it's been implemented already. Just call the method. I can do it in this way: QueryTermVector vector= new QueryTermVector(Document.getValues(field)); freq = result.getTermFrequencies(); I'm not sure because I've never used QueryTermVector before, but the fact that QueryTermVector doesn't take an IndexReader as a parameter is a good indication that it can't tell you anything about the frequency of the term in your documents. Daniel -- Daniel Noll Nuix Australia Pty Ltd Suite 79, 89 Jones St, Ultimo NSW 2007, Australia Phone: (02) 9280 0699 Fax: (02) 9212 6902 This message is intended only for the named recipient. If you are not the intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this message or attachment is strictly prohibited. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Searching/sorting strategy for many properties for semantic web app
Hi Erik. Many thanks for your reply. I'll likely see if I can find a list to pose a couple of questions there way. I am having fun with Lucene since it is new to me and I am impressed with the speed I am getting. I am reading anything I can get hold of and trying different code experiments. So far, the code is fairly straight forward so not so concerned about this at the moment. I am really hoping to hear from experienced people like yourself more on strategically what to index, what sort of things it would be a good idea to store and what to do about a fairly large schema that has much metadata to offer. Also perhaps when sorting and filtering gets too expensive. I realize that just because the metadata is available doesn't necessarily mean you want to even put it all in an index. I think these issues are pretty general, however I know there are folks on this that would likely advise some particular path or direction because of their own experiences with Lucene. I would really like to hear from anyone that has been working with metadata particularly or anyone generally about these topics. Regards, David Erik Hatcher wrote: One very nice implementation to take a look at is the Simile project at MIT. The Piggy Bank and Longwell projects use Lucene to index RDF and integrate full-text and structural queries nicely together. http://simile.mit.edu Erik On Feb 21, 2006, at 10:20 PM, David Pratt wrote: Hi there. I am new to Lucene and I have been developing a semantic application for a while and it appears to me Lucene could help me to get a much needed search with reasonable speed. I have some general question to start: 1) Since my app is virtually all metadata, what should I store in the indexes if anything? 2) Should I only index the most common properties that people will search and combine the rest (and index this combined text as a field)? 3) I would like to sort and filter results but am concerned this could be very memory intensive 4) Some general guidance on organizing indexes in an app would be appreciated. My schema is fairly large but I generally expect people to search on about 6 to 8 properties for the most part. I have the data stored in an sql database but not in a conventional way. I am willing to accept a slower advanced search on less common properties (accomodating this with sql search) but I really want some speed for the main properties with full text search. Pretty much everything in the app is metadata so I am most interested in focussing on the 6-8 properties that people will use to search on for the most part. I am thinking of combining the text of the remaining properties (quite a number) into a single description type field so that essentially all information gets indexed and ranked. Is this a reasonable approach? I see that there are advanced possibilities with the indexes to sort and filter. How advisable is using sort for large record sets. For example, say you have got 2 records returned from your search. Because this will have a web interface I will only be showing first 20 likely so I will be batching results. Is the sorting filtering highly memory intensive? Hopefully, someone can provide some initial advice. Many thanks. Regards, David - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: TREC,INEX and Lucene
Malcom, I've used Lucene in TREC last year in my QA list module, as have many of my contempories. On 2/22/06, Malcolm Clark [EMAIL PROTECTED] wrote: Hi all, I am planning on participating in the INEX and hopefully passively on a couple of TREC tracks mainly using the Lucene API. Is anyone else on this list planning on using Lucene during participation? I am particularly interested in the SPAM, Blog and ADHOC tracks. Malcolm Clark - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Dave Kor, Research Assistant Center for Information Mining and Extraction School of Computing National University of Singapore. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
hyphen not being removed by standard filter
Hi, I might be missing something. I have a custom analyzer the gist of which is: public TokenStream tokenStream(String fieldName, Reader reader) { TokenStream result = new StandardTokenizer(reader); result = new StandardFilter(result); result = new LowerCaseFilter(result); result = new StopFilter(result, stopSet); result = new PorterStemFilter(result); return result; } I test my above analyzer with the following query string: the is EOS-20D canon amazing In my test code I do this to see what my analyzed query string looks like: PerFieldAnalyzerWrapper analyzer = new PerFieldAnalyzerWrapper(new StandardStemmingAnalyzer()); analyzer.addAnalyzer(categoryNames, new KeywordAnalyzer()); TokenStream stream = analyzer.tokenStream(null, new StringReader(queryString)); String analyzedQueryString = ; while(true) { Token token = stream.next(); if(token == null) { break; } analyzedQueryString = analyzedQueryString + token.termText() + ; } analyzedQueryString = analyzedQueryString.trim(); log.debug(analyzedQueryString = + analyzedQueryString); The output of the log statement above is: analyzedQueryString = eos-20d canon amaz I see that the common stop words have been removed, everything has been lower cased and even the query has also been stemmed, why was the hyphen not removed by the standard filter??? Or does the standard analyzer remove hyphens only from phrases like eos - 20d and not from eos-20d ? Thanks.