Tuning Indexing performance question ..
Hi, I am using a multi threaded app to index a bunch of Data. The app spawns X number of threads. Each thread writes to a RAMDirectory. When thread finishes it work, the contents from the RAMDirectory are written into the FSDirectory. All threads are passed an instance of the FSWriter when they are created. Now, reading the Lucened docs, I understand the indexing performance can be further tweaked by playing with mergeFactor, maxMergeDocs and minMergeDocs. Am I understanding this right that these three parameters effect the writing of the index to the FSDirectory and not to the RAMDirectory (Since a RAMDirectory exists entirely in memory)? In other words, does tweaking the three parameters - mergeFactor, maxMergeDocs and minMergeDocs effect the performance of writing to the RAMDirectory? -Thanks - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Regarding Indexes
The solution to your problem lies in answers to many business domain specific questions like: 1. Will each company only want to carry out searches on their data or on ALL the data? 2. If you do not know the answer to that, is there a chance that the some companies would want to search only their data and some others would want to search data from company a and company b? and yet another company would want to search all the data? 3. How does having just one index opposed to individual indexes affect indexing given the load that it will have to handle from one or more companies? [Note: you could also index data from a group of companies according to howmuch data on average they might have] 4. It might turn up that at this point you nor the companies might have no way of knowing howmuch data they will have, in that case you will have to use your best judgement in what path to take and build your app in an a way such that it can be abstracted from whether the index is being indexed in one index or in multiple indexes. Later you can toy around with different setups as you get more understanding on the usage of the application. One way is to index all the data from a particular company with one of the terms being companyIdentifier ... This way you will have the ability to search within a company d's data or within a few different company's data or the entire search index. -Mufaddal. -Original Message- From: Ravi [mailto:[EMAIL PROTECTED] Sent: Friday, March 31, 2006 9:22 AM To: java-user@lucene.apache.org Subject: Regarding Indexes Hi Luceners, This is the my problem . Can any body give the solution for this one.. I am going to implement for the company which is going to Support ASP (Application Service Provider ) model. In this model , around 200 companies are going to register with us and add there documents and searches them . Now the problem is shall I maintain individual index files for each company or maintain single index file for all the companies. If I maintain individual index files then I need to create 200 searcher objects for them because. each index should be searched.. But if I maintain single index file , I can have one single index searcher but I need to add the condition for each document. And more over in feature if any body needed there own data we can not provide them .. so please tell me which model can help us to solve this problem.. the key point in this application is add/modify/delete will occur very frequently . Please help me I am waiting for your feed back Thanks Ravi Kumar Jaladanki - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Update or Delete Document for Lucene 1.4.x
The way you update a document in lucene is by deleting the current one and adding a new one. -Mufaddal. -Original Message- From: Don Vaillancourt [mailto:[EMAIL PROTECTED] Sent: Friday, March 31, 2006 1:37 PM To: java-user@lucene.apache.org Subject: Update or Delete Document for Lucene 1.4.x Hi All, I need to implement the ability to update one document within a Lucene collection. I haven't been able to find anything in the API. Is there a way to update one document or delete a document so that I can add an update? Thank You -- Don Vaillancourt Director of Software Development WEB IMPACT INC. phone: 416-815-2000 ext. 245 fax: 416-815-2001 toll free: 866-319-1573 ext. 245 email: [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] blackberry: [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] web: http://www.web-impact.com address: http://www.mapquest.ca http://www.mapquest.com/maps/map.adp?country=CAaddtohistory=formtype= addresssearchtype=addresscat=address=99%20Atlantic%20Avecity=Toronto state=ONzipcode=M6K%203J8 This email message is intended only for the addressee(s) and contains information that may be confidential and/or copyright. If you are not the intended recipient please notify the sender by reply email and immediately delete this email. Use, disclosure or reproduction of this email by anyone other than the intended recipient(s) is strictly prohibited. No representation is made that this email or any attachments are free of viruses. Virus scanning is recommended and is the responsibility of the recipient. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Getting no hits ...
I have been trying to figure out why my query below would not return any hits. I use two custom analyzers for indexing and searching. The one I use for indexing uses this: public TokenStream tokenStream(String fieldName, Reader reader) { TokenStream result = new StandardTokenizer(reader); result = new StandardFilter(result); result = new LowerCaseFilter(result); result = new StopFilter(result, stopSet); result = new SynonymFilter(result, new MySynonymEngine()); result = new PorterStemFilter(result); return result; } The one I use for searching uses this: public TokenStream tokenStream(String fieldName, Reader reader) { TokenStream result = new StandardTokenizer(reader); result = new StandardFilter(result); result = new LowerCaseFilter(result); result = new StopFilter(result, stopSet); result = new PorterStemFilter(result); return result; } (Basically while searching I do not use the SynonymFilter.) I have quite a few products that I index that have the text on which I am querying on. I do a search for this: ES-20D This is the final query that I run: +(+content:es\-20d) +entity:product +(title:es\-20d~2^40.0 ((title:es\-20d)^10.0) content:es\-20d~2^20.0 (content:es\-20d) categoryName:es\-20d^80.0) (The content and title fields are Indexed, Tokenized and Stored. The categoryName field is Indexed and Stored.) I get no hits? Where am i going wrong with this? Any pointers? -Thanks. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Getting no hits ...
In my earlier email i put in the wrong query that I am searching on. The correct query is: EOS-20D And this is the query under question that is producing no hits still: +(+content:eos\-20d) +entity:product +(title:eos\-20d~2^40.0 ((title:eos\-20d)^10.0) content:eos\-20d~2^20.0 (content:eos\-20d) categoryName:eos\-20d^80.0) I have used the AnalyzerUtils.displayTokensWithFullDetails(analyzer, string); (AnalyzerUtils from the LIA book). This is part of the log output from using the AnalyzerUtils.displayTokensWithFullDetails(analyzer, string) when this product gets indexed: 119: [013803044430:857-869:ALPHANUM] 120: [eos-20d:870-877:NUM] 121: [011-eos-20d:878-889:NUM] This is part of the log output from using the AnalyzerUtils.displayTokensWithFullDetails(analyzer, string) when I do the search: 1: [eos-20d:0-6:NUM] From what I understand I see that the analyzer is producing the same tokens while indexing and during searching. Chris Hostetter wrote: 1) Have you looked at what tokens your indexing analyzer produces when you tokenize ES-20D ? 2) Have you looked at what tokens your query analyser products when you tokenize ES-20D ? 3) Have you tried a simpler query (ie: just content:es\-20d ) ? 4) When giving QueryParser a (quoted) phrase search, i don't think you really want to escape that - character. : Date: Thu, 23 Feb 2006 14:16:42 -0700 : From: Mufaddal Khumri [EMAIL PROTECTED] : Reply-To: java-user@lucene.apache.org : To: java-user@lucene.apache.org : Subject: Getting no hits ... : : I have been trying to figure out why my query below would not return any : hits. : : I use two custom analyzers for indexing and searching. The one I use for : indexing uses this: : : public TokenStream tokenStream(String fieldName, Reader reader) : { : TokenStream result = new StandardTokenizer(reader); : result = new StandardFilter(result); : result = new LowerCaseFilter(result); : result = new StopFilter(result, stopSet); : result = new SynonymFilter(result, new MySynonymEngine()); : result = new PorterStemFilter(result); : return result; : } : : The one I use for searching uses this: : : public TokenStream tokenStream(String fieldName, Reader reader) : { : TokenStream result = new StandardTokenizer(reader); : result = new StandardFilter(result); : result = new LowerCaseFilter(result); : result = new StopFilter(result, stopSet); : result = new PorterStemFilter(result); : return result; : } : : (Basically while searching I do not use the SynonymFilter.) : : I have quite a few products that I index that have the text on which I : am querying on. : : I do a search for this: ES-20D : : This is the final query that I run: : +(+content:es\-20d) +entity:product +(title:es\-20d~2^40.0 : ((title:es\-20d)^10.0) content:es\-20d~2^20.0 (content:es\-20d) : categoryName:es\-20d^80.0) : : (The content and title fields are Indexed, Tokenized and Stored. The : categoryName field is Indexed and Stored.) : : I get no hits? : : Where am i going wrong with this? Any pointers? : : -Thanks. : : : : : : - : To unsubscribe, e-mail: [EMAIL PROTECTED] : For additional commands, e-mail: [EMAIL PROTECTED] : -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Getting no hits ...
Follow up on my previous email ... When I execute this query using luke using the standard analyzer on the same index, i get 8 hits. +(+content:eos\-20d) +entity:product +(title:eos\-20d~2^40.0 ((title:eos\-20d)^10.0) content:eos\-20d~2^20.0 (content:eos\-20d) categoryName:eos\-20d^80.0) I modified my searching code to use the standard analyzer, but i did not get any hits back. I am still trying to figure out the problem out. Any ideas? Mufaddal Khumri wrote: In my earlier email i put in the wrong query that I am searching on. The correct query is: EOS-20D And this is the query under question that is producing no hits still: +(+content:eos\-20d) +entity:product +(title:eos\-20d~2^40.0 ((title:eos\-20d)^10.0) content:eos\-20d~2^20.0 (content:eos\-20d) categoryName:eos\-20d^80.0) I have used the AnalyzerUtils.displayTokensWithFullDetails(analyzer, string); (AnalyzerUtils from the LIA book). This is part of the log output from using the AnalyzerUtils.displayTokensWithFullDetails(analyzer, string) when this product gets indexed: 119: [013803044430:857-869:ALPHANUM] 120: [eos-20d:870-877:NUM] 121: [011-eos-20d:878-889:NUM] This is part of the log output from using the AnalyzerUtils.displayTokensWithFullDetails(analyzer, string) when I do the search: 1: [eos-20d:0-6:NUM] From what I understand I see that the analyzer is producing the same tokens while indexing and during searching. Chris Hostetter wrote: 1) Have you looked at what tokens your indexing analyzer produces when you tokenize ES-20D ? 2) Have you looked at what tokens your query analyser products when you tokenize ES-20D ? 3) Have you tried a simpler query (ie: just content:es\-20d ) ? 4) When giving QueryParser a (quoted) phrase search, i don't think you really want to escape that - character. : Date: Thu, 23 Feb 2006 14:16:42 -0700 : From: Mufaddal Khumri [EMAIL PROTECTED] : Reply-To: java-user@lucene.apache.org : To: java-user@lucene.apache.org : Subject: Getting no hits ... : : I have been trying to figure out why my query below would not return any : hits. : : I use two custom analyzers for indexing and searching. The one I use for : indexing uses this: : : public TokenStream tokenStream(String fieldName, Reader reader) : { : TokenStream result = new StandardTokenizer(reader); : result = new StandardFilter(result); : result = new LowerCaseFilter(result); : result = new StopFilter(result, stopSet); : result = new SynonymFilter(result, new MySynonymEngine()); : result = new PorterStemFilter(result); : return result; : } : : The one I use for searching uses this: : : public TokenStream tokenStream(String fieldName, Reader reader) : { : TokenStream result = new StandardTokenizer(reader); : result = new StandardFilter(result); : result = new LowerCaseFilter(result); : result = new StopFilter(result, stopSet); : result = new PorterStemFilter(result); : return result; : } : : (Basically while searching I do not use the SynonymFilter.) : : I have quite a few products that I index that have the text on which I : am querying on. : : I do a search for this: ES-20D : : This is the final query that I run: : +(+content:es\-20d) +entity:product +(title:es\-20d~2^40.0 : ((title:es\-20d)^10.0) content:es\-20d~2^20.0 (content:es\-20d) : categoryName:es\-20d^80.0) : : (The content and title fields are Indexed, Tokenized and Stored. The : categoryName field is Indexed and Stored.) : : I get no hits? : : Where am i going wrong with this? Any pointers? : : -Thanks. : : : : : : - : To unsubscribe, e-mail: [EMAIL PROTECTED] : For additional commands, e-mail: [EMAIL PROTECTED] : -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
ArrayIndexOutOfBoundsException being thrown ...
Getting an ArrayIndexOutOfBoundsException ... Line 31 in IndexSearcherManager.java: ... public static IndexSearcher getIndexSearcher(String indexPath) { logger.debug(indexPath = + indexPath); searcher = new IndexSearcher(indexPath); LINE 31 return searcher; } ... ... I get the following exception: 28628 DEBUG com.allegrocentral.tandoori.managers.search.IndexSearcherManager [21] - indexPath = /opt/tomcat/webapps/ROOT/WEB-INF/search-index 28666 WARN org.apache.struts.action.RequestProcessor [516] - Unhandled Exception thrown: class java.lang.ArrayIndexOutOfBoundsException 28669 ERROR org.apache.catalina.core.ContainerBase.[Catalina].[localhost].[/].[action] [704] - Servlet.service() for servlet action threw exception java.lang.ArrayIndexOutOfBoundsException: -1 at java.util.ArrayList.get(ArrayList.java:323) at org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:155) at org.apache.lucene.index.FieldInfos.fieldName(FieldInfos.java:151) at org.apache.lucene.index.SegmentTermEnum.readTerm(SegmentTermEnum.java:149) at org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:115) at org.apache.lucene.index.TermInfosReader.readIndex(TermInfosReader.java:86) at org.apache.lucene.index.TermInfosReader.init(TermInfosReader.java:45) at org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:112) at org.apache.lucene.index.SegmentReader.init(SegmentReader.java:89) at org.apache.lucene.index.IndexReader$1.doBody(IndexReader.java:118) at org.apache.lucene.store.Lock$With.run(Lock.java:109) at org.apache.lucene.index.IndexReader.open(IndexReader.java:111) at org.apache.lucene.index.IndexReader.open(IndexReader.java:95) at org.apache.lucene.search.IndexSearcher.init(IndexSearcher.java:38) at com.allegrocentral.tandoori.managers.search.IndexSearcherManager.getIndexSearcher(IndexSearcherManager.java:31) Any ideas as to why this might be happening? (Am using lucene-core-1.9-rc1.jar) -Thanks.
hyphen not being removed by standard filter
Hi, I might be missing something. I have a custom analyzer the gist of which is: public TokenStream tokenStream(String fieldName, Reader reader) { TokenStream result = new StandardTokenizer(reader); result = new StandardFilter(result); result = new LowerCaseFilter(result); result = new StopFilter(result, stopSet); result = new PorterStemFilter(result); return result; } I test my above analyzer with the following query string: the is EOS-20D canon amazing In my test code I do this to see what my analyzed query string looks like: PerFieldAnalyzerWrapper analyzer = new PerFieldAnalyzerWrapper(new StandardStemmingAnalyzer()); analyzer.addAnalyzer(categoryNames, new KeywordAnalyzer()); TokenStream stream = analyzer.tokenStream(null, new StringReader(queryString)); String analyzedQueryString = ; while(true) { Token token = stream.next(); if(token == null) { break; } analyzedQueryString = analyzedQueryString + token.termText() + ; } analyzedQueryString = analyzedQueryString.trim(); log.debug(analyzedQueryString = + analyzedQueryString); The output of the log statement above is: analyzedQueryString = eos-20d canon amaz I see that the common stop words have been removed, everything has been lower cased and even the query has also been stemmed, why was the hyphen not removed by the standard filter??? Or does the standard analyzer remove hyphens only from phrases like eos - 20d and not from eos-20d ? Thanks.
get results by relevance, limiting results and then sort the results by some criterion
When I do a search for example on batteries i get 1200+ results. I would like to show the user lets say 300. I can do that by only extracting the first 300 hits (sorted by decreasing relevance by default) and displaying those to the user. Now on the search results page, I have a drop down box that lets the user sort the results by price. When the user selects the Sort by price low to high, i would like to be able to sort the same 300 hits I got above (sorted by decreasing relevance by default) by price. Essentially I want to be able to sort the first 300 relevant search results by price. (in other words I would like to be able to get search results by relevance, limit the results and sort the results by some criterion). What would be a good way to do this in lucene? -Thanks. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: get results by relevance, limiting results and then sort the results by some criterion
So yes, if the xth + 1 item happens to be a camera and if its price happens to be lower than the previous x cameras it wont be included in this view and that is exactly what we want. Mufaddal Khumri wrote: In my case when we search for lets say cameras , my top x results are all sorts of cameras and then i get documents that match camera casings etc. As a company we want to show as many cameras as possible and not other camera related products for this one web view on a specific page we have. On this same page we also want to provide a way that the user can select price high to low or price low to high and sort these top x results. Essentially the hard part is to come up with the X so that you ideally dont prune any cameras. As a business we want to strive to get as many cameras in the search results, but at the same time we dont mind if a few cameras do not appear in those results if we can really fine tune our search results to only show cameras and not camera casings and camera batteries etc. I have been looking at QueryFilter and the Sort API, but havent yet figured out a way to do what I am trying to do .. any pointers are greatly appreciated. -Thanks, John Powers wrote: I'm sure you've taken care of this, but I am curious myself: If the 301 document only has a single term batteries (and thus is so far low on the Hits), but has a price of seven cents, then the sort of all the documents with batteries would put this near the top, but by eliminating all documents above 300, this one doesn't appear in the solution you are working for, correct?Why is that a good thing? It seems you would want to sort on the full document list, and then return on the 300 top that you want the user to see. I think I'm just curious why getting rid of some that could (in a new sort) be of higher relevance is a good thing. -Original Message- From: Mufaddal Khumri [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 21, 2006 10:33 AM To: java-user@lucene.apache.org Subject: get results by relevance, limiting results and then sort the results by some criterion When I do a search for example on batteries i get 1200+ results. I would like to show the user lets say 300. I can do that by only extracting the first 300 hits (sorted by decreasing relevance by default) and displaying those to the user. Now on the search results page, I have a drop down box that lets the user sort the results by price. When the user selects the Sort by price low to high, i would like to be able to sort the same 300 hits I got above (sorted by decreasing relevance by default) by price. Essentially I want to be able to sort the first 300 relevant search results by price. (in other words I would like to be able to get search results by relevance, limit the results and sort the results by some criterion). What would be a good way to do this in lucene? -Thanks. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: get results by relevance, limiting results and then sort the results by some criterion
Currently I am doing exactly that. I am boosting relevant docs and I am sorting in java to get the desired effect. I just was trying to see if I can do something using QueryFilter or Sorts and do what I am doing. -Thanks. John Powers wrote: Also, if you don't like the tag solution, you could borrow something right from LIA... boost the documents that are significant products with 1.5 (or whatever higher then 1), and the support/ancillary products boot with .1 If there is nothing relavent in the significant products, at least you'll get some of these. After all they may search for bolt maybe they want an ancillary product. -Original Message- From: Mufaddal Khumri [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 21, 2006 12:06 PM To: java-user@lucene.apache.org Subject: Re: get results by relevance, limiting results and then sort the results by some criterion So yes, if the xth + 1 item happens to be a camera and if its price happens to be lower than the previous x cameras it wont be included in this view and that is exactly what we want. Mufaddal Khumri wrote: In my case when we search for lets say cameras , my top x results are all sorts of cameras and then i get documents that match camera casings etc. As a company we want to show as many cameras as possible and not other camera related products for this one web view on a specific page we have. On this same page we also want to provide a way that the user can select price high to low or price low to high and sort these top x results. Essentially the hard part is to come up with the X so that you ideally dont prune any cameras. As a business we want to strive to get as many cameras in the search results, but at the same time we dont mind if a few cameras do not appear in those results if we can really fine tune our search results to only show cameras and not camera casings and camera batteries etc. I have been looking at QueryFilter and the Sort API, but havent yet figured out a way to do what I am trying to do .. any pointers are greatly appreciated. -Thanks, John Powers wrote: I'm sure you've taken care of this, but I am curious myself: If the 301 document only has a single term batteries (and thus is so far low on the Hits), but has a price of seven cents, then the sort of all the documents with batteries would put this near the top, but by eliminating all documents above 300, this one doesn't appear in the solution you are working for, correct?Why is that a good thing? It seems you would want to sort on the full document list, and then return on the 300 top that you want the user to see. I think I'm just curious why getting rid of some that could (in a new sort) be of higher relevance is a good thing. -Original Message- From: Mufaddal Khumri [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 21, 2006 10:33 AM To: java-user@lucene.apache.org Subject: get results by relevance, limiting results and then sort the results by some criterion When I do a search for example on batteries i get 1200+ results. I would like to show the user lets say 300. I can do that by only extracting the first 300 hits (sorted by decreasing relevance by default) and displaying those to the user. Now on the search results page, I have a drop down box that lets the user sort the results by price. When the user selects the Sort by price low to high, i would like to be able to sort the same 300 hits I got above (sorted by decreasing relevance by default) by price. Essentially I want to be able to sort the first 300 relevant search results by price. (in other words I would like to be able to get search results by relevance, limit the results and sort the results by some criterion). What would be a good way to do this in lucene? -Thanks. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: get results by relevance, limiting results and then sort the results by some criterion
Hi, Thats exactly what I am doing currently. Was just wondering if there is a lucene way to do what I am doing using QueryFilter etc. -Thanks. Dan Armbrust wrote: Mufaddal Khumri wrote: When I do a search for example on batteries i get 1200+ results. I would like to show the user lets say 300. I can do that by only extracting the first 300 hits (sorted by decreasing relevance by default) and displaying those to the user. If you are only talking about ordering the number of items that you are going to show to the user, that seems to imply that the number will be small. Why don't you just re-sort the items that you are going to display to the user somewhere in your code after you get the documents back from lucene? It may not be quite as clean, but I doubt that there will be any performance impact. Dan - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
exact match ..
lets say i do this while indexing: doc.add(Field.Text(categoryNames, categoryNames)); Now while searching categoryNames, I do a search for digital cameras. I only want to match the exact phrase digital cameras with documents who have exactly the phrase digital cameras in the categoryNames field. I do not want results that have digital camera batteries part of the result. Whats the best way to accomplish this? thanks. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
span first query and boosting ..
Hi, I do this: SpanFirstQuery fullPhraseInCategoryNamesQuery = new SpanFirstQuery(new SpanTermQuery(new Term(categoryNames, digital cameras)), 2); fullPhraseInCategoryNamesQuery.setBoost(8); In my log output i get this: spanFirst(categoryNames:digit camera, 2)) Why cant I boost a span query? What am i doing wrong? -Thanks - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: exact match ..
Hi Steve, If I understand you right, I could use something like the Keyword analyzer to tokenize the entire stream as a single token and store that in the index. I could definitely the keyword analyzer while indexing this particular field categoryNames. Now my questions is on how to search and boost this since this is part of a bigger boolean query in my case. My typical query actually looks like: +(+content:digit +content:camera) +entity:product +(title:digit camera~2^40.0 ((title:digit title:camera)^10.0) content:digit camera~2^20.0 (content:digit content:camera) categoryNames:digit camera^80.0) As you can see i was trying to do a phrase query on the categoryNames field and boosting it by 80.0. Also I am using the potter stemming filter to stem while searching. (I do this while indexing as well). If I go with the KeywordAnalyzer approach I can index the categoryNames field using this analyzer . Would I be using the QueryParser to create my query and specify the keyword analyzer to it while searching on categoryNames ? (and then make that query part of my global boolean query?) -Thanks. Steven Rowe wrote: Mufaddal Khumri wrote: lets say i do this while indexing: doc.add(Field.Text(categoryNames, categoryNames)); Now while searching categoryNames, I do a search for digital cameras. I only want to match the exact phrase digital cameras with documents who have exactly the phrase digital cameras in the categoryNames field. I do not want results that have digital camera batteries part of the result. Whats the best way to accomplish this? Hi Mufaddal, One way to do this is to use the KeywordAnalyzer (in the Lucene Subversion trunk, but not in v1.4.3; will be in forthcoming v1.9) for the categoryNames field. This analyzer does not tokenize field contents, so digital cameras would be a single token, and the only thing that would match it would be the exact same single token. Be careful when you search to construct the search tokens similarly. If you have other fields you want to search, and you want to tokenize their contents when you index them, you could use the PerFieldAnalyzerWrapper, so that the KeywordAnalyzer is only used for the categoryNames field. Steve - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
StandardAnalyzer .. stemming
The SnowBallAnalyzer seems to offer stemming. The StandardAnalyzer on the other hand has a bunch of other niceness. What is the best practice of leveraging both these analyzers while indexing and searching? Do I chain these up somehow and if so what apis do i look at for doing so? Do i implement my own analyzer and use both these two process the tokens? Thanks, - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: StandardAnalyzer .. stemming
Thank you. I think in my case i can just do the last approach you suggested. One more question, what jar is SnowballFilter part of? Chris Hostetter wrote: : The SnowBallAnalyzer seems to offer stemming. The StandardAnalyzer on : the other hand has a bunch of other niceness. What is the best practice : of leveraging both these analyzers while indexing and searching? Do I : chain these up somehow and if so what apis do i look at for doing so? Do : i implement my own analyzer and use both these two process the tokens? the Analyzer class is already designed to making chaining very easy -- but not Analyzer chaining, TokenFilter chaining. if you take a look at the source for StandardAnalyzer and SnowBallAnalyzer it should (hopefully) be very obvious how to write your own (10 line or less) Analyzer that gives youall the goodness you want from both... http://svn.apache.org/viewcvs.cgi/lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/StandardAnalyzer.java?rev=219090view=markup http://svn.apache.org/viewcvs.cgi/lucene/java/trunk/contrib/snowball/src/java/org/apache/lucene/analysis/snowball/SnowballAnalyzer.java?rev=151459view=markup ...if you literaly just want to add snowball stemming to the end of StandardAnalyzer, then i *think* something like this would work... Analyzer a = new StandardAnalyzer(stoplist) { public TokenStream tokenStream(String fieldName, Reader reader) { return new SnowballFilter(super.tokenStream(fieldName,reader), yourChoiceOfStemmerName); } } -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Lucene Query ... understanding
Hi, Am just trying to see if i understand the lucene query below correctly. +(+contentNew:radio +contentNew:mp3) +entity:product +(name:radio mp3^4.0 (contentNew:radio contentNew:mp3) contentNew:radio mp3^2.0) Let me see if can understand the above query correctly: 1. the contentNew field has the word radio AND the word mp3 AND 2. the entity field has the word product AND 3. the phrase radio mp3 is in field name boosted by 4 OR the word radio is in the field contentNew OR the word mp3 is in the field contentNew OR the phrase radio mp3 is in the field contentNew boosted by 2 (I am trying to understand the above query in terms of ANDs, ORs, Groupings and boosting as opposed to prohibited and required) Am I correct in my understanding? Thanks, - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Strange Problem ... Luke returns results Lucene api does not.
Hi, I have a query that gets hits via luke. I can see the documents it finds. But when I run the same query via my java code it returns 0 hits. Note: 1. I am using standard analyzer while indexing and searching. 2. I have made sure that I am querying the same index via luke or through my java program. This is the call I make in my java code. BooleanQuery finalQuery = new BooleanQuery(); . . log.debug(finalQuery.toString()); hits = IndexSearcherManager.getIndexSearcher(indexPath).search(finalQuery); log.debug(Hits length = + hits.length()); The output of the first log statement above is: +(+contentNew:Wireless +contentNew:fm +contentNew:car +contentNew:transmitter) +entity:category +(name:Wireless fm car transmitter^40.0 ((name:Wireless name:fm name:car name:transmitter)^10.0) contentNew:Wireless fm car transmitter^20.0 (contentNew:Wireless contentNew:fm contentNew:car contentNew:transmitter)) The output of the second log statement above is: Hits length = 0 I run the above query against the same index via Luke and I get search results that I expected. Any ideas as to why my java call does not return any hits? how i might be able to debug this? Thanks, - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Strange Problem ... Luke returns results Lucene api does not.
I am using the standard analyzer with luke. Standard analyzer lower cases while indexing and searching. The BooleanQuery, finalQuery.toString() in my case below is: +(+contentNew:wireless +contentNew:fm +contentNew:car +contentNew:transmitter) +entity:product +(name:wireless fm car transmitter^40.0 ((name:wireless name:fm name:car name:transmitter)^10.0) contentNew:wireless fm car transmitter^20.0 (contentNew:wireless contentNew:fm contentNew:car contentNew:transmitter)) OR +(+contentNew:Wireless +contentNew:fm +contentNew:car +contentNew:transmitter) +entity:category +(name:Wireless fm car transmitter^40.0 ((name:Wireless name:fm name:car name:transmitter) ^10.0) contentNew:Wireless fm car transmitter^20.0 (contentNew:Wireless contentNew:fm contentNew:car contentNew:transmitter)) work in Luke just fine. I am using the StandardAnalyzer in Luke. But when i try to execute the above boolean query via a call to IndexSearcher.search(finalQuery) it returns no hits. Erik Hatcher wrote: How are you constructing your BooleanQuery and what Analyzer are you using with Luke? You have some capitalized words in your query, and most analyzers would lowercase those, which may be the issue (perhaps you indexed the capitalized words?). Erik On Feb 16, 2006, at 2:41 PM, Mufaddal Khumri wrote: Hi, I have a query that gets hits via luke. I can see the documents it finds. But when I run the same query via my java code it returns 0 hits. Note: 1. I am using standard analyzer while indexing and searching. 2. I have made sure that I am querying the same index via luke or through my java program. This is the call I make in my java code. BooleanQuery finalQuery = new BooleanQuery(); . . log.debug(finalQuery.toString()); hits = IndexSearcherManager.getIndexSearcher (indexPath).search(finalQuery); log.debug(Hits length = + hits.length()); The output of the first log statement above is: +(+contentNew:Wireless +contentNew:fm +contentNew:car +contentNew:transmitter) +entity:category +(name:Wireless fm car transmitter^40.0 ((name:Wireless name:fm name:car name:transmitter) ^10.0) contentNew:Wireless fm car transmitter^20.0 (contentNew:Wireless contentNew:fm contentNew:car contentNew:transmitter)) The output of the second log statement above is: Hits length = 0 I run the above query against the same index via Luke and I get search results that I expected. Any ideas as to why my java call does not return any hits? how i might be able to debug this? Thanks, - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Strange Problem ... Luke returns results Lucene api does not.
Yes. thats exactly the problem. I just found out that analyzer was not being set correctly. Thanks, Chris Hostetter wrote: : Standard analyzer lower cases while indexing and searching. Correct, but since the toString() of your query still has capital words in it (like contentNew:Wireless) you obviously didn't build this query using the StandardAnalyzer -- IndexSearcher doesn't apply any Analyzers for you when you search -- it's the responsability of whatever is constructing your query (be that custom code you've written, or QueryParser) to run the input thoguh the appropraite Analyzer. when you paste that query into Luke, it *does* run it through the QueryParser for you -- so the text gets analyzed and lower cased. : : The BooleanQuery, finalQuery.toString() in my case below is: : : +(+contentNew:wireless +contentNew:fm +contentNew:car : +contentNew:transmitter) +entity:product +(name:wireless fm car : transmitter^40.0 ((name:wireless name:fm name:car : name:transmitter)^10.0) contentNew:wireless fm car transmitter^20.0 : (contentNew:wireless contentNew:fm contentNew:car contentNew:transmitter)) : : OR : : +(+contentNew:Wireless +contentNew:fm +contentNew:car : +contentNew:transmitter) +entity:category +(name:Wireless fm car : transmitter^40.0 ((name:Wireless name:fm name:car name:transmitter) : ^10.0) contentNew:Wireless fm car transmitter^20.0 (contentNew:Wireless : contentNew:fm contentNew:car contentNew:transmitter)) : : work in Luke just fine. I am using the StandardAnalyzer in Luke. : : But when i try to execute the above boolean query via a call to : IndexSearcher.search(finalQuery) it returns no hits. : : Erik Hatcher wrote: : : How are you constructing your BooleanQuery and what Analyzer are you : using with Luke? You have some capitalized words in your query, and : most analyzers would lowercase those, which may be the issue (perhaps : you indexed the capitalized words?). : : Erik : : On Feb 16, 2006, at 2:41 PM, Mufaddal Khumri wrote: : : Hi, : : I have a query that gets hits via luke. I can see the documents it : finds. But when I run the same query via my java code it returns 0 : hits. : : Note: : 1. I am using standard analyzer while indexing and searching. : 2. I have made sure that I am querying the same index via luke or : through my java program. : : This is the call I make in my java code. : BooleanQuery finalQuery = new BooleanQuery(); : . : . : log.debug(finalQuery.toString()); : : hits = IndexSearcherManager.getIndexSearcher : (indexPath).search(finalQuery); : log.debug(Hits length = + hits.length()); : : The output of the first log statement above is: : : +(+contentNew:Wireless +contentNew:fm +contentNew:car : +contentNew:transmitter) +entity:category +(name:Wireless fm car : transmitter^40.0 ((name:Wireless name:fm name:car name:transmitter) : ^10.0) contentNew:Wireless fm car transmitter^20.0 : (contentNew:Wireless contentNew:fm contentNew:car : contentNew:transmitter)) : : The output of the second log statement above is: : : Hits length = 0 : : I run the above query against the same index via Luke and I get : search results that I expected. : : Any ideas as to why my java call does not return any hits? how i : might be able to debug this? : : Thanks, : : : - : To unsubscribe, e-mail: [EMAIL PROTECTED] : For additional commands, e-mail: [EMAIL PROTECTED] : : : : - : To unsubscribe, e-mail: [EMAIL PROTECTED] : For additional commands, e-mail: [EMAIL PROTECTED] : : : : - : To unsubscribe, e-mail: [EMAIL PROTECTED] : For additional commands, e-mail: [EMAIL PROTECTED] : -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
de pluralization
Hello, I am just posting this question out here since this might be a common problem and some of you might have good pointers. Is there algorithms/api built into lucene that would help de pluralize words while indexing and/or while searching the index? Are there analyzers that do this already? There is tons of academic work on going in this area and I was wondering the best way to solve this problem. We have ideas and heuristics ourseleves, but would love input from the community here since this might be a common problem. Any pointers/ideas on this? Thank you, Mufaddal. -- This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the system manager. Please note that any views or opinions presented in this email are solely those of the author and do not necessarily represent those of the company. Finally, the recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. Consult your physician prior to the use of any medical supplies or product. -- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Question regarding boosting
Hi, After a little probing and trying I formulated this query: queryString = entity:\ + en + \ AND (name:\ + queryString + \^2 OR content:\ + queryString + \); Query q = QueryParser.parse(queryString, content, analyzer); When I execute the above query, the following query gets executed in lucene: +entity:product +(name:audio cable^2.0 content:audio cable) Note: audio cable is the contents of the search box. Also I saw that my OR gets represented as a blank in the query. Is that fine? The results from executing this query seem alright, but is this a good way of achieving the results I was trying to achieve? (NOTE: My original post explains what I am trying to do). Any insight would be appreciated. Mufaddal. -Original Message- From: Mufaddal Khumri [mailto:[EMAIL PROTECTED] Sent: Friday, May 20, 2005 3:34 PM To: java-user@lucene.apache.org Subject: Question regarding boosting Hi, I wanted to know what method would be the best way to do something that I am describing below. I am creating an index of all my products and categories. While indexing, I am creating the following documents for my products and categories: Product: doc.add(Field.UnIndexed(id, (String)obj[0])); doc.add(Field.Keyword(entity,product)); doc.add(Field.Text(name, name)); doc.add(Field.Text(content, content)); Category: doc.add(Field.UnIndexed(id, (String)obj[0])); doc.add(Field.Keyword(entity,category)); doc.add(Field.Text(name, name)); doc.add(Field.Text(content, content)); As you can see above the id is stored to retrieve the objects from the database. The entity field distinguishes whether I want to carry out my search on products or categories. The content field is a combination of the name and description of the product and category. The name field is the name of the product or the name of the category. My searches and indexing works great. This is how I am searching: Query query1 = QueryParser.parse(queryString,content,analyzer); Term term = null; if(entity.equals(product)) term = new Term(entity,product); else if(entity.equals(category)) term = new Term(entity,category); TermQuery query2 = new TermQuery(term); BooleanQuery bq = new BooleanQuery(); bq.add(query1, true, false); bq.add(query2, true, false); return indexSearcher.search(bq); As you can see above I am using the content and entity fields to do my search and everything works fine. What I want to do now is that I want to boost the results such that if the query matches the name field it gives a higher rank. How do I do this? For example adding something like this: ... Query query3 = QueryParser.parse(queryString,name,analyzer); query3.setBoost(2); ... ... bq.add(query3, true, false); When I do the above, I print a toString on my final Boolean query which is: +content:radio +entity:category +name:radio^2.0 When I am doing my search for products, lets say, how do I tell lucene that - Show me all products such that the results are ordered in such a way that if a product's name matches the querystring more it gets a higher relevance So the relevance should be in the following order: 1. Product name matches more - more relevance. 2. Product content matches - relevance is more but less than the relevance given to product name in 1. Any ideas? Thanks. -- This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the system manager. Please note that any views or opinions presented in this email are solely those of the author and do not necessarily represent those of the company. Finally, the recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. Consult your physician prior to the use of any medical supplies or product. -- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Lucene loosing documents?
Hi, I am trying to index 20349 records. When I index using the FSDirectory I get 20349 documents - this is correct. Now when I ude the RAMDirectory to create my index and write all documents from the RAMDirectory to the FSDirectory I only get 20340 documents consistently. This is the only change I made. Why do I loose 9 documents? int counter = 1; while(counter = 20349) { ramWriter.addDocument(doc); } Directory d[] = {ramDir}; fsWriter.addIndexes(d); fsWriter.optimize(); ramWriter.close(); fsWriter.close(); Any ideas as to why I am missing the 9 documents? Thanks. -- This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the system manager. Please note that any views or opinions presented in this email are solely those of the author and do not necessarily represent those of the company. Finally, the recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. Consult your physician prior to the use of any medical supplies or product. -- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Lucene loosing documents?
Hi, Thanks. That seems to work. I guess calling the close before the add causes the last few documents to be flushed out or something? -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Thursday, April 28, 2005 2:19 PM To: java-user@lucene.apache.org Subject: Re: Lucene loosing documents? Can you close the ramDirectory first and then add it via fsWriter and see if that solves it? Otis --- Mufaddal Khumri [EMAIL PROTECTED] wrote: Hi, I am trying to index 20349 records. When I index using the FSDirectory I get 20349 documents - this is correct. Now when I ude the RAMDirectory to create my index and write all documents from the RAMDirectory to the FSDirectory I only get 20340 documents consistently. This is the only change I made. Why do I loose 9 documents? int counter = 1; while(counter = 20349) { ramWriter.addDocument(doc); } Directory d[] = {ramDir}; fsWriter.addIndexes(d); fsWriter.optimize(); ramWriter.close(); fsWriter.close(); Any ideas as to why I am missing the 9 documents? Thanks. -- This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the system manager. Please note that any views or opinions presented in this email are solely those of the author and do not necessarily represent those of the company. Finally, the recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. Consult your physician prior to the use of any medical supplies or product. -- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Lucene bulk indexing
Hi, I am sure this question must be raised before and maybe it has been even answered. I would be grateful, if someone could point me in the right direction or give their thoughts on this topic. The problem: I have approximately over 2 products that I need to index. At the moment I get X number of products at a time and index them. This process takes about 26 minutes (Am indexing the database id, product name, product description). I was thinking of ways to make this indexing faster. For this I was thinking about writing a threaded module that would index X number of products simultaneously. For instance I could spawn (Number of products/X) number of threads and do the indexing. I am guessing this would be faster but by what factor would this be faster? (I understand the writes to the index are synchronized by lucene). Is there any other approach by which I could speed up the indexing? Thoughts? Suggestions? Thanks, Mufaddal. -- This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the system manager. Please note that any views or opinions presented in this email are solely those of the author and do not necessarily represent those of the company. Finally, the recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. Consult your physician prior to the use of any medical supplies or product. -- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]