Frequency of phrase
This is somewhat related to a question sent to this list a while ago: Is there an efficient way to count the number of occurrences of a phrase (not term) in an index? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How can I get a term's frequency?
You need to make sure you are indexing with Term Vectors in order for IndexReader.getTermFreqVector to return anything meaningful. You do not need to implement it. QueryTermVector is meant to provide similar information to the Document side for Queries. For an example demo of indexing and using term vectors, go to http://www.cnlp.org/apachecon2005. All the examples are under Apache license and there is some documentation too. -Grant Daniel Noll wrote: sog wrote: en, but IndexReader.getTermFreqVector is an abstract method, I do not know how to implement it in an efficient way. Anyone has good advise? You probably don't need to implement it, it's been implemented already. Just call the method. I can do it in this way: QueryTermVector vector= new QueryTermVector(Document.getValues(field)); freq = result.getTermFrequencies(); I'm not sure because I've never used QueryTermVector before, but the fact that QueryTermVector doesn't take an IndexReader as a parameter is a good indication that it can't tell you anything about the frequency of the term in your documents. Daniel -- --- Grant Ingersoll Sr. Software Engineer Center for Natural Language Processing Syracuse University School of Information Studies 335 Hinds Hall Syracuse, NY 13244 http://www.cnlp.org Voice: 315-443-5484 Fax: 315-443-6886 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Throughput doesn't increase when using more concurrent threads
Can nutch be made to use lucene query parser? Rgds Prabhu On 2/23/06, Peter Keegan [EMAIL PROTECTED] wrote: Hi Otis, The Lucene server is actually CPU and network bound, as the index gets memory mapped pretty quickly. There is little disk activity observed. I was also able to run the server on a Sun box last night with 4 dual core opterons (same Linux and JVM) and I'm observing query rates of 400 qps! Has Linux been optimized to run on this hardware? I imagine that Sun's JVM has been. Peter On 2/22/06, Otis Gospodnetic [EMAIL PROTECTED] wrote: Hi, Some things that could be different: - thread scheduling (shouldn't make too much of a difference though) --- I would also play with disk IO schedulers, if you can. CentOS is based on RedHat, I believe, and RedHat (ext3, really) now has about 4 different IO schedulers that, according to articles I recently read, can have an impact on disk read/write performance. These schedules can be specified at mount time, I believe, and maybe at boot time (kernel line in Grub/LILO). Otis On 2/22/06, Peter Keegan [EMAIL PROTECTED] wrote: I am doing a performance comparison of Lucene on Linux vs Windows. I have 2 identically configured servers (8-CPUs (real) x 3GHz Xeon processors, 64GB RAM). One is running CentOS 4 Linux, the other is running Windows server 2003 Enterprise Edition x64. Both have 64-bit JVMs from Sun. The Lucene server is using MMapDirectory. I'm running the jvm with -Xmx16000M. Peak memory usage of the jvm on Linux is about 6GB and 7.8GBon windows. I'm observing query rates of 330 queries/sec on the Wintel server, but only 200 qps on the Linux box. At first, I suspected a network bottleneck, but when I 'short-circuited' Lucene, the query rates were identical. I suspect that there are some things to be tuned in Linux, but I'm not sure what. Any advice would be appreciated. Peter On 1/30/06, Peter Keegan [EMAIL PROTECTED] wrote: I cranked up the dial on my query tester and was able to get the rate up to 325 qps. Unfortunately, the machine died shortly thereafter (memory errors :-( ) Hopefully, it was just a coincidence. I haven't measured 64-bit indexing speed, yet. Peter On 1/29/06, Daniel Noll [EMAIL PROTECTED] wrote: Peter Keegan wrote: I tried the AMD64-bit JVM from Sun and with MMapDirectory and I'm now getting 250 queries/sec and excellent cpu utilization (equal concurrency on all cpus)!! Yonik, thanks for the pointer to the 64-bit jvm. I wasn't aware of it. Wow. That's fast. Out of interest, does indexing time speed up much on 64-bit hardware? I'm particularly interested in this side of things because for our own application, any query response under half a second is good enough, but the indexing side could always be faster. :-) Daniel -- Daniel Noll Nuix Australia Pty Ltd Suite 79, 89 Jones St, Ultimo NSW 2007, Australia Phone: (02) 9280 0699 Fax: (02) 9212 6902 This message is intended only for the named recipient. If you are not the intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this message or attachment is strictly prohibited. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Throughput doesn't increase when using more concurrent threads
I would give the IBM or blackdown JVM a try on linux - I've seen pretty wide variance in their speed on different operations. Sometimes better than Sun, sometimes worse - it depended on the task (I did some adhoc tests at one point that showed sun was faster for indexing, but IBM was faster for querying - but that was quite a while ago. Dan -- Daniel Armbrust Biomedical Informatics Mayo Clinic Rochester daniel.armbrust(at)mayo.edu http://informatics.mayo.edu/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
SQL DISTINCT functionality in Lucene
Hi, I need to find all distinct values for a keyword field in a Lucene index. Is this easily done? If so how? Many thanks, Hugh
Hierarchical Navigation in Lucene
Hi, We have a custom built document repository which is searchable / indexed via lucene. I want to put together some kind of navigation framework based on the repository metadata (which is also indexed with lucene). Is there a best-practice way to do this.? Thanks, Hugh
RE: SQL DISTINCT functionality in Lucene
Many Thanks. Hugh -Original Message- From: Michael D. Curtin [mailto:[EMAIL PROTECTED] Sent: 23 February 2006 17:39 To: java-user@lucene.apache.org Subject: Re: SQL DISTINCT functionality in Lucene Hugh Ross wrote: I need to find all distinct values for a keyword field in a Lucene index. I think the IndexReader.terms() method will do what you want. Good luck! --MDC - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: search a subdirectory (New to Lucene)
I reindexed with the path as a keyword field and now the PrefixQuery filter does exactly what I need. Thanks! I'm going to hold off on the paragraph-level indexing for now, but that does sound interesting. many thanks, John -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Wednesday, February 22, 2006 3:18 PM To: java-user@lucene.apache.org Subject: Re: search a subdirectory (New to Lucene) I presume by saying subdirectory you're referring to filesystem directories and you're indexing a directory tree of files. If you index the path (perhaps relative from the root is best) as a keyword field (untokenized, but indexed) you could perform filtering on a / path/subpath sort of way using PrefixQuery. As for paragraphs - how you index a document is entirely application dependent. Maybe it makes sense to parse the documents before handing them to Lucene such that you're creating a Lucene Document for each paragraph rather than for each entire file. Slicing the granularity of a domain into Documents is a fascinating topic :) Erik On Feb 22, 2006, at 1:00 PM, John Hamilton wrote: I'm new to Lucene and was wondering what is the best way to perform a search on a subdirectory or subdirectories within the index? My thought at this point is to build a query to first search for files in the required directory(ies) and then use that query to make a QueryFilter and use that QueryFilter in the actual search. Is there an easier way? On an unrelated note, does anybody know of a way to get results a the section level within a document? For example, could I find not just a document that matches my query, but the paragraph within that document that best matches the query? thanks, John - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Throughput doesn't increase when using more concurrent threads
Hi, Please ask on the Nutch mailing list (I answered your question in general@ already). Also, please don't steal other people's threads - it's considered inpolite for obvious reasons. Otis - Original Message From: Raghavendra Prabhu [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Thursday, February 23, 2006 11:10:11 AM Subject: Re: Throughput doesn't increase when using more concurrent threads Can nutch be made to use lucene query parser? Rgds Prabhu On 2/23/06, Peter Keegan [EMAIL PROTECTED] wrote: Hi Otis, The Lucene server is actually CPU and network bound, as the index gets memory mapped pretty quickly. There is little disk activity observed. I was also able to run the server on a Sun box last night with 4 dual core opterons (same Linux and JVM) and I'm observing query rates of 400 qps! Has Linux been optimized to run on this hardware? I imagine that Sun's JVM has been. Peter On 2/22/06, Otis Gospodnetic [EMAIL PROTECTED] wrote: Hi, Some things that could be different: - thread scheduling (shouldn't make too much of a difference though) --- I would also play with disk IO schedulers, if you can. CentOS is based on RedHat, I believe, and RedHat (ext3, really) now has about 4 different IO schedulers that, according to articles I recently read, can have an impact on disk read/write performance. These schedules can be specified at mount time, I believe, and maybe at boot time (kernel line in Grub/LILO). Otis On 2/22/06, Peter Keegan [EMAIL PROTECTED] wrote: I am doing a performance comparison of Lucene on Linux vs Windows. I have 2 identically configured servers (8-CPUs (real) x 3GHz Xeon processors, 64GB RAM). One is running CentOS 4 Linux, the other is running Windows server 2003 Enterprise Edition x64. Both have 64-bit JVMs from Sun. The Lucene server is using MMapDirectory. I'm running the jvm with -Xmx16000M. Peak memory usage of the jvm on Linux is about 6GB and 7.8GBon windows. I'm observing query rates of 330 queries/sec on the Wintel server, but only 200 qps on the Linux box. At first, I suspected a network bottleneck, but when I 'short-circuited' Lucene, the query rates were identical. I suspect that there are some things to be tuned in Linux, but I'm not sure what. Any advice would be appreciated. Peter On 1/30/06, Peter Keegan [EMAIL PROTECTED] wrote: I cranked up the dial on my query tester and was able to get the rate up to 325 qps. Unfortunately, the machine died shortly thereafter (memory errors :-( ) Hopefully, it was just a coincidence. I haven't measured 64-bit indexing speed, yet. Peter On 1/29/06, Daniel Noll [EMAIL PROTECTED] wrote: Peter Keegan wrote: I tried the AMD64-bit JVM from Sun and with MMapDirectory and I'm now getting 250 queries/sec and excellent cpu utilization (equal concurrency on all cpus)!! Yonik, thanks for the pointer to the 64-bit jvm. I wasn't aware of it. Wow. That's fast. Out of interest, does indexing time speed up much on 64-bit hardware? I'm particularly interested in this side of things because for our own application, any query response under half a second is good enough, but the indexing side could always be faster. :-) Daniel -- Daniel Noll Nuix Australia Pty Ltd Suite 79, 89 Jones St, Ultimo NSW 2007, Australia Phone: (02) 9280 0699 Fax: (02) 9212 6902 This message is intended only for the named recipient. If you are not the intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this message or attachment is strictly prohibited. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Throughput doesn't increase when using more concurrent threads
Hi Sorry for the trouble I was sending my first mail to the group and replied to this thread and then later on sent a direct mail. I would like to apologise for the inconvenience caused. Rgds Prabhu On 2/23/06, Otis Gospodnetic [EMAIL PROTECTED] wrote: Hi, Please ask on the Nutch mailing list (I answered your question in general@ already). Also, please don't steal other people's threads - it's considered inpolite for obvious reasons. Otis - Original Message From: Raghavendra Prabhu [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Thursday, February 23, 2006 11:10:11 AM Subject: Re: Throughput doesn't increase when using more concurrent threads Can nutch be made to use lucene query parser? Rgds Prabhu On 2/23/06, Peter Keegan [EMAIL PROTECTED] wrote: Hi Otis, The Lucene server is actually CPU and network bound, as the index gets memory mapped pretty quickly. There is little disk activity observed. I was also able to run the server on a Sun box last night with 4 dual core opterons (same Linux and JVM) and I'm observing query rates of 400 qps! Has Linux been optimized to run on this hardware? I imagine that Sun's JVM has been. Peter On 2/22/06, Otis Gospodnetic [EMAIL PROTECTED] wrote: Hi, Some things that could be different: - thread scheduling (shouldn't make too much of a difference though) --- I would also play with disk IO schedulers, if you can. CentOS is based on RedHat, I believe, and RedHat (ext3, really) now has about 4 different IO schedulers that, according to articles I recently read, can have an impact on disk read/write performance. These schedules can be specified at mount time, I believe, and maybe at boot time (kernel line in Grub/LILO). Otis On 2/22/06, Peter Keegan [EMAIL PROTECTED] wrote: I am doing a performance comparison of Lucene on Linux vs Windows. I have 2 identically configured servers (8-CPUs (real) x 3GHz Xeon processors, 64GB RAM). One is running CentOS 4 Linux, the other is running Windows server 2003 Enterprise Edition x64. Both have 64-bit JVMs from Sun. The Lucene server is using MMapDirectory. I'm running the jvm with -Xmx16000M. Peak memory usage of the jvm on Linux is about 6GB and 7.8GBon windows. I'm observing query rates of 330 queries/sec on the Wintel server, but only 200 qps on the Linux box. At first, I suspected a network bottleneck, but when I 'short-circuited' Lucene, the query rates were identical. I suspect that there are some things to be tuned in Linux, but I'm not sure what. Any advice would be appreciated. Peter On 1/30/06, Peter Keegan [EMAIL PROTECTED] wrote: I cranked up the dial on my query tester and was able to get the rate up to 325 qps. Unfortunately, the machine died shortly thereafter (memory errors :-( ) Hopefully, it was just a coincidence. I haven't measured 64-bit indexing speed, yet. Peter On 1/29/06, Daniel Noll [EMAIL PROTECTED] wrote: Peter Keegan wrote: I tried the AMD64-bit JVM from Sun and with MMapDirectory and I'm now getting 250 queries/sec and excellent cpu utilization (equal concurrency on all cpus)!! Yonik, thanks for the pointer to the 64-bit jvm. I wasn't aware of it. Wow. That's fast. Out of interest, does indexing time speed up much on 64-bit hardware? I'm particularly interested in this side of things because for our own application, any query response under half a second is good enough, but the indexing side could always be faster. :-) Daniel -- Daniel Noll Nuix Australia Pty Ltd Suite 79, 89 Jones St, Ultimo NSW 2007, Australia Phone: (02) 9280 0699 Fax: (02) 9212 6902 This message is intended only for the named recipient. If you are not the intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this message or attachment is strictly prohibited. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED]
Re: Throughput doesn't increase when using more concurrent threads
We discovered that the kernel was only using 8 CPUs. After recompiling for 16 (8+hyperthreads), it looks like the query rate will settle in around 280-300 qps. Much better, although still quite a bit slower than the opteron. Peter On 2/22/06, Yonik Seeley [EMAIL PROTECTED] wrote: Hmmm, not sure what that could be. You could try using the default FSDir instead of MMapDir to see if the differences are there. Some things that could be different: - thread scheduling (shouldn't make too much of a difference though) - synchronization workings - page replacement policy... how to figure out what pages to swap in and which to swap out, esp of the memory mapped files. You could also try a profiler on both platforms to try and see where the difference is. -Yonik On 2/22/06, Peter Keegan [EMAIL PROTECTED] wrote: I am doing a performance comparison of Lucene on Linux vs Windows. I have 2 identically configured servers (8-CPUs (real) x 3GHz Xeon processors, 64GB RAM). One is running CentOS 4 Linux, the other is running Windows server 2003 Enterprise Edition x64. Both have 64-bit JVMs from Sun. The Lucene server is using MMapDirectory. I'm running the jvm with -Xmx16000M. Peak memory usage of the jvm on Linux is about 6GB and 7.8GBon windows. I'm observing query rates of 330 queries/sec on the Wintel server, but only 200 qps on the Linux box. At first, I suspected a network bottleneck, but when I 'short-circuited' Lucene, the query rates were identical. I suspect that there are some things to be tuned in Linux, but I'm not sure what. Any advice would be appreciated. Peter On 1/30/06, Peter Keegan [EMAIL PROTECTED] wrote: I cranked up the dial on my query tester and was able to get the rate up to 325 qps. Unfortunately, the machine died shortly thereafter (memory errors :-( ) Hopefully, it was just a coincidence. I haven't measured 64-bit indexing speed, yet. Peter On 1/29/06, Daniel Noll [EMAIL PROTECTED] wrote: Peter Keegan wrote: I tried the AMD64-bit JVM from Sun and with MMapDirectory and I'm now getting 250 queries/sec and excellent cpu utilization (equal concurrency on all cpus)!! Yonik, thanks for the pointer to the 64-bit jvm. I wasn't aware of it. Wow. That's fast. Out of interest, does indexing time speed up much on 64-bit hardware? I'm particularly interested in this side of things because for our own application, any query response under half a second is good enough, but the indexing side could always be faster. :-) Daniel -- Daniel Noll Nuix Australia Pty Ltd Suite 79, 89 Jones St, Ultimo NSW 2007, Australia Phone: (02) 9280 0699 Fax: (02) 9212 6902 This message is intended only for the named recipient. If you are not the intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this message or attachment is strictly prohibited. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Hierarchical Navigation in Lucene
On Feb 23, 2006, at 12:37 PM, Hugh Ross wrote: Hi, We have a custom built document repository which is searchable / indexed via lucene. I want to put together some kind of navigation framework based on the repository metadata (which is also indexed with lucene). Is there a best-practice way to do this.? I don't know about a best practice, but I've used term enumeration coupled with PrefixQuery's to enable hierarchical navigation on my (very dusty and way outdated) blog: http://www.blogscene.org/erik Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Searching/sorting strategy for many properties for semantic web app
On Feb 22, 2006, at 9:01 PM, David Pratt wrote: Hi Erik. Many thanks for your reply. I'll likely see if I can find a list to pose a couple of questions there way. I am having fun with Lucene since it is new to me and I am impressed with the speed I am getting. I am reading anything I can get hold of and trying different code experiments. So far, the code is fairly straight forward so not so concerned about this at the moment. I am really hoping to hear from experienced people like yourself more on strategically what to index, what sort of things it would be a good idea to store and what to do about a fairly large schema that has much metadata to offer. Also perhaps when sorting and filtering gets too expensive. I realize that just because the metadata is available doesn't necessarily mean you want to even put it all in an index. I think these issues are pretty general, however I know there are folks on this that would likely advise some particular path or direction because of their own experiences with Lucene. I would really like to hear from anyone that has been working with metadata particularly or anyone generally about these topics. In my University job, I'm dealing with a fair bit of metadata in the form of RDF about 19th century literature objects. I'm indexing basic Dublin Core data such as title and author as individual fields, and also dropping all indexed metadata into a single searchable field. I've been using Kowari as the metadata store, but it also has Lucene integration (that I've not tried myself yet). I'm not sure what else to add as your query is a bit general. I think you'll find if you post more specific questions you're more likely to get detailed responses. General queries tend to be too general to respond to, I find. There really are no best practices with Lucene in terms of what to index, what to store - these are all highly application dependent and is often something I tune as the application itself evolves. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Throughput doesn't increase when using more concurrent threads
Chris, I tried JRockit a while back on 8-cpu/windows and it was slower than Sun's. Since I seem to be cpu-bound right now, I'll be trying a 16-cpu system next (32 with hyperthreading), on LinTel. I may give JRockit another go around then. Thanks, Peter On 2/23/06, Chris Lamprecht [EMAIL PROTECTED] wrote: Peter, Have you given JRockit JVM a try? I've seen it help throughput compared to Sun's JVM on a dual xeon/linux machine, especially with concurrency (up to 6 concurrent searches happening). I'm curious to see if it makes a difference for you. -chris On 2/23/06, Peter Keegan [EMAIL PROTECTED] wrote: We discovered that the kernel was only using 8 CPUs. After recompiling for 16 (8+hyperthreads), it looks like the query rate will settle in around 280-300 qps. Much better, although still quite a bit slower than the opteron. Peter On 2/22/06, Yonik Seeley [EMAIL PROTECTED] wrote: Hmmm, not sure what that could be. You could try using the default FSDir instead of MMapDir to see if the differences are there. Some things that could be different: - thread scheduling (shouldn't make too much of a difference though) - synchronization workings - page replacement policy... how to figure out what pages to swap in and which to swap out, esp of the memory mapped files. You could also try a profiler on both platforms to try and see where the difference is. -Yonik On 2/22/06, Peter Keegan [EMAIL PROTECTED] wrote: I am doing a performance comparison of Lucene on Linux vs Windows. I have 2 identically configured servers (8-CPUs (real) x 3GHz Xeon processors, 64GB RAM). One is running CentOS 4 Linux, the other is running Windows server 2003 Enterprise Edition x64. Both have 64-bit JVMs from Sun. The Lucene server is using MMapDirectory. I'm running the jvm with -Xmx16000M. Peak memory usage of the jvm on Linux is about 6GB and 7.8GBon windows. I'm observing query rates of 330 queries/sec on the Wintel server, but only 200 qps on the Linux box. At first, I suspected a network bottleneck, but when I 'short-circuited' Lucene, the query rates were identical. I suspect that there are some things to be tuned in Linux, but I'm not sure what. Any advice would be appreciated. Peter On 1/30/06, Peter Keegan [EMAIL PROTECTED] wrote: I cranked up the dial on my query tester and was able to get the rate up to 325 qps. Unfortunately, the machine died shortly thereafter (memory errors :-( ) Hopefully, it was just a coincidence. I haven't measured 64-bit indexing speed, yet. Peter On 1/29/06, Daniel Noll [EMAIL PROTECTED] wrote: Peter Keegan wrote: I tried the AMD64-bit JVM from Sun and with MMapDirectory and I'm now getting 250 queries/sec and excellent cpu utilization (equal concurrency on all cpus)!! Yonik, thanks for the pointer to the 64-bit jvm. I wasn't aware of it. Wow. That's fast. Out of interest, does indexing time speed up much on 64-bit hardware? I'm particularly interested in this side of things because for our own application, any query response under half a second is good enough, but the indexing side could always be faster. :-) Daniel -- Daniel Noll Nuix Australia Pty Ltd Suite 79, 89 Jones St, Ultimo NSW 2007, Australia Phone: (02) 9280 0699 Fax: (02) 9212 6902 This message is intended only for the named recipient. If you are not the intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this message or attachment is strictly prohibited. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Getting no hits ...
I have been trying to figure out why my query below would not return any hits. I use two custom analyzers for indexing and searching. The one I use for indexing uses this: public TokenStream tokenStream(String fieldName, Reader reader) { TokenStream result = new StandardTokenizer(reader); result = new StandardFilter(result); result = new LowerCaseFilter(result); result = new StopFilter(result, stopSet); result = new SynonymFilter(result, new MySynonymEngine()); result = new PorterStemFilter(result); return result; } The one I use for searching uses this: public TokenStream tokenStream(String fieldName, Reader reader) { TokenStream result = new StandardTokenizer(reader); result = new StandardFilter(result); result = new LowerCaseFilter(result); result = new StopFilter(result, stopSet); result = new PorterStemFilter(result); return result; } (Basically while searching I do not use the SynonymFilter.) I have quite a few products that I index that have the text on which I am querying on. I do a search for this: ES-20D This is the final query that I run: +(+content:es\-20d) +entity:product +(title:es\-20d~2^40.0 ((title:es\-20d)^10.0) content:es\-20d~2^20.0 (content:es\-20d) categoryName:es\-20d^80.0) (The content and title fields are Indexed, Tokenized and Stored. The categoryName field is Indexed and Stored.) I get no hits? Where am i going wrong with this? Any pointers? -Thanks. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Throughput doesn't increase when using more concurrent threads
Wow, some resources! Would it be cheaper / more scalable to copy the index to multiple boxes and loadbalance requests across them? -Yonik On 2/23/06, Peter Keegan [EMAIL PROTECTED] wrote: Since I seem to be cpu-bound right now, I'll be trying a 16-cpu system next (32 with hyperthreading), on LinTel. I may give JRockit another go around then. Thanks, Peter - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Throughput doesn't increase when using more concurrent threads
Yonik, We're investigating both approaches. Yes, the resources (and permutations) are dizzying! Peter On 2/23/06, Yonik Seeley [EMAIL PROTECTED] wrote: Wow, some resources! Would it be cheaper / more scalable to copy the index to multiple boxes and loadbalance requests across them? -Yonik On 2/23/06, Peter Keegan [EMAIL PROTECTED] wrote: Since I seem to be cpu-bound right now, I'll be trying a 16-cpu system next (32 with hyperthreading), on LinTel. I may give JRockit another go around then. Thanks, Peter - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Getting no hits ...
1) Have you looked at what tokens your indexing analyzer produces when you tokenize ES-20D ? 2) Have you looked at what tokens your query analyser products when you tokenize ES-20D ? 3) Have you tried a simpler query (ie: just content:es\-20d ) ? 4) When giving QueryParser a (quoted) phrase search, i don't think you really want to escape that - character. : Date: Thu, 23 Feb 2006 14:16:42 -0700 : From: Mufaddal Khumri [EMAIL PROTECTED] : Reply-To: java-user@lucene.apache.org : To: java-user@lucene.apache.org : Subject: Getting no hits ... : : I have been trying to figure out why my query below would not return any : hits. : : I use two custom analyzers for indexing and searching. The one I use for : indexing uses this: : : public TokenStream tokenStream(String fieldName, Reader reader) : { : TokenStream result = new StandardTokenizer(reader); : result = new StandardFilter(result); : result = new LowerCaseFilter(result); : result = new StopFilter(result, stopSet); : result = new SynonymFilter(result, new MySynonymEngine()); : result = new PorterStemFilter(result); : return result; : } : : The one I use for searching uses this: : : public TokenStream tokenStream(String fieldName, Reader reader) : { : TokenStream result = new StandardTokenizer(reader); : result = new StandardFilter(result); : result = new LowerCaseFilter(result); : result = new StopFilter(result, stopSet); : result = new PorterStemFilter(result); : return result; : } : : (Basically while searching I do not use the SynonymFilter.) : : I have quite a few products that I index that have the text on which I : am querying on. : : I do a search for this: ES-20D : : This is the final query that I run: : +(+content:es\-20d) +entity:product +(title:es\-20d~2^40.0 : ((title:es\-20d)^10.0) content:es\-20d~2^20.0 (content:es\-20d) : categoryName:es\-20d^80.0) : : (The content and title fields are Indexed, Tokenized and Stored. The : categoryName field is Indexed and Stored.) : : I get no hits? : : Where am i going wrong with this? Any pointers? : : -Thanks. : : : : : : - : To unsubscribe, e-mail: [EMAIL PROTECTED] : For additional commands, e-mail: [EMAIL PROTECTED] : -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Getting no hits ...
In my earlier email i put in the wrong query that I am searching on. The correct query is: EOS-20D And this is the query under question that is producing no hits still: +(+content:eos\-20d) +entity:product +(title:eos\-20d~2^40.0 ((title:eos\-20d)^10.0) content:eos\-20d~2^20.0 (content:eos\-20d) categoryName:eos\-20d^80.0) I have used the AnalyzerUtils.displayTokensWithFullDetails(analyzer, string); (AnalyzerUtils from the LIA book). This is part of the log output from using the AnalyzerUtils.displayTokensWithFullDetails(analyzer, string) when this product gets indexed: 119: [013803044430:857-869:ALPHANUM] 120: [eos-20d:870-877:NUM] 121: [011-eos-20d:878-889:NUM] This is part of the log output from using the AnalyzerUtils.displayTokensWithFullDetails(analyzer, string) when I do the search: 1: [eos-20d:0-6:NUM] From what I understand I see that the analyzer is producing the same tokens while indexing and during searching. Chris Hostetter wrote: 1) Have you looked at what tokens your indexing analyzer produces when you tokenize ES-20D ? 2) Have you looked at what tokens your query analyser products when you tokenize ES-20D ? 3) Have you tried a simpler query (ie: just content:es\-20d ) ? 4) When giving QueryParser a (quoted) phrase search, i don't think you really want to escape that - character. : Date: Thu, 23 Feb 2006 14:16:42 -0700 : From: Mufaddal Khumri [EMAIL PROTECTED] : Reply-To: java-user@lucene.apache.org : To: java-user@lucene.apache.org : Subject: Getting no hits ... : : I have been trying to figure out why my query below would not return any : hits. : : I use two custom analyzers for indexing and searching. The one I use for : indexing uses this: : : public TokenStream tokenStream(String fieldName, Reader reader) : { : TokenStream result = new StandardTokenizer(reader); : result = new StandardFilter(result); : result = new LowerCaseFilter(result); : result = new StopFilter(result, stopSet); : result = new SynonymFilter(result, new MySynonymEngine()); : result = new PorterStemFilter(result); : return result; : } : : The one I use for searching uses this: : : public TokenStream tokenStream(String fieldName, Reader reader) : { : TokenStream result = new StandardTokenizer(reader); : result = new StandardFilter(result); : result = new LowerCaseFilter(result); : result = new StopFilter(result, stopSet); : result = new PorterStemFilter(result); : return result; : } : : (Basically while searching I do not use the SynonymFilter.) : : I have quite a few products that I index that have the text on which I : am querying on. : : I do a search for this: ES-20D : : This is the final query that I run: : +(+content:es\-20d) +entity:product +(title:es\-20d~2^40.0 : ((title:es\-20d)^10.0) content:es\-20d~2^20.0 (content:es\-20d) : categoryName:es\-20d^80.0) : : (The content and title fields are Indexed, Tokenized and Stored. The : categoryName field is Indexed and Stored.) : : I get no hits? : : Where am i going wrong with this? Any pointers? : : -Thanks. : : : : : : - : To unsubscribe, e-mail: [EMAIL PROTECTED] : For additional commands, e-mail: [EMAIL PROTECTED] : -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: ArrayIndexOutOfBounds being thrown ...
Hi everyone, Sorry for not replying to original post (from Muffadal Khumri, 22/2) - I'm new to the list. I also had this problem, but it seems not to be in the source - downloading and building the1.9-rc1 source fixed the problem for me. Steve Stephen Gray Archive Research Officer Australian Social Science Data Archive 18 Balmain Crescent (Building #66) The Australian National University Canberra ACT 0200 Phone +61 2 6125 2185 Fax +61 2 6125 0627 Web http://assda.anu.edu.au/
Re: Getting no hits ...
Follow up on my previous email ... When I execute this query using luke using the standard analyzer on the same index, i get 8 hits. +(+content:eos\-20d) +entity:product +(title:eos\-20d~2^40.0 ((title:eos\-20d)^10.0) content:eos\-20d~2^20.0 (content:eos\-20d) categoryName:eos\-20d^80.0) I modified my searching code to use the standard analyzer, but i did not get any hits back. I am still trying to figure out the problem out. Any ideas? Mufaddal Khumri wrote: In my earlier email i put in the wrong query that I am searching on. The correct query is: EOS-20D And this is the query under question that is producing no hits still: +(+content:eos\-20d) +entity:product +(title:eos\-20d~2^40.0 ((title:eos\-20d)^10.0) content:eos\-20d~2^20.0 (content:eos\-20d) categoryName:eos\-20d^80.0) I have used the AnalyzerUtils.displayTokensWithFullDetails(analyzer, string); (AnalyzerUtils from the LIA book). This is part of the log output from using the AnalyzerUtils.displayTokensWithFullDetails(analyzer, string) when this product gets indexed: 119: [013803044430:857-869:ALPHANUM] 120: [eos-20d:870-877:NUM] 121: [011-eos-20d:878-889:NUM] This is part of the log output from using the AnalyzerUtils.displayTokensWithFullDetails(analyzer, string) when I do the search: 1: [eos-20d:0-6:NUM] From what I understand I see that the analyzer is producing the same tokens while indexing and during searching. Chris Hostetter wrote: 1) Have you looked at what tokens your indexing analyzer produces when you tokenize ES-20D ? 2) Have you looked at what tokens your query analyser products when you tokenize ES-20D ? 3) Have you tried a simpler query (ie: just content:es\-20d ) ? 4) When giving QueryParser a (quoted) phrase search, i don't think you really want to escape that - character. : Date: Thu, 23 Feb 2006 14:16:42 -0700 : From: Mufaddal Khumri [EMAIL PROTECTED] : Reply-To: java-user@lucene.apache.org : To: java-user@lucene.apache.org : Subject: Getting no hits ... : : I have been trying to figure out why my query below would not return any : hits. : : I use two custom analyzers for indexing and searching. The one I use for : indexing uses this: : : public TokenStream tokenStream(String fieldName, Reader reader) : { : TokenStream result = new StandardTokenizer(reader); : result = new StandardFilter(result); : result = new LowerCaseFilter(result); : result = new StopFilter(result, stopSet); : result = new SynonymFilter(result, new MySynonymEngine()); : result = new PorterStemFilter(result); : return result; : } : : The one I use for searching uses this: : : public TokenStream tokenStream(String fieldName, Reader reader) : { : TokenStream result = new StandardTokenizer(reader); : result = new StandardFilter(result); : result = new LowerCaseFilter(result); : result = new StopFilter(result, stopSet); : result = new PorterStemFilter(result); : return result; : } : : (Basically while searching I do not use the SynonymFilter.) : : I have quite a few products that I index that have the text on which I : am querying on. : : I do a search for this: ES-20D : : This is the final query that I run: : +(+content:es\-20d) +entity:product +(title:es\-20d~2^40.0 : ((title:es\-20d)^10.0) content:es\-20d~2^20.0 (content:es\-20d) : categoryName:es\-20d^80.0) : : (The content and title fields are Indexed, Tokenized and Stored. The : categoryName field is Indexed and Stored.) : : I get no hits? : : Where am i going wrong with this? Any pointers? : : -Thanks. : : : : : : - : To unsubscribe, e-mail: [EMAIL PROTECTED] : For additional commands, e-mail: [EMAIL PROTECTED] : -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
phrase frequency??
I searched my question in the mail archive, and found that I really want to get a phrase frequency, it is an old question which was not solved well. I traced Lucene source code, and discover that I can get a phrase's IDF from the Hits object weight= PhraseQuery$PhraseWeight (id=62) idf= 8.3973465 queryNorm= 0.11908524 queryWeight= 1.0 similarity= DefaultSimilarity (id=66) this$0= PhraseQuery (id=29) value= 8.3973465 and we can get an approximate formula: score = tf * idf so: tf(phrase)= score / idf(phrase) is this correct? - Original Message - From: Daniel Noll [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Thursday, February 23, 2006 8:57 AM Subject: Re: How can I get a term's frequency? sog wrote: en, but IndexReader.getTermFreqVector is an abstract method, I do not know how to implement it in an efficient way. Anyone has good advise? You probably don't need to implement it, it's been implemented already. Just call the method. I can do it in this way: QueryTermVector vector= new QueryTermVector(Document.getValues(field)); freq = result.getTermFrequencies(); I'm not sure because I've never used QueryTermVector before, but the fact that QueryTermVector doesn't take an IndexReader as a parameter is a good indication that it can't tell you anything about the frequency of the term in your documents. Daniel -- Daniel Noll Nuix Australia Pty Ltd Suite 79, 89 Jones St, Ultimo NSW 2007, Australia Phone: (02) 9280 0699 Fax: (02) 9212 6902 This message is intended only for the named recipient. If you are not the intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this message or attachment is strictly prohibited. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]