Re: [ot] a reverse lucene
May be RSS feed a solution. Just provide RSS feed as a search result for each query and people subscribing these RSS feed would get notifications in regular intervals. They need to install RSS clients, which can run queries in regular intervals. --- On Sun, 11/23/08, Ian Holsman [EMAIL PROTECTED] wrote: From: Ian Holsman [EMAIL PROTECTED] Subject: Re: [ot] a reverse lucene To: java-user@lucene.apache.org Date: Sunday, November 23, 2008, 2:35 AM Anshum wrote: Hi Ian, I guess that could be achieved if you write code to read the queries and query for each document (using lucene). Assuming that I got the question right! :) yes.. that is one way, but probably not the most efficient one. think of something like http://www.google.com/alerts, but instead of running once a day, it would run each time it sees a document. (as-it-happens mode) and you would have a couple of million queries to run through. regards Ian -- Anshum Gupta Naukri Labs! http://ai-cafe.blogspot.com The facts expressed here belong to everybody, the opinions to me. The distinction is yours to draw On Sun, Nov 23, 2008 at 9:27 AM, Ian Holsman [EMAIL PROTECTED] wrote: Hi. apologies for the off-topic question. I was wondering if anyone knew of a open source solution (or a pointer to the algorithms) that do the reverse of lucene. By that I mean store a whole lot of queries, and run them against a document to see which queries match it. (with a score etc) I can see the case for this would be a news-article and several people writing queries to get alerted if it matched a certain condition. Regards Ian - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [ot] a reverse lucene
On Nov 22, 2008, at 10:57 PM, Ian Holsman wrote: Hi. apologies for the off-topic question. Not off-topic at all! I was wondering if anyone knew of a open source solution (or a pointer to the algorithms) that do the reverse of lucene. By that I mean store a whole lot of queries, and run them against a document to see which queries match it. (with a score etc) I can see the case for this would be a news-article and several people writing queries to get alerted if it matched a certain condition. This use-case was the reason MemoryIndex was created. It's a fast single document index where incoming documents could be sent in parallel to the main index - and slamming a bunch of queries at it. There's also InstantiatedIndex to compare to, as it can handle multiple documents. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [ot] a reverse lucene
Thanks Erik. I'll start looking at that. regards Ian Erik Hatcher wrote: On Nov 22, 2008, at 10:57 PM, Ian Holsman wrote: Hi. apologies for the off-topic question. Not off-topic at all! I was wondering if anyone knew of a open source solution (or a pointer to the algorithms) that do the reverse of lucene. By that I mean store a whole lot of queries, and run them against a document to see which queries match it. (with a score etc) I can see the case for this would be a news-article and several people writing queries to get alerted if it matched a certain condition. This use-case was the reason MemoryIndex was created. It's a fast single document index where incoming documents could be sent in parallel to the main index - and slamming a bunch of queries at it. There's also InstantiatedIndex to compare to, as it can handle multiple documents. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [ot] a reverse lucene
I am using MemoryIndex in a similar scenario. I have not as many queries though, less than 100, but several 'articles' coming per second. Works nicely. On Sun, Nov 23, 2008 at 10:00 AM, Erik Hatcher [EMAIL PROTECTED] wrote: On Nov 22, 2008, at 10:57 PM, Ian Holsman wrote: Hi. apologies for the off-topic question. Not off-topic at all! I was wondering if anyone knew of a open source solution (or a pointer to the algorithms) that do the reverse of lucene. By that I mean store a whole lot of queries, and run them against a document to see which queries match it. (with a score etc) I can see the case for this would be a news-article and several people writing queries to get alerted if it matched a certain condition. This use-case was the reason MemoryIndex was created. It's a fast single document index where incoming documents could be sent in parallel to the main index - and slamming a bunch of queries at it. There's also InstantiatedIndex to compare to, as it can handle multiple documents. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [ot] a reverse lucene
The formal name for this stuff is document filtering or just filtering. You can start on it, by looking at TREC, which had a filtering task for a number of years: http://trec.nist.gov/tracks.html At any rate, one approach is to store your queries as Lucene documents, albeit short ones. Then, as others have said, you index new, incoming docs into a Memory Index. From that, you can extract the key terms which can then be used to come up with a Query to be run against your query index. The MoreLikeThis functionality should help in determining the important terms. Then, you need to decide how to handle dealing with the results. You probably don't want to route the document to each and every query that matches. -Grant On Nov 23, 2008, at 2:35 AM, Ian Holsman wrote: Anshum wrote: Hi Ian, I guess that could be achieved if you write code to read the queries and query for each document (using lucene). Assuming that I got the question right! :) yes.. that is one way, but probably not the most efficient one. think of something like http://www.google.com/alerts, but instead of running once a day, it would run each time it sees a document. (as- it-happens mode) and you would have a couple of million queries to run through. regards Ian -- Anshum Gupta Naukri Labs! http://ai-cafe.blogspot.com The facts expressed here belong to everybody, the opinions to me. The distinction is yours to draw On Sun, Nov 23, 2008 at 9:27 AM, Ian Holsman [EMAIL PROTECTED] wrote: Hi. apologies for the off-topic question. I was wondering if anyone knew of a open source solution (or a pointer to the algorithms) that do the reverse of lucene. By that I mean store a whole lot of queries, and run them against a document to see which queries match it. (with a score etc) I can see the case for this would be a news-article and several people writing queries to get alerted if it matched a certain condition. Regards Ian - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Grant Ingersoll Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [ot] a reverse lucene
On Sun, Nov 23, 2008 at 02:57:28PM +1100, Ian Holsman wrote: I can see the case for this would be a news-article and several people writing queries to get alerted if it matched a certain condition. I haven't tried this, but if you have lots of queries and few documents then consider using lucene, but reconsidering how you design your documents. Turn the queries into documents in the index, and turn the document into a query. Something like google alerts you can have a document which is match: keyword Then the document can become a boolean query for each word in it: match:foo OR match:bar Obviously good choices of analysers and simplification of the queries that you allow will make this better. If you have fewer than 10k stored queries then the ways of running all the queries against a document in memory will probably be faster (depending on your incoming document rate, though you can batch them up and do the queries every 15 mintues or something if you don't mind the lag and you're getting lots of incomming documents). Just an idea. David -- About the use of language: it is impossible to sharpen a pencil with a blunt ax. It is equally vain to try to do it with ten blunt axes instead. -- Edsger Dijkstra - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: AW: Transforming german umlaute like ö,ä ,ü,ß into oe, ae, ue, ss
Where do I get the CharFilter library? I'm using Lucene, not Solr. Thanks, Sascha CharFilter is included in recent Solr nightly build. It is not OOTB solution for Lucene now, sorry. If I have time, I will make it for Lucene in this weekend. Now the patch available for Lucene at: https://issues.apache.org/jira/browse/LUCENE-1466 Koji - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [ot] a reverse lucene
Ian Holsman wrote: Hi. apologies for the off-topic question. I was wondering if anyone knew of a open source solution (or a pointer to the algorithms) that do the reverse of lucene. By that I mean store a whole lot of queries, and run them against a document to see which queries match it. (with a score etc) I can see the case for this would be a news-article and several people writing queries to get alerted if it matched a certain condition. http://www.seas.upenn.edu/~svilen/publications/subscribe.pdf -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[ANN] Luke 0.9.1 - bugfix release
Hi all, A bugfix release of Luke is now available at the usual place: http://www.getopt.org/luke * New features and improvements: o Added ability to set the maximum count of boolean clauses in BooleanQuery. * Bug fixes: o Unbalanced commit tags breaking the XML export. Reported by Teruhiko Kurosaka. o Opening a non-existent index from command-line creates an empty directory. This is worked-around in Luke although it's a Lucene bug. Reported by Chris Pimlott. See also LUCENE-1464. o IndexGate inadvertently deleting previous commit points, even if Keep all commits option was specified. Reported by Mark Harwood. o Empty index with no fields was reported as invalid. Discovered by Andrew Zhang and Michael McCandless (LUCENE-1454). Thank you! -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [ot] a reverse lucene
Thanks for all the suggestions guys.. This is great! Andrzej Bialecki wrote: Ian Holsman wrote: Hi. apologies for the off-topic question. I was wondering if anyone knew of a open source solution (or a pointer to the algorithms) that do the reverse of lucene. By that I mean store a whole lot of queries, and run them against a document to see which queries match it. (with a score etc) I can see the case for this would be a news-article and several people writing queries to get alerted if it matched a certain condition. http://www.seas.upenn.edu/~svilen/publications/subscribe.pdf - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [ot] a reverse lucene
If you index the queries consider also that they can potentially be indexed in an optimised form. For example, take a phrase query for Alonso Smith. You need only index one of these terms - an incoming document must contain both terms to be considered a match. If you chose to index this query on the rare term Alonso you would get far fewer requests to run this query than if you chose to index the comparitively more common Smith. Basically any query with mandatory terms can be index optimised to record only the rarest mandatory term (rarity typically being measured by using a look-up on some background index). Cheers, Mark Ian Holsman wrote: Thanks for all the suggestions guys.. This is great! Andrzej Bialecki wrote: Ian Holsman wrote: Hi. apologies for the off-topic question. I was wondering if anyone knew of a open source solution (or a pointer to the algorithms) that do the reverse of lucene. By that I mean store a whole lot of queries, and run them against a document to see which queries match it. (with a score etc) I can see the case for this would be a news-article and several people writing queries to get alerted if it matched a certain condition. http://www.seas.upenn.edu/~svilen/publications/subscribe.pdf - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] No virus found in this incoming message. Checked by AVG - http://www.avg.com Version: 8.0.175 / Virus Database: 270.9.9/1806 - Release Date: 11/22/2008 6:59 PM - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Doing the lucene remove character \n (break line)
What I'd do is make my own filter, probably one based upon one of the pre-existing ones and modify the call to nextToken, examine that token, and if it ends in a hyphen get the next token and return the concatenation of the two. I don't believe that there's a pre-existing filter that does this, but you might want to check because I haven't looked at them an a while. Best Erick On Sun, Nov 23, 2008 at 4:40 PM, farnetani [EMAIL PROTECTED] wrote: I need to do lucene find the sentence: ARLEI FERREIRA FARNETANI JUNIOR [arlei] [ferreira] [farnetani] [junior](1) and too: ARLEI FERREIRA FAR- break line NETANI JUNIOR I'm using the Brazilian Analyzer, but the result is: [ARLEI] [FERREIRA] [FAR] [NETANI] [JUNIOR] I have to do that the lucene result: [ARLEI] [FERREIRA] [FARNETANI] [JUNIOR] equals the sentence (1) So I have to do that lucene remove - hyphen and the break line (\n). To remove character hyphen (-) I got, but remove the break line no! How do I do??? -- View this message in context: http://www.nabble.com/Doing-the-lucene-remove-character-%5Cn-%28break-line%29-tp20650540p20650540.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Using multiple index searchers.
Hi all, After reading the FAQ I have a question regarding the use of multiple indexes and thus IndexSearches on the one server. I work on ecommerce websites and am looking at replacing our current method of full text searching product descriptions and names with a Lucene implementation. I envisaged creating a separate index file for each of the sites running on our main webserver (about 10 sites, each with different product listings). However this would mean that I would need to have many instances of IndexSearcher open and potentially come across file handle limit problems (as outlined in the FAQ) as well as consuming lots of memory. Is this a valid concern? Would I be able to use Lucene this way? An alternative would be to combine all the sites data into one index, and have a field identifiying which site each product entry belonged to. However I would rather not mix the data together. Thanks, Henrik
Re: Doing the lucene remove character \n (break line)
I got. I finish now, before of you to send message, but thanks your comments!:-D Have a nice day! Jr. Erick Erickson wrote: What I'd do is make my own filter, probably one based upon one of the pre-existing ones and modify the call to nextToken, examine that token, and if it ends in a hyphen get the next token and return the concatenation of the two. I don't believe that there's a pre-existing filter that does this, but you might want to check because I haven't looked at them an a while. Best Erick On Sun, Nov 23, 2008 at 4:40 PM, farnetani [EMAIL PROTECTED] wrote: I need to do lucene find the sentence: ARLEI FERREIRA FARNETANI JUNIOR [arlei] [ferreira] [farnetani] [junior](1) and too: ARLEI FERREIRA FAR- break line NETANI JUNIOR I'm using the Brazilian Analyzer, but the result is: [ARLEI] [FERREIRA] [FAR] [NETANI] [JUNIOR] I have to do that the lucene result: [ARLEI] [FERREIRA] [FARNETANI] [JUNIOR] equals the sentence (1) So I have to do that lucene remove - hyphen and the break line (\n). To remove character hyphen (-) I got, but remove the break line no! How do I do??? -- View this message in context: http://www.nabble.com/Doing-the-lucene-remove-character-%5Cn-%28break-line%29-tp20650540p20650540.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- View this message in context: http://www.nabble.com/Doing-the-lucene-remove-character-%5Cn-%28break-line%29-tp20650540p20654025.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Using multiple index searchers.
Hi all, After reading the FAQ I have a question regarding the use of multiple indexes and thus IndexSearches on the one server. I work on ecommerce websites and am looking at replacing our current method of full text searching product descriptions and names with a Lucene implementation. I envisaged creating a separate index file for each of the sites running on our main webserver (about 10 sites, each with different product listings). However this would mean that I would need to have many instances of IndexSearcher open and potentially come across file handle limit problems (as outlined in the FAQ) as well as consuming lots of memory. Is this a valid concern? Would I be able to use Lucene this way? An alternative would be to combine all the sites data into one index, and have a field identifiying which site each product entry belonged to. However I would rather not mix the data together. Thanks, Henrik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Using multiple index searchers.
If the data is unrelated, separate indexes will lead to the best performance. Memory usage should be less or equal to one big index. File descriptor usage can be minimized by either calling optimize before opening a new IndexSearcher (depends on how often you want to see updates), lowering the merge factor, or using the compound file format. -Yonik On Sun, Nov 23, 2008 at 9:32 PM, Henrik Axelsson [EMAIL PROTECTED] wrote: Hi all, After reading the FAQ I have a question regarding the use of multiple indexes and thus IndexSearches on the one server. I work on ecommerce websites and am looking at replacing our current method of full text searching product descriptions and names with a Lucene implementation. I envisaged creating a separate index file for each of the sites running on our main webserver (about 10 sites, each with different product listings). However this would mean that I would need to have many instances of IndexSearcher open and potentially come across file handle limit problems (as outlined in the FAQ) as well as consuming lots of memory. Is this a valid concern? Would I be able to use Lucene this way? An alternative would be to combine all the sites data into one index, and have a field identifiying which site each product entry belonged to. However I would rather not mix the data together. Thanks, Henrik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Using multiple index searchers.
Thanks for the quick reply, time to get to work on a prototype! On Mon, Nov 24, 2008 at 2:12 PM, Yonik Seeley [EMAIL PROTECTED] wrote: If the data is unrelated, separate indexes will lead to the best performance. Memory usage should be less or equal to one big index. File descriptor usage can be minimized by either calling optimize before opening a new IndexSearcher (depends on how often you want to see updates), lowering the merge factor, or using the compound file format. -Yonik On Sun, Nov 23, 2008 at 9:32 PM, Henrik Axelsson [EMAIL PROTECTED] wrote: Hi all, After reading the FAQ I have a question regarding the use of multiple indexes and thus IndexSearches on the one server. I work on ecommerce websites and am looking at replacing our current method of full text searching product descriptions and names with a Lucene implementation. I envisaged creating a separate index file for each of the sites running on our main webserver (about 10 sites, each with different product listings). However this would mean that I would need to have many instances of IndexSearcher open and potentially come across file handle limit problems (as outlined in the FAQ) as well as consuming lots of memory. Is this a valid concern? Would I be able to use Lucene this way? An alternative would be to combine all the sites data into one index, and have a field identifiying which site each product entry belonged to. However I would rather not mix the data together. Thanks, Henrik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]