Re: [ot] a reverse lucene

2008-11-23 Thread Cool The Breezer
May be RSS feed a solution. Just provide RSS feed as a search result for each 
query and people subscribing these RSS feed would get notifications in regular 
intervals. They need to install RSS clients, which can run queries in regular 
intervals. 


--- On Sun, 11/23/08, Ian Holsman [EMAIL PROTECTED] wrote:

 From: Ian Holsman [EMAIL PROTECTED]
 Subject: Re: [ot] a reverse lucene
 To: java-user@lucene.apache.org
 Date: Sunday, November 23, 2008, 2:35 AM
 Anshum wrote:
  Hi Ian,
  I guess that could be achieved if you write code to
 read the queries and
  query for each document (using lucene).
  Assuming that I got the question right! :)
 

 
 yes.. that is one way, but probably not the most efficient
 one.
 
 think of something like http://www.google.com/alerts, but
 instead of 
 running once a day, it would run each time it sees a
 document. 
 (as-it-happens mode)
 and you would have a couple of million queries to run
 through.
 
 regards
 Ian
  --
  Anshum Gupta
  Naukri Labs!
  http://ai-cafe.blogspot.com
 
  The facts expressed here belong to everybody, the
 opinions to me. The
  distinction is yours to draw
 
 
  On Sun, Nov 23, 2008 at 9:27 AM, Ian Holsman
 [EMAIL PROTECTED] wrote:
 

  Hi. apologies for the off-topic question.
 
  I was wondering if anyone knew of a open source
 solution (or a pointer to
  the algorithms)
  that do the reverse of lucene.
  By that I mean store a whole lot of queries, and
 run them against a
  document to see which queries match it. (with a
 score etc)
 
  I can see the case for this would be a
 news-article and several people
  writing queries to get alerted if it matched a
 certain condition.
 
 
  Regards
  Ian
 
 
 -
  To unsubscribe, e-mail:
 [EMAIL PROTECTED]
  For additional commands, e-mail:
 [EMAIL PROTECTED]
 
 
  
 

 
 
 -
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]


  

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [ot] a reverse lucene

2008-11-23 Thread Erik Hatcher


On Nov 22, 2008, at 10:57 PM, Ian Holsman wrote:

Hi. apologies for the off-topic question.


Not off-topic at all!

I was wondering if anyone knew of a open source solution (or a  
pointer to the algorithms)

that do the reverse of lucene.
By that I mean store a whole lot of queries, and run them against a  
document to see which queries match it. (with a score etc)


I can see the case for this would be a news-article and several  
people writing queries to get alerted if it matched a certain  
condition.


This use-case was the reason MemoryIndex was created.  It's a fast  
single document index where incoming documents could be sent in  
parallel to the main index - and slamming a bunch of queries at it.   
There's also InstantiatedIndex to compare to, as it can handle  
multiple documents.


Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [ot] a reverse lucene

2008-11-23 Thread Ian Holsman

Thanks Erik.
I'll start looking at that.

regards
Ian
Erik Hatcher wrote:


On Nov 22, 2008, at 10:57 PM, Ian Holsman wrote:

Hi. apologies for the off-topic question.


Not off-topic at all!

I was wondering if anyone knew of a open source solution (or a 
pointer to the algorithms)

that do the reverse of lucene.
By that I mean store a whole lot of queries, and run them against a 
document to see which queries match it. (with a score etc)


I can see the case for this would be a news-article and several 
people writing queries to get alerted if it matched a certain condition.


This use-case was the reason MemoryIndex was created.  It's a fast 
single document index where incoming documents could be sent in 
parallel to the main index - and slamming a bunch of queries at it.  
There's also InstantiatedIndex to compare to, as it can handle 
multiple documents.


Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [ot] a reverse lucene

2008-11-23 Thread jm
I am using MemoryIndex in a similar scenario. I have not as many
queries though, less than 100, but several 'articles' coming per
second.

Works nicely.

On Sun, Nov 23, 2008 at 10:00 AM, Erik Hatcher
[EMAIL PROTECTED] wrote:

 On Nov 22, 2008, at 10:57 PM, Ian Holsman wrote:

 Hi. apologies for the off-topic question.

 Not off-topic at all!

 I was wondering if anyone knew of a open source solution (or a pointer to
 the algorithms)
 that do the reverse of lucene.
 By that I mean store a whole lot of queries, and run them against a
 document to see which queries match it. (with a score etc)

 I can see the case for this would be a news-article and several people
 writing queries to get alerted if it matched a certain condition.

 This use-case was the reason MemoryIndex was created.  It's a fast single
 document index where incoming documents could be sent in parallel to the
 main index - and slamming a bunch of queries at it.  There's also
 InstantiatedIndex to compare to, as it can handle multiple documents.

Erik


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [ot] a reverse lucene

2008-11-23 Thread Grant Ingersoll
The formal name for this stuff is document filtering or just  
filtering.  You can start on it, by looking at TREC, which had a  
filtering task for a number of years: http://trec.nist.gov/tracks.html


At any rate, one approach is to store your queries as Lucene  
documents, albeit short ones.  Then, as others have said, you index  
new, incoming docs into a Memory Index.  From that, you can extract  
the key terms which can then be used to come up with a Query to be run  
against your query index.  The MoreLikeThis functionality should  
help in determining the important terms.  Then, you need to decide how  
to handle dealing with the results.  You probably don't want to route  
the document to each and every query that matches.


-Grant

On Nov 23, 2008, at 2:35 AM, Ian Holsman wrote:


Anshum wrote:

Hi Ian,
I guess that could be achieved if you write code to read the  
queries and

query for each document (using lucene).
Assuming that I got the question right! :)




yes.. that is one way, but probably not the most efficient one.

think of something like http://www.google.com/alerts, but instead of  
running once a day, it would run each time it sees a document. (as- 
it-happens mode)

and you would have a couple of million queries to run through.

regards
Ian

--
Anshum Gupta
Naukri Labs!
http://ai-cafe.blogspot.com

The facts expressed here belong to everybody, the opinions to me. The
distinction is yours to draw


On Sun, Nov 23, 2008 at 9:27 AM, Ian Holsman [EMAIL PROTECTED]  
wrote:




Hi. apologies for the off-topic question.

I was wondering if anyone knew of a open source solution (or a  
pointer to

the algorithms)
that do the reverse of lucene.
By that I mean store a whole lot of queries, and run them against a
document to see which queries match it. (with a score etc)

I can see the case for this would be a news-article and several  
people

writing queries to get alerted if it matched a certain condition.


Regards
Ian

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]









-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ











-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [ot] a reverse lucene

2008-11-23 Thread David Sheldon
On Sun, Nov 23, 2008 at 02:57:28PM +1100, Ian Holsman wrote:
 I can see the case for this would be a news-article and several people 
 writing queries to get alerted if it matched a certain condition.

I haven't tried this, but if you have lots of queries and few documents
then consider using lucene, but reconsidering how you design your
documents.

Turn the queries into documents in the index, and turn the document
into a query.

Something like google alerts you can have a document which is
   match: keyword

Then the document can become a boolean query for each word in it:
   match:foo OR match:bar

Obviously good choices of analysers and simplification of the queries
that you allow will make this better.

If you have fewer than 10k stored queries then the ways of running all
the queries against a document in memory will probably be faster
(depending on your incoming document rate, though you can batch them up
and do the queries every 15 mintues or something if you don't mind the
lag and you're getting lots of incomming documents).

Just an idea.

David
-- 
About the use of language: it is impossible to sharpen a pencil with a blunt
ax.  It is equally vain to try to do it with ten blunt axes instead.
-- Edsger Dijkstra

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: AW: Transforming german umlaute like ö,ä ,ü,ß into oe, ae, ue, ss

2008-11-23 Thread Koji Sekiguchi

  Where do I get the CharFilter library? I'm using Lucene, not Solr.
 
  Thanks,
  Sascha
 CharFilter is included in recent Solr nightly build.
 It is not OOTB solution for Lucene now, sorry.
 If I have time, I will make it for Lucene in this weekend.

Now the patch available for Lucene at:
https://issues.apache.org/jira/browse/LUCENE-1466

Koji


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [ot] a reverse lucene

2008-11-23 Thread Andrzej Bialecki

Ian Holsman wrote:

Hi. apologies for the off-topic question.

I was wondering if anyone knew of a open source solution (or a pointer 
to the algorithms)

that do the reverse of lucene.
By that I mean store a whole lot of queries, and run them against a 
document to see which queries match it. (with a score etc)


I can see the case for this would be a news-article and several people 
writing queries to get alerted if it matched a certain condition.



http://www.seas.upenn.edu/~svilen/publications/subscribe.pdf



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[ANN] Luke 0.9.1 - bugfix release

2008-11-23 Thread Andrzej Bialecki

Hi all,

A bugfix release of Luke is now available at the usual place:

http://www.getopt.org/luke

* New features and improvements:

  o Added ability to set the maximum count of boolean clauses in 
BooleanQuery.


* Bug fixes:

  o Unbalanced commit tags breaking the XML export. Reported by 
Teruhiko Kurosaka.


  o Opening a non-existent index from command-line creates an empty 
directory. This is worked-around in Luke although it's a Lucene bug. 
Reported by Chris Pimlott. See also LUCENE-1464.


  o IndexGate inadvertently deleting previous commit points, even if 
Keep all commits option was specified. Reported by Mark Harwood.


  o Empty index with no fields was reported as invalid. Discovered by 
Andrew Zhang and Michael McCandless (LUCENE-1454).



Thank you!

--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [ot] a reverse lucene

2008-11-23 Thread Ian Holsman

Thanks for all the suggestions guys..
This is great!


Andrzej Bialecki wrote:

Ian Holsman wrote:

Hi. apologies for the off-topic question.

I was wondering if anyone knew of a open source solution (or a 
pointer to the algorithms)

that do the reverse of lucene.
By that I mean store a whole lot of queries, and run them against a 
document to see which queries match it. (with a score etc)


I can see the case for this would be a news-article and several 
people writing queries to get alerted if it matched a certain condition.



http://www.seas.upenn.edu/~svilen/publications/subscribe.pdf






-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [ot] a reverse lucene

2008-11-23 Thread markharw00d
If you index the queries consider also that they can potentially be 
indexed in an optimised form.


For example, take a phrase query for Alonso Smith. You need only index 
one of these terms - an incoming document must contain both terms to be 
considered a match. If you chose to index this query on the rare term 
Alonso you would get far fewer requests to run this query than if you 
chose to index the comparitively more common Smith. Basically any 
query with mandatory terms can be index optimised to record only the 
rarest mandatory term (rarity typically being measured by using a 
look-up on some background index).


Cheers,
Mark

Ian Holsman wrote:

Thanks for all the suggestions guys..
This is great!


Andrzej Bialecki wrote:

Ian Holsman wrote:

Hi. apologies for the off-topic question.

I was wondering if anyone knew of a open source solution (or a 
pointer to the algorithms)

that do the reverse of lucene.
By that I mean store a whole lot of queries, and run them against a 
document to see which queries match it. (with a score etc)


I can see the case for this would be a news-article and several 
people writing queries to get alerted if it matched a certain 
condition.



http://www.seas.upenn.edu/~svilen/publications/subscribe.pdf






-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



No virus found in this incoming message.
Checked by AVG - http://www.avg.com 
Version: 8.0.175 / Virus Database: 270.9.9/1806 - Release Date: 11/22/2008 6:59 PM


  




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Doing the lucene remove character \n (break line)

2008-11-23 Thread Erick Erickson
What I'd do is make my own filter, probably one based upon one of
the pre-existing ones and modify the call to nextToken, examine that
token, and if it ends in a hyphen get the next token and return the
concatenation of the two. I don't believe that there's a pre-existing
filter that does this, but you might want to check because I haven't
looked at them an a while.

Best
Erick

On Sun, Nov 23, 2008 at 4:40 PM, farnetani [EMAIL PROTECTED] wrote:


 I need to do lucene find the sentence:
 ARLEI FERREIRA FARNETANI JUNIOR
 [arlei] [ferreira] [farnetani] [junior](1)

 and too:

 ARLEI FERREIRA FAR-   break line
 NETANI JUNIOR

 I'm using the Brazilian Analyzer, but the result is:
 [ARLEI] [FERREIRA] [FAR] [NETANI] [JUNIOR]

 I have to do that the lucene result:
 [ARLEI] [FERREIRA] [FARNETANI] [JUNIOR] equals the sentence (1)

 So I have to do that lucene remove - hyphen and the break line (\n).

 To remove character hyphen (-) I got, but remove the break line no!

 How do I do???
 --
 View this message in context:
 http://www.nabble.com/Doing-the-lucene-remove-character-%5Cn-%28break-line%29-tp20650540p20650540.html
 Sent from the Lucene - Java Users mailing list archive at Nabble.com.


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




Using multiple index searchers.

2008-11-23 Thread Henrik Axelsson
Hi all,

After reading the FAQ I have a question regarding the use of multiple
indexes and thus IndexSearches on the one server.

I work on ecommerce websites and am looking at replacing our current method
of full text searching product descriptions and names with a Lucene
implementation. I envisaged creating a separate index file for each of the
sites running on our main webserver (about 10 sites, each with different
product listings). However this would mean that I would need to have many
instances of IndexSearcher open and potentially come across file handle
limit problems (as outlined in the FAQ) as well as consuming lots of memory.
Is this a valid concern? Would I be able to use Lucene this way?

An alternative would be to combine all the sites data into one index, and
have a field identifiying which site each product entry belonged to. However
I would rather not mix the data together.

Thanks,
Henrik


Re: Doing the lucene remove character \n (break line)

2008-11-23 Thread farnetani

I got. I finish now, before of you to send message, but thanks your
comments!:-D

Have a nice day!

Jr.


Erick Erickson wrote:
 
 What I'd do is make my own filter, probably one based upon one of
 the pre-existing ones and modify the call to nextToken, examine that
 token, and if it ends in a hyphen get the next token and return the
 concatenation of the two. I don't believe that there's a pre-existing
 filter that does this, but you might want to check because I haven't
 looked at them an a while.
 
 Best
 Erick
 
 On Sun, Nov 23, 2008 at 4:40 PM, farnetani [EMAIL PROTECTED] wrote:
 

 I need to do lucene find the sentence:
 ARLEI FERREIRA FARNETANI JUNIOR
 [arlei] [ferreira] [farnetani] [junior](1)

 and too:

 ARLEI FERREIRA FAR-   break line
 NETANI JUNIOR

 I'm using the Brazilian Analyzer, but the result is:
 [ARLEI] [FERREIRA] [FAR] [NETANI] [JUNIOR]

 I have to do that the lucene result:
 [ARLEI] [FERREIRA] [FARNETANI] [JUNIOR] equals the sentence (1)

 So I have to do that lucene remove - hyphen and the break line (\n).

 To remove character hyphen (-) I got, but remove the break line no!

 How do I do???
 --
 View this message in context:
 http://www.nabble.com/Doing-the-lucene-remove-character-%5Cn-%28break-line%29-tp20650540p20650540.html
 Sent from the Lucene - Java Users mailing list archive at Nabble.com.


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


 
 

-- 
View this message in context: 
http://www.nabble.com/Doing-the-lucene-remove-character-%5Cn-%28break-line%29-tp20650540p20654025.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Using multiple index searchers.

2008-11-23 Thread Henrik Axelsson
Hi all,

After reading the FAQ I have a question regarding the use of multiple
indexes and thus IndexSearches on the one server.

I work on ecommerce websites and am looking at replacing our current
method of full text searching product descriptions and names with a
Lucene implementation. I envisaged creating a separate index file for
each of the sites running on our main webserver (about 10 sites, each
with different product listings). However this would mean that I would
need to have many instances of IndexSearcher open and potentially come
across file handle limit problems (as outlined in the FAQ) as well as
consuming lots of memory. Is this a valid concern? Would I be able to
use Lucene this way?

An alternative would be to combine all the sites data into one index,
and have a field identifiying which site each product entry belonged
to. However I would rather not mix the data together.

Thanks,
Henrik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Using multiple index searchers.

2008-11-23 Thread Yonik Seeley
If the data is unrelated, separate indexes will lead to the best performance.
Memory usage should be less or equal to one big index.
File descriptor usage can be minimized by either calling optimize
before opening a new IndexSearcher (depends on how often you want to
see updates), lowering the merge factor, or using the compound file
format.

-Yonik


On Sun, Nov 23, 2008 at 9:32 PM, Henrik Axelsson [EMAIL PROTECTED] wrote:
 Hi all,

 After reading the FAQ I have a question regarding the use of multiple
 indexes and thus IndexSearches on the one server.

 I work on ecommerce websites and am looking at replacing our current method
 of full text searching product descriptions and names with a Lucene
 implementation. I envisaged creating a separate index file for each of the
 sites running on our main webserver (about 10 sites, each with different
 product listings). However this would mean that I would need to have many
 instances of IndexSearcher open and potentially come across file handle
 limit problems (as outlined in the FAQ) as well as consuming lots of memory.
 Is this a valid concern? Would I be able to use Lucene this way?

 An alternative would be to combine all the sites data into one index, and
 have a field identifiying which site each product entry belonged to. However
 I would rather not mix the data together.

 Thanks,
 Henrik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Using multiple index searchers.

2008-11-23 Thread Henrik Axelsson
Thanks for the quick reply, time to get to work on a prototype!

On Mon, Nov 24, 2008 at 2:12 PM, Yonik Seeley [EMAIL PROTECTED] wrote:
 If the data is unrelated, separate indexes will lead to the best performance.
 Memory usage should be less or equal to one big index.
 File descriptor usage can be minimized by either calling optimize
 before opening a new IndexSearcher (depends on how often you want to
 see updates), lowering the merge factor, or using the compound file
 format.

 -Yonik


 On Sun, Nov 23, 2008 at 9:32 PM, Henrik Axelsson [EMAIL PROTECTED] wrote:
 Hi all,

 After reading the FAQ I have a question regarding the use of multiple
 indexes and thus IndexSearches on the one server.

 I work on ecommerce websites and am looking at replacing our current method
 of full text searching product descriptions and names with a Lucene
 implementation. I envisaged creating a separate index file for each of the
 sites running on our main webserver (about 10 sites, each with different
 product listings). However this would mean that I would need to have many
 instances of IndexSearcher open and potentially come across file handle
 limit problems (as outlined in the FAQ) as well as consuming lots of memory.
 Is this a valid concern? Would I be able to use Lucene this way?

 An alternative would be to combine all the sites data into one index, and
 have a field identifiying which site each product entry belonged to. However
 I would rather not mix the data together.

 Thanks,
 Henrik


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]