date:20060522

indexing in lucene 1.9.1

2006-05-22 Thread Harini Raghavan


Hi All,
We have recently upgraded from lucene 1.4.3 to lucene 1.9.1 version. 
After the upgrade, we are facing some issues:
1. Indexing seems to be behaving differently. There were more than 300 
segment files(.cfs) in the index and the IndexSearcher is taking forever 
to refresh the index. Have there been any changes in 1.9.1 wrt default 
values for merging segment files/ indexing?
2. Our application downloads documents and indexes them every min as a 
continuous process. So, we have a Quartz job that refreshes the Index 
Searcher every 4 hours. Would this have any effect on the indexing 
process/ add more no of segments?

Any help would be appreciated.
Thanks,
Harini

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

should I avoid create many Fields for a Document?

2006-05-22 Thread Paulo Silveira


Hello

What is the best way to search? Should I separate all the fields, or
create a big one that have all fields? Does this impact the
performance dramatically?

Creating a big field I would not need to create a BooleanQuery...

last time I did not get any clues, lets see if this time will be better...

thanks!

--
Paulo E. A. Silveira
Caelum Ensino e Soluções em Java
http://www.caelum.com.br/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Need some Advice on Searching

2006-05-22 Thread David Ahlschläger


On 19/05/06, Chris Hostetter [EMAIL PROTECTED] wrote:



i assume when you say this...

: 1. I need to temporarilly index sets of documents on the Fly say 100 at
a
: Time.

you mean that you'll have lots of temporary indexes of a few hundrad
documents and then you'll do a bunch of queries and throw the index away.
Even if i'm wrong most of the rest of my advice will wtill be usefull, but
its' good to clarify.



Correct I will throw them away!

: My problem is that for these queries I need to know which Documents hit. I

: also need to know which terms hit and if possible
: the location of the hits for each term in the hit Document.

knowing which docs match your is easy.  knowing where in a document a
particular term matches can be done using the TermPositions APIs ... but
it does you that info as a number of terms which for HTML content may be
confusing depending on how your analyzer deals with that HTML.



Okay based on your answer and a little testing just to see what it gives me
- I assume
Lucene only stores the Term Offset (which is Analyser Dependent) and not the
Actual Offset as retrieved from the Plain Text Stream for the Term.

if you have complex boolean queries and you need to know which individual

pat of the query matched that's not really trivial.  you didn't mention
anything about score or relevancy in your email, so i'm guessing all
you care about is boolean did it match or not logic .. in that case
using Filters directly (without ever searching) is your friend.  You can
build a Filter for each individual clause, intersect/union the bitsets to
get the final set of matching documents for your whole query, but
inspect the individual bitsets to know he specifics about which ones match
which documents.



Score/Relavence is not Important. I need the Yes/No logic with the what
caused the Match Info. Could you mayby explain the intersect/union the
bitsets and the interogating to know
what matched?

some people don't like Filters because of how much space they take up for

really large indexes, but if you've only got 100 docs ... there's no
reason not to use them



Nope will never have any really large Indexes here 100 to 200 docs at the
most.

-Hoss



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Thanx for the Relpy much appreciated.

Searching API: QueryParser vs Programatic queries

2006-05-22 Thread Irving, Dave

Hi,

Im very new to Lucene - so sorry if my question seems pretty dumb.

In the application Im writing, I've been struggling with myself over
whether I should be building up queries programatically, or using the
Query Parser.

My searchable fields are driven by meta-data, and I only want to support
a few query types. It seems cleaner to build the queries up
programatically rather than converting the query to a string and
throwing it through the QueryParser.

However, then we hit the problem that the QueryParser takes care of
Analysing the search strings - so to do this we'd have to write some
utility stuff to perform the analysis as we're building up the queries /
terms.

And then I think might as well just use the QueryParser!.

So here's what Im wondering (which probably sounds very dumb to
experienced Lucene'rs):

- Is there maybe some room for more utility classes in Lucene which make
this easier? E.g: When building up a document, we don't have to worry
about running content through an analyser - but unless we use
QueryParser, there doesn't seem to be corresponding behaviour on the
search side.
- So, Im thinking some kind of factory / builder or something, where you
can register an Analyser (possibly a per field wrapper), and then it is
applied per field as the query is being built up programatically.

Maybe this is just an extraction refactoring to take this behaviour
out of QueryParser (which could delegate to it).

The result could be that more users opt for a programatic build up of
queries (because it's become easier to do..) rather than falling back on
QueryParser in cases where it may not be the best choice.


Sorry if I rambled too much :o)

Dave


This e-mail and any attachment is for authorised use by the intended 
recipient(s) only. It may contain proprietary material, confidential 
information and/or be subject to legal privilege. It should not be copied, 
disclosed to, retained or used by, any other party. If you are not an intended 
recipient then please promptly delete this e-mail and any attachment and all 
copies and inform the sender. Thank you.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: indexing in lucene 1.9.1

2006-05-22 Thread Harini Raghavan


Hi Mike,

Yes you are right, when we run the optimize(), it creates one large 
segment file and makes the searching faster. But  the issue is our index 
keeps growing every minute as we download documents add to the index, so 
we cannot call optimize so often. The indexing seemed to be fine till we 
migrated to lucene 1.9.1.


I just compared the IndexWriter classes in 1.4.3 and 1.9.1 versions and 
found that there are some changes wrt to creating new segments. Any idea 
if that has impacted indexing? Has anyone else faced a similar issue 
with the new version of lucene?


-Harini

Mike Richmond wrote:


Hello Harini,

When you are finished indexing the documents are you running the
optimize() method on the IndexWriter before closing it?  This should
reduce the number of segments and make searching faster.  Just a
thought.


--Mike



On 5/22/06, Harini Raghavan [EMAIL PROTECTED] wrote:


Hi All,
We have recently upgraded from lucene 1.4.3 to lucene 1.9.1 version.
After the upgrade, we are facing some issues:
1. Indexing seems to be behaving differently. There were more than 300
segment files(.cfs) in the index and the IndexSearcher is taking forever
to refresh the index. Have there been any changes in 1.9.1 wrt default
values for merging segment files/ indexing?
2. Our application downloads documents and indexes them every min as a
continuous process. So, we have a Quartz job that refreshes the Index
Searcher every 4 hours. Would this have any effect on the indexing
process/ add more no of segments?
Any help would be appreciated.
Thanks,
Harini

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Aggregating category hits

2006-05-22 Thread Kapil Chhabra


Hi Jelda,
Is there any way by which I can achieve sorting of search results along 
with overriding the collect method of the HitCollector in this case?

I have been using

srch.search(query,sort);

If I replace it with srch.search(query, new HitCollector(){ impl of the 
collect method to collect counts }),

I will have no way to sort my results.

Any pointers?

Regards,
kapilChhabra

Kapil Chhabra wrote:

Thanks a lot Jelda.
I'll try this get back with the performance comparison chart.

Regards,
kapilChhabra

Ramana Jelda wrote:

Hi Kapil,
As I remember FieldCache is in lucene api since 1.4 .
Ok . Anyhow here is suedo code that can help.

//1. initialize reader on opening documentId to the categoryid 
relation as
below. Depending on your requirement you can either 
getStringIndex().. I get

StringIndex in //my project.

String[] docId2CategoryIdRelation=FieldCache.DEFAULT.getStrings(reader,
categoryFieldName);

//2. cache it
//3. search as usal with your Query providing your own HitCollector
//4. use docId2CategoryIdRelation to retrieve category id for each 
result

document
String yourCategoryId=docId2CategoryIdRelation[resultDocId]
//5.Increment yourCategoryId count (do lazy initialization of 
categoryCounts

holder.FAQ.)

//6 You are done.. :)

All the best,
Jelda




 

-Original Message-
From: Kapil Chhabra [mailto:[EMAIL PROTECTED] Sent: Tuesday, 
May 16, 2006 11:50 AM

To: java-user@lucene.apache.org
Subject: Re: Aggregating category hits

Hi Jelda,
I have not yet migrated to Lucene 1.9 and I guess FieldCache has 
been introduced in this release.

Can you please give me a pointer to your strategy of FieldCache?

Thanks  Regards,
Kapil Chhabra


Ramana Jelda wrote:
   
But this BitSet strategy is more memory consuming mainly if   
you have

documents in million numbers and categories in thousands.
So I preferred in my project FieldCache strategy.

Jelda

   

-Original Message-
From: Kapil Chhabra [mailto:[EMAIL PROTECTED]
Sent: Tuesday, May 16, 2006 7:38 AM
To: java-user@lucene.apache.org
Subject: Re: Aggregating category hits

Even I am doing the same in my application.
Once in a day, all the filters [for different categories] are 
initialized. Each time a query is fired, the Query BitSet is ANDed 
with the BitSet of each filter. The cardinality obtained is the 
desired output.
@Eric: I would like to know more about the implementation 
with DocSet

in place of Bitset.

Regards,
kapilChhabra


Erik Hatcher wrote:
   

On May 15, 2006, at 5:07 PM, Marvin Humphrey wrote:
   
If you needed to know not just the total number of hits, but the 
number of hits in each category, how would you handle that?


For instance, a search for egg would have to produce 
the 20 most

relevant documents for egg, but also a list like this:

Holiday  Seasonal / Easter 75
Books / Cooking 52
Miscellaneous 44
Kitchen Collectibles 43
Hobbies / Crafts 17
[...]

It seems to me that you'd have to retrieve each hit's


stored fields
   
and examine the contents of a category field. That's a lot of 
overhead. Is there another way?

My first implementation of faceted browsing uses BitSet's   
that get
pre-loaded for each category value (each unique term in a   

category
   
field, for example). And to intersect that with an actual   
Query, it
gets run through the QueryFilter to get its BitSet and then AND'd 
together with each of the category BitSet's. Sounds like   
a lot, but
for my applications there are not tons of these BitSet's and the 
performance has been outstanding. Now that I'm doing more


with Solr,
   
I'm beginning to leverage its amazing caching infrastructure and 
replacing BitSet's with DocSet's.


Erik





-
   
   

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
   

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




  

-
   

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]







-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


  






-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Aggregating category hits

2006-05-22 Thread Ramana Jelda

I think, if you dig a little bit what lucene is when asked to do Sort then
you will get the information what you are looking for. 

Here is some help.
Lucene uses TopFieldDocCollector for sorting purpose(lookat implementation
of IndexSearcher).
So your HitCollector will extend this TopFieldDocCollector, so that you will
do your work what ever you want to do  and also let TopFieldDocCollector do
its work (sorting..).I think I don't need to explain you more. 

Then you are done. 

Have fun,
Jelda




 -Original Message-
 From: Kapil Chhabra [mailto:[EMAIL PROTECTED] 
 Sent: Monday, May 22, 2006 2:07 AM
 To: java-user@lucene.apache.org
 Subject: Re: Aggregating category hits
 
 Hi Jelda,
 Is there any way by which I can achieve sorting of search 
 results along with overriding the collect method of the 
 HitCollector in this case?
 I have been using
 
 srch.search(query,sort);
 
 If I replace it with srch.search(query, new HitCollector(){ 
 impl of the collect method to collect counts }), I will have 
 no way to sort my results.
 
 Any pointers?
 
 Regards,
 kapilChhabra
 
 Kapil Chhabra wrote:
  Thanks a lot Jelda.
  I'll try this get back with the performance comparison chart.
 
  Regards,
  kapilChhabra
 
  Ramana Jelda wrote:
  Hi Kapil,
  As I remember FieldCache is in lucene api since 1.4 .
  Ok . Anyhow here is suedo code that can help.
 
  //1. initialize reader on opening documentId to the categoryid 
  relation as below. Depending on your requirement you can either 
  getStringIndex().. I get StringIndex in //my project.
 
  String[] 
  docId2CategoryIdRelation=FieldCache.DEFAULT.getStrings(reader,
  categoryFieldName);
 
  //2. cache it
  //3. search as usal with your Query providing your own 
 HitCollector 
  //4. use docId2CategoryIdRelation to retrieve category id for each 
  result document
  String yourCategoryId=docId2CategoryIdRelation[resultDocId]
  //5.Increment yourCategoryId count (do lazy initialization of 
  categoryCounts
  holder.FAQ.)
 
  //6 You are done.. :)
 
  All the best,
  Jelda
 
 
 
 
   
  -Original Message-
  From: Kapil Chhabra [mailto:[EMAIL PROTECTED] 
 Sent: Tuesday, 
  May 16, 2006 11:50 AM
  To: java-user@lucene.apache.org
  Subject: Re: Aggregating category hits
 
  Hi Jelda,
  I have not yet migrated to Lucene 1.9 and I guess FieldCache has 
  been introduced in this release.
  Can you please give me a pointer to your strategy of FieldCache?
 
  Thanks  Regards,
  Kapil Chhabra
 
 
  Ramana Jelda wrote:
 
  But this BitSet strategy is more memory consuming mainly 
 if   
  you have
  documents in million numbers and categories in thousands.
  So I preferred in my project FieldCache strategy.
 
  Jelda
 
 
  -Original Message-
  From: Kapil Chhabra [mailto:[EMAIL PROTECTED]
  Sent: Tuesday, May 16, 2006 7:38 AM
  To: java-user@lucene.apache.org
  Subject: Re: Aggregating category hits
 
  Even I am doing the same in my application.
  Once in a day, all the filters [for different categories] are 
  initialized. Each time a query is fired, the Query 
 BitSet is ANDed 
  with the BitSet of each filter. The cardinality obtained is the 
  desired output.
  @Eric: I would like to know more about the 
 implementation 
  with DocSet
  in place of Bitset.
 
  Regards,
  kapilChhabra
 
 
  Erik Hatcher wrote:
 
  On May 15, 2006, at 5:07 PM, Marvin Humphrey wrote:
 
  If you needed to know not just the total number of 
 hits, but the 
  number of hits in each category, how would you handle that?
 
  For instance, a search for egg would have to 
 produce 
  the 20 most
  relevant documents for egg, but also a list like this:
 
  Holiday  Seasonal / Easter 75
  Books / Cooking 52
  Miscellaneous 44
  Kitchen Collectibles 43
  Hobbies / Crafts 17
  [...]
 
  It seems to me that you'd have to retrieve each hit's
  
  stored fields
 
  and examine the contents of a category field. 
 That's a lot of 
  overhead. Is there another way?
  
  My first implementation of faceted browsing uses 
 BitSet's   
  that get
  pre-loaded for each category value (each unique term 
 in a   
  category
 
  field, for example). And to intersect that with an 
 actual   
  Query, it
  gets run through the QueryFilter to get its BitSet and 
 then AND'd 
  together with each of the category BitSet's. Sounds 
 like   
  a lot, but
  for my applications there are not tons of these 
 BitSet's and the 
  performance has been outstanding. Now that I'm doing more
  
  with Solr,
 
  I'm beginning to leverage its amazing caching 
 infrastructure and 
  replacing BitSet's with DocSet's.
 
  Erik
 
 
 
  
  
 
  -
 
 
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional

Re: OutOfMemory and IOException Access Denied errors

2006-05-22 Thread Dan Armbrust


Your out of memory error is likely due to a mysql bug outlined here:

  http://bugs.mysql.com/bug.php?id=7698

Thanks for the article. My query executed in no time without any errors !!!



The MySQL drivers are horrible at dealing with large result sets - that 
article gives you the workaround to tell it to bring the results back as 
they are needed (like it should in the first place) but I have found 
that it isn't reliable - it tends to drop out at random points during 
the query - so you will get a different number of rows each time you 
rerun the query.  In MySQL - the only reliable way I have found to get 
all of the results from a large table is to use their limit keyword in 
the query, and only ask it for X (I usually use 10,000, but use whatever 
works best with your system) number of rows as a time, and then keep 
rerunning the query, incrementing up the start position of the limit 
keyword.  This issue also varies a lot from version to version of the 
driver - some versions have been completely broken, and others are only 
slightly broken.  To bad we can't get lucene quality code everywhere :)




 Exception in thread main java.io.IOException: Access is denied

To me, that really seems like you have an issue with the location that 
you are writing the index to.  I would make sure you have full write 
permissions to the location, and make sure there aren't some old / 
invalid files sitting in there.


Dan


--

Daniel Armbrust
Biomedical Informatics
Mayo Clinic Rochester
daniel.armbrust(at)mayo.edu
http://informatics.mayo.edu/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: What is more efficient?

2006-05-22 Thread Otis Gospodnetic

The usual answer: it depends :)
Over on http://www.simpy.com I have similar functionality (groups), and I have 
them as separate indices.

If you want to be able to reindex individual groups separately, you;ll want 
them in separate groups.
If groups in aggregate will get very large, perhaps keeping them separate is 
more scalable.
If you want to distribute groups over multiple servers, keep them separate.
If they are heterogeneous (different fields), this may be another reason to 
keep them separate.
etc.

Of course, if some of the above don't hold or are not a requirement, a single 
group may be the way to go for you.

Otis


- Original Message 
From: Dan Wiggin [EMAIL PROTECTED]
To: java-user@lucene.apache.org
Sent: Monday, May 22, 2006 6:03:25 AM
Subject: What is more efficient?

If I work with groups, whats the best option do do? Use a multiple lucene
index for every group or is bettter an unique index.
For example:
I'm working with groups of people, and the action to add or delete is in
group level but the search is on all groups.
What do you think is the best implementation in lucene? I have any number of
index limitation in Multisearcher?




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: should I avoid create many Fields for a Document?

2006-05-22 Thread Otis Gospodnetic

Uh, another it depends answer.
Some people prefer one aggregate field, others do not.
If you care about field normalization (shorter fields with matches in them 
shoring higher than longer fields with equal number of matches in them), I'd 
say keep them separate.
If you want to boost individual fields differently at search time, keep them 
separate.

Over at http://www.simpy.com/ I tend to keep fields separate.  Some of the 
fields that indices at Simpy have are: title, tags, url, etc.  When a user 
performs a search I can use MultiFieldQueryParser and soon I'll be able to 
boost these fields differently (e.g. crowd-supplied tags may get a boost over 
web page author-supplied titles).

Also, I probably don't care about the URL length, so I don't need normalization 
there.  That saves some RAM and doesn't hurt scoring.

Otis

- Original Message 
From: Paulo Silveira [EMAIL PROTECTED]
To: java-user@lucene.apache.org
Sent: Monday, May 22, 2006 2:08:24 AM
Subject: should I avoid create many Fields for a Document?

Hello

What is the best way to search? Should I separate all the fields, or
create a big one that have all fields? Does this impact the
performance dramatically?

Creating a big field I would not need to create a BooleanQuery...

last time I did not get any clues, lets see if this time will be better...

thanks!

-- 
Paulo E. A. Silveira
Caelum Ensino e Soluções em Java
http://www.caelum.com.br/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Performance ...

2006-05-22 Thread Dragon Fly


Hi,

The search results of my Lucene application are always sorted 
alphabetically.

Therefore, score and relevance are not needed.  With that said, is there
anything that I can disable to:

(a) Improve the search performance
(b) Reduce the size of the index
(c) Shorten the indexing time

Thank you.

_
Dont just search. Find. Check out the new MSN Search! 
http://search.msn.click-url.com/go/onm00200636ave/direct/01/



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Searching API: QueryParser vs Programatic queries

2006-05-22 Thread Otis Gospodnetic

Dave,
You said you are new to Lucene and you didn't mention this class explicitly, so 
you may not be aware of it yet: PerFieldAnalyzerWrapper.
It sounds like this may be what you are after.

Otis

- Original Message 
From: Irving, Dave [EMAIL PROTECTED]
To: java-user@lucene.apache.org
Sent: Monday, May 22, 2006 5:15:23 AM
Subject: Searching API: QueryParser vs Programatic queries

Hi,

Im very new to Lucene - so sorry if my question seems pretty dumb.

In the application Im writing, I've been struggling with myself over
whether I should be building up queries programatically, or using the
Query Parser.

My searchable fields are driven by meta-data, and I only want to support
a few query types. It seems cleaner to build the queries up
programatically rather than converting the query to a string and
throwing it through the QueryParser.

However, then we hit the problem that the QueryParser takes care of
Analysing the search strings - so to do this we'd have to write some
utility stuff to perform the analysis as we're building up the queries /
terms.

And then I think might as well just use the QueryParser!.

So here's what Im wondering (which probably sounds very dumb to
experienced Lucene'rs):

- Is there maybe some room for more utility classes in Lucene which make
this easier? E.g: When building up a document, we don't have to worry
about running content through an analyser - but unless we use
QueryParser, there doesn't seem to be corresponding behaviour on the
search side.
- So, Im thinking some kind of factory / builder or something, where you
can register an Analyser (possibly a per field wrapper), and then it is
applied per field as the query is being built up programatically.

Maybe this is just an extraction refactoring to take this behaviour
out of QueryParser (which could delegate to it).

The result could be that more users opt for a programatic build up of
queries (because it's become easier to do..) rather than falling back on
QueryParser in cases where it may not be the best choice.

Sorry if I rambled too much :o)

Dave

This e-mail and any attachment is for authorised use by the intended 
recipient(s) only. It may contain proprietary material, confidential 
information and/or be subject to legal privilege. It should not be copied, 
disclosed to, retained or used by, any other party. If you are not an intended 
recipient then please promptly delete this e-mail and any attachment and all 
copies and inform the sender. Thank you.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Searching API: QueryParser vs Programatic queries

2006-05-22 Thread Irving, Dave

Hi Otis,

Thanks for your reply.
Yeah, Im aware of PerFieldAnalyserWrapper - and I think it could help in
the solution - but not on its own.
Here's what I mean:

When we build a document Field, we suppy either a String or a Reader.
The framework takes care of running the contents through an Analyser
(per field or otherwise) when we add the document to an index.

However, on the searching side of things, we don't have similar
functionality unless we use the QueryParser.
If we build queries programatically, then we have to make sure (by hand)
that we run search terms through the appropriate analyser whilst
constructing the query.

Its in this area that I wonder whether additional utility classes could
make programatic construction of queries somewhat easier.

Dave

 -Original Message-
 From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
 Sent: 22 May 2006 15:59
 To: java-user@lucene.apache.org
 Subject: Re: Searching API: QueryParser vs Programatic queries
 
 Dave,
 You said you are new to Lucene and you didn't mention this 
 class explicitly, so you may not be aware of it yet: 
 PerFieldAnalyzerWrapper.
 It sounds like this may be what you are after.
 
 Otis
 
 - Original Message 
 From: Irving, Dave [EMAIL PROTECTED]
 To: java-user@lucene.apache.org
 Sent: Monday, May 22, 2006 5:15:23 AM
 Subject: Searching API: QueryParser vs Programatic queries
 
 Hi,
 
 Im very new to Lucene - so sorry if my question seems pretty dumb.
 
 In the application Im writing, I've been struggling with 
 myself over whether I should be building up queries 
 programatically, or using the Query Parser.
 
 My searchable fields are driven by meta-data, and I only want 
 to support a few query types. It seems cleaner to build the 
 queries up programatically rather than converting the query 
 to a string and throwing it through the QueryParser.
 
 However, then we hit the problem that the QueryParser takes 
 care of Analysing the search strings - so to do this we'd 
 have to write some utility stuff to perform the analysis as 
 we're building up the queries / terms.
 
 And then I think might as well just use the QueryParser!.
 
 So here's what Im wondering (which probably sounds very dumb 
 to experienced Lucene'rs):
 
 - Is there maybe some room for more utility classes in Lucene 
 which make this easier? E.g: When building up a document, we 
 don't have to worry about running content through an analyser 
 - but unless we use QueryParser, there doesn't seem to be 
 corresponding behaviour on the search side.
 - So, Im thinking some kind of factory / builder or 
 something, where you can register an Analyser (possibly a per 
 field wrapper), and then it is applied per field as the query 
 is being built up programatically.
 
 Maybe this is just an extraction refactoring to take this 
 behaviour out of QueryParser (which could delegate to it).
 
 The result could be that more users opt for a programatic 
 build up of queries (because it's become easier to do..) 
 rather than falling back on QueryParser in cases where it may 
 not be the best choice.
 
 
 Sorry if I rambled too much :o)
 
 Dave
 
 
 This e-mail and any attachment is for authorised use by the 
 intended recipient(s) only. It may contain proprietary 
 material, confidential information and/or be subject to legal 
 privilege. It should not be copied, disclosed to, retained or 
 used by, any other party. If you are not an intended 
 recipient then please promptly delete this e-mail and any 
 attachment and all copies and inform the sender. Thank you.
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Searching API: QueryParser vs Programatic queries

2006-05-22 Thread Raghavendra Prabhu


If i understand correctly, is it that you dont want to make use of query
parse?

You need to parse a query string without using query parser and construct
the query and still want an analyzer applied on the outcome search.


On 5/22/0 p6, Irving, Dave [EMAIL PROTECTED] wrote:


Hi Otis,

Thanks for your reply.
Yeah, Im aware of PerFieldAnalyserWrapper - and I think it could help in
the solution - but not on its own.
Here's what I mean:

When we build a document Field, we suppy either a String or a Reader.
The framework takes care of running the contents through an Analyser
(per field or otherwise) when we add the document to an index.

However, on the searching side of things, we don't have similar
functionality unless we use the QueryParser.
If we build queries programatically, then we have to make sure (by hand)
that we run search terms through the appropriate analyser whilst
constructing the query.

Its in this area that I wonder whether additional utility classes could
make programatic construction of queries somewhat easier.

Dave

 -Original Message-
 From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
 Sent: 22 May 2006 15:59
 To: java-user@lucene.apache.org
 Subject: Re: Searching API: QueryParser vs Programatic queries

 Dave,
 You said you are new to Lucene and you didn't mention this
 class explicitly, so you may not be aware of it yet:
 PerFieldAnalyzerWrapper.
 It sounds like this may be what you are after.

 Otis

 - Original Message 
 From: Irving, Dave [EMAIL PROTECTED]
 To: java-user@lucene.apache.org
 Sent: Monday, May 22, 2006 5:15:23 AM
 Subject: Searching API: QueryParser vs Programatic queries

 Hi,

 Im very new to Lucene - so sorry if my question seems pretty dumb.

 In the application Im writing, I've been struggling with
 myself over whether I should be building up queries
 programatically, or using the Query Parser.

 My searchable fields are driven by meta-data, and I only want
 to support a few query types. It seems cleaner to build the
 queries up programatically rather than converting the query
 to a string and throwing it through the QueryParser.

 However, then we hit the problem that the QueryParser takes
 care of Analysing the search strings - so to do this we'd
 have to write some utility stuff to perform the analysis as
 we're building up the queries / terms.

 And then I think might as well just use the QueryParser!.

 So here's what Im wondering (which probably sounds very dumb
 to experienced Lucene'rs):

 - Is there maybe some room for more utility classes in Lucene
 which make this easier? E.g: When building up a document, we
 don't have to worry about running content through an analyser
 - but unless we use QueryParser, there doesn't seem to be
 corresponding behaviour on the search side.
 - So, Im thinking some kind of factory / builder or
 something, where you can register an Analyser (possibly a per
 field wrapper), and then it is applied per field as the query
 is being built up programatically.

 Maybe this is just an extraction refactoring to take this
 behaviour out of QueryParser (which could delegate to it).

 The result could be that more users opt for a programatic
 build up of queries (because it's become easier to do..)
 rather than falling back on QueryParser in cases where it may
 not be the best choice.


 Sorry if I rambled too much :o)

 Dave


 This e-mail and any attachment is for authorised use by the
 intended recipient(s) only. It may contain proprietary
 material, confidential information and/or be subject to legal
 privilege. It should not be copied, disclosed to, retained or
 used by, any other party. If you are not an intended
 recipient then please promptly delete this e-mail and any
 attachment and all copies and inform the sender. Thank you.

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]





 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Searching API: QueryParser vs Programatic queries

2006-05-22 Thread Irving, Dave

 You need to parse a query string without using query parser and 
 construct the query and still want an analyzer applied on the outcome
search

Not quite. The user is presented with a list of (UI) fields, and each
field already knows whether its an OR AND etc.
So, there is no query String as such.
For this reason, it seems to make more sense to build the query up
programmatically - as my field meta data can drive this.
However, if I do that, I have to do the work of extracting terms by
running through an analyser for each field manually.
This is also done by the query parser.

So, right now, if Im being lazy, the easiest thing to do is construct a
query string based on the meta data, and then run that through the query
parser. This just doesn't -- feel right -- from a design perspective
though :o)

The logic I could see being extracted out would be some of the stuff in
QueryParser#getFieldQuery(String field, String queryText).


 -Original Message-
 From: Raghavendra Prabhu [mailto:[EMAIL PROTECTED] 
 Sent: 22 May 2006 16:17
 To: java-user@lucene.apache.org
 Subject: Re: Searching API: QueryParser vs Programatic queries
 
 If i understand correctly, is it that you dont want to make 
 use of query parse?
 
 You need to parse a query string without using query parser 
 and construct the query and still want an analyzer applied on 
 the outcome search.
 
 
 On 5/22/0 p6, Irving, Dave [EMAIL PROTECTED] wrote:
 
  Hi Otis,
 
  Thanks for your reply.
  Yeah, Im aware of PerFieldAnalyserWrapper - and I think it 
 could help 
  in the solution - but not on its own.
  Here's what I mean:
 
  When we build a document Field, we suppy either a String or 
 a Reader.
  The framework takes care of running the contents through an 
 Analyser 
  (per field or otherwise) when we add the document to an index.
 
  However, on the searching side of things, we don't have similar 
  functionality unless we use the QueryParser.
  If we build queries programatically, then we have to make sure (by 
  hand) that we run search terms through the appropriate 
 analyser whilst 
  constructing the query.
 
  Its in this area that I wonder whether additional utility classes 
  could make programatic construction of queries somewhat easier.
 
  Dave
 
   -Original Message-
   From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
   Sent: 22 May 2006 15:59
   To: java-user@lucene.apache.org
   Subject: Re: Searching API: QueryParser vs Programatic queries
  
   Dave,
   You said you are new to Lucene and you didn't mention this class 
   explicitly, so you may not be aware of it yet:
   PerFieldAnalyzerWrapper.
   It sounds like this may be what you are after.
  
   Otis
  
   - Original Message 
   From: Irving, Dave [EMAIL PROTECTED]
   To: java-user@lucene.apache.org
   Sent: Monday, May 22, 2006 5:15:23 AM
   Subject: Searching API: QueryParser vs Programatic queries
  
   Hi,
  
   Im very new to Lucene - so sorry if my question seems pretty dumb.
  
   In the application Im writing, I've been struggling with myself 
   over whether I should be building up queries programatically, or 
   using the Query Parser.
  
   My searchable fields are driven by meta-data, and I only want to 
   support a few query types. It seems cleaner to build 
 the queries 
   up programatically rather than converting the query to a 
 string and 
   throwing it through the QueryParser.
  
   However, then we hit the problem that the QueryParser 
 takes care of 
   Analysing the search strings - so to do this we'd have to 
 write some 
   utility stuff to perform the analysis as we're building up the 
   queries / terms.
  
   And then I think might as well just use the QueryParser!.
  
   So here's what Im wondering (which probably sounds very dumb to 
   experienced Lucene'rs):
  
   - Is there maybe some room for more utility classes in 
 Lucene which 
   make this easier? E.g: When building up a document, we 
 don't have to 
   worry about running content through an analyser
   - but unless we use QueryParser, there doesn't seem to be 
   corresponding behaviour on the search side.
   - So, Im thinking some kind of factory / builder or 
 something, where 
   you can register an Analyser (possibly a per field wrapper), and 
   then it is applied per field as the query is being built up 
   programatically.
  
   Maybe this is just an extraction refactoring to take this 
   behaviour out of QueryParser (which could delegate to it).
  
   The result could be that more users opt for a programatic 
 build up 
   of queries (because it's become easier to do..) rather 
 than falling 
   back on QueryParser in cases where it may not be the best choice.
  
  
   Sorry if I rambled too much :o)
  
   Dave
  
  
   This e-mail and any attachment is for authorised use by 
 the intended 
   recipient(s) only. It may contain proprietary material, 
 confidential 
   information and/or be subject to legal privilege. It 
 should not be 
   copied,

Re: Searching API: QueryParser vs Programatic queries

2006-05-22 Thread Marvin Humphrey



On May 22, 2006, at 8:44 AM, Irving, Dave wrote:

So, right now, if Im being lazy, the easiest thing to do is  
construct a
query string based on the meta data, and then run that through the  
query

parser. This just doesn't -- feel right -- from a design perspective
though :o)


How about building a larger BooleanQuery by combining the output of  
the QueryParser with custom-built Query objects based on your metadata?


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Searching API: QueryParser vs Programatic queries

2006-05-22 Thread J.J. Larrea

At 10:15 AM +0100 5/22/06, Irving, Dave wrote:
- Is there maybe some room for more utility classes in Lucene which make
this easier? E.g: When building up a document, we don't have to worry
about running content through an analyser - but unless we use
QueryParser, there doesn't seem to be corresponding behaviour on the
search side.
- So, Im thinking some kind of factory / builder or something, where you
can register an Analyser (possibly a per field wrapper), and then it is
applied per field as the query is being built up programatically.

Maybe this is just an extraction refactoring to take this behaviour
out of QueryParser (which could delegate to it).

The result could be that more users opt for a programatic build up of
queries (because it's become easier to do..) rather than falling back on
QueryParser in cases where it may not be the best choice.

I concur with your thoughts that there is room for such utility classes, and 
that those would increase the use of programmatic queries.  I say this as a 
developer who also lazed out and opted to simply construct a string and let 
the QP do all the work (but who then had to subclass and finally 
copy-and-modify QP to make it conform to requirements).

The underlying issue may be that there are two quite different concerns bundled 
into QueryParser:
 - Parsing a string into a set of discrete query requests
 - Constructing Query objects to meet those requests

If you take a look at http://issues.apache.org/jira/browse/LUCENE-344 you'll 
see that someone else (Matthew Denner) also had this belief, and went so far as 
to implement a QueryFactory interface and a couple of implementing classes.  
One has the construction logic now found in QueryParser.  Then there is a 
decorator class which adds the functionality of MultiFieldQueryParser and 
another which lower-cases terms.

Perhaps something along those lines that should be considered for the next 
break in API continuity eg. Lucene 2.0.  It seems much cleaner than subclassing 
QP when all that is needed is a variant in Query construction logic, and it 
also provides a higher-level interface for constructing Query objects 
(especially TermQuery) like you were proposing.  Unfortunately the actual 
LUCENE-344 patch appears out of date with changes in QueryParser, 
MultiFieldQueryParser, etc.  But perhaps just the QueryFactory part would be a 
good starting point for what you want to do.

Anyway, just a thought.

- J.J.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Searching missing documents after doing an addIndexes

2006-05-22 Thread Jim Wilson

I am using 1.9.1(java).

I am trying to add documents to an existing index that may or may not
exist. I use a RAMDirectory to build a temp index that is later merged.
Before adding a new document, I search the existing index (using unique
key) to see if it is there. If not, I add it. 

In reading the documentation, I understood that I can search while an
index is being updated. It was not clear if that search would include
recently added items. I had assumed it would. However, it appears to not
find them unless I close and re-open the searcher...

The net result is that I get duplicate documents as the search does not
find the document if I recently added it.

Note that the duplicate CANNOT come from the RAMDirectory (i.e.,
getNewDocuments), as it is guaranteed to have no duplicates in it. The
search is failing to find documents that have been recently added via a
addIndexes.

Can anyone clarify this behavior, i.e., why does search not find
recently added documents unless I close and re-open it?

I have code that does roughly the following:

RAMDirectory added = new RAMDirectory(...)

IndexWriter writer = new IndexWriter(the main index);
IndexWriter ramWriter = new IndexWriter(added)
IndexSearcher searcher = new IndexSearcher(...)

for (Document d : getNewDocuments()) {
... build a query
  if (searcher.search(...) == 0) {
 // doesn't exist, so we can add it
  }
  if (timeToMerge) {
writer.addIndexes(new Directory[] {added});
added.close();se();
added = new RAMDirectory();
ramWriter.close();
ramWriter = new IndexWriter(added, new StandardAnalyzer(), true);
// for some reason the searcher won't see the new indices
// unless the following two lines are here
searcher.close();
searcher = new IndexSearcher(current.getDirectory());
  }
}

Jim Wilson
Colorado Springs, CO 
719-266-4431 (Home)
719-661-6768 (Cell)
[EMAIL PROTECTED]
IM:jwilsonsprings
Registered Linux User # 302849


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: does anybody have the experience to do some pooling upon lucene?

2006-05-22 Thread Erik Hatcher



On May 21, 2006, at 10:56 PM, Zhenjian YU wrote:
I didn't dig the source code of lucence deep enough, but I noticed  
that the

IndexSearcher uses an IndexReader, while the cost of initializing
IndexReader is a bit high.


The key is the IndexReader.


My application is a webapp, so I think it may be good if I cache some
instances of IndexSearcher to provide service for my webapp. I  
haven't done
any performance testing yet. Maybe I test it later to see the  
difference

between caching and without caching.


It is best to keep only a single IndexSearcher/IndexReader  
combination around.  There is no need to have more than one instance,  
and in fact it is a waste of resources to do so.


Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Searching missing documents after doing an addIndexes

2006-05-22 Thread Chris Hostetter


: Can anyone clarify this behavior, i.e., why does search not find
: recently added documents unless I close and re-open it?

this is by design .. an IndexReader (and hence an IndexSearcher) maintain
consistent views of the index at the moment they were open by hanging on
to the open filehandles and segment information.  no changes made to the
index after they've been open ever show up in that instance (but they will
show up in other instances you open after those changes.

the two main reasons for this behavior that i know of are:

  1) it gives yo ua consistent view for as long as you want it -- you can
choose when you get to see updates.
  2) it allows the IndexReader to maintain caches of information (like the
FieldCache and CachingWrapperFilter for example .. i'm sure there are
other peices of information that get cached, but i can't think of
specifics off the top of my head)



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

incremental updates

2006-05-22 Thread Van Nguyen

I'm pretty new to lucene and was wondering if there are any resources on
how to do incremental updates in lucene.
 
Thanks!
 
Van Nguyen
Wynne Systems, Inc.
19800 MacArthur Blvd., Suite 900
Irvine, CA  92612-2421
949.224.6300 ext 223
949.225.6540 (fax)
866.901.9284 (toll-free)
www.wynnesystems.com
blocked::blocked::blocked::blocked::http://www.wynnesystems.com/  
 
This communication and any documents, files, or previous electronic mail 
messages attached to it constitute
an electronic communication within the scope of the Electronic Communication 
Privacy Act, 18 USCA 2510. 
This communication may contain non-public, confidential, or legally privileged 
information intended for the 
sole use of the designated recipient(s). The unlawful interception, use or 
disclosure of such information is 
strictly prohibited under 18 USCA 2511 and any applicable laws.

Re: Searching API: QueryParser vs Programatic queries

2006-05-22 Thread Erick Erickson


There's a long scree that I'm leaving at the bottom because I put effort
into it and I like to rant. But here's, perhaps, an approach.

Maybe I'm mis-interpreting what you're trying to do. I'm assuming that you
have several search fields (I'm not exactly sure what driven by meta-data
means in this case, but what the heck).

It seems to me that you can always do something like:

BooleanQuery bq;
QueryParser qp1 = new QueryParser(field1, your query fragment here,
analyzer);
Query q1 = qp1.parse(search term or clause);
bq.add(q1,,,);

QueryParser qp2 = new QueryParser(field2, your query fragment here,
analyzer);
Query q2 = qp2.parse(search term or clause);
bq.add(q2);

.
.
.

and eventually submit the query you've build up in bq.

You can arbitrarily build these up. In other words, your q1, q2, q3, etc can
be the same field for the first N clauses, and another field for the second
M clauses. Or you could build up the query fragment to consist of all the
terms for a particular field.


As I said, I have no clue whether this is possible in your application. If
not, see below G.

Scree starts here***

I've had similar arguments with myself. But I'm getting less forgiving with
myself when I reinvent wheels, and firmly slap my own wrists.

Pretend you are talking to your boss/technical lead/coworker. I'm assuming
you actually want to get a product out the door. Your manager asks: How can
you justify spending the time to create, debug and maintain code that has
already been written for you for the sake of cleanliness at the expense of
the other things you could be contributing instead?

There are some very good answers to this, but most of the ones I've tried to
use involve a lot of hand-waving on the order of If we ever extend the
application.., or It would be cleaner.  At which point the
conversation *should* go something like this

Manager: let me get this straight. You can spend 10 minutes right now
implementing the pass-to-the-query-parser solution and an unknown amount
(but probably way more than your initial estimate)
implementing/debugging/testing a 'cleaner' solution. Is that right?

You: Yes but.

Manager: Furthermore, the functionality you want to add is *already* built
into the 'use-the-parser' solution, right?

You: Yes, but

Manager: And the amount of time you'll spend debugging this, not to mention
the amount of *other* people's time you'll spend identifying any bugs and
figuring out that it's in this new code will only increase as the longer any
bugs to undetected, right?

You: Yes, but...

Manager: Do it the use-the-parser way. We can always implement it the other
way if we have time. It doesn't cost us *any* time to implement the 'use the
query parser' way, whereas your way has a measurable cost now, an unknown
cost in the future and no measurable gain. Add a big comment if you want
about how I forced you to do this ugly thing..

Of course there are good reasons to take the time now *if* it will save
time/effort in the future. But this sure doesn't seem like one of those
situations to me. Not to mention that it'll be MUCH simpler for the next
person looking at it to understand. Here are several things off the top of
my head that'll become maintenance issues for a custom solution, that are
*all* taken care of by the use-the-parser solution

1 How are you going to handle stop words?
2 Will you ever want to change analyzers to, say, keep URLs together? Or
maybe break them up?
3 What happens if you want to use the RegularExpressionAnalyzer to, say,
remove all punctuation or other user-entered junk?
4 Will you remember all the ins-and-outs of this code in even 1 month? What
about the next poor joker who has to figure it out?

None of this is to say that your suggestion that there be utility classes
that allow this sort of thing doesn't have merit. But I have to wonder
whether it would be effort well spent for you at this time, in this project
G.

As you can see, this is one of my hot-button issues G. If you want to
really see me go off the deep end, just *mention* premature
optimizations

Best
Erick

Checking for duplicates inside index

2006-05-22 Thread Hannes Carl Meyer


Hi All,

I'm indexing ~1 documents per day but since I'm getting a lot of 
real duplicates (100% the same document content) I want to check the 
content before indexing...


My idea is to create a checksum of the documents content and store it 
within document inside the index, before indexing a new document I will 
compare the new documents

checksum with the ones in the index.

Is that a good idea? does someone have experiences with that method? any 
tools available?


Thank you and kind regards

Hannes

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Searching API: QueryParser vs Programatic queries

2006-05-22 Thread Chris Hostetter


: Not quite. The user is presented with a list of (UI) fields, and each
: field already knows whether its an OR AND etc.
: So, there is no query String as such.
: For this reason, it seems to make more sense to build the query up
: programmatically - as my field meta data can drive this.
: However, if I do that, I have to do the work of extracting terms by
: running through an analyser for each field manually.
: This is also done by the query parser.

typically, when build queries up from form data, each piece of data falls
into one of 2 categories:

  1) data which doesn't need analyzed because the field it's going to
 query on wasn't tokenized (ie: a date field, or a numeric field, or a
 boolean field)
  2) data whcih is typed by the user in a text box, and not only needs
 analyzed, but may also need some parsing (ie: to support quoted
 phrases or +mandatory and -prohibited terms)

in the first case, build your query clauses progromatically.

in the second case make a QueryParser on the fly with the defaultField set
to whatever makes sense and let it handle parsing the users text (and
applying hte correct analyzer using PerFieldAnalyzer.  if there are
special characters you want it to ignore, then escape them first.

i discussed this a little more recently...

http://www.nabble.com/RE%3A+Building+queries-t1635907.html#a4436416



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: How are results merged from a multisearcher?

2006-05-22 Thread Doug Cutting


Tom Emerson wrote:

Thanks for the clarification. What then is the difference between a
MultiSearcher and using an IndexSearcher on a MultiReader?


The results should be identical.  A MultiSearcher permits use of 
ParallelMultiSearcher and RemoteSearchable, for parallel and/or 
distributed operation.  But, for single-threaded searching, a 
MultiReader is probably fastest.


Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Changing the scoring (newest doc date first)

2006-05-22 Thread Doug Cutting


Marcus Falck wrote:
There is however one LARGE problem that we have run into. All search result should be displayed sorted with the newest document at top. We tried to accomplish this using Lucene's sort capabilites but quickly ran into large performance bottlenecks. So i figured since the default sort is by relevance i would like to change the relevance so that we don't even need to sort the documents. I guess alot of people at this mail list can give me valuable hints about how to accomplish this! 
(Since i now about the ability to sort by index id (which i haven't tried) I can also add that i can't guarantee that all documents will be added in correct date order (remember the several systems,  the future plans is to buy content from different actors on the market and index it up).


A HitCollector should help.  Matching documents are passed to a 
HitCollector in the order they were added to the index.  So if newer 
documents were added to your index later, then the newest N documents 
are simply the last N documents passed to the HitCollector.


Could that work?

Cheers,

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Checking for duplicates inside index

2006-05-22 Thread Omar Didi

you have two choices that I can think of:
1- before adding a document, check if it does't exist in the index. you can do 
this by querying on a unique field if you have it .
2- you can index all your documents, and once the indexing is done you can 
dedupe. (Lucene has built in methods that can help with this)


if your index doesn't have a unique key, you need to add one like the one you 
suggested.

-Original Message-
From: karl wettin [mailto:[EMAIL PROTECTED]
Sent: Monday, May 22, 2006 6:05 PM
To: java-user@lucene.apache.org
Subject: Re: Checking for duplicates inside index


On Mon, 2006-05-22 at 23:42 +0200, Hannes Carl Meyer wrote:
 
 I'm indexing ~1 documents per day but since I'm getting a lot of 
 real duplicates (100% the same document content) I want to check the 
 content before indexing...
 
 My idea is to create a checksum of the documents content and store it 
 within document inside the index, before indexing a new document I
 will compare the new documents checksum with the ones in the index.
 
 Is that a good idea? does someone have experiences with that method?
 any tools available? 

That could work.

You will need a big sum though. MD5?



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Checking for duplicates inside index

2006-05-22 Thread Eugene Tuan


I have created a method that can delete duplicate docs. Basically,
during indexing, a doc is associated with an id (a term field defined by
you.) that is indexed. Then, call the method to delete duplicates
whenever you update index. 

I haven't contributed back to Lucene community yet because our code is
in beta testing now. 

My former colleague, Chris, has received agreement from Doug Cutting
since last August that this feature is nice to have.

Eugene


-Original Message-
From: Omar Didi [mailto:[EMAIL PROTECTED] 
Sent: Monday, May 22, 2006 6:47 PM
To: java-user@lucene.apache.org
Subject: RE: Checking for duplicates inside index

you have two choices that I can think of:
1- before adding a document, check if it does't exist in the index. you
can do this by querying on a unique field if you have it .
2- you can index all your documents, and once the indexing is done you
can dedupe. (Lucene has built in methods that can help with this)


if your index doesn't have a unique key, you need to add one like the
one you suggested.

-Original Message-
From: karl wettin [mailto:[EMAIL PROTECTED]
Sent: Monday, May 22, 2006 6:05 PM
To: java-user@lucene.apache.org
Subject: Re: Checking for duplicates inside index


On Mon, 2006-05-22 at 23:42 +0200, Hannes Carl Meyer wrote:
 
 I'm indexing ~1 documents per day but since I'm getting a lot of 
 real duplicates (100% the same document content) I want to check the 
 content before indexing...
 
 My idea is to create a checksum of the documents content and store it 
 within document inside the index, before indexing a new document I
 will compare the new documents checksum with the ones in the index.
 
 Is that a good idea? does someone have experiences with that method?
 any tools available? 

That could work.

You will need a big sum though. MD5?



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: does anybody have the experience to do some pooling upon lucene?

2006-05-22 Thread Zhenjian YU


OK, got it. Thanks.

On 5/23/06, Erik Hatcher [EMAIL PROTECTED] wrote:



On May 21, 2006, at 10:56 PM, Zhenjian YU wrote:
 I didn't dig the source code of lucence deep enough, but I noticed
 that the
 IndexSearcher uses an IndexReader, while the cost of initializing
 IndexReader is a bit high.

The key is the IndexReader.

 My application is a webapp, so I think it may be good if I cache some
 instances of IndexSearcher to provide service for my webapp. I
 haven't done
 any performance testing yet. Maybe I test it later to see the
 difference
 between caching and without caching.

It is best to keep only a single IndexSearcher/IndexReader
combination around.  There is no need to have more than one instance,
and in fact it is a waste of resources to do so.

Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Checking for duplicates inside index

2006-05-22 Thread Ken Krugler


On Mon, 2006-05-22 at 23:42 +0200, Hannes Carl Meyer wrote:



  I'm indexing ~1 documents per day but since I'm getting a lot of

 real duplicates (100% the same document content) I want to check the
 content before indexing...


  My idea is to create a checksum of the documents content and store it
  within document inside the index, before indexing a new document I
  will compare the new documents checksum with the ones in the index.
 

 Is that a good idea? does someone have experiences with that method?
 any tools available?


That could work.

You will need a big sum though. MD5?


Just as a reference, Nutch uses an MD5 digest to detect duplicate web 
pages. It works fine, except of course when two docs differ by only 
an insignificant text delta. There's some recent work in this area - 
check out TextProfileSignature.


-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
Find Code, Find Answers

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: SpanScorer Out Of Bounds

2006-05-22 Thread Michael Chan


Hi Otis,

Thanks for that. I found out that it's a memory usage problem rather
than one on Lucene's part.

Thanks.

Michael

On 5/22/06, Otis Gospodnetic [EMAIL PROTECTED] wrote:

Hi Michael,
I don't see any responses to your problem.  It's early, so you may get some, 
but this sounds like a case for JIRA.
Also, please try to write and attach (to your JIRA case) a unit test that 
demonstrates a problem, something we can run and debug this.  Without that we 
may not be able to fix this.

Otis


- Original Message 
From: Michael Chan [EMAIL PROTECTED]
To: java-user@lucene.apache.org
Sent: Sunday, May 21, 2006 7:37:35 AM
Subject: SpanScorer Out Of Bounds

Hi,

Somehow, after running many searches using instances of SpanQuery
(mostly SpanNearQuery), I get the ArrayIndexOutOfBounds exception:

bash-2.03$ java.lang.ArrayIndexOutOfBoundsException: 2147483647
at org.apache.lucene.search.spans.SpanScorer.score(SpanScorer.java:72)
at 
org.apache.lucene.search.ConjunctionScorer.score(ConjunctionScorer.java:82)
at 
org.apache.lucene.search.BooleanScorer2$2.score(BooleanScorer2.java:186)
at 
org.apache.lucene.search.BooleanScorer2.score(BooleanScorer2.java:327)
at 
org.apache.lucene.search.BooleanScorer2.score(BooleanScorer2.java:291)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:132)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:99)
at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:65)
at org.apache.lucene.search.Hits.init(Hits.java:44)
at org.apache.lucene.search.Searcher.search(Searcher.java:44)
at org.apache.lucene.search.Searcher.search(Searcher.java:36)
...
traces to my program

Is there a counter or something in play? Is it a cache or some sort?

Any help will be much appreciated.

Thanks.

Michael

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Making SpanQuery more effiicent

2006-05-22 Thread Michael Chan


Hi,

As I use SpanQuery purely for the use of slop, I was wondering how to
make SpanQuery more efficient,. Since I don't need any span
information, is there a way to disable the computation for span and
other unneeded overhead?

Thanks.

Michael

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

indexing in lucene 1.9.1

should I avoid create many Fields for a Document?

Re: Need some Advice on Searching

Searching API: QueryParser vs Programatic queries

Re: indexing in lucene 1.9.1

Re: Aggregating category hits

RE: Aggregating category hits

Re: OutOfMemory and IOException Access Denied errors

Re: What is more efficient?

Re: should I avoid create many Fields for a Document?

Performance ...

Re: Searching API: QueryParser vs Programatic queries

RE: Searching API: QueryParser vs Programatic queries

Re: Searching API: QueryParser vs Programatic queries

RE: Searching API: QueryParser vs Programatic queries

Re: Searching API: QueryParser vs Programatic queries

Re: Searching API: QueryParser vs Programatic queries

Searching missing documents after doing an addIndexes

Re: does anybody have the experience to do some pooling upon lucene?

Re: Searching missing documents after doing an addIndexes

incremental updates

Re: Searching API: QueryParser vs Programatic queries

Checking for duplicates inside index

RE: Searching API: QueryParser vs Programatic queries

Re: How are results merged from a multisearcher?

Re: Changing the scoring (newest doc date first)

RE: Checking for duplicates inside index

RE: Checking for duplicates inside index

Re: does anybody have the experience to do some pooling upon lucene?

Re: Checking for duplicates inside index

Re: SpanScorer Out Of Bounds

Making SpanQuery more effiicent

32 matches

Site Navigation

Mail list logo

Footer information