Re: Open an IndexWriter in parallel with an IndexReader on the same index.

2006-02-22 Thread Nadav Har'El
Chris Hostetter [EMAIL PROTECTED] wrote on 22/02/2006 03:24:58 AM:

 : It would have been nice if someone wrote something like indexModifier,
 : but with a cache, similar to what Yonik suggested above: deletions will
 : not be done immediately, but rather cached and later done in batches.
 : Of course, batched deletions should not remember the term to delete,
 : but rather the matching document numbers at the time of the deletion -
 : because after the addition of the modified document if we search for
 : the term again we'll find two documents.

 That's not a safe sequence of events.  An Add can trigger a segment
merge,
 which cna renumber documents.

I see. Then maybe there's a way to catch this merge and do the deletions
just before it, because...

 As yonik said, you want to queue up the adds/updates, then do a delete
for
 each update in your queue, then do your adds in one batch.  knowing

The problem in this solution is that unlike queuing deletes, queuing
additions requires you to queue the actual document contents. Doing
this in memory might add a large memory pentalty which is undesired
for applications that try to maintain a small memory footprint.

 when/what to delete requies knowing a key for your records -- which
 isnt' a native lucne concept, but it is certainly a general enough one
 that a helper class could be written for this.

I realise that the name of this delete key isn't defined by Lucene,
but I believe that the concept of such a key was officially
sanctioned by Lucene with the deleteDocuments(Term) method (whose
documentation even mentions the unique ID string scenario).
So indeed a helper class of this sort will probably be useful to
more than a few people.

--
Nadav Har'El


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How can I get a term's frequency?

2006-02-22 Thread sog
en, but IndexReader.getTermFreqVector is an abstract method, I do not know 
how to implement it in an efficient way. Anyone has good advise?


I search with a group of query terms, I can get a document from the search 
result:


Query(term1, term2, term3)--search index--Hits(doc1, doc2, doc3, ..)

I wanna get term1's frequency in doc1 ?

I think the tf value is caculated in the index procedure. can I get the 
tf(term frequency) value of term1 directly?


I can do it in this way:


QueryTermVector vector= new QueryTermVector(Document.getValues(field));
freq = result.getTermFrequencies();


but I think this is a very low efficient way.

Anyone can help me ? thx


sog




- Original Message - 
From: Daniel Noll [EMAIL PROTECTED]

To: java-user@lucene.apache.org
Sent: Wednesday, February 22, 2006 1:19 PM
Subject: Re: How can I get a term's frequency?



sog wrote:


I search the index with a group of terms. I want to get every term's
frequency in each document of the search result.


Are you looking for this?

TermFreqVector vector = IndexReader.getTermFreqVector(docNum, field);

That gives you the frequency of every term, but you can just look up the
ones you're interested in.

Daniel


--
Daniel Noll

Nuix Australia Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, Australia
Phone: (02) 9280 0699
Fax:   (02) 9212 6902

This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Phrase query vs span query

2006-02-22 Thread Paul Elschot
On Wednesday 22 February 2006 00:45, Rajesh Munavalli wrote:
 I am trying to adopt lucene for a special IR system. The following scenario
 is an approximation of what I am trying to do. Please bear with me if some
 things doesnt make sense. I need some suggestions on formulating queries for
 the following scenario
 
 Each document consists of a set of fields (standard in lucene). But in my
 case, the field is somewhat different as explained below.
 
 Field:
 -
 Each field consists of a set of conceptual sections. Each of these sections
 is separated by say N (say 1000) index positions but are in the same field.
 Sizes of sections vary and do not have any lower or upper bound on the
 number of terms they may contain
 .
 Ex: Lets say Field contents has
 section1 of 100 termsgap of 1000 term positionssection 2 of 1500
 termsgap of 1000 term positionsgap of 1000 term positionssection 3 of
 10 terms
 
 NOTE: At index time, I am assuming I somehow know how to form these
 sections.

One more choice you have is too index both the full document and each section
as a Lucene document.

 
 Typical Query:
 -
 Consists of 15 to 30 query terms. In other words, these query terms
 represent a conceptual section.

Would you need synonyms of these terms, too?
 
 Aim of the Query formation:
 
 I want to rank the documents proportional to the number query terms

For this there is the coord() factor used in Lucene boolean queries.
But scoring exactly proportional to the number of query terms is difficult
to do because the lucene score is not bound by default.

 appearing in the SAME SECTION and IN ORDER. Documents containing terms with

To query the exact order, you can use PhraseQuery and SpanQuery.

 the
 
 My Questions:
 -
 Considering the structure of the fields/documents and the number of query
 terms.
 
 (1) Is there an effective way of formulating a query with the existing query
 types in Lucene?

I don't think so, see below.

 (2) After considering the way different queries work and their limitations,
 I think forming phrase/span queries of groups of query terms
 might approximate the rankings I am expecting. In that case which of the
 following queries will perform better (in terms of QUERY SPEED and RANKING)
   (a) phrase query with certain slope factor
   (b) span query

SpanQuery is slower than PhraseQuery, but it has the advantage that it can
be nested. Nesting here means the possibility to use eg. a short phrase as
a unit to be matched and scored.

Concerning this:
Rank 2: Documents containing section containing all terms but randomly
ordered

SpanQuery can also match unordered occurrences, I don't know about
PhraseQuery.

To formulate a single query for your requirements,
there is still the problem that PhraseQuery and SpanQuery only work when
all their terms are present in an indexed lucene document field.
Putting it differently, when fewer terms present, their order cannot
be taken into account, unless the query contains an (non)ordered query 
specifying a subset of the terms present in the documents.

An alternative to the current span query implementation is here:
http://issues.apache.org/jira/browse/LUCENE-413
but this will only help to get an impression of how to match in the ordered
and unordered cases.
It might be possible to generalize the various span algorithms there and
in the trunk to work with fewer terms.

Regards,
Paul Elschot

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index missing documents

2006-02-22 Thread Michael van Rooyen
I'm using Lucene 1.4.3, and maxBufferedDocs only appears to be in the new 
(unreleased?) version of IndexWriter in CVS.  Looking at the code though, 
setMaxBufferedDocs(n) just translates to minMergeDocs = n.  My index was 
constructed using the default minMergeDocs = 10, so somehow this doesn't 
seem to be the culprit that caused all 2 million+ documents to be missing 
from the crashed index.  It seems more likely that none of the index files 
were registered in Lucene's segements file.  Is there perhaps some other 
trigger that causes Lucene to register the indexes in the segments file, 
or is there some way of flushing the segments file every so often to ensure 
that it's list is up to date?  Thanks again for your assistance.


Michael.

- Original Message - 
From: Otis Gospodnetic [EMAIL PROTECTED]

To: java-user@lucene.apache.org
Sent: Monday, February 20, 2006 8:39 PM
Subject: Re: Index missing documents


No, using the same IndexWriter is the way to go.  If you want things to be 
written to disk more frequently, lower the maxBufferedDocs setting.  Go 
down to 1, if you want.  You'll use less memory (RAM), Documents will be 
written to disk without getting buffered in RAM, but the indexing process 
will be slower.


Otis




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How can I get a term's frequency?

2006-02-22 Thread sog


en, I describe my question more clearly:

I search with a group of query terms, I can get a document from the search 
result:


Query(term1, term2, term3)--search index--Hits(doc1, doc2, doc3, ..)

I wanna get term1's frequency in doc1 ?

Hits(docs1((term1,freq),(term2,freq),(term3,freq)),
docs2((term1,freq),(term2,freq),(term3,freq)),..)


I think the tf value is caculated in the index procedure. can I get the 
tf(term frequency) value of term1 directly?


I can do it in this way:


QueryTermVector vector= new QueryTermVector(Document.getValues(field));
freq = result.getTermFrequencies();


but I think this is a very low efficient way.

Anyone can help me ? thx


sog


- Original Message - 
From: Daniel Noll [EMAIL PROTECTED]

To: java-user@lucene.apache.org
Sent: Wednesday, February 22, 2006 1:19 PM
Subject: Re: How can I get a term's frequency?



sog wrote:


I search the index with a group of terms. I want to get every term's 
frequency in each document of the search result.


Are you looking for this?

TermFreqVector vector = IndexReader.getTermFreqVector(docNum, field);

That gives you the frequency of every term, but you can just look up the
ones you're interested in.

Daniel


--
Daniel Noll

Nuix Australia Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, Australia
Phone: (02) 9280 0699
Fax:   (02) 9212 6902

This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: webserverless search with lucene on offline HTML doc

2006-02-22 Thread Fabio Insaccanebbia
The signed applet is surely a simpler and more elegant solution..

In some projects however this could not be a viable option:
the System properties problem you have pointed out (and I had missed
:-) is hopefully going to be solved in 1.9
(http://issues.apache.org/jira/browse/LUCENE-369)

Fabio

P.S.: is there any possibilty to have a look at your quick and dirty
implementation of the JarDirectory? I've written a
JarReadOnlyDirectory but it was very dirty and not even so quick
for me to write :-(

 I wrote a quick and dirty implementation of a JarDirectory - it works, but a 
 new problem is encountered soon after: The indexWriter requires information 
 from the System properties; an applet is allowed to read only a limited set 
 of Properties.

 Especially with an offline applet I would stick to the solution of signing 
 the applet.

 Dolf.


On 2/21/06, Trieschnigg, R.B. (Dolf) [EMAIL PROTECTED] wrote:
  Wouldn't this be a good case for the JarDirectory implementation
  somebody asked for?
  The index could then be statically written in a jar file downloaded
  with the applet (the original mail refers to static offline HTML
  files).

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



:Lucene 1.9 RC1 is not working properly with older version of Code 1.43:

2006-02-22 Thread Ravi
Hi ,

 

I  got the latest source code of Lucene 1.9 RC1 and modified my code
according to that by removing the deprecated methods. But once I have
updated to this version the search is not working at all.. if I try with
luke it is working fine but If I try with program it is not returning any
error .. and result.. please let me know If any problems are there with this
version. If that contains  I will replace with old version1.4.3 which is
working fine..

 

 

Thanks

Ravi Kumar Jaladanki



Lucene, Cannot rename segments.new to segments

2006-02-22 Thread Patrick Kimber
I am getting intermittent errors with Lucene.  Here are two examples:
java.io.IOException: Cannot rename E:\lucene\segments.new to E:\lucene\segments
java.io.IOException: Cannot rename E:\lucene\_8ya.tmp to E:\lucene\_8ya.del

This issue has an open BugZilla entry:
http://issues.apache.org/bugzilla/show_bug.cgi?id=36241

I thought this error must be caused by an error in my application.  To
try and solve the error I used the LuceneIndexAccessor in my
application:
http://issues.apache.org/bugzilla/show_bug.cgi?id=34995

I am still getting the error.

1) Is there a reason (other than time and resource) why the bug report
is still set to NEW after 6 months (since August 2005)?

2) Is the problem likely to be in my application?  Any ideas how I
could go about solving this issue?

Thanks for your help
Patrick

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: :Lucene 1.9 RC1 is not working properly with older version of Code 1.43:

2006-02-22 Thread Yonik Seeley
Hi Ravi,

Could you try 1.9RC1 without changing your code to remove the
deprecated calls first?
If that works, try changing one type of deprecated call at a time
until the culprit is found.
It may either be a bug in API usage in your code, or a bug in Lucene.

-Yonik

On 2/22/06, Ravi [EMAIL PROTECTED] wrote:
 I  got the latest source code of Lucene 1.9 RC1 and modified my code
 according to that by removing the deprecated methods. But once I have
 updated to this version the search is not working at all.. if I try with
 luke it is working fine but If I try with program it is not returning any
 error .. and result.. please let me know If any problems are there with this
 version. If that contains  I will replace with old version1.4.3 which is
 working fine..

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



TREC,INEX and Lucene

2006-02-22 Thread Malcolm Clark

Hi all,
I am planning on participating in the INEX and hopefully passively on a 
couple of TREC tracks mainly using the Lucene API.

Is anyone else on this list planning on using Lucene during participation?
I am particularly interested in the SPAM, Blog and ADHOC tracks.
Malcolm Clark 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Phrase query vs span query

2006-02-22 Thread Rajesh Munavalli
On 2/22/06, Paul Elschot [EMAIL PROTECTED] wrote:

 
  Typical Query:
  -
  Consists of 15 to 30 query terms. In other words, these query terms
  represent a conceptual section.

 Would you need synonyms of these terms, too?


Yes.


  (2) After considering the way different queries work and their
 limitations,
  I think forming phrase/span queries of groups of query terms
  might approximate the rankings I am expecting. In that case which of the
  following queries will perform better (in terms of QUERY SPEED and
 RANKING)
(a) phrase query with certain slope factor
(b) span query

 SpanQuery is slower than PhraseQuery, but it has the advantage that it can
 be nested. Nesting here means the possibility to use eg. a short phrase as
 a unit to be matched and scored.


I wasn't aware of the capability to nest spanquery. Is there a link where I
could read more about this?

 To formulate a single query for your requirements,
 there is still the problem that PhraseQuery and SpanQuery only work when
 all their terms are present in an indexed lucene document field.
 Putting it differently, when fewer terms present, their order cannot
 be taken into account, unless the query contains an (non)ordered query
 specifying a subset of the terms present in the documents.


I was thinking of building a boolean combination of either phrase/span query
on subset of terms. Though its not exhaustive, but might be sufficient in
majority of the cases.

An alternative to the current span query implementation is here:
 http://issues.apache.org/jira/browse/LUCENE-413
 but this will only help to get an impression of how to match in the
 ordered
 and unordered cases.
 It might be possible to generalize the various span algorithms there and
 in the trunk to work with fewer terms.

I will consider that option.

Thanks,

Rajesh Munavalli


ArrayIndexOutOfBoundsException being thrown ...

2006-02-22 Thread Mufaddal Khumri
Getting an ArrayIndexOutOfBoundsException ...

Line 31 in IndexSearcherManager.java:
...

public static IndexSearcher getIndexSearcher(String indexPath) 
{
logger.debug(indexPath =  + indexPath);

 
searcher = new IndexSearcher(indexPath);  LINE 
31

return searcher;
}
...
...

I get the following exception:

28628 DEBUG com.allegrocentral.tandoori.managers.search.IndexSearcherManager 
[21] - indexPath = /opt/tomcat/webapps/ROOT/WEB-INF/search-index
28666 WARN  org.apache.struts.action.RequestProcessor [516] - Unhandled 
Exception thrown: class java.lang.ArrayIndexOutOfBoundsException
28669 ERROR 
org.apache.catalina.core.ContainerBase.[Catalina].[localhost].[/].[action] 
[704] - Servlet.service() for servlet action threw exception
java.lang.ArrayIndexOutOfBoundsException: -1
at java.util.ArrayList.get(ArrayList.java:323)
at org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:155)
at org.apache.lucene.index.FieldInfos.fieldName(FieldInfos.java:151)
at 
org.apache.lucene.index.SegmentTermEnum.readTerm(SegmentTermEnum.java:149)
at 
org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:115)
at 
org.apache.lucene.index.TermInfosReader.readIndex(TermInfosReader.java:86)
at 
org.apache.lucene.index.TermInfosReader.init(TermInfosReader.java:45)
at 
org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:112)
at org.apache.lucene.index.SegmentReader.init(SegmentReader.java:89)
at org.apache.lucene.index.IndexReader$1.doBody(IndexReader.java:118)
at org.apache.lucene.store.Lock$With.run(Lock.java:109)
at org.apache.lucene.index.IndexReader.open(IndexReader.java:111)
at org.apache.lucene.index.IndexReader.open(IndexReader.java:95)
at org.apache.lucene.search.IndexSearcher.init(IndexSearcher.java:38)
at 
com.allegrocentral.tandoori.managers.search.IndexSearcherManager.getIndexSearcher(IndexSearcherManager.java:31)

Any ideas as to why this might be happening? (Am using lucene-core-1.9-rc1.jar)

-Thanks.


IndexSearcher

2006-02-22 Thread Gus Kormeier
Maybe too general a question, but is there anything about creating an
IndexSearcher( directory) object that would make the instantiation really
slow?


I have one index where the instantiation is very fast, to the point where I
don't need to do any pooling.  A new index I have created, takes a very long
time to create the IndexSearcher object.  With a 30mb index, it can take
about 30 seconds just to instantiate an IndexSearcher().  It almost seems
like it is reading the index at that point.


The only difference between the indexes has been the # of fields indexed.
The newer one only having one field indexed.

Any ways to speed up that instantiation? Or do I have to use a pooling
system?

Thanks for any suggestions,
-Gus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: IndexSearcher

2006-02-22 Thread John Powers
This doesn't really address your question, but...

Once you have the single indexsearcher, do you need any others?   Could
your app just use the single instance?  

-Original Message-
From: Gus Kormeier [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, February 22, 2006 11:28 AM
To: java-user@lucene.apache.org
Subject: IndexSearcher

Maybe too general a question, but is there anything about creating an
IndexSearcher( directory) object that would make the instantiation
really
slow?


I have one index where the instantiation is very fast, to the point
where I
don't need to do any pooling.  A new index I have created, takes a very
long
time to create the IndexSearcher object.  With a 30mb index, it can take
about 30 seconds just to instantiate an IndexSearcher().  It almost
seems
like it is reading the index at that point.


The only difference between the indexes has been the # of fields
indexed.
The newer one only having one field indexed.

Any ways to speed up that instantiation? Or do I have to use a pooling
system?

Thanks for any suggestions,
-Gus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



search a subdirectory (New to Lucene)

2006-02-22 Thread John Hamilton
I'm new to Lucene and was wondering what is the best way to perform a search on 
a subdirectory or subdirectories within the index?  My thought at this point is 
to build a query to first search for files in the required directory(ies) and 
then use that query to make a QueryFilter and use that QueryFilter in the 
actual search.  Is there an easier way?
 
On an unrelated note, does anybody know of a way to get results a the section 
level within a document?  For example, could I find not just a document that 
matches my query, but the paragraph within that document that best matches the 
query?
 
thanks,
 
John


RE: IndexSearcher

2006-02-22 Thread John Powers
I guess what I meant was to have all your servlets use the same
instance.  They could get it from the class or from a parent of all your
servlets.   Then you can let the indexsearcher take care of all the
search requests.

-Original Message-
From: Gus Kormeier [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, February 22, 2006 12:42 PM
To: 'java-user@lucene.apache.org'
Subject: RE: IndexSearcher

It's in a servlet, so one work around I have been going with is to just
open
it at init().  That gives me some threading concerns.

And I didn't have to do that in the past,
-Gus

-Original Message-
From: John Powers [mailto:[EMAIL PROTECTED]
Sent: Wednesday, February 22, 2006 9:35 AM
To: java-user@lucene.apache.org
Subject: RE: IndexSearcher


This doesn't really address your question, but...

Once you have the single indexsearcher, do you need any others?   Could
your app just use the single instance?  

-Original Message-
From: Gus Kormeier [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, February 22, 2006 11:28 AM
To: java-user@lucene.apache.org
Subject: IndexSearcher

Maybe too general a question, but is there anything about creating an
IndexSearcher( directory) object that would make the instantiation
really
slow?


I have one index where the instantiation is very fast, to the point
where I
don't need to do any pooling.  A new index I have created, takes a very
long
time to create the IndexSearcher object.  With a 30mb index, it can take
about 30 seconds just to instantiate an IndexSearcher().  It almost
seems
like it is reading the index at that point.


The only difference between the indexes has been the # of fields
indexed.
The newer one only having one field indexed.

Any ways to speed up that instantiation? Or do I have to use a pooling
system?

Thanks for any suggestions,
-Gus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Throughput doesn't increase when using more concurrent threads

2006-02-22 Thread Yonik Seeley
Hmmm, not sure what that could be.
You could try using the default FSDir instead of MMapDir to see if the
differences are there.

Some things that could be different:
- thread scheduling (shouldn't make too much of a difference though)
- synchronization workings
- page replacement policy... how to figure out what pages to swap in
and which to swap out, esp of the memory mapped files.

You could also try a profiler on both platforms to try and see where
the difference is.

-Yonik

On 2/22/06, Peter Keegan [EMAIL PROTECTED] wrote:
 I am doing a performance comparison of Lucene on Linux vs Windows.

 I have 2 identically configured servers (8-CPUs (real) x 3GHz Xeon
 processors, 64GB RAM). One is running CentOS 4 Linux, the other is running
 Windows server 2003 Enterprise Edition x64. Both have 64-bit JVMs from Sun.
 The Lucene server is using MMapDirectory. I'm running the jvm with
 -Xmx16000M. Peak memory usage of the jvm on Linux is about 6GB and 7.8GB on
 windows.

 I'm observing query rates of 330 queries/sec on the Wintel server, but only
 200 qps on the Linux box. At first, I suspected a network bottleneck, but
 when I 'short-circuited' Lucene, the query rates were identical.

 I suspect that there are some things to be tuned in Linux, but I'm not sure
 what. Any advice would be appreciated.

 Peter



 On 1/30/06, Peter Keegan [EMAIL PROTECTED] wrote:
 
  I cranked up the dial on my query tester and was able to get the rate up
  to 325 qps. Unfortunately, the machine died shortly thereafter (memory
  errors :-( ) Hopefully, it was just a coincidence. I haven't measured 64-bit
  indexing speed, yet.
 
  Peter
 
  On 1/29/06, Daniel Noll [EMAIL PROTECTED] wrote:
  
   Peter Keegan wrote:
I tried the AMD64-bit JVM from Sun and with MMapDirectory and I'm now
getting 250 queries/sec and excellent cpu utilization (equal
   concurrency on
all cpus)!! Yonik, thanks for the pointer to the 64-bit jvm. I wasn't
   aware
of it.
   
   Wow.  That's fast.
  
   Out of interest, does indexing time speed up much on 64-bit hardware?
   I'm particularly interested in this side of things because for our own
   application, any query response under half a second is good enough, but
   the indexing side could always be faster. :-)
  
   Daniel
  
   --
   Daniel Noll
  
   Nuix Australia Pty Ltd
   Suite 79, 89 Jones St, Ultimo NSW 2007, Australia
   Phone: (02) 9280 0699
   Fax:   (02) 9212 6902
  
   This message is intended only for the named recipient. If you are not
   the intended recipient you are notified that disclosing, copying,
   distributing or taking any action in reliance on the contents of this
   message or attachment is strictly prohibited.
  
  
   -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
  
  
 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Searching/sorting strategy for many properties for semantic web app

2006-02-22 Thread Erik Hatcher
One very nice implementation to take a look at is the Simile project  
at MIT.   The Piggy Bank and Longwell projects use Lucene to index  
RDF and integrate full-text and structural queries nicely together.
http://simile.mit.edu


Erik

On Feb 21, 2006, at 10:20 PM, David Pratt wrote:

Hi there. I am new to Lucene and I have been developing a semantic  
application for a while and it appears to me Lucene could help me  
to get a much needed search with reasonable speed. I have some  
general question to start:


1) Since my app is virtually all metadata, what should I store in  
the indexes if anything?
2) Should I only index the most common properties that people will  
search and combine the rest (and index this combined text as a field)?
3) I would like to sort and filter results but am concerned this  
could be very memory intensive
4) Some general guidance on organizing indexes in an app would be  
appreciated.


My schema is fairly large but I generally expect people to search  
on about 6 to 8 properties for the most part. I have the data  
stored in an sql database but not in a conventional way. I am  
willing to accept a slower advanced search on less common  
properties (accomodating this with sql search) but I really want  
some speed for the main properties with full text search.


Pretty much everything in the app is metadata so I am most  
interested in  focussing on the 6-8 properties that people will use  
to search on for the most part. I am thinking of combining the text  
of the remaining properties (quite a number) into a single  
description type field so that essentially all information gets  
indexed and ranked. Is this a reasonable approach?


I see that there are advanced possibilities with the indexes to  
sort and filter. How advisable is using sort for large record sets.  
For example, say you have got 2 records returned from your  
search. Because this will have a web interface I will only be  
showing first 20 likely so I will be batching results. Is the  
sorting filtering highly memory intensive?


Hopefully, someone can provide some initial advice. Many thanks.

Regards,
David

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: search a subdirectory (New to Lucene)

2006-02-22 Thread Erik Hatcher
I presume by saying subdirectory you're referring to filesystem  
directories and you're indexing a directory tree of files.   If you  
index the path (perhaps relative from the root is best) as a keyword  
field (untokenized, but indexed) you could perform filtering on a / 
path/subpath sort of way using PrefixQuery.


As for paragraphs - how you index a document is entirely  
application dependent.  Maybe it makes sense to parse the documents  
before handing them to Lucene such that you're creating a Lucene  
Document for each paragraph rather than for each entire file.   
Slicing the granularity of a domain into Documents is a fascinating  
topic :)


Erik


On Feb 22, 2006, at 1:00 PM, John Hamilton wrote:

I'm new to Lucene and was wondering what is the best way to perform  
a search on a subdirectory or subdirectories within the index?  My  
thought at this point is to build a query to first search for files  
in the required directory(ies) and then use that query to make a  
QueryFilter and use that QueryFilter in the actual search.  Is  
there an easier way?


On an unrelated note, does anybody know of a way to get results a  
the section level within a document?  For example, could I find not  
just a document that matches my query, but the paragraph within  
that document that best matches the query?


thanks,

John



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Lucene 1.9 RC1 release available

2006-02-22 Thread Doug Cutting

Release 1.9 RC1 of Lucene is now available from:

http://www.apache.org/dyn/closer.cgi/lucene/java/

This release candidate has many improvements since release 1.4.3,
including new features, performance improvements, bug fixes, etc.  For
details, see:

http://svn.apache.org/viewcvs.cgi/*checkout*/lucene/java/branches/lucene_1_9/CHANGES.txt?rev=379190

1.9 will be the last 1.x release. It is both back-compatible with 1.4.3
and forward-compatible with the upcoming 2.0 release. Many methods and
classes in 1.4.3 have been deprecated in 1.9 and will be removed in 2.0.
 Applications must compile against 1.9 without deprecation warnings
before they are compatible with 2.0.

Doug


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: IndexSearcher

2006-02-22 Thread Chris Hostetter
: I have one index where the instantiation is very fast, to the point where I
: don't need to do any pooling.  A new index I have created, takes a very long
: time to create the IndexSearcher object.  With a 30mb index, it can take
: about 30 seconds just to instantiate an IndexSearcher().  It almost seems
: like it is reading the index at that point.
:
:
: The only difference between the indexes has been the # of fields indexed.
: The newer one only having one field indexed.

If i remember correctly, The IndexSearcher constructor doesn't do anything
but open an IndexReader ... IndexReader.open() opens a MultiReader on all
of the segments, and each of the SegmentReaders open up a bunch of files.

so off hte top of my head, one thing that can make a differnece in the
new IndexSearcher times, is how many segments you have in your index
(ie: is it optimized?) ... using the compound fileformat can probably make
a difference as well.

: Any ways to speed up that instantiation? Or do I have to use a pooling
: system?

Even if you get it down to 0.1 seconds,i would still reuse the same
IndexSearcher as much as possible.  See previous replies from me in
the archive about memory for my reasoning.



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: IndexSearcher

2006-02-22 Thread Gus Kormeier
Thanks Hoss,
I did figure out that I was putting about 400 stored fields per
document into my new index; more than my prior indexes. 
Reducing the number of stored fields seems to have helped significantly.

I do call writer.optimize() after loading in documents, but not sure how I
would set the # of segments?
I think I will keep the IndexSearcher statically for all instances. The slow
times I was seeing, weren't even sufficient for that though.

Since this is a case of really only needing to search on one field and use
the index as a storage medium for the rest of the data(pretty much textual
data), I'm thinking it would make sense to get the latest version of lucene
and create a two field index.
Something like:
Field1: id
Field2: serialized data object.

Any reason why that wouldn't be fast?

I have been having elusive memory issues with my other usage, maybe you just
helped me find that solution as well.
Thanks,
-Gus

-Original Message-
From: Chris Hostetter [mailto:[EMAIL PROTECTED]
Sent: Wednesday, February 22, 2006 4:02 PM
To: java-user@lucene.apache.org
Subject: Re: IndexSearcher


: I have one index where the instantiation is very fast, to the point where
I
: don't need to do any pooling.  A new index I have created, takes a very
long
: time to create the IndexSearcher object.  With a 30mb index, it can take
: about 30 seconds just to instantiate an IndexSearcher().  It almost seems
: like it is reading the index at that point.
:
:
: The only difference between the indexes has been the # of fields indexed.
: The newer one only having one field indexed.

If i remember correctly, The IndexSearcher constructor doesn't do anything
but open an IndexReader ... IndexReader.open() opens a MultiReader on all
of the segments, and each of the SegmentReaders open up a bunch of files.

so off hte top of my head, one thing that can make a differnece in the
new IndexSearcher times, is how many segments you have in your index
(ie: is it optimized?) ... using the compound fileformat can probably make
a difference as well.

: Any ways to speed up that instantiation? Or do I have to use a pooling
: system?

Even if you get it down to 0.1 seconds,i would still reuse the same
IndexSearcher as much as possible.  See previous replies from me in
the archive about memory for my reasoning.



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How can I get a term's frequency?

2006-02-22 Thread Daniel Noll
sog wrote:
 en, but IndexReader.getTermFreqVector is an abstract method, I do not 
 know how to implement it in an efficient way. Anyone has good advise?

You probably don't need to implement it, it's been implemented already.
 Just call the method.

 I can do it in this way:
 
 QueryTermVector vector= new QueryTermVector(Document.getValues(field));
 freq = result.getTermFrequencies();

I'm not sure because I've never used QueryTermVector before, but the
fact that QueryTermVector doesn't take an IndexReader as a parameter is
a good indication that it can't tell you anything about the frequency of
the term in your documents.

Daniel




-- 
Daniel Noll

Nuix Australia Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, Australia
Phone: (02) 9280 0699
Fax:   (02) 9212 6902

This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Searching/sorting strategy for many properties for semantic web app

2006-02-22 Thread David Pratt
Hi Erik. Many thanks for your reply. I'll likely see if I can find a 
list to pose a couple of questions there way. I am having fun with 
Lucene since it is new to me and I am impressed with the speed I am 
getting. I am reading anything I can get hold of and trying different 
code experiments. So far, the code is fairly straight forward so not so 
concerned about this at the moment.


I am really hoping to hear from experienced people like yourself more on 
strategically what to index, what sort of things it would be a good idea 
to store and what to do about a fairly large schema that has much 
metadata to offer. Also perhaps when sorting and filtering gets too 
expensive. I realize that just because the metadata is available doesn't 
necessarily mean you want to even put it all in an index. I think these 
issues are pretty general, however I know there are folks on this that 
would likely advise some particular path or direction because of their 
own experiences with Lucene. I would really like to hear from anyone 
that has been working with metadata particularly or anyone generally 
about these topics.


Regards,
David


Erik Hatcher wrote:
One very nice implementation to take a look at is the Simile project  at 
MIT.   The Piggy Bank and Longwell projects use Lucene to index  RDF and 
integrate full-text and structural queries nicely together.
http://simile.mit.edu


Erik

On Feb 21, 2006, at 10:20 PM, David Pratt wrote:

Hi there. I am new to Lucene and I have been developing a semantic  
application for a while and it appears to me Lucene could help me  to 
get a much needed search with reasonable speed. I have some  general 
question to start:


1) Since my app is virtually all metadata, what should I store in  the 
indexes if anything?
2) Should I only index the most common properties that people will  
search and combine the rest (and index this combined text as a field)?
3) I would like to sort and filter results but am concerned this  
could be very memory intensive
4) Some general guidance on organizing indexes in an app would be  
appreciated.


My schema is fairly large but I generally expect people to search  on 
about 6 to 8 properties for the most part. I have the data  stored in 
an sql database but not in a conventional way. I am  willing to accept 
a slower advanced search on less common  properties (accomodating this 
with sql search) but I really want  some speed for the main properties 
with full text search.


Pretty much everything in the app is metadata so I am most  interested 
in  focussing on the 6-8 properties that people will use  to search on 
for the most part. I am thinking of combining the text  of the 
remaining properties (quite a number) into a single  description type 
field so that essentially all information gets  indexed and ranked. Is 
this a reasonable approach?


I see that there are advanced possibilities with the indexes to  sort 
and filter. How advisable is using sort for large record sets.  For 
example, say you have got 2 records returned from your  search. 
Because this will have a web interface I will only be  showing first 
20 likely so I will be batching results. Is the  sorting filtering 
highly memory intensive?


Hopefully, someone can provide some initial advice. Many thanks.

Regards,
David

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: TREC,INEX and Lucene

2006-02-22 Thread Dave Kor
Malcom, I've used Lucene in TREC last year in my QA list module, as
have many of my contempories.

On 2/22/06, Malcolm Clark [EMAIL PROTECTED] wrote:
 Hi all,
 I am planning on participating in the INEX and hopefully passively on a
 couple of TREC tracks mainly using the Lucene API.
 Is anyone else on this list planning on using Lucene during participation?
 I am particularly interested in the SPAM, Blog and ADHOC tracks.
 Malcolm Clark


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




--
Dave Kor, Research Assistant
Center for Information Mining and Extraction
School of Computing
National University of Singapore.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



hyphen not being removed by standard filter

2006-02-22 Thread Mufaddal Khumri
Hi,

I might be missing something. I have a custom analyzer the gist of which is:

public TokenStream tokenStream(String fieldName, Reader reader)
{
TokenStream result = new StandardTokenizer(reader);
result = new StandardFilter(result);
result = new LowerCaseFilter(result);
result = new StopFilter(result, stopSet);
result = new PorterStemFilter(result);
return result;
}

I test my above analyzer with the following query string:
the is EOS-20D canon amazing

In my test code I do this  to see what my analyzed query string looks like:

PerFieldAnalyzerWrapper analyzer = new 
PerFieldAnalyzerWrapper(new StandardStemmingAnalyzer());
analyzer.addAnalyzer(categoryNames, new KeywordAnalyzer());

TokenStream stream = analyzer.tokenStream(null, new 
StringReader(queryString));
String analyzedQueryString = ;

while(true)
{
Token token = stream.next();
if(token == null)
{
break;
}

analyzedQueryString = analyzedQueryString + 
token.termText() +  ;
}

analyzedQueryString = analyzedQueryString.trim();

log.debug(analyzedQueryString =  + analyzedQueryString);

The output of the log statement above is:

analyzedQueryString = eos-20d canon amaz

I see that the common stop words have been removed, everything has been lower 
cased and even the query has also been stemmed, why was the hyphen not removed 
by the standard filter??? Or does the standard analyzer remove hyphens only 
from phrases like eos - 20d and not from eos-20d ?

Thanks.