Steven Rowe wrote:
2.1 is much more likely to be the label used for the next release than
2.0.1.
The roadmap in Jira shows 21 issues scheduled for 2.0.1. If there is in
fact no intent to merge these into the 2.0 branch, these should probably
be retargetted for 2.1.0, and the 2.0.1 version
Karl Koch wrote:
Are there any other papers that regard the combination of coordination level matching and TFxIDF as advantageous?
We independently developed coordination-level matching combined with
TFxIDF when I worked at Apple. This is documented in:
Marcelo Ochoa wrote:
Then I'll move the code outside the lucene-2.0 code tree to be
packed as subdirectory of the contrib area, for example.
Other alternative is to make an small zip file and send it to the
list as attach as a preliminary (alpha-alpha version ;)
This sounds like great
Erick Erickson wrote:
Something like
Document doc = new Document();
doc.add(flag1, Y);
doc.add(flag2, Y);
IndexWriter.add(doc);
Fields have overheads. It would be more efficient to implement this as
a single field with a different value for each boolean flag (as others
have suggested).
Michael J. Prichard wrote:
I get this output:
Tue Aug 01 21:15:45 EDT 2006
That's August 2, 2006 at 01:15:45 GMT.
20060802
Huh?! Should it be:
20060801
DateTools uses GMT.
Doug
-
To unsubscribe, e-mail:
Rob Staveley (Tom) wrote:
Is there a tool I can use to see how much of the index is occupied by the
different fields I am indexing?
Note that IndexReader has a main() that will list the contents of
compound index files.
Doug
Tom Emerson wrote:
Thanks for the clarification. What then is the difference between a
MultiSearcher and using an IndexSearcher on a MultiReader?
The results should be identical. A MultiSearcher permits use of
ParallelMultiSearcher and RemoteSearchable, for parallel and/or
distributed
Marcus Falck wrote:
There is however one LARGE problem that we have run into. All search result should be displayed sorted with the newest document at top. We tried to accomplish this using Lucene's sort capabilites but quickly ran into large performance bottlenecks. So i figured since the
hu andy wrote:
Hi, I hava an application that need mark the retrieved documents which have
been read. So the next time I needn't read the marked documents again.
You could mark the documents as deleted, then later clear deletions. So
long as you don't close the IndexReader, the deletions
Sunil Kumar PK wrote:
I want to know is there any possibility or method to merge the weight
calculation of index 1 and its search in a single RPC instead of doing the
both function in separate steps.
To score correctly, weights from all indexes must be created before any
can be searched.
Is this markedly faster than using an MMapDirectory? Copying all this
data into the Java heap (as RAMDirectory does) puts a tremendous burden
on the garbage collector. MMapDirectory should be nearly as fast, but
keeps the index out of the Java heap.
Doug
z shalev wrote:
I've rewritten
karl wettin wrote:
Do I have to worry about passing a null Directory to the default
constructor?
A null Directory should not cause you problems.
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands,
Peter Keegan wrote:
Oops. I meant to say: Does this mean that an IndexSearcher constructed from
a MultiReader doesn't merge the search results and sort the results as if
there was only one index?
It doesn't have to, since a MultiReader *is* a single index.
A quick test indicates that it does
Dmitry Goldenberg wrote:
For an enterprise-level application, Lucene appears too file-system and
too byte-sequence-centric a technology. Just my opinion. The Directory
API is just too low-level.
There are good reasons why Lucene is not built on top of a RDBMS. An
inverted index is not
Dan Armbrust wrote:
My indexing process works as follows (and some of this is hold-over from
the time before lucene had a compound file format - so bear with me)
I open up a File based index - using a merge factor of 90, and in my
current test, the compound index format. When I have added
I talked about this a bit in a presentation at Haifa last year:
http://www.haifa.ibm.com/Workshops/ir2005/papers/DougCutting-Haifa05.pdf
See the section on Seek versus Transfer.
Doug
Prasenjit Mukherjee wrote:
It seems to me that lucene doesn't use B-tree for its indexing storage.
Any
thomasg wrote:
Hi, we are currently intending to implement a document storage / search tool
using Jackrabbit and Lucene. We have been approached by a commercial search
and indexing organisation called ISYS who are suggesting the following
problems with using Lucene. We do have a requirement to
Vincent Le Maout wrote:
I am missing something ? Is it intented or is it a bug ?
Looks like a bug. Can you submit a patch?
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
Vincent Le Maout wrote:
I am missing something ? Is it intented or is it a bug ?
Looks like a bug. Can you please submit a bug report, and, ideally,
attach a patch?
Thanks,
Doug
-
To unsubscribe, e-mail: [EMAIL
Igor Bolotin wrote:
If somebody is interested - I can post our changes in TermInfosWriter and
SegmentTermEnum code, although they are pretty trivial.
Please submit this as a patch attached to a bug report.
I contemplated making this change to Lucene myself, when writing Nutch's
FsDirectory,
Igor Bolotin wrote:
Does it make sense to change TermInfosWriter.FORMAT in the patch?
Yes. This should be updated for any change to the format of the file,
and this certainly constitutes a format change. This discussion should
move to [EMAIL PROTECTED]
Doug
Olivier Jaquemet wrote:
IndexReader.unlock(indexDir); // unlock directory in case of unproper
shutdown
This should be used very carefully. In particular, you should only call
it when you are certain that no other applications are accessing the index.
Doug
Dai, Chunhe wrote:
Does anyone know whether Lucene plans to support NFS in later
release(2.0)? We are planning to integrate Lucene into our products and
cluster support is definitely needed. We want to check whether NFS
support is in the plan or not before implementing a new file locking
The Hits-based search API is optimized for returning earlier hits. If
you want the lowest-scoring matches, then you could reverse-sort the
hits, so that these are returned first. Or you could use the
TopDocs-based API to retrieve hits up to your toHits. (Hits-based
search is implemented
Peter Keegan wrote:
I did some additional testing with Chris's patch and mine (based on Doug's
note) vs. no patch and found that all 3 produced the same throughput - about
330 qps - over a longer period.
Was CPU utilizaton 100%? If not, where do you think the bottleneck now
is? Network? Or
Michael Wechner wrote:
Maybe it would make sense to sort it alphabetically [ ... ]
+1 This should be sorted alphabetically be business name or last name.
That's what it says on the page, although a few entries are out of
place. Please feel free to fix this.
Doug
is basically the same if not better!!!
if anyone is interested let me know
Doug Cutting [EMAIL PROTECTED] wrote:
RAMDirectory is indeed currently limited to 2GB. This would not be too
hard to fix. Please file a bug report. Better yet, attach a patch.
I assume you're running a 64bit JVM. If so
Erick Erickson wrote:
Could you point me to any explanation of *why* range queries expand this
way?
It's just what they do. They were contributed a long time ago, before
things like RangeFilter or ConstantScoreRangeQuery were written. The
latter are relatively recent additions to Lucene
Are you changing the default mergeFactor or other settings? If so, how?
Large mergeFactors are generally a bad idea: they don't make things
faster in the long run and they chew up file handles.
Are all searches reusing a single IndexReader? They should. This is
the other most common
Dawid Weiss wrote:
I get the concept implemented in PhraseQuery but isn't calling it an
edit distance a little bit far fetched?
Yes, it should probably be called edit-distance-like or something.
Only the marginal elements
(minimum and maximum distance from their respective query positions)
RAMDirectory is indeed currently limited to 2GB. This would not be too
hard to fix. Please file a bug report. Better yet, attach a patch.
I assume you're running a 64bit JVM. If so, then MMapDirectory might
also work well for you.
Doug
z shalev wrote:
this is in continuation of a
WATHELET Thomas wrote:
I've created an index with the Lucene version 1.9 and when I try to open
this index I have always this error mesage:
java.lang.ArrayIndexOutOfBoundsException.
if I use an index built with the lucene version 1.4.3 it's working.
Wath's wrong?
Are you perhaps trying to open
Peter Keegan wrote:
I ran a query performance tester against 8-cpu and 16-cpu Xeon servers
(16/32 cpu hyperthreaded). on Linux. Here are the results:
8-cpu: 275 qps
16-cpu: 305 qps
(the dual-core Opteron servers are still faster)
Here is the stack trace of 8 of the 16 query threads during the
Jeff Rodenburg wrote:
Following on the Range Query approach, how is performance? I found the
range approach (albeit with the exact values) to be slower than the
parsed-string approach I posited.
Note that Hoss suggested RangeFilter, not RangeQuery. Or perhaps
ConstantScoreRangeQuery, which
Release 1.9-final of Lucene is now available from:
http://www.apache.org/dyn/closer.cgi/lucene/java/
This release has many improvements since release 1.4.3, including new
features, performance improvements, bug fixes, etc. For details, see:
revati joshi wrote:
hi all,
I just wnted to know how to increase the speed of indexing of files .
I tried it by using Multithreading approach but couldn't get much better
performance.
It was same as it is in usual sequential indexing.Is there any other approach
to get better
Eric Jain wrote:
This gives you the number of documents containing the phrase, rather
than the number of occurrences of the phrase itself, but that may in
fact be good enough...
If you use a span query then you can get the actual number of phrase
instances.
Doug
Release 1.9 RC1 of Lucene is now available from:
http://www.apache.org/dyn/closer.cgi/lucene/java/
This release candidate has many improvements since release 1.4.3,
including new features, performance improvements, bug fixes, etc. For
details, see:
Trieschnigg, R.B. (Dolf) wrote:
I would like to implement the Okapi BM25 weighting function using my own
Similarity implementation. Unfortunately BM25 requires the document length in
the score calculation, which is not provided by the Scorer.
How do you want to measure document length? If
Sebastian Menge wrote:
Or, to put it more simple, what does a boost of 2 or 10 _mean_ in
contrast to a boost of 0.5 or 0.1 !?
Boosts are simply multiplied into scores. So they only mean something
in the context of the rest of the scoring mechanism.
Daniel Pfeifer wrote:
We are sporting Solaris 10 on a Sun Fire-machine with four cores and
12GB of RAM and mirrored Ultra 320-disks. I guess I could try switching
to FSDirectory and hope for the best.
Or, since you're on a 64-bit platform, try MMapDirectory, which supports
greater parallelism
Daniel Pfeifer wrote:
Are we both talking about Lucene? I am using Lucene 1.4.3 and can't find
a class called MapDirectory or MMapDirectory.
It is post-1.4.
You can download a nightly build of the current trunk at:
http://cvs.apache.org/dist/lucene/java/nightly/
Doug
Doug Cutting wrote:
A 64-bit JVM with NioDirectory would really be optimal for this.
Oops. I meant MMapDirectory, not NioDirectory.
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL
Peter Keegan wrote:
This is just fyi - in my stress tests on a 8-cpu box (that's 8 real cpus),
the maximum throughput occurred with just 4 query threads. The query
throughput decreased with fewer than 4 or greater than 4 query threads. The
entire index was most likely in the file system cache,
Klaus wrote:
I have tried to study to lucene scoring in the default similarity. Can
anyone explain me, how this similarity was designed? I have read a lot of IR
literature, but I have never seen an equation like the one used in lucene.
Why is this better then the normal cosine-measure?
It
B-Tree's are best for random, incremental updates. They require
log_b(N) disk accesses for inserts, deletes and accesses, where b is the
number of entries per page, and N is the total number of entries in the
tree. But that's too slow for text indexing. Rather Lucene uses a
combination of
J.J. Larrea wrote:
So... I notice that both IndexWriter.addIndexes(...) merge methods start
and end with calls to optimize() on the target index. I'm not sure
whether that is causing the unpacking and repacking I observe, but it
does wonder whether they truly need to be there:
I don't
Paul Elschot wrote:
Querying the host field like this in a web page index can be dangerous
business. For example when term1 is wikipedia and term2 is org,
the query will match at least all pages from wikipedia.org.
Note that if you search for wikipedia.org in Nutch this is interpreted
as an
Andrzej Bialecki wrote:
It's nice to have these couple percent... however, it doesn't solve the
main problem; I need 50 or more percent increase... :-) and I suspect
this can be achieved only by some radical changes in the way Nutch uses
Lucene. It seems the default query structure is too
Andrzej Bialecki wrote:
For a simple TermQuery, if the DF(term) is above 10%, the response time
from IndexSearcher.search() is around 400ms (repeatable, after warm-up).
For such complex phrase queries the response time is around 1 sec or
more (again, after warm-up).
Are you specifying
IndexReader locks the index while opening it to prohibit an IndexWriter
from deleting any of the files in that index until all are opened.
Lock files are not stored in the index directory since write access to
an index should not be required to lock it while opening an IndexReader.
Doug
Daniel Noll wrote:
I actually did throw a lot of terms in, and eventually chose one for
the tests because it was the slowest query to complete of them all
(hence I figured it was already spending some fairly long time in I/O,
and would be penalised the most.) Every other query was around 7ms
Daniel Noll wrote:
Timings were obtained by performing the same search 1,000 times and
averaging the total time. This was then performed five times in a row
to get the range that's displayed below. Memory usage was obtained
using a 20-second sleep after loading the index, and then using the
Greg K wrote:
Now, however, I'd like to be able restrict the search to certain documents
in the index, so I don't have to stream through a couple of thousand spans
to produce the 10 excerpts on a subset of the documents.
I've tried added a term to the SpanNearQueries that targets a keyword
Marvin Humphrey wrote:
You *can't* set it on the reader end. If you could set it, the reader
would get out of sync and break. The value is set per-segment at write
time, and the reader has to be able to adapt on the fly.
It would actually not be too hard to change things so that there was
Chris Hostetter wrote:
: One thing that I know has bogged me is when matching a phrase where I
: would expect mathematical formula (which is just a subphrase). I
: would have liked the phrase-query to extend as far as it wishes but not
: passed a given token... would this be possible ?
:
Marc Hadfield wrote:
In the SpanNear (or for that matter PhraseQuery), one can set a slop
value where 0 (zero) means one following after the other.
How can one differentiate between Terms at the **same** position vs. one
after the other?
The following queries only match x and y at the same
Marc Hadfield wrote:
I actually mention your option in my email:
In principle I could store the full text in two fields with the second
field containing the types without incrementing the token index.
Then, do a SpanQuery for Johnson and name with a distance of 0.
The resulting match
Marc Hadfield wrote:
I'll give Span Query's a try as they can handle the 0 increment issue.
Note that PhraseQuery can now handle this too.
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail:
Peter Kim wrote:
I noticed one way to get around this is to use IndexReader.isDeleted()
to check if it's deleted or not. The problem with that is I only have
access to a MultiSearcher in my HitCollector which doesn't give me
access to the underlying IndexReader. I don't want to have to open an
Palmer, Andrew MMI Woking wrote:
I am looking at changing the value BufferedIndexOutput.BUFFER_SIZE from
1024 to maybe 8192. Has anyone done anything similar and did they get
any performance improvements.
I doubt this will speed things much.
Generally I am looking to reduce the time it
Dawid Weiss wrote:
I have a very technical question. I need to alter document score (or in
fact: document boosts) for an existing index, but for each query. In
other words, I'd like these to have pseudo-queries of the form:
1. civil war PREFER:shorter
2. civil war PREFER:longer
for these two
Jeff Rodenburg wrote:
My suggestion to you: pick up a copy of Lucene in Action. [ ...]
The authors lurk on this list.
They're pretty chatty for lurkers.
http://en.wikipedia.org/wiki/Lurker
But good advice nonetheless!
Cheers,
Doug
Tony Schwartz wrote:
I think you're jumping into the conversation too late. What you have said here
does not
address the problem at hand. That is, in TermInfosReader, all terms in the
segment get
loaded into three very large arrays.
That's not true. Only 1/128th of the terms are loaded by
Fredrik wrote:
Opening the index with Luke, I can see the following:
Number of fields: 17
Number of documents: 1165726
Number of terms: 6721726
The size of the index is approx 5,3 GB.
Lucene version is 1.4.3.
The index contains Norwegian terms, but lots of inline HTML, etc
is probably
Chris D wrote:
Well in my case field order is important, but the order of the
individual fields isn't. So I can speed up getFields to roughly O(1)
by implementing Document as follows.
Have you actually found getFields to be a performance bottleneck in your
application? I'd be surprised if it
Tony Schwartz wrote:
What about the TermInfosReader class? It appears to read the entire term set
for the
segment into 3 arrays. Am I seeing double on this one?
p.s. I am looking at the current sources.
see TermInfosReader.ensureIndexIsRead();
The index only has 1/128 of the terms, by
Ali Rouhi wrote:
I can think of 3 reasons why search methods returning Hits objects
are not exposed in Searchable:
1) Someone forgot to declare Hits Serializable
2) There is a fundamental reason the forms of search which return Hits
objects cannot be called remotely, some non optimal form of
Lokesh Bajaj wrote:
For a very large index where we might want to delete/replace some documents,
this would require a lot of memory (for 100 million documents, this would need
381 MB of memory). Is there any reason why this was implemented this way?
In practice this has not been an issue. A
The method Similarity.queryNorm() normalizes query term weights. To
disable this you could define it to return 1.0 in your own Similarity
implementation.
http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html#queryNorm(float)
Doug
Robichaud, Jean-Philippe wrote:
Tansley, Robert wrote:
What if we're trying to index multiple languages in the same site? Is
it best to have:
1/ one index for all languages
2/ one index for all languages, with an extra language field so searches
can be constrained to a particular language
3/ separate indices for each
Sebastian Marius Kirsch wrote:
I took up your suggestion to use a ParallelReader for adding more
fields to existing documents. I now have two indexes with the same
number of documents, but different fields.
Does search work using the ParalleReader?
One field is duplicated
(the id field.)
Fred Toth wrote:
I'm thinking we need something like HTMLTokenizer which bridges the
gap between StandardAnalyzer and an external HTML parser. Since so
many of us are dealing with HTML, I would think this would be generally
useful for many problems. It could work this way:
Given this input:
Matt Quail wrote:
I have a similar problem, for which ParallelReader looks like a good
solution -- except for the problem of creating a set of indices with
matching document numbers.
I have wondered about this as well. Are there any *sure fire* ways of
creating (and updating) two indices so
Steven J. Owens wrote:
A friend just asked me for advice about synchronizing lucene
indexes across a very large number of servers. I haven't really
delved that deeply into this sort of stuff, but I've seen a variety of
comments here about similar topics. Are there are any well-known
Chris Lamprecht wrote:
I've done exactly what you describe, using N threads where N is the
number of processors on the machine, plus one more thread that writes
to the file system index (since that is I/O-bound anyway). Since most
of the CPU time is tokenizing/stemming/etc, the method works well.
Chuck Williams wrote:
I found this to be a problem as well and created
alternative classes, DistributedMultiFieldQueryParser and
MaxDisjunctionQuery, which are available here:
http://issues.apache.org/bugzilla/show_bug.cgi?id=32674
You might check these out and see if they provide the ranking
Paul Smith wrote:
So it sounds like there isn't a perfect solution, but I think the best
tradeoff for me is to put them all in the same position unless
anyone has more input on the subject?
If they're all at the same position you can still use slop to match the
phrase. So if 'power', 'query'
Paul Libbrecht wrote:
I am currently evaluating the need for an elaborate query data-structure
(to be exchanged over XML-RPC) as opposed to working with plain strings.
I'd opt for both. For example:
search
boolean-query
required
query-parser analyzer=...java based
Roy Klein wrote:
I think this is a better way of asking my original questions:
Why was this designed this way?
In order to optimize updates.
Can it be changed to optimize updates?
Updates are fastest when additions and deletions are separately batched.
That is the design.
Doug
Chuck Williams wrote:
index.setMaxBufferedDocs(10); // Buffer 10 documents at a time
in memory (they could be big)
You might use a larger value here for the index with the small
documents. I've sucessfully used values as high as a 1000 when indexing
documents that average a few
Antony Sequeira wrote:
A user does a search for say condominium, and i show him the 50,000
properties that meet that description.
I need two other pieces of information for display -
1. I want to show a select box on the UI, which contains all the
cities that appear in those 50,000 documents
2.
Omar Didi wrote:
I am having a large index (100GB) and when i run the following code :
String indexLocation = servlet.getServletContext().getInitParameter(
com.lucene.index );
logger.log( Level.INFO, got the index location from: + indexLocation );
searcher = new IndexSearcher(indexLocation);
Daniel Naber wrote:
If that doesn't help: are you sure
you're using Lucene the right way, e.g. having only one
IndexReader/Searcher and using it for all searches?
That's my first suggestion too. Memory consumption should not primarily
grow per query, rather per IndexSearcher. You're seeing
Chuck Williams wrote:
If there is going to
be any generalization to built-in sorting representations, I'd like to
suggest two things be included:
1. Fix issue 34028 (delete the one word final)
Done.
2. Include a provision for query-time parameters
Can you provide a proposal?
Doug
sergiu gordea wrote:
So .. here is an example of how I parse a simple query string provided
by a user ...
the user checks a few flags and writes test ko AND NOT bo
and the resulting query.toString() is saved in the database:
+(+(subject:test description:test keywordsTerms:test koProperties:test
86 matches
Mail list logo