Hmmm... I never use range queries, but that "" part looks suspicious.
Sorry, can't help more now, maybe somebody else will have the answer for you.
Sit tight.
Otis
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/ - Tag - Search - Share
John,
Look at coord Similarity method. That may help you solve the e.g.,
"Nissan Altima Sports Package" will be
the #1 hit even though there was an exact document matching every term.
Otis
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/ - Tag
Simpy.com has a similar setup. You have to be careful about open files, making
sure you don't run out of open file descriptors. You'll also want to minimize
IndexReader/Searcher/Writer open/close as much as you can. The good side of
this setup is that searches go against small indices, you do
Thanks for your reply Otis,
wquery.toString() returns westbc:[* TO ]
query.toString() returns westbc:[* TO ]
If I compare these two strings for equality like
wquery.toString().equals(query.toString()) I get true. I also got bytes of
those strings and compared them bytewise - they are
> Also, i don't understand why the encode/decode functions have a range of
7x10^9 to 2x10^-9, when it seems to me the most common values are (boosts
set to 1.0) something between 1.0 and 0. When would somebody have a monster
huge value like 7x10^9? Even with a huge index time boost of 20.0 or
s
Thank you kindly for the responses.
This was the solution that I dreamed up initially as well (overriding
lengthNorm) and making the returned values for small numTerms values (e.g. 3
and 4) more discrete.
So I did that in multiple ways, and I ran into a different problem. If
lengthNorm returns
Wow. Thanks Erick! So I guess the issue isn't with the test code...
I wonder what kind of environmental problem I could have? I am also running on
XP with JDK 1.5, Lucene 2.1, default memory and gc... The queries I am running
are a bit more complex, and return 0-10,000 hits. When I close and re-
Having to put a counter in and close/open your searcher should not
be necessary. I'm afraid I'm not going to be very helpful, because
I took your test case and made some very minor modifications
to make it run in an environment I happen to have lying around
(mostly, just instantiated the Quer
My approach to dealing with these kinds of issues (which has worked well for
me thus far) is:
- Run java with -XX:+HeapDumpOnOutOfMemoryError command-line option
- use jhat to inspect the heap dump, like so:
$ /usr/java/jdk1.6/bin/jhat ./java_pid1347.hprof
jhat will take a while to parse the hea
So, forgetting the RMI stuff, I put together a test client very similar to the
one in the book "Lucene in Action" page 182.
The client:
1. instantiates a IndexSearcher
2. loops through queries, searches, prints hit count, saves nothing
I am only able to run through about 40 searches before I
What's the aggregate size of all your user indexes? And how many
servers could you potentially spread the load across? What kind
of queries do you allow? wildcards? simple term? Arbitrary Boolean
expressions? What kind of throughput are you expecting?
Opening and closing a reader for each search
Hi,
Iam currently working on indexing the documents present in a web based
document management system. The system currently has around 200,000 users and
each user has approximately 10 to 100 documents.We currently have around 50 GB
of data. The system should allow the users only to search a
On 4/5/07, moraleslos <[EMAIL PROTECTED]> wrote:
something specific as this, or are there better algorithms and/or software
out there that does name matching. Thanks in advance!
Approximate string matching is an active research field. There are
many systems that implement different algorithms t
See below
On 4/5/07, Ryan O'Hara <[EMAIL PROTECTED]> wrote:
Hey Erick,
Thanks for the quick response. I need a truly exact match. What I
ended up doing was using a TOKENIZED field, but altering the
StandardAnalyzer's stop word list to include only the word/letter
'a'. Below is my searching
1) which version of FunctionQuery are you using (from the solr repository
or from a Jira issue attachment?)
2) what is hte full stacktrace? (ie: which function/line is throwing the
Exception)
FunctionQuery supports explain just fine, not sure why you'd have
problems, oh wait ... i see exactly wha
: The problem comes when your float value is encoded into that 8 bit
: field norm, the 3 length and 4 length both become the same 8 bit
: value. Call Similarity.encodeNorm on the values you calculate for the
: different numbers of terms and make sure they return different byte
: values.
bingo.
: I am new to Lucene. I find that the output
: of the Query.toString() method cannot be parsed
: back to the same query. Is it true? If it is
: true, I am wondering why not make the output of
: Query.toString() parsable to the same query again?
some of hte more simplified query classes generate a
Hey Erick,
Thanks for the quick response. I need a truly exact match. What I
ended up doing was using a TOKENIZED field, but altering the
StandardAnalyzer's stop word list to include only the word/letter
'a'. Below is my searching code:
String[] stopWords = {"a"};
StandardA
Hi Grant!
Thanks for the reply. I'll look into the links you suggested. Just curious
though, what did you do to implement this--if you can spill some of the
beans ;-) You think what you did was better than the FuzzyQuery approach?
Was it a custom algorithm or did you utilize some framework f
It's like deja vu all over again. I literally just finished up a
similar task (about 2 hours ago). I didn't use Lucene for it,
although I suppose I could have. Lucene does have the FuzzyQuery
(http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/
javadoc/org/apache/lucene/search/
The problem comes when your float value is encoded into that 8 bit
field norm, the 3 length and 4 length both become the same 8 bit
value. Call Similarity.encodeNorm on the values you calculate for the
different numbers of terms and make sure they return different byte
values.
Andrew
On 4/5/07,
I was wondering if anyone has done people name matching using Lucene. For
example, I have a name coming from some external source that I would like to
match with the one I have in my DB. Lets say my DB contains the name "John
Smith". If the external source has something like "Smith John", "Smit
Yes, you can search on UN_TOKENIZED fields, but they're exact,
really, really exact .
I'd recommend that you get a copy of Luke (google lucene luke) and
examine your index to see what you actually have in your index.
Also, you haven't provided us a clue what the actual query is. I'd
use Query.to
Hey,
I was just wondering if you are supposed to be able to search on
UN_TOKENIZED fields? It seems like you can from the docs, but I have
been unsuccessful. I want to do exact string matching on a certain
field without analyzer interference.
Thanks,
Ryan
--
As far as I know, this is the case where you want your custom Similarity that
knows how to deal with a small number of terms.
public float lengthNorm(String fieldName, int numTerms) {
if (numTerms < N)
// return something smart
return (float)(1.0 / Math.sqrt(numTerms));
}
I thi
It is the right forum, silence just means either no one knows the
answer or no one who knows the answer has read it... Such is the
nature of the community.
Have you looked at overriding similarity with your own
implementation? Have you done explain() calls on the docs to see
where the s
You can't really rely on Query.toString() to produce a valid query identical to
the query in that Query instance. Are you sure both produce the same query
string? You didn't include that.
Otis
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/ - Tag
Sorry to re-post -- is this the correct forum for questions like this? I
think that writing a new encode/decode operation should help alleviate my
problem, but thought that this must be fairly widespread issue for people
using lucene for "non-web-page" searches (i.e., shorter documents)
Thanks a
Hello everybody,
I need to index and search real numbers in Lucene. I found NumberUtils class
in Solr project which permits one to encode doubles into string so that
alpha numeric ordering would correctly correspond to the ordering on
numbers. When I use ConstantScoreRangeQuery programmatically e
Lucene has no built-in recognition of anything. You have to parse
the header and index the relevant bits as you need to.
There are projects *based* upon lucene that do web crawls that
you might want to look into, Nutch comes to mind.
Erick
On 4/5/07, Developer Developer <[EMAIL PROTECTED]> wrot
I am using WGET to download content from the www with ---save-header option.
The save-header option saves the hppt header to the downloaded files.
Does Lucene make use of content type while indexing or I have to parse
the header , determine the content-type and determine the right set of
action
I'm running lukeall-0.7.jar.
In the Search Tab, when I try to use the SnowBallAnalyzer with name
"German" for a Query,
I do receive the message
"java.lang.ClassNotFound:
net.sf.snowball.ext.GermansStemmer".
In the PlugIns tab, when I'm using the SnowballAnalyzer, I do get
"Couldn't instantiate
ŮŚWell
Philipp and Ronnie Thank you very much indeed
--
Regards,
Mohammad
As long as there are no deletions, the ids will remain unchanged and
it is safe to use them outside.
But in a case where you delete some document, the resulting gap in the
document list will be filled during the next optimize (triggered
manually) or merge operation (may be triggered automatically
Thanks Philipp
2007/4/5, Philipp Nanz <[EMAIL PROTECTED]>:
> That *is* the actual id in the index. There is no other.
> You should be careful using it outside of Lucene though, because
> Lucene may rearrange the document ids during optimization for example.
>
> If you need an application id, ad
Ahh, now i know what you mean...
Forget the above :-)
Use result.id( i )
2007/4/5, Philipp Nanz <[EMAIL PROTECTED]>:
That *is* the actual id in the index. There is no other.
You should be careful using it outside of Lucene though, because
Lucene may rearrange the document ids during optimizati
It's in the FAQ:
http://wiki.apache.org/lucene-java/LuceneFAQ#head-e1de2630fe33fb6eb6733747a5bf870f600e1b4c
Mohammad Norouzi wrote:
but the question is, if I add, say, a document to my index, is lucene going
to re arrange the internal IDs? can't I trust them?
Would you tell me in exactly which
Hi
I need the id of the document that returned by Hits as a result of a query.
Hits result = searchable.find(myQuery);
now I need something like result.getId()
is there any way to get it?
Thanks so much
--
Regards,
Mohammad Norouzi
That *is* the actual id in the index. There is no other.
You should be careful using it outside of Lucene though, because
Lucene may rearrange the document ids during optimization for example.
If you need an application id, add it as an additional stored field to
each document and retrieve that.
sorry to correct my answer:
I need something like this result.doc( i ).getId();
this id from the result (the i ) is starting from 1 but I need the actual id
in the index.
On 4/5/07, Mohammad Norouzi <[EMAIL PROTECTED]> wrote:
Hi
I need the id of the document that returned by Hits as a result
deja vu ... didn't someone else just asking about "tolerant" query
parsing, and then followup that they have found this suggestion from past
me...
http://www.nabble.com/Error-tolerant-query-parsing-tf108987.html
...inspecting the ParseException should allow you to do all sorts of
iterative "fixi
41 matches
Mail list logo