A couple of thoughts here...
You could hash (e.g.md5) all the documents in your index and eliminate
duplicates that way. Just pick one of the docs in the hash bucket as
the non-dup document and the delete the other dups. This could be run as a
batch job to eliminate the duplicates in an off-line
: Where shall i post this issue.
you are currently posting to a list named java-user this is for user
related questions about the java lucene project.
if you have questions about Lucene.Net you should be asking them on the
Lucene.Net user list...
http://incubator.apache.org/lucene.net/
Hey Lukas,
I was being simplistic when I said that the text and TokenSteam must be
exactly the same. It's difficult to think of a reason why you would not
want them to be the same though. Each Token records the offsets where it
can be found in the original text -- that is how the Highlighter
I believe Nutch has a duplicate detection algorithm. I don't know
how easy it would be to run independently on a Lucene index.
-Grant
On Jul 29, 2007, at 2:18 AM, Dmitry wrote:
We trying to find are any implementation for Lucene - detection
index duclicates.
Assuming we have a set of
30 jul 2007 kl. 14.43 skrev Grant Ingersoll:
I believe Nutch has a duplicate detection algorithm. I don't know
how easy it would be to run independently on a Lucene index.
There have also been a bunch of near-duplicate ideas that have been
presented on the forums before.
This is one of
Hi guys,
Do you think LUCENE-843 is stable enough? If so, do you think it's worth to
release it with probably LUCENE 2.2.1? It would be nice so that people can
take the advantage of it right away without risking other breaking changes
in the HEAD branch or waiting until 2.3 release.
Thanks,
--
Hi everyone,
I told you I'd be back with more questions! :-)
Here is my situation. In my application, the field to be searched is
selected via a drop-down box. I want my searches to basically be contains
searches - I take what the user typed in, put a wildcard character at the
beginning and end,
Hello,
Hi everyone,
I told you I'd be back with more questions! :-)
Here is my situation. In my application, the field to be searched is
selected via a drop-down box. I want my searches to basically
be contains
searches - I take what the user typed in, put a wildcard
character at the
You might want to search the mail archive for facets or faceted search
(no quotes), as I *think* this might be relevant.
Best
Erick
On 7/26/07, Ramana Jelda [EMAIL PROTECTED] wrote:
Hi ,
Of course this statement is very expensive.
--document.get(CAMPCATID)==null?:document.get(CAMPCATID);
I've built a production index with this patch and done some query stress
testing with no problems.
I'd give it a thumbs up.
Peter
On 7/30/07, testn [EMAIL PROTECTED] wrote:
Hi guys,
Do you think LUCENE-843 is stable enough? If so, do you think it's worth
to
release it with probably LUCENE
See IndexWriter.setMaxFieldLength(). 87,300 is odd, since the default
max field length, last I knew, was 10,000. But this sounds like
it might relate to your issue.
Best
Erick
On 7/27/07, Eduardo Botelho [EMAIL PROTECTED] wrote:
Hi guys,
I would like to know if exist some limit of size for
It does sound very strange to me, to default to a WildCardQuery! Suppose I
am looking for bold, I am getting hits for old.
I know - but that's what the requirements dictate. A better example might be
a MAC or IP address, where someone might be searching for a string in the
middle - like, I
Hey Jeff, I didn´t had any luck, I don´t think you´re approach is going to
help me, thanks for the reply. I´ll try a solution that does not require
this kind of problem.
[]s
Rossini
On 7/29/07, Jeff French [EMAIL PROTECTED] wrote:
Rossini, have you had any luck with this? I don't know if
Following up on my recent question. It has been suggested to me that I can
run the query text through an Analyzer without using the QueryParser. For
example, if I know what field to be searched I can create a PrefixQuery or
WildcardQuery, but still want to process the search text with the same
Or check out Solr and see if you can use that, or see how they do it,
Regards Ard
You might want to search the mail archive for facets or
faceted search
(no quotes), as I *think* this might be relevant.
Best
Erick
On 7/26/07, Ramana Jelda [EMAIL PROTECTED] wrote:
Hi ,
Of
It does sound very strange to me, to default to a
WildCardQuery! Suppose I
am looking for bold, I am getting hits for old.
I know - but that's what the requirements dictate. A better
example might be
a MAC or IP address, where someone might be searching for a
string in the
middle -
We found that a fast way to do this simply by running a query for each
category and getting the maxDocs. There would be one query for category
getting a single hit.
Dennis Kubes
Erick Erickson wrote:
You might want to search the mail archive for facets or faceted search
(no quotes), as I
Hi,
I am getting different results for the following queries.
1. ABST:spring-elastic^3 AND SPEC:internal combustion^2 OR
ABST:cylinder^3
2. SPEC:internal combustion^2 AND ABST:spring-elastic^3 OR
ABST:cylinder^3
I think the above two queries are similar and will give the same results.
I have two questions.
First, Is there a tokenizer that takes every word and simply makes a token
out of it? So it looks for two white spaces and takes the characters
between them and makes a token out of them?
If this tokenizer exists, is there a difference between doing that and
simply storing
Hello,
I have two questions.
First, Is there a tokenizer that takes every word and simply
makes a token
out of it?
org.apache.lucene.analysis.WhitespaceTokenizer
So it looks for two white spaces and takes the characters
between them and makes a token out of them?
If this tokenizer
Check this out:
http://www.gossamer-threads.com/lists/lucene/java-user/35433?search_string=category;#35433
On 7/30/07, Dennis Kubes [EMAIL PROTECTED] wrote:
We found that a fast way to do this simply by running a query for each
category and getting the maxDocs. There would be one query for
Yeah, it's a surprise, isn't it? I'm afraid there isn't a good answer.
http://wiki.apache.org/lucene-java/BooleanQuerySyntax
The best practice appears to be to require parens everywhere to force the
evaluation order. Not very satisfying, but it does work 100%.
-Original Message-
From:
Would this work?
TokenStream ts = StandardAnalyzer.tokenStream();
while ((Token tok = ts.next()) != null) {
do whatever
}
Best
Erick
On 7/30/07, Joe Attardi [EMAIL PROTECTED] wrote:
Following up on my recent question. It has been suggested to me that I can
run the query text through an
So then would I just concatenate the tokens together to form the query text?
--
Joe Attardi
[EMAIL PROTECTED]
http://thinksincode.blogspot.com/
On 7/30/07, Erick Erickson [EMAIL PROTECTED] wrote:
Would this work?
TokenStream ts = StandardAnalyzer.tokenStream();
while ((Token tok =
So then would I just concatenate the tokens together to form
the query text?
You might better create a TermQuery for each token instead of concatenating,
and combine them in a BooleanQuery and say wether all terms must or should
occur. Very simple, see [1]
Regards Ard
[1]
Hi,
I am not a French speaker, but here are some questions regarding
French analyzer:
Is there any analyzer that can do this? Analyze accentuated letters to
non accentuated corresponding letters (é,è,ê,ë - e), so that
search fenêtre (=window) found all docs with fenêtre or fenetre
and
search
Gosh, I sure hope not, because that would mean that we rolled our
own for no good reason. We wound up just collapsing
the input stream by substituting plain old 'e' for all the accented
variants before indexing and before searching. Be *really* careful
what character set you're using.
Actually,
Hi,
Take a look to the class ISOLatin1AccentFilter ! Add this to your analyzer
and it should work !
Hope this will help,
Samir
-Message d'origine-
De : Chris Lu [mailto:[EMAIL PROTECTED]
Envoyé : lundi, 30. juillet 2007 20:06
À : java-user@lucene.apache.org
Objet : a question for
I have a set of tags associated with content in my corpus. I also have
normal text. Our system tries to figure out which words are tags and
which are text, and falls back on text when tags fail. I'm wondering,
is there anything in Lucene which might help disambiguate multi-word
tags from text?
Thanks for the reply Erick,
I believe it is the gc for four reasons:
- I've tried the warmup approach alredy and it didn't change the
situation.
- The server completely pauses for several seconds. I run jstack to find
out where the pause is, and it also pauses for several seconds before
I believe there is an issue in JIRA that handles reopening an IndexReader
without reopening segments that have not changed.
On 7/30/07, Tim Sturge [EMAIL PROTECTED] wrote:
Thanks for the reply Erick,
I believe it is the gc for four reasons:
- I've tried the warmup approach alredy and it
And by the way, I cannot see it ever making sense to keep reopening an index
reader every second or so. It has to be MUCH more efficient to even wait
every 2 or 4 seconds...even that is going to be pretty nasty, but you have
to allow for a bit of batch man. You will waste so much time opening
Hi, Erick,
I added ISOLatin1AccentFilter to FrenchAnalyzer following Samir's tip,
and it works great! And I think it's the right way to go. Problems
like You have to store the data raw for display purposes if you want
the accents to show though will go away since Analyzer already have
the
Oh, yeah, I know now :-). But I really do have a requirement to show
search results from items that came in 5 seconds ago. We have an
application where a common usage pattern is
add an item
navigate to another item
search for the first item (to associate it with the second item)
and the gap
Being a French speaker, I will mention the following special cases:
- plus ça change - plus ca change
- œuf - oeuf
- lætitia - laetitia
But I just looked, and it looks like ISOLatin1AccentFilter handles these.
Better test to be sure...
--Renaud
-Original Message-
From: Chris Lu
What about the case where I want to search a MAC address? For example,
00:14:da:81:21:4f will be split by the StandardTokenizer as the tokens
00, 14, da, 81, 21, and 4f.
Suppose I want to search for 00:14:da:81:21:4f. In the search box, I type
00:14:da:81:21:4f. But because these are all separate
Hi Tim!
On Jul 25, 2007, at 8:41 PM, Tim Sturge wrote:
I am indexing a set of constantly changing documents. The change
rate is moderate (about 10 docs/sec over a 10M document collection
with a 6G total size) but I want to be right up to date (ideally
within a second but within 5 seconds
not that I know of
Erick
On 7/30/07, Max Metral [EMAIL PROTECTED] wrote:
I have a set of tags associated with content in my corpus. I also have
normal text. Our system tries to figure out which words are tags and
which are text, and falls back on text when tags fail. I'm wondering,
*SpanNearQueryfile:///C:/lucene-2.1.0/docs/api/org/apache/lucene/search/spans/SpanNearQuery.html#SpanNearQuery%28org.apache.lucene.search.spans.SpanQuery%5B%5D,%20int,%20boolean%29
*(SpanQueryfile:///C:/lucene-2.1.0/docs/api/org/apache/lucene/search/spans/SpanQuery.html[]
clauses,
int slop,
However, is there any special case that you have?
Yes, the character set we use is, as I remember,
MARC-8. Which I don't think is the ISOLatin,
but since I didn't know about that filter when we had our problem,
I didn't even look. Oh well, smarter/braver/lazier next time G...
Which is why I
Greetings All,
I have been trying out Lucene recently and very happy with the search
performance. But just notice that when Lucene performing search or index,
the CPU usage on my machine raise to 100%, because of this issue, some of my
others backend process will slow down eventually. Just
Hi,
I am using nutch index to search in lucene. One of my classes use
makeStopTable method ( which is deprecated) of class StopFilter in
org.apache.lucene.analysis. When I run my program with lucene 2.1.0
~/j2sdk1.4.2/bin/java -classpath .:lucene-core-2.1.0.jar SearchFiles
Exception in
42 matches
Mail list logo