RE: Chinese Segmentation with Phase Query

2007-11-09 Thread Steven A Rowe
Hi Cedric, On 11/08/2007, Cedric Ho wrote: a sentence containing characters ABC, it may be segmented into AB, C or A, BC. [snip] In this cases we would like to index both segmentation into the index: AB offset (0,1) position 0A offset (0,0) position 0 C offset (2,2) position 1

RE: FuzzyQuery + QueryParser - I'm puzzled

2007-12-17 Thread Steven A Rowe
Hi anjana m, You're going to have lots of trouble getting a response, for two reasons: 1. You are replying to an existing thread and changing the subject. Don't do that. When you have a question, start a new thread by creating a new email instead of replying. 2. You are not telling the list

RE: Lucene multifield query problem

2007-12-18 Thread Steven A Rowe
Hi Rakesh, Set the default QueryParser operator to AND (default default operator :) is OR): http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/queryParser/QueryParser.html#setDefaultOperator(org.apache.lucene.queryParser.QueryParser.Operator) Steve On 12/18/2007 at 1:22 PM, Rakesh Shete

RE: Lucene multifield query problem

2007-12-18 Thread Steven A Rowe
that D is not anymore sufficient to qualify a doc. Hope this helps (otherwise let this reply be forever disqualified : - ) ) Doron On Dec 18, 2007 9:28 PM, Steven A Rowe [EMAIL PROTECTED] wrote: Hi Rakesh, This doesn't look like a user-generated query. Have you considered

RE: anyone

2007-12-19 Thread Steven A Rowe
Hi James, Over the last two months, it has averaged roughly 15 messages per day. Feels like more than semi-active to me. Steve On 12/19/2007 at 2:00 PM, Hartrich, James CTR USTRANSCOM J6 wrote: Is this at least a semi-active list? James

RE: Which file in the lucene package is used to manipulate results..

2007-12-28 Thread Steven A Rowe
Hi Sumit, Here's a good place to start: http://lucene.apache.org/java/docs/scoring.html Steve On 12/28/2007 at 12:30 PM, sumittyagi wrote: also what is the lucene ranking (scoring documents) formula sumittyagi wrote: hi which file can i edit to change the scoring factors in

RE: Is there a mavenized Lucene bundle in the apache maven repo and what's the url?

2008-01-03 Thread Steven A Rowe
Hi, It's in the global maven repo at: http://repo1.maven.org/maven2/org/apache/lucene/ The 2.2.0 core jar is at: http://repo1.maven.org/maven2/org/apache/lucene/lucene-core/2.2.0/ Steve On 01/03/2008 at 11:26 AM, tgospodinov wrote: I couldn't find the url to the lucene maven repo if

RE: boost scores with non-content based information

2008-01-03 Thread Steven A Rowe
Hi Ted, On 01/03/2008 at 3:35 PM, Ted Chen wrote: I'd like to make sure that my search engine can take into account of some non-content based factors. [snip] P.S. My last email didn't get any response. Au contraire, mon frère:

RE: Why is lucene so slow indexing in nfs file system ?

2008-01-09 Thread Steven A Rowe
Hi Ariel, On 01/09/2008 at 8:50 AM, Ariel wrote: Dou you know others distributed architecture application that uses lucene to index big amounts of documents ? Apache Solr is an open source enterprise search server based on the Lucene Java search library, with XML/HTTP and JSON APIs, hit

RE: Empty lucene-similarity jars on maven mirrors

2008-01-09 Thread Steven A Rowe
Hi Sanjay, On 01/09/2008 at 3:02 PM, Sanjay Dahiya wrote: lucene-similarity (2.1.0 and 2.2.0) jar files available on maven mirrors don't contain any files. That's because the o.a.l.search.similar package (the sole contents of the contrib/similarity/ directory) has been empty as of the 2.1.0

RE: Retrieve the number of deleted documents

2008-01-11 Thread Steven A Rowe
Hi Shai, On 01/11/2008 at 7:42 AM, Shai Erera wrote: Will IndexReader.maxDocs() - IndexReader.numDocs() give the correct result? or this is just a heuristic? I think your expression gives the correct result - the abstract IndexReader.numDocs() method is implemented in SegmentReader as:

RE: generate-maven-artifacts

2008-01-15 Thread Steven A Rowe
Hi Sergey, On 01/15/2008 at 9:57 AM, Sergey Kabashnyuk wrote: Hi all. I try to build mavan artifacts using from tags/lucene_2_2_0. By calling ant generate-maven-artifacts But BUILD FAILED /java/src/lucene/svn/java/tags/lucene_2_2_0/build.xml:366: The following error occurred while

RE: Lucene, HTML and Hebrew

2008-01-22 Thread Steven A Rowe
Hi Itamar, In another thread, you wrote: Yesterday I sent an email to this group querying about some very important (to me...) features of Lucene. I'm giving it another chance before it goes unnoticed or forgotten. If it was too long please let me know and I will email a shorter list of

RE: Lucene, HTML and Hebrew

2008-01-22 Thread Steven A Rowe
On 01/22/2008 at 8:49 PM, Grant Ingersoll wrote: On Jan 22, 2008, at 6:06 PM, Steven A Rowe wrote: On 01/21/2008 at 2:59 PM, Itamar Syn-Hershko wrote: 2) How would I set the boosts for the headers and footnotes? I'd rather have it stored within the index file than have to append

RE: Lucene, HTML and Hebrew

2008-01-24 Thread Steven A Rowe
Hi Itamar, On 01/24/2008 at 2:55 PM, Itamar Syn-Hershko wrote: Lucene does not store proximity relations between data in different fields, only within individual fields So are 2 calls for doc-add with the same field but different texts are considered as 1 field (latter call being

RE: Apostrophe filtering in StandardFilter

2008-01-29 Thread Steven A Rowe
On 01/29/2008 at 10:05 AM, Grant Ingersoll wrote: On Jan 29, 2008, at 9:29 AM, christophe blin wrote: thanks for the pointer to the ellision filter, but I am currently stuck with lucene-core-2.2.0 found in maven2 central repository (do not contain this class). I'll watch for an upgrade to

RE: Apostrophe filtering in StandardFilter

2008-01-29 Thread Steven A Rowe
Hi Chris, Looks like the ElisionFilter handles the French problems you mentioned: http://lucene.apache.org/java/2_3_0/api/org/apache/lucene/analysis/fr/ElisionFilter.html See the code for the list of /X'/ constructions it handles:

RE: lucene 2.3 in production

2008-02-05 Thread Steven A Rowe
Hi GokulAnand, On 02/05/2008 at 12:33 AM, GokulAnand wrote: Can some one get me the link to get lucene 2.3 jars. It is considered bad form on this list to reply to an existing thread with a message on a different topic than the one already being discussed - this is called thread hijacking.

RE: Searching the backlog of mailing list for lucene java users

2008-02-08 Thread Steven A Rowe
Hi Erica, Another good place to look is at the FAQ: http://wiki.apache.org/lucene-java/LuceneFAQ Steve On 02/08/2008 at 8:10 AM, Grant Ingersoll wrote: http://wiki.apache.org/lucene-java/MailingListArchives has a variety of options (although the readlist one is not listed) On Feb 8,

RE: Delete problems O.O

2008-02-11 Thread Steven A Rowe
Hi Cesar, On 02/11/2008 at 2:19 PM, Cesar Ronchese wrote: I'm running problems with document deletion. [...] This simply doesn't delete anything from the Index. //see the code sample: //theFieldName was previously stored as Field.Store.YES and Field.Index.TOKENIZED. Term t = new

RE: Rails and lucene

2008-02-19 Thread Steven A Rowe
Hi Cooper Geng, Ferret is a Lucene-inspired Ruby search engine for Ruby - maybe that would be useful for you?: http://ferret.davebalmain.com/trac Steve On 02/19/2008 at 2:25 AM, coolgeng coolgeng wrote: Hi guys, Now an idea knock my brain, which I want to integrate the lucene into my

RE: How to index word-pairs and phrases

2008-02-19 Thread Steven A Rowe
\analyzers\src\java\org\apache\lucene\analysis\ngram ?? Does this tokenizer do what I need? thank you, -Ghinwa On Tue, 19 Feb 2008, Steven A Rowe wrote: Mark, The ShingleFilter contrib has not been committed yet - it's still here: https://issues.apache.org/jira/browse/LUCENE-400

RE: How to index word-pairs and phrases

2008-02-19 Thread Steven A Rowe
Mark, The ShingleFilter contrib has not been committed yet - it's still here: https://issues.apache.org/jira/browse/LUCENE-400 Steve On 02/19/2008 at 2:33 AM, markharw00d wrote: Further to Grant's useful background - there is an analyzer specifically for multi-word terms in contrib. See

RE: query question

2008-02-19 Thread Steven A Rowe
Hi C.B., Yonik is referring to a Solr class: http://svn.apache.org/viewvc/lucene/solr/trunk/src/java/org/apache/solr/analysis/WordDelimiterFilter.java?view=markup You should theoretically be able to use this filter with straight Lucene code, as long as it's on the classpath. (I'm guessing

RE: How to change the scores of the documents

2008-02-20 Thread Steven A Rowe
Hi, On 02/20/2008 at 12:29 PM, sumittyagi wrote: hi i want to rerank the documents obatined from the HITS, how can i edit the scoring formula. Here's a good place to start: http://lucene.apache.org/java/docs/scoring.html Steve

RE: Question: Using Shingle Analyzer NGramAnalyzerWrapper in Lucene

2008-02-26 Thread Steven A Rowe
Hi Stanley, I modernized the files in LUCENE-400 a bit - you can see the details in comments I made on the issue. The results, including all files needed to address the issue, are in the file attached to the issue named LUCENE-400.patch. I can tell you aren't using the modernized version

RE: Specialized XML handling in Lucene

2008-03-11 Thread Steven A Rowe
Hi Eran, see my comments below inline: On 03/11/2008 at 9:23 AM, Eran Sevi wrote: I would like to ask for suggestions of the best design for the following scenario: I have a very large number of XML files (around 1M). Each file contains several sections. Each section contains many elements

RE: Specialized XML handling in Lucene

2008-03-11 Thread Steven A Rowe
Hatcher's and Otis Gospodnetic's excellent book Lucene in Action covers sorting: http://www.manning.com/hatcher2/ Steve On Tue, Mar 11, 2008 at 5:48 PM, Steven A Rowe [EMAIL PROTECTED] wrote: Hi Eran, see my comments below inline: On 03/11/2008 at 9:23 AM, Eran Sevi wrote: I would like

RE: Specialized XML handling in Lucene

2008-03-11 Thread Steven A Rowe
On 03/11/2008 at 11:48 AM, Steven A Rowe wrote: 5 billion docs is within the range that Lucene can handle. I think you should try doc = element and see how well it works. Sorry, Eran, I was dead wrong about this assertion. See this thread for more information: http://www.nabble.com

RE: word position operator?

2008-03-17 Thread Steven A Rowe
Hi Darren, Check out SpanFirstQuery and SpanRegexQuery: http://lucene.apache.org/java/2_3_1/api/org/apache/lucene/search/spans/SpanFirstQuery.html http://lucene.apache.org/java/2_3_1/api/org/apache/lucene/search/regex/SpanRegexQuery.html Steve On 03/16/2008 at 8:55 PM, Darren Govoni wrote:

RE: Unicode Tokenizer problem with Registered Trademark Search

2008-04-02 Thread Steven A Rowe
Hi Bruce, On 04/02/2008 at 4:58 PM, [EMAIL PROTECTED] wrote: I am having a problem when searching for certain Unicode characters, such as the Registered Trademark. That's the Unicode character 00AE. It's also a problem searching for a Japanese Yen symbol (Unicode character 00A5). I'm using

RE: Lucene standard analyzer internationalization

2008-04-22 Thread Steven A Rowe
Hi Prashant, On 04/22/2008 at 2:23 PM, Prashant Malik wrote: We have been observing the following problem while tokenizing using lucene's StandardAnalyzer. Tokens that we get is different on different machines. I am suspecting it has something to do with the Locale settings on individual

RE: Lucene standard analyzer internationalization

2008-04-22 Thread Steven A Rowe
the same nfs mounts on both the machines Also we have tried with lucene2.2.0 and 2.3.1. with the same result . also about the actual string u have it right till 2 . 3,4,5 are a single character Thx PM On Tue, Apr 22, 2008 at 12:01 PM, Steven A Rowe [EMAIL PROTECTED] wrote

RE: lucene farsi problem

2008-04-30 Thread Steven A Rowe
Hi Esra, Caveat: I don't speak, read, write, or dream in Farsi - I just know that it mostly shares its orthography with Arabic, and that they are both written and read right-to-left. How are you constructing the queries? Using QueryParser? If so, then I suspect the problem is that you

RE: lucene farsi problem

2008-04-30 Thread Steven A Rowe
On 04/30/2008 at 12:50 PM, Steven A Rowe wrote: Caveat: I don't speak, read, write, or dream in Farsi - I just know that it mostly shares its orthography with Arabic, and that they are both written and read right-to-left. How are you constructing the queries? Using QueryParser? If so

RE: lucene farsi problem

2008-05-01 Thread Steven A Rowe
Hi Esra, Going back to the original problem statement, I see something that looks illogical to me - please correct me if I'm wrong: On Apr 30, 2008, at 3:21 AM, esra wrote: i am using lucene's IndexSearcher to search the given xml by keyword which contains farsi information. while searching

RE: lucene farsi problem

2008-05-02 Thread Steven A Rowe
/2008 at 9:31 AM, esra wrote: Hi Steven, sorry i made a mistake. unicodes are like this: د=U+62F ژ = U+632 and the first letter of ساب ووفر is س = U+633 you can also check them here http://www.unics.uni-hannover.de/nhtcapri/persian-alphabet.html Esra Steven A Rowe wrote

RE: lucene farsi problem

2008-05-02 Thread Steven A Rowe
-hannover.de/nhtcapri/persian-alphabet.html Esra Steven A Rowe wrote: Hi Esra, Going back to the original problem statement, I see something that looks illogical to me - please correct me if I'm wrong: On Apr 30, 2008, at 3:21 AM, esra wrote: i am using

RE: lucene farsi problem

2008-05-02 Thread Steven A Rowe
and the searcher works with unicodes. Esra Steven A Rowe wrote: Hi Esra, You are *still* incorrectly referring to the glyph with three dots over it: On 05/02/2008 at 12:18 PM, esra wrote: yes the correct one is ژ /ze/U+632. ژ is *not* ze/U+632 - it is zhe/U+698. Have you

RE: lucene farsi problem

2008-05-04 Thread Steven A Rowe
and post back about how it works. Thanks, Steve On 05/03/2008 at 8:33 AM, esra wrote: Hi Steven, thanks for your help Esra Steven A Rowe wrote: Hi Esra, I have created an issue for this - see https://issues.apache.org/jira/browse/LUCENE-1279. I'll try to take a crack

RE: lucene farsi problem

2008-05-07 Thread Steven A Rowe
are using fa for farsi and ar for arabic. I have added a little control for the locale parameter in my code and now i can see the correct results. Thank you very much for ypur help. Esra. Steven A Rowe wrote: Hi Esra, I have attached a patch to LUCENE-1279 containing a new

RE: lucene farsi problem

2008-05-07 Thread Steven A Rowe
Hi PV, On 05/07/2008 at 2:54 AM, PV wrote: Sorry for cross posting, but why the word 'Farsi' instead of 'Persian'? No one says Lucnce français or Español, or Deutsch - so why Farsi? Please read the following article, I found it quite enlightening.

RE: lucene farsi problem

2008-05-09 Thread Steven A Rowe
Hi Esra, On 05/07/2008 at 11:49 AM, Steven A Rowe wrote: At Chris Hostetter's suggestion, I am rewriting the patch attached to LUCENE-1279, including the following changes: - Merged the contents of the CollatingRangeQuery class into RangeQuery and RangeFilter - Switched the Locale

RE: lucene farsi problem

2008-05-11 Thread Steven A Rowe
that ConstantScoreRangeQuery doesn't have the clause limit restriction that RangeQuery has (1024 max clauses, IIRC). Steve On 05/10/2008 at 1:22 PM, esra wrote: Hi Steve, i used the locale as ar and it works fine . again thanks a lot for your help. Esra Steven A Rowe wrote: Hi

RE: indexing images of a document

2008-06-10 Thread Steven A Rowe
Hi Bernd, It's still not clear what you want to do. What will a search look like? On 06/10/2008 at 8:36 AM, Bernd Mueller wrote: I will try to explain what I mean with image stuff. An image in xml-documents is usually an url to the location where the image is stored. Additionally, such an

RE: indexing images of a document

2008-06-10 Thread Steven A Rowe
, with the binary data encoded as Base64 or something similar, you should be able to store and retrieve it as a String. Or maybe you could store a .jar file in a binary field - that would probably be simplest. Steve Steven A Rowe wrote: Hi Bernd, It's still not clear what you want to do. What

RE: huge tii files

2008-06-17 Thread Steven A Rowe
Hi tsuraan, On 06/17/2008 at 2:31 PM, tsuraan wrote: I'm guessing the answer is no, but is there an equivalent to that for lucene-2.2.0? Not exactly equivalent, but: from the apidoc for the 2.3.2 version of setTermInfosIndexDivisor(int)

RE: indexing unsupported mime types using Lucene

2008-06-19 Thread Steven A Rowe
Hi Gaurav, To which mime types are you referring? I can't think of a tool designed for this, but one thing you might try is checking whether the input is compressed/packed, and if so first decompressing/unpacking it, and then using the strings program (available on Linux and Cygwin) to

RE: How to make mutually exclusive lists of results

2008-06-22 Thread Steven A Rowe
Hi Dr. Fish, You could make just a single query with the broadest query possible - e.g. bacon AND country:united states and then iterate over all results, dividing them into your three buckets based on the values of the other two fields. Steve On 06/22/2008 at 12:29 PM, Dr. Fish wrote:

RE: matching sub phrases in user entered query...

2008-07-14 Thread Steven A Rowe
Hi Preetam, On 07/14/2008 at 1:40 PM, Preetam Rao wrote: Is there a query in Lucene which matches sub phrases ? [snip] I was redirected to Shingle filter which is a token filter that spits out n-grams. But it does not seem to be best solution since one does not know in advance what n in

RE: newbie question (for John Griffin) - fixed

2008-07-15 Thread Steven A Rowe
Hi Chris, The PhraseQuery class does no parsing; tokenization is expected to happen before you feed anything to it. So unless you have an index-time analyzer that outputs terms that look like aaa ddd -- that is, terms with embedded spaces -- then attempting to use PhraseQuery or any other

RE: Returned mail: see transcript for details

2008-07-15 Thread Steven A Rowe
Hi Erik, I'm seeing the same problem - here's an excerpt from the headers of a bounce I just got (note the address [EMAIL PROTECTED] in the last couple of Received: headers): Received: from spwiki.spsoftware.com (static61.17.14-87.vsnl.eth.net [61.17.14.87] (may be forged)) for [EMAIL

RE: custom scoring

2008-07-18 Thread Steven A Rowe
Hi Sébastien, Have you looked into the DisjunctionMaxQuery http://lucene.apache.org/java/2_3_2/api/org/apache/lucene/search/DisjunctionMaxQuery.html? From that page: A query that generates the union of documents produced by its subqueries, and that scores each document with the maximum

RE: Bug in CJKTokenizer

2008-07-18 Thread Steven A Rowe
Hi Scott, I think this sounds reasonable, but why not also add LATIN_EXTENDED_B and LATIN_EXTENDED_ADDITIONAL? AFAICT, among other things, these cover some eastern European languages and Vietnamese, respectively. Steve On 07/18/2008 at 5:03 PM, Scott Smith wrote:

RE: Boolean expression for no terms OR matching a wildcard

2008-07-21 Thread Steven A Rowe
Hi Ronald, Caveat - I haven't tested this, but: With a RegexQuery http://lucene.apache.org/java/2_3_2/api/org/apache/lucene/search/regex/RegexQuery.html, I think you can do something like (using your example): +abc*123 -{Regex}(?!abc.*123$) This query would include all documents that have

RE: Opposite to StopFilter. Anything already implemented out there?

2008-07-22 Thread Steven A Rowe
Hi Martin, On 07/22/2008 at 5:48 AM, mpermar wrote: I want to index some incoming text. In this case what I want to do is just detect keywords in that text. Therefore I want to discard everything that is not in the keywords set. This sounds to me pretty much like the reverse of using stop

RE: Using lucene to search a bunch of keywords?

2008-07-23 Thread Steven A Rowe
Hi Ryan, I'm not sure Lucene's the right tool for this job. I have used regular expressions and ternary search trees in the past to do similar things. Is the set of keywords too large for an in-memory solution like these? If not, consider using a tool like the Perl package Regex::PreSuf

RE: Using lucene to search a bunch of keywords?

2008-07-23 Thread Steven A Rowe
in. Thanks. On Jul 23, 2008, at 3:54 PM, Steven A Rowe wrote: Hi Ryan, I'm not sure Lucene's the right tool for this job. I have used regular expressions and ternary search trees in the past to do similar things. Is the set of keywords too large for an in-memory solution like

RE: Using lucene to search a bunch of keywords?

2008-07-23 Thread Steven A Rowe
On 07/23/2008 at 5:09 PM, Steven A Rowe wrote: Karl Wettin's recently committed ShingleMatrixAnalyzer Oops, ShingleMatrixAnalyzer - ShingleMatrixFilter. Steve - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional

RE: too many clause exception when using a filter

2008-07-30 Thread Steven A Rowe
Hi René, Since you're constructing the filter from a WildcardQuery or a PrefixQuery, both of which use a BooleanQuery to hold a TermQuery for each matching index term, you'll need to increase the number of clauses a BooleanQuery is allowed to hold, by calling static method

RE: folder path prefix filtering

2008-08-05 Thread Steven A Rowe
Hi Nico, On 08/05/2008 at 9:44 AM, Nico Krijnen wrote: On 5 aug 2008, at 11:11, Karsten F. wrote: Can't you store only the relevant path in an extra lucene field and set the maximum of query-terms to e.g. 2048 ? @Karsten: We did think about simplifying permissions to just top-level

RE: escaping special characters

2008-08-11 Thread Steven A Rowe
On 08/11/2008 at 2:14 PM, Chris Hostetter wrote: Aravind R Yarram wrote: can i escape built in lucene keywords like OR, AND aswell? as of the last time i checked: no, they're baked into the grammer. I have not tested this, but I've read somewhere on this list that enclosing OR and AND in

RE: Query to ignore certain phrases

2008-08-12 Thread Steven A Rowe
Hi Jeff, I don't know of a query parser that will allow you to acheive this. However, if you can programmatically construct (at least a component of) your queries, then you may want to check out Lucene's SpanQuery functionality. In particular, using your example, if you combine a

RE: Case Sensitivity

2008-08-13 Thread Steven A Rowe
Hi Dino, StandardAnalyzer incorporates StandardTokenizer, StandardFilter, LowerCaseFilter, and StopFilter. Any index you create using it will only provide case-insensitive matching. Steve On 08/13/2008 at 12:15 PM, Dino Korah wrote: Also would like to highlight the version of Lucene I am

RE: Testing for field existence

2008-08-18 Thread Steven A Rowe
Hi Bill, A simpler suggestion, assuming you need to test for the existence of just one particular field: rather than adding a field containing a list of all indexed fields for a particular document, as Karsten suggested, you could just add a field with a constant value when the field you want

RE: Case Sensitivity

2008-08-19 Thread Steven A Rowe
Hi Dino, I think you'd benefit from reading some FAQ answers, like: Why is it important to use the same analyzer type during indexing and search? http://wiki.apache.org/lucene-java/LuceneFAQ#head-0f374b0fe1483c90fe7d6f2c44472d10961ba63c Also, have a look at the AnalysisParalysis wiki page for

RE: EmailAddressAnalyzer TokenStreams

2008-08-20 Thread Steven A Rowe
Hi Dino, The Lucene KeywordTokenizer is about as simple as tokenizers get - it just outputs its entire input as a single token: http://svn.apache.org/viewvc/lucene/java/trunk/src/java/org/apache/lucene/analysis/KeywordTokenizer.java?revision=687357view=markup Check out the source code for

RE: Storing special characters in Lucene

2008-08-21 Thread Steven A Rowe
Hola Juan, On 08/21/2008 at 1:16 PM, Juan Pablo Morales wrote: I have an index in Spanish and I use Snowball to stem and analyze and it works perfectly. However, I am running into trouble storing (not indexing, only storing) words that have special characters. That is, I store the special

RE: Lucene sample code and api documentation

2008-08-28 Thread Steven A Rowe
Hi Sithu, On 08/27/2008 at 3:13 PM, Sudarsan, Sithu D. wrote: 2. Where do we look for sample codes? Or detailed tutorials? Lots of good stuff here: http://wiki.apache.org/jakarta-lucene and particularly here (books, articles, presentations, oh my!):

RE: Confused with NGRAM results

2008-08-28 Thread Steven A Rowe
Hi gaz77, Here's a good place to start: http://wiki.apache.org/jakarta-lucene/AnalysisParalysis Steve On 08/28/2008 at 10:52 AM, gaz77 wrote: Hi, I'd appreciate if someone could explain the results I'm getting. I've written a simple custom analyzer that applies the NGramTokenFilter

RE: boost freshness instead of sorting

2008-08-28 Thread Steven A Rowe
Hi Yannis, On 08/28/2008 at 12:12 PM, Yannis Pavlidis wrote: I am trying to boost the freshness of some of our documents in the index using the most efficient way (i.e. if 2 news stories have the same score based on the content then I want to promote the one that was created last) [...]

RE: boost freshness instead of sorting

2008-08-28 Thread Steven A Rowe
. Thanks, Yannis. -Original Message- From: Steven A Rowe [mailto:[EMAIL PROTECTED] Sent: Thu 8/28/2008 10:27 AM To: java-user@lucene.apache.org Subject: RE: boost freshness instead of sorting Hi Yannis, On 08/28/2008 at 12:12 PM, Yannis Pavlidis wrote: I am trying to boost

RE: Beginner: Specific indexing

2008-09-09 Thread Steven A Rowe
Hi Raymond, Check out SinkTokenizer/TeeTokenFilter: http://lucene.apache.org/java/2_3_2/api/org/apache/lucene/analysis/TeeTokenFilter.html Look at the unit tests for usage hints:

RE: Re: Replacing FAST functionality at sesam.no - ShingleFilter+exact matching

2008-09-09 Thread Steven A Rowe
Hi mck, On 09/09/2008 at 12:58 PM, Mck wrote: *ShortVersion* is there a way to make the ShingleFilter perform exact matching via inserting ^ $ begin/end markers? Reading through the mailing list i see how exact matching can be done, a la STFW to myself... So the ShortVersion now

RE: Re: Replacing FAST functionality at sesam.no - ShingleFilter+exactmatching

2008-09-09 Thread Steven A Rowe
On 09/09/2008 at 4:38 PM, Mck wrote: Looks to me like MultiPhraseQuery is getting in the way. Shingles that begin at the same word are given the same position by ShingleFilter, and Solr's FieldQParserPlugin creates a MultiPhraseQuery when it encounters tokens in a query with the same

RE: RE: Re: Replacing FAST functionality at sesam.no -ShingleFilter+exactmatching

2008-09-10 Thread Steven A Rowe
Hi mck, On 09/10/2008 at 3:55 AM, Mck wrote: probably better to change the one instance of .setPositionIncrement(0) to .setPositionIncrement(1) - that way, MultiPhraseQuery will not be invoked, and the standard disjunction thing should happen. Tried this. As you say i end up with

RE: Understanding/controlling role of Weight in IndexSearcher

2008-09-10 Thread Steven A Rowe
Hi Micah, On 09/09/2008 at 11:57 PM, Micah Jaffe wrote: I'm [...] curious how weights are calculated. [...] thoughts? pointers? best practices? http://lucene.apache.org/java/docs/scoring.html - To unsubscribe, e-mail:

RE: Re: Replacing FAST functionality at sesam.no-ShingleFilter+exactmatching

2008-09-10 Thread Steven A Rowe
On 09/10/2008 at 12:02 PM, Mck wrote: But this does not return the hits i want. Have you tried submitting the query without quotes? (That's where the PhraseQuery likely comes from.) Yes. It does not work. It returns just the unigrams, again the same behaviour as mentioned earlier.

RE: Problems when changing stoplist file

2008-09-11 Thread Steven A Rowe
Hi Marie, On 09/11/2008 at 4:03 AM, Marie-Christine Plogmann wrote: I am currently using the demo class IndexFiles to index some corpus. I have replaced the Standard by a GermanAnalyzer. Here, indexing works fine. But if i specify a different stopword list that should be used, the

RE: StandardTokenizer and Korean grouping with alphanum

2008-09-22 Thread Steven A Rowe
Hi Daniel, On 09/22/2008 at 12:49 AM, Daniel Noll wrote: I have a question about Korean tokenisation. Currently there is a rule in StandardTokenizerImpl.jflex which looks like this: ALPHANUM = ({LETTER}|{DIGIT}|{KOREAN})+ LUCENE-1126 https://issues.apache.org/jira/browse/LUCENE-1126

RE: ArrayIndexOutOfBoundsException in FastCharStream.readChar

2008-10-06 Thread Steven A Rowe
Hi Edwin, I don't know specifically what's causing the exception you're seeing, but note that in Lucene 2.3.0+, the JavaCC-generated version of StandardTokenizer (where your exception originates) has been replaced with a JFlex-generated version - see

RE: how to get the highlight code for v 2.2.0 (or any prior version)?

2008-10-16 Thread Steven A Rowe
Hi Paul, On 10/16/2008 at 12:00 PM, [EMAIL PROTECTED] wrote: Still a newbie here, sorry: a) I can see how to get a zip/jar of the Lucene v.2.2.0 (http://www.urlstructure.com/apache/lucene/java/archive/) or v.2.3.0 (http://www.urlstructure.com/apache/lucene/java/) b) but none of

RE: How to search in metadata? (filename, document title, cocument creator, ...)

2008-10-20 Thread Steven A Rowe
On 10/20/2008 at 12:41 PM, mil84 wrote: doc.add(new Field(Title, hohoho, Field.Store.YES, Field.Index.TOKENIZED)); [...] 3) Searching in title - it DON'T WORK (I try to find hohoho, and nothing). [...] QueryParser parser = new QueryParser(title, new StandardAnalyzer()); Field names are

RE: Question about QueryParser

2008-10-23 Thread Steven A Rowe
Hi James, On 10/23/2008 at 8:30 AM, James liu wrote: public class AnalyzerTest { @Test public void test() throws ParseException { QueryParser parser = new MultiFieldQueryParser(new String[]{title, body}, new StandardAnalyzer()); Query query1 = parser.parse(中文);

RE: example on RegexQuery

2008-10-24 Thread Steven A Rowe
Hi Aashish, On 10/24/2008 at 3:35 AM, Agrawal, Aashish (IT) wrote: I want to use lucene for a simple search engine with regex support . I tried using RegexQuery.. but seems I am missing something. Is there any working exmaple on using RegexQuery ?? How about TestRegexQuery?:

RE: example on RegexQuery

2008-10-27 Thread Steven A Rowe
Hi Aashish, On 10/26/2008 at 11:36 PM, Agrawal, Aashish (IT) wrote: I am searching a sample file like below - --- agrawal fdfdf fsdfafasf 3495549584 fsfsfs fsffsf r4e3fdere j4343 - when I search this file with pattern - .*4343* .*[a-z]4343 j4343 or even search for

RE: BoostingTermQuery scoring

2008-11-06 Thread Steven A Rowe
Hi Peter, On 11/06/2008 at 4:25 PM, Peter Keegan wrote: I've discovered another flaw in using this technique: (+contents:petroleum +contents:engineer +contents:refinery) (+boost:petroleum +boost:engineer +boost:refinery) It's possible that the first clause will produce a matching doc and

RE: BigDecimal values

2008-11-20 Thread Steven A Rowe
Hi Sergey, On 11/20/2008 at 9:30 AM, Sergey Kabashnyuk wrote: How can I convert java.math.BigDecimal numbers in to string for its storing in lexicographical order Here's a thoroughly untested idea, cribbing some from o.a.l.document.NumberTools[1]: convert BigDecimals into strings of the

RE: SnowballAnalyzer and AlphaNumeric

2008-12-05 Thread Steven A Rowe
Hi Sam, On 12/04/2008 at 8:21 PM, samd wrote: Where can I get the Lucene source for the Snowball implementation. I need to be able to search for words that are alphanumeric and this does not work with the current snowballanalyzer. Lucene-java's source is available through its revision control

RE: What are the best document edit options?

2008-12-17 Thread Steven A Rowe
Hi Thomas, On 12/17/2008 at 11:52 AM, Thomas J. Buhr wrote: Where can I see how IndexWriter.updateDocument works without getting into Lucene all over again until this important issue is resolved? Is there a sample of its usage for updating specific fields in a given document? The

RE: stuck with Encoded (possibly?) Database entries

2009-01-12 Thread Steven A Rowe
Hi Peter, On 01/12/2009 at 1:43 PM, peter.aisher wrote: ... the contents of the FILE field is the definition. the problem is that the contents of this field is just garbled text. is there any obvious compression technique which might have been used to store this? The text in the files

RE: stuck with Encoded (possibly?) Database entries

2009-01-12 Thread Steven A Rowe
. Carece de significación precisa. Amatar. Asustar. Avenar. a-2. ( Del gr. ἀ-, priv. ). 1. pref. Denota privación o negación. Acromático. Ateísmo. Ante vocal toma la forma an-. Anestesia. Anorexia. Steven A Rowe wrote: Hi Peter, On 01/12/2009 at 1:43 PM, peter.aisher

RE: first time using lucene

2009-01-21 Thread Steven A Rowe
Hi Nitin, Lucene in Action 2nd edition http://www.manning.com/hatcher3/ is a good place to start. If you want free stuff, check out the Lucene wiki Resources page: http://wiki.apache.org/lucene-java/Resources. Also, some basic code on the wiki: http://wiki.apache.org/lucene-java/TheBasics.

RE: Fields with multiple values...

2009-02-11 Thread Steven A Rowe
Hi Dragon Fly, You could split the original document into multiple Lucene Documents, one for each array index, all sharing the same DocID field value. Then your queries just work. But you'd have to do result consolidation, removing duplicate original docs when you get matches at multiple array

RE: Lucene indexes

2009-02-24 Thread Steven A Rowe
On 2/24/2009 at 5:36 PM, Chris Hostetter wrote: Shingling is (lucene specific?) vernacular for word based ngrams Shingle is not a Lucene-specific term - here's an entry, e.g., from an IBM Glossary of terms for enterprise search at

RE: N-grams with numbers and Shinglefilters

2009-03-02 Thread Steven A Rowe
Hi Raymond, On 3/1/2009, Raymond Balmès wrote: I'm trying to index ( search later) documents that contain tri-grams however they have the following form: string 2 digit 2 digit Does the ShingleFilter work with numbers in the match ? Yes, though it is the tokenizer and previous filters in

RE: N-grams with numbers and Shinglefilters

2009-03-02 Thread Steven A Rowe
Hi Raymond, On 3/2/2009 at 10:09 AM, Raymond Balmès wrote: suppose I have a tri-gram, what I want to do is index the tri-gram string digit1 digit2 as one indexing phrase, and not index each token separately. As long as you don't want any transformation performed on the phrase or its

RE: Confidence scores at search time

2009-03-02 Thread Steven A Rowe
On 3/2/2009 at 4:22 PM, Grant Ingersoll wrote: On Mar 2, 2009, at 2:47 PM, Ken Williams wrote: Also, while perusing the threads you refer to below, I saw a reference to the following link, which seems to have gone dead: https://issues.apache.org/bugzilla/show_bug.cgi?id=31841 Hmm,

RE: Why do range queries work on fields only ?

2009-03-03 Thread Steven A Rowe
Hi Raymond, On 3/3/2009 at 12:04 PM, Raymond Balmès wrote: The range query only works on fields (using a string compare)... is there any reason why it is not possible on the words of the document. The following query [stringa TO stringb] would just give the list of documents which contains

RE: Why do range queries work on fields only ?

2009-03-03 Thread Steven A Rowe
Hi Raymond, On 3/3/2009 at 1:19 PM, Raymond Balmès wrote: On Tue, Mar 3, 2009 at 7:18 PM, Raymond Balmès raymond.bal...@gmail.comwrote: Just a simplified view of my problem : A document contains the terms index01 blabla index02 xxx yyy index03 ... index10. I have the terms indexed in

  1   2   3   >