RE: StandardTokenizer generation from JFlex grammar

2012-10-04 Thread Steven A Rowe
Hi Phani, Assuming you're using Lucene 3.6.X, see: http://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_3_6/lucene/core/src/java/org/apache/lucene/analysis/standard/READ_BEFORE_REGENERATING.txt and

RE: Using stop words with snowball analyzer and shingle filter

2012-09-19 Thread Steven A Rowe
Hi Martin, SnowballAnalyzer was deprecated in Lucene 3.0.3 and will be removed in Lucene 5.0. Looks like you're using Lucene 3.X; here's an (untested) Analyzer, based on Lucene 3.6 EnglishAnalyzer, (except substituting SnowballFilter for PorterStemmer; disabling stopword holes' position

RE: ReferenceManager.maybeRefreshBlocking() should not be declared throwing InterruptedException

2012-07-21 Thread Steven A Rowe
Hi Vitaly, Info here should help you set up snapshot dependencies: http://wiki.apache.org/lucene-java/NightlyBuilds Steve -Original Message- From: Vitaly Funstein [mailto:vfunst...@gmail.com] Sent: Saturday, July 21, 2012 9:22 PM To: java-user@lucene.apache.org Subject: Re:

RE: RAMDirectory and expungeDeletes()/optimize()

2012-07-11 Thread Steven A Rowe
Nabble silently drops content from email sent through their interface on a regular basis. I've told them about it multiple times. My suggestion: find another way to post to this mailing list. -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent:

RE: how to remove the dash

2012-06-25 Thread Steven A Rowe
I added the following to both TestStandardAnalyzer and TestClassicAnalyzer in branches/lucene_solr_3_6/, and it passed in both cases: public void testWhitespaceHyphenWhitespace() throws Exception { BaseTokenStreamTestCase.assertAnalyzesTo (a, drinks - water, new String[]{drinks,

RE: [MAVEN] Heads up: build changes

2012-05-09 Thread Steven A Rowe
:53:29 PDT 2011 x86_64 Intel(R) Core(TM) i7-2820QM CPU @ 2.30GHz GenuineIntel GNU/Linux On 08/05/12 11:24, Steven A Rowe wrote: Hi Greg, I don't see that problem - 'ant generate-maven-artifacts' just works for me. I suspect that the XSLT processor included with your JDK does not support

[MAVEN] Heads up: build changes

2012-05-08 Thread Steven A Rowe
If you use the Lucene/Solr Maven POMs to drive the build, I committed a major change last night (see https://issues.apache.org/jira/browse/LUCENE-3948 for more details): * 'ant get-maven-poms' no longer places pom.xml files under the lucene/ and solr/ directories. Instead, they are placed in

RE: [MAVEN] Heads up: build changes

2012-05-08 Thread Steven A Rowe
] at org.apache.tools.ant.Main.startAnt(Main.java:217) [copy] at org.apache.tools.ant.launch.Launcher.run(Launcher.java:280) [copy] at org.apache.tools.ant.launch.Launcher.main(Launcher.java:109) On 08/05/12 10:31, Steven A Rowe wrote: If you use the Lucene/Solr Maven POMs to drive

RE: Highlighter and Shingles...

2012-04-20 Thread Steven A Rowe
Hi Dawn, Can you give an example of a partial match? Steve -Original Message- From: Dawn Zoë Raison [mailto:d...@digitorial.co.uk] Sent: Friday, April 20, 2012 7:59 AM To: java-user@lucene.apache.org Subject: Highlighter and Shingles... Hi, Are there any notes on making the

RE: Two questions on RussianAnalyzer

2012-04-19 Thread Steven A Rowe
Hi Vladimir, The most uncomfortable in new behaviour to me is that in past I used to search by subdomain like bbb.com: and have displayed results with www.bbb.com:, aaa.bbb.com: and so on. Now I have 0 results. About domain names, see my response to a similar question today on

RE: Partial word match

2012-04-09 Thread Steven A Rowe
Hi Hanu, Depending on the nature of the partial word match you're looking for - do you want to only match partial words that match at the beginning of the word? - you should look either at NGramTokenFilter or EdgeNGramTokenFilter:

RE: HTML tags and Lucene highlighting

2012-04-05 Thread Steven A Rowe
Hi okayndc, What *do* you want? Steve -Original Message- From: okayndc [mailto:bodymo...@gmail.com] Sent: Thursday, April 05, 2012 1:34 PM To: java-user@lucene.apache.org Subject: HTML tags and Lucene highlighting Hello, I currently use Lucene version 3.0...probably need to upgrade

RE: HTML tags and Lucene highlighting

2012-04-05 Thread Steven A Rowe
(in the field configured to use HTMLStripCharFilter, anyway). So HTMLStripCharFilter should do what you want. Steve From: okayndc [mailto:bodymo...@gmail.com] Sent: Thursday, April 05, 2012 3:36 PM To: Steven A Rowe Cc: java-user@lucene.apache.org Subject: Re: HTML tags and Lucene highlighting

RE: Lucene tokenization

2012-03-27 Thread Steven A Rowe
Hi Nilesh, Which version of Lucene are you using? StandardTokenizer behavior changed in v3.1. Steve -Original Message- From: Nilesh Vijaywargiay [mailto:nilesh.vi...@gmail.com] Sent: Tuesday, March 27, 2012 2:04 PM To: java-user@lucene.apache.org Subject: Lucene tokenization I have

RE: can't find common words -- using Lucene 3.4.0

2012-03-26 Thread Steven A Rowe
Hi Ilya, What analyzers are you using at index-time and query-time? My guess is that you're using an analyzer that includes punctuation in the tokens it emits, in which case your index will have things like sentence. and sentence? in it, so querying for sentence will not match. Luke can tell

RE: can't find common words -- using Lucene 3.4.0

2012-03-26 Thread Steven A Rowe
index are UTF8. I am using the standard analyzer for English text and other contributed analyzers for respective foreign texts Thanks, Ilya -Original Message- From: Steven A Rowe [mailto:sar...@syr.edu] Sent: Monday, March 26, 2012 10:59 AM To: java-user@lucene.apache.org Subject

RE: can't find common words -- using Lucene 3.4.0

2012-03-26 Thread Steven A Rowe
On 3/26/2012 at 12:21 PM, Ilya Zavorin wrote: I am not seeing anything suspicious. Here's what I see in the HEX: n.e from pain.electricity: 6E-2E-0D-0A-0D-0A-65 (n-.-CR-LF-CR-LF-e) e.H from sentence.He: 65-2E-0D-0A-48 I agree, standard DOS/Windows line endings. I am pretty sure I am using

RE: What replaces IndexReader.openIfChanged in Lucene 4.0?

2012-03-05 Thread Steven A Rowe
IndexReader.openIfChanged in Lucene 4.0? On Mon, Mar 5, 2012 at 11:07 AM, Steven A Rowe sar...@syr.edu wrote: The second item in the top section in trunk CHANGES.txt (back compat policy changes): Could you guys put this on the web site (or a link to it)? Or try to get it to SEO more prominently? * LUCENE

RE: What replaces IndexReader.openIfChanged in Lucene 4.0?

2012-03-05 Thread Steven A Rowe
You want the lucene-queryparser jar. From trunk MIGRATE.txt: * LUCENE-3283: Lucene's core o.a.l.queryParser QueryParsers have been consolidated into module/queryparser, where other QueryParsers from the codebase will also be placed. The following classes were moved: -

RE: Customizing indexing of large files

2012-02-27 Thread Steven A Rowe
PatternReplaceCharFilter would probably work, or maybe a custom CharFilter? *CharFilter has the advantage of preserving original text offsets, for highlighting. Steve -Original Message- From: Glen Newton [mailto:glen.new...@gmail.com] Sent: Monday, February 27, 2012 12:57 PM To:

RE: StandardAnalyzer and Email Addresses

2012-02-26 Thread Steven A Rowe
UAX29URLEmailTokenizer (see http://goo.gl/evH97). There is no Analyzer available that uses this Tokenizer, but you can define your own one like StandardAnalyzer, but with this class as Tokenizer (not StandardTokenizer). I am not sure why there is no Analyzer implementation already available, maybe Steven Rowe

RE: Can I just add ShingleFilter to my nalayzer used for indexing and searching

2012-02-21 Thread Steven A Rowe
Hi Paul, Lucene QueryParser splits on whitespace and then sends individual words one-by-one to be analyzed. All analysis components that do their work based on more than one word, including ShingleFilter and SynonymFilter, are borked by this. (There is a JIRA issue open for the QueryParser

RE: Maven repository for lucene trunk

2012-02-14 Thread Steven A Rowe
Hi Sudarshan, I think this wiki page has the info you want: http://wiki.apache.org/lucene-java/HowNightlyBuildsAreMade Steve -Original Message- From: sudarsh...@gmail.com [mailto:sudarsh...@gmail.com] On Behalf Of Sudarshan Gaikaiwari Sent: Tuesday, February 14, 2012 10:01 PM To:

RE: Access next token in a stream

2012-02-09 Thread Steven A Rowe
Hi Damerian, One way to handle your scenario is to hold on to the previous token, and only emit a token after you reach at least the second token (or at end-of-stream). Your incrementToken() method could look something like: 1. Get current attributes: input.incrementToken() 2. If previous

RE: Access next token in a stream

2012-02-09 Thread Steven A Rowe
To: java-user@lucene.apache.org Subject: Re: Access next token in a stream Στις 9/2/2012 8:54 μμ, ο/η Steven A Rowe έγραψε: Hi Damerian, One way to handle your scenario is to hold on to the previous token, and only emit a token after you reach at least the second token (or at end

RE: Access next token in a stream

2012-02-09 Thread Steven A Rowe
Message- From: Damerian [mailto:dameria...@gmail.com] Sent: Thursday, February 09, 2012 5:00 PM To: java-user@lucene.apache.org Subject: Re: Access next token in a stream Στις 9/2/2012 10:51 μμ, ο/η Steven A Rowe έγραψε: Damerian, The technique I mentioned would work for you

RE: Analysers for newspaper pages...

2011-11-28 Thread Steven A Rowe
Hi Dawn, I assume that when you refer to the impact of stop words, you're concerned about query-time performance? You should consider the possibility that performance without removing stop words is good enough that you won't have to take any steps to address the issue. That said, there are

RE: How do you see if a tokenstream has tokens without consuming the tokens ?

2011-10-19 Thread Steven A Rowe
Hi Paul, On 10/19/2011 at 5:26 AM, Paul Taylor wrote: On 18/10/2011 15:25, Steven A Rowe wrote: On 10/18/2011 at 4:57 AM, Paul Taylor wrote: On 18/10/2011 06:19, Steven A Rowe wrote: Another option is to create a char filter that substitutes PUNCT-EXCLAMATION for exclamation points

RE: How do you see if a tokenstream has tokens without consuming the tokens ?

2011-10-19 Thread Steven A Rowe
Hi Paul, What version of Lucene are you using? The JFlex spec you quote below looks pre-v3.1? Steve -Original Message- From: Paul Taylor [mailto:paul_t...@fastmail.fm] Sent: Wednesday, October 19, 2011 6:50 AM To: Steven A Rowe; java-user@lucene.apache.org 'java- u

RE: How do you see if a tokenstream has tokens without consuming the tokens ?

2011-10-18 Thread Steven A Rowe
Hi Paul, On 10/18/2011 at 4:57 AM, Paul Taylor wrote: On 18/10/2011 06:19, Steven A Rowe wrote: Another option is to create a char filter that substitutes PUNCT-EXCLAMATION for exclamation points, PUNCT-PERIOD for periods, etc., Yes that is how I first did it No, I don't think you did

RE: How do you see if a tokenstream has tokens without consuming the tokens ?

2011-10-17 Thread Steven A Rowe
Hi Paul, You could add a rule to the StandardTokenizer JFlex grammar to handle this case, bypassing its other rules. Another option is to create a char filter that substitutes PUNCT-EXCLAMATION for exclamation points, PUNCT-PERIOD for periods, etc., but only when the entire input consists

RE: setting MaxFieldLength in indexwriter

2011-09-28 Thread Steven A Rowe
Hi Peyman, The API docs give a hint http://lucene.apache.org/java/3_4_0/api/core/org/apache/lucene/index/IndexWriter.html: = Nested Class Summary ... static class IndexWriter.MaxFieldLength Deprecated. use LimitTokenCountAnalyzer instead. =

RE: Enabling indexing of hyphenated terms sans the hyphen

2011-09-19 Thread Steven A Rowe
Hi sbs, Solr's WordDelimiterFilterFactory does what you want. You can see a description of its function here: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory. WordDelimiterFilter, the filter class implementing the above factory's functionality, is

RE: 4.0-SNAPSHOT in maven repo via Jenkins?

2011-07-25 Thread Steven A Rowe
Hi Eric, On 7/24/2011 at 3:07 AM, Eric Charles wrote: 0112233445566778 12345678901234567890123456789012345678901234567890123456789012345678901234567890 Jenkins jobs builds lucene trunk with 'mvn --batch-mode --non-recursive

RE: Some question about Lucene

2011-07-10 Thread Steven A Rowe
This slide show is a few years old, but I think it might be a good introduction for you to the differences between the projects: http://www.slideshare.net/dnaber/apache-lucene-searching-the-web-and-everything-else-jazoon07/ Steve -Original Message- From: Ing. Yusniel Hidalgo Delgado

RE: how are built the packages in the maven repository?

2011-07-06 Thread Steven A Rowe
Ant is the official Lucene/Solr build system. Snapshot and release artifacts are produced with Ant. While Maven is capable of producing artifacts, the artifacts produced in this way may not be the same as the official Ant artifacts. For this reason: no, the artifacts should not be built with

RE: Lucene Simple Project

2011-06-18 Thread Steven A Rowe
Hi Hamada, Do you know about the Lucene demo?: http://lucene.apache.org/java/3_2_0/demo.html Steve -Original Message- From: hamadazahera [mailto:hamadazah...@gmail.com] Sent: Saturday, June 18, 2011 9:30 AM To: java-user@lucene.apache.org Subject: Lucene Simple Project Hello

RE: Bug fix to contrib/.../IndexSplitter

2011-06-09 Thread Steven A Rowe
Hi Ivan, You do have rights to submit fixes to Lucene - everyone does! Here's how: http://wiki.apache.org/lucene-java/HowToContribute Please create a patch, create an issue in JIRA, and then attach the patch to the JIRA issue. When you do this, you are asked to state that you grant license

RE: FastVectorHighlighter StringIndexOutofBounds bug

2011-05-22 Thread Steven A Rowe
Hi WeiWei, Thanks for the report. Can you provide a self-contained unit test that triggers the bug? Thanks, Steve -Original Message- From: Weiwei Wang [mailto:ww.wang...@gmail.com] Sent: Monday, May 23, 2011 1:25 AM To: java-user@lucene.apache.org Subject: FastVectorHighlighter

RE: Query Parser, Unary Operators and Multi-Field Query

2011-05-20 Thread Steven A Rowe
Hi Renaud, That's normal behavior, since you have AND as default operator. This is equivalent to placing a + in front of every element of your query. In fact, if you removed the other two +s, you would get the same behavior. I think you'll get what you want by just switching the default

RE: Query Parser, Unary Operators and Multi-Field Query

2011-05-20 Thread Steven A Rowe
Hi Renaud, On 5/20/2011 at 1:58 PM, Renaud Delbru wrote: As said in http://lucidworks.lucidimagination.com/display/LWEUG/Boolean+Operators, if one or more of the terms in a term list has an explicit term operator (+ or - or relational operator) the rest of the terms will be treated as

RE: Lucene 3.3 in Eclipse

2011-05-15 Thread Steven A Rowe
Hi Cheng, Lucene 3.3 does not exist - do you mean branches/branch_3x ? FYI, as of Lucene 3.1, there is an Ant target you can use to setup an Eclipse project for Lucene/Solr - run this from the top level directory of a full source tree (including dev-tools/ directory) checked out from

RE: Lucene 3.3 in Eclipse

2011-05-15 Thread Steven A Rowe
, 2011 10:48 AM To: java-user@lucene.apache.org Cc: Steven A Rowe Subject: RE: Lucene 3.3 in Eclipse Steve, thanks for correction. You are right. The version is 3.0.3 released last Oct. I did place an ant jar in Eclipse, and it does the job to remove some compiling errors. However, it seems

RE: Can I omit ShingleFilter's filler tokens

2011-05-12 Thread Steven A Rowe
A thought: one way to do #1 without modifying ShingleFilter: if there were a StopFilter variant that accepted regular expressions instead of a stopword list, you could configure it with a regex like /_ .*|.* _| _ / (assuming a full match is required, i.e. implicit beginning and end anchors),

RE: Can I omit ShingleFilter's filler tokens

2011-05-12 Thread Steven A Rowe
- From: Robert Muir [mailto:rcm...@gmail.com] Sent: Thursday, May 12, 2011 1:15 PM To: java-user@lucene.apache.org Subject: Re: Can I omit ShingleFilter's filler tokens On Thu, May 12, 2011 at 1:03 PM, Steven A Rowe sar...@syr.edu wrote: A thought: one way to do #1 without modifying

RE: Can I omit ShingleFilter's filler tokens

2011-05-11 Thread Steven A Rowe
Hi Bill, I can think of two possible interpretations of removing filler tokens: 1. Don't create shingles across stopwords, e.g. for text one two three four five and stopword three, bigrams only, you'd get (one two, four five), instead of the current (one two, two _, _ four, four five). 2.

RE: Can I omit ShingleFilter's filler tokens

2011-05-11 Thread Steven A Rowe
, Steven A Rowe sar...@syr.edu wrote: Hi Bill, I can think of two possible interpretations of removing filler tokens: 1. Don't create shingles across stopwords, e.g. for text one two three four five and stopword three, bigrams only, you'd get (one two, four five), instead of the current (one

RE: Lucene 3.0.3 with debug information

2011-04-29 Thread Steven A Rowe
Hi Paul, What did you find about Luke that's buggy? Bug reports are very useful; please contribute in this way. The official Lucene 3.0.3 distribution jars were compiled using the -g cmdline argument to javac - by default, though, only line number and source file information is generated.

RE: Lucene 3.0.3 with debug information

2011-04-29 Thread Steven A Rowe
Hi Paul, On 4/29/2011 at 4:14 PM, Paul Taylor wrote: On 29/04/2011 16:03, Steven A Rowe wrote: What did you find about Luke that's buggy? Bug reports are very useful; please contribute in this way. Please see previous post, in summary mistake on my part. Okay... Which previous post? I

RE: Lucene 3.0.3 with debug information

2011-04-29 Thread Steven A Rowe
Thanks Dawid. – Steve From: dawid.we...@gmail.com [mailto:dawid.we...@gmail.com] On Behalf Of Dawid Weiss Sent: Friday, April 29, 2011 4:45 PM To: java-user@lucene.apache.org Cc: Steven A Rowe Subject: Lucene 3.0.3 with debug information This is the e-mail you're looking for, Steven (it wasn't

RE: lucene 3.0.3 | QueryParser | MultiFieldQueryParser

2011-04-27 Thread Steven A Rowe
Ranjit, The problem is definitely the analyzer you are passing to QueryParser or MultiFieldQueryParser, and not the parser itself. The following tests succeed using KeywordAnalyzer, which is a pass-through analyzer (the output is the same as the input): public void testSharpQP() throws

RE: lucene 3.0.3 | QueryParser | MultiFieldQueryParser

2011-04-26 Thread Steven A Rowe
Hi Ranjit, I suspect the problem is not QueryParser, since the TERM definition includes the '#' character (from http://svn.apache.org/viewvc/lucene/java/tags/lucene_3_0_3/src/java/org/apache/lucene/queryParser/QueryParser.jj?view=markup#l1136): | #_TERM_START_CHAR: ( ~[ , \t, \n, \r,

RE: lucene 3.0.3 | searching problem with *.docx file

2011-04-12 Thread Steven A Rowe
Hi Ranjit, Do you know about Luke? It will let you see what's in your index, and much more: http://code.google.com/p/luke/ Steve -Original Message- From: Ranjit Kumar [mailto:ranjit.ku...@otssolutions.com] Sent: Tuesday, April 12, 2011 9:05 AM To:

RE: Lucene 3.1

2011-04-05 Thread Steven A Rowe
Hi Tanuj, Can you be more specific? What file did you download? (Lucene 3.1 has three downloadable packages: -src.tar.gz, .tar.gz, and .zip.) What did you expect to find that is not there? (Some examples would help.) Steve -Original Message- From: Tanuj Jain

RE: word + ngram tokenization

2011-04-05 Thread Steven A Rowe
Hi Shambhu, ShingleFilter will construct word n-grams: http://lucene.apache.org/java/3_1_0/api/contrib-analyzers/org/apache/lucene/analysis/shingle/ShingleFilter.html Steve -Original Message- From: sham singh [mailto:shamsing...@gmail.com] Sent: Tuesday, April 05, 2011 5:53 PM To:

RE: lucene-snowball 3.1.0 packages are missing?

2011-04-03 Thread Steven A Rowe
Hi Alex, From Lucene contrib CHANGES.html http://lucene.apache.org/java/3_1_0/changes/Contrib-Changes.html#3.1.0.changes_in_backwards_compatibility_policy: 3. LUCENE-2226: Moved contrib/snowball functionality into contrib/analyzers. Be sure to remove any old obselete

RE: [POLL] Where do you get Lucene/Solr from? Maven? ASF Mirrors?

2011-01-18 Thread Steven A Rowe
[x] ASF Mirrors (linked in our release announcements or via the Lucene website) [x] Maven repository (whether you use Maven, Ant+Ivy, Buildr, etc.) [x] I/we build them from source via an SVN/Git checkout.

RE: lucene-based log searcher?

2011-01-13 Thread Steven A Rowe
Hi Paul, I saw this yesterday, but haven't tried it myself: http://karussell.wordpress.com/2010/10/27/feeding-solr-with-its-own-logs/ The author has a project called Sogger - Solr + Logger? - that can read various forms of logs. Steve -Original Message- From: Paul Libbrecht

RE: Re: Scale up design

2010-12-22 Thread Steven A Rowe
On 12/22/2010 at 2:38 AM, Ganesh wrote: Any other tips targeting 64 bit? If memory usage is an issue, you might consider using HotSpot's compressed oops option: http://wikis.sun.com/display/HotSpotInternals/CompressedOops

RE: Analyzer

2010-11-29 Thread Steven A Rowe
Hi Manjula, It's not terribly clear what you're doing here - I got lost in your description of your (two? or maybe four?) classes. Sometimes things are easier to understand if you provide more concrete detail. I suspect that you could benefit from reading the book Lucene in Action, 2nd

RE: IndexWriters and write locks

2010-11-10 Thread Steven A Rowe
NFS[1] != NTFS[2] [1] NFS: http://en.wikipedia.org/wiki/Network_File_System_%28protocol%29 [2] NTFS: http://en.wikipedia.org/wiki/NTFS -Original Message- From: Pulkit Singhal [mailto:pulkitsing...@gmail.com] Sent: Wednesday, November 10, 2010 2:55 PM To: java-user@lucene.apache.org

RE: Use of hyphens in StandardAnalyzer

2010-10-24 Thread Steven A Rowe
Hi Martin, StandardTokenizer and -Analyzer have been changed, as of future version 3.1 (the next release) to support the Unicode segmentation rules in UAX#29. My (untested) guess is that your hyphenated word will be kept as a single token if you set the version to 3.1 or higher in the

RE: Use of hyphens in StandardAnalyzer

2010-10-24 Thread Steven A Rowe
for a StandardAnalyzer has Version_30 as its highest value. Do you know when 3.1 is due? -Original Message- From: Steven A Rowe [mailto:sar...@syr.edu] Sent: 24 Oct 2010 21 31 To: java-user@lucene.apache.org Subject: RE: Use of hyphens in StandardAnalyzer Hi Martin, StandardTokenizer

RE: Issue with sentence specific search

2010-10-07 Thread Steven A Rowe
Hi Sirish, StandardTokenizer does not produce a token from '#', as you suspected. Something that fits the word definition, but which won't ever be encountered in your documents, is what you should use for the delimiter - something like a1b2c3c2b1a . Sentence boundary handling is clunky in

RE: Issue with sentence specific search

2010-10-06 Thread Steven A Rowe
Hi Sirish, I think I understand within sentence phrase search - you want the entire phrase to be within a single sentence. But can you give an example of non sentence specific phrase search? It's not clear to me how useful such capability would be. Steve -Original Message- From:

RE: Issue with sentence specific search

2010-10-06 Thread Steven A Rowe
Hi Sirish, Have you looked at SpanQuery's yet?: http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/spans/package-summary.html See also this Lucid Imagination blog post by Mark Miller: http://www.lucidimagination.com/blog/2009/07/18/the-spanquery/ One common technique,

RE: Updating documents with fields that aren't stored

2010-10-04 Thread Steven A Rowe
This is not a defect: http://wiki.apache.org/lucene-java/LuceneFAQ#Does_Lucene_allow_searching_and_indexing_simultaneously.3F. -Original Message- From: Justin [mailto:cry...@yahoo.com] Sent: Monday, October 04, 2010 2:03 PM To: java-user@lucene.apache.org Subject: Updating documents

RE: Updating documents with fields that aren't stored

2010-10-04 Thread Steven A Rowe
: Steven A Rowe sar...@syr.edu To: java-user@lucene.apache.org java-user@lucene.apache.org Sent: Mon, October 4, 2010 1:05:36 PM Subject: RE: Updating documents with fields that aren't stored This is not a defect: http://wiki.apache.org/lucene- java/LuceneFAQ

RE: Hierarchical Fields

2010-09-15 Thread Steven A Rowe
Hi Iam, Can you say why you don't like the proposed solution? Also, the example of the scoring you're looking for doesn't appear to be hierarchical in nature - can you give illustrate the relationship between the tokens in [token1, token2, token3]? Also, why do you want token1 to contribute

RE: Hierarchical Fields

2010-09-15 Thread Steven A Rowe
, 2010 at 12:52 PM, Steven A Rowe sar...@syr.edu wrote: Hi Iam, Can you say why you don't like the proposed solution? Also, the example of the scoring you're looking for doesn't appear to be hierarchical in nature - can you give illustrate the relationship between the tokens in [token1

RE: Search results include results with excluded terms

2010-08-16 Thread Steven A Rowe
Hi Christoph, There could be several things going on, but it's difficult to tell without more information. Since excluded terms require a non-empty set from which to remove documents at the same boolean clause level, you could try something like title:(*:* -Datei*) avl, or -title:Datei*

RE: Search results include results with excluded terms

2010-08-16 Thread Steven A Rowe
Oops, setLowercaseExpandedTerms() is an instance method, not static. I wrote: QueryParser has a static method setLowercaseExpandedTerms() that you can call to turn on automatic pre-expansion query term downcasing:

RE: InverseWildcardQuery

2010-07-30 Thread Steven A Rowe
Hi Justin, [...] *:* AND -myfield:foo*. If my document contains myfield:foobar and myfield:dog, the document would be thrown out because of the first field. I want to keep the document because the second field does not match. I'm assuming that you mistakenly used the same field name above

RE: InverseWildcardQuery

2010-07-30 Thread Steven A Rowe
Hi Justin, Unfortunately the suffix requires a wildcard as well in our case. There are a limited number of prefixes though (10ish), so perhaps we could combine them all into one query. We'd still need some sort of InverseWildcardQuery implementation. use another analyzer so you don't need

RE: InverseWildcardQuery

2010-07-30 Thread Steven A Rowe
Hi Justin, an example PerFieldAnalyzerWrapper analyzers = new PerFieldAnalyzerWrapper(new KeywordAnalyzer()); // myfield defaults to KeywordAnalyzer analyzers.addAnalyzer(content, new SnowballAnalyzer(luceneVersion, English)); // analyzers affects the indexed field value

RE: InverseWildcardQuery

2010-07-30 Thread Steven A Rowe
you want what Lucene already does, but that's clearly not true Hmmm, let's pretend that contents field in my example wasn't analyzed at index time. The unstemmed form of terms will be indexed. But if I query with a stemmed form or use QueryParser with the SnowballAnalyzer, I'm not going

RE: ShingleFilter failing with more terms than index phrase

2010-07-13 Thread Steven A Rowe
Hi Ethan, You'll probably get better answers about Solr specific stuff on the solr-u...@a.l.o list. Check out PositionFilterFactory - it may address your issue: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PositionFilterFactory Steve -Original Message- From:

RE: URL Tokenization

2010-06-24 Thread Steven A Rowe
. Thanks, Sudha On Wed, Jun 23, 2010 at 12:21 PM, Steven A Rowe sar...@syr.edu wrote: Hi Sudha, There is such a tokenizer, named NewStandardTokenizer, in the most recent patch on the following JIRA issue: https://issues.apache.org/jira/browse/LUCENE-2167 It keeps (HTTP

RE: URL Tokenization

2010-06-23 Thread Steven A Rowe
Hi Sudha, There is such a tokenizer, named NewStandardTokenizer, in the most recent patch on the following JIRA issue: https://issues.apache.org/jira/browse/LUCENE-2167 It keeps (HTTP(S), FTP, and FILE) URLs together as single tokens, and e-mails too, in accordance with the relevant IETF

RE: search hits not returned until I stop and restart application

2010-06-21 Thread Steven A Rowe
Hi Andy, From the API docs for IndexWriter http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/index/IndexWriter.html: [D]ocuments are added with addDocument and removed with deleteDocuments(Term) or deleteDocuments(Query). A document can be updated with

RE: search hits not returned until I stop and restart application

2010-06-21 Thread Steven A Rowe
Andy, I think batching commits either by time or number of documents is common. Do you know about NRT (Near Realtime Search)?: http://wiki.apache.org/lucene-java/NearRealtimeSearch. Using IndexWriter.getReader(), you can avoid commits altogether, as well as reducing update-search latency.

RE: Reverse Searching

2010-05-17 Thread Steven A Rowe
Hi Siraj, Lucene's MemoryIndex can be used to serve this purpose. From http://lucene.apache.org/java/3_0_1/api/contrib-memory/org/apache/lucene/index/memory/MemoryIndex.html: [T]his class targets fulltext search of huge numbers of queries over comparatively small transient

RE: Reverse Searching

2010-05-17 Thread Steven A Rowe
dilemma is, I might have upto 100,000 queries to run against it. Do you think this route will give me results in reasonable amount of time, i.e. in a few seconds? thanks -siraj On 5/17/2010 5:21 PM, Steven A Rowe wrote: Hi Siraj, Lucene's MemoryIndex can be used to serve this purpose

RE: PrefixQuery and special characters

2010-04-14 Thread Steven A Rowe
Hi Franz, The likely problem is that you're using an index-time analyzer that strips out the parentheses. StandardAnalyzer, for example, does this; WhitespaceAnalyzer does not. Remember that hits are the result of matches between index-analyzed terms and query-analyzed terms. Except in the

RE: Lucene query with long strings

2010-03-23 Thread Steven A Rowe
Hi Aaron, Your false positives comments point to a mismatch between what you're currently asking Lucene for (any document matching any one of the terms in the query) and what you want (only fully correct matches). You need to identify the terms of the query that MUST match and tell Lucene

RE: Increase number of available positions?

2010-03-17 Thread Steven A Rowe
Hi Rene, On 03/17/2010 at 11:17 AM, Rene Hackl-Sommer wrote: SpanNot fieldName=MyField Include !-- Gets all the matching spans within L_2 boundaries and includes them -- SpanNot Include SpanNear slop=2147483647 inOrder=false SpanTermt293/SpanTerm SpanTermt4979/SpanTerm /SpanNear

RE: Increase number of available positions?

2010-03-15 Thread Steven A Rowe
Hi Rene, Why can't you use a different field for each of the Level_X's, i.e. MyLevel1Field, MyLevel2Field, MyLevel3Field? On 03/15/2010 at 9:59 AM, Rene Hackl-Sommer wrote: Search in MyField: Terms T1 and T2 on Level_2 and T3, T4, and T5 on Level_3, which should both be in the same

RE: Increase number of available positions?

2010-03-15 Thread Steven A Rowe
Hi Rene, Have you seen SpanNotQuery?: http://lucene.apache.org/java/3_0_1/api/core/org/apache/lucene/search/spans/SpanNotQuery.html For a document that looks like: Level_1 id=1 Level_2 id=1 Level_3 id=1T1 T2 T3/Level_3 Level_3 id=2T4 T5 T6/Level_3 Level_3 id=3T7 T8 T9/Level_3

RE: Searching Subversion comments:

2010-03-08 Thread Steven A Rowe
Hi Erick, On 03/08/2010 at 3:48 PM, Erick Erickson wrote: Is there any convenient way to, say, find all the files associated with patch ? I realize one can (hopefully) get this information from JIRA, but... This is a subset of the problem of searching Subversion comments. I know of two

RE: Reverse Search

2010-03-01 Thread Steven A Rowe
Hi Mark, On 03/01/2010 at 3:35 PM, Mark Ferguson wrote: I will be processing short bits of text (Tweets for example), and need to search them to see if they certain terms. You might consider, instead of performing reverse search, just querying all of your locations against one document at a

RE: Match span of capitalized words

2010-02-05 Thread Steven A Rowe
Hi Max, On 02/05/2010 at 10:18 AM, Grant Ingersoll wrote: On Feb 3, 2010, at 8:57 PM, Max Lynch wrote: Hi, I would like to do a search for Microsoft Windows as a span, but not match if words before or after Microsoft Windows are upper cased. For example, I want this to match: another

RE: Unexpected Query Results

2010-02-04 Thread Steven A Rowe
Hi Jamie, Since phrase query terms aren't analyzed, you're getting exact matches for terms было and время, but when you search for them individually, they are analyzed, and it is the analyzed query terms that fail to match against the indexed terms. Sounds to me like your index-time and

RE: Analyzer for stripping non alpha-numeric characters?

2010-02-04 Thread Steven A Rowe
Hi Jason, Solr's PatternReplaceFilter(ts, \\P{Alnum}+$, , false) should work, chained after an appropriate tokenizer. Steve On 02/04/2010 at 12:18 PM, Jason Rutherglen wrote: Is there an analyzer that easily strips non alpha-numeric from the end of a token?

RE: Unexpected Query Results

2010-02-04 Thread Steven A Rowe
On 02/04/2010 at 3:24 PM, Chris Hostetter wrote: : Since phrase query terms aren't analyzed, you're getting exact : matches quoted phrase passed to the QueryParser are analyzed -- but they are analyzed as complete strings, so Analyzers that treat whitespace special may produce differnet

RE: combine query score with external score

2010-01-28 Thread Steven A Rowe
Hi Dennis, You should check out payloads (arbitrary per-index-term byte[] arrays), which can be used to encode values which are then incorporated into documents' scores, by overriding Similarity.scorePayload():

RE: RangeFilter

2010-01-13 Thread Steven A Rowe
Hi AlexElba, The problem is that Lucene only knows how to handle character strings, not numbers. Lexicographically, 3 10, so you get the expected results (nothing). The standard thing to do is transform your numbers into strings that sort as you want them to. E.g., you can left-pad the

RE: RangeFilter

2010-01-13 Thread Steven A Rowe
Hi AlexElba, Did you completely re-index? If you did, then there is some other problem - can you share (more of) your code? Do you know about Luke? It's an essential tool for Lucene index debugging: http://www.getopt.org/luke/ Steve On 01/13/2010 at 8:34 PM, AlexElba wrote: Hello,

RE: TopFieldDocCollector and v3.0.0

2009-12-08 Thread Steven A Rowe
Hi Uwe, On 12/08/2009 at 9:40 AM, Uwe Schindler wrote: After the move to 3.0, you can (but you must not) further update your code to use generics, which is not really needed but will remove all compiler warnings. This sounds like you're telling people that although they are able to update

RE: Hits and TopDoc

2009-10-20 Thread Steven A Rowe
Hi Nathan, On 10/20/2009 at 5:03 PM, Nathan Howard wrote: This is sort of related to the above question, but I'm trying to update some (now depricated) Java/Lucene code that I've become aware of once we started using 2.4.1 (we were previously using 2.3.2): Hits results =

RE: Hits and TopDoc

2009-10-20 Thread Steven A Rowe
something with current hit -Yonik http://www.lucidimagination.com On Tue, Oct 20, 2009 at 5:27 PM, Steven A Rowe sar...@syr.edu wrote: Hi Nathan, On 10/20/2009 at 5:03 PM, Nathan Howard wrote: This is sort of related to the above question, but I'm trying to update some (now

  1   2   3   >