How about replacing "-" with some arbitrary character sequence with MappingCharFilter before tokenizer and then restoring that '-' with PatternReplaceFilter after the tokenizer?
May be you can just eat '-' with charFilter so that Lawton-Browne becomes LawtonBrowne. --- On Mon, 10/25/10, Martin O'Shea <app...@dsl.pipex.com> wrote: > From: Martin O'Shea <app...@dsl.pipex.com> > Subject: FW: Use of hyphens in StandardAnalyzer > To: java-user@lucene.apache.org > Date: Monday, October 25, 2010, 12:28 AM > A good suggestion. But I'm using > Lucene 3.0.2 and the constructor for a StandardAnalyzer has > Version_30 as its highest value. Do you know when 3.1 is > due? > > -----Original Message----- > From: Steven A Rowe [mailto:sar...@syr.edu] > Sent: 24 Oct 2010 21 31 > To: java-user@lucene.apache.org > Subject: RE: Use of hyphens in StandardAnalyzer > > Hi Martin, > > StandardTokenizer and -Analyzer have been changed, as of > future version 3.1 (the next release) to support the Unicode > segmentation rules in UAX#29. My (untested) guess is > that your hyphenated word will be kept as a single token if > you set the version to 3.1 or higher in the constructor. > > Steve > > > -----Original Message----- > > From: Martin O'Shea [mailto:app...@dsl.pipex.com] > > Sent: Sunday, October 24, 2010 3:59 PM > > To: java-user@lucene.apache.org > > Subject: Use of hyphens in StandardAnalyzer > > > > Hello > > > > > > > > I have a StandardAnalyzer working which retrieves > words and frequencies > > from > > a single document using a TermVectorMapper which is > populating a HashMap. > > > > > > > > But if I use the following text as a field in my > document, i.e. > > > > > > > > addDoc(w, "lucene Lawton-Browne Lucene"); > > > > > > > > The word frequencies returned in the HashMap are: > > > > > > > > browne 1 > > > > lucene 2 > > > > lawton 1 > > > > > > > > The problem is the words 'lawton' and 'browne'. If > this is an actual > > 'double-barreled' name, can Lucene recognise it as > 'Lawton-Browne' where > > the > > name is actually a single word? > > > > > > > > I've tried combinations of: > > > > > > > > addDoc(w, "lucene \"Lawton-Browne\" Lucene"); > > > > > > > > And single quotes but without success. > > > > > > > > Thanks > > > > > > > > Martin O'Shea. > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org