Hi Steve, Thanks for the quick reply and implementing support for URL tokenization. Another newbie question about applying this patch.
I have the Lucene 3.0.2 source and I downloaded the patch and tried to apply it: lucene-3.0.2> patch -p0 < LUCENE-2167.patch Comes back with the error message: ....(output truncated) can't find file to patch at input line 13106 Perhaps you used the wrong -p or --strip option? The text leading up to this was: After looking at the line, it looks like it's trying to find modules/analysis/common/build.xml -- which is not part of the official 3.0.2 src release. And thinking about it, may be I need to use the latest source (or a nightly build). But, I couldn't figure how to get that. The hudson link for nightly builds on the apache-lucene site seems to be broke. Or may be I have a different problem. I'd appreciate any help. Thanks, Sudha On Wed, Jun 23, 2010 at 12:21 PM, Steven A Rowe <sar...@syr.edu> wrote: > Hi Sudha, > > There is such a tokenizer, named NewStandardTokenizer, in the most recent > patch on the following JIRA issue: > > https://issues.apache.org/jira/browse/LUCENE-2167 > > It keeps (HTTP(S), FTP, and FILE) URLs together as single tokens, and > e-mails too, in accordance with the relevant IETF RFCs. > > Steve > > > -----Original Message----- > > From: Sudha Verma [mailto:verma.su...@gmail.com] > > Sent: Wednesday, June 23, 2010 2:07 PM > > To: java-user@lucene.apache.org > > Subject: URL Tokenization > > > > Hi, > > > > I am new to lucene and I am using Lucene 3.0.2. > > > > I am using Lucene to parse text which may contain URLs. I noticed the > > StandardTokenizer keeps the email addresses in one token, but not the > > URLs. > > I also looked at Solr wiki pages, and even though the wiki page for > > solr.StandardTokenizerFactory says it keeps track of the URL token type - > > it does not seem to be the case. > > > > Is there an Analyzer implementation that can keep the URLs intact into > one > > token? or does anyone have an example of that for Solr or Lucene? > > > > Thanks much, > > Sudha >