Hi Matthew and Narcis, I think that I found the (original) problem.
It looks like the reason that I was getting all those other terms, which looked to me like the octets, weren't the octets :)... When I was doing the doc.add(), there were some other numbers (not IP addresses) in the String that I was passing to doc.add(...). BTW, I did try Narcis' suggestion, changing to NOT_ANALYZED, before I found my problem, and that looked like it made the entire string that I was passing to doc.add(...) as the term, which then, when I searched, didn't get any results. So, I think the original ANALYZED is ok. Sorry about that!! Jim ---- Matthew Hall <mh...@informatics.jax.org> wrote: > I'm a little unclear on how you could be getting both "aa.bb.cc.dd" as a > term, and then also the octets. > > Are you adding the "contents" field into the index multiple times, > possibly with separate analyzers? > > Could you possibly try a test, very simple case? > > Just create an index with a single lucene document, with that documents > contents being "aa.bb.cc.dd" and then take a look at the index via Luke > again. > > When you look at the terms section (Its what comes up by default) you > SHOULD see only "aa", "bb", "cc", and "dd" as the top (and thusly ONLY > terms in the index). This could vary depending on your analyzer, as > some will show an index containing only a single term "aa.bb.cc.dd". > What I would not expect is an index that would contain both. > > Furthermore by making the field not analyzed you will now have a > trickier time searching for it. As you will need to use a keyword > analyzer or something similar to search, which if I'm understanding the > spirit of your problem isn't really something that you want to do. > > So, if you could run that test scenario that I've outlined for you I > think you should be able to have a nice test bed to see what the results > of swapping to different analyzers will have on the data that you are > trying to index. Then, after you have played with that a bit you should > be able to re-expand your corpus again, and see if the analyzer you have > chosen continues to stand up. > > I.. had thought that StandardAnalyzer already kept IP addresses together > as a single token, but maybe its doing something... special and > interesting and thusly you are seeing the behavior that you are describing. > > Matt > > oh...@cox.net wrote: > > Hi, > > > > Oh. Ok, thanks! I'll give that a try. > > > > Jim > > > > > > ---- "Armasu wrote: > > > >> Keyword: Field.Index.NOT_ANALYZED > >> > >> -----Original Message----- > >> From: oh...@cox.net [mailto:oh...@cox.net] > >> Sent: Thursday, July 30, 2009 4:36 PM > >> To: java-user@lucene.apache.org > >> Subject: How to index IP addresses? > >> > >> Hi, > >> > >> I am trying to index information in some proprietary-formatted files. > >> > >> In particular, these files contain some IP addresses in dotted notation, > >> e.g., aa.bb.cc.dd. > >> > >> For my initial test, I have a Document implementation, and after I extract > >> what I need into a String named "Info", I do: > >> > >> doc.add(new Field("contents", Info, Field.Store.YES, > >> Field.Index.ANALYZED)); > >> > >> From looking at the resulting index using Luke, it appears that I am > >> getting terms for the full IP address string (e.g., "aa.bb.cc.dd"), but I > >> am also getting terms for each octet of each IP address string, e.g.: > >> > >> aa > >> bb > >> cc > >> dd > >> > >> I'm still just getting started with Lucene, but from the research that > >> I've done, it seems like Lucene is treating the "." in the dotted notation > >> strings as "noise". Is that correct? > >> > >> If so, is there a way to get it not to do that? > >> > >> Thanks, > >> Jim > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: java-user-h...@lucene.apache.org > >> > >> > >> > > > > > > Amazon Development Center (Romania) S.R.L. registered office: 37 Lazar > > Street, floor 5, Iasi, Iasi County, Iasi 700049, Romania. Registered in > > Romania. Registration number J40/12967/2005. > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > -- > Matthew Hall > Software Engineer > Mouse Genome Informatics > mh...@informatics.jax.org > (207) 288-6012 > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org