Re: How to index IP addresses?

ohaya Thu, 30 Jul 2009 07:38:12 -0700

Hi Matthew and Narcis,

I think that I found the (original) problem.


It looks like the reason that I was getting all those other terms, which looked 
to me like the octets, weren't the octets :)...

When I was doing the doc.add(), there were some other numbers (not IP 
addresses) in the String that I was passing to doc.add(...).

BTW, I did try Narcis' suggestion, changing to NOT_ANALYZED, before I found my 
problem, and that looked like it made the entire string that I was passing to 
doc.add(...) as the term, which then, when I searched, didn't get any results.

So, I think the original ANALYZED is ok.

Sorry about that!!

Jim




---- Matthew Hall <mh...@informatics.jax.org> wrote: 
> I'm a little unclear on how you could be getting both "aa.bb.cc.dd" as a 
> term, and then also the octets.
> 
> Are you adding the "contents" field into the index multiple times, 
> possibly with separate analyzers?
> 
> Could you possibly try a test, very simple case?
> 
> Just create an index with a single lucene document, with that documents 
> contents being "aa.bb.cc.dd" and then take a look at the index via Luke 
> again.
> 
> When you look at the terms section (Its what comes up by default) you 
> SHOULD see only "aa", "bb", "cc", and "dd" as the top (and thusly ONLY 
> terms in the index).  This could vary depending on your analyzer, as 
> some will show an index containing only a single term "aa.bb.cc.dd".  
> What I would not expect is an index that would contain both.
> 
> Furthermore by making the field not analyzed you will now have a 
> trickier time searching for it.  As you will need to use a keyword 
> analyzer or something similar to search, which if I'm understanding the 
> spirit of your problem isn't really something that you want to do.
> 
> So, if you could run that test scenario that I've outlined for you I 
> think you should be able to have a nice test bed to see what the results 
> of swapping to different analyzers will have on the data that you are 
> trying to index.  Then, after you have played with that a bit you should 
> be able to re-expand your corpus again, and see if the analyzer you have 
> chosen continues to stand up. 
> 
> I.. had thought that StandardAnalyzer already kept IP addresses together 
> as a single token, but maybe its doing something... special and 
> interesting and thusly you are seeing the behavior that you are describing.
> 
> Matt
> 
> oh...@cox.net wrote:
> > Hi,
> >
> > Oh.  Ok, thanks!  I'll give that a try.
> >
> > Jim
> >
> >
> > ---- "Armasu wrote: 
> >   
> >> Keyword: Field.Index.NOT_ANALYZED
> >>
> >> -----Original Message-----
> >> From: oh...@cox.net [mailto:oh...@cox.net] 
> >> Sent: Thursday, July 30, 2009 4:36 PM
> >> To: java-user@lucene.apache.org
> >> Subject: How to index IP addresses?
> >>
> >> Hi,
> >>
> >> I am trying to index information in some proprietary-formatted files.  
> >>
> >> In particular, these files contain some IP addresses in dotted notation, 
> >> e.g., aa.bb.cc.dd.
> >>
> >> For my initial test, I have a Document implementation, and after I extract 
> >> what I need into a String named "Info", I do:
> >>
> >> doc.add(new Field("contents", Info, Field.Store.YES, 
> >> Field.Index.ANALYZED));
> >>
> >> From looking at the resulting index using Luke, it appears that I am 
> >> getting terms for the full IP address string (e.g., "aa.bb.cc.dd"), but I 
> >> am also getting terms for each octet of each IP address string, e.g.:
> >>
> >> aa
> >> bb
> >> cc
> >> dd
> >>
> >> I'm still just getting started with Lucene, but from the research that 
> >> I've done, it seems like Lucene is treating the "." in the dotted notation 
> >> strings as "noise".  Is that correct?
> >>
> >> If so, is there a way to get it not to do that?
> >>
> >> Thanks,
> >> Jim
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>
> >>
> >>     
> >
> >
> > Amazon Development Center (Romania) S.R.L. registered office: 37 Lazar 
> > Street, floor 5, Iasi, Iasi County, Iasi 700049, Romania. Registered in 
> > Romania. Registration number J40/12967/2005.
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >   
> 
> 
> -- 
> Matthew Hall
> Software Engineer
> Mouse Genome Informatics
> mh...@informatics.jax.org
> (207) 288-6012
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: How to index IP addresses?

Reply via email to