I am following this thread as I have a similar issue to deal with in my coming developments. Howie thanks for your insights into this as I think this may solve my problem.
I am trying to index Title 26 of the US Code http://www.access.gpo.gov/uscode/title26/title26.html The problem is I don't want the search engines users to have to go crazy trying to find a particular code section. Genrally the code is cited by users in this format: 26USC1 Which transaltes to Title 26, Section 1. Fortunately, the government puts the citation on the top of each page [CITE: 26USC1] See" http://frwebgate.access.gpo.gov/cgi-bin/getdoc.cgi?dbname=browse_usc&doc id=Cite:+26USC1 at the top of the page My goal is to parse that citation out and make it so that I can let users search on the citation. So would I do something like 1. parse out the citation 2. metadata.put(<citation>, <citation>); ? Thanks for your help on this. -----Original Message----- From: Raghavendra Prabhu [mailto:[EMAIL PROTECTED] Sent: Thursday, March 09, 2006 2:53 AM To: [email protected] Subject: Re: writing a metadata content tag Hi Howie That is what i am looking at it But as you said generalize for all requirements including intranet requirement I am better off doing what u said Rgds Prabu On 3/9/06, Howie Wang <[EMAIL PROTECTED]> wrote: > > >What i want to do is i should add some header info in parse-filter > >which will be used by index-filter to add my own nature of the new > >FIELD > > > >Rgds > >Prabhu > > I would recommend doing it at the index phase if possible. If the end > goal is to have it searchable from the index, ask if you really need > to have the information at the parsing stage. If you decide you want > to tweak your keywords, it's easy to re-index. If you do it at the > parsing stage, it will take twice as long since you have to re-parse > and then re-index. Plus re-parsing is not complicated, but involves > kind of a hack with renaming a bunch of directories. > > One reason to do your analysis at parse time is that it's easier to > get the entire page contents like HTML tags in case you need that for > categorization. If you don't need this stuff, you probably don't need > to categorize at the parsing phase. > > If you really want to do it at parse time, it's not difficult. Take a > look at parse-html. You can use the metadata object to store your > category. Look in HtmlParseFilter.java in getParse. Just do: > > metadata.put("myfield", "sports"); > > In your index filter, you can then do a metadata.get to get your > category and then index it. > > Howie > > > ------------------------------------------------------- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
