Richard. So would I do something like > > 1. parse out the citation > 2. metadata.put(<citation>, <citation>);
Yes, I think that is the way to proceed. And then on implementing the Indexing and Query FIlters, all as desribed in the WritingPlugin tutorial: http://wiki.apache.org/nutch/WritingPluginExample Rgrds, Thomas ? > > Thanks for your help on this. > > > -----Original Message----- > From: Raghavendra Prabhu [mailto:[EMAIL PROTECTED] > Sent: Thursday, March 09, 2006 2:53 AM > To: [email protected] > Subject: Re: writing a metadata content tag > > > Hi Howie > > That is what i am looking at it > > But as you said generalize for all requirements including intranet > requirement > > I am better off doing what u said > > Rgds > Prabu > > > On 3/9/06, Howie Wang <[EMAIL PROTECTED]> wrote: > > > > >What i want to do is i should add some header info in parse-filter > > >which will be used by index-filter to add my own nature of the new > > >FIELD > > > > > >Rgds > > >Prabhu > > > > I would recommend doing it at the index phase if possible. If the end > > goal is to have it searchable from the index, ask if you really need > > to have the information at the parsing stage. If you decide you want > > to tweak your keywords, it's easy to re-index. If you do it at the > > parsing stage, it will take twice as long since you have to re-parse > > and then re-index. Plus re-parsing is not complicated, but involves > > kind of a hack with renaming a bunch of directories. > > > > One reason to do your analysis at parse time is that it's easier to > > get the entire page contents like HTML tags in case you need that for > > categorization. If you don't need this stuff, you probably don't need > > to categorize at the parsing phase. > > > > If you really want to do it at parse time, it's not difficult. Take a > > look at parse-html. You can use the metadata object to store your > > category. Look in HtmlParseFilter.java in getParse. Just do: > > > > metadata.put("myfield", "sports"); > > > > In your index filter, you can then do a metadata.get to get your > > category and then index it. > > > > Howie > > > > > > > >
