I am following this thread as I have a similar issue to deal with in my
coming developments.  Howie thanks for your insights into this as I
think this may solve my problem.  

I am trying to index Title 26 of the US Code
http://www.access.gpo.gov/uscode/title26/title26.html

The problem is I don't want the search engines users to have to go crazy
trying to find a particular code section.

Genrally the code is cited by users in this format: 26USC1
Which transaltes to Title 26, Section 1.

Fortunately, the government puts the citation on the top of each page
[CITE: 26USC1]
See"
http://frwebgate.access.gpo.gov/cgi-bin/getdoc.cgi?dbname=browse_usc&doc
id=Cite:+26USC1 at the top of the page

My goal is to parse that citation out and make it so that I can let
users search on the citation.

So would I do something like 

1. parse out the citation
2. metadata.put(<citation>, <citation>);

?

Thanks for your help on this.


-----Original Message-----
From: Raghavendra Prabhu [mailto:[EMAIL PROTECTED] 
Sent: Thursday, March 09, 2006 2:53 AM
To: [email protected]
Subject: Re: writing a metadata content tag


Hi Howie

That is what i am looking at it

But as you said generalize for all requirements including intranet
requirement

I am better off doing what u said

Rgds
Prabu


On 3/9/06, Howie Wang <[EMAIL PROTECTED]> wrote:
>
> >What i want to do is i should add some header info in parse-filter 
> >which will be used by index-filter to add my own nature of the new 
> >FIELD
> >
> >Rgds
> >Prabhu
>
> I would recommend doing it at the index phase if possible. If the end 
> goal is to have it searchable from the index, ask if you really need 
> to have the information at the parsing stage. If you decide you want 
> to tweak your keywords, it's easy to re-index. If you do it at the 
> parsing stage, it will take twice as long since you have to re-parse 
> and then re-index. Plus re-parsing is not complicated, but involves 
> kind of a hack with renaming a bunch of directories.
>
> One reason to do your analysis at parse time is that it's easier to 
> get the entire page contents like HTML tags in case you need that for 
> categorization. If you don't need this stuff, you probably don't need 
> to categorize at the parsing phase.
>
> If you really want to do it at parse time, it's not difficult. Take a 
> look at parse-html. You can use the metadata object to store your 
> category. Look in HtmlParseFilter.java in getParse. Just do:
>
> metadata.put("myfield", "sports");
>
> In your index filter, you can then do a metadata.get to get your 
> category and then index it.
>
> Howie
>
>
>



-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to