[ 
http://issues.apache.org/jira/browse/NUTCH-271?page=comments#action_12412523 ] 

Gal Nitzan commented on NUTCH-271:
----------------------------------


Hi Stefan,

Indeed 0.8 is not release 1.0 yet but it is stable and we are using it in 
production.

As a whole Nutch is greate and does the job right. there is a lot of tweakiing 
to it but once you get the whole thing configured to your liking there is not 
much to change after.

In terms of plugin development, I do not think Java is that far from PHP so I 
do not think you would have hard time there. the plugins are usually pretty 
small  code. Since most job is already done by Nutch.

for example you want to check certain rule and based on this rule to add some 
information into the index so you can later search your index based on that tag.

The way to go about it would be to develop a parse filter plugin. This plugin 
is called during the parse phase usualy it happens right after fetching unless 
disabled in conf.
The plugin has one interface: filter which gets the URL, content and a parse 
object which contains a meta data object, for every page fetched. There you can 
put an implementation that when the URL of the fetched page matched some 
criteria you would add a metat data tag.

Than you would add an index plugin that will take that meta data and store it 
in your index as a new field.

The last thing to do is write a query plugin that will enable you to search the 
index based on the field you added in your indexing phase.

HTH.

Gal.

These kind of questions should be sent through the user list and not Jira.

> Meta-data per URL/site/section
> ------------------------------
>
>          Key: NUTCH-271
>          URL: http://issues.apache.org/jira/browse/NUTCH-271
>      Project: Nutch
>         Type: New Feature

>     Versions: 0.7.2
>     Reporter: Stefan Neufeind

>
> We have the need to index sites and attach additional meta-data-tags to them. 
> Afaik this is not yet possible, or is there a "workaround" I don't see? What 
> I think of is using meta-tags per start-url, only indexing content below that 
> URL, and have the ability to limit searches upon those meta-tags. E.g.
> http://www.example1.com/something1/   -> meta-tag "companybranch1"
> http://www.example2.com/something2/   -> meta-tag "companybranch2"
> http://www.example3.com/something3/   -> meta-tag "companybranch1"
> http://www.example4.com/something4/   -> meta-tag "companybranch3"
> search for everything in companybranch1 or across 1 and 3 or similar

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to