You're right about it being not documented well, but it's actually
pretty simple to do.
You need to write an indexing filter and a query filter. For the indexing
filter, I would copy the index-more plugin, and change names, dirs, and
build files appropriately. The main thing you'll change is the filter
method:
public Document filter(Document doc, Parse parse, FetcherOutput fo)
In it, you can add your own fields. To add a new category with value
"puppies", it will look something like this:
doc.add(new Field("category", "puppies", false, true, false));
See the Document.add API for more info on the booleans.
That's pretty much it for indexing. To search for this, you need to create
a query filter. I would copy the query-site plugin. Again change file names,
directories, and build files as needed. The main java file is very simple,
just
change the string in the line with "super". Instead of:
super("site");
You would have
super("category");
Make sure that you put your new index-category and query-category
plugins in your nutch-default.xml file. Don't forget to check that it's in
your WEB-INF/classess directory too.
HTH,
Howie
You can't do it unless you write a plugin to parse a custom meta tag called
category.
I'm trying to do something like this now, but the plugin documentation is
horrible.
Lourival Júnior wrote:
Hi Ernesto!
I know what you mean. Sometimes I get no answers too. Unfortunately, I'm
new
in nutch and lucene and I can't help you. Continue trying, the comunity
will
help you :).
On 8/22/06, Ernesto De Santis <[EMAIL PROTECTED]> wrote:
Hi All
Please, some body can answer my questions?
I'm a nutch beginner, I hope that my questions/doubts are easy... ;)
Or if my email is wrong, tell me. Or confirm me if I'm in the right way.
Thanks a lot!
Ernesto.
Ernesto De Santis escribió:
> Hi
>
> I'm new in nutch, start yesterday.
> But I have experience with Lucene.
>
> I have some questions for you, a nutch experts... ;)
>
> I want to split my pages results in categories, to filter or to show
> its separately.
> This is my approach:
>
> *crawl/index*
>
> I want to index an extra field.
> Then, I need to do my own plugin for that, to develop my custom logic.
> Then, I config my plugin in conf/nutch-site.xml.
>
> To develop my plugin, I see that I need to implements: Configurable
> <
http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/conf/Configurable.html
>,
> IndexingFilter
> <
http://lucene.apache.org/nutch/apidocs-0.8/org/apache/nutch/indexer/IndexingFilter.html
>,
> and Pluggable
> <
http://lucene.apache.org/nutch/apidocs-0.8/org/apache/nutch/plugin/Pluggable.html
>interfaces.
>
> Add to the Document instance the field value, category value.
>
> *search*
>
> Here I have a doubt, one way is set to nutch query a requiredTerm:
>
> query.addRequiredTerm(myCategory, "category");
>
> I see that nutch use QueryFilters too, but I can't see how I do hook
> it to my query.
>
> *miscellaneous*
>
> Lucene has a rich query hierarchy, I don't see it in nutch. I don't
> see BooleanQuery, TermQuery, etc. The unique point to build the query
> in nutch is the Query class?
>
> Lucene searcher has a way to seperate the query to the filters. The
> queries conditions affect the rank, and filters don't. How nutch
> separates it?
>
> *documentation*
>
> I read the documentation in nutch site, tutorial, wiki, presentations
> and today.java.net article:
>
http://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html
> and part2 too.
>
> A lot of details aren't covered there. Some body know more detailed
> documentation?
>
> Thanks a lot.
> Ernesto.
>
__________________________________________________
Preguntá. Respondé. Descubrí.
Todo lo que querías saber, y lo que ni imaginabas,
está en Yahoo! Respuestas (Beta).
¡Probalo ya!
http://www.yahoo.com.ar/respuestas
-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general