On 2010-01-11 13:18, Erlend Garåsen wrote:

First of all: I didn't know about the list archive, so sorry for not
searching that resource before I sent a new post.

MilleBii wrote:
For lastModified just enable the index|query-more plugins it will do
the job for you.

Unfortunately not. Our pages include Dublin core metadata which has a
Norwegian name.

For other meta searc the mailing list its explained many times how to
do it

I found several posts concerning metadata, but for me, one question is
still unanswered: Do I really have to create a lot of new classes/xml
files in order to store the content of just two metadata? I have not
managed to parse the content of the lastModified metadata after I tried
to rewrite the HtmlParser class. So I tried to add hard coded metadata
values in HtmlParser like this instead:
entry.getValue().getData().getParseMeta().set("dato.endret", "01.01.2008");

My modified MoreIndexingFilter managed to pick up the hard coded values,
and the dates were successfully stored into my Solr Index after running
the solrindex option.

This means that it is not necessary to write a new MoreIndexingFilter
class, but I'm still unsure about the HtmlParser class since I haven't
managed to parse the content of the metadata.

You can of course hack your way through HtmlParser and add/remove/modify as you see fit - it's straightforward and likely you will get the result that you want.

However, as MilleBii suggests, the preferred way to do this would be to write a plugin. The reason is the cost of a long-term maintenance - if you ever want to sync up your local modified version of Nutch with the newer public release, your hacked copy of HtmlParser won't merge nicely, whereas if you put your code in a separate plugin then it might. Another reason is configurability - if you put this code in a separate plugin, you can easily turn it on/off, but if it sits in HtmlParser this would be more difficult to do.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to