I managed to "hack" HtmlParser by modifying the class HTMLMetaProcessor. Now I'm able to parse my metadata.

I agree with you. I will write my own plugin later. At the moment I'm only interested to find out whether it is possible to start using Solr/Nutch instead of paying A LOT for a Fast/Ultraseek/Omnifind license. To parse our metadata was one of our many requirements.

Now I have learned a little bit more by changing the code of the existing parsers/indexers. I guess you will all hear more from me, but then related to another problem. :)

Erlend


Andrzej Bialecki wrote:
On 2010-01-11 13:18, Erlend Garåsen wrote:

First of all: I didn't know about the list archive, so sorry for not
searching that resource before I sent a new post.

MilleBii wrote:
For lastModified just enable the index|query-more plugins it will do
the job for you.

Unfortunately not. Our pages include Dublin core metadata which has a
Norwegian name.

For other meta searc the mailing list its explained many times how to
do it

I found several posts concerning metadata, but for me, one question is
still unanswered: Do I really have to create a lot of new classes/xml
files in order to store the content of just two metadata? I have not
managed to parse the content of the lastModified metadata after I tried
to rewrite the HtmlParser class. So I tried to add hard coded metadata
values in HtmlParser like this instead:
entry.getValue().getData().getParseMeta().set("dato.endret", "01.01.2008");

My modified MoreIndexingFilter managed to pick up the hard coded values,
and the dates were successfully stored into my Solr Index after running
the solrindex option.

This means that it is not necessary to write a new MoreIndexingFilter
class, but I'm still unsure about the HtmlParser class since I haven't
managed to parse the content of the metadata.

You can of course hack your way through HtmlParser and add/remove/modify as you see fit - it's straightforward and likely you will get the result that you want.

However, as MilleBii suggests, the preferred way to do this would be to write a plugin. The reason is the cost of a long-term maintenance - if you ever want to sync up your local modified version of Nutch with the newer public release, your hacked copy of HtmlParser won't merge nicely, whereas if you put your code in a separate plugin then it might. Another reason is configurability - if you put this code in a separate plugin, you can easily turn it on/off, but if it sits in HtmlParser this would be more difficult to do.




--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Reply via email to