Re: Adding additional metadata

Erlend Garåsen Mon, 11 Jan 2010 08:19:03 -0800

I managed to "hack" HtmlParser by modifying the class HTMLMetaProcessor.Now I'm able to parse my metadata.

I agree with you. I will write my own plugin later. At the moment I'monly interested to find out whether it is possible to start usingSolr/Nutch instead of paying A LOT for a Fast/Ultraseek/Omnifindlicense. To parse our metadata was one of our many requirements.

Now I have learned a little bit more by changing the code of theexisting parsers/indexers. I guess you will all hear more from me, butthen related to another problem. :)


Erlend


Andrzej Bialecki wrote:

On 2010-01-11 13:18, Erlend Garåsen wrote:
First of all: I didn't know about the list archive, so sorry for not
searching that resource before I sent a new post.

MilleBii wrote:
For lastModified just enable the index|query-more plugins it will do
the job for you.
Unfortunately not. Our pages include Dublin core metadata which has a
Norwegian name.
For other meta searc the mailing list its explained many times how to
do it
I found several posts concerning metadata, but for me, one question is
still unanswered: Do I really have to create a lot of new classes/xml
files in order to store the content of just two metadata? I have not
managed to parse the content of the lastModified metadata after I tried
to rewrite the HtmlParser class. So I tried to add hard coded metadata
values in HtmlParser like this instead:
entry.getValue().getData().getParseMeta().set("dato.endret","01.01.2008");
My modified MoreIndexingFilter managed to pick up the hard coded values,
and the dates were successfully stored into my Solr Index after running
the solrindex option.

This means that it is not necessary to write a new MoreIndexingFilter
class, but I'm still unsure about the HtmlParser class since I haven't
managed to parse the content of the metadata.
You can of course hack your way through HtmlParser and add/remove/modifyas you see fit - it's straightforward and likely you will get the resultthat you want.
However, as MilleBii suggests, the preferred way to do this would be towrite a plugin. The reason is the cost of a long-term maintenance - ifyou ever want to sync up your local modified version of Nutch with thenewer public release, your hacked copy of HtmlParser won't merge nicely,whereas if you put your code in a separate plugin then it might. Anotherreason is configurability - if you put this code in a separate plugin,you can easily turn it on/off, but if it sits in HtmlParser this wouldbe more difficult to do.



--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Re: Adding additional metadata

Reply via email to