Hi Erik,

Stefan - thanks for the reply. I'm still digesting Nutch and how to work with it at a basic level but it does make sense to allow metadata to tag along with fetches - I certainly don't know enough yet to say whether your patch fits into the long-term vision of Nutch or not yet.
Well I would be interested to hear about the long-term vision of Nutch as well. :)


I've started writing a custom RDF parser plugin that will take the URL and simply add it to Kowari (letting Kowari actually parse it and ingest it). But I'm feeling like this might not be the best approach.

At what stage would make the most sense for ingesting RDF into an external system? Is parsing the most logical stage?

Further on this topic, I'm curious about indexing multiple "documents" per .rdf file fetched - for instance, one document per RDF "resource".
You hit another problem I see since some time. For example I see this problem for rss parsing, image search or any other them where you have multiple logical documents per physical document (xml feed, html page).

Is this currently possible with a plugin?
NO! As far I know you can only have one document per one URL.
If not, what would it take to do something like this? Maybe this approach doesn't even make sense in the Nutch sense - I'm just exploring my architectural options.
I solved a similar problem with following steps.
I fetch but I do not parse until fetch time.
In the next step I read the unparsed content from the segment use a own parser and directly indexed the content I had parsed. Beside this I had written a text file with extracted URLs. This urls was merged back to webdb in the end. It was working but not more than a prototype and at least I was asking myself if it makes sense to use nutch for such a task. Anyway I would be very happy to see a patch that allows to extract multiple documents from one source ( this would help to implement a better rss or image search) however I think that is a very tricky issue.

HTH
Stefan

Reply via email to