Hi Erik,
Stefan - thanks for the reply. I'm still digesting Nutch and how
to work with it at a basic level but it does make sense to allow
metadata to tag along with fetches - I certainly don't know enough
yet to say whether your patch fits into the long-term vision of
Nutch or not yet.
Well I would be interested to hear about the long-term vision of
Nutch as well. :)
I've started writing a custom RDF parser plugin that will take the
URL and simply add it to Kowari (letting Kowari actually parse it
and ingest it). But I'm feeling like this might not be the best
approach.
At what stage would make the most sense for ingesting RDF into an
external system? Is parsing the most logical stage?
Further on this topic, I'm curious about indexing multiple
"documents" per .rdf file fetched - for instance, one document per
RDF "resource".
You hit another problem I see since some time. For example I see this
problem for rss parsing, image search or any other them where you
have multiple logical documents per physical document (xml feed, html
page).
Is this currently possible with a plugin?
NO! As far I know you can only have one document per one URL.
If not, what would it take to do something like this? Maybe this
approach doesn't even make sense in the Nutch sense - I'm just
exploring my architectural options.
I solved a similar problem with following steps.
I fetch but I do not parse until fetch time.
In the next step I read the unparsed content from the segment use a
own parser and directly indexed the content I had parsed. Beside this
I had written a text file with extracted URLs. This urls was merged
back to webdb in the end.
It was working but not more than a prototype and at least I was
asking myself if it makes sense to use nutch for such a task.
Anyway I would be very happy to see a patch that allows to extract
multiple documents from one source ( this would help to implement a
better rss or image search) however I think that is a very tricky issue.
HTH
Stefan