On 24 Nov 2005, at 23:49, Chris Mattmann wrote:
Dublin core may is good for semantic web, but not for a content
storage.
I completely disagree with that.
Me too.
In fact, I think many people would disagree
with that in fact. Dublin core is a "standard" metadata model for
electronic
resources. It is by no means the entire spectrum of metadata that
could be
stored for electronic content. However, rather than creating your own
"author" field, or "content creator", or "document creator", or
whatever you
want to call it, I think it would be nice to provide the DC
metadata because
at least it is well known and provides interoperability with other
content
storage systems. Check out DSpace from MIT. Check out ISO-11179
registry
systems. Check out the ISO standard OAIS reference model for archiving
systems. Each of these systems has recognized that standard
metadata is an
important concern in any content management system.
Further along these lines... Nutch's instigation had a bit to do with
Google's dominance, and look where Google is headed now! Semantic
web, oh my! Google Base currently is just scratching the surface of
where they'll head. Nutch could certainly be used in this sort of
space. I was, but currently backed off for something much simpler to
begin with, using Nutch to crawl library archives with RDF data
backing the web pages, pointed to by <link> tags in the <head>
section. That RDF is dumped into a powerful triplestore (Kowari),
with the goal of blending structured RDF queries with full-text queries.
I strongly suspect that there will be more efforts to tweak Nutch
into the semantic web space. I'd be surprised otherwise.
The magic world is minimalism.
So I vote against this suggestion!
Stefan
In general, this proposal represents a step forward in being able
to parse
generic XML content in Nutch, which is a very challenging problem.
Thanks
for your suggestions, however, I think that our proposal would help
Nutch to
move forward in being to handle generic forms of XML markup content.
Stefan - please don't inhibit innovation. Just because you don't
agree with the approach, let them have the freedom to prove it out
with encouragement, not negativity. Plugins can be turned off, and
if it isn't acceptable to be in the core then so be it, it doesn't
even have to be an officially supported plugin. But I, for one,
would like to encourage them to continue on with their XML efforts
and see where it leads.
RDF, microformats, triplestores, structured querying, faceted
browsing.... these are the things I need, with of course full-text
search, and this is the direction Google is headed in a major way.
Full-text is great and all, but it's only part of the story, and a
crude one in many respects. :) Scraping HTML for "meaning"... insanity.
Erik
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers