On 24 Nov 2005, at 23:49, Chris Mattmann wrote:
Dublin core may is good for semantic web, but not for a content storage.

I completely disagree with that.

Me too.

In fact, I think many people would disagree
with that in fact. Dublin core is a "standard" metadata model for electronic resources. It is by no means the entire spectrum of metadata that could be
stored for electronic content. However, rather than creating your own
"author" field, or "content creator", or "document creator", or whatever you want to call it, I think it would be nice to provide the DC metadata because at least it is well known and provides interoperability with other content storage systems. Check out DSpace from MIT. Check out ISO-11179 registry
systems. Check out the ISO standard OAIS reference model for archiving
systems. Each of these systems has recognized that standard metadata is an
important concern in any content management system.

Further along these lines... Nutch's instigation had a bit to do with Google's dominance, and look where Google is headed now! Semantic web, oh my! Google Base currently is just scratching the surface of where they'll head. Nutch could certainly be used in this sort of space. I was, but currently backed off for something much simpler to begin with, using Nutch to crawl library archives with RDF data backing the web pages, pointed to by <link> tags in the <head> section. That RDF is dumped into a powerful triplestore (Kowari), with the goal of blending structured RDF queries with full-text queries.

I strongly suspect that there will be more efforts to tweak Nutch into the semantic web space. I'd be surprised otherwise.

The magic world is minimalism.
So I vote against this suggestion!
Stefan

In general, this proposal represents a step forward in being able to parse generic XML content in Nutch, which is a very challenging problem. Thanks for your suggestions, however, I think that our proposal would help Nutch to
move forward in being to handle generic forms of XML markup content.

Stefan - please don't inhibit innovation. Just because you don't agree with the approach, let them have the freedom to prove it out with encouragement, not negativity. Plugins can be turned off, and if it isn't acceptable to be in the core then so be it, it doesn't even have to be an officially supported plugin. But I, for one, would like to encourage them to continue on with their XML efforts and see where it leads.

RDF, microformats, triplestores, structured querying, faceted browsing.... these are the things I need, with of course full-text search, and this is the direction Google is headed in a major way. Full-text is great and all, but it's only part of the story, and a crude one in many respects. :) Scraping HTML for "meaning"... insanity.

        Erik




-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to