Re: [Nutch-dev] RE: [proposal] Generic Markup Language Parser

Erik Hatcher Fri, 25 Nov 2005 02:33:19 -0800


On 24 Nov 2005, at 23:49, Chris Mattmann wrote:

Dublin core may is good for semantic web, but not for a contentstorage.
I completely disagree with that.


Me too.

In fact, I think many people would disagree
with that in fact. Dublin core is a "standard" metadata model forelectronicresources. It is by no means the entire spectrum of metadata thatcould be
stored for electronic content. However, rather than creating your own
"author" field, or "content creator", or "document creator", orwhatever youwant to call it, I think it would be nice to provide the DCmetadata becauseat least it is well known and provides interoperability with othercontentstorage systems. Check out DSpace from MIT. Check out ISO-11179registry
systems. Check out the ISO standard OAIS reference model for archiving
systems. Each of these systems has recognized that standardmetadata is an
important concern in any content management system.

Further along these lines... Nutch's instigation had a bit to do withGoogle's dominance, and look where Google is headed now! Semanticweb, oh my! Google Base currently is just scratching the surface ofwhere they'll head. Nutch could certainly be used in this sort ofspace. I was, but currently backed off for something much simpler tobegin with, using Nutch to crawl library archives with RDF databacking the web pages, pointed to by <link> tags in the <head>section. That RDF is dumped into a powerful triplestore (Kowari),with the goal of blending structured RDF queries with full-text queries.

I strongly suspect that there will be more efforts to tweak Nutchinto the semantic web space. I'd be surprised otherwise.

The magic world is minimalism.
So I vote against this suggestion!
Stefan
In general, this proposal represents a step forward in being ableto parsegeneric XML content in Nutch, which is a very challenging problem.Thanksfor your suggestions, however, I think that our proposal would helpNutch to
move forward in being to handle generic forms of XML markup content.

Stefan - please don't inhibit innovation. Just because you don'tagree with the approach, let them have the freedom to prove it outwith encouragement, not negativity. Plugins can be turned off, andif it isn't acceptable to be in the core then so be it, it doesn'teven have to be an officially supported plugin. But I, for one,would like to encourage them to continue on with their XML effortsand see where it leads.

RDF, microformats, triplestores, structured querying, facetedbrowsing.... these are the things I need, with of course full-textsearch, and this is the direction Google is headed in a major way.Full-text is great and all, but it's only part of the story, and acrude one in many respects. :) Scraping HTML for "meaning"... insanity.


        Erik




-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] RE: [proposal] Generic Markup Language Parser

Reply via email to