Andrzej, thanks so much. It's great that nutch follows HEAD <link> since it's the preferred place for autodiscovery of rdf/owl data. The type property inside <link> tag can be set to "application/owl+xml" and "application/rdf+xml" so that nutch crawler knows the linked resource has rdf/owl content.
A related question: If I want nutch to fetch only rdf/owl files, is it possible to generate the fetch list with urls that have type of "application/owl+xml" or "application/rdf+xml"? Using file extension does not always work because the resource url may not have extention like ".rdf". If Nutch keeps the application type for each <link> item it finds, then the application type can be used later when selecting urls for fetch list. I plan to use nutch to crawl specifically for rdf/owl files and then parse them into Lucene document for storing in a lucene index. This lucene index of semantic data will be searched from the same nutch search interface. Thanks, AJ On 6/16/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
AJ Chen wrote: > I'm about to use nutch to crawl semantic data. Links to semantic data > files > (RDF, OWL, etc.) can be placed in two places: (1) HEAD <link>; (2) > BODY <a > href...>. Does nutch crawler follows the HEAD <link>? Yes. Please see parse-html/..../DOMContentUtils.java for details. > > I'm also creating a semantic data publishing tool, I would appreciate any > suggestion regarding the best way to make RDF files visible to nutch > crawler. Well, Nutch is certainly not a competitor to an RDF triple-store ;) It may be used to collect RDF files, and then the map-reduce jobs can be used to massively process these files to annotate large numbers of target resources (e.g. add metadata to pages in the crawldb). You could also load them to a triple store and use that to annotate resources in Nutch, to provide a better searching experience (e.g. searching by concept, by semantic relationships, finding similar concepts in other ontologies, etc). In the end, the model that Nutch supports the best is the Lucene model, which is an unordered bag of documents with multiple fields (properties). If you can translate your required model into this, then you're all set. Nutch/Hadoop provides also a scalable processing framework, which is quite useful for enhancing the existing data with data from external sources (e.g. databases, triplestore, ontologies, semantic nets and such). In some cases, when this external infrastructure is efficient enough, it's possible to combine it on-the-fly (I have successfully used this approach with WordNet, Wikipedia and DMOZ), in other cases you will need to do some batch pre-processing to make this external metadata available as a part of Nutch documents ... again, the framework of map/reduce and DFS is very useful for that (and I have used this approach too, even with the same data as above). -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
_______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
