[
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13982827#comment-13982827
]
Sebastian Nagel commented on NUTCH-1129:
----------------------------------------
Hi [~lewismc], not yet. But I head a look on the patch. Looks good, in general!
Some comments:
* dep to any23 jar is also in ivy/ivy.xml. Is a global dependency required? We
recently had a discussion about that topic
[@user|http://mail-archives.apache.org/mod_mbox/nutch-user/201404.mbox/%3C535615BA.3050601%40raytion.com%3E].
* all extracted triples are finally stored in one multi-valued field, each
triple represented as string. That's not an optimal representation, regarding
two (are there more?) possible use cases: extract and index key-value pairs as
structured content (cf.
[@dev|http://mail-archives.apache.org/mod_mbox/nutch-dev/201204.mbox/%3C4F8DEC5B.8070705%40googlemail.com%3E]),
index into some triple store (as new indexer back-end)
* similar: isn't there a more efficient way to pass triples from parse to
indexing filter than tab-separated in a huge string (there may be many triples
in one document!)
The latter two points aren't a blocker by no means. But we should think about
evolving the plugin and make it really usable.
> Any23 Nutch plugin
> ------------------
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
> Issue Type: New Feature
> Components: parser
> Reporter: Lewis John McGibbney
> Assignee: Lewis John McGibbney
> Priority: Minor
> Fix For: 2.3, 1.9
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin
> which extracts RDF data from HTTP and file resources. Although as of writing
> Any23 not part of the ASF, the project is working towards integration into
> the Apache Incubator. Once the project proves its value, this would be an
> excellent addition to the Nutch 1.X codebase.
--
This message was sent by Atlassian JIRA
(v6.2#6252)