[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

Sebastian Nagel (JIRA) Mon, 28 Apr 2014 01:16:20 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13982827#comment-13982827
 ]


Sebastian Nagel commented on NUTCH-1129:
----------------------------------------

Hi [~lewismc], not yet. But I head a look on the patch. Looks good, in general! 
Some comments:
* dep to any23 jar is also in ivy/ivy.xml. Is a global dependency required? We 
recently had a discussion about that topic 
[@user|http://mail-archives.apache.org/mod_mbox/nutch-user/201404.mbox/%3C535615BA.3050601%40raytion.com%3E].
* all extracted triples are finally stored in one multi-valued field, each 
triple represented as string. That's not an optimal representation, regarding 
two (are there more?) possible use cases: extract and index key-value pairs as 
structured content (cf. 
[@dev|http://mail-archives.apache.org/mod_mbox/nutch-dev/201204.mbox/%3C4F8DEC5B.8070705%40googlemail.com%3E]),
 index into some triple store (as new indexer back-end)
* similar: isn't there a more efficient way to pass triples from parse to 
indexing filter than tab-separated in a huge string (there may be many triples 
in one document!)

The latter two points aren't a blocker by no means. But we should think about 
evolving the plugin and make it really usable.

> Any23 Nutch plugin
> ------------------
>
>                 Key: NUTCH-1129
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1129
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>            Priority: Minor
>             Fix For: 2.3, 1.9
>
>         Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

Reply via email to