[
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Lewis John McGibbney updated NUTCH-1129:
----------------------------------------
Attachment: NUTCH-1129.patch
First pass at this for 2.x HEAD.
Some tests covering RDFa and Microdata extraction.
I've documented the patch everywhere I could to make the Any23 functionality as
clear as possible.
For those wanting to test out this patch, please turn logging to debug and you
will see a nice extractor report in with your logs. This is great for seeing
which Any23 extractors were activated and used as well as how many triples were
extracted and how long it took to do the job!
Some con's which I would like to address. Right now by default we (Any23 code
base) print out a rather bulky configuration message which is really
unappealing as far as logging goes. I need to find a way of turning this off.
It can maybe be done through configuration but I may also need to add a switch
down in Any23 for it.
So anyway, here is a first pass. If you are able to comment it would be great.
Thanks
> Any23 Nutch plugin
> ------------------
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
> Issue Type: New Feature
> Components: parser
> Reporter: Lewis John McGibbney
> Assignee: Lewis John McGibbney
> Priority: Minor
> Fix For: 2.3, 1.9
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin
> which extracts RDF data from HTTP and file resources. Although as of writing
> Any23 not part of the ASF, the project is working towards integration into
> the Apache Incubator. Once the project proves its value, this would be an
> excellent addition to the Nutch 1.X codebase.
--
This message was sent by Atlassian JIRA
(v6.2#6252)