[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1129:
----------------------------------------

    Attachment: NUTCH-1129.patch

First pass at this for 2.x HEAD.
Some tests covering RDFa and Microdata extraction.
I've documented the patch everywhere I could to make the Any23 functionality as 
clear as possible.

For those wanting to test out this patch, please turn logging to debug and you 
will see a nice extractor report in with your logs. This is great for seeing 
which Any23 extractors were activated and used as well as how many triples were 
extracted and how long it took to do the job! 

Some con's which I would like to address. Right now by default we (Any23 code 
base) print out a rather bulky configuration message which is really 
unappealing as far as logging goes. I need to find a way of turning this off. 
It can maybe be done through configuration but I may also need to add a switch 
down in Any23 for it.

So anyway, here is a first pass. If you are able to comment it would be great.
Thanks 

> Any23 Nutch plugin
> ------------------
>
>                 Key: NUTCH-1129
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1129
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>            Priority: Minor
>             Fix For: 2.3, 1.9
>
>         Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to