[ 
https://issues.apache.org/jira/browse/ANY23-168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13948664#comment-13948664
 ] 

Lewis John McGibbney commented on ANY23-168:
--------------------------------------------

I've been trying to establish the default boolean value of true for property 
'any23.extraction.head.meta' as advised in our documentation[0] as follows
{code:title=MyCode.java|borderStyle=solid}
    Any23 runner = new Any23();
    DocumentSource source = runner.createDocumentSource("file:" + 
fileURIString);
    ByteArrayOutputStream baos = new ByteArrayOutputStream();
    TripleHandler handler = new JSONWriter(baos);
    final ExtractionParameters extractionParameters = 
ExtractionParameters.newDefault();
    extractionParameters.setFlag("any23.extraction.head.meta", true);
    try {
      runner.extract(extractionParameters, source, handler);
    } catch (ExtractionException e) {
      e.printStackTrace();
    } finally {
      handler.close();
    }
{code}
but so far, when I am debugging the code, it seems that the 
SingleDocumentExtraction class is NOT registering HTMLMetaExtractor for 
potential extraction.

If you get any further with this then please update this thread, it would be 
excellent to get this issue sorted out.
Thanks  
[0] http://any23.apache.org/configuration.html

> RDFa properties in <meta> elements not picked up
> ------------------------------------------------
>
>                 Key: ANY23-168
>                 URL: https://issues.apache.org/jira/browse/ANY23-168
>             Project: Apache Any23
>          Issue Type: Bug
>            Reporter: Ruben Verborgh
>              Labels: meta-tags, rdfa
>             Fix For: 1.0.0
>
>
> RDFa annotations in <meta> elements are not picked up:
> http://ruben.verborgh.org/tmp/dctitle-test.html
> http://any23.org/any23/?uri=http%3A%2F%2Fruben.verborgh.org%2Ftmp%2Fdctitle-test.html
> The Structured Data Testing Tool finds them:
> http://www.google.com/webmasters/tools/richsnippets?q=http%3A%2F%2Fruben.verborgh.org%2Ftmp%2Fdctitle-test.html
> Additionally, I wonder whether it's a good idea to drop the dcterms:title 
> property extracted from <title> of an actual dc:title property is present. 
> This allows for more meaningful titles, for instance:
>     <title>HTML Title – Website Name</title>
>     <meta property="dc:title" content="DC Title"/>
> This would allow to overcome the common situation that the HTML <title> also 
> contains the website name etc., so is not suited for a "clean" dc:title. I 
> would thus say that an actual dc:title has precedence over an implied 
> dc:title from <title>.
> Furthermore, I'm confused by the double appearance of
> <http://ruben.verborgh.org/tmp/dctitle-test.html> dcterms:title "HTML Title – 
> Website Name" .
> AND
> <http://ruben.verborgh.org/tmp/dctitle-test.html> 
> <http://www.w3.org/1999/xhtml/microdata#item> 
> _:nodecfcd208495d565ef66e7dff9f98764da ;
>       dcterms:title "HTML Title – Website Name" .
> Should the page itself AND some blank node have this dcterms:title? (And what 
> happens if the <meta> tags are parsed?)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to