[jira] [Commented] (ANY23-227) not extracting opengraph rdfa

stephane corlosquet (JIRA) Sat, 02 Aug 2014 19:11:28 -0700

    [ 
https://issues.apache.org/jira/browse/ANY23-227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14083838#comment-14083838
 ]


stephane corlosquet commented on ANY23-227:
-------------------------------------------

This is because http://www.last.fm/music/Bread contains some unclosed meta HTML 
elements, which are valid HTML5, but the HTML parser used in any23 seems to 
choke as soon as it reaches any of those lines:
{code}
    <meta charset="utf-8">
...
        <meta name="apple-itunes-app" content="app-id=585235199">
{code}

and ignores the rest of the content of the head element. In fact, if you 
download that page and move any og: meta element above that first meta element, 
it is extracted properly on my local machine (using the rover CLI).

I tried using the [web UI of semargl|http://semarglproject.org/demo-rdfa.html] 
(the RDFa parser used in any23), and it's able to extract the og data from 
http://www.last.fm/music/Bread without any problem. Aren't we using semargl 
HTML parser? I wonder if any23 uses it's own HTML parser. Pinging [~p_ansell] 
who worked on the integration between any23 and semargl.

It seems another element that the parser doesn't like is:
{code}
        <!–[if IE]><![endif]–>
{code}

> not extracting opengraph rdfa
> -----------------------------
>
>                 Key: ANY23-227
>                 URL: https://issues.apache.org/jira/browse/ANY23-227
>             Project: Apache Any23
>          Issue Type: Bug
>    Affects Versions: 1.0
>            Reporter: hadar
>
> unable to extract opengraph data using any23 default settings.
> example page.
> http://www.last.fm/music/Bread



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (ANY23-227) not extracting opengraph rdfa

Reply via email to