[
https://issues.apache.org/jira/browse/ANY23-227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14083838#comment-14083838
]
stephane corlosquet commented on ANY23-227:
-------------------------------------------
This is because http://www.last.fm/music/Bread contains some unclosed meta HTML
elements, which are valid HTML5, but the HTML parser used in any23 seems to
choke as soon as it reaches any of those lines:
{code}
<meta charset="utf-8">
...
<meta name="apple-itunes-app" content="app-id=585235199">
{code}
and ignores the rest of the content of the head element. In fact, if you
download that page and move any og: meta element above that first meta element,
it is extracted properly on my local machine (using the rover CLI).
I tried using the [web UI of semargl|http://semarglproject.org/demo-rdfa.html]
(the RDFa parser used in any23), and it's able to extract the og data from
http://www.last.fm/music/Bread without any problem. Aren't we using semargl
HTML parser? I wonder if any23 uses it's own HTML parser. Pinging [~p_ansell]
who worked on the integration between any23 and semargl.
It seems another element that the parser doesn't like is:
{code}
<!–[if IE]><![endif]–>
{code}
> not extracting opengraph rdfa
> -----------------------------
>
> Key: ANY23-227
> URL: https://issues.apache.org/jira/browse/ANY23-227
> Project: Apache Any23
> Issue Type: Bug
> Affects Versions: 1.0
> Reporter: hadar
>
> unable to extract opengraph data using any23 default settings.
> example page.
> http://www.last.fm/music/Bread
--
This message was sent by Atlassian JIRA
(v6.2#6252)