[ 
https://issues.apache.org/jira/browse/ANY23-67?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16621045#comment-16621045
 ] 

Hans Brende commented on ANY23-67:
----------------------------------

[~lewismc] Yes. 

One of the reasons I'm interested in getting the currently incorrect algorithm 
fixed is that it appears, in later versions of the algorithm, that the 
(original) substeps 5.2.1 through 5.2.4 have been removed. However, these 
substeps are creating a lot of extraction noise; e.g., in the combined test 
crawls I've run, the predicate 
{{http://www.w3.org/1999/xhtml/vocab#ALTERNATE-STYLESHEET}} from the microdata 
extractor (substep 5.2.2.7) occurred 2,876,844 times, the third highest 
frequency predicate produced (first being 
{{http://www.w3.org/1999/02/22-rdf-syntax-ns#type}}, occurring 5,745,237 times, 
and second being {{http://www.w3.org/1999/xhtml/vocab#stylesheet}} from the 
html-rdfa11 extractor, occurring 3,147,382 times). So rather than implementing 
an additional "blocking" triple handler, it might be better in the long term to 
simply fix the underlying algorithm.

Another reason I'm interested in getting the algorithm fixed is that it 
generates redundant triples. E.g., extracting the title of the document 
(substep 5.2.1) is already handled by our {{html-head-title}} extractor. 
Generating xhtml vocab triples (substep 5.2.2.8) appears to be already handled 
by our {{html-rdfa11}} extractor.

Also, it appears that our existing URI generation algorithm for predicates is 
wrong.

Thoughts?

> Microdata extraction using obsolete RDF conversion scheme
> ---------------------------------------------------------
>
>                 Key: ANY23-67
>                 URL: https://issues.apache.org/jira/browse/ANY23-67
>             Project: Apache Any23
>          Issue Type: Bug
>          Components: microdata
>    Affects Versions: 0.7.0
>            Reporter: Hannes Mühleisen
>            Priority: Major
>             Fix For: 2.3
>
>
> There is now a more-or-less final Microdata to RDF algorithm published[1] 
> which is different than the one in the current, official HTML5 draft [2] 
> (that Ian Hickson has publicly revoked). However, Any23s extractor uses the 
> old scheme according to a comment in its source code, which refers to [2]. 
> However, this is exactly the algorithm that Ian Hickson rescinded at some 
> point. Unfortunately, the official working drafts have not been updated for a 
> very long time, but if you look at the editor's draft [3], you will see that 
> that section has been entirely removed. Instead, there was a Semantic Web 
> Interest group task force that discussed the issues, and [1] is the result of 
> this discussion. It would be nice if this would be reflected in Any23 in the 
> future.
> [Condensed from an E-Mail conversation with Ivan Herman]
> [1] http://www.w3.org/TR/microdata-rdf/
> [2] http://www.w3.org/TR/microdata/#rdf
> [3] http://dev.w3.org/html5/md/Overview.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to