[
https://issues.apache.org/jira/browse/ANY23-137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13565867#comment-13565867
]
Lewis John McGibbney commented on ANY23-137:
--------------------------------------------
Hi Lev,
I've also come across another issue with the existing html-rdfa11 Extractor
implementation and have attached the file.
For reference, here is the log report and output.
{code}
<response><extractors><extractor>html-head-title</extractor><extractor>html-mf-hcard</extractor><extractor>html-mf-adr</extractor><extractor>html-rdfa11</extractor></extractors><report><message/><error/><issueReport><extractorIssues
extractor="html-rdfa11"><issue level="Warning" row="202" col="30">Error while
processing node
[/HTML[1]/BODY[1]/DIV[2]/DIV[2]/DIV[1]/DIV[2]/DIV[1]/DIV[1]/P[1]/SPAN[1]/A[1]]
: 'Cannot map prefix 'width''</issue><issue level="Warning" row="204"
col="30">Error while processing node
[/HTML[1]/BODY[1]/DIV[2]/DIV[2]/DIV[1]/DIV[2]/DIV[1]/DIV[1]/P[2]/SPAN[1]/A[1]]
: 'Cannot map prefix 'width''</issue><issue level="Warning" row="208"
col="30">Error while processing node
[/HTML[1]/BODY[1]/DIV[2]/DIV[2]/DIV[1]/DIV[2]/DIV[1]/DIV[2]/P[1]/SPAN[1]/A[1]]
: 'Cannot map prefix
'width''</issue></extractorIssues></issueReport><validationReport><errors>
</errors><ruleActivations>
</ruleActivations><issues>
</issues></validationReport></report><data>
# OUTPUT FORMAT: Turtle (mimeTypes=text/turtle, application/x-turtle; ext=ttl)
# BEGIN:
ExtractionContext(urn:x-any23:html-mf-adr:root-extraction-result-id:http://stanford.edu/)
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix vcard: <http://www.w3.org/2006/vcard/ns#> .
# BEGIN: ExtractionContext(urn:x-any23:html-mf-adr:1:http://stanford.edu/)
# BEGIN:
ExtractionContext(urn:x-any23:html-head-title:root-extraction-result-id:http://stanford.edu/)
@prefix dcterms: <http://purl.org/dc/terms/> .
<http://stanford.edu/> dcterms:title "Stanford University"@en .
_:noded01df813432682e65b842257f3757e9 a vcard:Address ;
vcard:locality "450 Serra Mall, Stanford" ;
vcard:region "CA" ;
vcard:postal-code "94305" .
# BEGIN:
ExtractionContext(urn:x-any23:html-mf-hcard:root-extraction-result-id:http://stanford.edu/)
_:node68324ba1f68fb1712ae267fe33274 vcard:fn "Stanford University" ;
vcard:n _:node17eprgndbx338343 .
_:node17eprgndbx338343 a vcard:Name ;
vcard:given-name "Stanford" ;
vcard:family-name "University" .
_:node68324ba1f68fb1712ae267fe33274 vcard:org _:node17eprgndbx338344 .
_:node17eprgndbx338344 a vcard:Organization ;
vcard:organization-name "Stanford University" .
_:node68324ba1f68fb1712ae267fe33274 vcard:adr
_:noded01df813432682e65b842257f3757e9 ;
vcard:tel <tel:(650)%20723-2300> .
# BEGIN: ExtractionContext(urn:x-any23:html-mf-hcard:1:http://stanford.edu/)
_:node68324ba1f68fb1712ae267fe33274 a vcard:VCard .
# BEGIN:
ExtractionContext(urn:x-any23:html-rdfa11:root-extraction-result-id:http://stanford.edu/)
<http://stanford.edu/> <http://stanford.edu/alternate>
<http://news.stanford.edu/rss/index.xml> .
<http://stanford.edu/css/layout.css?v=3.0> <http://stanford.edu/stylesheet>
<http://news.stanford.edu/rss/index.xml> .
<http://stanford.edu/css/homepage.css?v=3.1> <http://stanford.edu/stylesheet>
<http://news.stanford.edu/rss/index.xml> .
<http://stanford.edu/css/jquery.fancybox.css?v=2.0.5>
<http://stanford.edu/stylesheet> <http://news.stanford.edu/rss/index.xml> .
<http://stanford.edu/css/mobile.css> <http://stanford.edu/stylesheet>
<http://news.stanford.edu/rss/index.xml> .
<https://fonts.googleapis.com/css?family=Source+Sans+Pro:300,400,600,700>
<http://stanford.edu/stylesheet> <http://news.stanford.edu/rss/index.xml> .
<https://fonts.googleapis.com/css?family=Crimson+Text:400,600,700>
<http://stanford.edu/stylesheet> <http://news.stanford.edu/rss/index.xml> .
# END: ExtractionContext(urn:x-any23:html-mf-adr:1:http://stanford.edu/)
# END:
ExtractionContext(urn:x-any23:html-mf-adr:root-extraction-result-id:http://stanford.edu/)
# END:
ExtractionContext(urn:x-any23:html-rdfa11:root-extraction-result-id:http://stanford.edu/)
# END:
ExtractionContext(urn:x-any23:html-head-title:root-extraction-result-id:http://stanford.edu/)
# END:
ExtractionContext(urn:x-any23:html-mf-hcard:root-extraction-result-id:http://stanford.edu/)
# END: ExtractionContext(urn:x-any23:html-mf-hcard:1:http://stanford.edu/)
</data></response>
{code}
> RDFa parser implementation proposal
> -----------------------------------
>
> Key: ANY23-137
> URL: https://issues.apache.org/jira/browse/ANY23-137
> Project: Apache Any23
> Issue Type: Improvement
> Components: core
> Affects Versions: 0.8.0
> Reporter: Lev Khomich
> Priority: Minor
> Fix For: 0.8.0
>
> Attachments: rdfa-extractor-proposal.patch
>
>
> As a follow up to discussion [1].
> I've implemented another RDFa extractor for Any23 (0.7.1).
> Proposed code depends on semargl project [2]. It isn't published in maven
> central, therefore I didn't change any poms.
> Still not quite sure about class name (because related ones are already
> taken),
> feel free to rename it. See attachments for patch with extractor and tests.
> [1] http://mail-archives.apache.org/mod_mbox/any23-dev/201212.mbox/browser
> [2] http://semarglproject.org
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira