Hi Folks,
I was recently messing around the Web Data Commons (WDC) with the aim of
setting up a Jena TDB of the entire most recent WDC dataset.
When attempting to import a subset of the WDC nquads into TDB2 I was
constantly banging into issues similar to the following

[2018-10-03 22:57:24] Fuseki     ERROR [line: 102379, col: 115] Illegal
character in IRI (codepoint 0x7C, '|'): <
http://www.hsamuel.co.uk/l/gifts/category[|]...>
[2018-10-03 22:57:26] Fuseki     INFO  [10] 400 Parse error: [line: 102379,
col: 115] Illegal character in IRI (codepoint 0x7C, '|'): <
http://www.hsamuel.co.uk/l/gifts/category[|]...> (6.786 s)

The quad is as follows

_:genid2d8089eee9237845cab8ffa262694474f12db3 <http://schema.org/item> <
http://www.hsamuel.co.uk/l/gifts/category|ladies%20accessories/> <
http://www.hsamuel.co.uk/webstore/d/2821133/happy+40th+anniversary+champagne+flutes/>
  .

As you can see above, the presence of the '|' vertical bar is raising the
error on the TDB2 side.
When I ran this through the any23.org service with validate+fix, report an
annotate parameters set to true, I got the following report

<?xml version="1.0" encoding="UTF-8" ?>
<response>
<extractors>
<extractor>rdf-nq</extractor>
</extractors>
<report>
<message/>
<error/>
<issueReport>
<extractorIssues extractor="rdf-nq">
<issue level="ERROR" row="1" col="-1">Unexpected character U+7C at
index 41: 
http://www.hsamuel.co.uk/l/gifts/category|ladies%20accessories/</issue>
</extractorIssues>
</issueReport>
<validationReport>
<issues>
</issues>
<ruleActivations>
</ruleActivations>
<errors>
</errors>
</validationReport>
</report>
<data>
<![CDATA[
[ {
  "@graph" : [ {
    "@id" : "http://www.hsamuel.co.uk/l/gifts/category|ladies%20accessories/",
    "http://schema.org/name"; : [ {
      "@value" : "Ladies Accessories"
    } ]
  } ],
  "@id" : 
"http://www.hsamuel.co.uk/webstore/d/2821133/happy+40th+anniversary+champagne+flutes/";
} ]]]>
</data>
</response>

I thought the Subject should be fixed with the vertical replaced by the
encoded vertical bar character but this doesn't seem to be the case.
Lewis

-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc

Reply via email to