Re: [Virtuoso-users] Not Giving up on HTML5/RDFa Import

2016-01-14 Thread Kingsley Idehen
On 1/14/16 11:17 AM, Haag, Jason wrote:
> Hi All,
>
> I"m back again evaluating Virtuoso for the HTML5/RDFa crawling
> capability. We are considering moving to the Universal Server from VOS
> if I can ever prove to my team that it will be a good choice for
> sponging and crawling HTML5/RDFa files.

It most certainly is.

> I have been testing this feature periodically over the past several
> months with no luck. I appreciate the support and feedback so far, but
> I haven't made any progress. Some previous posts/inquiries I made on
> this topic are available
> here: http://sourceforge.net/p/virtuoso/mailman/message/34507072/ and
> here: 
> http://sourceforge.net/p/virtuoso/mailman/virtuoso-users/thread/CAHjqjnLo7-hiA30neYBsbGm93HeXe%3DHrda5rZPGS%3Dwm%2B08ZvBw%40mail.gmail.com/#msg34525370
>
> I would really like to use the conductor interface to regularly
> schedule the import several graph IRIs that contain RDFa and check the
> triples for any additions on daily basis. I recently upgraded the
> installation to VOS 7.2.3 and still can't see to get the RDFa data to
> populate the data store.

Why don't you approach this matter as follows:

[1] Use the live instance at http://linkeddata.uriburner.com to import
your target data sources
[2] Compare that with what's happening on your local instance.

> After I run the import from the que, anytime I query the virtuoso
> database there is no data from my RDFa datasets that I have imported
> through conductor. I must be doing something wrong or missing an
> important step somewhere. However, if I use these same exact RDFa IRIs
> using the isql-v function (DB.DBA.RDF_LOAD_RDFA) the triples load
> successfully.

Yes, so there is something amiss in your setup. You import/crawl jobs
should include directives for invoking the sponger cartridge for HTML docs.
>
> Here's a summary of what I've done and discovered so far:
>
> 1) Installed VOS 7.2.3 successfully
> 2) Read some of the newly updated documentation, which is excellent by
> the way
> 3) Checked/updated sponger priveledges per this guidance for securing
> the
> endpoint: 
> http://docs.openlinksw.com/virtuoso/rdfsparql.html#rdfsupportedprotocolendpointuri
> 4) Installed cartridges_dav.vad from commercial version (for sponger
> cartridges):
> http://opldownload.s3.amazonaws.com/uda/vad-packages/7.2/cartridges_dav.vad
> 5) Checked and configured xHTML /  aka HTML5 (and variants) cartridge
> under “extractor cartridges” with the following settings (per advice
> from the mailing list/forums): 
>
> Pattern: (application/xhtml.xml)|(text|application)/.*(html|xml)
> fallback-mode=no
> rdfa=yes
> reify_html5md=1
> reify_rdfa=0

For now (while you are troubleshooting), also use: reify_rdfa=1
> reify_jsonld=0
> reify_all_grddl=0
> passthrough_mode=yes
> loose=yes
> reify_html=0
> reify_html_misc=0
> reify_turtle=no
>
>
> I also tried this basic configuration as well:
> add-html-meta=no
> get-feeds=no
> rdfa=yes
> fallback-mode=no
> reify_html=no
> reify_html_misc=no
> reify_html5md=no
> reify_rdfa=no
> reify_jsonld=no
> reify_turtle=no
> reify_all_grddl=no
> passthrough_mode=no
> loose=no
>
> 6) Created a content import for the HTML5/RDFa document using
> conductor with the following options:
>
> Target URL: http://xapi.vocab.pub/datasets/adl/verbs
> login owner: dba
> checked the following
>
>   * store documents locally
>   * run sponger
>   * store metadata (selected xHTML aka HTML5 and variants)
>

Goto: https://www.pinterest.com/pin/389561436498376210/  -- this shows
your content via the lenses of our OSDS browser extension
 .

>
> 7) Run the import, and 0/1 pages/sites were retrieved and looked up
> the error to be: "XM003: XML parser detected an error: ERROR : Tag
> nesting error: name 'head' of end tag does not match the name 'link'
> of start tag at line 19 column 108 at line 20 column 9 of source text
>  ---^"
> 8) This appears to be a validation error looking for closing tags of
> the  and  elements. It appears the content import isn't
> checking my doctype declaration.  HTML5 doesn't need to close the
>  or  elements whereas xhtml does. 
> 9) Updated the HTML5 to close the meta and link tags to work around
> this to see if the error would go away. It did!
> 10 Created a new import targeting the updated HTML5 with closing tags.
> This time,  no errors and one 1 site was retrieved successfully
> (http://xapi.vocab.pub/datasets/adl/verbs)
> 11) Check to see if the named graph and triples populated the
> database. Nothing there.
> SPARQL SELECT DISTINCT ?g WHERE {GRAPH ?g { ?s ?p ?o . }}
>
> Here are some strange things I noticed that could be causing issues.
> Not sure if anyone can explain what's happening here.
>
>   * Even though the content type is text/html and explicitly defined
> as such in the HTML metatag, the file is being stored in webdav as
> the "application/xhtml+xml" content type
>

That's fine. Virtuoso is doing that .

>   * Even though I assigned dba as 

Re: [Virtuoso-users] Not Giving up on HTML5/RDFa Import

2016-01-14 Thread Haag, Jason
Thanks for the response Kingsley. I will try some of these other tips you
provided.  I have a few questions for clarification:

1) When you say to use the live instance of URI burner, do you mean that
the import URI would be
http://linkeddata.uriburner.com/about/html/http/xapi.vocab.pub/datasets/adl/verbs
for the content import?
2) How would I compare URI burner results with my virtuoso instance when
the nothing is being imported using the conductor interface? There is
nothing to compare.
3) Is there any chance you could try importing this URI on your instance or
a test instance to see if you get the same results? If not, perhaps it is
unique to my setup.
4) I haven't tried the DET folder via webdav before. Is this the latest
documentation on that approach?
http://virtuoso.openlinksw.com/dataspace/doc/dav/wiki/Main/VirtTipsAndTricksControlDefineGraphWithSpongeOption

Thank you.

On Thu, Jan 14, 2016 at 12:46 PM, Kingsley Idehen 
wrote:

> On 1/14/16 11:17 AM, Haag, Jason wrote:
>
> Hi All,
>
> I"m back again evaluating Virtuoso for the HTML5/RDFa crawling capability.
> We are considering moving to the Universal Server from VOS if I can ever
> prove to my team that it will be a good choice for sponging and crawling
> HTML5/RDFa files.
>
>
> It most certainly is.
>
> I have been testing this feature periodically over the past several months
> with no luck. I appreciate the support and feedback so far, but I haven't
> made any progress. Some previous posts/inquiries I made on this topic are
> available here:
> http://sourceforge.net/p/virtuoso/mailman/message/34507072/ and here:
> http://sourceforge.net/p/virtuoso/mailman/virtuoso-users/thread/CAHjqjnLo7-hiA30neYBsbGm93HeXe%3DHrda5rZPGS%3Dwm%2B08ZvBw%40mail.gmail.com/#msg34525370
>
> I would really like to use the conductor interface to regularly schedule
> the import several graph IRIs that contain RDFa and check the triples for
> any additions on daily basis. I recently upgraded the installation to VOS
> 7.2.3 and still can't see to get the RDFa data to populate the data store.
>
>
> Why don't you approach this matter as follows:
>
> [1] Use the live instance at http://linkeddata.uriburner.com to import
> your target data sources
> [2] Compare that with what's happening on your local instance.
>
> After I run the import from the que, anytime I query the virtuoso database
> there is no data from my RDFa datasets that I have imported through
> conductor. I must be doing something wrong or missing an important step
> somewhere. However, if I use these same exact RDFa IRIs using the isql-v
> function (DB.DBA.RDF_LOAD_RDFA) the triples load successfully.
>
>
> Yes, so there is something amiss in your setup. You import/crawl jobs
> should include directives for invoking the sponger cartridge for HTML docs.
>
>
> Here's a summary of what I've done and discovered so far:
>
> 1) Installed VOS 7.2.3 successfully
> 2) Read some of the newly updated documentation, which is excellent by the
> way
> 3) Checked/updated sponger priveledges per this guidance for securing the
> endpoint:
> 
> http://docs.openlinksw.com/virtuoso/rdfsparql.html#rdfsupportedprotocolendpointuri
> 4) Installed cartridges_dav.vad from commercial version (for sponger
> cartridges):
> 
> http://opldownload.s3.amazonaws.com/uda/vad-packages/7.2/cartridges_dav.vad
> 5) Checked and configured xHTML /  aka HTML5 (and variants) cartridge
> under “extractor cartridges” with the following settings (per advice from
> the mailing list/forums):
>
> Pattern: (application/xhtml.xml)|(text|application)/.*(html|xml)
> fallback-mode=no
> rdfa=yes
> reify_html5md=1
> reify_rdfa=0
>
>
> For now (while you are troubleshooting), also use: reify_rdfa=1
>
> reify_jsonld=0
> reify_all_grddl=0
> passthrough_mode=yes
> loose=yes
> reify_html=0
> reify_html_misc=0
> reify_turtle=no
>
>
> I also tried this basic configuration as well:
> add-html-meta=no
> get-feeds=no
> rdfa=yes
> fallback-mode=no
> reify_html=no
> reify_html_misc=no
> reify_html5md=no
> reify_rdfa=no
> reify_jsonld=no
> reify_turtle=no
> reify_all_grddl=no
> passthrough_mode=no
> loose=no
>
> 6) Created a content import for the HTML5/RDFa document using conductor
> with the following options:
>
> Target URL: http://xapi.vocab.pub/datasets/adl/verbs
> login owner: dba
> checked the following
>
>- store documents locally
>- run sponger
>- store metadata (selected xHTML aka HTML5 and variants)
>
>
> Goto: https://www.pinterest.com/pin/389561436498376210/  -- this shows
> your content via the lenses of our OSDS browser extension
>  .
>
>
> 7) Run the import, and 0/1 pages/sites were retrieved and looked up the
> error to be: "XM003: XML parser detected an error: ERROR : Tag nesting
> error: name 'head' of end tag does not match the name