Thanks for the response Kingsley. I will try some of these other tips you
provided.  I have a few questions for clarification:

1) When you say to use the live instance of URI burner, do you mean that
the import URI would be
http://linkeddata.uriburner.com/about/html/http/xapi.vocab.pub/datasets/adl/verbs
for the content import?
2) How would I compare URI burner results with my virtuoso instance when
the nothing is being imported using the conductor interface? There is
nothing to compare.
3) Is there any chance you could try importing this URI on your instance or
a test instance to see if you get the same results? If not, perhaps it is
unique to my setup.
4) I haven't tried the DET folder via webdav before. Is this the latest
documentation on that approach?
http://virtuoso.openlinksw.com/dataspace/doc/dav/wiki/Main/VirtTipsAndTricksControlDefineGraphWithSpongeOption

Thank you.

On Thu, Jan 14, 2016 at 12:46 PM, Kingsley Idehen <kide...@openlinksw.com>
wrote:

> On 1/14/16 11:17 AM, Haag, Jason wrote:
>
> Hi All,
>
> I"m back again evaluating Virtuoso for the HTML5/RDFa crawling capability.
> We are considering moving to the Universal Server from VOS if I can ever
> prove to my team that it will be a good choice for sponging and crawling
> HTML5/RDFa files.
>
>
> It most certainly is.
>
> I have been testing this feature periodically over the past several months
> with no luck. I appreciate the support and feedback so far, but I haven't
> made any progress. Some previous posts/inquiries I made on this topic are
> available here:
> http://sourceforge.net/p/virtuoso/mailman/message/34507072/ and here:
> http://sourceforge.net/p/virtuoso/mailman/virtuoso-users/thread/CAHjqjnLo7-hiA30neYBsbGm93HeXe%3DHrda5rZPGS%3Dwm%2B08ZvBw%40mail.gmail.com/#msg34525370
>
> I would really like to use the conductor interface to regularly schedule
> the import several graph IRIs that contain RDFa and check the triples for
> any additions on daily basis. I recently upgraded the installation to VOS
> 7.2.3 and still can't see to get the RDFa data to populate the data store.
>
>
> Why don't you approach this matter as follows:
>
> [1] Use the live instance at http://linkeddata.uriburner.com to import
> your target data sources
> [2] Compare that with what's happening on your local instance.
>
> After I run the import from the que, anytime I query the virtuoso database
> there is no data from my RDFa datasets that I have imported through
> conductor. I must be doing something wrong or missing an important step
> somewhere. However, if I use these same exact RDFa IRIs using the isql-v
> function (DB.DBA.RDF_LOAD_RDFA) the triples load successfully.
>
>
> Yes, so there is something amiss in your setup. You import/crawl jobs
> should include directives for invoking the sponger cartridge for HTML docs.
>
>
> Here's a summary of what I've done and discovered so far:
>
> 1) Installed VOS 7.2.3 successfully
> 2) Read some of the newly updated documentation, which is excellent by the
> way
> 3) Checked/updated sponger priveledges per this guidance for securing the
> endpoint:
> <http://docs.openlinksw.com/virtuoso/rdfsparql.html#rdfsupportedprotocolendpointuri>
> http://docs.openlinksw.com/virtuoso/rdfsparql.html#rdfsupportedprotocolendpointuri
> 4) Installed cartridges_dav.vad from commercial version (for sponger
> cartridges):
> <http://opldownload.s3.amazonaws.com/uda/vad-packages/7.2/cartridges_dav.vad>
> http://opldownload.s3.amazonaws.com/uda/vad-packages/7.2/cartridges_dav.vad
> 5) Checked and configured xHTML /  aka HTML5 (and variants) cartridge
> under “extractor cartridges” with the following settings (per advice from
> the mailing list/forums):
>
> Pattern: (application/xhtml.xml)|(text|application)/.*(html|xml)
> fallback-mode=no
> rdfa=yes
> reify_html5md=1
> reify_rdfa=0
>
>
> For now (while you are troubleshooting), also use: reify_rdfa=1
>
> reify_jsonld=0
> reify_all_grddl=0
> passthrough_mode=yes
> loose=yes
> reify_html=0
> reify_html_misc=0
> reify_turtle=no
>
>
> I also tried this basic configuration as well:
> add-html-meta=no
> get-feeds=no
> rdfa=yes
> fallback-mode=no
> reify_html=no
> reify_html_misc=no
> reify_html5md=no
> reify_rdfa=no
> reify_jsonld=no
> reify_turtle=no
> reify_all_grddl=no
> passthrough_mode=no
> loose=no
>
> 6) Created a content import for the HTML5/RDFa document using conductor
> with the following options:
>
> Target URL: http://xapi.vocab.pub/datasets/adl/verbs
> login owner: dba
> checked the following
>
>    - store documents locally
>    - run sponger
>    - store metadata (selected xHTML aka HTML5 and variants)
>
>
> Goto: https://www.pinterest.com/pin/389561436498376210/  -- this shows
> your content via the lenses of our OSDS browser extension
> <http://osds.openlinksw.com/> .
>
>
> 7) Run the import, and 0/1 pages/sites were retrieved and looked up the
> error to be: "XM003: XML parser detected an error: ERROR : Tag nesting
> error: name 'head' of end tag does not match the name 'link' of start tag
> at line 19 column 108 at line 20 column 9 of source text </head> -------^"
> 8) This appears to be a validation error looking for closing tags of the
> <meta> and <link> elements. It appears the content import isn't checking my
> doctype declaration.  HTML5 doesn't need to close the <meta> or <link>
> elements whereas xhtml does.
> 9) Updated the HTML5 to close the meta and link tags to work around this
> to see if the error would go away. It did!
> 10 Created a new import targeting the updated HTML5 with closing tags.
> This time,  no errors and one 1 site was retrieved successfully (
> <http://xapi.vocab.pub/datasets/adl/verbs>
> http://xapi.vocab.pub/datasets/adl/verbs)
> 11) Check to see if the named graph and triples populated the database.
> Nothing there.
> SPARQL SELECT DISTINCT ?g WHERE {GRAPH ?g { ?s ?p ?o . }}
>
> Here are some strange things I noticed that could be causing issues. Not
> sure if anyone can explain what's happening here.
>
>    - Even though the content type is text/html and explicitly defined as
>    such in the HTML metatag, the file is being stored in webdav as the
>    "application/xhtml+xml" content type
>
>
> That's fine. Virtuoso is doing that .
>
>
>    - Even though I assigned dba as content owner, is is assigning dav as
>    content owner
>
>
> Yes, since 'dav' is the super-user in the Web Content Storage aspect of
> Virtuoso.
>
>
>    - After the import que is run, two files are created and stored in
>    DAV/home/dba/rdf_sink even though I select the option to store a single
>    file: (verbs and urn_dav_home_dba_rdf_sink.RDF). If I access the verbs file
>    in webdav it renders the html that was imported. If I click
>    the urn_dav_home_dba_rdf_sink.RDF it is not available. Note: the verbs file
>    is being stored as "application/xhtml+xml" content type and
>    the urn_dav_home_dba_rdf_sink.RDF is being stored as text/xml in webdav.
>
>
> These folders shouldn't have anything to do with your import job,
> certainly not at this stage.
>
>
> After all of this I decided to check and see if I could load the
> HTML5/RDFa document using isql-v:
>
> SQL> DB.DBA.RDF_LOAD_RDFA (http_get('
> <http://xapi.vocab.pub/datasets/adl/verbs/%27>
> http://xapi.vocab.pub/datasets/adl/verbs/'), '
> http://xapi.vocab.pub/datasets/adl/verbs/#', '
> http://xapi.vocab.pub/datasets/adl/verbs');
>
> This worked and the graph and triples are in the database. However, for
> team collaboration it would be helpful for others to see the stored imports
> and crawler jobs in the conductor interface rather than strictly relying on
> isql-v to populate the data store.
>
>
> Yes.
>
> You can even make a DET folder type that's mapped to target named graph
> with the option to invoke the HTML sponger cartridge. Net effect: a
> so-called Data Lake of documents (variety of content formats) imported from
> the Web (or any internal HTTP network) and also passed through the sponger
> which deposits output into designated named graph.
>
> You can access these files via your browser or any WebDAV client (i.e.,
> mount to any native OS via its WebDAV support) . You can share URIs via
> copy & paste of "share feature" that exists in your browser etc..
>
> Am I doing something wrong in trying to get HTML5/RDFa content to import
> using conductor? I feel like I might be missing an important step that is
> preventing it from working. Thanks in advance.
>
>
> Somewhere something has gone wrong or this is an undiscovered VOS edition
> quirk.
>
> We are leaning towards removing these features from VOS as its best suited
> as a dedicated store for data represented as SQL Tables or RDF Property
> Graphs.  Thus, you are really going to be much better off using the
> commercial edition.
>
>
> Kingsley
>
>
> Regards,
>
> J Haag
>
>
>
>
> ------------------------------------------------------------------------------
> Site24x7 APM Insight: Get Deep Visibility into Application Performance
> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
> Monitor end-to-end web transactions and take corrective actions now
> Troubleshoot faster and improve end-user experience. Signup 
> Now!http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
>
>
>
> _______________________________________________
> Virtuoso-users mailing 
> listVirtuoso-users@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/virtuoso-users
>
>
>
> --
> Regards,
>
> Kingsley Idehen       
> Founder & CEO
> OpenLink Software
> Company Web: http://www.openlinksw.com
> Personal Weblog 1: http://kidehen.blogspot.com
> Personal Weblog 2: http://www.openlinksw.com/blog/~kidehen
> Twitter Profile: https://twitter.com/kidehen
> Google+ Profile: https://plus.google.com/+KingsleyIdehen/about
> LinkedIn Profile: http://www.linkedin.com/in/kidehen
> Personal WebID: http://kingsley.idehen.net/dataspace/person/kidehen#this
>
>
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
_______________________________________________
Virtuoso-users mailing list
Virtuoso-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/virtuoso-users

Reply via email to