Hi Pat, On Wed, Apr 29, 2015 at 12:19 PM, <[email protected]> wrote:
> I’ve just started trying to use Any23 programmatically from Java, and it > looks great. > Great. Did you get it working? I just saw your other thread. > > > The documentation has sample code [1], but that code doesn’t seem to work > properly for me (and has a small typo on line 1 (‘Apache’ appears twice for > some reason, and needs to be removed). > Can you please log an issue at the issue tracker and document this? We will then go and fix it. https://issues.apache.org/jira/browse/ANY23 > Hitting the example webpage ( > http://www.rentalinrome.com/semanticloft/semanticloft.htm) from the > command-line using Rover works fine [2], but using the sample code gives me > no triples [3]. > Mmmm. OK, I have no idea. Which version of Any23 are you using? Have you tried it in a debugger? It is only around 10 lines of code. > > > So my questions are: > > 1. Any idea why isn’t the sample code isn’t outputting any triples > for me? > No. I tried this a while back. None of the code functionality yo these API's has changed in a while so it is still OK. > 2. The sample code won’t crawl from the webpage I provide. It just > scans that one page, right?. > No. Essentially the HTTPDocumentSource simply wraps an HTTPCLient and uses it to make the client request. So it doesn't crawl the 'site' as such, but does make a request for the page. > So I guess I need to use Rover somehow from my Java code - so is there a > code sample for crawling a website given just the entrypoint (e.g. ‘ > http://schema.org’)? With code to show how to configure the MaxPages and > MaxDepth, too? > We'll the Any23 source does have a crawler plugin which comes out of the box. Please navigate to the plugins directory if you are using the source codem from Github or .git 3. [BONUS QUESTION!] How come when I use Rover to hit ‘obvious’ markup > websites, like ’google.com’ (5 triples), ‘Schema.org’ (2 triples) or ‘ > bbc.co.uk’ (8 triples) I get so very few descriptive triples? I was > expecting lots of triples, with lots of links to further information, etc. > Shouldn’t these sites by exemplary examples of structured markup…?!. > Well, it's a bit of a paradox when efforts such as schema.org don't typically embed much markup in their homepages... but occasionally this is the case! On the other hand, you can get much more content from articles themselves lmcgibbn@LMC-032857 /usr/local/any23/core/target/appassembler(master) $ ./bin/any23 rover -f ntriples http://www.bbc.com/news/world-europe-32517447 SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. ------------------------------------------------------------------------ Apache Any23 :: rover ------------------------------------------------------------------------ [Fatal Error] :97:521: The entity name must immediately follow the '&' in the entity reference. _:b0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> < http://schema.org/NewsArticle> . _:b0 <http://schema.org/datePublished> "2015-04-29T17:33:00+01:00"^^< http://schema.org/Date> . _:b0 <http://schema.org/description> "A Russian spacecraft delivering supplies to the International Space Station is out of control and will fall back to Earth, unnamed officials say."^^< http://www.w3.org/2001/XMLSchema#string> . _:b0 <http://schema.org/headline> "Russian spacecraft Progress M-27M 'out of control'"^^<http://www.w3.org/2001/XMLSchema#string> . _:b0 <http://schema.org/image> < http://ichef.bbci.co.uk/news/560/media/images/82647000/jpg/_82647139_82647137.jpg> . _:b0 <http://schema.org/publisher> _:b1 . _:b0 <http://schema.org/url> <http://www.bbc.com/news/world-europe-32517447> . _:b1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> < http://schema.org/Organization> . _:b1 <http://schema.org/logo> < http://www.bbc.co.uk/news/special/2015/newsspec_10857/bbc_news_logo.png?cb=1> . _:b1 <http://schema.org/name> "BBC News"^^< http://www.w3.org/2001/XMLSchema#string> . <http://www.bbc.com/news/world-europe-32517447> < http://vocab.sindice.net/any23#CPS_AUDIENCE> "US"@en . <http://www.bbc.com/news/world-europe-32517447> < http://vocab.sindice.net/any23#twitter:description> "A Russian spacecraft delivering supplies to the International Space Station is out of control and will fall back to Earth, unnamed officials say."@en . <http://www.bbc.com/news/world-europe-32517447> < http://vocab.sindice.net/any23#x-audience> "US"@en . <http://www.bbc.com/news/world-europe-32517447> < http://vocab.sindice.net/any23#twitter:card> "summary_large_image"@en . <http://www.bbc.com/news/world-europe-32517447> < http://vocab.sindice.net/any23#twitter:site> "@BBCWorld"@en . <http://www.bbc.com/news/world-europe-32517447> < http://vocab.sindice.net/any23#apple-mobile-web-app-title> "BBC News"@en . <http://www.bbc.com/news/world-europe-32517447> < http://vocab.sindice.net/any23#twitter:title> "Russian spacecraft Progress M-27M 'out of control' - BBC News"@en . <http://www.bbc.com/news/world-europe-32517447> < http://vocab.sindice.net/any23#x-country> "us"@en . <http://www.bbc.com/news/world-europe-32517447> < http://vocab.sindice.net/any23#description> "A Russian spacecraft delivering supplies to the International Space Station is out of control and will fall back to Earth, unnamed officials say."@en . <http://www.bbc.com/news/world-europe-32517447> < http://vocab.sindice.net/any23#robots> "NOODP,NOYDIR"@en . <http://www.bbc.com/news/world-europe-32517447> < http://vocab.sindice.net/any23#msapplication-TileImage> " http://static.bbci.co.uk/news/1.67.0293/windows-eight-icon-144x144.png"@en . <http://www.bbc.com/news/world-europe-32517447> < http://vocab.sindice.net/any23#msapplication-TileColor> "#CC0101"@en . <http://www.bbc.com/news/world-europe-32517447> < http://vocab.sindice.net/any23#twitter:creator> "@BBCWorld"@en . <http://www.bbc.com/news/world-europe-32517447> < http://vocab.sindice.net/any23#twitter:image:src> " http://ichef.bbci.co.uk/news/560/media/images/82647000/jpg/_82647139_82647137.jpg"@en . <http://www.bbc.com/news/world-europe-32517447> < http://vocab.sindice.net/any23#application-name> "BBC News"@en . <http://www.bbc.com/news/world-europe-32517447> < http://vocab.sindice.net/any23#twitter:domain> "www.bbc.com"@en . <http://www.bbc.com/news/world-europe-32517447> < http://vocab.sindice.net/any23#mobile-web-app-capable> "yes"@en . <http://www.bbc.com/news/world-europe-32517447> < http://vocab.sindice.net/any23#viewport> "width=device-width, initial-scale=1.0"@en . <http://www.bbc.com/news/world-europe-32517447> < http://vocab.sindice.net/any23#theme-color> "#cc0101"@en . <http://www.bbc.com/news/world-europe-32517447> < http://purl.org/dc/terms/title> "Russian spacecraft Progress M-27M 'out of control' - BBC News" . <http://www.bbc.com/news/world-europe-32517447> < http://www.w3.org/1999/xhtml/vocab#alternate> < http://www.bbc.co.uk/news/world-europe-32517447> . <http://www.bbc.co.uk/news/world-europe-32517447> < http://www.w3.org/1999/xhtml/vocab#alternate> < http://www.bbc.com/news/world-europe-32517447> . <http://www.bbc.com/news/world-europe-32517447> <http://ogp.me/ns#title> "Russian spacecraft Progress M-27M 'out of control' - BBC News"@en . <http://www.bbc.com/news/world-europe-32517447> <http://ogp.me/ns#type> "article"@en . <http://www.bbc.com/news/world-europe-32517447> < http://ogp.me/ns#description> "A Russian spacecraft delivering supplies to the International Space Station is out of control and will fall back to Earth, unnamed officials say."@en . <http://www.bbc.com/news/world-europe-32517447> <http://ogp.me/ns#site_name> "BBC News"@en . <http://www.bbc.com/news/world-europe-32517447> <http://ogp.me/ns#locale> "en_GB"@en . <http://www.bbc.com/news/world-europe-32517447> < http://ogp.me/ns#article:author> "BBC News"@en . <http://www.bbc.com/news/world-europe-32517447> < http://ogp.me/ns#article:section> "Europe"@en . <http://www.bbc.com/news/world-europe-32517447> <http://ogp.me/ns#url> " http://www.bbc.com/news/world-europe-32517447"@en . <http://www.bbc.com/news/world-europe-32517447> <http://ogp.me/ns#image> " http://ichef.bbci.co.uk/news/1024/media/images/82647000/jpg/_82647139_82647137.jpg"@en . <http://static.bbci.co.uk/news/1.67.0293/apple-touch-icon.png> < http://www.w3.org/1999/xhtml/vocab#stylesheet> < http://static.bbci.co.uk/frameworks/barlesque/2.83.4/orb/4/style/orb.css> . ------------------------------------------------------------------------ Apache Any23 FAILURE Execution terminated with errors: Error while parsing RDF document. Total time: 4s Finished at: Wed Apr 29 14:00:55 PDT 2015 Final Memory: 126M/480M ------------------------------------------------------------------------ > 4. [FINAL QUESTION] Why does Any23 report [Fatal Error] so often, > but then seem to continue fine? See my Rover output below at [4] for both > Google and the BBC. > We changed the behavior to be like this a while back such that if a particular extractor was to fail, the entire extraction task would not fail. Additionally, we have been trying to make Any23 as lexible as possible such that we a) extract very well from various markup standards, but also b) bake in some flexibility such that extractions don't fail entirely. Does this make sense. Thanks Lewis
