Hi Pat,

On Wed, Apr 29, 2015 at 12:19 PM, <[email protected]> wrote:

> I’ve just started trying to use Any23 programmatically from Java, and it
> looks great.
>

Great. Did you get it working? I  just saw your other thread.


>
>
> The documentation has sample code [1], but that code doesn’t seem to work
> properly for me (and has a small typo on line 1 (‘Apache’ appears twice for
> some reason, and needs to be removed).
>

Can you please log an issue at the issue tracker and document this? We will
then go and fix it.
https://issues.apache.org/jira/browse/ANY23


> Hitting the example webpage (
> http://www.rentalinrome.com/semanticloft/semanticloft.htm) from the
> command-line using Rover works fine [2], but using the sample code gives me
> no triples [3].
>

Mmmm. OK, I have no idea. Which version of Any23 are you using?
Have you tried it in a debugger? It is only around 10 lines of code.


>
>
> So my questions are:
>
> 1.      Any idea why isn’t the sample code isn’t outputting any triples
> for me?
>

No. I tried this a while back. None of the code functionality yo these
API's has changed in a while so it is still OK.


> 2.      The sample code won’t crawl from the webpage I provide. It just
> scans that one page, right?.
>
No. Essentially the HTTPDocumentSource simply wraps an HTTPCLient and uses
it to make the client request. So it doesn't crawl the 'site' as such, but
does make a request for the page.


> So I guess I need to use Rover somehow from my Java code - so is there a
> code sample for crawling a website given just the entrypoint (e.g. ‘
> http://schema.org’)? With code to show how to configure the MaxPages and
> MaxDepth, too?
>

We'll the Any23 source does have a crawler plugin which comes out of the
box. Please navigate to the plugins directory if you are using the source
codem from Github or .git

3.      [BONUS QUESTION!] How come when I use Rover to hit ‘obvious’ markup
> websites, like ’google.com’ (5 triples), ‘Schema.org’ (2 triples) or ‘
> bbc.co.uk’ (8 triples) I get so very few descriptive triples? I was
> expecting lots of triples, with lots of links to further information, etc.
> Shouldn’t these sites by exemplary examples of structured markup…?!.
>
Well, it's a bit of a paradox when efforts such as schema.org don't
typically embed much markup in their homepages... but occasionally this is
the case!
On the other hand, you can get much more content from articles themselves

lmcgibbn@LMC-032857 /usr/local/any23/core/target/appassembler(master) $
./bin/any23 rover -f ntriples http://www.bbc.com/news/world-europe-32517447
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further
details.

------------------------------------------------------------------------
Apache Any23 :: rover
------------------------------------------------------------------------

[Fatal Error] :97:521: The entity name must immediately follow the '&' in
the entity reference.
_:b0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <
http://schema.org/NewsArticle> .
_:b0 <http://schema.org/datePublished> "2015-04-29T17:33:00+01:00"^^<
http://schema.org/Date> .
_:b0 <http://schema.org/description> "A Russian spacecraft delivering
supplies to the International Space Station is out of control and will fall
back to Earth, unnamed officials say."^^<
http://www.w3.org/2001/XMLSchema#string> .
_:b0 <http://schema.org/headline> "Russian spacecraft Progress M-27M 'out
of control'"^^<http://www.w3.org/2001/XMLSchema#string> .
_:b0 <http://schema.org/image> <
http://ichef.bbci.co.uk/news/560/media/images/82647000/jpg/_82647139_82647137.jpg>
.
_:b0 <http://schema.org/publisher> _:b1 .
_:b0 <http://schema.org/url> <http://www.bbc.com/news/world-europe-32517447>
.
_:b1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <
http://schema.org/Organization> .
_:b1 <http://schema.org/logo> <
http://www.bbc.co.uk/news/special/2015/newsspec_10857/bbc_news_logo.png?cb=1>
.
_:b1 <http://schema.org/name> "BBC News"^^<
http://www.w3.org/2001/XMLSchema#string> .
<http://www.bbc.com/news/world-europe-32517447> <
http://vocab.sindice.net/any23#CPS_AUDIENCE> "US"@en .
<http://www.bbc.com/news/world-europe-32517447> <
http://vocab.sindice.net/any23#twitter:description> "A Russian spacecraft
delivering supplies to the International Space Station is out of control
and will fall back to Earth, unnamed officials say."@en .
<http://www.bbc.com/news/world-europe-32517447> <
http://vocab.sindice.net/any23#x-audience> "US"@en .
<http://www.bbc.com/news/world-europe-32517447> <
http://vocab.sindice.net/any23#twitter:card> "summary_large_image"@en .
<http://www.bbc.com/news/world-europe-32517447> <
http://vocab.sindice.net/any23#twitter:site> "@BBCWorld"@en .
<http://www.bbc.com/news/world-europe-32517447> <
http://vocab.sindice.net/any23#apple-mobile-web-app-title> "BBC News"@en .
<http://www.bbc.com/news/world-europe-32517447> <
http://vocab.sindice.net/any23#twitter:title> "Russian spacecraft Progress
M-27M 'out of control' - BBC News"@en .
<http://www.bbc.com/news/world-europe-32517447> <
http://vocab.sindice.net/any23#x-country> "us"@en .
<http://www.bbc.com/news/world-europe-32517447> <
http://vocab.sindice.net/any23#description> "A Russian spacecraft
delivering supplies to the International Space Station is out of control
and will fall back to Earth, unnamed officials say."@en .
<http://www.bbc.com/news/world-europe-32517447> <
http://vocab.sindice.net/any23#robots> "NOODP,NOYDIR"@en .
<http://www.bbc.com/news/world-europe-32517447> <
http://vocab.sindice.net/any23#msapplication-TileImage> "
http://static.bbci.co.uk/news/1.67.0293/windows-eight-icon-144x144.png"@en .
<http://www.bbc.com/news/world-europe-32517447> <
http://vocab.sindice.net/any23#msapplication-TileColor> "#CC0101"@en .
<http://www.bbc.com/news/world-europe-32517447> <
http://vocab.sindice.net/any23#twitter:creator> "@BBCWorld"@en .
<http://www.bbc.com/news/world-europe-32517447> <
http://vocab.sindice.net/any23#twitter:image:src> "
http://ichef.bbci.co.uk/news/560/media/images/82647000/jpg/_82647139_82647137.jpg"@en
.
<http://www.bbc.com/news/world-europe-32517447> <
http://vocab.sindice.net/any23#application-name> "BBC News"@en .
<http://www.bbc.com/news/world-europe-32517447> <
http://vocab.sindice.net/any23#twitter:domain> "www.bbc.com"@en .
<http://www.bbc.com/news/world-europe-32517447> <
http://vocab.sindice.net/any23#mobile-web-app-capable> "yes"@en .
<http://www.bbc.com/news/world-europe-32517447> <
http://vocab.sindice.net/any23#viewport> "width=device-width,
initial-scale=1.0"@en .
<http://www.bbc.com/news/world-europe-32517447> <
http://vocab.sindice.net/any23#theme-color> "#cc0101"@en .
<http://www.bbc.com/news/world-europe-32517447> <
http://purl.org/dc/terms/title> "Russian spacecraft Progress M-27M 'out of
control' - BBC News" .
<http://www.bbc.com/news/world-europe-32517447> <
http://www.w3.org/1999/xhtml/vocab#alternate> <
http://www.bbc.co.uk/news/world-europe-32517447> .
<http://www.bbc.co.uk/news/world-europe-32517447> <
http://www.w3.org/1999/xhtml/vocab#alternate> <
http://www.bbc.com/news/world-europe-32517447> .
<http://www.bbc.com/news/world-europe-32517447> <http://ogp.me/ns#title>
"Russian spacecraft Progress M-27M 'out of control' - BBC News"@en .
<http://www.bbc.com/news/world-europe-32517447> <http://ogp.me/ns#type>
"article"@en .
<http://www.bbc.com/news/world-europe-32517447> <
http://ogp.me/ns#description> "A Russian spacecraft delivering supplies to
the International Space Station is out of control and will fall back to
Earth, unnamed officials say."@en .
<http://www.bbc.com/news/world-europe-32517447> <http://ogp.me/ns#site_name>
"BBC News"@en .
<http://www.bbc.com/news/world-europe-32517447> <http://ogp.me/ns#locale>
"en_GB"@en .
<http://www.bbc.com/news/world-europe-32517447> <
http://ogp.me/ns#article:author> "BBC News"@en .
<http://www.bbc.com/news/world-europe-32517447> <
http://ogp.me/ns#article:section> "Europe"@en .
<http://www.bbc.com/news/world-europe-32517447> <http://ogp.me/ns#url> "
http://www.bbc.com/news/world-europe-32517447"@en .
<http://www.bbc.com/news/world-europe-32517447> <http://ogp.me/ns#image> "
http://ichef.bbci.co.uk/news/1024/media/images/82647000/jpg/_82647139_82647137.jpg"@en
.
<http://static.bbci.co.uk/news/1.67.0293/apple-touch-icon.png> <
http://www.w3.org/1999/xhtml/vocab#stylesheet> <
http://static.bbci.co.uk/frameworks/barlesque/2.83.4/orb/4/style/orb.css> .

------------------------------------------------------------------------
Apache Any23 FAILURE

Execution terminated with errors: Error while parsing RDF document.

Total time: 4s
Finished at: Wed Apr 29 14:00:55 PDT 2015
Final Memory: 126M/480M
------------------------------------------------------------------------


> 4.      [FINAL QUESTION] Why does Any23 report [Fatal Error] so often,
> but then seem to continue fine? See my Rover output below at [4] for both
> Google and the BBC.
>

We changed the behavior to be like this a while back such that if a
particular extractor was to fail, the entire extraction task would not
fail. Additionally, we have been trying to make Any23 as lexible as
possible such that we a) extract very well from various markup standards,
but also b) bake in some flexibility such that extractions don't fail
entirely.
Does this make sense.
Thanks
Lewis

Reply via email to