Hi Lewis,

Thanks for the answers!

So today I'm in the office, and I'm having problems getting Any23 to work via 
our proxy. Due to Any23 wrapping HttpClient (confusingly with its own 
'HTTPClient' class!), it doesn't seem so easy to configure the underlying HTTP 
library at all. In fact it also seems very awkward to configure even something 
as simple as the HTTP timeout. I tried passing a 'ModifiableConfiguration' 
instance to the Any23 constructor, but the Any23 code still uses a newly 
instantiated 'DefaultConfiguration' when my code calls 'runner.getHTTPClient()' 
(I'm finding this initialization code very confusing at the moment...).

I was also having problems building Any23 from source on my Windows machine 
(both v1.1 and v1.2-SNAPSHOT). The 'Any23 Core' project seems to have failing 
tests, so to get it to compile I have to run 'mvn clean install -DskipTests' 
(and then the 'Plugins :: Integration Test' project fails, see [1] below). So 
finally I have compiling Any23 source on my machine (with no running tests!), 
and now I'm about to make code changes to allow me configure Any23 properly, 
and to see HTTP proxy settings on the underlying HTTPClient library...

So lots of teething problems, and therefore I haven't even gotten around to 
trying to get my code to use Rover instead of just extracting from a single 
website entry point.

I'll let you know how I get on...

Cheers,

Pat.


[1]  C:\Installs\Apache\Any23\Any23-1.2-SNAPSHOT>mvn clean install -DskipTests
:
:
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] Apache Any23 ....................................... SUCCESS [  4.761 s]
[INFO] Apache Any23 :: Base API ........................... SUCCESS [  1.576 s]
[INFO] Apache Any23 :: Test Resources ..................... SUCCESS [  0.811 s]
[INFO] Apache Any23 :: NQuads Parser and Writer ........... SUCCESS [  0.546 s]
[INFO] Apache Any23 :: CSV Utilities ...................... SUCCESS [  0.265 s]
[INFO] Apache Any23 :: Mime Type Detection ................ SUCCESS [  0.640 s]
[INFO] Apache Any23 :: Encoding Detection ................. SUCCESS [  0.327 s]
[INFO] Apache Any23 :: Core ............................... SUCCESS [  6.926 s]
[INFO] Apache Any23 :: Plugins :: Basic Crawler ........... SUCCESS [  3.174 s]
[INFO] Apache Any23 :: Plugins :: HTML Scraper ............ SUCCESS [  0.625 s]
[INFO] Apache Any23 :: Plugins :: Office Scraper .......... SUCCESS [  0.906 s]
[INFO] Apache Any23 :: Plugins :: Integration Test ........ FAILURE [01:20 min]
[INFO] Apache Any23 :: Service ............................ SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 01:41 min
[INFO] Finished at: 2015-04-30T15:16:10+01:00
[INFO] Final Memory: 53M/973M
[INFO] ------------------------------------------------------------------------

> -----Original Message-----
> From: Lewis John Mcgibbney [mailto:[email protected]]
> Sent: 29 April 2015 22:08
> To: [email protected]
> Subject: Re: How to use Rover from Java code?
> 
> Hi Pat,
> 
> On Wed, Apr 29, 2015 at 12:19 PM, <[email protected]>
> wrote:
> 
> > I’ve just started trying to use Any23 programmatically from Java, and
> > it looks great.
> >
> 
> Great. Did you get it working? I  just saw your other thread.
> 
> 
> >
> >
> > The documentation has sample code [1], but that code doesn’t seem to
> > work properly for me (and has a small typo on line 1 (‘Apache’ appears
> > twice for some reason, and needs to be removed).
> >
> 
> Can you please log an issue at the issue tracker and document this? We will 
> then
> go and fix it.
> https://issues.apache.org/jira/browse/ANY23
> 
> 
> > Hitting the example webpage (
> > http://www.rentalinrome.com/semanticloft/semanticloft.htm) from the
> > command-line using Rover works fine [2], but using the sample code
> > gives me no triples [3].
> >
> 
> Mmmm. OK, I have no idea. Which version of Any23 are you using?
> Have you tried it in a debugger? It is only around 10 lines of code.
> 
> 
> >
> >
> > So my questions are:
> >
> > 1.      Any idea why isn’t the sample code isn’t outputting any triples
> > for me?
> >
> 
> No. I tried this a while back. None of the code functionality yo these API's 
> has
> changed in a while so it is still OK.
> 
> 
> > 2.      The sample code won’t crawl from the webpage I provide. It just
> > scans that one page, right?.
> >
> No. Essentially the HTTPDocumentSource simply wraps an HTTPCLient and uses
> it to make the client request. So it doesn't crawl the 'site' as such, but 
> does make
> a request for the page.
> 
> 
> > So I guess I need to use Rover somehow from my Java code - so is there
> > a code sample for crawling a website given just the entrypoint (e.g. ‘
> > http://schema.org’)? With code to show how to configure the MaxPages
> > and MaxDepth, too?
> >
> 
> We'll the Any23 source does have a crawler plugin which comes out of the box.
> Please navigate to the plugins directory if you are using the source codem 
> from
> Github or .git
> 
> 3.      [BONUS QUESTION!] How come when I use Rover to hit ‘obvious’ markup
> > websites, like ’google.com’ (5 triples), ‘Schema.org’ (2 triples) or ‘
> > bbc.co.uk’ (8 triples) I get so very few descriptive triples? I was
> > expecting lots of triples, with lots of links to further information, etc.
> > Shouldn’t these sites by exemplary examples of structured markup…?!.
> >
> Well, it's a bit of a paradox when efforts such as schema.org don't typically
> embed much markup in their homepages... but occasionally this is the case!
> On the other hand, you can get much more content from articles themselves
> 
> lmcgibbn@LMC-032857 /usr/local/any23/core/target/appassembler(master) $
> ./bin/any23 rover -f ntriples http://www.bbc.com/news/world-europe-
> 32517447
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further
> details.
> 
> ------------------------------------------------------------------------
> Apache Any23 :: rover
> ------------------------------------------------------------------------
> 
> [Fatal Error] :97:521: The entity name must immediately follow the '&' in the
> entity reference.
> _:b0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <
> http://schema.org/NewsArticle> .
> _:b0 <http://schema.org/datePublished> "2015-04-29T17:33:00+01:00"^^<
> http://schema.org/Date> .
> _:b0 <http://schema.org/description> "A Russian spacecraft delivering supplies
> to the International Space Station is out of control and will fall back to 
> Earth,
> unnamed officials say."^^< http://www.w3.org/2001/XMLSchema#string> .
> _:b0 <http://schema.org/headline> "Russian spacecraft Progress M-27M 'out of
> control'"^^<http://www.w3.org/2001/XMLSchema#string> .
> _:b0 <http://schema.org/image> <
> http://ichef.bbci.co.uk/news/560/media/images/82647000/jpg/_82647139_826
> 47137.jpg>
> .
> _:b0 <http://schema.org/publisher> _:b1 .
> _:b0 <http://schema.org/url> <http://www.bbc.com/news/world-europe-
> 32517447>
> .
> _:b1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <
> http://schema.org/Organization> .
> _:b1 <http://schema.org/logo> <
> http://www.bbc.co.uk/news/special/2015/newsspec_10857/bbc_news_logo.pn
> g?cb=1>
> .
> _:b1 <http://schema.org/name> "BBC News"^^<
> http://www.w3.org/2001/XMLSchema#string> .
> <http://www.bbc.com/news/world-europe-32517447> <
> http://vocab.sindice.net/any23#CPS_AUDIENCE> "US"@en .
> <http://www.bbc.com/news/world-europe-32517447> <
> http://vocab.sindice.net/any23#twitter:description> "A Russian spacecraft
> delivering supplies to the International Space Station is out of control and 
> will
> fall back to Earth, unnamed officials say."@en .
> <http://www.bbc.com/news/world-europe-32517447> <
> http://vocab.sindice.net/any23#x-audience> "US"@en .
> <http://www.bbc.com/news/world-europe-32517447> <
> http://vocab.sindice.net/any23#twitter:card> "summary_large_image"@en .
> <http://www.bbc.com/news/world-europe-32517447> <
> http://vocab.sindice.net/any23#twitter:site> "@BBCWorld"@en .
> <http://www.bbc.com/news/world-europe-32517447> <
> http://vocab.sindice.net/any23#apple-mobile-web-app-title> "BBC News"@en .
> <http://www.bbc.com/news/world-europe-32517447> <
> http://vocab.sindice.net/any23#twitter:title> "Russian spacecraft Progress M-
> 27M 'out of control' - BBC News"@en .
> <http://www.bbc.com/news/world-europe-32517447> <
> http://vocab.sindice.net/any23#x-country> "us"@en .
> <http://www.bbc.com/news/world-europe-32517447> <
> http://vocab.sindice.net/any23#description> "A Russian spacecraft delivering
> supplies to the International Space Station is out of control and will fall 
> back to
> Earth, unnamed officials say."@en .
> <http://www.bbc.com/news/world-europe-32517447> <
> http://vocab.sindice.net/any23#robots> "NOODP,NOYDIR"@en .
> <http://www.bbc.com/news/world-europe-32517447> <
> http://vocab.sindice.net/any23#msapplication-TileImage> "
> http://static.bbci.co.uk/news/1.67.0293/windows-eight-icon-144x144.png"@en
> .
> <http://www.bbc.com/news/world-europe-32517447> <
> http://vocab.sindice.net/any23#msapplication-TileColor> "#CC0101"@en .
> <http://www.bbc.com/news/world-europe-32517447> <
> http://vocab.sindice.net/any23#twitter:creator> "@BBCWorld"@en .
> <http://www.bbc.com/news/world-europe-32517447> <
> http://vocab.sindice.net/any23#twitter:image:src> "
> http://ichef.bbci.co.uk/news/560/media/images/82647000/jpg/_82647139_826
> 47137.jpg"@en
> .
> <http://www.bbc.com/news/world-europe-32517447> <
> http://vocab.sindice.net/any23#application-name> "BBC News"@en .
> <http://www.bbc.com/news/world-europe-32517447> <
> http://vocab.sindice.net/any23#twitter:domain> "www.bbc.com"@en .
> <http://www.bbc.com/news/world-europe-32517447> <
> http://vocab.sindice.net/any23#mobile-web-app-capable> "yes"@en .
> <http://www.bbc.com/news/world-europe-32517447> <
> http://vocab.sindice.net/any23#viewport> "width=device-width, initial-
> scale=1.0"@en .
> <http://www.bbc.com/news/world-europe-32517447> <
> http://vocab.sindice.net/any23#theme-color> "#cc0101"@en .
> <http://www.bbc.com/news/world-europe-32517447> <
> http://purl.org/dc/terms/title> "Russian spacecraft Progress M-27M 'out of
> control' - BBC News" .
> <http://www.bbc.com/news/world-europe-32517447> <
> http://www.w3.org/1999/xhtml/vocab#alternate> <
> http://www.bbc.co.uk/news/world-europe-32517447> .
> <http://www.bbc.co.uk/news/world-europe-32517447> <
> http://www.w3.org/1999/xhtml/vocab#alternate> <
> http://www.bbc.com/news/world-europe-32517447> .
> <http://www.bbc.com/news/world-europe-32517447>
> <http://ogp.me/ns#title> "Russian spacecraft Progress M-27M 'out of control' -
> BBC News"@en .
> <http://www.bbc.com/news/world-europe-32517447>
> <http://ogp.me/ns#type> "article"@en .
> <http://www.bbc.com/news/world-europe-32517447> <
> http://ogp.me/ns#description> "A Russian spacecraft delivering supplies to the
> International Space Station is out of control and will fall back to Earth, 
> unnamed
> officials say."@en .
> <http://www.bbc.com/news/world-europe-32517447>
> <http://ogp.me/ns#site_name> "BBC News"@en .
> <http://www.bbc.com/news/world-europe-32517447>
> <http://ogp.me/ns#locale> "en_GB"@en .
> <http://www.bbc.com/news/world-europe-32517447> <
> http://ogp.me/ns#article:author> "BBC News"@en .
> <http://www.bbc.com/news/world-europe-32517447> <
> http://ogp.me/ns#article:section> "Europe"@en .
> <http://www.bbc.com/news/world-europe-32517447> <http://ogp.me/ns#url>
> "
> http://www.bbc.com/news/world-europe-32517447"@en .
> <http://www.bbc.com/news/world-europe-32517447>
> <http://ogp.me/ns#image> "
> http://ichef.bbci.co.uk/news/1024/media/images/82647000/jpg/_82647139_82
> 647137.jpg"@en
> .
> <http://static.bbci.co.uk/news/1.67.0293/apple-touch-icon.png> <
> http://www.w3.org/1999/xhtml/vocab#stylesheet> <
> http://static.bbci.co.uk/frameworks/barlesque/2.83.4/orb/4/style/orb.css> .
> 
> ------------------------------------------------------------------------
> Apache Any23 FAILURE
> 
> Execution terminated with errors: Error while parsing RDF document.
> 
> Total time: 4s
> Finished at: Wed Apr 29 14:00:55 PDT 2015 Final Memory: 126M/480M
> ------------------------------------------------------------------------
> 
> 
> > 4.      [FINAL QUESTION] Why does Any23 report [Fatal Error] so often,
> > but then seem to continue fine? See my Rover output below at [4] for
> > both Google and the BBC.
> >
> 
> We changed the behavior to be like this a while back such that if a particular
> extractor was to fail, the entire extraction task would not fail. 
> Additionally, we
> have been trying to make Any23 as lexible as possible such that we a) extract
> very well from various markup standards, but also b) bake in some flexibility 
> such
> that extractions don't fail entirely.
> Does this make sense.
> Thanks
> Lewis

Reply via email to