Hi Armon, I think we need to clarify something here
Any23 parsers extract structured data... the parsers DO NOT aim to extract unstructured text like some kind of 'traditional' parser. By structure we are not referring to markup as such but instead relate solely to semantic/structural relationships between concepts within some given data resource. Within the context of this thread, we refer (somewhat ambiguously) to resources as one of the following formats RDF/XML, Turtle, Notation 3, RDFa with RDFa1.1 prefix mechanism, Microformats: Adr, Geo, hCalendar, hCard, hListing, hResume, hReview, License, XFN and Species, HTML5 Microdata: (such as Schema.org), CSV: Comma Separated Values with separator autodetection. Does this make sense? The Any23 parser is doing it's job as it should. Lewis On Fri, Jun 22, 2012 at 10:26 AM, armon <[email protected]> wrote: > Hi Lewis, > > I even as the xml data in a file, and then command: ./any23 rover @filepath > ,but it still can't work, finally,I create a simply xml data file to test, > again nothing retrieved, so I think maybe it is not the url issue, but > related with parser engine. > > Is the any23 0.7 coming, will it meet my particular request? If so, then I > just get the latest 0.7 and test it again. > > thanks for your reply. > > All the best! > > armon.chen > > > > On 2012年6月22日星期五 at 下午5:13, Lewis John Mcgibbney wrote: > >> So I suppose there are a couple of options here. >> >> On Fri, Jun 22, 2012 at 10:02 AM, armon <[email protected] >> (mailto:[email protected])> wrote: >> > >> > but we know that there is some other data in the page that can't be >> > retrieved, such as the xml data (in the attachment of last email). >> >> Yes there is a good bit more content but the parsing implementations >> within Any23 do not aim to extract content strings... instead the >> project (parsing anyway) gains its strength from extracting triples >> and such like. >> >> You could quickly fire up a Nutch instance to gather content then use >> the basic-crawler from Any23 for triples... this is until we implement >> an Any23 parsing and indexing filter within Nutch which will provide a >> complete solution to your particular request. >> >> You could easily implement the above programmatically which would >> enable you to fetch page content as well as extract the triples from >> it separately. > -- Lewis
