Hi Lewis, I DO agree with your opinion, yes, actually any23 do a great work,
and I was used to think that it can support the common xml structure data while it doesn't. So it is ok, maybe I need to develop a new module to meet my requirement. And if there is anything make you misunderstand my real mean, I am sorry about that. I just ask you sincerely whether asny23 0.7 will support the common xml format as input or not. If not, it is ok, I will get other solution. Thank you very much! All the best! armon.chen On 2012年6月22日星期五 at 下午5:35, Lewis John Mcgibbney wrote: > Hi Armon, > > I think we need to clarify something here > > Any23 parsers extract structured data... the parsers DO NOT aim to > extract unstructured text like some kind of 'traditional' parser. > By structure we are not referring to markup as such but instead relate > solely to semantic/structural relationships between concepts within > some given data resource. > Within the context of this thread, we refer (somewhat ambiguously) to > resources as one of the following formats > > RDF/XML, Turtle, Notation 3, RDFa with RDFa1.1 prefix mechanism, > Microformats: Adr, Geo, hCalendar, hCard, hListing, hResume, hReview, > License, XFN and Species, HTML5 Microdata: (such as Schema.org > (http://Schema.org)), CSV: > Comma Separated Values with separator autodetection. > > Does this make sense? > > The Any23 parser is doing it's job as it should. > > Lewis > > On Fri, Jun 22, 2012 at 10:26 AM, armon <[email protected] > (mailto:[email protected])> wrote: > > Hi Lewis, > > > > I even as the xml data in a file, and then command: ./any23 rover @filepath > > ,but it still can't work, finally,I create a simply xml data file to test, > > again nothing retrieved, so I think maybe it is not the url issue, but > > related with parser engine. > > > > Is the any23 0.7 coming, will it meet my particular request? If so, then I > > just get the latest 0.7 and test it again. > > > > thanks for your reply. > > > > All the best! > > > > armon.chen > > > > > > > > On 2012年6月22日星期五 at 下午5:13, Lewis John Mcgibbney wrote: > > > > > So I suppose there are a couple of options here. > > > > > > On Fri, Jun 22, 2012 at 10:02 AM, armon <[email protected] > > > (mailto:[email protected])> wrote: > > > > > > > > but we know that there is some other data in the page that can't be > > > > retrieved, such as the xml data (in the attachment of last email). > > > > > > Yes there is a good bit more content but the parsing implementations > > > within Any23 do not aim to extract content strings... instead the > > > project (parsing anyway) gains its strength from extracting triples > > > and such like. > > > > > > You could quickly fire up a Nutch instance to gather content then use > > > the basic-crawler from Any23 for triples... this is until we implement > > > an Any23 parsing and indexing filter within Nutch which will provide a > > > complete solution to your particular request. > > > > > > You could easily implement the above programmatically which would > > > enable you to fetch page content as well as extract the triples from > > > it separately. > > > > -- > Lewis
