So I suppose there are a couple of options here. On Fri, Jun 22, 2012 at 10:02 AM, armon <[email protected]> wrote: > > but we know that there is some other data in the page that can't be > retrieved, such as the xml data (in the attachment of last email).
Yes there is a good bit more content but the parsing implementations within Any23 do not aim to extract content strings... instead the project (parsing anyway) gains its strength from extracting triples and such like. You could quickly fire up a Nutch instance to gather content then use the basic-crawler from Any23 for triples... this is until we implement an Any23 parsing and indexing filter within Nutch which will provide a complete solution to your particular request. You could easily implement the above programmatically which would enable you to fetch page content as well as extract the triples from it separately.
