Re: about the supported input format of any23

Lewis John Mcgibbney Fri, 22 Jun 2012 02:13:46 -0700

So I suppose there are a couple of options here.

On Fri, Jun 22, 2012 at 10:02 AM, armon <[email protected]> wrote:
>
>  but we know that there is some other data in the page that can't be 
> retrieved, such as the xml data (in the attachment of last email).


Yes there is a good bit more content but the parsing implementations
within Any23 do not aim to extract content strings... instead the
project (parsing anyway) gains its strength from extracting triples
and such like.

You could quickly fire up a Nutch instance to gather content then use
the basic-crawler from Any23 for triples... this is until we implement
an Any23 parsing and indexing filter within Nutch which will provide a
complete solution to your particular request.

You could easily implement the above programmatically which would
enable you to fetch page content as well as extract the triples from
it separately.

Re: about the supported input format of any23

Reply via email to