Re: Export crawled URLs

Hitoshi Ozawa Sun, 04 Dec 2011 20:14:13 -0800

"The interpretation of this field will differ from connector to connector".

From the above description, seems the content of entityid is dependentof which connector is

being used to crawl the web pages.

You're right about the second point on entityid column datatype. InMySQL, which I'm usingwith ManifoldCF, the datatype of entityid is LONGTEXT. I was just usingit figurably even though

I just found out that I can actually execute the sql statement. :-)


Cheers,
H.Ozawa

(2011/12/05 10:29), Karl Wright wrote:

Well, the history comes from the repohistory table, yes - but you may
not be able to construct a query with entityid=jobs.id, first of all
because that is incorrect (what the entity field contains is dependent
  on the activity type), and secondly because that column is
potentially long and only some kinds of queries can be done against
it.  Specifically it cannot be built into an index on PostgreSQL.

Karl

On Sun, Dec 4, 2011 at 7:50 PM, Hitoshi Ozawa
<[email protected]>  wrote:

Is "history" just entries in the "repohistory" table with entitityid =
jobs.id?

H.Ozawa

(2011/12/03 1:43), Karl Wright wrote:

The best place to get this from is the simple history.  A command-line
utility to dump this information to a text file should be possible
with the currently available interface primitives.  If that is how you
want to go, you will need to run ManifoldCF in multiprocess mode.
Alternatively you might want to request the info from the API, but
that's problematic because nobody has implemented report support in
the API as of now.

A final alternative is to get this from the log.  There is an [INFO]
level line from the web connector for every fetch, I seem to recall,
and you might be able to use that.

Thanks,
Karl


On Fri, Dec 2, 2011 at 11:18 AM, M Kelleher<[email protected]>    wrote:

Is it possible to export / download the list of URLs visited during a
crawl job?

Sent from my iPad

Re: Export crawled URLs

Reply via email to