Correct, content is not stored in the crawldb.
The crawldb holds the url, its state (fetched, unfetched, last fetch
time, etc.). The content of the page is held in the segments. Content
folder holds the actual page content. Parse data is the page meta data
and Parse text is the actual text of the page after parsing.
Dennis
Qi Wu wrote:
Hi All,
I want to know what kind of information of a page is kept in webDB. It
seems the content of a page can't be got from the WebDB but the MD5 hash of
page contents from WebDB, and page contents can only be got from Segements
.Is this right ?
Thanks,
Qi