Re: [Nutch-dev] retrieving original html from database

songjue Thu, 26 Apr 2007 23:26:54 -0700

You can try this command:  bin/nutch readseg (-dump ... | -get ...) .
If you need an API instead of the command line, you may have to hack 
the segment/SegmentReader.java? I'm also wondering this.


BTW, make sure you set the 'http.content.limit' property to -1 to avoid 
content truncation.
 



songjue
2007-04-27



发件人： Charlie Williams
发送时间： 2007-04-25 22:43:12
收件人： [EMAIL PROTECTED]
抄送： 
主题： retrieving original html from database

I have an index of pages from the web, a bit over 1 million. The fetch took
several weeks to complete, since it was mainly over a small set of domains.
Once we had a completed fetch, and index we began trying to work with the
retrieved text, and found that the cached text is just that, flat text. Is
the original HTML cached anywhere that it can be accessed after the intial
fetch? It would be a shame to have to recrawl all those pages. We are using
Nutch  .8

Thanks for any help.

-Charlie

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] retrieving original html from database

Reply via email to