BELLINI ADAM wrote:
hi,
thx for the advise,
but guess when u run the readseg command it will not retun the pages as is (as 
if browsed ).
i tried it and it returns  information about pages :

Recno:: 0
URL:: http://blabla.com/blabla.jsp

CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Mon Aug 31 16:11:26 EDT 2009
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 86400 seconds (1 days)
Score: 8.849112E-7
Signature: null
Metadata:

is there another way to get the source of the page as if it will be browsed ? i 
mean as if we run wget ?

The above record comes from <segmentDir>/crawl_parse part of segment. If you dump the /content part then you will get the original raw content.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to