BELLINI ADAM wrote:
hi,
thx for the advise,
but guess when u run the readseg command it will not retun the pages as is (as
if browsed ).
i tried it and it returns information about pages :
Recno:: 0
URL:: http://blabla.com/blabla.jsp
CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Mon Aug 31 16:11:26 EDT 2009
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 86400 seconds (1 days)
Score: 8.849112E-7
Signature: null
Metadata:
is there another way to get the source of the page as if it will be browsed ? i
mean as if we run wget ?
The above record comes from <segmentDir>/crawl_parse part of segment. If
you dump the /content part then you will get the original raw content.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com