Russell Mayor wrote:
One approach that I looked at a little while ago used the notion that, when
sorted, each url in a list is likely to have a character sequence at its
start that is common with the previous url This is especially true when
nutch is being used for focussed / deep crawling and when the page database
is large. The idea was then to encode the url of each page in the web
database in two parts:
- the number of characters that the url has in common with the previous page
in the database
- the remaining characters from the url once the common part has been
removed.
ie, the url list:
- http://www.awebsite.com/section1/page1.html
- http://www.awebsite.com/section1/page2.html
- http://www.awebsite.com/section2/page3.html
- ...
would be encoded as
- http://www.awebsite.com/section1/page1.html
- 34 page2.html
- 23 2/page3.html
- ...
I was working on this, and implemented this as a StringListWritable. And
then after testing I decided not to use it. The space saving are worse
than you think, because for every string you need to save an int with
its length. This means 4 bytes wasted. Overall I couldn't go lower than
25-28% saving, using Deflater with the default values (5?).
I got part way through implementing this for nutch 0.6, but didn't get it
completed. I also heard on the list that the mapred branch of nutch was
going to have a substantially re-worked web database, so I would have been
trying to hit a moving target with my optimisation. Does it sound suitable
for the new web database (I'm not familliar with the mapred branch of
nutch)?
You will find the mapred version much much more responsive.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com