Out of interest, has anyone ever looked at compressing the data that is
stored in the web database? There is a lot of text in there that could be
stored in a much smaller space than currently if compressed.

I realize that this would increase serialization / de-serialization time,
but would reduce corresponding disk transfer times, so it would be important
to use a fast encoding.

One approach that I looked at a little while ago used the notion that, when
sorted, each url in a list is likely to have a character sequence at its
start that is common with the previous url This is especially true when
nutch is being used for focussed / deep crawling and when the page database
is large. The idea was then to encode the url of each page in the web
database in two parts:
- the number of characters that the url has in common with the previous page
in the database
- the remaining characters from the url once the common part has been
removed.

ie, the url list:
- http://www.awebsite.com/section1/page1.html
- http://www.awebsite.com/section1/page2.html
- http://www.awebsite.com/section2/page3.html
- ...

would be encoded as
- http://www.awebsite.com/section1/page1.html
- 34 page2.html
- 23 2/page3.html
- ...

The random access of the page database is performed by a course seek into
the sorted list of pages, followed by a sequential fine scan. So, for the
above approach to work it's necessary to ensure that each of the pages at
the course seek points aren't compressed.

I got part way through implementing this for nutch 0.6, but didn't get it
completed. I also heard on the list that the mapred branch of nutch was
going to have a substantially re-worked web database, so I would have been
trying to hit a moving target with my optimisation. Does it sound suitable
for the new web database (I'm not familliar with the mapred branch of
nutch)?

Russell

Reply via email to