Out of interest, has anyone ever looked at compressing the data that is stored in the web database? There is a lot of text in there that could be stored in a much smaller space than currently if compressed.
I realize that this would increase serialization / de-serialization time, but would reduce corresponding disk transfer times, so it would be important to use a fast encoding. One approach that I looked at a little while ago used the notion that, when sorted, each url in a list is likely to have a character sequence at its start that is common with the previous url This is especially true when nutch is being used for focussed / deep crawling and when the page database is large. The idea was then to encode the url of each page in the web database in two parts: - the number of characters that the url has in common with the previous page in the database - the remaining characters from the url once the common part has been removed. ie, the url list: - http://www.awebsite.com/section1/page1.html - http://www.awebsite.com/section1/page2.html - http://www.awebsite.com/section2/page3.html - ... would be encoded as - http://www.awebsite.com/section1/page1.html - 34 page2.html - 23 2/page3.html - ... The random access of the page database is performed by a course seek into the sorted list of pages, followed by a sequential fine scan. So, for the above approach to work it's necessary to ensure that each of the pages at the course seek points aren't compressed. I got part way through implementing this for nutch 0.6, but didn't get it completed. I also heard on the list that the mapred branch of nutch was going to have a substantially re-worked web database, so I would have been trying to hit a moving target with my optimisation. Does it sound suitable for the new web database (I'm not familliar with the mapred branch of nutch)? Russell
