Hello Friend, I have a question on how to write mapreduce job to build inverted index for crawled webdata.
My problem is: if I store one page in one file, file-id can easily got, but I am afriad if crawled billions of pages will have a problem for the hadoop storage system. If I store all pages in a big file, then how to get the file-id during the map-reduce job? Thanks in advance! Regards Zhijun
