Hello Friend,

I have a question on how to write mapreduce job to build inverted index for
crawled webdata.

My problem is: if I store one page in one file, file-id can easily got, but
I am afriad if crawled billions of pages will have a problem for the hadoop
storage system.

If I store all pages in a big file, then how to get the file-id during the
map-reduce job?

Thanks in advance!

Regards

Zhijun

Reply via email to