Parsing a directory of 300,000 HTML files?

David Balatero Wed, 24 Oct 2007 17:09:59 -0700

I have a corpus of 300,000 raw HTML files that I want to read in andparse using Hadoop. What is the best input file format to use in thiscase? I want to have access to each page's raw HTML in the mapper, soI can parse from there.

I was thinking of preprocessing all the files, removing the newlines, and putting them in a big <key, value> file:


url1, html with stripped new lines
url2, ....
url3, ....
...
urlN, ....

I'd rather not do all this preprocessing, just to wrangle the textinto Hadoop. Any other suggestions? What if I just stored the path tothe HTML file in a <key, value> type


url1, path_to_file1
url2, path_to_file2
...
urlN, path_to_fileN

Then in the mapper, I could read each file in from the DFS on thefly. Anyone have any other good ideas? I feel like there's some keyfunction that I'm just stupidly overlooking...


Thanks!
David Balatero

Parsing a directory of 300,000 HTML files?

Reply via email to