File open time is an issue if you have lots and lots of little files.
If you are doing this analysis once or a few times, then it isn't worth reformatting into a few larger files. If you are likely to do this analysis dozens of times, then opening larger files will probably give you a significant benefit in terms of runtime. If the runtime isn't terribly important, then the filename per line approach will work fine. Note that the filename per line approach is a great way to do the pre-processing into a few large files which will then be analyzed faster. On 10/24/07 5:09 PM, "David Balatero" <[EMAIL PROTECTED]> wrote: > I have a corpus of 300,000 raw HTML files that I want to read in and > parse using Hadoop. What is the best input file format to use in this > case? I want to have access to each page's raw HTML in the mapper, so > I can parse from there. > > I was thinking of preprocessing all the files, removing the new > lines, and putting them in a big <key, value> file: > > url1, html with stripped new lines > url2, .... > url3, .... > ... > urlN, .... > > I'd rather not do all this preprocessing, just to wrangle the text > into Hadoop. Any other suggestions? What if I just stored the path to > the HTML file in a <key, value> type > > url1, path_to_file1 > url2, path_to_file2 > ... > urlN, path_to_fileN > > Then in the mapper, I could read each file in from the DFS on the > fly. Anyone have any other good ideas? I feel like there's some key > function that I'm just stupidly overlooking... > > Thanks! > David Balatero
