Re: Parsing a directory of 300,000 HTML files?

Ted Dunning Wed, 24 Oct 2007 17:29:53 -0700


File open time is an issue if you have lots and lots of little files.

If you are doing this analysis once or a few times, then it isn't worth
reformatting into a few larger files.

If you are likely to do this analysis dozens of times, then opening larger
files will probably give you a significant benefit in terms of runtime.

If the runtime isn't terribly important, then the filename per line approach
will work fine.

Note that the filename per line approach is a great way to do the
pre-processing into a few large files which will then be analyzed faster.

On 10/24/07 5:09 PM, "David Balatero" <[EMAIL PROTECTED]> wrote:

> I have a corpus of 300,000 raw HTML files that I want to read in and
> parse using Hadoop. What is the best input file format to use in this
> case? I want to have access to each page's raw HTML in the mapper, so
> I can parse from there.
> 
> I was thinking of preprocessing all the files, removing the new
> lines, and putting them in a big <key, value> file:
> 
> url1, html with stripped new lines
> url2, ....
> url3, ....
> ...
> urlN, ....
> 
> I'd rather not do all this preprocessing, just to wrangle the text
> into Hadoop. Any other suggestions? What if I just stored the path to
> the HTML file in a <key, value> type
> 
> url1, path_to_file1
> url2, path_to_file2
> ...
> urlN, path_to_fileN
> 
> Then in the mapper, I could read each file in from the DFS on the
> fly. Anyone have any other good ideas? I feel like there's some key
> function that I'm just stupidly overlooking...
> 
> Thanks!
> David Balatero

Re: Parsing a directory of 300,000 HTML files?

Reply via email to