That isn't all that many files. At 1MB, you shouldn't be seeing much performance hit due to reading many files.
You will need a special input format but it can be very simple. Just extend something like TextInputFormat and replace the record reader and report the file as unsplittable. On 11/26/07 8:49 AM, "Peter Thygesen" <[EMAIL PROTECTED]> wrote: > I would like to run some mapReduce jobs on some xml files I got (aprox. > 100000 compressed files). > The XML files are not that big about 1 Mb compressed, each containing > about 1000 records. > > Do I have to write my own InputSplitter? Should I use > MultiFileInputFormat or StreamInputFormat? Can I use the > StreamXmlRecordReader, and how? By sub-classing some input class? > > The tutorials and examples I've read are all very straight forward > reading simple text files, but I miss a more complex example, especially > one that reads xml files ;) > > thx. > Peter > >
