Hi Everyone, I am working on a project which takes in data from a lot of text files, and although there are a lot of ways to do it, it is not clear to me which is the best/fastest. I am working on an EC2 cluster with approximately 20 machines.
The data is currently spread across 20k text files (total of a ~3gb ), each of which needs to be treated as a whole (no splits within those files), but I am willing to change around the format if I can get increased speed. Using the regular TextInputFormat adjusted to take in entire files is pretty slow since each file takes a minimum of about ~3 seconds no matter how small it is. >From what I have read the possible options to proceed with are as follows: 1. Use MultiFileInputSplit, it seems to be designed for this sort of situation, but I have yet to see an implementation of this, or a commentary on its performance increase over the regular input. 2. Read the data in, and output it as a Sequence File and use the sequence file as input from there on out. 3. Condense the files down to a small number of files (say ~100) and then delimit the files so each part gets a separate record reader. If anyone could give me guidance as to what will provide the best performance for this setup, I would greatly appreciate it. Thanks for your help -- View this message in context: http://www.nabble.com/Handling-Large-Number-Of-Files%2C-Fastest-Way-tp17297485p17297485.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
