Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.
The following page has been changed by yinghe: http://wiki.apache.org/pig/DataGeneratorHadoop New page: == Make DataGenerator A Hadoop Job == === Introduction === The current data generator runs on a single box and is single threaded. Its execution time is linear to the amount of data to be generated. When the amount of data reaches hundreds of gigabytes, the time required becomes unacceptable. In other words, this application is not scalable to deal with large amount of data. The goal is to be able to generate data in parallel, so the time can be greatly reduced. === Algorithm === Tuples generated by data generator can contain fields that are uniformly distributed or Zipf distributed. Both types of fields can be split into multiple processors with each processor generating a fraction of total rows. If M rows are to be generated by N processors, then each processor shall generate M/N rows. When the data from each processor are combined together, the result should still be uniformly distributed or zipf distributed. * Uniform distributed fields: A random number is generated using Java Random class with specified cardinality. The random numbers are uniformly distributed. For integer and long types, this random number is returned as the value of the tuple field. If N processors use the same cardinality, the combined result is still uniformly distributed. For float, double, and string, this random number is used as a seed to generate corresponding float/double/string. We need to make sure for the same seed, the same float/double/string is returned. When running across N processors, this can be achieved by generating a mapping between random number to actual float/double/string in advance. Then each processor loads this information during startup and uses this data mapping to generate float/double/string fields. * Zipf distributed fields: A 3rd party library is used to generate random numbers between 1 and "cardinality" following zipf distributed. Given a cardinality, this library generates numbers with fixed density. For integer and long types, this number is returned as field value. Therefore, combining data from multiple processors together, the result should have the same density distribution. For float, double and string types, this number is used as a seed to return a float/double/string. We need to make sure for the same number, the same float/double/string is returned across multiple processors. This can be achieved by generating a mapping between random number to actual float/double/string in advance. Then each processor loads this information during startup and uses this data mapping to generate string fields. === Design === Data generator is modified to be a hadoop job. * Command line change * An option -m is added to specify the number of mappers to run this job. Reducer is not required. * Option -f is required to specify the output directory. * When DataGenerator is running in hadoop mode, -e (for seed) is disable. Because multiple mappers are running to generate data, if they share the same seed, the data generated by multiple mappers would be duplicated. * The sequence of execution * Data generator pre-generates a mapping file for each string/double/float field for zipf or uniform distribution. * Create a config file which contains data type, length, cardinality, distribution type, and percentage of NULLs for each field. Attach the name of the mapping file at the end if created from step 1. The name of this config file is passed to each mapper through JobConf. * If input file is not configured, then * Create N input files for N mappers. Each input file only has one row. It contains the number of rows to be generated by a mapper. * Mark the job that there is not input file * If input file is configured, then * Set the input file as input path. * Mark the job that there is input file * Start map-reduce job, and load in field config. For the fields that have mapping file associated with it, build an internal hash for lookups. When mapper gets the input tuple, depending on input type: * If there is no input file, the tuple that mapper receives is the number of rows to be generated. Therefore, it generates the specified number of rows. * If there is an input file, the tuple that mapper receives is an tuple from input file, append it with other fields. === Future Works === This implementation is constrained by the memory availability. For now, we assume the cardinality of a field that need a mapping file is less than 2M, and the number of such fields is not more than 5. In this case, the memory required should be less than 1G for most settings. To work with bigger cardinality or more of string fields, the DataGenerator has to generate data with random numbers and then does an explicit join between the mapping file and the data file.
