[Pig Wiki] Update of "DataGeneratorHadoop" by yinghe

Apache Wiki Mon, 03 Aug 2009 14:25:07 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.


The following page has been changed by yinghe:
http://wiki.apache.org/pig/DataGeneratorHadoop

New page:

== Make DataGenerator A Hadoop Job ==

=== Introduction ===
The current data generator runs on a single box and is single threaded. Its 
execution time is linear to the amount of data to be generated. When the amount 
of data reaches hundreds of gigabytes, the time required becomes unacceptable. 
In other words, this application is not scalable to deal with large amount of 
data. The goal is to be able to generate data in parallel, so the time can be 
greatly reduced.

=== Algorithm ===
Tuples generated by data generator can contain fields that are uniformly 
distributed or Zipf distributed. Both types of fields can be split into 
multiple processors with each processor generating a fraction of total rows. If 
M rows are to be generated by N processors, then each processor shall generate 
M/N rows. When the data from each processor are combined together, the result 
should still be uniformly distributed or zipf distributed.

    * Uniform distributed fields: A random number is generated using Java 
Random class with specified cardinality. The random numbers are uniformly 
distributed. For integer and long types, this random number is returned as the 
value of the tuple field. If N processors use the same cardinality, the 
combined result is still uniformly distributed. For float, double, and string, 
this random number is used as a seed to generate corresponding 
float/double/string. We need to make sure for the same seed, the same 
float/double/string is returned. When running across N processors, this can be 
achieved by generating a mapping between random number to actual 
float/double/string in advance. Then each processor loads this information 
during startup and uses this data mapping to generate float/double/string 
fields.

    * Zipf distributed fields: A 3rd party library is used to generate random 
numbers between 1 and &quot;cardinality&quot; following zipf distributed. Given 
a cardinality, this library generates numbers with fixed density. For integer 
and long types, this number is returned as field value. Therefore, combining 
data from multiple processors together, the result should have the same density 
distribution. For float, double and string types, this number is used as a seed 
to return a float/double/string.  We need to make sure for the same number, the 
same float/double/string is returned across multiple processors.  This can be 
achieved by generating a mapping between random number to actual 
float/double/string in advance. Then each processor loads this information 
during startup and uses this data mapping to generate string fields.

=== Design ===
Data generator is modified to be a hadoop job. 
    * Command line change 
      * An option -m is added to specify the number of mappers to run this job. 
Reducer is not required. 
      * Option -f is required to specify the output directory.  
      * When DataGenerator is running in hadoop mode, -e (for seed) is disable. 
Because multiple mappers are running to generate data, if they share the same 
seed, the data generated by multiple mappers would be duplicated.

    * The sequence of execution
      * Data generator pre-generates a mapping file for each 
string/double/float field for zipf or uniform distribution.
      * Create a config file which contains data type, length, cardinality, 
distribution type, and percentage of NULLs for each field. Attach the name of 
the mapping file at the end if created from step 1. The name of this config 
file is passed to each mapper through JobConf.
      * If input file is not configured, then
         * Create N input files for N mappers. Each input file only has one 
row. It contains the number of rows to be generated by a mapper.
         * Mark the job that there is not input file
      * If input file is configured, then
         * Set the input file as input path.
         * Mark the job that there is input file
      * Start map-reduce job, and load in field config. For the fields that 
have mapping file associated with it, build an internal hash for lookups. When 
mapper gets the input tuple, depending on input type:
         * If there is no input file, the tuple that mapper receives is the 
number of rows to be generated. Therefore, it generates the specified number of 
rows.
         * If there is an input file, the tuple that mapper receives is an 
tuple from input file, append it with other fields.

=== Future Works ===
This implementation is constrained by the memory availability.  For now, we 
assume the cardinality of a field that need a mapping file is less than 2M, and 
the number of such fields is not more than 5. In this case, the memory required 
should be less than 1G for most settings. 
To work with bigger cardinality or more of string fields, the DataGenerator has 
to generate data with random numbers and then does an explicit join between the 
mapping file and the data file.

[Pig Wiki] Update of "DataGeneratorHadoop" by yinghe

Reply via email to