[Pig Wiki] Update of "DataGeneratorHadoop" by yinghe

Apache Wiki Tue, 04 Aug 2009 11:42:05 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.


The following page has been changed by yinghe:
http://wiki.apache.org/pig/DataGeneratorHadoop

The comment on the change is:
   

------------------------------------------------------------------------------
- 
  == Make DataGenerator A Hadoop Job ==
  
- === Introduction ===
+ === Background ===
+ Data generator is provided to generate tuples with specified number of fields 
for testing purpose. You can configure the datatype, length, cardinality, 
distribution and the percentage of NULL values for each field. Data generator 
generates random values that matches the configuration. Two types of 
distribution are supported: uniform and zipf distribution.
+ 
- The current data generator runs on a single box and is single threaded. Its 
execution time is linear to the amount of data to be generated. When the amount 
of data reaches hundreds of gigabytes, the time required becomes unacceptable. 
In other words, this application is not scalable to deal with large amount of 
data. The goal is to be able to generate data in parallel, so the time can be 
greatly reduced.
+ The current implementation runs on a single box and is single threaded. Its 
execution time is linear to the amount of data to be generated. When the amount 
of data reaches hundreds of gigabytes, the time required becomes unacceptable. 
In other words, this application is not scalable to deal with large amount of 
data. 
+ 
+ The newer version of implementation allows to generate data in hadoop mode. 
You can specify the number of mappers, each mapper only needs to generate a 
fraction of data. This can greatly reduce the execution time.
  
  === Algorithm ===
  Tuples generated by data generator can contain fields that are uniformly 
distributed or Zipf distributed. Both types of fields can be split into 
multiple processors with each processor generating a fraction of total rows. If 
M rows are to be generated by N processors, then each processor shall generate 
M/N rows. When the data from each processor are combined together, the result 
should still be uniformly distributed or zipf distributed.
@@ -33, +36 @@

           * If there is no input file, the tuple that mapper receives is the 
number of rows to be generated. Therefore, it generates the specified number of 
rows.
           * If there is an input file, the tuple that mapper receives is an 
tuple from input file, append it with other fields.
  
+ 
+ === Usage ===
+ 
+ Define following env variables:
+     * $pigjar: pig.jar
+     * $zipfjar:  sdsuLibJKD12.jar
+     * $datagenjar: jar file that contains DataGenerator class
+     * $conf_file: hadoop-site.xml for your cluster
+ 
+ export HADOOP_CLASSPATH=$pigjar:$zipfjar:$datagenjar
+ 
+ hadoop jar -libjars $zipfjar $datagenjar 
org.apache.pig.test.utils.datagen.DataGenerator -conf $conf_file [options] 
colspec...
+ 
+     * options:
+         * -m:  number of mappers to run concurrently to generate data. If not 
configured or equal to 0, it runs in local mode.
+         * -e:  seed value for random numbers, can not be configured if -m is 
greater than 0. Optional for local mode.
+         * -f:  for local mode, output file and optional with default to 
stdout, for hadoop model output directory and required
+         * -i:  optional, input file, lines will be read from.       
+         * -r:  number of rows to output, not required if -i is configured
+         * -s:  optional, separator character, default is ^A
+        
+     * colspec Format: 
columntype:average_size:cardinality:distribution_type:percent_null
+         * columntype:
+              * i = int
+              * l = long
+              * f = float
+              * d = double
+              * s = string
+              * m = map
+              * bx = bag of x, where x is a columntype
+         * distribution_type:
+              * u = uniform
+              * z = zipf
+         * average_size: average size for string types
+ 
+ Examples:
+     s:20:16000:z:7 specifies a String field whose length is 20, cardinality 
is 16000, it has zipf distribution and about 7% of NULL values.
+ 
+     i:1:20:u:0 specifies an Integer field whose cardinality is 20, it has 
uniform distribution and no NULL values.
+ 
+ To run it locally, besides using hadoop jar without -m, you can also use:
+ 
+ java -cp $pigjar:$zipfjar:$datagenjar 
org.apache.pig.test.utils.datagen.DataGenerator [options] colspec...
+ 
+ 
  === Future Works ===
  This implementation is constrained by the memory availability.  For now, we 
assume the cardinality of a field that need a mapping file is less than 2M, and 
the number of such fields is not more than 5. In this case, the memory required 
should be less than 1G for most settings. 
  To work with bigger cardinality or more of string fields, the DataGenerator 
has to generate data with random numbers and then does an explicit join between 
the mapping file and the data file.

[Pig Wiki] Update of "DataGeneratorHadoop" by yinghe

Reply via email to