Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.
The following page has been changed by yinghe: http://wiki.apache.org/pig/DataGeneratorHadoop The comment on the change is: ------------------------------------------------------------------------------ - == Make DataGenerator A Hadoop Job == - === Introduction === + === Background === + Data generator is provided to generate tuples with specified number of fields for testing purpose. You can configure the datatype, length, cardinality, distribution and the percentage of NULL values for each field. Data generator generates random values that matches the configuration. Two types of distribution are supported: uniform and zipf distribution. + - The current data generator runs on a single box and is single threaded. Its execution time is linear to the amount of data to be generated. When the amount of data reaches hundreds of gigabytes, the time required becomes unacceptable. In other words, this application is not scalable to deal with large amount of data. The goal is to be able to generate data in parallel, so the time can be greatly reduced. + The current implementation runs on a single box and is single threaded. Its execution time is linear to the amount of data to be generated. When the amount of data reaches hundreds of gigabytes, the time required becomes unacceptable. In other words, this application is not scalable to deal with large amount of data. + + The newer version of implementation allows to generate data in hadoop mode. You can specify the number of mappers, each mapper only needs to generate a fraction of data. This can greatly reduce the execution time. === Algorithm === Tuples generated by data generator can contain fields that are uniformly distributed or Zipf distributed. Both types of fields can be split into multiple processors with each processor generating a fraction of total rows. If M rows are to be generated by N processors, then each processor shall generate M/N rows. When the data from each processor are combined together, the result should still be uniformly distributed or zipf distributed. @@ -33, +36 @@ * If there is no input file, the tuple that mapper receives is the number of rows to be generated. Therefore, it generates the specified number of rows. * If there is an input file, the tuple that mapper receives is an tuple from input file, append it with other fields. + + === Usage === + + Define following env variables: + * $pigjar: pig.jar + * $zipfjar: sdsuLibJKD12.jar + * $datagenjar: jar file that contains DataGenerator class + * $conf_file: hadoop-site.xml for your cluster + + export HADOOP_CLASSPATH=$pigjar:$zipfjar:$datagenjar + + hadoop jar -libjars $zipfjar $datagenjar org.apache.pig.test.utils.datagen.DataGenerator -conf $conf_file [options] colspec... + + * options: + * -m: number of mappers to run concurrently to generate data. If not configured or equal to 0, it runs in local mode. + * -e: seed value for random numbers, can not be configured if -m is greater than 0. Optional for local mode. + * -f: for local mode, output file and optional with default to stdout, for hadoop model output directory and required + * -i: optional, input file, lines will be read from. + * -r: number of rows to output, not required if -i is configured + * -s: optional, separator character, default is ^A + + * colspec Format: columntype:average_size:cardinality:distribution_type:percent_null + * columntype: + * i = int + * l = long + * f = float + * d = double + * s = string + * m = map + * bx = bag of x, where x is a columntype + * distribution_type: + * u = uniform + * z = zipf + * average_size: average size for string types + + Examples: + s:20:16000:z:7 specifies a String field whose length is 20, cardinality is 16000, it has zipf distribution and about 7% of NULL values. + + i:1:20:u:0 specifies an Integer field whose cardinality is 20, it has uniform distribution and no NULL values. + + To run it locally, besides using hadoop jar without -m, you can also use: + + java -cp $pigjar:$zipfjar:$datagenjar org.apache.pig.test.utils.datagen.DataGenerator [options] colspec... + + === Future Works === This implementation is constrained by the memory availability. For now, we assume the cardinality of a field that need a mapping file is less than 2M, and the number of such fields is not more than 5. In this case, the memory required should be less than 1G for most settings. To work with bigger cardinality or more of string fields, the DataGenerator has to generate data with random numbers and then does an explicit join between the mapping file and the data file.