Ottomata has uploaded a new change for review. https://gerrit.wikimedia.org/r/231782
Change subject: Add separate camus property files for smaller and larger volumne webrequest topics ...................................................................... Add separate camus property files for smaller and larger volumne webrequest topics Change-Id: Iecf923d75fc6a93042af7a44960dfc6787f07c36 --- A camus/README.md A camus/camus.webrequest_maps,misc,mobile.properties A camus/camus.webrequest_text.properties A camus/camus.webrequest_upload.properties 4 files changed, 362 insertions(+), 0 deletions(-) git pull ssh://gerrit.wikimedia.org:29418/analytics/refinery refs/changes/82/231782/1 diff --git a/camus/README.md b/camus/README.md new file mode 100644 index 0000000..9feb2e0 --- /dev/null +++ b/camus/README.md @@ -0,0 +1,14 @@ +# Todo: +* Move these to Puppet! + +# webrequest_* +The webrequest_maps webrequest_mobile and webrequest_misc topics are relatively +small volume. This camus job is expected to complete runs fairly quickly. + +webrequest_text and webrequest_upload are both large volume. These are +run as separate jobs so as not to interere with each other and smaller volume +webrequest imports. + + +The camus.webrequest.properties file contains all webrequest topics, but as +of 2015-08-15 is deprecated. \ No newline at end of file diff --git a/camus/camus.webrequest_maps,misc,mobile.properties b/camus/camus.webrequest_maps,misc,mobile.properties new file mode 100644 index 0000000..69e9ff2 --- /dev/null +++ b/camus/camus.webrequest_maps,misc,mobile.properties @@ -0,0 +1,116 @@ +# +# Camus properites file for consuming webrequest_* topics into HDFS. +# The consumed data will later have external Hive partitions +# automatically mapped on top of it. +# + +# submit this job in the WMF Analytics Cluster's 'essential' queue. +mapreduce.job.queuename=essential + +# Set HDFS umask so that webrequest files and directories created by Camus are not world readable. +fs.permissions.umask-mode=027 + +# final top-level data output directory, sub-directory will be dynamically created for each topic pulled +etl.destination.path=hdfs://analytics-hadoop/wmf/data/raw/webrequest +# HDFS location where you want to keep execution files, i.e. offsets, error logs, and count files +etl.execution.base.path=hdfs://analytics-hadoop/wmf/camus/webrequest +# where completed Camus job output directories are kept, usually a sub-dir in the base.path +etl.execution.history.path=hdfs://analytics-hadoop/wmf/camus/webrequest/history + +# Our timestamps look like 2013-09-20T15:40:17 +camus.message.timestamp.format=yyyy-MM-dd'T'HH:mm:ss + +# use the dt field +camus.message.timestamp.field=dt + +# Store output into hourly buckets +etl.output.file.time.partition.mins=60 +etl.keep.count.files=false +etl.execution.history.max.of.quota=.8 + +# records are delimited by newline +etl.output.record.delimiter=\n + +# Concrete implementation of the Decoder class to use +camus.message.decoder.class=com.linkedin.camus.etl.kafka.coders.JsonStringMessageDecoder + +# SequenceFileRecordWriterProvider writes the records as Hadoop Sequence files +# so that they can be split even if they are compressed. We Snappy compress these +# by setting mapreduce.output.fileoutputformat.compress.codec to SnappyCodec +# in /etc/hadoop/conf/mapred-site.xml. +etl.record.writer.provider.class=com.linkedin.camus.etl.kafka.common.SequenceFileRecordWriterProvider + +# Max hadoop tasks to use, each task can pull multiple topic partitions. +# Currently 5 topics, each with 12 partitions. Setting +# map.tasks to 5*12. +mapred.map.tasks=60 + +# Connection parameters. +kafka.brokers=analytics1012.eqiad.wmnet:9092,analytics1018.eqiad.wmnet:9092,analytics1021.eqiad.wmnet:9092,analytics1022.eqiad.wmnet:9092 + +# max historical time that will be pulled from each partition based on event timestamp +# Note: max.pull.hrs doesn't quite seem to be respected here. +# This will take some more sleuthing to figure out why, but in our case +# here its ok, as we hope to never be this far behind in Kafka messages to +# consume. +kafka.max.pull.hrs=168 +# events with a timestamp older than this will be discarded. +kafka.max.historical.days=7 +# Max minutes for each mapper to pull messages (-1 means no limit) +# Let each mapper run for no more than 55 minutes. +# Camus creates hourly directories, and we don't want a single +# long running mapper keep other Camus jobs from being launched. +kafka.max.pull.minutes.per.task=55 + +# if whitelist has values, only whitelisted topic are pulled. nothing on the blacklist is pulled +kafka.blacklist.topics= + +# These are the kafka topics camus brings to HDFS +kafka.whitelist.topics=webrequest_maps,webrequest_mobile,webrequest_misc + +# Name of the client as seen by kafka +kafka.client.name=camus-webrequest-00 + +# Fetch Request Parameters +#kafka.fetch.buffer.size= +#kafka.fetch.request.correlationid= +#kafka.fetch.request.max.wait= +#kafka.fetch.request.min.bytes= + +kafka.client.buffer.size=20971520 +kafka.client.so.timeout=60000 + + +# Controls the submitting of counts to Kafka +# Default value set to true +post.tracking.counts.to.kafka=false + +# Stops the mapper from getting inundated with Decoder exceptions for the same topic +# Default value is set to 10 +max.decoder.exceptions.to.print=5 + +log4j.configuration=false + +# everything below this point can be ignored for the time being, will provide more documentation down the road +########################## +etl.run.tracking.post=false +#kafka.monitor.tier= +kafka.monitor.time.granularity=10 + +etl.hourly=hourly +etl.daily=daily +etl.ignore.schema.errors=false + +# WMF relies on the relevant Hadoop properties for this, +# not Camus' custom properties. +# i.e. mapreduce.output.compression* properties +# # configure output compression for deflate or snappy. Defaults to deflate. +# etl.output.codec=deflate +# etl.deflate.level=6 +# #etl.output.codec=snappy + +etl.default.timezone=UTC +etl.output.file.time.partition.mins=60 +etl.keep.count.files=false +#etl.counts.path= +etl.execution.history.max.of.quota=.8 diff --git a/camus/camus.webrequest_text.properties b/camus/camus.webrequest_text.properties new file mode 100644 index 0000000..19154e9 --- /dev/null +++ b/camus/camus.webrequest_text.properties @@ -0,0 +1,116 @@ +# +# Camus properites file for consuming webrequest_* topics into HDFS. +# The consumed data will later have external Hive partitions +# automatically mapped on top of it. +# + +# submit this job in the WMF Analytics Cluster's 'essential' queue. +mapreduce.job.queuename=essential + +# Set HDFS umask so that webrequest files and directories created by Camus are not world readable. +fs.permissions.umask-mode=027 + +# final top-level data output directory, sub-directory will be dynamically created for each topic pulled +etl.destination.path=hdfs://analytics-hadoop/wmf/data/raw/webrequest +# HDFS location where you want to keep execution files, i.e. offsets, error logs, and count files +etl.execution.base.path=hdfs://analytics-hadoop/wmf/camus/webrequest +# where completed Camus job output directories are kept, usually a sub-dir in the base.path +etl.execution.history.path=hdfs://analytics-hadoop/wmf/camus/webrequest/history + +# Our timestamps look like 2013-09-20T15:40:17 +camus.message.timestamp.format=yyyy-MM-dd'T'HH:mm:ss + +# use the dt field +camus.message.timestamp.field=dt + +# Store output into hourly buckets +etl.output.file.time.partition.mins=60 +etl.keep.count.files=false +etl.execution.history.max.of.quota=.8 + +# records are delimited by newline +etl.output.record.delimiter=\n + +# Concrete implementation of the Decoder class to use +camus.message.decoder.class=com.linkedin.camus.etl.kafka.coders.JsonStringMessageDecoder + +# SequenceFileRecordWriterProvider writes the records as Hadoop Sequence files +# so that they can be split even if they are compressed. We Snappy compress these +# by setting mapreduce.output.fileoutputformat.compress.codec to SnappyCodec +# in /etc/hadoop/conf/mapred-site.xml. +etl.record.writer.provider.class=com.linkedin.camus.etl.kafka.common.SequenceFileRecordWriterProvider + +# Max hadoop tasks to use, each task can pull multiple topic partitions. +# Currently 5 topics, each with 12 partitions. Setting +# map.tasks to 5*12. +mapred.map.tasks=60 + +# Connection parameters. +kafka.brokers=analytics1012.eqiad.wmnet:9092,analytics1018.eqiad.wmnet:9092,analytics1021.eqiad.wmnet:9092,analytics1022.eqiad.wmnet:9092 + +# max historical time that will be pulled from each partition based on event timestamp +# Note: max.pull.hrs doesn't quite seem to be respected here. +# This will take some more sleuthing to figure out why, but in our case +# here its ok, as we hope to never be this far behind in Kafka messages to +# consume. +kafka.max.pull.hrs=168 +# events with a timestamp older than this will be discarded. +kafka.max.historical.days=7 +# Max minutes for each mapper to pull messages (-1 means no limit) +# Let each mapper run for no more than 55 minutes. +# Camus creates hourly directories, and we don't want a single +# long running mapper keep other Camus jobs from being launched. +kafka.max.pull.minutes.per.task=55 + +# if whitelist has values, only whitelisted topic are pulled. nothing on the blacklist is pulled +kafka.blacklist.topics= + +# These are the kafka topics camus brings to HDFS +kafka.whitelist.topics=webrequest_text + +# Name of the client as seen by kafka +kafka.client.name=camus-webrequest-00 + +# Fetch Request Parameters +#kafka.fetch.buffer.size= +#kafka.fetch.request.correlationid= +#kafka.fetch.request.max.wait= +#kafka.fetch.request.min.bytes= + +kafka.client.buffer.size=20971520 +kafka.client.so.timeout=60000 + + +# Controls the submitting of counts to Kafka +# Default value set to true +post.tracking.counts.to.kafka=false + +# Stops the mapper from getting inundated with Decoder exceptions for the same topic +# Default value is set to 10 +max.decoder.exceptions.to.print=5 + +log4j.configuration=false + +# everything below this point can be ignored for the time being, will provide more documentation down the road +########################## +etl.run.tracking.post=false +#kafka.monitor.tier= +kafka.monitor.time.granularity=10 + +etl.hourly=hourly +etl.daily=daily +etl.ignore.schema.errors=false + +# WMF relies on the relevant Hadoop properties for this, +# not Camus' custom properties. +# i.e. mapreduce.output.compression* properties +# # configure output compression for deflate or snappy. Defaults to deflate. +# etl.output.codec=deflate +# etl.deflate.level=6 +# #etl.output.codec=snappy + +etl.default.timezone=UTC +etl.output.file.time.partition.mins=60 +etl.keep.count.files=false +#etl.counts.path= +etl.execution.history.max.of.quota=.8 diff --git a/camus/camus.webrequest_upload.properties b/camus/camus.webrequest_upload.properties new file mode 100644 index 0000000..c0aa006 --- /dev/null +++ b/camus/camus.webrequest_upload.properties @@ -0,0 +1,116 @@ +# +# Camus properites file for consuming webrequest_* topics into HDFS. +# The consumed data will later have external Hive partitions +# automatically mapped on top of it. +# + +# submit this job in the WMF Analytics Cluster's 'essential' queue. +mapreduce.job.queuename=essential + +# Set HDFS umask so that webrequest files and directories created by Camus are not world readable. +fs.permissions.umask-mode=027 + +# final top-level data output directory, sub-directory will be dynamically created for each topic pulled +etl.destination.path=hdfs://analytics-hadoop/wmf/data/raw/webrequest +# HDFS location where you want to keep execution files, i.e. offsets, error logs, and count files +etl.execution.base.path=hdfs://analytics-hadoop/wmf/camus/webrequest +# where completed Camus job output directories are kept, usually a sub-dir in the base.path +etl.execution.history.path=hdfs://analytics-hadoop/wmf/camus/webrequest/history + +# Our timestamps look like 2013-09-20T15:40:17 +camus.message.timestamp.format=yyyy-MM-dd'T'HH:mm:ss + +# use the dt field +camus.message.timestamp.field=dt + +# Store output into hourly buckets +etl.output.file.time.partition.mins=60 +etl.keep.count.files=false +etl.execution.history.max.of.quota=.8 + +# records are delimited by newline +etl.output.record.delimiter=\n + +# Concrete implementation of the Decoder class to use +camus.message.decoder.class=com.linkedin.camus.etl.kafka.coders.JsonStringMessageDecoder + +# SequenceFileRecordWriterProvider writes the records as Hadoop Sequence files +# so that they can be split even if they are compressed. We Snappy compress these +# by setting mapreduce.output.fileoutputformat.compress.codec to SnappyCodec +# in /etc/hadoop/conf/mapred-site.xml. +etl.record.writer.provider.class=com.linkedin.camus.etl.kafka.common.SequenceFileRecordWriterProvider + +# Max hadoop tasks to use, each task can pull multiple topic partitions. +# Currently 5 topics, each with 12 partitions. Setting +# map.tasks to 5*12. +mapred.map.tasks=60 + +# Connection parameters. +kafka.brokers=analytics1012.eqiad.wmnet:9092,analytics1018.eqiad.wmnet:9092,analytics1021.eqiad.wmnet:9092,analytics1022.eqiad.wmnet:9092 + +# max historical time that will be pulled from each partition based on event timestamp +# Note: max.pull.hrs doesn't quite seem to be respected here. +# This will take some more sleuthing to figure out why, but in our case +# here its ok, as we hope to never be this far behind in Kafka messages to +# consume. +kafka.max.pull.hrs=168 +# events with a timestamp older than this will be discarded. +kafka.max.historical.days=7 +# Max minutes for each mapper to pull messages (-1 means no limit) +# Let each mapper run for no more than 55 minutes. +# Camus creates hourly directories, and we don't want a single +# long running mapper keep other Camus jobs from being launched. +kafka.max.pull.minutes.per.task=55 + +# if whitelist has values, only whitelisted topic are pulled. nothing on the blacklist is pulled +kafka.blacklist.topics= + +# These are the kafka topics camus brings to HDFS +kafka.whitelist.topics=webrequest_upload + +# Name of the client as seen by kafka +kafka.client.name=camus-webrequest-00 + +# Fetch Request Parameters +#kafka.fetch.buffer.size= +#kafka.fetch.request.correlationid= +#kafka.fetch.request.max.wait= +#kafka.fetch.request.min.bytes= + +kafka.client.buffer.size=20971520 +kafka.client.so.timeout=60000 + + +# Controls the submitting of counts to Kafka +# Default value set to true +post.tracking.counts.to.kafka=false + +# Stops the mapper from getting inundated with Decoder exceptions for the same topic +# Default value is set to 10 +max.decoder.exceptions.to.print=5 + +log4j.configuration=false + +# everything below this point can be ignored for the time being, will provide more documentation down the road +########################## +etl.run.tracking.post=false +#kafka.monitor.tier= +kafka.monitor.time.granularity=10 + +etl.hourly=hourly +etl.daily=daily +etl.ignore.schema.errors=false + +# WMF relies on the relevant Hadoop properties for this, +# not Camus' custom properties. +# i.e. mapreduce.output.compression* properties +# # configure output compression for deflate or snappy. Defaults to deflate. +# etl.output.codec=deflate +# etl.deflate.level=6 +# #etl.output.codec=snappy + +etl.default.timezone=UTC +etl.output.file.time.partition.mins=60 +etl.keep.count.files=false +#etl.counts.path= +etl.execution.history.max.of.quota=.8 -- To view, visit https://gerrit.wikimedia.org/r/231782 To unsubscribe, visit https://gerrit.wikimedia.org/r/settings Gerrit-MessageType: newchange Gerrit-Change-Id: Iecf923d75fc6a93042af7a44960dfc6787f07c36 Gerrit-PatchSet: 1 Gerrit-Project: analytics/refinery Gerrit-Branch: master Gerrit-Owner: Ottomata <[email protected]> _______________________________________________ MediaWiki-commits mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits
