[MediaWiki-commits] [Gerrit] Add separate camus property files for smaller and larger vol... - change (analytics/refinery)

Ottomata (Code Review) Sat, 15 Aug 2015 09:45:45 -0700

Ottomata has uploaded a new change for review.

  https://gerrit.wikimedia.org/r/231782


Change subject: Add separate camus property files for smaller and larger 
volumne webrequest topics
......................................................................

Add separate camus property files for smaller and larger volumne webrequest 
topics

Change-Id: Iecf923d75fc6a93042af7a44960dfc6787f07c36
---
A camus/README.md
A camus/camus.webrequest_maps,misc,mobile.properties
A camus/camus.webrequest_text.properties
A camus/camus.webrequest_upload.properties
4 files changed, 362 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.wikimedia.org:29418/analytics/refinery 
refs/changes/82/231782/1

diff --git a/camus/README.md b/camus/README.md
new file mode 100644
index 0000000..9feb2e0
--- /dev/null
+++ b/camus/README.md
@@ -0,0 +1,14 @@
+# Todo:
+* Move these to Puppet!
+
+# webrequest_*
+The webrequest_maps webrequest_mobile and webrequest_misc topics are relatively
+small volume.  This camus job is expected to complete runs fairly quickly.
+
+webrequest_text and webrequest_upload are both large volume.  These are
+run as separate jobs so as not to interere with each other and smaller volume
+webrequest imports.
+
+
+The camus.webrequest.properties file contains all webrequest topics, but as
+of 2015-08-15 is deprecated.
\ No newline at end of file
diff --git a/camus/camus.webrequest_maps,misc,mobile.properties 
b/camus/camus.webrequest_maps,misc,mobile.properties
new file mode 100644
index 0000000..69e9ff2
--- /dev/null
+++ b/camus/camus.webrequest_maps,misc,mobile.properties
@@ -0,0 +1,116 @@
+#
+# Camus properites file for consuming webrequest_* topics into HDFS.
+# The consumed data will later have external Hive partitions
+# automatically mapped on top of it.
+#
+
+# submit this job in the WMF Analytics Cluster's 'essential' queue.
+mapreduce.job.queuename=essential
+
+# Set HDFS umask so that webrequest files and directories created by Camus are 
not world readable.
+fs.permissions.umask-mode=027
+
+# final top-level data output directory, sub-directory will be dynamically 
created for each topic pulled
+etl.destination.path=hdfs://analytics-hadoop/wmf/data/raw/webrequest
+# HDFS location where you want to keep execution files, i.e. offsets, error 
logs, and count files
+etl.execution.base.path=hdfs://analytics-hadoop/wmf/camus/webrequest
+# where completed Camus job output directories are kept, usually a sub-dir in 
the base.path
+etl.execution.history.path=hdfs://analytics-hadoop/wmf/camus/webrequest/history
+
+# Our timestamps look like 2013-09-20T15:40:17
+camus.message.timestamp.format=yyyy-MM-dd'T'HH:mm:ss
+
+# use the dt field
+camus.message.timestamp.field=dt
+
+# Store output into hourly buckets
+etl.output.file.time.partition.mins=60
+etl.keep.count.files=false
+etl.execution.history.max.of.quota=.8
+
+# records are delimited by newline
+etl.output.record.delimiter=\n
+
+# Concrete implementation of the Decoder class to use
+camus.message.decoder.class=com.linkedin.camus.etl.kafka.coders.JsonStringMessageDecoder
+
+# SequenceFileRecordWriterProvider writes the records as Hadoop Sequence files
+# so that they can be split even if they are compressed.  We Snappy compress 
these
+# by setting mapreduce.output.fileoutputformat.compress.codec to SnappyCodec
+# in /etc/hadoop/conf/mapred-site.xml.
+etl.record.writer.provider.class=com.linkedin.camus.etl.kafka.common.SequenceFileRecordWriterProvider
+
+# Max hadoop tasks to use, each task can pull multiple topic partitions.
+# Currently 5 topics, each with 12 partitions.  Setting
+# map.tasks to 5*12.
+mapred.map.tasks=60
+
+# Connection parameters.
+kafka.brokers=analytics1012.eqiad.wmnet:9092,analytics1018.eqiad.wmnet:9092,analytics1021.eqiad.wmnet:9092,analytics1022.eqiad.wmnet:9092
+
+# max historical time that will be pulled from each partition based on event 
timestamp
+#  Note:  max.pull.hrs doesn't quite seem to be respected here.
+#  This will take some more sleuthing to figure out why, but in our case
+#  here its ok, as we hope to never be this far behind in Kafka messages to
+#  consume.
+kafka.max.pull.hrs=168
+# events with a timestamp older than this will be discarded.
+kafka.max.historical.days=7
+# Max minutes for each mapper to pull messages (-1 means no limit)
+# Let each mapper run for no more than 55 minutes.
+# Camus creates hourly directories, and we don't want a single
+# long running mapper keep other Camus jobs from being launched.
+kafka.max.pull.minutes.per.task=55
+
+# if whitelist has values, only whitelisted topic are pulled.  nothing on the 
blacklist is pulled
+kafka.blacklist.topics=
+
+# These are the kafka topics camus brings to HDFS
+kafka.whitelist.topics=webrequest_maps,webrequest_mobile,webrequest_misc
+
+# Name of the client as seen by kafka
+kafka.client.name=camus-webrequest-00
+
+# Fetch Request Parameters
+#kafka.fetch.buffer.size=
+#kafka.fetch.request.correlationid=
+#kafka.fetch.request.max.wait=
+#kafka.fetch.request.min.bytes=
+
+kafka.client.buffer.size=20971520
+kafka.client.so.timeout=60000
+
+
+# Controls the submitting of counts to Kafka
+# Default value set to true
+post.tracking.counts.to.kafka=false
+
+# Stops the mapper from getting inundated with Decoder exceptions for the same 
topic
+# Default value is set to 10
+max.decoder.exceptions.to.print=5
+
+log4j.configuration=false
+
+# everything below this point can be ignored for the time being, will provide 
more documentation down the road
+##########################
+etl.run.tracking.post=false
+#kafka.monitor.tier=
+kafka.monitor.time.granularity=10
+
+etl.hourly=hourly
+etl.daily=daily
+etl.ignore.schema.errors=false
+
+# WMF relies on the relevant Hadoop properties for this,
+# not Camus' custom properties.
+#   i.e.  mapreduce.output.compression* properties
+# # configure output compression for deflate or snappy. Defaults to deflate.
+# etl.output.codec=deflate
+# etl.deflate.level=6
+# #etl.output.codec=snappy
+
+etl.default.timezone=UTC
+etl.output.file.time.partition.mins=60
+etl.keep.count.files=false
+#etl.counts.path=
+etl.execution.history.max.of.quota=.8
diff --git a/camus/camus.webrequest_text.properties 
b/camus/camus.webrequest_text.properties
new file mode 100644
index 0000000..19154e9
--- /dev/null
+++ b/camus/camus.webrequest_text.properties
@@ -0,0 +1,116 @@
+#
+# Camus properites file for consuming webrequest_* topics into HDFS.
+# The consumed data will later have external Hive partitions
+# automatically mapped on top of it.
+#
+
+# submit this job in the WMF Analytics Cluster's 'essential' queue.
+mapreduce.job.queuename=essential
+
+# Set HDFS umask so that webrequest files and directories created by Camus are 
not world readable.
+fs.permissions.umask-mode=027
+
+# final top-level data output directory, sub-directory will be dynamically 
created for each topic pulled
+etl.destination.path=hdfs://analytics-hadoop/wmf/data/raw/webrequest
+# HDFS location where you want to keep execution files, i.e. offsets, error 
logs, and count files
+etl.execution.base.path=hdfs://analytics-hadoop/wmf/camus/webrequest
+# where completed Camus job output directories are kept, usually a sub-dir in 
the base.path
+etl.execution.history.path=hdfs://analytics-hadoop/wmf/camus/webrequest/history
+
+# Our timestamps look like 2013-09-20T15:40:17
+camus.message.timestamp.format=yyyy-MM-dd'T'HH:mm:ss
+
+# use the dt field
+camus.message.timestamp.field=dt
+
+# Store output into hourly buckets
+etl.output.file.time.partition.mins=60
+etl.keep.count.files=false
+etl.execution.history.max.of.quota=.8
+
+# records are delimited by newline
+etl.output.record.delimiter=\n
+
+# Concrete implementation of the Decoder class to use
+camus.message.decoder.class=com.linkedin.camus.etl.kafka.coders.JsonStringMessageDecoder
+
+# SequenceFileRecordWriterProvider writes the records as Hadoop Sequence files
+# so that they can be split even if they are compressed.  We Snappy compress 
these
+# by setting mapreduce.output.fileoutputformat.compress.codec to SnappyCodec
+# in /etc/hadoop/conf/mapred-site.xml.
+etl.record.writer.provider.class=com.linkedin.camus.etl.kafka.common.SequenceFileRecordWriterProvider
+
+# Max hadoop tasks to use, each task can pull multiple topic partitions.
+# Currently 5 topics, each with 12 partitions.  Setting
+# map.tasks to 5*12.
+mapred.map.tasks=60
+
+# Connection parameters.
+kafka.brokers=analytics1012.eqiad.wmnet:9092,analytics1018.eqiad.wmnet:9092,analytics1021.eqiad.wmnet:9092,analytics1022.eqiad.wmnet:9092
+
+# max historical time that will be pulled from each partition based on event 
timestamp
+#  Note:  max.pull.hrs doesn't quite seem to be respected here.
+#  This will take some more sleuthing to figure out why, but in our case
+#  here its ok, as we hope to never be this far behind in Kafka messages to
+#  consume.
+kafka.max.pull.hrs=168
+# events with a timestamp older than this will be discarded.
+kafka.max.historical.days=7
+# Max minutes for each mapper to pull messages (-1 means no limit)
+# Let each mapper run for no more than 55 minutes.
+# Camus creates hourly directories, and we don't want a single
+# long running mapper keep other Camus jobs from being launched.
+kafka.max.pull.minutes.per.task=55
+
+# if whitelist has values, only whitelisted topic are pulled.  nothing on the 
blacklist is pulled
+kafka.blacklist.topics=
+
+# These are the kafka topics camus brings to HDFS
+kafka.whitelist.topics=webrequest_text
+
+# Name of the client as seen by kafka
+kafka.client.name=camus-webrequest-00
+
+# Fetch Request Parameters
+#kafka.fetch.buffer.size=
+#kafka.fetch.request.correlationid=
+#kafka.fetch.request.max.wait=
+#kafka.fetch.request.min.bytes=
+
+kafka.client.buffer.size=20971520
+kafka.client.so.timeout=60000
+
+
+# Controls the submitting of counts to Kafka
+# Default value set to true
+post.tracking.counts.to.kafka=false
+
+# Stops the mapper from getting inundated with Decoder exceptions for the same 
topic
+# Default value is set to 10
+max.decoder.exceptions.to.print=5
+
+log4j.configuration=false
+
+# everything below this point can be ignored for the time being, will provide 
more documentation down the road
+##########################
+etl.run.tracking.post=false
+#kafka.monitor.tier=
+kafka.monitor.time.granularity=10
+
+etl.hourly=hourly
+etl.daily=daily
+etl.ignore.schema.errors=false
+
+# WMF relies on the relevant Hadoop properties for this,
+# not Camus' custom properties.
+#   i.e.  mapreduce.output.compression* properties
+# # configure output compression for deflate or snappy. Defaults to deflate.
+# etl.output.codec=deflate
+# etl.deflate.level=6
+# #etl.output.codec=snappy
+
+etl.default.timezone=UTC
+etl.output.file.time.partition.mins=60
+etl.keep.count.files=false
+#etl.counts.path=
+etl.execution.history.max.of.quota=.8
diff --git a/camus/camus.webrequest_upload.properties 
b/camus/camus.webrequest_upload.properties
new file mode 100644
index 0000000..c0aa006
--- /dev/null
+++ b/camus/camus.webrequest_upload.properties
@@ -0,0 +1,116 @@
+#
+# Camus properites file for consuming webrequest_* topics into HDFS.
+# The consumed data will later have external Hive partitions
+# automatically mapped on top of it.
+#
+
+# submit this job in the WMF Analytics Cluster's 'essential' queue.
+mapreduce.job.queuename=essential
+
+# Set HDFS umask so that webrequest files and directories created by Camus are 
not world readable.
+fs.permissions.umask-mode=027
+
+# final top-level data output directory, sub-directory will be dynamically 
created for each topic pulled
+etl.destination.path=hdfs://analytics-hadoop/wmf/data/raw/webrequest
+# HDFS location where you want to keep execution files, i.e. offsets, error 
logs, and count files
+etl.execution.base.path=hdfs://analytics-hadoop/wmf/camus/webrequest
+# where completed Camus job output directories are kept, usually a sub-dir in 
the base.path
+etl.execution.history.path=hdfs://analytics-hadoop/wmf/camus/webrequest/history
+
+# Our timestamps look like 2013-09-20T15:40:17
+camus.message.timestamp.format=yyyy-MM-dd'T'HH:mm:ss
+
+# use the dt field
+camus.message.timestamp.field=dt
+
+# Store output into hourly buckets
+etl.output.file.time.partition.mins=60
+etl.keep.count.files=false
+etl.execution.history.max.of.quota=.8
+
+# records are delimited by newline
+etl.output.record.delimiter=\n
+
+# Concrete implementation of the Decoder class to use
+camus.message.decoder.class=com.linkedin.camus.etl.kafka.coders.JsonStringMessageDecoder
+
+# SequenceFileRecordWriterProvider writes the records as Hadoop Sequence files
+# so that they can be split even if they are compressed.  We Snappy compress 
these
+# by setting mapreduce.output.fileoutputformat.compress.codec to SnappyCodec
+# in /etc/hadoop/conf/mapred-site.xml.
+etl.record.writer.provider.class=com.linkedin.camus.etl.kafka.common.SequenceFileRecordWriterProvider
+
+# Max hadoop tasks to use, each task can pull multiple topic partitions.
+# Currently 5 topics, each with 12 partitions.  Setting
+# map.tasks to 5*12.
+mapred.map.tasks=60
+
+# Connection parameters.
+kafka.brokers=analytics1012.eqiad.wmnet:9092,analytics1018.eqiad.wmnet:9092,analytics1021.eqiad.wmnet:9092,analytics1022.eqiad.wmnet:9092
+
+# max historical time that will be pulled from each partition based on event 
timestamp
+#  Note:  max.pull.hrs doesn't quite seem to be respected here.
+#  This will take some more sleuthing to figure out why, but in our case
+#  here its ok, as we hope to never be this far behind in Kafka messages to
+#  consume.
+kafka.max.pull.hrs=168
+# events with a timestamp older than this will be discarded.
+kafka.max.historical.days=7
+# Max minutes for each mapper to pull messages (-1 means no limit)
+# Let each mapper run for no more than 55 minutes.
+# Camus creates hourly directories, and we don't want a single
+# long running mapper keep other Camus jobs from being launched.
+kafka.max.pull.minutes.per.task=55
+
+# if whitelist has values, only whitelisted topic are pulled.  nothing on the 
blacklist is pulled
+kafka.blacklist.topics=
+
+# These are the kafka topics camus brings to HDFS
+kafka.whitelist.topics=webrequest_upload
+
+# Name of the client as seen by kafka
+kafka.client.name=camus-webrequest-00
+
+# Fetch Request Parameters
+#kafka.fetch.buffer.size=
+#kafka.fetch.request.correlationid=
+#kafka.fetch.request.max.wait=
+#kafka.fetch.request.min.bytes=
+
+kafka.client.buffer.size=20971520
+kafka.client.so.timeout=60000
+
+
+# Controls the submitting of counts to Kafka
+# Default value set to true
+post.tracking.counts.to.kafka=false
+
+# Stops the mapper from getting inundated with Decoder exceptions for the same 
topic
+# Default value is set to 10
+max.decoder.exceptions.to.print=5
+
+log4j.configuration=false
+
+# everything below this point can be ignored for the time being, will provide 
more documentation down the road
+##########################
+etl.run.tracking.post=false
+#kafka.monitor.tier=
+kafka.monitor.time.granularity=10
+
+etl.hourly=hourly
+etl.daily=daily
+etl.ignore.schema.errors=false
+
+# WMF relies on the relevant Hadoop properties for this,
+# not Camus' custom properties.
+#   i.e.  mapreduce.output.compression* properties
+# # configure output compression for deflate or snappy. Defaults to deflate.
+# etl.output.codec=deflate
+# etl.deflate.level=6
+# #etl.output.codec=snappy
+
+etl.default.timezone=UTC
+etl.output.file.time.partition.mins=60
+etl.keep.count.files=false
+#etl.counts.path=
+etl.execution.history.max.of.quota=.8

-- 
To view, visit https://gerrit.wikimedia.org/r/231782
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: Iecf923d75fc6a93042af7a44960dfc6787f07c36
Gerrit-PatchSet: 1
Gerrit-Project: analytics/refinery
Gerrit-Branch: master
Gerrit-Owner: Ottomata <[email protected]>

_______________________________________________
MediaWiki-commits mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits

[MediaWiki-commits] [Gerrit] Add separate camus property files for smaller and larger vol... - change (analytics/refinery)

Reply via email to