EBernhardson has uploaded a new change for review.
https://gerrit.wikimedia.org/r/265890
Change subject: Correct invalid xml and add some debugging to popularity_score
......................................................................
Correct invalid xml and add some debugging to popularity_score
While trying to deploy this today i found a few errors in
the repository. Additionally added a few error messages to
popularityScore.py to make the output logs slightly more
useful.
Change-Id: Ia5eb7a4bd2e1fdfb920c9ee95ce3cf971a52d707
---
M oozie/popularity_score/popularityScore.py
M oozie/popularity_score/workflow.xml
2 files changed, 7 insertions(+), 2 deletions(-)
git pull ssh://gerrit.wikimedia.org:29418/wikimedia/discovery/analytics
refs/changes/90/265890/1
diff --git a/oozie/popularity_score/popularityScore.py
b/oozie/popularity_score/popularityScore.py
index e58013d..5a75c64 100644
--- a/oozie/popularity_score/popularityScore.py
+++ b/oozie/popularity_score/popularityScore.py
@@ -1,4 +1,5 @@
import pyspark
+import pyspark.sql
import pyspark.sql.functions
import pyspark.sql.types
import argparse
@@ -79,6 +80,8 @@
sqlContext = pyspark.sql.SqlContext(sc)
parquetPaths = pageViewHourlyPathList(args.source_dir, args.start_date,
args.end_date)
+ print("loading pageview data from:")
+ print("\t" + "\n\t".join(parquetPaths) + "\n")
dataFrame = sqlContext.parquetFile(*parquetPaths)
aggregated = dataFrame.groupBy(
@@ -98,6 +101,7 @@
pyspark.sql.types.DoubleType(),
)
+ print("Calculating popularity score")
result = aggregated.select(
aggregated.project,
aggregated.page_id,
@@ -110,4 +114,5 @@
deleteHdfsDir(args.output_dir)
# the default spark.sql.shuffle.partitions creates 200 partitions,
resulting in 3mb files.
# repartition to achieve result files close to 256mb (our default hdfs
block size)
+ print("Writing results to " + args.output_dir)
result.repartition(3).saveAsParquetFile(args.output_dir)
diff --git a/oozie/popularity_score/workflow.xml
b/oozie/popularity_score/workflow.xml
index 5437fc5..85fd8c3 100644
--- a/oozie/popularity_score/workflow.xml
+++ b/oozie/popularity_score/workflow.xml
@@ -24,7 +24,7 @@
<description>hive-site.xml file path in HDFS</description>
</property>
<property>
- <name>hive_lib_path></name>
+ <name>hive_lib_path</name>
<description>Local path to hive jars on executor
instances</description>
</property>
@@ -34,7 +34,7 @@
</property>
<property>
<name>spark_number_executors</name>
- <description>Number of executors to run job with</description>>
+ <description>Number of executors to run job with</description>
</property>
<property>
<name>spark_executor_memory</name>
--
To view, visit https://gerrit.wikimedia.org/r/265890
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings
Gerrit-MessageType: newchange
Gerrit-Change-Id: Ia5eb7a4bd2e1fdfb920c9ee95ce3cf971a52d707
Gerrit-PatchSet: 1
Gerrit-Project: wikimedia/discovery/analytics
Gerrit-Branch: master
Gerrit-Owner: EBernhardson <[email protected]>
_______________________________________________
MediaWiki-commits mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits