EBernhardson has uploaded a new change for review.

  https://gerrit.wikimedia.org/r/265890

Change subject: Correct invalid xml and add some debugging to popularity_score
......................................................................

Correct invalid xml and add some debugging to popularity_score

While trying to deploy this today i found a few errors in
the repository. Additionally added a few error messages to
popularityScore.py to make the output logs slightly more
useful.

Change-Id: Ia5eb7a4bd2e1fdfb920c9ee95ce3cf971a52d707
---
M oozie/popularity_score/popularityScore.py
M oozie/popularity_score/workflow.xml
2 files changed, 7 insertions(+), 2 deletions(-)


  git pull ssh://gerrit.wikimedia.org:29418/wikimedia/discovery/analytics 
refs/changes/90/265890/1

diff --git a/oozie/popularity_score/popularityScore.py 
b/oozie/popularity_score/popularityScore.py
index e58013d..5a75c64 100644
--- a/oozie/popularity_score/popularityScore.py
+++ b/oozie/popularity_score/popularityScore.py
@@ -1,4 +1,5 @@
 import pyspark
+import pyspark.sql
 import pyspark.sql.functions
 import pyspark.sql.types
 import argparse
@@ -79,6 +80,8 @@
     sqlContext = pyspark.sql.SqlContext(sc)
 
     parquetPaths = pageViewHourlyPathList(args.source_dir, args.start_date, 
args.end_date)
+    print("loading pageview data from:")
+    print("\t" + "\n\t".join(parquetPaths) + "\n")
     dataFrame = sqlContext.parquetFile(*parquetPaths)
 
     aggregated = dataFrame.groupBy(
@@ -98,6 +101,7 @@
         pyspark.sql.types.DoubleType(),
     )
 
+    print("Calculating popularity score")
     result = aggregated.select(
         aggregated.project,
         aggregated.page_id,
@@ -110,4 +114,5 @@
     deleteHdfsDir(args.output_dir)
     # the default spark.sql.shuffle.partitions creates 200 partitions, 
resulting in 3mb files.
     # repartition to achieve result files close to 256mb (our default hdfs 
block size)
+    print("Writing results to " + args.output_dir)
     result.repartition(3).saveAsParquetFile(args.output_dir)
diff --git a/oozie/popularity_score/workflow.xml 
b/oozie/popularity_score/workflow.xml
index 5437fc5..85fd8c3 100644
--- a/oozie/popularity_score/workflow.xml
+++ b/oozie/popularity_score/workflow.xml
@@ -24,7 +24,7 @@
             <description>hive-site.xml file path in HDFS</description>
         </property>
         <property>
-            <name>hive_lib_path></name>
+            <name>hive_lib_path</name>
             <description>Local path to hive jars on executor 
instances</description>
         </property>
 
@@ -34,7 +34,7 @@
         </property>
         <property>
             <name>spark_number_executors</name>
-            <description>Number of executors to run job with</description>>
+            <description>Number of executors to run job with</description>
         </property>
         <property>
             <name>spark_executor_memory</name>

-- 
To view, visit https://gerrit.wikimedia.org/r/265890
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: Ia5eb7a4bd2e1fdfb920c9ee95ce3cf971a52d707
Gerrit-PatchSet: 1
Gerrit-Project: wikimedia/discovery/analytics
Gerrit-Branch: master
Gerrit-Owner: EBernhardson <[email protected]>

_______________________________________________
MediaWiki-commits mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits

Reply via email to