[MediaWiki-commits] [Gerrit] search/MjoLniR[master]: JVM components to support file-based training

EBernhardson (Code Review) Wed, 10 Jan 2018 14:23:36 -0800

EBernhardson has uploaded a new change for review. ( 
https://gerrit.wikimedia.org/r/403545 )


Change subject: JVM components to support file-based training
......................................................................

JVM components to support file-based training

An upcoming refactor changes training_pipeline.py from dataframe based
training to file based training, where we emit partitioned and
formatted folds/splits to hdfs and load them into training by copying
to a local file and pointing c++ as it.

This is a separate patch so we can release a new version of the
MjoLniR jar. Due to how our CI works python cannot test against new
jvm code until it has been released.

The entry points that python will be using are:
* DataWriter.write
* MlrXGBoost.trainWithFiles

Change-Id: Ib5e8cd9d3e87e724f05b5ec0941c140aa5077d71
---
M .gitignore
D jvm/mjolnir.iml
M jvm/pom.xml
A jvm/src/main/scala/ml/dmlc/xgboost4j/scala/spark/MjolnirUtils.scala
A jvm/src/main/scala/org/wikimedia/search/mjolnir/AsLocalFile.scala
A jvm/src/main/scala/org/wikimedia/search/mjolnir/DataWriter.scala
A jvm/src/main/scala/org/wikimedia/search/mjolnir/MlrXGBoost.scala
A jvm/src/test/resources/fixtures/datasets/test.txt
A jvm/src/test/resources/fixtures/datasets/test.txt.query
A jvm/src/test/resources/fixtures/datasets/train.txt
A jvm/src/test/resources/fixtures/datasets/train.txt.query
M jvm/src/test/scala/org/wikimedia/search/mjolnir/PythonUtilsSuite.scala
12 files changed, 749 insertions(+), 163 deletions(-)


  git pull ssh://gerrit.wikimedia.org:29418/search/MjoLniR 
refs/changes/45/403545/1

diff --git a/.gitignore b/.gitignore
index f2a9cf7..83930f6 100644
--- a/.gitignore
+++ b/.gitignore
@@ -27,6 +27,7 @@
 # Editor temporary files
 .*.sw[po]
 /jvm/.idea
+/jvm/mjolnir.iml
 
 # Vagrant, and cdh stuff in vagrant
 .vagrant
diff --git a/jvm/mjolnir.iml b/jvm/mjolnir.iml
deleted file mode 100644
index b341014..0000000
--- a/jvm/mjolnir.iml
+++ /dev/null
@@ -1,162 +0,0 @@
-<?xml version="1.0" encoding="UTF-8"?>
-<module 
org.jetbrains.idea.maven.project.MavenProjectsManager.isMavenModule="true" 
type="JAVA_MODULE" version="4">
-  <component name="NewModuleRootManager" LANGUAGE_LEVEL="JDK_1_7">
-    <output url="file://$MODULE_DIR$/target/classes" />
-    <output-test url="file://$MODULE_DIR$/target/test-classes" />
-    <content url="file://$MODULE_DIR$">
-      <sourceFolder url="file://$MODULE_DIR$/src/main/scala" 
isTestSource="false" />
-      <sourceFolder url="file://$MODULE_DIR$/src/test/scala" 
isTestSource="true" />
-      <sourceFolder url="file://$MODULE_DIR$/src/test/resources" 
type="java-test-resource" />
-      <excludeFolder url="file://$MODULE_DIR$/target" />
-    </content>
-    <orderEntry type="inheritedJdk" />
-    <orderEntry type="sourceFolder" forTests="false" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.scala-lang:scala-compiler:2.11.8" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.scala-lang.modules:scala-xml_2.11:1.0.4" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.scala-lang.modules:scala-parser-combinators_2.11:1.0.4" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.scala-lang:scala-reflect:2.11.8" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.scala-lang:scala-library:2.11.8" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.apache.spark:spark-mllib_2.11:2.1.0" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.apache.spark:spark-core_2.11:2.1.0" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.apache.avro:avro-mapred:hadoop2:1.7.7" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.apache.avro:avro-ipc:1.7.7" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.apache.avro:avro:1.7.7" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.apache.avro:avro-ipc:tests:1.7.7" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.codehaus.jackson:jackson-core-asl:1.9.13" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.codehaus.jackson:jackson-mapper-asl:1.9.13" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
com.twitter:chill_2.11:0.8.0" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
com.esotericsoftware:kryo-shaded:3.0.3" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
com.esotericsoftware:minlog:1.3.0" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.objenesis:objenesis:2.1" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
com.twitter:chill-java:0.8.0" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.apache.xbean:xbean-asm5-shaded:4.4" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.apache.hadoop:hadoop-client:2.2.0" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.apache.hadoop:hadoop-common:2.2.0" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
commons-cli:commons-cli:1.2" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.apache.commons:commons-math:2.1" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
xmlenc:xmlenc:0.52" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
commons-io:commons-io:2.1" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
commons-lang:commons-lang:2.5" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
commons-configuration:commons-configuration:1.6" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
commons-collections:commons-collections:3.2.1" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
commons-digester:commons-digester:1.8" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
commons-beanutils:commons-beanutils:1.7.0" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
commons-beanutils:commons-beanutils-core:1.8.0" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
com.google.protobuf:protobuf-java:2.5.0" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.apache.hadoop:hadoop-auth:2.2.0" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.apache.commons:commons-compress:1.4.1" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.tukaani:xz:1.0" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.apache.hadoop:hadoop-hdfs:2.2.0" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.mortbay.jetty:jetty-util:6.1.26" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.apache.hadoop:hadoop-mapreduce-client-app:2.2.0" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.apache.hadoop:hadoop-mapreduce-client-common:2.2.0" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.apache.hadoop:hadoop-yarn-client:2.2.0" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
com.google.inject:guice:3.0" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
javax.inject:javax.inject:1" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
aopalliance:aopalliance:1.0" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.apache.hadoop:hadoop-yarn-server-common:2.2.0" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.apache.hadoop:hadoop-mapreduce-client-shuffle:2.2.0" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.apache.hadoop:hadoop-yarn-api:2.2.0" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.apache.hadoop:hadoop-mapreduce-client-core:2.2.0" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.apache.hadoop:hadoop-yarn-common:2.2.0" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.apache.hadoop:hadoop-mapreduce-client-jobclient:2.2.0" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.apache.hadoop:hadoop-annotations:2.2.0" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.apache.spark:spark-launcher_2.11:2.1.0" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.apache.spark:spark-network-common_2.11:2.1.0" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.fusesource.leveldbjni:leveldbjni-all:1.8" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.apache.spark:spark-network-shuffle_2.11:2.1.0" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.apache.spark:spark-unsafe_2.11:2.1.0" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
net.java.dev.jets3t:jets3t:0.7.1" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
commons-codec:commons-codec:1.3" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
commons-httpclient:commons-httpclient:3.1" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.apache.curator:curator-recipes:2.4.0" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.apache.curator:curator-framework:2.4.0" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.apache.curator:curator-client:2.4.0" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.apache.zookeeper:zookeeper:3.4.5" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
com.google.guava:guava:14.0.1" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
javax.servlet:javax.servlet-api:3.1.0" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.apache.commons:commons-lang3:3.5" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
com.google.code.findbugs:jsr305:1.3.9" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.slf4j:slf4j-api:1.7.16" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.slf4j:jul-to-slf4j:1.7.16" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.slf4j:jcl-over-slf4j:1.7.16" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
log4j:log4j:1.2.17" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.slf4j:slf4j-log4j12:1.7.16" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
com.ning:compress-lzf:1.0.3" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.xerial.snappy:snappy-java:1.1.2.6" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
net.jpountz.lz4:lz4:1.3.0" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.roaringbitmap:RoaringBitmap:0.5.11" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
commons-net:commons-net:2.2" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.json4s:json4s-jackson_2.11:3.2.11" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.json4s:json4s-core_2.11:3.2.11" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.json4s:json4s-ast_2.11:3.2.11" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.scala-lang:scalap:2.11.0" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.glassfish.jersey.core:jersey-client:2.22.2" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
javax.ws.rs:javax.ws.rs-api:2.0.1" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.glassfish.hk2:hk2-api:2.4.0-b34" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.glassfish.hk2:hk2-utils:2.4.0-b34" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.glassfish.hk2.external:aopalliance-repackaged:2.4.0-b34" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.glassfish.hk2.external:javax.inject:2.4.0-b34" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.glassfish.hk2:hk2-locator:2.4.0-b34" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.javassist:javassist:3.18.1-GA" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.glassfish.jersey.core:jersey-common:2.22.2" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
javax.annotation:javax.annotation-api:1.2" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.glassfish.jersey.bundles.repackaged:jersey-guava:2.22.2" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.glassfish.hk2:osgi-resource-locator:1.0.1" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.glassfish.jersey.core:jersey-server:2.22.2" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.glassfish.jersey.media:jersey-media-jaxb:2.22.2" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
javax.validation:validation-api:1.1.0.Final" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.glassfish.jersey.containers:jersey-container-servlet:2.22.2" 
level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.glassfish.jersey.containers:jersey-container-servlet-core:2.22.2" 
level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
io.netty:netty-all:4.0.42.Final" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
io.netty:netty:3.8.0.Final" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
com.clearspring.analytics:stream:2.7.0" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
io.dropwizard.metrics:metrics-core:3.1.2" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
io.dropwizard.metrics:metrics-jvm:3.1.2" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
io.dropwizard.metrics:metrics-json:3.1.2" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
io.dropwizard.metrics:metrics-graphite:3.1.2" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.apache.ivy:ivy:2.4.0" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: oro:oro:2.0.8" 
level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
net.razorvine:pyrolite:4.13" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
net.sf.py4j:py4j:0.10.4" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.apache.commons:commons-crypto:1.0.0" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.apache.spark:spark-streaming_2.11:2.1.0" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.apache.spark:spark-sql_2.11:2.1.0" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
com.univocity:univocity-parsers:2.2.1" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.apache.spark:spark-sketch_2.11:2.1.0" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.apache.spark:spark-catalyst_2.11:2.1.0" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.codehaus.janino:janino:3.0.0" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.codehaus.janino:commons-compiler:3.0.0" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.antlr:antlr4-runtime:4.5.3" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.apache.parquet:parquet-column:1.8.1" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.apache.parquet:parquet-common:1.8.1" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.apache.parquet:parquet-encoding:1.8.1" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.apache.parquet:parquet-hadoop:1.8.1" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.apache.parquet:parquet-format:2.3.0-incubating" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.apache.parquet:parquet-jackson:1.8.1" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.apache.spark:spark-graphx_2.11:2.1.0" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
com.github.fommil.netlib:core:1.1.2" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
net.sourceforge.f2j:arpack_combined_all:0.1" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.apache.spark:spark-mllib-local_2.11:2.1.0" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.scalanlp:breeze_2.11:0.12" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.scalanlp:breeze-macros_2.11:0.12" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
net.sf.opencsv:opencsv:2.3" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
com.github.rwl:jtransforms:2.4.0" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.spire-math:spire_2.11:0.7.4" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.spire-math:spire-macros_2.11:0.7.4" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
com.chuusai:shapeless_2.11:2.0.0" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.apache.commons:commons-math3:3.4.1" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.jpmml:pmml-model:1.2.15" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.jpmml:pmml-schema:1.2.15" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.apache.spark:spark-tags_2.11:2.1.0" level="project" />
-    <orderEntry type="library" scope="PROVIDED" name="Maven: 
org.spark-project.spark:unused:1.0.0" level="project" />
-    <orderEntry type="library" scope="TEST" name="Maven: 
org.scalatest:scalatest_2.11:3.0.1" level="project" />
-    <orderEntry type="library" scope="TEST" name="Maven: 
org.scalactic:scalactic_2.11:3.0.1" level="project" />
-    <orderEntry type="library" name="Maven: 
com.fasterxml.jackson.module:jackson-module-scala_2.11:2.6.5" level="project" />
-    <orderEntry type="library" name="Maven: 
com.fasterxml.jackson.core:jackson-core:2.6.5" level="project" />
-    <orderEntry type="library" name="Maven: 
com.fasterxml.jackson.core:jackson-annotations:2.6.5" level="project" />
-    <orderEntry type="library" name="Maven: 
com.fasterxml.jackson.core:jackson-databind:2.6.5" level="project" />
-    <orderEntry type="library" name="Maven: 
com.fasterxml.jackson.module:jackson-module-paranamer:2.6.5" level="project" />
-    <orderEntry type="library" name="Maven: 
com.thoughtworks.paranamer:paranamer:2.6" level="project" />
-  </component>
-</module>
\ No newline at end of file
diff --git a/jvm/pom.xml b/jvm/pom.xml
index 479b4ec..d1fdc13 100644
--- a/jvm/pom.xml
+++ b/jvm/pom.xml
@@ -14,7 +14,7 @@
         <spark.version>2.1.0</spark.version>
         <scala.version>2.11.8</scala.version>
         <scala.binary.version>2.11</scala.binary.version>
-        <xgboost.version>0.7-wmf-1</xgboost.version>
+        <xgboost.version>0.8-wmf-1-SNAPSHOT</xgboost.version>
     </properties>
 
     <scm>
@@ -146,6 +146,16 @@
             
<artifactId>jackson-module-scala_${scala.binary.version}</artifactId>
             <version>2.6.5</version>
         </dependency>
+        <dependency>
+            <groupId>ml.dmlc</groupId>
+            <artifactId>xgboost4j-spark</artifactId>
+            <version>${xgboost.version}</version>
+        </dependency>
+        <dependency>
+            <groupId>ml.dmlc</groupId>
+            <artifactId>xgboost4j</artifactId>
+            <version>${xgboost.version}</version>
+        </dependency>
     </dependencies>
     <repositories>
         <repository>
diff --git 
a/jvm/src/main/scala/ml/dmlc/xgboost4j/scala/spark/MjolnirUtils.scala 
b/jvm/src/main/scala/ml/dmlc/xgboost4j/scala/spark/MjolnirUtils.scala
new file mode 100644
index 0000000..45a00af
--- /dev/null
+++ b/jvm/src/main/scala/ml/dmlc/xgboost4j/scala/spark/MjolnirUtils.scala
@@ -0,0 +1,28 @@
+package ml.dmlc.xgboost4j.scala.spark
+
+import ml.dmlc.xgboost4j.java.IRabitTracker
+import ml.dmlc.xgboost4j.scala.Booster
+import ml.dmlc.xgboost4j.scala.rabit.RabitTracker
+
+/**
+  * Provide access to package-private constructs of xgboost4j-spark
+  */
+object MjolnirUtils {
+  def model(booster: Booster, metrics: Map[String, Array[Float]], trainMatrix: 
String): XGBoostModel = {
+    // Arbitrarily take an 'other' matrix if available
+    val xgMetrics = metrics.keys.find(!_.equals(trainMatrix)).map{ name => Map(
+      "train" -> metrics(trainMatrix),
+      "test" -> metrics(name)
+    ) }.getOrElse(Map(
+      "train" -> metrics(trainMatrix)
+    ))
+
+    val model = new XGBoostRegressionModel(booster)
+    model.setSummary(XGBoostTrainingSummary(xgMetrics))
+    model
+  }
+
+  def scalaRabitTracker(nWorkers: Int): IRabitTracker = {
+    new RabitTracker(nWorkers)
+  }
+}
diff --git a/jvm/src/main/scala/org/wikimedia/search/mjolnir/AsLocalFile.scala 
b/jvm/src/main/scala/org/wikimedia/search/mjolnir/AsLocalFile.scala
new file mode 100644
index 0000000..9962b3a
--- /dev/null
+++ b/jvm/src/main/scala/org/wikimedia/search/mjolnir/AsLocalFile.scala
@@ -0,0 +1,82 @@
+package org.wikimedia.search.mjolnir
+
+import java.io.{IOException, ObjectInputStream, ObjectOutputStream}
+import java.nio.file.{Files, Path}
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.{Path => HDFSPath}
+import org.apache.spark.{SparkContext, TaskContext}
+import org.apache.spark.broadcast.Broadcast
+
+import scala.util.control.NonFatal
+
+/**
+  * Helper that makes hdfs paths appear local so xgboost can read them
+  *
+  * @param broadcastConfiguration The hadoop configuration to be used on 
executors
+  *                               to access HDFS.
+  */
+class AsLocalFile(broadcastConfiguration: 
Broadcast[SerializableConfiguration]) extends Serializable {
+
+  def this(sc: SparkContext) = this(sc.broadcast(new 
SerializableConfiguration(sc.hadoopConfiguration)))
+
+  // Re-interpret files starting at root as local files
+  private def asHDFSPath(path: String): HDFSPath = if (path.charAt(0) == '/') {
+    new HDFSPath(s"file://$path")
+  } else {
+    new HDFSPath(path)
+  }
+
+  private def copyToLocalFile(src: String, dst: Path): Unit = {
+    val s = asHDFSPath(src)
+    val d = asHDFSPath(dst.toString)
+    s.getFileSystem(broadcastConfiguration.value.value).copyToLocalFile(s, d)
+  }
+
+  /**
+   * Convert string representing either a remote or local file into
+   * a string representing a local file. If the file was copied locally
+   * delete it on function exit after executing provided block.
+   */
+  def apply[A](path: String)(block: String => A): A = {
+    if (path.startsWith("/")) {
+      block(path)
+    } else if (path.startsWith("file:/")) {
+      block(path.substring("file:".length))
+    } else {
+      val prefix = 
s"mjolnir-${TaskContext.get.stageId()}-${TaskContext.getPartitionId()}-"
+      val localOutputPath = Files.createTempFile(prefix, ".xgb")
+      localOutputPath.toFile.deleteOnExit()
+      try {
+        copyToLocalFile(path, localOutputPath)
+        block(localOutputPath.toString)
+      } finally {
+        Files.deleteIfExists(localOutputPath)
+      }
+    }
+  }
+}
+
+/**
+  * Makes hadoop configuration serializable as a broadcast variable
+  */
+class SerializableConfiguration(@transient var value: Configuration) extends 
Serializable {
+  private def writeObject(out: ObjectOutputStream): Unit = tryOrIOException {
+    out.defaultWriteObject()
+    value.write(out)
+  }
+
+  private def readObject(in: ObjectInputStream): Unit = tryOrIOException {
+    value = new Configuration(false)
+    value.readFields(in)
+  }
+
+  private def tryOrIOException[T](block: => T): T = {
+    try {
+      block
+    } catch {
+      case e: IOException => throw e
+      case NonFatal(e) => throw new IOException(e)
+    }
+  }
+}
diff --git a/jvm/src/main/scala/org/wikimedia/search/mjolnir/DataWriter.scala 
b/jvm/src/main/scala/org/wikimedia/search/mjolnir/DataWriter.scala
new file mode 100644
index 0000000..2a7bad0
--- /dev/null
+++ b/jvm/src/main/scala/org/wikimedia/search/mjolnir/DataWriter.scala
@@ -0,0 +1,195 @@
+package org.wikimedia.search.mjolnir
+
+import java.nio.charset.StandardCharsets
+
+import org.apache.hadoop.fs.{Path => HDFSPath}
+import org.apache.spark.api.java.JavaSparkContext
+import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.ml.linalg.Vector
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.sql.{DataFrame, Row}
+
+import scala.collection.parallel.ForkJoinTaskSupport
+import scala.collection.parallel.immutable.ParVector
+import scala.concurrent.forkjoin.ForkJoinPool
+
+/**
+  * Write out a mjolnir dataframe from data_pipeline to hdfs as
+  * txt files readable directly by xgboost/lightgbm. While not
+  * explicitly called out in the return values there is a matching
+  * path + ".query" file for each output file containing sequential
+  * query counts needed by xgboost and lightgbm.
+  *
+  * @param broadcastConfiguration Broadcasted hadoop configuration to access 
HDFS from executors.
+  * @param sparse When true features with a value of zero are not emitted
+  */
+class DataWriter(
+    broadcastConfiguration: Broadcast[SerializableConfiguration],
+    sparse: Boolean = true
+) extends Serializable {
+
+  // Accepting JavaSparkContext for py4j compatability
+  def this(sc: JavaSparkContext) = this(sc.broadcast(new 
SerializableConfiguration(sc.hadoopConfiguration)))
+
+  private def asHDFSPath(path: String): HDFSPath = if (path.charAt(0) == '/') {
+    new HDFSPath(s"file://$path")
+  } else {
+    new HDFSPath(path)
+  }
+
+  // rdd contents for passing between formatPartition and writeOneFold
+  private type OutputRow = (Int, Array[Byte], Array[Byte])
+
+  // Writes out a single partition of training data. One partition
+  // may contain gigabytes of data so this should do as little work
+  // as possible per-row.
+  private def writeOneFold(
+    pathFormatter: (String, Int) => HDFSPath,
+    config: Array[String]
+  )(partitionId: Int, rows: Iterator[OutputRow]): Iterator[Map[String, 
String]] = {
+    // .toSet.toVector gives us a unique list, but feels like hax
+    val paths = config.toSet.toVector.map { name: String =>
+      name -> pathFormatter(name, partitionId)
+    }
+
+    val writers = paths.map { case (name: String, path: HDFSPath) =>
+      val fs = path.getFileSystem(broadcastConfiguration.value.value)
+      val a = fs.create(path)
+      val b = fs.create(new HDFSPath(path + ".query"))
+      name -> (a, b)
+    }.toMap
+
+    try {
+      for ((fold, line, queryLine) <- rows) {
+        val out = writers(config(fold))
+        out._1.write(line)
+        out._2.write(queryLine)
+      }
+    } finally {
+      writers.values.foreach({ out =>
+        out._1.close()
+        out._2.close()
+      })
+    }
+
+    Iterator(paths.map { case (k, v) => k -> v.toString }.toMap)
+  }
+
+  @transient lazy private val maybeSparsify = if (sparse) {
+    { features: Array[(Double, Int)] => features.filter(_._1 != 0D) }
+  } else {
+    { features: Array[(Double, Int)] => features }
+  }
+
+  private def makeLine(features: Vector, label: Int): String =
+    maybeSparsify(features.toArray.zipWithIndex).map {
+      case (feat, index) => s"${index + 1}:$feat";
+    }.mkString(s"$label ", " ", "\n")
+
+  // Counts sequential queries and emits boundaries
+  private def queryCounter(): (String, Option[String]) => String = {
+    var count = 0;
+    { (query: String, nextQuery: Option[String]) =>
+      count += 1
+      if (nextQuery.exists(_.equals(query))) {
+        // next matches, keep the count going
+        ""
+      } else {
+        // next query is different. zero out
+        val out = s"$count\n"
+        count = 0
+        out
+      }
+    }
+  }
+
+  @transient lazy private val utf8 = StandardCharsets.UTF_8
+
+  private def formatPartition(schema: StructType, foldCol: 
Option[String])(rows: Iterator[Row]): Iterator[OutputRow] = {
+    val chooseFold = foldCol.map { name =>
+      val index = schema.fieldIndex(name);
+      { row: Row => row.getLong(index).toInt }
+    }.getOrElse({ row: Row => 0 })
+
+    val labelIndex = schema.fieldIndex("label")
+    val featuresIndex = schema.fieldIndex("features")
+    val queryIndex  = schema.fieldIndex("query")
+    val makeQueryLine = queryCounter()
+    val it = rows.buffered
+    for (row <- it) yield {
+      val fold = chooseFold(row)
+      val nextQuery = if (it.hasNext) {
+        Some(it.head.getString(queryIndex))
+      } else {
+        None
+      }
+
+      val line = makeLine(row.getAs[Vector](featuresIndex), 
row.getInt(labelIndex))
+      val queryLine = makeQueryLine(row.getString(queryIndex), nextQuery)
+      (fold, line.getBytes(utf8), queryLine.getBytes(utf8))
+    }
+  }
+
+  // take nullable string and output java map for py4j compatability
+  def write(df: DataFrame, numWorkers: Int, pathFormat: String, foldCol: 
String): Array[Array[java.util.Map[String, String]]] = {
+    import collection.JavaConverters._
+    write(df, numWorkers, pathFormat, Option(foldCol)).map(_.map(_.asJava))
+  }
+
+  /**
+    * @param df Output from data_pipeline. Must have be repartitioned on query 
and sorted by query
+    *           within partitions.
+    * @param numWorkers The number of partitions each data file will be 
emitted as
+    * @param pathFormat Format for hdfs paths. Params are %s: name, %s: fold, 
%d: partition
+    * @param foldCol Long column to source which fold a row belongs to
+    * @return List of folds each represented by a list of partitions each 
containing a map
+    *         from split name to hdfs path for that partition.
+    */
+  def write(
+    df: DataFrame,
+    numWorkers: Int,
+    pathFormat: String,
+    foldCol: Option[String]
+  ): Array[Array[Map[String, String]]] = {
+    val rdd = df
+      .rdd.mapPartitions(formatPartition(df.schema, foldCol))
+    val numFolds = foldCol.map { name => 
df.schema(name).metadata.getLong("num_folds").toInt }.getOrElse(1)
+
+    try {
+      // Materialize rdd so the parallel tasks coming up share the 
result.otherwise spark can just
+      // coalesce all the work above into the minimal number of output workers 
and repeat it for
+      // each partition it writes out.
+      // TODO: depends on if we have enough nodes to cache the data. On a busy 
cluster maybe not...
+      rdd.cache()
+      rdd.count()
+
+      val folds = ParVector.range(0, numFolds)
+      folds.tasksupport = new ForkJoinTaskSupport(new ForkJoinPool(numFolds))
+      folds.map { fold =>
+        // TODO: It might be nice to just accept Seq[Map[Int, String]] that 
lists folds
+        // each map is a config parameter. Pushes naming up the call chain.
+        val config = foldCol.map { name =>
+          // If fold column provided assign that to test, everything else as 
train
+          (0 until numFolds).map { x =>
+            if (x == fold) "test" else "train"
+          }.toArray
+          // Otherwise everything in an "all" bucket. formatPartition ensures
+          // that all rows report fold of 0L when foldCol is not provided.
+        }.getOrElse(Array("all"))
+
+        // Accepts a name and partition id, returns path to write out to
+        val pathFormatter: (String, Int) => HDFSPath = foldCol.map { _ =>
+          val foldId = fold.toString;
+          { (name: String, partition: Int) => 
asHDFSPath(pathFormat.format(name, foldId, partition)) }
+          // If all in one bucket the indicate fold in path with 'x'
+        }.getOrElse({ (name, partition) => asHDFSPath(pathFormat.format(name, 
"x", partition)) })
+
+        rdd.coalesce(numWorkers)
+          .mapPartitionsWithIndex(writeOneFold(pathFormatter, config))
+          .collect()
+      }.toArray
+    } finally {
+      rdd.unpersist()
+    }
+  }
+}
diff --git a/jvm/src/main/scala/org/wikimedia/search/mjolnir/MlrXGBoost.scala 
b/jvm/src/main/scala/org/wikimedia/search/mjolnir/MlrXGBoost.scala
new file mode 100644
index 0000000..a98c978
--- /dev/null
+++ b/jvm/src/main/scala/org/wikimedia/search/mjolnir/MlrXGBoost.scala
@@ -0,0 +1,175 @@
+package org.wikimedia.search.mjolnir
+
+import java.io.{ByteArrayInputStream, IOException}
+
+import ml.dmlc.xgboost4j.java.{IRabitTracker, Rabit, RabitTracker => 
PyRabitTracker}
+import ml.dmlc.xgboost4j.scala.spark.{MjolnirUtils, XGBoostModel}
+import ml.dmlc.xgboost4j.scala.{DMatrix, XGBoost}
+import org.apache.spark.api.java.JavaSparkContext
+import org.apache.spark.{FutureAction, Partitioner, SparkContext, TaskContext}
+import org.apache.spark.rdd.RDD
+
+/**
+  * Rabit tracker configurations.
+  *
+  * @param workerConnectionTimeout The timeout for all workers to connect to 
the tracker.
+  *                                Set timeout length to zero to disable 
timeout.
+  *                                Use a finite, non-zero timeout value to 
prevent tracker from
+  *                                hanging indefinitely (in milliseconds)
+  *                                (supported by "scala" implementation only.)
+  * @param trackerImpl Choice between "python" or "scala". The former utilizes 
the Java wrapper of
+  *                    the Python Rabit tracker (in dmlc_core), whereas the 
latter is implemented
+  *                    in Scala without Python components, and with full 
support of timeouts.
+  *                    The Scala implementation is currently experimental, use 
at your own risk.
+  */
+case class TrackerConf(workerConnectionTimeout: Long, trackerImpl: String)
+
+object TrackerConf {
+  def apply(): TrackerConf = TrackerConf(0L, "python")
+}
+
+private class MlrXGBoost(asLocalFile: AsLocalFile) extends Serializable {
+  private def createDMatrices[A](
+      inputs: Map[String, String]
+  ): Map[String, DMatrix] = inputs.map { case (name, path) =>
+    name -> asLocalFile(path) { localPath: String =>
+      // DMatrix will read the file in on creation, and asLocalFile
+      // will delete that local file. after creation of the dmatrix.
+      new DMatrix(localPath)
+    }
+  }
+
+  private[mjolnir] def buildDistributedBoosters(
+      rdds: RDD[Map[String, String]],
+      trainMatrix: String,
+      params: Map[String, Any],
+      rabitEnv: Option[java.util.Map[String, String]],
+      numRounds: Int,
+      earlyStoppingRound: Int = 0
+  ): RDD[(Array[Byte], Map[String, Array[Float]])] =
+    rdds.mapPartitions({ rows=>
+      // XGBoost refuses to load our binary format if rabit has been
+      // initialized, so we do it early. This make the odd situation
+      // where we need to dispose of them before rabit is shutdown.
+      val watches = createDMatrices(rows.next())
+      rabitEnv.foreach { env =>
+        env.put("DMLC_TASK_ID", TaskContext.getPartitionId().toString)
+        Rabit.init(env)
+      }
+      try {
+        if (rows.hasNext) {
+          throw new IOException("Expected single row in partition but received 
more.")
+        }
+        val metrics = Array.fill(watches.size)(new Array[Float](numRounds))
+        val booster = XGBoost.train(
+          watches(trainMatrix), params, numRounds, watches, metrics,
+          earlyStoppingRound = earlyStoppingRound)
+        val bytes = booster.toByteArray
+        booster.dispose
+        Iterator(bytes -> watches.keys.zip(metrics).toMap)
+      } finally {
+        watches.values.foreach(_.delete())
+        rabitEnv.foreach { _ => Rabit.shutdown() }
+      }
+    }).cache()
+}
+
+object MlrXGBoost {
+  private def overrideParamsAccordingToTaskCPUs(sc: SparkContext, params: 
Map[String, Any]): Map[String, Any] = {
+    val coresPerTask = sc.getConf.getInt("spark.task.cpus", 1)
+    if (params.contains("nthread")) {
+      val nThread = params("nthread").toString.toInt
+      require(nThread <= coresPerTask,
+        s"the nthread configuration ($nThread) must be no larger than " +
+        s"spark.task.cpus ($coresPerTask)")
+      params
+    } else {
+      params + ("nthread" -> coresPerTask)
+    }
+  }
+
+  private def startTracker(nWorkers: Int, trackerConf: TrackerConf): 
IRabitTracker = {
+    val tracker: IRabitTracker = trackerConf.trackerImpl match {
+      case "scala" => MjolnirUtils.scalaRabitTracker(nWorkers)
+      case "python" => new PyRabitTracker(nWorkers)
+      case _ => new PyRabitTracker(nWorkers)
+    }
+
+    require(tracker.start(trackerConf.workerConnectionTimeout), "FAULT: Failed 
to start tracker")
+    tracker
+  }
+
+  private def postTrackerReturnProcessing(
+    trackerReturnVal: Int,
+    distributedBoosters: RDD[(Array[Byte], Map[String, Array[Float]])],
+    trainMatrix: String,
+    async: Option[FutureAction[_]]
+  ): XGBoostModel = {
+    if (trackerReturnVal != 0) {
+      async.foreach(_.cancel())
+      throw new Exception("XGBoostModel training failed")
+    }
+    val res = distributedBoosters.first()
+    distributedBoosters.unpersist()
+    val bais = new ByteArrayInputStream(res._1)
+    val booster = XGBoost.loadModel(bais)
+    MjolnirUtils.model(booster, res._2, trainMatrix)
+  }
+
+  def trainWithFiles(
+    jsc: JavaSparkContext,
+    fold: Seq[Map[String, String]],
+    trainMatrix: String,
+    params: Map[String, Any],
+    numRounds: Int,
+    earlyStoppingRound: Int
+  ): XGBoostModel = {
+    // Convert input data, a map of names to paths of worker splits, into
+    // a map per worker containing it's individual splits.
+    val baseData = fold.indices.zip(fold)
+    // Distribute that data into one row per partition
+    val rdd = jsc.parallelize(baseData, baseData.length)
+      .partitionBy(new ExactPartitioner(baseData.length, baseData.length))
+      .map(_._2)
+
+    val trainer = new MlrXGBoost(new AsLocalFile(jsc))
+    val overwrittenParams = overrideParamsAccordingToTaskCPUs(jsc, params)
+    if (rdd.getNumPartitions == 1) {
+      // Special case with single worker, doesn't need Rabit
+      val distributedBoosters = trainer.buildDistributedBoosters(
+        rdd, trainMatrix, overwrittenParams, None, numRounds,
+        earlyStoppingRound)
+      distributedBoosters.foreachPartition(() => _)
+      postTrackerReturnProcessing(0, distributedBoosters, trainMatrix, None)
+    } else {
+      val trackerConf = params.get("tracker_conf") match {
+        case None => TrackerConf()
+        case Some(conf: TrackerConf) => conf
+        case _ => throw new IllegalArgumentException("parameter 
\"tracker_conf\" must be an " +
+          "instance of TrackerConf.")
+      }
+
+      val tracker = startTracker(baseData.length, trackerConf)
+      try {
+        val distributedBoosters = trainer.buildDistributedBoosters(
+          rdd, trainMatrix, overwrittenParams, Some(tracker.getWorkerEnvs),
+          numRounds, earlyStoppingRound)
+        val async = distributedBoosters.foreachPartitionAsync(() => _)
+        postTrackerReturnProcessing(tracker.waitFor(0L), distributedBoosters, 
trainMatrix, Some(async))
+      } finally {
+        tracker.stop()
+      }
+    }
+  }
+}
+
+// Necessary to ensure we distribute the training files as 1 file per partition
+private class ExactPartitioner[V](partitions: Int, elements: Int) extends 
Partitioner {
+  override def numPartitions: Int = partitions
+  override def getPartition(key: Any): Int = {
+    val k = key.asInstanceOf[Int]
+    k * partitions / elements
+  }
+}
+
+
diff --git a/jvm/src/test/resources/fixtures/datasets/test.txt 
b/jvm/src/test/resources/fixtures/datasets/test.txt
new file mode 100644
index 0000000..d3d6afa
--- /dev/null
+++ b/jvm/src/test/resources/fixtures/datasets/test.txt
@@ -0,0 +1,100 @@
+3 1:0.326636 2:0.077529 3:0.371918 4:0.380291 5:0.014513
+0 1:0.972146 2:0.158156 3:0.914378 4:0.862571 5:0.320560
+1 1:0.298956 2:0.364456 3:0.883224 4:0.396381 5:0.793094
+3 1:0.661035 2:0.051392 3:0.516250 4:0.451712 5:0.944109
+0 1:0.864322 2:0.525315 3:0.786576 4:0.989264 5:0.083457
+3 1:0.404920 2:0.319258 3:0.998247 4:0.337145 5:0.418243
+0 1:0.510521 2:0.187725 3:0.330769 4:0.927407 5:0.335392
+3 1:0.814285 2:0.251424 3:0.506982 4:0.762243 5:0.100247
+3 1:0.117167 2:0.819767 3:0.709812 4:0.796283 5:0.540541
+3 1:0.897631 2:0.580800 3:0.026243 4:0.964434 5:0.231140
+2 1:0.391528 2:0.683021 3:0.997876 4:0.568064 5:0.010284
+2 1:0.454713 2:0.288310 3:0.848797 4:0.748149 5:0.522314
+1 1:0.168367 2:0.309259 3:0.664556 4:0.824906 5:0.370062
+0 1:0.650610 2:0.220118 3:0.321737 4:0.612518 5:0.071387
+1 1:0.411739 2:0.803898 3:0.042743 4:0.842187 5:0.391775
+2 1:0.807920 2:0.979387 3:0.307194 4:0.988092 5:0.529102
+2 1:0.755759 2:0.365076 3:0.885247 4:0.565210 5:0.615610
+0 1:0.254644 2:0.437205 3:0.257438 4:0.279958 5:0.175046
+3 1:0.444070 2:0.378370 3:0.610439 4:0.936986 5:0.287545
+0 1:0.967863 2:0.092285 3:0.987719 4:0.792169 5:0.690630
+2 1:0.455959 2:0.806951 3:0.713977 4:0.546367 5:0.950988
+0 1:0.790699 2:0.858631 3:0.611960 4:0.836762 5:0.386819
+0 1:0.240780 2:0.842705 3:0.766013 4:0.317449 5:0.359710
+1 1:0.783596 2:0.017780 3:0.765471 4:0.225174 5:0.000262
+1 1:0.144028 2:0.529792 3:0.203593 4:0.102651 5:0.774630
+1 1:0.886765 2:0.524375 3:0.555517 4:0.347809 5:0.076116
+3 1:0.094031 2:0.956022 3:0.274482 4:0.054051 5:0.856636
+1 1:0.921116 2:0.567716 3:0.406270 4:0.066023 5:0.419928
+1 1:0.768755 2:0.520684 3:0.138103 4:0.003975 5:0.975453
+2 1:0.936436 2:0.675702 3:0.286579 4:0.053526 5:0.297008
+1 1:0.304488 2:0.246547 3:0.547152 4:0.094481 5:0.680314
+1 1:0.058460 2:0.155417 3:0.608897 4:0.400933 5:0.805939
+0 1:0.537291 2:0.740986 3:0.590492 4:0.608429 5:0.549569
+0 1:0.981997 2:0.140494 3:0.061062 4:0.045810 5:0.401585
+2 1:0.369939 2:0.285637 3:0.853111 4:0.225219 5:0.426097
+3 1:0.908313 2:0.569428 3:0.497288 4:0.491016 5:0.006803
+2 1:0.179035 2:0.642798 3:0.625024 4:0.927981 5:0.405290
+1 1:0.239090 2:0.710597 3:0.179604 4:0.371392 5:0.449195
+1 1:0.579927 2:0.728269 3:0.206559 4:0.424931 5:0.713960
+3 1:0.649605 2:0.731400 3:0.631224 4:0.175438 5:0.304366
+2 1:0.669838 2:0.153425 3:0.771762 4:0.553746 5:0.040279
+0 1:0.436897 2:0.473752 3:0.795377 4:0.820149 5:0.190161
+0 1:0.640871 2:0.516977 3:0.592973 4:0.490025 5:0.577195
+2 1:0.997780 2:0.095394 3:0.651330 4:0.749575 5:0.660816
+3 1:0.152661 2:0.704911 3:0.947403 4:0.870216 5:0.798095
+1 1:0.747997 2:0.075223 3:0.939461 4:0.643620 5:0.914241
+2 1:0.133448 2:0.430715 3:0.789519 4:0.610028 5:0.903299
+0 1:0.401159 2:0.373494 3:0.659331 4:0.294644 5:0.629553
+0 1:0.207960 2:0.361213 3:0.911492 4:0.554985 5:0.493094
+1 1:0.057594 2:0.713257 3:0.446922 4:0.420421 5:0.567548
+1 1:0.887591 2:0.980055 3:0.615915 4:0.974575 5:0.590911
+3 1:0.727584 2:0.967858 3:0.491858 4:0.012659 5:0.283389
+3 1:0.021647 2:0.394032 3:0.764637 4:0.590971 5:0.827838
+3 1:0.620332 2:0.833978 3:0.671682 4:0.236934 5:0.858537
+1 1:0.256581 2:0.496616 3:0.825439 4:0.632813 5:0.164283
+3 1:0.442709 2:0.256347 3:0.231125 4:0.436597 5:0.587120
+2 1:0.464535 2:0.337670 3:0.452877 4:0.892539 5:0.001143
+3 1:0.734347 2:0.180984 3:0.214319 4:0.363408 5:0.644541
+2 1:0.787623 2:0.952571 3:0.030398 4:0.049772 5:0.242738
+3 1:0.953805 2:0.961399 3:0.837528 4:0.898172 5:0.862682
+1 1:0.731495 2:0.847731 3:0.557340 4:0.431582 5:0.900491
+2 1:0.726246 2:0.828104 3:0.761575 4:0.919778 5:0.119979
+2 1:0.934568 2:0.153811 3:0.337450 4:0.940157 5:0.040626
+0 1:0.483256 2:0.741818 3:0.447331 4:0.580388 5:0.378437
+2 1:0.350344 2:0.330684 3:0.712829 4:0.787849 5:0.380805
+3 1:0.962389 2:0.991490 3:0.913410 4:0.158044 5:0.180574
+1 1:0.464680 2:0.328260 3:0.684814 4:0.092428 5:0.477978
+3 1:0.717346 2:0.441867 3:0.444311 4:0.014475 5:0.951867
+1 1:0.829431 2:0.888164 3:0.221868 4:0.171388 5:0.296757
+1 1:0.952573 2:0.183402 3:0.666165 4:0.527787 5:0.412143
+2 1:0.229767 2:0.820025 3:0.615662 4:0.820428 5:0.603999
+3 1:0.982424 2:0.041073 3:0.796715 4:0.015130 5:0.543146
+2 1:0.929637 2:0.825426 3:0.727255 4:0.677794 5:0.093995
+2 1:0.253085 2:0.925220 3:0.464014 4:0.905004 5:0.860812
+3 1:0.551381 2:0.561334 3:0.478454 4:0.881930 5:0.117634
+3 1:0.584691 2:0.779480 3:0.264777 4:0.778328 5:0.787311
+2 1:0.047838 2:0.133641 3:0.556090 4:0.565904 5:0.211877
+2 1:0.557917 2:0.261338 3:0.125646 4:0.258342 5:0.632127
+2 1:0.396906 2:0.526099 3:0.922920 4:0.899431 5:0.437990
+2 1:0.133891 2:0.128417 3:0.841312 4:0.671067 5:0.483595
+3 1:0.710280 2:0.183570 3:0.528577 4:0.940885 5:0.543355
+3 1:0.884920 2:0.375900 3:0.851078 4:0.021749 5:0.808034
+0 1:0.998045 2:0.539014 3:0.476142 4:0.714477 5:0.674686
+1 1:0.051806 2:0.881044 3:0.620053 4:0.324929 5:0.658388
+3 1:0.274004 2:0.305911 3:0.293288 4:0.200069 5:0.898098
+0 1:0.856834 2:0.648173 3:0.159354 4:0.792070 5:0.728352
+2 1:0.123342 2:0.957180 3:0.621960 4:0.480828 5:0.102538
+0 1:0.768011 2:0.718725 3:0.152013 4:0.877257 5:0.184134
+3 1:0.043845 2:0.106492 3:0.343883 4:0.476853 5:0.931994
+0 1:0.840979 2:0.724733 3:0.132602 4:0.660006 5:0.172654
+1 1:0.790630 2:0.387497 3:0.146874 4:0.272060 5:0.230415
+0 1:0.483095 2:0.769878 3:0.038515 4:0.773712 5:0.406844
+2 1:0.611834 2:0.054520 3:0.097503 4:0.852499 5:0.350596
+1 1:0.319831 2:0.741203 3:0.445348 4:0.305239 5:0.076114
+1 1:0.064030 2:0.230787 3:0.716318 4:0.777473 5:0.473709
+0 1:0.030735 2:0.086434 3:0.230371 4:0.166334 5:0.729940
+0 1:0.443625 2:0.148763 3:0.777603 4:0.469975 5:0.918963
+1 1:0.011012 2:0.474785 3:0.837294 4:0.976318 5:0.813301
+3 1:0.143489 2:0.537931 3:0.422035 4:0.032870 5:0.770511
+3 1:0.531663 2:0.379978 3:0.993619 4:0.126227 5:0.036455
diff --git a/jvm/src/test/resources/fixtures/datasets/test.txt.query 
b/jvm/src/test/resources/fixtures/datasets/test.txt.query
new file mode 100644
index 0000000..ea19cda
--- /dev/null
+++ b/jvm/src/test/resources/fixtures/datasets/test.txt.query
@@ -0,0 +1,7 @@
+16
+12
+12
+20
+12
+16
+12
diff --git a/jvm/src/test/resources/fixtures/datasets/train.txt 
b/jvm/src/test/resources/fixtures/datasets/train.txt
new file mode 100644
index 0000000..c079891
--- /dev/null
+++ b/jvm/src/test/resources/fixtures/datasets/train.txt
@@ -0,0 +1,100 @@
+2 1:0.276857 2:0.309246 3:0.443840 4:0.323302 5:0.251910
+0 1:0.696140 2:0.637534 3:0.229497 4:0.280137 5:0.617490
+2 1:0.413772 2:0.395534 3:0.429219 4:0.603654 5:0.517427
+0 1:0.661208 2:0.568080 3:0.704676 4:0.597076 5:0.923187
+0 1:0.448723 2:0.282579 3:0.554448 4:0.515220 5:0.520206
+0 1:0.690696 2:0.517868 3:0.721864 4:0.683105 5:0.645097
+0 1:0.028015 2:0.758854 3:0.510872 4:0.041371 5:0.938664
+2 1:0.628253 2:0.164138 3:0.174485 4:0.660370 5:0.661200
+3 1:0.156874 2:0.244076 3:0.834069 4:0.816389 5:0.267347
+0 1:0.598002 2:0.961438 3:0.115683 4:0.176997 5:0.802739
+0 1:0.210183 2:0.094221 3:0.315377 4:0.104599 5:0.078312
+2 1:0.155581 2:0.239757 3:0.390496 4:0.448591 5:0.363550
+0 1:0.842447 2:0.449796 3:0.699602 4:0.775102 5:0.551474
+1 1:0.282622 2:0.493173 3:0.308711 4:0.588665 5:0.015217
+1 1:0.849260 2:0.716784 3:0.801057 4:0.702118 5:0.467375
+1 1:0.318216 2:0.502648 3:0.139761 4:0.320343 5:0.677026
+1 1:0.496001 2:0.137648 3:0.962708 4:0.638394 5:0.922043
+0 1:0.113989 2:0.482813 3:0.614473 4:0.398461 5:0.411334
+3 1:0.910096 2:0.742803 3:0.802974 4:0.242235 5:0.741078
+3 1:0.026262 2:0.153817 3:0.866324 4:0.152764 5:0.079376
+2 1:0.866166 2:0.176590 3:0.501546 4:0.550038 5:0.252295
+2 1:0.635001 2:0.421436 3:0.400276 4:0.960437 5:0.958324
+0 1:0.511912 2:0.465321 3:0.391040 4:0.536889 5:0.262131
+3 1:0.061608 2:0.844460 3:0.775436 4:0.940628 5:0.472095
+2 1:0.968002 2:0.142278 3:0.746172 4:0.167117 5:0.888660
+1 1:0.264066 2:0.675206 3:0.761167 4:0.605750 5:0.458382
+3 1:0.848002 2:0.323477 3:0.985868 4:0.032662 5:0.831382
+2 1:0.173251 2:0.098697 3:0.002584 4:0.172282 5:0.467190
+3 1:0.464708 2:0.120618 3:0.178631 4:0.107419 5:0.458566
+3 1:0.085904 2:0.100485 3:0.439993 4:0.253806 5:0.225162
+3 1:0.309424 2:0.219766 3:0.979150 4:0.992466 5:0.781002
+2 1:0.978544 2:0.831527 3:0.924340 4:0.472260 5:0.854699
+1 1:0.838158 2:0.539927 3:0.033527 4:0.939022 5:0.858939
+0 1:0.478958 2:0.035299 3:0.321513 4:0.138026 5:0.344530
+1 1:0.420183 2:0.050261 3:0.168622 4:0.987962 5:0.081127
+3 1:0.529119 2:0.756857 3:0.525388 4:0.906647 5:0.015031
+2 1:0.757490 2:0.697273 3:0.672960 4:0.038027 5:0.567640
+1 1:0.877515 2:0.997738 3:0.140903 4:0.022510 5:0.275767
+1 1:0.707713 2:0.767267 3:0.146578 4:0.177046 5:0.390119
+2 1:0.126545 2:0.338924 3:0.290230 4:0.140717 5:0.193812
+1 1:0.438441 2:0.413887 3:0.715906 4:0.718078 5:0.037032
+1 1:0.394099 2:0.898491 3:0.851562 4:0.854452 5:0.200452
+3 1:0.554398 2:0.927987 3:0.506810 4:0.361222 5:0.339552
+1 1:0.651350 2:0.642931 3:0.992078 4:0.394123 5:0.356481
+2 1:0.323378 2:0.726388 3:0.278780 4:0.873152 5:0.760571
+2 1:0.134356 2:0.635057 3:0.085017 4:0.454603 5:0.756119
+2 1:0.980191 2:0.788963 3:0.817764 4:0.653687 5:0.556327
+0 1:0.008553 2:0.637250 3:0.401720 4:0.094888 5:0.483444
+3 1:0.422236 2:0.459310 3:0.469951 4:0.514213 5:0.806955
+2 1:0.794112 2:0.424444 3:0.341827 4:0.791480 5:0.355004
+1 1:0.650646 2:0.831209 3:0.360482 4:0.730525 5:0.253130
+0 1:0.899328 2:0.843065 3:0.668982 4:0.386870 5:0.767945
+3 1:0.459478 2:0.172103 3:0.115004 4:0.861834 5:0.497636
+3 1:0.502739 2:0.717756 3:0.324993 4:0.854486 5:0.691303
+3 1:0.694347 2:0.975218 3:0.749809 4:0.080061 5:0.216270
+0 1:0.204749 2:0.152912 3:0.364078 4:0.440324 5:0.064159
+0 1:0.731656 2:0.371814 3:0.359377 4:0.176542 5:0.820773
+0 1:0.247573 2:0.992306 3:0.796591 4:0.586521 5:0.681643
+0 1:0.057699 2:0.006356 3:0.445668 4:0.924739 5:0.316846
+2 1:0.542624 2:0.136532 3:0.997968 4:0.995610 5:0.035732
+1 1:0.394307 2:0.992210 3:0.872835 4:0.150469 5:0.909376
+2 1:0.857816 2:0.033190 3:0.880868 4:0.565811 5:0.665156
+1 1:0.942718 2:0.660032 3:0.178236 4:0.182812 5:0.087367
+2 1:0.831711 2:0.563728 3:0.384249 4:0.217499 5:0.294353
+0 1:0.876942 2:0.456118 3:0.673524 4:0.541833 5:0.584563
+3 1:0.576608 2:0.133179 3:0.125881 4:0.989455 5:0.918416
+1 1:0.750385 2:0.262965 3:0.308398 4:0.490595 5:0.058171
+2 1:0.650555 2:0.260526 3:0.103875 4:0.015988 5:0.965965
+0 1:0.256881 2:0.816037 3:0.178294 4:0.502192 5:0.982004
+2 1:0.197509 2:0.074589 3:0.638905 4:0.277399 5:0.396911
+2 1:0.354089 2:0.236236 3:0.093984 4:0.792231 5:0.763705
+1 1:0.616673 2:0.861175 3:0.073603 4:0.534483 5:0.935103
+3 1:0.251242 2:0.234976 3:0.922202 4:0.126030 5:0.462835
+1 1:0.120315 2:0.277710 3:0.911577 4:0.730788 5:0.657184
+2 1:0.106240 2:0.919003 3:0.650803 4:0.052300 5:0.717360
+2 1:0.047706 2:0.206002 3:0.439120 4:0.411402 5:0.153675
+2 1:0.573659 2:0.866770 3:0.863335 4:0.203366 5:0.234077
+1 1:0.035194 2:0.398743 3:0.510657 4:0.028964 5:0.619880
+3 1:0.729706 2:0.133357 3:0.864867 4:0.806695 5:0.681677
+3 1:0.745570 2:0.384475 3:0.758124 4:0.269669 5:0.134004
+2 1:0.625335 2:0.962249 3:0.365400 4:0.905473 5:0.574668
+0 1:0.127824 2:0.574420 3:0.925172 4:0.861584 5:0.651584
+0 1:0.283864 2:0.524622 3:0.822224 4:0.572662 5:0.054927
+0 1:0.634046 2:0.067980 3:0.154102 4:0.392269 5:0.630872
+0 1:0.715898 2:0.914851 3:0.767055 4:0.929095 5:0.545982
+1 1:0.265327 2:0.815794 3:0.787845 4:0.316523 5:0.931774
+3 1:0.989390 2:0.000113 3:0.033949 4:0.230697 5:0.298948
+1 1:0.894869 2:0.368693 3:0.138898 4:0.941772 5:0.320418
+0 1:0.230322 2:0.525224 3:0.071963 4:0.762348 5:0.378543
+1 1:0.615923 2:0.670154 3:0.398374 4:0.978306 5:0.445780
+2 1:0.225093 2:0.960227 3:0.620164 4:0.878473 5:0.018904
+0 1:0.527845 2:0.985475 3:0.661959 4:0.950368 5:0.760041
+3 1:0.153984 2:0.333098 3:0.228054 4:0.466053 5:0.466788
+1 1:0.721555 2:0.533151 3:0.648141 4:0.394710 5:0.959431
+2 1:0.838903 2:0.029601 3:0.941272 4:0.642579 5:0.999925
+3 1:0.704906 2:0.658367 3:0.698945 4:0.545432 5:0.448200
+1 1:0.244696 2:0.192652 3:0.219868 4:0.399418 5:0.217660
+1 1:0.834653 2:0.236075 3:0.051846 4:0.215874 5:0.286917
+0 1:0.391280 2:0.895263 3:0.587608 4:0.880774 5:0.779338
+2 1:0.215979 2:0.890866 3:0.158514 4:0.359504 5:0.023049
diff --git a/jvm/src/test/resources/fixtures/datasets/train.txt.query 
b/jvm/src/test/resources/fixtures/datasets/train.txt.query
new file mode 100644
index 0000000..52d733e
--- /dev/null
+++ b/jvm/src/test/resources/fixtures/datasets/train.txt.query
@@ -0,0 +1,7 @@
+18
+14
+14
+12
+16
+16
+10
diff --git 
a/jvm/src/test/scala/org/wikimedia/search/mjolnir/PythonUtilsSuite.scala 
b/jvm/src/test/scala/org/wikimedia/search/mjolnir/PythonUtilsSuite.scala
index 5fc66c7..6e1979a 100644
--- a/jvm/src/test/scala/org/wikimedia/search/mjolnir/PythonUtilsSuite.scala
+++ b/jvm/src/test/scala/org/wikimedia/search/mjolnir/PythonUtilsSuite.scala
@@ -1,5 +1,9 @@
 package org.wikimedia.search.mjolnir
 
+import java.nio.file.{Files, Paths}
+
+import org.apache.spark.api.java.JavaSparkContext
+
 class PythonUtilsSuite extends SharedSparkContext {
   test("query groups should maintain input order") {
     val sqlContext = spark.sqlContext
@@ -26,4 +30,43 @@
     val res = PythonUtils.calcQueryGroups(df, "queryId")
     assert(res == Seq(Seq(1, 2), Seq(2, 1, 3)))
   }
+
+  test("basic training") {
+    val base_path = { x: String => 
getClass.getResource(s"fixtures/datasets/$x") }
+    val params = Map(
+      "objective" -> "rank:ndcg",
+      "eval_metric" -> "ndcg@5"
+    )
+    val fold = Seq("train", "test").map { name =>
+      // We need to copy the resource files to the filesystem where xgboost 
can read them
+      val remote_path = getClass.getResource(s"/fixtures/datasets/$name.txt")
+      val remote_query_path = 
getClass.getResource(s"/fixtures/datasets/$name.txt.query")
+
+      // Using createTempFile to get a base dir + random name
+      val local_path = Files.createTempFile("mjolnir-test", "")
+      val local_query_path = Paths.get(local_path + ".query")
+
+      Files.delete(local_path)
+      Files.copy(remote_path.openStream(), local_path)
+      Files.copy(remote_query_path.openStream(), local_query_path)
+
+      name -> ("file:" + local_path.toString)
+    }.toMap
+
+    try {
+      val model = MlrXGBoost.trainWithFiles(
+        new JavaSparkContext(spark.sparkContext),
+        Array(fold), "train", params, numRounds = 5,
+        earlyStoppingRound = 0)
+      assert(model.summary.trainObjectiveHistory.length == 5)
+      assert(model.summary.testObjectiveHistory.nonEmpty)
+      assert(model.summary.testObjectiveHistory.get.length == 5)
+      assert(model.booster.getModelDump().length == 5)
+    } finally {
+      fold.values.map { local_path =>
+        Files.deleteIfExists(Paths.get(local_path))
+        Files.deleteIfExists(Paths.get(local_path + ".query"))
+      }
+    }
+  }
 }

-- 
To view, visit https://gerrit.wikimedia.org/r/403545
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: Ib5e8cd9d3e87e724f05b5ec0941c140aa5077d71
Gerrit-PatchSet: 1
Gerrit-Project: search/MjoLniR
Gerrit-Branch: master
Gerrit-Owner: EBernhardson <ebernhard...@wikimedia.org>

_______________________________________________
MediaWiki-commits mailing list
MediaWiki-commits@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits

[MediaWiki-commits] [Gerrit] search/MjoLniR[master]: JVM components to support file-based training

Reply via email to