EBernhardson has uploaded a new change for review. ( 
https://gerrit.wikimedia.org/r/395062 )

Change subject: Add option to train using external memory
......................................................................

Add option to train using external memory

I'm not sure what exactly has changed, but i'm unable to complete a full
round of training on wikis with large (~35M) numbers of observations
keep getting killed by spark. I tried increasing memory overhead from 9G
to 12G but it still keeps dieing. I'm wary of allocating even more
memory than that, as we are asking for a significant % of cluster
memory.

Take advantage of xgboost's external memory implementation to prevent
the memory explosion. This basically writes out the features matrix to
disk and memory maps it, depending on the kernel disk cache to keep it
in memory where possible. This is likely a little slower, but still
faster than killing executors and regularly restarting training.

Change-Id: Ie283c1c58d8395054164f1c0157e1a709d1cccc4
---
M example_train.yaml
M mjolnir/test/fixtures/load_config/example_train.expect
M mjolnir/training/xgboost.py
M mjolnir/utilities/training_pipeline.py
4 files changed, 18 insertions(+), 3 deletions(-)


  git pull ssh://gerrit.wikimedia.org:29418/search/MjoLniR 
refs/changes/62/395062/1

diff --git a/example_train.yaml b/example_train.yaml
index 5784421..9cfb8a3 100644
--- a/example_train.yaml
+++ b/example_train.yaml
@@ -138,6 +138,7 @@
                     cv-jobs: 22
                     folds: 3
                     final-trees: 100
+                    use-external-memory: yes
 
     medium:
         # 4M to 12M observations per executor.
diff --git a/mjolnir/test/fixtures/load_config/example_train.expect 
b/mjolnir/test/fixtures/load_config/example_train.expect
index 23e536f..75233f6 100644
--- a/mjolnir/test/fixtures/load_config/example_train.expect
+++ b/mjolnir/test/fixtures/load_config/example_train.expect
@@ -243,6 +243,7 @@
           folds: '3'
           input: hdfs://analytics-hadoop/user/pytest/mjolnir/marker
           output: /home/pytest/training_size/marker_large
+          use-external-memory: 'True'
           workers: '3'
         environment:
           HOME: /home/pytest
diff --git a/mjolnir/training/xgboost.py b/mjolnir/training/xgboost.py
index 6d1f70b..03e8599 100644
--- a/mjolnir/training/xgboost.py
+++ b/mjolnir/training/xgboost.py
@@ -108,7 +108,7 @@
     return retval
 
 
-def train(df, params, num_workers=None):
+def train(df, params, num_workers=None, use_external_memory=False):
     """Train a single xgboost ranking model.
 
     df : pyspark.sql.DataFrame
@@ -168,6 +168,7 @@
     try:
         return XGBoostModel.trainWithDataFrame(df_grouped, params, num_rounds,
                                                num_workers, 
feature_col='features',
+                                               
use_external_memory=use_external_memory,
                                                label_col='label')
     finally:
         if unpersist:
diff --git a/mjolnir/utilities/training_pipeline.py 
b/mjolnir/utilities/training_pipeline.py
index 3ee6bd2..dae13ab 100644
--- a/mjolnir/utilities/training_pipeline.py
+++ b/mjolnir/utilities/training_pipeline.py
@@ -51,7 +51,7 @@
 
 
 def run_pipeline(sc, sqlContext, input_dir, output_dir, wikis, 
initial_num_trees, final_num_trees,
-                 num_workers, num_cv_jobs, num_folds, test_dir, zero_features):
+                 num_workers, num_cv_jobs, num_folds, test_dir, zero_features, 
use_external_memory):
     for wiki in wikis:
         print 'Training wiki: %s' % (wiki)
         df_hits_with_features = (
@@ -98,7 +98,8 @@
         df_grouped, j_groups = mjolnir.training.xgboost.prep_training(
             df_hits_with_features, num_workers)
         best_params['groupData'] = j_groups
-        model = mjolnir.training.xgboost.train(df_grouped, best_params)
+        model = mjolnir.training.xgboost.train(
+                df_grouped, best_params, 
use_external_memory=use_external_memory)
 
         tune_results['metrics']['train'] = model.eval(df_grouped, j_groups)
         df_grouped.unpersist()
@@ -142,6 +143,14 @@
         print 'Wrote xgboost binary model to %s' % (xgb_model_output)
         print ''
 
+def str_to_bool(value):
+    if value.lower() in ['true', 'yes', '1']:
+        return True
+    elif value.lower() in ['false', 'no', '0']:
+        return False
+    else:
+        raise ValueError("Unknown boolean string: " + value)
+
 
 def parse_arguments(argv):
     parser = argparse.ArgumentParser(description='Train XGBoost ranking 
models')
@@ -168,6 +177,9 @@
         '--initial-trees', dest='initial_num_trees', default=100, type=int,
         help='Number of trees to perform hyperparamter tuning with.  (Default: 
100)')
     parser.add_argument(
+        '-e', '--use-external-memory', dest='use_external_memory', 
default=False,
+        type=str_to_bool, help='Use external memory for feature matrix')
+    parser.add_argument(
         '--final-trees', dest='final_num_trees', default=None, type=int,
         help='Number of trees in the final ensemble. If not provided the value 
from '
              + '--initial-trees will be used.  (Default: None)')

-- 
To view, visit https://gerrit.wikimedia.org/r/395062
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: Ie283c1c58d8395054164f1c0157e1a709d1cccc4
Gerrit-PatchSet: 1
Gerrit-Project: search/MjoLniR
Gerrit-Branch: master
Gerrit-Owner: EBernhardson <ebernhard...@wikimedia.org>

_______________________________________________
MediaWiki-commits mailing list
MediaWiki-commits@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits

Reply via email to