[PIO-111] Documentation for `pio batchpredict` Closes #418
Project: http://git-wip-us.apache.org/repos/asf/incubator-predictionio/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-predictionio/commit/0d2e06d5 Tree: http://git-wip-us.apache.org/repos/asf/incubator-predictionio/tree/0d2e06d5 Diff: http://git-wip-us.apache.org/repos/asf/incubator-predictionio/diff/0d2e06d5 Branch: refs/heads/livedoc Commit: 0d2e06d54524946ecc1ebfed085f58f36c1100e7 Parents: f752690 Author: Mars Hall <[email protected]> Authored: Thu Aug 3 11:49:27 2017 -0700 Committer: Donald Szeto <[email protected]> Committed: Thu Aug 3 11:49:27 2017 -0700 ---------------------------------------------------------------------- docs/manual/Gemfile.lock | 8 +- docs/manual/data/nav/main.yml | 3 + docs/manual/source/batchpredict/index.html.md | 148 +++++++++++++++++++++ docs/manual/source/cli/index.html.md | 7 +- 4 files changed, 161 insertions(+), 5 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/incubator-predictionio/blob/0d2e06d5/docs/manual/Gemfile.lock ---------------------------------------------------------------------- diff --git a/docs/manual/Gemfile.lock b/docs/manual/Gemfile.lock index 947840f..ecf6194 100644 --- a/docs/manual/Gemfile.lock +++ b/docs/manual/Gemfile.lock @@ -91,7 +91,7 @@ GEM kramdown (1.6.0) launchy (2.4.3) addressable (~> 2.3) - libv8 (3.16.14.13) + libv8 (3.16.14.19) listen (2.10.0) celluloid (~> 0.16.0) rb-fsevent (>= 0.9.3) @@ -200,8 +200,8 @@ GEM sprockets (~> 2.0) tilt (~> 1.1) temple (0.7.6) - therubyracer (0.12.2) - libv8 (~> 3.16.14.0) + therubyracer (0.12.3) + libv8 (~> 3.16.14.15) ref thor (0.19.1) thread_safe (0.3.5) @@ -253,4 +253,4 @@ DEPENDENCIES wdm (~> 0.1.0) BUNDLED WITH - 1.11.2 + 1.15.3 http://git-wip-us.apache.org/repos/asf/incubator-predictionio/blob/0d2e06d5/docs/manual/data/nav/main.yml ---------------------------------------------------------------------- diff --git a/docs/manual/data/nav/main.yml b/docs/manual/data/nav/main.yml index 245fbaf..d80d550 100644 --- a/docs/manual/data/nav/main.yml +++ b/docs/manual/data/nav/main.yml @@ -62,6 +62,9 @@ root: body: 'Engine Command-line Interface' url: '/cli/#engine-commands' - + body: 'Batch Predictions' + url: '/batchpredict/' + - body: 'Monitoring Engine' url: '/deploy/monitoring/' - http://git-wip-us.apache.org/repos/asf/incubator-predictionio/blob/0d2e06d5/docs/manual/source/batchpredict/index.html.md ---------------------------------------------------------------------- diff --git a/docs/manual/source/batchpredict/index.html.md b/docs/manual/source/batchpredict/index.html.md new file mode 100644 index 0000000..38ddb3b --- /dev/null +++ b/docs/manual/source/batchpredict/index.html.md @@ -0,0 +1,148 @@ +--- +title: Batch Predictions +--- + +<!-- +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to You under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +--> + +##Overview +Process predictions for many queries using efficient parallelization +through Spark. Useful for mass auditing of predictions and for +generating predictions to push into other systems. + +Batch predict reads and writes multi-object JSON files similar to the +[batch import](/datacollection/batchimport/) format. JSON objects are separated +by newlines and cannot themselves contain unencoded newlines. + +##Compatibility +`pio batchpredict` loads the engine and processes queries exactly like +`pio deploy`. There is only one additional requirement for engines +to utilize batch predict: + +WARNING: All algorithm classes used in the engine must be +[serializable](https://www.scala-lang.org/api/2.11.8/index.html#scala.Serializable). +**This is already true for PredictionIO's base algorithm classes**, but may be broken +by including non-serializable fields in their constructor. Using the +[`@transient` annotation](http://fdahms.com/2015/10/14/scala-and-the-transient-lazy-val-pattern/) +may help in these cases. + +This requirement is due to processing the input queries as a +[Spark RDD](https://spark.apache.org/docs/latest/rdd-programming-guide.html#resilient-distributed-datasets-rdds) +which enables high-performance parallelization, even on a single machine. + +##Usage + +### `pio batchpredict` + +Command to process bulk predictions. Takes the same options as `pio deploy` plus: + +### `--input <value>` + +Path to file containing queries; a multi-object JSON file with one +query object per line. Accepts any valid Hadoop file URL. + +Default: `batchpredict-input.json` + +### `--output <value>` + +Path to file to receive results; a multi-object JSON file with one +object per line, the prediction + original query. Accepts any +valid Hadoop file URL. Actual output will be written as Hadoop +partition files in a directory with the output name. + +Default: `batchpredict-output.json` + +### `--query-partitions <value>` + +Configure the concurrency of predictions by setting the number of partitions +used internally for the RDD of queries. This will directly effect the +number of resulting `part-*` output files. While setting to `1` may seem +appealing to get a single output file, this will remove parallelization +for the batch process, reducing performance and possibly exhausting memory. + +Default: number created by Spark context's `textFile` (probably the number +of cores available on the local machine) + +### `--engine-instance-id <value>` + +Identifier for the trained instance to use for batch predict. + +Default: the latest trained instance. + +##Example + +###Input + +A multi-object JSON file of queries as they would be sent to the engine's +HTTP Queries API. + +NOTE: Read via +[SparkContext's `textFile`](https://spark.apache.org/docs/latest/rdd-programming-guide.html#external-datasets) +and so may be a single file or any supported Hadoop format. + +File: `batchpredict-input.json` + +```json +{"user":"1"} +{"user":"2"} +{"user":"3"} +{"user":"4"} +{"user":"5"} +``` + +###Execute + +```bash +pio batchpredict \ + --input batchpredict-input.json \ + --output batchpredict-output.json +``` + +This command will run to completion, aborting if any errors are encountered. + +###Output + +A multi-object JSON file of predictions + original queries. The predictions +are JSON objects as they would be returned from the engine's HTTP Queries API. + +NOTE: Results are written via Spark RDD's `saveAsTextFile` so each partition +will be written to its own `part-*` file. +See [post-processing results](#post-processing-results). + +File 1: `batchpredict-output.json/part-00000` + +```json +{"query":{"user":"1"},"prediction":{"itemScores":[{"item":"1","score":33},{"item":"2","score":32}]}} +{"query":{"user":"3"},"prediction":{"itemScores":[{"item":"2","score":16},{"item":"3","score":12}]}} +{"query":{"user":"4"},"prediction":{"itemScores":[{"item":"3","score":19},{"item":"1","score":18}]}} +``` + +File 2: `batchpredict-output.json/part-00001` + +```json +{"query":{"user":"2"},"prediction":{"itemScores":[{"item":"5","score":55},{"item":"3","score":28}]}} +{"query":{"user":"5"},"prediction":{"itemScores":[{"item":"1","score":24},{"item":"4","score":14}]}} +``` + +###Post-processing Results + +After the process exits successfully, the parts may be concatenated into a +single output file using a command like: + +```bash +cat batchpredict-output.json/part-* > batchpredict-output-all.json +``` http://git-wip-us.apache.org/repos/asf/incubator-predictionio/blob/0d2e06d5/docs/manual/source/cli/index.html.md ---------------------------------------------------------------------- diff --git a/docs/manual/source/cli/index.html.md b/docs/manual/source/cli/index.html.md index e4927fd..3735348 100644 --- a/docs/manual/source/cli/index.html.md +++ b/docs/manual/source/cli/index.html.md @@ -67,4 +67,9 @@ third-party informational messages. ```pio train``` Kick off a training using an engine. -```pio deploy``` Deploy an engine as an engine server. If no instance ID is specified, it will deploy the latest instance. +```pio deploy``` Deploy an engine as an engine server. + +```pio batchpredict``` Process bulk predictions using an engine. + +For ```deploy``` & ```batchpredict```, if ```--engine-instance-id``` is not +specified, it will use the latest trained instance.
