Updated gitbook for Spark top-k join
Project: http://git-wip-us.apache.org/repos/asf/incubator-hivemall/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-hivemall/commit/4909deda Tree: http://git-wip-us.apache.org/repos/asf/incubator-hivemall/tree/4909deda Diff: http://git-wip-us.apache.org/repos/asf/incubator-hivemall/diff/4909deda Branch: refs/heads/master Commit: 4909deda546946a66de31195b9a3eaa120382c50 Parents: b2032af Author: myui <[email protected]> Authored: Thu Feb 2 11:48:00 2017 +0900 Committer: myui <[email protected]> Committed: Thu Feb 2 11:48:00 2017 +0900 ---------------------------------------------------------------------- docs/gitbook/SUMMARY.md | 5 +++++ docs/gitbook/spark/misc/misc.md | 0 docs/gitbook/spark/misc/topk_join.md | 15 ++++++--------- 3 files changed, 11 insertions(+), 9 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/4909deda/docs/gitbook/SUMMARY.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/SUMMARY.md b/docs/gitbook/SUMMARY.md index 33bb46c..76f7924 100644 --- a/docs/gitbook/SUMMARY.md +++ b/docs/gitbook/SUMMARY.md @@ -145,6 +145,11 @@ * [Outlier Detection using Local Outlier Factor (LOF)](anomaly/lof.md) +## Part X - Hivemall on Spark + +* [Generic features](spark/misc/misc.md) + * [Top-k Join processing](spark/misc/topk_join.md) + ## Part X - External References * [Hivemall on Apache Spark](https://github.com/maropu/hivemall-spark) http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/4909deda/docs/gitbook/spark/misc/misc.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/spark/misc/misc.md b/docs/gitbook/spark/misc/misc.md new file mode 100644 index 0000000..e69de29 http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/4909deda/docs/gitbook/spark/misc/topk_join.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/spark/misc/topk_join.md b/docs/gitbook/spark/misc/topk_join.md index 03e0a23..af3351d 100644 --- a/docs/gitbook/spark/misc/topk_join.md +++ b/docs/gitbook/spark/misc/topk_join.md @@ -21,13 +21,10 @@ `top_k_join` is much IO-efficient as compared to regular joining + ranking operations because `top_k_join` drops unsatisfied records and writes only top-k records to disks during joins. -<!-- toc --> - -# Notice - -* `top_k_join` is supported in the DataFrame of Spark v2.1.0 or later. -* A type of `score` must be ByteType, ShortType, IntegerType, LongType, FloatType, DoubleType, or DecimalType. -* If `k` is less than 0, the order is reverse and `top_k_join` joins the tail-K records of `rightDf`. +> #### Caution +> * `top_k_join` is supported in the DataFrame of Spark v2.1.0 or later. +> * A type of `score` must be ByteType, ShortType, IntegerType, LongType, FloatType, DoubleType, or DecimalType. +> * If `k` is less than 0, the order is reverse and `top_k_join` joins the tail-K records of `rightDf`. # Usage @@ -61,7 +58,7 @@ For example, we have two tables below; In the two tables, the example computes the nearest `position` for `userId` in each `group`. The standard way using DataFrame window functions would be as follows: -``` +```scala val computeDistanceFunc = sqrt(pow(inputDf("x") - masterDf("x"), lit(2.0)) + pow(inputDf("y") - masterDf("y"), lit(2.0))) @@ -76,7 +73,7 @@ leftDf.join( You can use `top_k_join` as follows: -``` +```scala leftDf.top_k_join( k = lit(-1), right = rightDf,
