Repository: spark
Updated Branches:
  refs/heads/master 56b0f5f4d -> e3bf37fa3


Fix example of tf_idf with minDocFreq

## What changes were proposed in this pull request?

The python example for tf_idf with the parameter "minDocFreq" is not properly 
set up because the same variable is used to transform the document for both 
with and without the "minDocFreq" parameter.
The IDF(minDocFreq=2) is stored in the variable "idfIgnore" but then it is the 
original variable "idf" used to transform the "tf" instead of the "idfIgnore".

## How was this patch tested?

Before the results for "tfidf" and "tfidfIgnore" were the same:
tfidf:
(1048576,[1046921],[3.75828890549])
(1048576,[1046920],[3.75828890549])
(1048576,[1046923],[3.75828890549])
(1048576,[892732],[3.75828890549])
(1048576,[892733],[3.75828890549])
(1048576,[892734],[3.75828890549])
tfidfIgnore:
(1048576,[1046921],[3.75828890549])
(1048576,[1046920],[3.75828890549])
(1048576,[1046923],[3.75828890549])
(1048576,[892732],[3.75828890549])
(1048576,[892733],[3.75828890549])
(1048576,[892734],[3.75828890549])

After the fix those are how they should be:
tfidf:
(1048576,[1046921],[3.75828890549])
(1048576,[1046920],[3.75828890549])
(1048576,[1046923],[3.75828890549])
(1048576,[892732],[3.75828890549])
(1048576,[892733],[3.75828890549])
(1048576,[892734],[3.75828890549])
tfidfIgnore:
(1048576,[1046921],[0.0])
(1048576,[1046920],[0.0])
(1048576,[1046923],[0.0])
(1048576,[892732],[0.0])
(1048576,[892733],[0.0])
(1048576,[892734],[0.0])

Author: Maxime Rihouey <maxime.riho...@gmail.com>

Closes #15503 from maximerihouey/patch-1.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e3bf37fa
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e3bf37fa
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e3bf37fa

Branch: refs/heads/master
Commit: e3bf37fa3ada43624b2e77bef90ad3d3dbcd8ce1
Parents: 56b0f5f
Author: Maxime Rihouey <maxime.riho...@gmail.com>
Authored: Mon Oct 17 10:56:22 2016 +0100
Committer: Sean Owen <so...@cloudera.com>
Committed: Mon Oct 17 10:56:22 2016 +0100

----------------------------------------------------------------------
 examples/src/main/python/mllib/tf_idf_example.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/e3bf37fa/examples/src/main/python/mllib/tf_idf_example.py
----------------------------------------------------------------------
diff --git a/examples/src/main/python/mllib/tf_idf_example.py 
b/examples/src/main/python/mllib/tf_idf_example.py
index c4d5333..b66412b 100644
--- a/examples/src/main/python/mllib/tf_idf_example.py
+++ b/examples/src/main/python/mllib/tf_idf_example.py
@@ -43,7 +43,7 @@ if __name__ == "__main__":
     # In such cases, the IDF for these terms is set to 0.
     # This feature can be used by passing the minDocFreq value to the IDF 
constructor.
     idfIgnore = IDF(minDocFreq=2).fit(tf)
-    tfidfIgnore = idf.transform(tf)
+    tfidfIgnore = idfIgnore.transform(tf)
     # $example off$
 
     print("tfidf:")


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

Reply via email to