GitHub user tashoyan opened a pull request:

    https://github.com/apache/spark/pull/20578

    [SPARK-23318][ML] FP-growth: WARN FPGrowth: Input data is not cached

    ## What changes were proposed in this pull request?
    
    Cache the RDD of items in ml.FPGrowth before passing it to mllib.FPGrowth. 
Cache only when the user did not cache the input dataset of transactions. This 
fixes the warning about uncached data emerging from mllib.FPGrowth.
    
    ## How was this patch tested?
    
    Manually:
    1. Run ml.FPGrowthExample - warning is there
    2. Apply the fix
    3. Run ml.FPGrowthExample again - no warning anymore

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tashoyan/spark SPARK-23318

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20578.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20578
    
----
commit d17d3fbee84fcb0072d3030f3118ca18ce783e0c
Author: Arseniy Tashoyan <tashoyan@...>
Date:   2018-02-10T21:16:51Z

    [SPARK-23318][ML]Workaround for 'ArrayStoreException: [Ljava.lang.Object' 
when trying to cache the RDD of items.

commit e0eb8519bf09db12f5d5bc426eaf17d6488e05c1
Author: Arseniy Tashoyan <tashoyan@...>
Date:   2018-02-11T15:21:39Z

    [SPARK-23318][ML] Cache the RDD of items if the user did not cache the 
input dataset of transactions. This should eliminate the warning about uncahed 
data in mllib.FPGrowth.

commit 374a49c2bf447f3ddfed655f6eda9c8cd5f45285
Author: Arseniy Tashoyan <tashoyan@...>
Date:   2018-02-11T15:23:58Z

    Merge remote-tracking branch 'upstream/master' into SPARK-23318

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to