I'm catching up on some mail and I came across this patch - this looks
OK to me (though I'm not too familiar with the nuances of running on
EMR).

I'm unit testing it now, but I wanted to ask what the policy on
committing patches delivered via link is?  Should I request a resubmit
as a JIRA attachment before applying this?  If there are no objections
(based on that or otherwise), I'll probably take this patch as my
first commit.

-tom

On Wed, Feb 22, 2012 at 12:18 AM, Matteo Riondato (Created) (JIRA)
<[email protected]> wrote:
> Patch to make PFPGrowth run on Amazon MapReduce (also shows patterns for 
> making other algorithms work in Amazon MapReduce)
> --------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-980
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-980
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Frequent Itemset/Association Rule Mining
>    Affects Versions: 0.6, 0.5, 0.7
>         Environment: Amazon MapReduce
>            Reporter: Matteo Riondato
>             Fix For: 0.7
>
>
> The patch at http://www.cs.brown.edu/~matteo/PFPGrowth.java.diff (against 
> trunk as of Wed Feb 22 00:07:35 EST 2012, revision 1292127) makes it possible 
> to run PFPGrowth on Elastic MapReduce.
>
> The problem was in the way the fList stored in the DistributedCache was 
> accessed. DistributedCache.getCacheFiles(conf) should be reserved for 
> internal use according to the Hadoop API Documentation. The suggested way to 
> access the files in the DistributedCache is through 
> DistributedCache.getLocalCacheFiles(conf) and then through a LocalFilesystem. 
> This is what the patch does. Note that there is a fallback case if we are 
> running PFPGrowth with "-method mapreduce" but locally (e.g. when HADOOP_HOME 
> is not set or MAHOUT_LOCAL is set). In this case, we use 
> DistributedCache.getCacheFiles() as it is done in the unpatched version.
>
> A quick grep in the source tree shows that there are other places where 
> DistributedCache.getCacheFiles(conf) is used. It may be worth checking 
> whether the corresponding algorithms can be made to work in Amazon MapReduce 
> by fixing them in a similar fashion.
>
> The patch was tested also outside Amazon MapReduce and does not change any 
> other functionality.
>
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA 
> administrators: 
> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>
>

Reply via email to