[ 
https://issues.apache.org/jira/browse/MAHOUT-106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12875327#action_12875327
 ] 

Julien Le Dem commented on MAHOUT-106:
--------------------------------------

As an optimization I would replace the unit of processing from a single value 
to a vector with all z values.
tot_z is a controlled value and will always be of reasonable size so that the 
vector fits in memory.

In the pig script that would translate as follows. This applies as well to the 
java code.
p_s_z (s:chararray, z:int, p_s_z:float) becomes p_s (s:chararray, p_s: tuple)
p_z_u  (u:chararray, z:int, p_z_u:float) becomes (u:chararray,p_u: tuple);  

the tuples p_s and p_u being of size tot_z.

Then you need UDFs to do the operations on arrays instead of single values. (In 
map reduce, you nee to change the reduce code to work on arrays)
The same applies to all intermediary datasets that have the z value.

That would remove several reduce steps from the process and reduce the size of 
intermediary data thus improving the performance significantly.

> PLSI/EM in pig based on hofmann's ACM 04 paper. 
> ------------------------------------------------
>
>                 Key: MAHOUT-106
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-106
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>         Environment: Pig/Hadoop 
>            Reporter: Prasen Mukherjee
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: plsi-java.patch, plsi_pig.patch
>
>
> Based on the following paper by hofmann : T. Hofmann Latent Semantic Models 
> for Collaborative Filtering In ACM Transactions on Information Systems, 2004, 
>  vol 22(1), pp. 89-115.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to