Change Iterable<Cooccurrence> in
org.apache.mahout.math.hadoop.similarity.RowSimilarityJob.SimilarityReducer to
list or array to improve the performance
----------------------------------------------------------------------------------------------------------------------------------------------------------
Key: MAHOUT-467
URL: https://issues.apache.org/jira/browse/MAHOUT-467
Project: Mahout
Issue Type: Improvement
Components: Collaborative Filtering
Affects Versions: 0.4
Reporter: Hui Wen Han
Fix For: 0.4
In Class AbstractDistributedVectorSimilarity
protected int countElements(Iterator<?> iterator)
{ int count = 0;
while (iterator.hasNext())
{
count++;
iterator.next();
} return count;
}
The method countElements is used continually and is called continually ,but it
has bad performance.
If the iterator has million elements ,we have to iterate million times to just
get the count of the iterator.
this methods used in many pacles:
1) DistributedCooccurrenceVectorSimilarity
public class DistributedCooccurrenceVectorSimilarity extends
AbstractDistributedVectorSimilarity {
@Override
protected double doComputeResult(int rowA, int rowB, Iterable<Cooccurrence>
cooccurrences, double weightOfVectorA,
double weightOfVectorB, int numberOfColumns) {
return countElements(cooccurrences);
}
}
one items may be liked by many people, we has system ,one items may be liked by
hundred thousand persons,
Here doComputeResult just returned the count of elements in cooccurrences,but
It has to iterate for hundred thousand times.
If we use List or Array type,we can get the result in one call. because it
already sets the size of the Array or list when system constructs the List or
Array.
2) DistributedLoglikelihoodVectorSimilarity
3) DistributedTanimotoCoefficientVectorSimilarity
I have doing a test using DistributedCooccurrenceVectorSimilarity
it used 4.5 hours to run RowSimilarityJob-CooccurrencesMapper-SimilarityReducer
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.