[
https://issues.apache.org/jira/browse/MAHOUT-468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hui Wen Han updated MAHOUT-468:
-------------------------------
Description:
I have done a test ,
Preferences records: 680,194
distinct users: 23,246
distinct items:437,569
SIMILARITY_CLASS_NAME=SIMILARITY_COOCCURRENCE
maybePruneItemUserMatrixPath:16.50M
weights:13.80M
pairwiseSimilarity:18.81G
Job RowSimilarityJob-RowWeightMapper-WeightedOccurrencesPerColumnReducer:used
32 sec
Job RowSimilarityJob-CooccurrencesMapper-SimilarityReducer:used 4.30 hours
I think the reason may be following:
1) We used SequenceFileOutputFormat,it cause job can only be run by n ( n=
Hadoop node counts ) mappers or reducers concurrently.
2) We stored redundant info.
for example :
the output of CooccurrencesMapper:
(ItemIndexA,similarity),(ItemIndexA,ItemIndexB,similarity)
3) Some frequently used code
https://issues.apache.org/jira/browse/MAHOUT-467
4) allocate many local variable in loop (need confirm )
In Class DistributedUncenteredZeroAssumingCosineVectorSimilarity
@Override
public double weight(Vector v) {
double length = 0.0;
Iterator<Element> elemIterator = v.iterateNonZero();
while (elemIterator.hasNext()) {
double value = elemIterator.next().get(); //this one
length += value * value;
}
return Math.sqrt(length);
}
5) Maybe we need control the size of cooccurrences
was:
I have done a test ,
Preferences records: 680,194
distinct users: 23,246
distinct items:437,569
SIMILARITY_CLASS_NAME=SIMILARITY_COOCCURRENCE
maybePruneItemUserMatrixPath:16.50M
weights:13.80M
pairwiseSimilarity:18.81G
Job RowSimilarityJob-RowWeightMapper-WeightedOccurrencesPerColumnReducer:used
32 sec
Job RowSimilarityJob-CooccurrencesMapper-SimilarityReducer:used 4.30 hours
I think the reason may be following:
1) We used SequenceFileOutputFormat,it cause job can only be run by n ( n=
Hadoop node counts ) mappers or reducers concurrently.
2) We stored redundant info.
for example :
the output of CooccurrencesMapper:
(ItemIndexA,similarity),(ItemIndexA,ItemIndexB,similarity)
3) Some frequently used code
https://issues.apache.org/jira/browse/MAHOUT-467
4) allocate many local variable in loop (need confirm )
In Class DistributedUncenteredZeroAssumingCosineVectorSimilarity
@Override
public double weight(Vector v) {
double length = 0.0;
Iterator<Element> elemIterator = v.iterateNonZero();
while (elemIterator.hasNext()) {
double value = elemIterator.next().get(); //this one
length += value * value;
}
return Math.sqrt(length);
}
> Performance of RowSimilarityJob is not good
> -------------------------------------------
>
> Key: MAHOUT-468
> URL: https://issues.apache.org/jira/browse/MAHOUT-468
> Project: Mahout
> Issue Type: Test
> Components: Collaborative Filtering
> Affects Versions: 0.4
> Reporter: Hui Wen Han
> Fix For: 0.4
>
>
> I have done a test ,
> Preferences records: 680,194
> distinct users: 23,246
> distinct items:437,569
> SIMILARITY_CLASS_NAME=SIMILARITY_COOCCURRENCE
> maybePruneItemUserMatrixPath:16.50M
> weights:13.80M
> pairwiseSimilarity:18.81G
> Job RowSimilarityJob-RowWeightMapper-WeightedOccurrencesPerColumnReducer:used
> 32 sec
> Job RowSimilarityJob-CooccurrencesMapper-SimilarityReducer:used 4.30 hours
> I think the reason may be following:
> 1) We used SequenceFileOutputFormat,it cause job can only be run by n ( n=
> Hadoop node counts ) mappers or reducers concurrently.
> 2) We stored redundant info.
> for example :
> the output of CooccurrencesMapper:
> (ItemIndexA,similarity),(ItemIndexA,ItemIndexB,similarity)
> 3) Some frequently used code
> https://issues.apache.org/jira/browse/MAHOUT-467
> 4) allocate many local variable in loop (need confirm )
> In Class DistributedUncenteredZeroAssumingCosineVectorSimilarity
> @Override
> public double weight(Vector v) {
> double length = 0.0;
> Iterator<Element> elemIterator = v.iterateNonZero();
> while (elemIterator.hasNext()) {
> double value = elemIterator.next().get(); //this one
> length += value * value;
> }
> return Math.sqrt(length);
> }
> 5) Maybe we need control the size of cooccurrences
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.