yannis ats created MAHOUT-1233:
----------------------------------
Summary: Problem in processing datasets as a single chunk vs many
chunks in HADOOP mode in mostly all the clustering algos
Key: MAHOUT-1233
URL: https://issues.apache.org/jira/browse/MAHOUT-1233
Project: Mahout
Issue Type: Question
Components: Clustering
Affects Versions: 0.7, 0.8
Reporter: yannis ats
Priority: Minor
I am trying to process a dataset and i do it in two ways.
Firstly i give it as a single chunk(all the dataset) and secondly as many
smaller chunks in order to increase the throughput of my machine.
The problem is that when i perform the single chunk computation the results are
fine
and by fine i mean that if i have in the input 1000 vectors i get in the output
1000 vectorids with their cluster_ids (i have tried in canopy,kmeans and fuzzy
kmeans).
However when i split the dataset in order to speed up the computations then
strange phenomena occur.
For instance the same dataset that contains 1000 vectors and is split in for
example 10 files then in the output i will obtain more vector ids(w.g 1100
vectorids with their corresponding clusterids).
The question is, am i doing something wrong in the process?
Is there a problem in clusterdump and seqdumper when the input is in many files?
I have observed when mahout is performing the computations that in the screen
says that processed the correct number of vectors.
Am i missing something?
I use as input the transformed to mvc weka vectors.
I have tried this in v0.7 and the v0.8 snapshot.
Thank you in advance for your time.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira