yannis ats created MAHOUT-1233:
----------------------------------

             Summary: Problem in processing datasets as a single chunk vs many 
chunks in HADOOP mode in mostly all the clustering algos
                 Key: MAHOUT-1233
                 URL: https://issues.apache.org/jira/browse/MAHOUT-1233
             Project: Mahout
          Issue Type: Question
          Components: Clustering
    Affects Versions: 0.7, 0.8
            Reporter: yannis ats
            Priority: Minor


I am trying to process a dataset and i do it in two ways.
Firstly i give it as a single chunk(all the dataset) and secondly as many 
smaller chunks in order to increase the throughput of my machine.
The problem is that when i perform the single chunk computation the results are 
fine 
and by fine i mean that if i have in the input 1000 vectors i get in the output 
1000 vectorids with their cluster_ids (i have tried in canopy,kmeans and fuzzy 
kmeans).
However when i split the dataset in order to speed up the computations then 
strange phenomena occur.
For instance the same dataset that contains 1000 vectors and is split in  for 
example 10 files then in the output i will obtain more vector ids(w.g 1100 
vectorids with their corresponding clusterids).
The question is, am i doing something wrong in the process?
Is there a problem in clusterdump and seqdumper when the input is in many files?
I have observed when mahout is performing the computations that in the screen 
says that processed the correct number of vectors.
Am i missing something?
I use as input the transformed to mvc weka vectors.
I have tried this in v0.7 and the v0.8 snapshot.

Thank you in advance for your time.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to