Pat Ferrel created MAHOUT-1028:
----------------------------------
Summary: seq2sparse n-gram weighting creates malformed vectors
which crashes kmeans
Key: MAHOUT-1028
URL: https://issues.apache.org/jira/browse/MAHOUT-1028
Project: Mahout
Issue Type: Bug
Components: Clustering
Affects Versions: 0.7
Environment: using trunk snapshot about June 1.
Reporter: Pat Ferrel
Fix For: 0.7
I think I found the root but not sure what needs fixing.
I took out n-gram generation and the vector now looks like this:
Key: https://farfetchers.com/category/collections/source/brice-berard:
Value:
https://farfetchers.com/category/collections/source/brice-berard:{701:0.5484552974788475,1876:0.6020428878306935,3620:0.5802940184767269}
This works in clustering.
It doesn't seem like a malformed vector should crash clustering (it apparently
doesn't in mahout 0.6) but it looks like something in seq2sparse's n-gram
weighting does cause a malformed vector.
I'll file a JIRA
On 6/5/12 11:48 AM, Pat Ferrel wrote:
> Using seqdumper on the TFIDF vectors, that vector is indeed in the list
> Key: https://farfetchers.com/category/collections/source/brice-berard:
> Value: https://farfetchers.com/category/collections/source/brice-berard:{
>
> Looking in the seqfiles we find the document in part-00005 of 10 in no
> particular part of the file.
> Key: https://farfetchers.com/category/collections/source/brice-berard:
> Value: ::Title::
> Brice Berard | FarFetchers.com
> Blog Posts
>
> On the chance that this originates in seq2sparse I'll try changing options
> until the vector looks different. and try clustering again.
>
> On 6/5/12 10:43 AM, Pat Ferrel wrote:
>> I'm not completely sure what I'm looking at but...
>>
>> In iterateSeq on iteration #1 of processing vectors/tfidf-vectors it reads
>> vector = "https://farfetchers.com/category/collections/source/brice-berard:{"
>>
>> it's a named vector where the url is the name, the value is "{", which
>> looks wrong and when that is classified to get a probability it gets
>>
>> probabilities =
>> "{0:NaN,1:NaN,2:NaN,3:NaN,4:NaN,5:NaN,6:NaN,7:NaN,8:NaN,9:NaN,10:NaN,11:NaN,12:NaN,13:NaN,14:NaN,15:NaN,16:NaN,17:NaN,18:NaN,19:NaN}"
>>
>> That causes the probabilities.maxValueIndex() = -1 and everything dies.
>>
>> vector looks wrong, doesn't it? Truncated?
>>
>> I went back to try the same on mahout 0.6 but iterateSeq does not get called
>> though I used -xm sequential on both runs. I can't see
>> kmeans-clusters/clusters-0 being created on mahout 0.6 either. Is that part
>> of the refactoring?
>>
>> On 6/4/12 3:07 PM, Pat Ferrel wrote:
>>> Some things to try:
>>> - Have you verified the contents of your input vectors actually have data
>>> in them?
>>> * YES, from the other email you know that the data works fine in 0.6
>>> - Can you run the cluster dumper on the b3/kmeans-clusters/clusters-0
>>> contents?
>>> * YES, It is attached from trunk's clusterdump after the failure of kmeans,
>>> of course. A simple data set fortunately.
>>> - Is it possible to run the sequential version (-xm sequential)? If it is
>>> you could run it in a debugger to gain more insight.
>>> * YES, will report back.
>>>
>>> On 6/4/12 2:19 PM, Jeff Eastman wrote:
>>>> It looks like the probabilities vector returned by
>>>> AbstractClusteringPolicy.classify() has no non-zero elements. In this
>>>> case, AbstractClusteringPolicy.select()'s call to
>>>> AbstractVector.maxValueIndex() is returning -1 and that is causing the
>>>> exception.
>>>>
>>>> How could this happen? I'm not exactly sure, but consider that the
>>>> probabilities vector is calculated in AbstractClusteringPolicy.classify()
>>>> by calling DistanceMeasureCluster.pdf() on each of the prior clusters in
>>>> b3/kmeans-clusters/clusters-0. With a CosineDistanceMeasure I don't see
>>>> how this could ever return zero. Certainly, some of your vectors will
>>>> match the prior cluster centers exactly (they were sampled from the input)
>>>> and those values would return pdf==1. Even if the cosine distance was 1
>>>> the pdf would be 0.5.
>>>>
>>>> Some things to try:
>>>> - Have you verified the contents of your input vectors actually have data
>>>> in them?
>>>> - Can you run the cluster dumper on the b3/kmeans-clusters/clusters-0
>>>> contents?
>>>> - Is it possible to run the sequential version (-xm sequential)? If it is
>>>> you could run it in a debugger to gain more insight.
>>>>
>>>> Jeff
>>>>
>>>> On 6/4/12 12:05 PM, Pat Ferrel wrote:
>>>>> Using the CLI to kmeans from several trunk versions I get an error I
>>>>> don't understand. When the job died the
>>>>> b3/canopy-centroids/clusters-0-final contained the random-seeds file
>>>>> generated by the kmeans driver and the b3/kmeans-clusters/clusters-0 had
>>>>> several part files but b3/kmeans-clusters/clusters-1 was empty. When I
>>>>> look through the code from the trace it doesn't make much sense.
>>>>>
>>>>> Command line:
>>>>> mahout kmeans
>>>>> -i b3/vectors/tfidf-vectors/
>>>>> -k 20
>>>>> -c b3/canopy-centroids/clusters-0-final
>>>>> -cl
>>>>> -o b3/kmeans-clusters
>>>>> -ow
>>>>> -cd 0.01
>>>>> -x 30
>>>>> -dm org.apache.mahout.common.distance.CosineDistanceMeasure
>>>>>
>>>>> Error:
>>>>> 12/06/04 07:55:03 INFO common.AbstractJob: Command line arguments:
>>>>> {--clustering=null, --clusters=[b3/canopy-centroids/clusters-0-final],
>>>>> --convergenceDelta=[0.01],
>>>>> --distanceMeasure=[org.apache.mahout.common.distance.CosineDistanceMeasure],
>>>>> --endPhase=[2147483647], --input=[b3/vectors/tfidf-vectors/],
>>>>> --maxIter=[30], --method=[mapreduce], --numClusters=[20],
>>>>> --output=[b3/kmeans-clusters], --overwrite=null, --startPhase=[0],
>>>>> --tempDir=[temp]}
>>>>> 2012-06-04 07:55:03.752 java[67308:1903] Unable to load realm info from
>>>>> SCDynamicStore
>>>>> 12/06/04 07:55:03 INFO common.HadoopUtil: Deleting
>>>>> b3/canopy-centroids/clusters-0-final
>>>>> 12/06/04 07:55:04 WARN util.NativeCodeLoader: Unable to load
>>>>> native-hadoop library for your platform... using builtin-java classes
>>>>> where applicable
>>>>> 12/06/04 07:55:04 INFO compress.CodecPool: Got brand-new compressor
>>>>> 12/06/04 07:55:04 INFO kmeans.RandomSeedGenerator: Wrote 20 vectors to
>>>>> b3/canopy-centroids/clusters-0-final/part-randomSeed
>>>>> 12/06/04 07:55:04 INFO kmeans.KMeansDriver: Input:
>>>>> b3/vectors/tfidf-vectors Clusters In:
>>>>> b3/canopy-centroids/clusters-0-final/part-randomSeed Out:
>>>>> b3/kmeans-clusters Distance:
>>>>> org.apache.mahout.common.distance.CosineDistanceMeasure
>>>>> 12/06/04 07:55:04 INFO kmeans.KMeansDriver: convergence: 0.01 max
>>>>> Iterations: 30 num Reduce Tasks: org.apache.mahout.math.VectorWritable
>>>>> Input Vectors: {}
>>>>> 12/06/04 07:55:04 INFO compress.CodecPool: Got brand-new decompressor
>>>>> Cluster Iterator running iteration 1 over priorPath:
>>>>> b3/kmeans-clusters/clusters-0
>>>>> 12/06/04 07:55:05 INFO input.FileInputFormat: Total input paths to
>>>>> process : 1
>>>>> 12/06/04 07:55:05 INFO mapred.JobClient: Running job: job_local_0001
>>>>> 12/06/04 07:55:06 INFO mapred.MapTask: io.sort.mb = 100
>>>>> 12/06/04 07:55:08 INFO mapred.MapTask: data buffer = 79691776/99614720
>>>>> 12/06/04 07:55:08 INFO mapred.MapTask: record buffer = 262144/327680
>>>>> 12/06/04 07:55:08 INFO mapred.JobClient: map 0% reduce 0%
>>>>> 12/06/04 07:55:09 WARN mapred.LocalJobRunner: job_local_0001
>>>>> org.apache.mahout.math.IndexException: Index -1 is outside allowable
>>>>> range of [0,20)
>>>>> at org.apache.mahout.math.AbstractVector.set(AbstractVector.java:439)
>>>>> at
>>>>> org.apache.mahout.clustering.iterator.AbstractClusteringPolicy.select(AbstractClusteringPolicy.java:44)
>>>>> at
>>>>> org.apache.mahout.clustering.iterator.CIMapper.map(CIMapper.java:52)
>>>>> at
>>>>> org.apache.mahout.clustering.iterator.CIMapper.map(CIMapper.java:18)
>>>>> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>>>>> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>>>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>>>>> at
>>>>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
>>>>> 12/06/04 07:55:09 INFO mapred.JobClient: Job complete: job_local_0001
>>>>> 12/06/04 07:55:09 INFO mapred.JobClient: Counters: 0
>>>>> Exception in thread "main" java.lang.InterruptedException: Cluster
>>>>> Iteration 1 failed processing b3/kmeans-clusters/clusters-1
>>>>> at
>>>>> org.apache.mahout.clustering.iterator.ClusterIterator.iterateMR(ClusterIterator.java:186)
>>>>> at
>>>>> org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:229)
>>>>> at
>>>>> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:149)
>>>>> at
>>>>> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:108)
>>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>>> at
>>>>> org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:49)
>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>> at
>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>>> at
>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>> at java.lang.reflect.Method.invoke(Method.java:597)
>>>>> at
>>>>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>>>> at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>>>> at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira