[jira] [Commented] (MAHOUT-1028) seq2sparse n-gram weighting creates malformed vectors which crashes kmeans

Pat Ferrel (JIRA) Sat, 09 Jun 2012 08:20:44 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-1028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13292359#comment-13292359
 ]


Pat Ferrel commented on MAHOUT-1028:
------------------------------------

Yes, these are the settings I use on a much larger crawl, here I'm using them 
on a small one for quick experimentation. My understanding was that a very 
large ml would create less n-grams and only really really important ones, which 
is what I want. There is always the individual words that should be used in the 
vector when the n-grams don't pass the ml test. Or so I thought.

In short large ml means less terms in the vector but I thought never less than 
the number of words. So large ml should never create 0 vector, am I wrong? 


                
> seq2sparse n-gram weighting creates malformed vectors which crashes kmeans
> --------------------------------------------------------------------------
>
>                 Key: MAHOUT-1028
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1028
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.7
>         Environment: using trunk snapshot about June 1. 
>            Reporter: Pat Ferrel
>             Fix For: 0.7
>
>         Attachments: dmp
>
>
> I think I found the root but not sure what needs fixing.
> I took out n-gram generation and the vector now looks like this:
> Key: https://farfetchers.com/category/collections/source/brice-berard:
> Value: 
> https://farfetchers.com/category/collections/source/brice-berard:{701:0.5484552974788475,1876:0.6020428878306935,3620:0.5802940184767269}
> This works in clustering.
> It doesn't seem like a malformed vector should crash clustering (it 
> apparently doesn't in mahout 0.6) but it looks like something in seq2sparse's 
> n-gram weighting does cause a malformed vector.
> I'll file a JIRA
> On 6/5/12 11:48 AM, Pat Ferrel wrote:
> > Using seqdumper on the TFIDF vectors, that vector is indeed in the list
> > Key: https://farfetchers.com/category/collections/source/brice-berard:
> > Value: https://farfetchers.com/category/collections/source/brice-berard:{
> >
> > Looking in the seqfiles we find the document in part-00005 of 10 in no 
> > particular part of the file.
> > Key: https://farfetchers.com/category/collections/source/brice-berard:
> > Value: ::Title::
> > Brice Berard | FarFetchers.com
> > Blog Posts
> >
> > On the chance that this originates in seq2sparse I'll try changing options 
> > until the vector looks different. and try clustering again.
> >
> > On 6/5/12 10:43 AM, Pat Ferrel wrote:
> >> I'm not completely sure what I'm looking at but...
> >>
> >> In iterateSeq on iteration #1  of processing vectors/tfidf-vectors it reads
> >> vector = 
> >> "https://farfetchers.com/category/collections/source/brice-berard:{";
> >>
> >> it's a named vector where the  url is the name, the value is "{", which 
> >> looks wrong and when that is classified to get a probability it gets
> >>
> >> probabilities = 
> >> "{0:NaN,1:NaN,2:NaN,3:NaN,4:NaN,5:NaN,6:NaN,7:NaN,8:NaN,9:NaN,10:NaN,11:NaN,12:NaN,13:NaN,14:NaN,15:NaN,16:NaN,17:NaN,18:NaN,19:NaN}"
> >>
> >> That causes the probabilities.maxValueIndex() = -1 and everything dies.
> >>
> >> vector looks wrong, doesn't it? Truncated?
> >>
> >> I went back to try the same on mahout 0.6 but iterateSeq does not get 
> >> called though I used -xm sequential on both runs. I can't see 
> >> kmeans-clusters/clusters-0 being created on mahout 0.6 either. Is that 
> >> part of the refactoring?
> >>
> >> On 6/4/12 3:07 PM, Pat Ferrel wrote:
> >>> Some things to try:
> >>> - Have you verified the contents of your input vectors actually have data 
> >>> in them?
> >>> * YES, from the other email you know that the data works fine in 0.6
> >>> - Can you run the cluster dumper on the b3/kmeans-clusters/clusters-0 
> >>> contents?
> >>> * YES, It is attached from trunk's clusterdump after the failure of 
> >>> kmeans, of course. A simple data set fortunately.
> >>> - Is it possible to run the sequential version (-xm sequential)? If it is 
> >>> you could run it in a debugger to gain more insight.
> >>> * YES, will report back.
> >>>
> >>> On 6/4/12 2:19 PM, Jeff Eastman wrote:
> >>>> It looks like the probabilities vector returned by 
> >>>> AbstractClusteringPolicy.classify() has no non-zero elements. In this 
> >>>> case, AbstractClusteringPolicy.select()'s call to 
> >>>> AbstractVector.maxValueIndex() is returning -1 and that is causing the 
> >>>> exception.
> >>>>
> >>>> How could this happen? I'm not exactly sure, but consider that the 
> >>>> probabilities vector is calculated in 
> >>>> AbstractClusteringPolicy.classify() by calling 
> >>>> DistanceMeasureCluster.pdf() on each of the prior clusters in 
> >>>> b3/kmeans-clusters/clusters-0. With a CosineDistanceMeasure I don't see 
> >>>> how this could ever return zero. Certainly, some of your vectors will 
> >>>> match the prior cluster centers exactly (they were sampled from the 
> >>>> input) and those values would return pdf==1. Even if the cosine distance 
> >>>> was 1 the pdf would be 0.5.
> >>>>
> >>>> Some things to try:
> >>>> - Have you verified the contents of your input vectors actually have 
> >>>> data in them?
> >>>> - Can you run the cluster dumper on the b3/kmeans-clusters/clusters-0 
> >>>> contents?
> >>>> - Is it possible to run the sequential version (-xm sequential)? If it 
> >>>> is you could run it in a debugger to gain more insight.
> >>>>
> >>>> Jeff
> >>>>
> >>>> On 6/4/12 12:05 PM, Pat Ferrel wrote:
> >>>>> Using the CLI to kmeans from several trunk versions I get an error I 
> >>>>> don't understand.  When the job died the 
> >>>>> b3/canopy-centroids/clusters-0-final contained the random-seeds file 
> >>>>> generated by the kmeans driver and the b3/kmeans-clusters/clusters-0 
> >>>>> had several part files but b3/kmeans-clusters/clusters-1 was empty. 
> >>>>> When I look through the code from the trace it doesn't make much sense.
> >>>>>
> >>>>> Command line:
> >>>>> mahout kmeans
> >>>>>   -i b3/vectors/tfidf-vectors/
> >>>>>   -k 20
> >>>>>   -c b3/canopy-centroids/clusters-0-final
> >>>>>   -cl
> >>>>>   -o b3/kmeans-clusters
> >>>>>   -ow
> >>>>>   -cd 0.01
> >>>>>   -x 30
> >>>>>   -dm org.apache.mahout.common.distance.CosineDistanceMeasure
> >>>>>
> >>>>> Error:
> >>>>> 12/06/04 07:55:03 INFO common.AbstractJob: Command line arguments: 
> >>>>> {--clustering=null, --clusters=[b3/canopy-centroids/clusters-0-final], 
> >>>>> --convergenceDelta=[0.01], 
> >>>>> --distanceMeasure=[org.apache.mahout.common.distance.CosineDistanceMeasure],
> >>>>>  --endPhase=[2147483647], --input=[b3/vectors/tfidf-vectors/], 
> >>>>> --maxIter=[30], --method=[mapreduce], --numClusters=[20], 
> >>>>> --output=[b3/kmeans-clusters], --overwrite=null, --startPhase=[0], 
> >>>>> --tempDir=[temp]}
> >>>>> 2012-06-04 07:55:03.752 java[67308:1903] Unable to load realm info from 
> >>>>> SCDynamicStore
> >>>>> 12/06/04 07:55:03 INFO common.HadoopUtil: Deleting 
> >>>>> b3/canopy-centroids/clusters-0-final
> >>>>> 12/06/04 07:55:04 WARN util.NativeCodeLoader: Unable to load 
> >>>>> native-hadoop library for your platform... using builtin-java classes 
> >>>>> where applicable
> >>>>> 12/06/04 07:55:04 INFO compress.CodecPool: Got brand-new compressor
> >>>>> 12/06/04 07:55:04 INFO kmeans.RandomSeedGenerator: Wrote 20 vectors to 
> >>>>> b3/canopy-centroids/clusters-0-final/part-randomSeed
> >>>>> 12/06/04 07:55:04 INFO kmeans.KMeansDriver: Input: 
> >>>>> b3/vectors/tfidf-vectors Clusters In: 
> >>>>> b3/canopy-centroids/clusters-0-final/part-randomSeed Out: 
> >>>>> b3/kmeans-clusters Distance: 
> >>>>> org.apache.mahout.common.distance.CosineDistanceMeasure
> >>>>> 12/06/04 07:55:04 INFO kmeans.KMeansDriver: convergence: 0.01 max 
> >>>>> Iterations: 30 num Reduce Tasks: org.apache.mahout.math.VectorWritable 
> >>>>> Input Vectors: {}
> >>>>> 12/06/04 07:55:04 INFO compress.CodecPool: Got brand-new decompressor
> >>>>> Cluster Iterator running iteration 1 over priorPath: 
> >>>>> b3/kmeans-clusters/clusters-0
> >>>>> 12/06/04 07:55:05 INFO input.FileInputFormat: Total input paths to 
> >>>>> process : 1
> >>>>> 12/06/04 07:55:05 INFO mapred.JobClient: Running job: job_local_0001
> >>>>> 12/06/04 07:55:06 INFO mapred.MapTask: io.sort.mb = 100
> >>>>> 12/06/04 07:55:08 INFO mapred.MapTask: data buffer = 79691776/99614720
> >>>>> 12/06/04 07:55:08 INFO mapred.MapTask: record buffer = 262144/327680
> >>>>> 12/06/04 07:55:08 INFO mapred.JobClient:  map 0% reduce 0%
> >>>>> 12/06/04 07:55:09 WARN mapred.LocalJobRunner: job_local_0001
> >>>>> org.apache.mahout.math.IndexException: Index -1 is outside allowable 
> >>>>> range of [0,20)
> >>>>>     at 
> >>>>> org.apache.mahout.math.AbstractVector.set(AbstractVector.java:439)
> >>>>>     at 
> >>>>> org.apache.mahout.clustering.iterator.AbstractClusteringPolicy.select(AbstractClusteringPolicy.java:44)
> >>>>>     at 
> >>>>> org.apache.mahout.clustering.iterator.CIMapper.map(CIMapper.java:52)
> >>>>>     at 
> >>>>> org.apache.mahout.clustering.iterator.CIMapper.map(CIMapper.java:18)
> >>>>>     at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> >>>>>     at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
> >>>>>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
> >>>>>     at 
> >>>>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
> >>>>> 12/06/04 07:55:09 INFO mapred.JobClient: Job complete: job_local_0001
> >>>>> 12/06/04 07:55:09 INFO mapred.JobClient: Counters: 0
> >>>>> Exception in thread "main" java.lang.InterruptedException: Cluster 
> >>>>> Iteration 1 failed processing b3/kmeans-clusters/clusters-1
> >>>>>     at 
> >>>>> org.apache.mahout.clustering.iterator.ClusterIterator.iterateMR(ClusterIterator.java:186)
> >>>>>     at 
> >>>>> org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:229)
> >>>>>     at 
> >>>>> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:149)
> >>>>>     at 
> >>>>> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:108)
> >>>>>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >>>>>     at 
> >>>>> org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:49)
> >>>>>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >>>>>     at 
> >>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> >>>>>     at 
> >>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >>>>>     at java.lang.reflect.Method.invoke(Method.java:597)
> >>>>>     at 
> >>>>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> >>>>>     at 
> >>>>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> >>>>>     at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-1028) seq2sparse n-gram weighting creates malformed vectors which crashes kmeans

Reply via email to