Hello!

I have about 1.3M vectors from lucene.vector utility that I later try to clusterize in 550 clusters. Everything seems to be fine, clusterization starts, but in an hour I get:


12/05/10 18:26:50 INFO fs.FSInputChecker: Found checksum error: b[196, 708]=6b8e184f4f7c8900d812ade6a7269429bc3520746f5df7387257558a3f02c4af3f6e4bf05ef2676a88b54c86409399375df28bb28abe47df012c891e771ff57264883d88f7ec0db1d1fb3581b00ab0438df7de297763f4b9005cdef9eda32b3715e0ed015bf2609ad5e8c18f5f3500e921f5fd856cebdc96173080e6cbeac5f4957eb3b9d0a72d31bf8d9ca8c0c4d7204092fd8269aad260d5b007a0f9d4d59a7ebb1291588a00346187d1a72b23b4d26804a0f7587d8cb32f4aeda0224086528c9ac617b7ce850888c3ef2fa24e61f5cb45ce26e9c6057b57fa53e950266946e5b1ca5135e1a79b804e3bd2d5b57f0d321b5e535dd76e3a754c40c66b00066bcd9991778af3add0314e476bc96e959aa80ea831e1a295c024e578dbdb4a0448538b0e5138482541c718e65bf967a5542a338b218617b6588db0ff0a66e443f1bcbfc8667e3b90f10e809da4bc33da59a34a1452ca2a85dd1edc17d57c6834f325e97b4a23b7b06abb18db4fdd7b01e5dd9ce265654b544423b473cf2efcd52ac905ac07603b19b653e952c3c2ab20baee4b5b82bb7ef4c86b085d14f284c3d106529c25e0a80f69b12368a52405c0ee3ecd7be8bd1dbf148410ab4e9c32068926f9755ac919f5344df12dba241601888fd565afef29088e4c458044251ee5db4bd7b2613b4049ed95d10fd5ceabf2856eebd476f5ea595564062340ead4fe6f org.apache.hadoop.fs.ChecksumException: Checksum error: file:/lms/apps/data/mahout/mahout_rus_938K_en_410K/mahout_vectors at 677131776 at org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.java:277) at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:241)
    at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:189)
    at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
    at java.io.DataInputStream.readFully(Unknown Source)
at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63) at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1930) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2062) at org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.nextKeyValue(SequenceFileRecordReader.java:68) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:532) at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
12/05/10 18:26:50 WARN mapred.LocalJobRunner: job_local_0001
org.apache.hadoop.fs.ChecksumException: Checksum error: file:/lms/apps/data/mahout/mahout_rus_938K_en_410K/mahout_vectors at 677131776 at org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.java:277) at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:241)
    at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:189)
    at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
    at java.io.DataInputStream.readFully(Unknown Source)
at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63) at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1930) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2062) at org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.nextKeyValue(SequenceFileRecordReader.java:68) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:532) at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
12/05/10 18:26:50 INFO mapred.JobClient: Job complete: job_local_0001
12/05/10 18:26:50 INFO mapred.JobClient: Counters: 11
12/05/10 18:26:50 INFO mapred.JobClient:   FileSystemCounters
12/05/10 18:26:50 INFO mapred.JobClient:     FILE_BYTES_READ=78797420248
12/05/10 18:26:50 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=1141852988
12/05/10 18:26:50 INFO mapred.JobClient:   File Input Format Counters
12/05/10 18:26:50 INFO mapred.JobClient:     Bytes Read=682074112
12/05/10 18:26:50 INFO mapred.JobClient:   Map-Reduce Framework
12/05/10 18:26:50 INFO mapred.JobClient: Map output materialized bytes=100785641
12/05/10 18:26:50 INFO mapred.JobClient:     Combine output records=2182
12/05/10 18:26:50 INFO mapred.JobClient:     Map input records=236549
12/05/10 18:26:50 INFO mapred.JobClient:     Spilled Records=2182
12/05/10 18:26:50 INFO mapred.JobClient:     Map output bytes=1444355648
12/05/10 18:26:50 INFO mapred.JobClient:     Combine input records=234975
12/05/10 18:26:50 INFO mapred.JobClient:     Map output records=236548
12/05/10 18:26:50 INFO mapred.JobClient:     SPLIT_RAW_BYTES=2730
Exception in thread "main" java.lang.InterruptedException: K-Means Iteration failed processing /lms/apps/data/mahout/mahout_rus_938K_en_410K/centroid-rndm-seeds/part-randomSeed at org.apache.mahout.clustering.kmeans.KMeansDriver.runIteration(KMeansDriver.java:371) at org.apache.mahout.clustering.kmeans.KMeansDriver.buildClustersMR(KMeansDriver.java:316) at org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:239) at org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:154) at org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:112)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:61)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    at java.lang.reflect.Method.invoke(Unknown Source)
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
    at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
    at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)



I tried Mahout 5.0 and 6.0. I didn't encounter such problems on smaller collections (~ 400K vectors, Mahout 5.0)

Do you have any insight on what's going on, and what are the possible ways of solving the problem?


Reply via email to