It actually looks more like a segment's index is trashed.
Try using the following patch to identify the troubled segment, then re-index it.
Doug
Jason Boss wrote:
Any fixes for this or do I start over with a new database?
Jason
[EMAIL PROTECTED] nutch-nightly]# bin/nutch dedup segments dedup.tmp 040921 214812 Clearing old deletions in segments/20040829092114/index 040921 214812 Clearing old deletions in segments/20040829122947/index 040921 214812 Clearing old deletions in segments/20040829124357/index 040921 214812 Clearing old deletions in segments/20040829130541/index 040921 214813 Clearing old deletions in segments/20040829212107/index 040921 214813 Clearing old deletions in segments/20040829225928/index 040921 214813 Clearing old deletions in segments/20040830042947/index 040921 214813 Clearing old deletions in segments/20040830043001/index 040921 214813 Clearing old deletions in segments/20040830065943/index 040921 214813 Clearing old deletions in segments/20040830111830/index 040921 214816 Reading url hashes... 040921 214816 loading file:/root/nutch-nightly/conf/nutch-default.xml 040921 214816 loading file:/root/nutch-nightly/conf/nutch-site.xml Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 12243, Size: 12 at java.util.ArrayList.RangeCheck(ArrayList.java:507) at java.util.ArrayList.get(ArrayList.java:324) at org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:155) at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:66) at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:237) at net.nutch.indexer.DeleteDuplicates.computeHashes(DeleteDuplicates.java:182) at net.nutch.indexer.DeleteDuplicates.deleteUrlDuplicates(DeleteDuplicates.java :149) at net.nutch.indexer.DeleteDuplicates.main(DeleteDuplicates.java:264) [EMAIL PROTECTED] nutch-nightly]#
------------------------------------------------------- This SF.Net email is sponsored by: YOU BE THE JUDGE. Be one of 170 Project Admins to receive an Apple iPod Mini FREE for your judgement on who ports your project to Linux PPC the best. Sponsored by IBM. Deadline: Sept. 24. Go here: http://sf.net/ppc_contest.php _______________________________________________ Nutch-general mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-general
Index: src/java/net/nutch/indexer/DeleteDuplicates.java
===================================================================
RCS file: /cvsroot/nutch/nutch/src/java/net/nutch/indexer/DeleteDuplicates.java,v
retrieving revision 1.14
diff -u -r1.14 DeleteDuplicates.java
--- src/java/net/nutch/indexer/DeleteDuplicates.java 8 Sep 2004 16:29:12 -0000 1.14
+++ src/java/net/nutch/indexer/DeleteDuplicates.java 21 Sep 2004 16:44:25 -0000
@@ -204,6 +204,7 @@
try {
for (int index = 0; index < readers.length; index++) {
IndexReader reader = readers[index];
+ LOG.info(" processing index in: " + reader.directory());
int readerMax = reader.maxDoc();
indexedDoc.index = index;
for (int doc = 0; doc < readerMax; doc++) {
