Hi,
here is a shell script that reproduce the problem.
We notice that after dedub in the merged index we have less documents than in the orginal index.

Number of Documents in
Original Index: 42
Dedup Index: 17

Do we may have a mistake somehow in the script or in the process itself?

Regards,
Marko.

Here is the script. When you try the script please delete the folders indexes, segments, tmp_index (if exists) and urls.


#!/bin/sh


NUTCH=$HOME/nutch
DB=$NUTCH/db


SEGMENT=$NUTCH/segments
INDEX=$NUTCH/indexes

TMP_INDEX=$NUTCH/tmp_index


mkdir urls
echo 'http://www.apache.org' > urls/links.txt

$NUTCH/bin/nutch inject $DB urls
REPEATS=2
for ((a=1; a <= REPEATS ; a++))
do
$NUTCH/bin/nutch generate $DB $SEGMENT
s1=`ls -d $SEGMENT/2* | tail -1`
$NUTCH/bin/nutch fetch $s1
$NUTCH/bin/nutch updatedb $DB $s1
done
s1=`ls -d $SEGMENT/2*`
$NUTCH/bin/nutch index $TMP_INDEX $DB $s1
s1=`ls -d $TMP_INDEX/part-* | tail -1`
cp -r $s1 $TMP_INDEX/copyOfIndex
$NUTCH/bin/nutch merge $INDEX/index $TMP_INDEX
$NUTCH/bin/nutch dedup $INDEX













Am 25.10.2005 um 16:28 schrieb Stefan Groschupf:

Hi Doug,

I copy a working index and merge the original and the old together. Than I run the dedub over these index. Shouldn't the dedub tool remove the duplicates in the merged index?
Thanks,
Stefan


Am 24.10.2005 um 21:25 schrieb Doug Cutting:


It works for me. It currently only deletes md5 duplicates, but url duplicates are currently handled elsewhere in the mapred branch. What problems did you see?

Doug

Stefan Groschupf wrote:


Hi,
what is the status of the dedub tool in the mapreduce branche.
The javadoc mentioned that the second part isn't implemented but the indexer will take about this issue anyway. However I tried this tool and it looks like that it does not work correctly.
Thanks for a comment.
Stefan











-------------------------------------------------------
This SF.Net email is sponsored by the JBoss Inc.
Get Certified Today * Register for a JBoss Training Course
Free Certification Exam for All Training Attendees Through End of 2005
Visit http://www.jboss.com/services/certification for more information
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to