Hi, Seems you need to mention what is "modified document". Which case would it be? Case 1: you dump the crawled page from nutch segment and do what you like on them If this is the case, you need to think of which action you want: I. modified the document and then ask nutch to crawl the modified directory? II. modified the document, write back to segment (the crawl DB), then do the indexing? Case 2: Keep track the document update For this case, when you keep on doing re-crawl based on the same crawl DB (what you need to tune the the day of re-crawl), then nutch will do the update for you. Hope it help :)
Jean-Christophe Alleman wrote: > > > > Hi, > > I have nothing said. This works fine ! It's morning and I'm still not woke > up :-D > > I just want to know if it was possible to re index modified documents ? Or > re index documents which are already in database ? > > Thank's in advance ! > > Jisay > > >> >> Hi Susam Pal and thank's for your help ! >> >> The solution you give to me doesn't work... I have still an error with >> Hadoop... And if I download an older version of the API, will this patch >> work ? I have Nutch-0.9 and I don't know if I compile with an oder Hadoop >> API, this patch will work. But if it will work where can I find an older >> version of Hadoop API ? >> >> Thank's in advance for your help, >> >> Jisay >> >> >>> >>> I am not sure but it seems that this is because of an older version of >>> Hadoop. I don't have older versions of Nutch or Hadoop with me to >>> confirm this. Just try omitting the second argument in: >>> fs.listPaths(indexes, HadoopFSUtil.getPassAllFilter()) and see if it >>> compiles? >>> >>> I guess, fs.listPaths(indexes) should work since I can find such a >>> method (though it is deprecated now) in the latest Hadoop API. >>> >>> Regards, >>> Susam pal >>> >>> On Tue, Mar 18, 2008 at 9:09 PM, Jean-Christophe Alleman >>> wrote: >>>> >>>> >>>> >>>> Thank's for your reply Susam Pal ! >>>> >>>> I have run ant and I have an error I can't resolve... Look at this : >>>> >>>> debian:~/nutch-0.9# ant >>>> Buildfile: build.xml >>>> >>>> init: >>>> [unjar] Expanding: /root/nutch-0.9/lib/hadoop-0.12.2-core.jar into >>>> /root/nutch-0.9/build/hadoop >>>> [untar] Expanding: /root/nutch-0.9/build/hadoop/bin.tgz into >>>> /root/nutch-0.9/bin >>>> [unjar] Expanding: /root/nutch-0.9/lib/hadoop-0.12.2-core.jar into >>>> /root/nutch-0.9/build >>>> >>>> compile-core: >>>> [javac] Compiling 133 source files to /root/nutch-0.9/build/classes >>>> [javac] /root/nutch-0.9/src/java/org/apache/nutch/crawl/Crawl.java:150: >>>> cannot find symbol >>>> [javac] symbol : variable HadoopFSUtil >>>> [javac] location: class org.apache.nutch.crawl.Crawl >>>> [javac] merger.merge(fs.listPaths(indexes, >>>> HadoopFSUtil.getPassAllFilter()), >>>> [javac] ^ >>>> [javac] Note: Some input files use or override a deprecated API. >>>> [javac] Note: Recompile with -Xlint:deprecation for details. >>>> [javac] Note: Some input files use unchecked or unsafe operations. >>>> [javac] Note: Recompile with -Xlint:unchecked for details. >>>> [javac] 1 error >>>> >>>> BUILD FAILED >>>> /root/nutch-0.9/build.xml:106: Compile failed; see the compiler error >>>> output for details. >>>> >>>> Total time: 8 seconds >>>> >>>> I have already corrected 3errors but I can't correct this one... I >>>> don't know what's HadoopFSUtil and so I can't correct the error... Help >>>> me please, >>>> >>>> Thank's for your help ! >>>> >>>> Jisay >>>> >>>> >>>> >>>>> >>>>> The patch was generated for Nutch 1.0 development version which is >>>>> currently in trunk. So, it is unable to patch your older version >>>>> cleanly. >>>>> >>>>> I also see that you are using NUTCH-601v0.3.patch. However, >>>>> NUTCH-601v1.0.patch is the recommended patch. If this patch fails, you >>>>> can make the modifications manually. This patch is extremely simple >>>>> and if you just open the patch using a text editor, you would find >>>>> that 3 lines have been removed from the original source code >>>>> (indicated by leading minus signs) and 11 new lines have been added >>>>> (indicated by plus signs). You have to make these changes manually to >>>>> your Nutch 0.9 source code directory. >>>>> >>>>> Once you make the changes, just build your project again with ant and >>>>> you would be ready for recrawl. >>>>> >>>>> Regards, >>>>> Susam Pal >>>>> >>>>> On Tue, Mar 18, 2008 at 7:12 PM, Jean-Christophe Alleman >>>> >>>>> wrote: >>>>>> >>>>>> >>>>>> Hi, I'm interested by this patch but I can't patch it. I have some >>>>>> problems when I try to patch... >>>>>> >>>>>> Here is what I do : >>>>>> >>>>>> debian:~/patch# patch -p0> can't find file to patch at input line 5 >>>> >>>>>> Perhaps you used the wrong -p or --strip option? >>>>>> The text leading up to this was: >>>>>> -------------------------- >>>>>> |Index: src/java/org/apache/nutch/crawl/Crawl.java >>>>>> |=================================================================== >>>>>> |--- src/java/org/apache/nutch/crawl/Crawl.java (revision 628119) >>>>>> |+++ src/java/org/apache/nutch/crawl/Crawl.java (working copy) >>>>>> -------------------------- >>>>>> File to patch: >>>>>> /root/nutch-0.9/src/java/org/apache/nutch/crawl/Crawl.java >>>>>> patching file >>>>>> /root/nutch-0.9/src/java/org/apache/nutch/crawl/Crawl.java >>>>>> Reversed (or previously applied) patch detected! Assume -R? [n] y >>>>>> Hunk #2 FAILED at 100. >>>>>> Hunk #3 FAILED at 131. >>>>>> 2 out of 3 hunks FAILED -- saving rejects to file >>>>>> /root/nutch-0.9/src/java/org/apache/nutch/crawl/Crawl.java.rej >>>>>> >>>>>> Can you please help me ! It's first time I patch. Please help me ! >>>>>> >>>>>> Thank's in advance, >>>>>> >>>>>> Jisay >>>>>> >>>>>> >>>>>> >>>>>>> >>>>>>> The recrawl patch in https://issues.apache.org/jira/browse/NUTCH-601 >>>>>>> got committed today. So if you check out the latest trunk, you can >>>>>>> recrawl without deleting the crawl directory. >>>>>>> >>>>>>> However, if you are using an older version, you may use the script >>>>>>> at: >>>>>>> http://wiki.apache.org/nutch/Crawl >>>>>>> >>>>>>> Regards, >>>>>>> Susam Pal >>>>>>> >>>>>>> On Fri, Mar 14, 2008 at 3:48 AM, Bradford Stephens >>>> >>>>>>> wrote: >>>>>>>> Greetings, >>>>>>>> >>>>>>>> A coworker and I are experimenting with Nutch in anticipation of a >>>>>>>> pretty large rollout at our company. However, we seem to be stuck >>>>>>>> on >>>>>>>> something -- after the crawler is finished, we can't manually >>>>>>>> re-crawl >>>>>>>> into the same directory/index! It says "Directory already exists" >>>>>>>> when >>>>>>>> we try to initiate a new crawl. Any ideas? >>>>>>>> >>>>>>>> Cheers, >>>>>>>> Bradford >>>>>>>> >>>>>> >>>>>> _________________________________________________________________ >>>>>> Changez votre Live en un clic ! >>>>>> http://get.live.com >>>> >>>> _________________________________________________________________ >>>> Windows Live: une foule de solutions orginales pour partager vos >>>> souvenirs ! >>>> http://get.live.com >> >> _________________________________________________________________ >> Emmenez vos amis avec vous, grâce à Messenger sur votre GSM. >> http://get.live.com > > _________________________________________________________________ > Vous partez ? Emmenez vos amis avec vous ! > http://www.windowslivemobile.msn.com/nl-be > -- View this message in context: http://www.nabble.com/Recrawling-without-deleting-crawl-directory-tp16039970p16235138.html Sent from the Nutch - User mailing list archive at Nabble.com.