RE: Recrawling without deleting crawl directory

Vinci Sun, 23 Mar 2008 05:02:20 -0700

Hi,

Seems you need to mention what is "modified document". Which case would it
be?
Case 1: you dump the crawled page from nutch segment and do what you like on
them
If this is the case, you need to think of which action you want:
I. modified the document and then ask nutch to crawl the modified directory?
II. modified the document, write back to segment (the crawl DB), then do the
indexing?
Case 2: Keep track the document update
For this case, when you keep on doing re-crawl based on the same crawl DB
(what you need to tune the the day of re-crawl), then nutch will do the
update for you.
Hope it help :)




Jean-Christophe Alleman wrote:
> 
> 
> 
> Hi,
> 
> I have nothing said. This works fine ! It's morning and I'm still not woke
> up :-D
> 
> I just want to know if it was possible to re index modified documents ? Or
> re index documents which are already in database ?
> 
> Thank's in advance !
> 
> Jisay
> 
> 
>> 
>> Hi Susam Pal and thank's for your help !
>> 
>> The solution you give to me doesn't work... I have still an error with
>> Hadoop... And if I download an older version of the API, will this patch
>> work ? I have Nutch-0.9 and I don't know if I compile with an oder Hadoop
>> API, this patch will work. But if it will work where can I find an older
>> version of Hadoop API ?
>> 
>> Thank's in advance for your help,
>> 
>> Jisay
>> 
>> 
>>>
>>> I am not sure but it seems that this is because of an older version of
>>> Hadoop. I don't have older versions of Nutch or Hadoop with me to
>>> confirm this. Just try omitting the second argument in:
>>> fs.listPaths(indexes, HadoopFSUtil.getPassAllFilter()) and see if it
>>> compiles?
>>>
>>> I guess, fs.listPaths(indexes) should work since I can find such a
>>> method (though it is deprecated now) in the latest Hadoop API.
>>>
>>> Regards,
>>> Susam pal
>>>
>>> On Tue, Mar 18, 2008 at 9:09 PM, Jean-Christophe Alleman
>>>  wrote:
>>>>
>>>>
>>>>
>>>> Thank's for your reply Susam Pal !
>>>>
>>>> I have run ant and I have an error I can't resolve... Look at this :
>>>>
>>>> debian:~/nutch-0.9# ant
>>>> Buildfile: build.xml
>>>>
>>>> init:
>>>> [unjar] Expanding: /root/nutch-0.9/lib/hadoop-0.12.2-core.jar into
>>>> /root/nutch-0.9/build/hadoop
>>>> [untar] Expanding: /root/nutch-0.9/build/hadoop/bin.tgz into
>>>> /root/nutch-0.9/bin
>>>> [unjar] Expanding: /root/nutch-0.9/lib/hadoop-0.12.2-core.jar into
>>>> /root/nutch-0.9/build
>>>>
>>>> compile-core:
>>>> [javac] Compiling 133 source files to /root/nutch-0.9/build/classes
>>>> [javac] /root/nutch-0.9/src/java/org/apache/nutch/crawl/Crawl.java:150:
>>>> cannot find symbol
>>>> [javac] symbol : variable HadoopFSUtil
>>>> [javac] location: class org.apache.nutch.crawl.Crawl
>>>> [javac] merger.merge(fs.listPaths(indexes,
>>>> HadoopFSUtil.getPassAllFilter()),
>>>> [javac] ^
>>>> [javac] Note: Some input files use or override a deprecated API.
>>>> [javac] Note: Recompile with -Xlint:deprecation for details.
>>>> [javac] Note: Some input files use unchecked or unsafe operations.
>>>> [javac] Note: Recompile with -Xlint:unchecked for details.
>>>> [javac] 1 error
>>>>
>>>> BUILD FAILED
>>>> /root/nutch-0.9/build.xml:106: Compile failed; see the compiler error
>>>> output for details.
>>>>
>>>> Total time: 8 seconds
>>>>
>>>> I have already corrected 3errors but I can't correct this one... I
>>>> don't know what's HadoopFSUtil and so I can't correct the error... Help
>>>> me please,
>>>>
>>>> Thank's for your help !
>>>>
>>>> Jisay
>>>>
>>>>
>>>>
>>>>>
>>>>> The patch was generated for Nutch 1.0 development version which is
>>>>> currently in trunk. So, it is unable to patch your older version
>>>>> cleanly.
>>>>>
>>>>> I also see that you are using NUTCH-601v0.3.patch. However,
>>>>> NUTCH-601v1.0.patch is the recommended patch. If this patch fails, you
>>>>> can make the modifications manually. This patch is extremely simple
>>>>> and if you just open the patch using a text editor, you would find
>>>>> that 3 lines have been removed from the original source code
>>>>> (indicated by leading minus signs) and 11 new lines have been added
>>>>> (indicated by plus signs). You have to make these changes manually to
>>>>> your Nutch 0.9 source code directory.
>>>>>
>>>>> Once you make the changes, just build your project again with ant and
>>>>> you would be ready for recrawl.
>>>>>
>>>>> Regards,
>>>>> Susam Pal
>>>>>
>>>>> On Tue, Mar 18, 2008 at 7:12 PM, Jean-Christophe Alleman
>>>>
>>>>> wrote:
>>>>>>
>>>>>>
>>>>>> Hi, I'm interested by this patch but I can't patch it. I have some
>>>>>> problems when I try to patch...
>>>>>>
>>>>>> Here is what I do :
>>>>>>
>>>>>> debian:~/patch# patch -p0> can't find file to patch at input line 5
>>>>
>>>>>> Perhaps you used the wrong -p or --strip option?
>>>>>> The text leading up to this was:
>>>>>> --------------------------
>>>>>> |Index: src/java/org/apache/nutch/crawl/Crawl.java
>>>>>> |===================================================================
>>>>>> |--- src/java/org/apache/nutch/crawl/Crawl.java (revision 628119)
>>>>>> |+++ src/java/org/apache/nutch/crawl/Crawl.java (working copy)
>>>>>> --------------------------
>>>>>> File to patch:
>>>>>> /root/nutch-0.9/src/java/org/apache/nutch/crawl/Crawl.java
>>>>>> patching file
>>>>>> /root/nutch-0.9/src/java/org/apache/nutch/crawl/Crawl.java
>>>>>> Reversed (or previously applied) patch detected! Assume -R? [n] y
>>>>>> Hunk #2 FAILED at 100.
>>>>>> Hunk #3 FAILED at 131.
>>>>>> 2 out of 3 hunks FAILED -- saving rejects to file
>>>>>> /root/nutch-0.9/src/java/org/apache/nutch/crawl/Crawl.java.rej
>>>>>>
>>>>>> Can you please help me ! It's first time I patch. Please help me !
>>>>>>
>>>>>> Thank's in advance,
>>>>>>
>>>>>> Jisay
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> The recrawl patch in https://issues.apache.org/jira/browse/NUTCH-601
>>>>>>> got committed today. So if you check out the latest trunk, you can
>>>>>>> recrawl without deleting the crawl directory.
>>>>>>>
>>>>>>> However, if you are using an older version, you may use the script
>>>>>>> at:
>>>>>>> http://wiki.apache.org/nutch/Crawl
>>>>>>>
>>>>>>> Regards,
>>>>>>> Susam Pal
>>>>>>>
>>>>>>> On Fri, Mar 14, 2008 at 3:48 AM, Bradford Stephens
>>>>
>>>>>>> wrote:
>>>>>>>> Greetings,
>>>>>>>>
>>>>>>>> A coworker and I are experimenting with Nutch in anticipation of a
>>>>>>>> pretty large rollout at our company. However, we seem to be stuck
>>>>>>>> on
>>>>>>>> something -- after the crawler is finished, we can't manually
>>>>>>>> re-crawl
>>>>>>>> into the same directory/index! It says "Directory already exists"
>>>>>>>> when
>>>>>>>> we try to initiate a new crawl. Any ideas?
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Bradford
>>>>>>>>
>>>>>>
>>>>>> _________________________________________________________________
>>>>>> Changez votre Live en un clic !
>>>>>> http://get.live.com
>>>>
>>>> _________________________________________________________________
>>>> Windows Live: une foule de solutions orginales pour partager vos
>>>> souvenirs !
>>>> http://get.live.com
>> 
>> _________________________________________________________________
>> Emmenez vos amis avec vous, grâce à Messenger sur votre GSM.
>> http://get.live.com
> 
> _________________________________________________________________
> Vous partez ? Emmenez vos amis avec vous !
> http://www.windowslivemobile.msn.com/nl-be
> 

-- 
View this message in context: 
http://www.nabble.com/Recrawling-without-deleting-crawl-directory-tp16039970p16235138.html
Sent from the Nutch - User mailing list archive at Nabble.com.

RE: Recrawling without deleting crawl directory

Reply via email to