Re: [Nutch-general] [Fwd: Re: 0.8 Recrawl script updated]

Fred Tyre Tue, 08 Aug 2006 16:30:04 -0700

I tried that and all I get is exceptions that are thrown because the "crawl"
folder is not setup correctly.


I ran...
bin/nutch crawl urls -dir crawl -threads 2 -depth 3 -topN 50

bin/nutch org.apache.nutch.searcher.NutchBean fort
Total hits: 0

Since it did not return any results, I tried running your recrawl script and
here is the output...
(I hardcoded the directories in the shell script for ease of use)

~/bin/recrawl.sh
started at 20060808_180949
reindexing 1 of 1 (max_num_pages each = 1000)
java.io.IOException: Input directory
C:/cygwin/home/fred/nutch-0.8/crawl/segments/20060808180951/crawl_fetch in
local is invalid.
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327)
        at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:62)
        at org.apache.nutch.crawl.CrawlDb.main(CrawlDb.java:116)
Exception in thread "main" java.io.IOException: Input directory
C:/cygwin/home/fred/nutch-0.8/crawl/segments/20060808180951/parse_data in
local is invalid.
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327)
        at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:212)
        at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:316)
Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:357)
        at
org.apache.nutch.segment.SegmentMerger.merge(SegmentMerger.java:627)
        at
org.apache.nutch.segment.SegmentMerger.main(SegmentMerger.java:675)
Exception in thread "main" cp: cannot stat `crawl/mergesegs_dir/*': No such
file or directory
java.io.IOException: Input directory
C:/cygwin/home/fred/nutch-0.8/crawl/segments/20060808180951/crawl_fetch in
local is invalid.
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327)
        at org.apache.nutch.indexer.Indexer.index(Indexer.java:296)
        at org.apache.nutch.indexer.Indexer.main(Indexer.java:313)
Exception in thread "main" java.io.IOException: Input directory
C:/cygwin/home/fred/nutch-0.8/crawl/newindexes in local is invalid.
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327)
        at
org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:326)
        at
org.apache.nutch.indexer.DeleteDuplicates.main(DeleteDuplicates.java:365)
Exception in thread "main" ***Removing old segment directories that are no
longer in use. If any of these error out it is not a problem, just used for
clean up.
Removing Segment: crawl/segments/20060808180601
Total hits: 0
finished at 20060808_181108

P.S.  I am running this on Windows XP.

-----Original Message-----
From: Matthew Holt [mailto:[EMAIL PROTECTED]
Sent: Tuesday, August 08, 2006 2:56 PM
To: [email protected]
Subject: Re: [Fwd: Re: 0.8 Recrawl script updated]


It's not needed.. you use the bin/nutch script to generate the initial
crawl..

details here:
http://lucene.apache.org/nutch/tutorial8.html#Intranet+Crawling

Fred Tyre wrote:
> First of all, thanks for the recrawl script.
> I believe it will save me a few headaches.
>
> Secondly, is there a reason that there isn't a crawl script posted on the
> FAQ?
>
> As far as I can tell, you could take your recrawl script and add in the
> following line after you setup the crawl subdirectories.
>    $FT_NUTCH_BIN/nutch crawl urls -dir $crawl_dir -threads 2 -depth
3 -topN
> 50
>
> Obviously, the threads, depth and topN could be parameters as well.
>
> Thanks again.
>
> -----Original Message-----
> From: Matthew Holt [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, August 08, 2006 2:00 PM
> To: [email protected]; [email protected]
> Subject: Re: [Fwd: Re: 0.8 Recrawl script updated]
>
>
> Since it wasn't really clear whether my script approached the problem of
> deleting segments correctly, I refactored it so it generates the new
> number of segments, merges them into one, then deletes the "new"
> segments. Not as efficient disk space wise, but still removes a large
> number of the segments that are not being referenced by anything due to
> not being indexed yet.
>
> I reupdated the wiki. Unless there is any more clarification regarding
> the issue, hopefully I won't have to bombard your inbox with any more
> emails regarding this.
>
> Matt
>
> Lukas Vlcek wrote:
>
>> Hi again,
>>
>> I just found related discussion here:
>> http://www.nabble.com/NullPointException-tf2045994r1.html
>>
>> I think these guys are discussing similar problem and if I understood
>> the conclusion correctly then the only solution right now is to write
>> some code and test which segments are used in index and which are not.
>>
>> Regards,
>> Lukas
>>
>> On 8/4/06, Lukas Vlcek <[EMAIL PROTECTED]> wrote:
>>
>>> Matthew,
>>>
>>> In fact I didn't realize you are doing merge stuff (sorry for that)
>>> but frankly I don't know how exactly merging works and if this
>>> strategy would work in the long time perspective and whether it is
>>> universal approach in all variability of cases which may occur during
>>> crawling (-topN, threads frozen, pages unavailable, crawling dies, ...
>>> etc), may be it is correct path. I would appreciate if anybody can
>>> answer this question precisely.
>>>
>>> Thanks,
>>> Lukas
>>>
>>> On 8/4/06, Matthew Holt <[EMAIL PROTECTED]> wrote:
>>>
>>>> If anyone doesnt mind taking a look...
>>>>
>>>>
>>>>
>>>> ---------- Forwarded message ----------
>>>> From: Matthew Holt <[EMAIL PROTECTED]>
>>>> To: [email protected]
>>>> Date: Fri, 04 Aug 2006 10:07:57 -0400
>>>> Subject: Re: 0.8 Recrawl script updated
>>>> Lukas,
>>>>    Thanks for your e-mail. I assumed I could drop the $depth number of
>>>> oldest segments because I first merged them all into one segment
>>>>
>>> (which
>>>
>>>> I don't drop). Am I incorrect in my assumption and can this cause
>>>> problems in the future? If so, then I'll go back to the original
>>>>
>>> version
>>>
>>>> of my script when I kept all the segments without merging. However, it
>>>> just seemed like if that is the case, it will be a problem after
>>>>
>>> enough
>>>
>>>> number of recrawls due to the large amount of segments being kept.
>>>>
>>>>  Thanks,
>>>>   Matt
>>>>
>>>> Lukas Vlcek wrote:
>>>>
>>>>> Hi Matthew,
>>>>>
>>>>> I am surious about one thing. How do you know you can just drop
>>>>>
>>> $depth
>>>
>>>>> number of the most oldest segments in the end? I haven't studied
>>>>>
>>> nutch
>>>
>>>>> code regarding this topic yet but I thought that segment can be
>>>>> dropped once you are sure that all its content is already crawled in
>>>>> some newer segments (which should be checked somehow via some
>>>>> function/script - which hasen't been yet implemented to my
>>>>>
>>> knowledge).
>>>
>>>>> Also I don't think this question has been discussed on dev/user
>>>>>
>>> lists
>>>
>>>>> in detail yet so I just wanted to ask you about your opinion. The
>>>>> situation could get even more complicated if people add -topN
>>>>> parameter into script (which can happen because some might prefer
>>>>> crawling in ten smaller bunches over to two huge crawls due to
>>>>>
>>> various
>>>
>>>>> technical reasons).
>>>>>
>>>>> Anyway, never mind if you don't want to bother about my silly
>>>>>
>>> question
>>>
>>>>> :-)
>>>>>
>>>>> Regards,
>>>>> Lukas
>>>>>
>>>>> On 8/4/06, Matthew Holt <[EMAIL PROTECTED]> wrote:
>>>>>
>>>>>> Last email regarding this script. I found a bug in it that is
>>>>>>
>>> sporadic
>>>
>>>>>> (i think it only affected different setups). However, since it
>>>>>>
>>> would be
>>>
>>>>>> a problem sometimes, I refactored the script. I'd suggest you
>>>>>>
>>> redownload
>>>
>>>>>> the script if you are using it.
>>>>>>
>>>>>> Matt
>>>>>>
>>>>>> Matthew Holt wrote:
>>>>>>
>>>>>>> I'm currently pretty busy at work. If I have I'll do it later.
>>>>>>>
>>>>>>> The version 0.8 recrawl script has a working version online
>>>>>>>
>>> now. I
>>>
>>>>>>> temporarily modified it on the website yesterday when I ran
>>>>>>>
>>> into some
>>>
>>>>>>> problems, but I further tested it and the actual working code is
>>>>>>> modified now. So if you got it off the web site any time
>>>>>>>
>>> yesterday, I
>>>
>>>>>>> would redownload the script.
>>>>>>>
>>>>>>> Matt
>>>>>>>
>>>>>>> Lourival Júnior wrote:
>>>>>>>
>>>>>>>> Hi Matthew!
>>>>>>>>
>>>>>>>> Could you update the script to the version 0.7.2 with the same
>>>>>>>> functionalities? I write a scritp that do this, but it don't
>>>>>>>>
>>> work
>>>
>>>>>> very
>>>>>>
>>>>>>>> well...
>>>>>>>>
>>>>>>>> Regards!
>>>>>>>>
>>>>>>>> On 8/2/06, Matthew Holt <[EMAIL PROTECTED]> wrote:
>>>>>>>>
>>>>>>>>> Just letting everyone know that I updated the recrawl script
>>>>>>>>>
>>> on the
>>>
>>>>>>>>> Wiki. It now merges the created segments them deletes the old
>>>>>>>>>
>>>>>> segs to
>>>>>>
>>>>>>>>> prevent a lot of unneeded data remaining/growing on the hard
>>>>>>>>>
>>> drive.
>>>
>>>>>>>>>   Matt
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>
http://wiki.apache.org/nutch/IntranetRecrawl?action=show#head-e58e25a0b9530b
> b6fcdfb282fd27a207fc0aff03
>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>
>>>>
>>>>
>
>
>


-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] [Fwd: Re: 0.8 Recrawl script updated]

Reply via email to