Hi Siddharth, I don't think Nutch is storing embedded images into segments data (you will see images in cached pages in Nutch search web application only if image urls have absolute address - so images are fetched from web). You can see what is Nutch storing in segments for some url by:
bin/nutch readseg -get crawl/segments/20071218150307/ http://www.xy.com/ Tomislav PS: put your segment path and url :) On Sat, 2008-03-08 at 07:07 +0530, Siddharth Jha wrote: > Hello everyone > > I am having a problem when we are trying to implement cached text for our > nutch search engine. By Cached text I mean the ability to store only text of > a website without any embedded images or css files. I am not able to get this > done. I am thinking that this is due to the fact that I need to filter these > images during the time that indexing happens in nutch. > > Any help would be really appreciated. I had to reply to this message since, > my new posts are not going successully on the mailing list. > > Thanks > Siddharth > > > Date: Fri, 7 Mar 2008 18:10:13 +0000 > > From: [EMAIL PROTECTED] > > To: [email protected] > > Subject: Re: merging indexes with nutch > > > > Thanks Tomislav, It worked beautifully. > > > > The other solution i also found is that the index was not read by > > nutch because as the index.done file was not created (as mentioned in > > http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch_merge). so it > > seems like manually adding an empty index.done to that folder would do > > the job as well. > > > > Cheers > > > > On Wed, Mar 5, 2008 at 6:11 PM, Tomislav Poljak <[EMAIL PROTECTED]> wrote: > > > Hi, > > > try this: > > > bin/nutch merge crawl/index crawl/indexes crawl/indexes1 > > > > > > where crawl/index (not indexes) should be created by merge and > > > crawl/indexes and crawl/indexes1 are existing indexes for merging. Nutch > > > search web application will use merged index form crawl/index and you > > > should see this in web application log: > > > > > > 2007-09-09 20:30:58,949 INFO searcher.NutchBean - creating new bean > > > 2007-09-09 20:30:59,128 INFO searcher.NutchBean - opening merged index > > > in /home/nutch/test/trunk/crawl/index > > > > > > Hope this helps, > > > > > > Tomislav > > > > > > > > > > > > > > > On Tue, 2008-03-04 at 21:09 +0000, Boris Lau wrote: > > > > Hi all, > > > > > > > > I am having a problem with trying to get my merged index to be > > > > searched by nutch. > > > > > > > > I have used "bin/nutch merge" command to merge 2 indexes into one, but > > > > the nutch web-app would not be able to search the merged index (always > > > > return 0 items). I have examined the index in Luke and everything > > > > seems sane with the index (correct number of merged documents, > > > > segments references are correct, etc.). It is just that the webapp > > > > would give 0 output. > > > > > > > > Is there something that I am missing? Any advise on how i would debug > > > it? > > > > > > > > Many thanks > > > > boris > > > > > > > > p.s. would anybody have any recommendation on an alternative way of > > > > examining index other than using Luke (e.g. command line interface)? > > > > java awt is painfully slow.... > > > > > > > > _________________________________________________________________ > Post free property ads on Yello Classifieds now! www.yello.in > http://ss1.richmedia.in/recurl.asp?pid=219
