RE: merging indexes with nutch

Tomislav Poljak Sat, 08 Mar 2008 07:01:52 -0800

Hi Siddharth,
I don't think Nutch is storing embedded images into segments data (you
will see images in cached pages in Nutch search web application only if
image urls have absolute address - so images are fetched from web). You
can see what is Nutch storing in segments for some url by:


bin/nutch readseg -get crawl/segments/20071218150307/ http://www.xy.com/

Tomislav

PS: put your segment path and url :)


On Sat, 2008-03-08 at 07:07 +0530, Siddharth Jha wrote:
> Hello everyone
> 
> I am having a problem when we are trying to implement cached text for our 
> nutch search engine. By Cached text I mean the ability to store only text of 
> a website without any embedded images or css files. I am not able to get this 
> done. I am thinking that this is due to the fact that I need to filter these 
> images during the time that indexing happens in nutch.
> 
> Any help would be really appreciated. I had to reply to this message since, 
> my new posts are not going successully on the mailing list.
> 
> Thanks
> Siddharth
> 
> > Date: Fri, 7 Mar 2008 18:10:13 +0000
> > From: [EMAIL PROTECTED]
> > To: [email protected]
> > Subject: Re: merging indexes with nutch
> > 
> > Thanks Tomislav,  It worked beautifully.
> > 
> > The other solution i also found is that the index was not read by
> > nutch because as the index.done file was not created (as mentioned in
> > http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch_merge).  so it
> > seems like manually adding an empty index.done to that folder would do
> > the job as well.
> > 
> > Cheers
> > 
> > On Wed, Mar 5, 2008 at 6:11 PM, Tomislav Poljak <[EMAIL PROTECTED]> wrote:
> > > Hi,
> > >  try this:
> > >  bin/nutch merge crawl/index crawl/indexes crawl/indexes1
> > >
> > >  where crawl/index (not indexes) should be created by merge and
> > >  crawl/indexes and crawl/indexes1 are existing indexes for merging. Nutch
> > >  search web application will use merged index form crawl/index and you
> > >  should see this in web application log:
> > >
> > >  2007-09-09 20:30:58,949 INFO  searcher.NutchBean - creating new bean
> > >  2007-09-09 20:30:59,128 INFO  searcher.NutchBean - opening merged index
> > >  in /home/nutch/test/trunk/crawl/index
> > >
> > >  Hope this helps,
> > >
> > >  Tomislav
> > >
> > >
> > >
> > >
> > >  On Tue, 2008-03-04 at 21:09 +0000, Boris Lau wrote:
> > >  > Hi all,
> > >  >
> > >  > I am having a problem with trying to get my merged index to be
> > >  > searched by nutch.
> > >  >
> > >  > I have used "bin/nutch merge" command to merge 2 indexes into one, but
> > >  > the nutch web-app would not be able to search the merged index (always
> > >  > return 0 items).  I have examined the index in Luke and everything
> > >  > seems sane with the index (correct number of merged documents,
> > >  > segments references are correct, etc.).  It is just that the webapp
> > >  > would give 0 output.
> > >  >
> > >  > Is there something that I am missing?  Any advise on how i would debug 
> > > it?
> > >  >
> > >  > Many thanks
> > >  > boris
> > >  >
> > >  > p.s. would anybody have any recommendation on an alternative way of
> > >  > examining index other than using Luke (e.g. command line interface)?
> > >  > java awt is painfully slow....
> > >
> > >
> 
> _________________________________________________________________
> Post free property ads on Yello Classifieds now! www.yello.in
> http://ss1.richmedia.in/recurl.asp?pid=219

RE: merging indexes with nutch

Reply via email to