Re: Integration of Nutch

Renaud Richardet Tue, 07 Aug 2007 12:02:29 -0700

hi Marcus,

[please respond to the list, and not to my email, so that we keepeverybody on the loop ;-) ]

Hi. Renaud thanks for your answer!
Just so I get it clear: The index dir which the crawl put the segmentfiles in is just a plain Lucene dir just like the IndexWriter createstrue|false ?

yep, just try to open it up with Luke.

So what you're saying is that I should have something like two dirsnamed e.g. crawl & crawl_tmp
Something like this ?

1. Do the crawl to "crawl_tmp"
2. Rename / move "crawl_tmp" to "crawl"
3. Notify my lucenesearcher to reinit which points to "crawl"

yes, that should work.

I save all RSS content into my DB since they are connected to anothertable named "Site" which the users of our service enters info abouttheir site into. So I can link them by id rather than the harder tointerpret domain/url.Lucene is just called upon my save/update methods in the responsibleDAO. I can with some persuation change my mind :)
I fully understand that your crawler does a much better/faster jobthan my "ThreadedJobPool" but in the RSS case I think I need to do itmyself.

whatever works for you :-)

Oh and I have a another question: What code part generates thesummaries in the search result? I would like to look at that"HtmlParser". I'm writing my own which determines how many <h1><h2><p>etc tags there are on a page and tries to use the <h> as title (fallsback to <title>) and <p> as summaries and cleansed <div>... as fallback.

checkhttp://lucene.apache.org/nutch/apidocs/org/apache/nutch/summary/lucene/package-summary.html


or tweak it in NutchBean:

public Summary[] *getSummary*(HitDetails[] hits, Query query)
   throws IOException {
   return summarizer.*getSummary*(hits, query);
 }


HTH,
Renaud

The bastard Google have a really nice summarizer (surprise) Yahoo doesnot have as good. Look below:http://www.google.se/search?hl=sv&as_qdr=all&q=site%3Ahttp%3A%2F%2Fgadgets.fosfor.se%2Fcanon-eos-40d&btnG=S%C3%B6k&meta=<http://www.google.se/search?hl=sv&as_qdr=all&q=site%3Ahttp%3A%2F%2Fgadgets.fosfor.se%2Fcanon-eos-40d&btnG=S%C3%B6k&meta=>Go to the URL http://gadgets.fosfor.se/canon-eos-40d/<http://gadgets.fosfor.se/canon-eos-40d/> and look in the content; Thedamn summarizer is spot on. Probably a reason is since he has GoogleAds which "wraps" the contextual content and makes it really easy to find


What's your thoughts regarding the summaries ?


Thanks again.

Kindly

//Marcus

On 8/7/07, *Renaud Richardet* <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:


    hi Marcus,
    > Hi.
    >
    > I am building (yet another) crawler, parsing and indexing the
    html files
    > crawled with Lucene. Then I came to think about it. Stupido! why
    aren't you
    > using nutch instead!
    >
    > My use case is something like this.
    >
    > 100-1000 domains with average depth of 3 to 5 I think. If I miss
    some pages
    > it is not the end of the world so a tradeoff between depth and
    crawl speed
    > is taken.
    > All urls must be crawled at least once a day and be crontabbed.
    >
    > I would like to have one lucene dir which is optimized after
    each reindexing
    > not one dir per crawl so I need to create something like the
    recrawl script
    > which is published on the Wiki.
    >
    Not sure I understand: why don't you just throw away the old index
    once
    you have successfully created the new one (since you have to re-crawl
    the whole content daily)?
    > I would prefer to search the content myself by creating an
    IndexSearcher,
    > this is because I already index a whole lot of RSS feeds so I
    would like to
    > do a "MultiIndex" search, think that will be hard to do without
    doing it
    > yourself.
    >
    Or you could index the feeds with Nutch, too. There's a plugin for
    RSS...
    > I noticed the WAR file but I would prefer too create the
    templates myself.
    >
    Actually, the WAR is just a started, you will have to implement your
    layout anyway in the jsp's.

    HTH,
    Renaud

Re: Integration of Nutch

Reply via email to