hi Marcus,

[please respond to the list, and not to my email, so that we keep everybody on the loop ;-) ]

Hi. Renaud thanks for your answer!

Just so I get it clear: The index dir which the crawl put the segment files in is just a plain Lucene dir just like the IndexWriter creates true|false ?
yep, just try to open it up with Luke.

So what you're saying is that I should have something like two dirs named e.g. crawl & crawl_tmp

Something like this ?

1. Do the crawl to "crawl_tmp"
2. Rename / move "crawl_tmp" to "crawl"
3. Notify my lucenesearcher to reinit which points to "crawl"
yes, that should work.

I save all RSS content into my DB since they are connected to another table named "Site" which the users of our service enters info about their site into. So I can link them by id rather than the harder to interpret domain/url. Lucene is just called upon my save/update methods in the responsible DAO. I can with some persuation change my mind :)

I fully understand that your crawler does a much better/faster job than my "ThreadedJobPool" but in the RSS case I think I need to do it myself.
whatever works for you :-)

Oh and I have a another question: What code part generates the summaries in the search result? I would like to look at that "HtmlParser". I'm writing my own which determines how many <h1><h2><p> etc tags there are on a page and tries to use the <h> as title (falls back to <title>) and <p> as summaries and cleansed <div>... as fallback.
check http://lucene.apache.org/nutch/apidocs/org/apache/nutch/summary/lucene/package-summary.html

or tweak it in NutchBean:

public Summary[] *getSummary*(HitDetails[] hits, Query query)
   throws IOException {
   return summarizer.*getSummary*(hits, query);
 }


HTH,
Renaud


The bastard Google have a really nice summarizer (surprise) Yahoo does not have as good. Look below: http://www.google.se/search?hl=sv&as_qdr=all&q=site%3Ahttp%3A%2F%2Fgadgets.fosfor.se%2Fcanon-eos-40d&btnG=S%C3%B6k&meta= <http://www.google.se/search?hl=sv&as_qdr=all&q=site%3Ahttp%3A%2F%2Fgadgets.fosfor.se%2Fcanon-eos-40d&btnG=S%C3%B6k&meta=> Go to the URL http://gadgets.fosfor.se/canon-eos-40d/ <http://gadgets.fosfor.se/canon-eos-40d/> and look in the content; The damn summarizer is spot on. Probably a reason is since he has Google Ads which "wraps" the contextual content and makes it really easy to find

What's your thoughts regarding the summaries ?


Thanks again.

Kindly

//Marcus




On 8/7/07, *Renaud Richardet* <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote:

    hi Marcus,
    > Hi.
    >
    > I am building (yet another) crawler, parsing and indexing the
    html files
    > crawled with Lucene. Then I came to think about it. Stupido! why
    aren't you
    > using nutch instead!
    >
    > My use case is something like this.
    >
    > 100-1000 domains with average depth of 3 to 5 I think. If I miss
    some pages
    > it is not the end of the world so a tradeoff between depth and
    crawl speed
    > is taken.
    > All urls must be crawled at least once a day and be crontabbed.
    >
    > I would like to have one lucene dir which is optimized after
    each reindexing
    > not one dir per crawl so I need to create something like the
    recrawl script
    > which is published on the Wiki.
    >
    Not sure I understand: why don't you just throw away the old index
    once
    you have successfully created the new one (since you have to re-crawl
    the whole content daily)?
    > I would prefer to search the content myself by creating an
    IndexSearcher,
    > this is because I already index a whole lot of RSS feeds so I
    would like to
    > do a "MultiIndex" search, think that will be hard to do without
    doing it
    > yourself.
    >
    Or you could index the feeds with Nutch, too. There's a plugin for
    RSS...
    > I noticed the WAR file but I would prefer too create the
    templates myself.
    >
    Actually, the WAR is just a started, you will have to implement your
    layout anyway in the jsp's.

    HTH,
    Renaud



Reply via email to