hi Marcus,
[please respond to the list, and not to my email, so that we keep
everybody on the loop ;-) ]
Hi. Renaud thanks for your answer!
Just so I get it clear: The index dir which the crawl put the segment
files in is just a plain Lucene dir just like the IndexWriter creates
true|false ?
yep, just try to open it up with Luke.
So what you're saying is that I should have something like two dirs
named e.g. crawl & crawl_tmp
Something like this ?
1. Do the crawl to "crawl_tmp"
2. Rename / move "crawl_tmp" to "crawl"
3. Notify my lucenesearcher to reinit which points to "crawl"
yes, that should work.
I save all RSS content into my DB since they are connected to another
table named "Site" which the users of our service enters info about
their site into. So I can link them by id rather than the harder to
interpret domain/url.
Lucene is just called upon my save/update methods in the responsible
DAO. I can with some persuation change my mind :)
I fully understand that your crawler does a much better/faster job
than my "ThreadedJobPool" but in the RSS case I think I need to do it
myself.
whatever works for you :-)
Oh and I have a another question: What code part generates the
summaries in the search result? I would like to look at that
"HtmlParser". I'm writing my own which determines how many <h1><h2><p>
etc tags there are on a page and tries to use the <h> as title (falls
back to <title>) and <p> as summaries and cleansed <div>... as fallback.
check
http://lucene.apache.org/nutch/apidocs/org/apache/nutch/summary/lucene/package-summary.html
or tweak it in NutchBean:
public Summary[] *getSummary*(HitDetails[] hits, Query query)
throws IOException {
return summarizer.*getSummary*(hits, query);
}
HTH,
Renaud
The bastard Google have a really nice summarizer (surprise) Yahoo does
not have as good. Look below:
http://www.google.se/search?hl=sv&as_qdr=all&q=site%3Ahttp%3A%2F%2Fgadgets.fosfor.se%2Fcanon-eos-40d&btnG=S%C3%B6k&meta=
<http://www.google.se/search?hl=sv&as_qdr=all&q=site%3Ahttp%3A%2F%2Fgadgets.fosfor.se%2Fcanon-eos-40d&btnG=S%C3%B6k&meta=>
Go to the URL http://gadgets.fosfor.se/canon-eos-40d/
<http://gadgets.fosfor.se/canon-eos-40d/> and look in the content; The
damn summarizer is spot on. Probably a reason is since he has Google
Ads which "wraps" the contextual content and makes it really easy to find
What's your thoughts regarding the summaries ?
Thanks again.
Kindly
//Marcus
On 8/7/07, *Renaud Richardet* <[EMAIL PROTECTED]
<mailto:[EMAIL PROTECTED]>> wrote:
hi Marcus,
> Hi.
>
> I am building (yet another) crawler, parsing and indexing the
html files
> crawled with Lucene. Then I came to think about it. Stupido! why
aren't you
> using nutch instead!
>
> My use case is something like this.
>
> 100-1000 domains with average depth of 3 to 5 I think. If I miss
some pages
> it is not the end of the world so a tradeoff between depth and
crawl speed
> is taken.
> All urls must be crawled at least once a day and be crontabbed.
>
> I would like to have one lucene dir which is optimized after
each reindexing
> not one dir per crawl so I need to create something like the
recrawl script
> which is published on the Wiki.
>
Not sure I understand: why don't you just throw away the old index
once
you have successfully created the new one (since you have to re-crawl
the whole content daily)?
> I would prefer to search the content myself by creating an
IndexSearcher,
> this is because I already index a whole lot of RSS feeds so I
would like to
> do a "MultiIndex" search, think that will be hard to do without
doing it
> yourself.
>
Or you could index the feeds with Nutch, too. There's a plugin for
RSS...
> I noticed the WAR file but I would prefer too create the
templates myself.
>
Actually, the WAR is just a started, you will have to implement your
layout anyway in the jsp's.
HTH,
Renaud