Re: Real-time Solr integration

Markus Jelsma Thu, 14 Jul 2011 08:47:10 -0700


On Thursday 14 July 2011 15:03:34 Julien Nioche wrote:
> Have been thinking about this again. We could make so that the indexer does
> not necessarily require a linkDB : some people are not particularly
> interested in getting the anchors. At the moment you have to have a linkDB.
> 
> This would make it a bit simpler (and quicker) to index within a crawl
> iteration. Any thoughts on this?


It still requires the CrawlDB right? Or are you suggesting we index without 
mapping through the CrawlDB?
And at which point during the crawl cycle? The fetcher with parsing enabled? 
In that case, do we need url filtering and normalizing in the parse job?

Anyway, take care of memory. The indexer can fill up your heap real quick, 
even with smaller add buffers. And of course handing indexing failure. It's 
not uncommon for requests to time out. And there is also the problem of an 
unhappily timed commit, which currently stops all indexing to Solr (there's an 
issue for this i remember).

> 
> On 12 July 2011 18:23, Markus Jelsma <[email protected]> wrote:
> > > Thanks for the responses :)
> > > 
> > > So the size of the segments then i guess would determine the latency
> > > between crawling and indexing.
> > 
> > The size of your crawldb may matter even more in some cases. If you
> > segment has just on file and your crawldb many millions, the indexing
> > takes forever.
> > 
> > > I and my colleague will look more into the scripts to see how the diffs
> > 
> > get
> > 
> > > pushed to Solr.
> > > 
> > > Thanks again
> > > 
> > > M
> > > 
> > > 
> > > On Tue, Jul 12, 2011 at 6:12 PM, lewis john mcgibbney <
> > > 
> > > [email protected]> wrote:
> > > > To add to Julien's comments there was a contribution made by Gabriele
> > > > a while ago which addressed this issue (however I have not used his
> > 
> > scripts
> > 
> > > > extensively). They might be of interest for a look. Try the link
> > > > below
> > 
> > http://wiki.apache.org/nutch/Whole-Web%20Crawling%20incremental%20script
> > 
> > > > On Tue, Jul 12, 2011 at 2:15 PM, Julien Nioche <
> > > > 
> > > > [email protected]> wrote:
> > > >> Hi Matthew,
> > > >> 
> > > >> This is usually achieved by writing a script containing the
> > > >> individual Nutch commands (as opposed to calling 'nutch crawl') and
> > > >> index at the end of a generate-fetch-parse-update-linkdb sequence.
> > > >> You don't need any plugins for that
> > > >> 
> > > >> HTH
> > > >> 
> > > >> Julien
> > > >> 
> > > >> On 12 July 2011 13:35, Matthew Painter <[email protected]
> > >
> > >wrote:
> > > >>> Hi all,
> > > >>> 
> > > >>> I was wondering about the feasibility of creating a plugin for
> > > >>> nutch that create a solr update command, and added it to a queue
> > > >>> for indexing after it first parses the page, rather than when
> > > >>> crawling
> > 
> > has
> > 
> > > >>> finished.
> > > >>> 
> > > >>> This would allow you to do "real-time" indexing when crawling.
> > > >>> 
> > > >>> Drawbacks: Not able to use the graph to give relevancy information.
> > > >>> 
> > > >>> Wondering what initial thoughts are about this?
> > > >>> 
> > > >>> Thanks :)
> > > >> 
> > > >> --
> > > >> *
> > > >> *Open Source Solutions for Text Engineering
> > > >> 
> > > >> http://digitalpebble.blogspot.com/
> > > >> http://www.digitalpebble.com
> > > > 
> > > > --
> > > > *Lewis*

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Real-time Solr integration

Reply via email to