This is what I was thinking also previously :) It would seem sensible to have the option. I definitely have use cases where the links are not important.
On Thu, Jul 14, 2011 at 2:03 PM, Julien Nioche < [email protected]> wrote: > Have been thinking about this again. We could make so that the indexer does > not necessarily require a linkDB : some people are not particularly > interested in getting the anchors. At the moment you have to have a linkDB. > > This would make it a bit simpler (and quicker) to index within a crawl > iteration. Any thoughts on this? > > > On 12 July 2011 18:23, Markus Jelsma <[email protected]> wrote: > >> >> > Thanks for the responses :) >> > >> > So the size of the segments then i guess would determine the latency >> > between crawling and indexing. >> >> The size of your crawldb may matter even more in some cases. If you >> segment >> has just on file and your crawldb many millions, the indexing takes >> forever. >> >> > >> > I and my colleague will look more into the scripts to see how the diffs >> get >> > pushed to Solr. >> > >> > Thanks again >> > >> > M >> > >> > >> > On Tue, Jul 12, 2011 at 6:12 PM, lewis john mcgibbney < >> > >> > [email protected]> wrote: >> > > To add to Julien's comments there was a contribution made by Gabriele >> a >> > > while ago which addressed this issue (however I have not used his >> scripts >> > > extensively). They might be of interest for a look. Try the link below >> > > >> > > >> http://wiki.apache.org/nutch/Whole-Web%20Crawling%20incremental%20script >> > > >> > > On Tue, Jul 12, 2011 at 2:15 PM, Julien Nioche < >> > > >> > > [email protected]> wrote: >> > >> Hi Matthew, >> > >> >> > >> This is usually achieved by writing a script containing the >> individual >> > >> Nutch commands (as opposed to calling 'nutch crawl') and index at the >> > >> end of a generate-fetch-parse-update-linkdb sequence. You don't need >> > >> any plugins for that >> > >> >> > >> HTH >> > >> >> > >> Julien >> > >> >> > >> On 12 July 2011 13:35, Matthew Painter <[email protected] >> >wrote: >> > >>> Hi all, >> > >>> >> > >>> I was wondering about the feasibility of creating a plugin for nutch >> > >>> that create a solr update command, and added it to a queue for >> > >>> indexing after it first parses the page, rather than when crawling >> has >> > >>> finished. >> > >>> >> > >>> This would allow you to do "real-time" indexing when crawling. >> > >>> >> > >>> Drawbacks: Not able to use the graph to give relevancy information. >> > >>> >> > >>> Wondering what initial thoughts are about this? >> > >>> >> > >>> Thanks :) >> > >> >> > >> -- >> > >> * >> > >> *Open Source Solutions for Text Engineering >> > >> >> > >> http://digitalpebble.blogspot.com/ >> > >> http://www.digitalpebble.com >> > > >> > > -- >> > > *Lewis* >> > > > > -- > * > *Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com >

