This is what I was thinking also previously :)

It would seem sensible to have the option. I definitely have use cases where
the links are not important.


On Thu, Jul 14, 2011 at 2:03 PM, Julien Nioche <
[email protected]> wrote:

> Have been thinking about this again. We could make so that the indexer does
> not necessarily require a linkDB : some people are not particularly
> interested in getting the anchors. At the moment you have to have a linkDB.
>
> This would make it a bit simpler (and quicker) to index within a crawl
> iteration. Any thoughts on this?
>
>
> On 12 July 2011 18:23, Markus Jelsma <[email protected]> wrote:
>
>>
>> > Thanks for the responses :)
>> >
>> > So the size of the segments then i guess would determine the latency
>> > between crawling and indexing.
>>
>> The size of your crawldb may matter even more in some cases. If you
>> segment
>> has just on file and your crawldb many millions, the indexing takes
>> forever.
>>
>> >
>> > I and my colleague will look more into the scripts to see how the diffs
>> get
>> > pushed to Solr.
>> >
>> > Thanks again
>> >
>> > M
>> >
>> >
>> > On Tue, Jul 12, 2011 at 6:12 PM, lewis john mcgibbney <
>> >
>> > [email protected]> wrote:
>> > > To add to Julien's comments there was a contribution made by Gabriele
>> a
>> > > while ago which addressed this issue (however I have not used his
>> scripts
>> > > extensively). They might be of interest for a look. Try the link below
>> > >
>> > >
>> http://wiki.apache.org/nutch/Whole-Web%20Crawling%20incremental%20script
>> > >
>> > > On Tue, Jul 12, 2011 at 2:15 PM, Julien Nioche <
>> > >
>> > > [email protected]> wrote:
>> > >> Hi Matthew,
>> > >>
>> > >> This is usually achieved by writing a script containing the
>> individual
>> > >> Nutch commands (as opposed to calling 'nutch crawl') and index at the
>> > >> end of a generate-fetch-parse-update-linkdb sequence. You don't need
>> > >> any plugins for that
>> > >>
>> > >> HTH
>> > >>
>> > >> Julien
>> > >>
>> > >> On 12 July 2011 13:35, Matthew Painter <[email protected]
>> >wrote:
>> > >>> Hi all,
>> > >>>
>> > >>> I was wondering about the feasibility of creating a plugin for nutch
>> > >>> that create a solr update command, and added it to a queue for
>> > >>> indexing after it first parses the page, rather than when crawling
>> has
>> > >>> finished.
>> > >>>
>> > >>> This would allow you to do "real-time" indexing when crawling.
>> > >>>
>> > >>> Drawbacks: Not able to use the graph to give relevancy information.
>> > >>>
>> > >>> Wondering what initial thoughts are about this?
>> > >>>
>> > >>> Thanks :)
>> > >>
>> > >> --
>> > >> *
>> > >> *Open Source Solutions for Text Engineering
>> > >>
>> > >> http://digitalpebble.blogspot.com/
>> > >> http://www.digitalpebble.com
>> > >
>> > > --
>> > > *Lewis*
>>
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>

Reply via email to