[
https://issues.apache.org/jira/browse/NUTCH-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14943880#comment-14943880
]
Chris A. Mattmann commented on NUTCH-2132:
------------------------------------------
Hey Julien, yeah to be honest we thought about using LogStash and log files
too. Some folks on my team even started some work on it (e.g., see:
https://github.com/kwhitehall and she had done some odds and ends scripts on
this). I think Sujen was trying to figure out though an approach that directly
publishes what's going on beyond log grepping and pushing out info into a log
(IOW, a more formalized approach).
That said, yes LogStash has backends, and so forth and is configurable. If
someone has the time and energy to implement that great. We tried, but didn't
get too far, mostly b/c we didn't want it to be a log parsing interface. I
think the RabbitMQ approach is a lot better. To your point about tying us into
that, if you look at the proposed patch from Sujen RabbitMQ allows swapping of
the queue implementation and is an interface (we can use already RabbitMQ,
ActiveMq, Kafka, and there is one RabbitMQ implementation).
I also think it would be more modular to isolate the RabbitMQ jar stuff. But
this is a first cut, so you have what you have :) We'll try and make it better.
> Publisher/Subscriber model for Nutch to emit events
> ----------------------------------------------------
>
> Key: NUTCH-2132
> URL: https://issues.apache.org/jira/browse/NUTCH-2132
> Project: Nutch
> Issue Type: New Feature
> Components: fetcher, REST_api
> Reporter: Sujen Shah
> Labels: memex
> Fix For: 1.11
>
> Attachments: NUTCH-2132.patch
>
>
> It would be nice to have a Pub/Sub model in Nutch to emit certain events (ex-
> Fetcher events like fetch-start, fetch-end, a fetch report which may contain
> data like outlinks of the current fetched url, score, etc).
> A consumer of this functionality could use this data to generate real time
> visualization and generate statics of the crawl without having to wait for
> the fetch round to finish.
> The REST API could contain an endpoint which would respond with a url to
> which a client could subscribe to get the fetcher events.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)