[
https://issues.apache.org/jira/browse/NUTCH-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15210241#comment-15210241
]
Lewis John McGibbney commented on NUTCH-2005:
---------------------------------------------
[[email protected]], please check out the issue description which I've
updated. You should begin producing your proposal ASAP.
You can see previous proposals for guidance at
https://wiki.apache.org/nutch/GoogleSummerOfCode#A2015 for guidance. If you
have any issues then please let me know.
> Implement HTrace'ing in Nutch
> -----------------------------
>
> Key: NUTCH-2005
> URL: https://issues.apache.org/jira/browse/NUTCH-2005
> Project: Nutch
> Issue Type: New Feature
> Components: build
> Reporter: Lewis John McGibbney
> Assignee: Lewis John McGibbney
> Labels: gsoc2016
> Fix For: 2.4
>
>
> Recent developments within the tracing community have brought projects like
> Apache HTrace (Incubating) into the Apache Incubator opening up the
> possibility of utilizing tracing logic to better understand distributed
> applications, systems and systems-of-systems. As many will know, tracing
> involves a specialized use of logging to record information about a program’s
> execution. Although many use cases involve the use of tracing within
> distributed systems such as Hadoop and databases, few tracing experiments
> belong within the field of large scale, distributed Web search.
> This issue will combine comprehensive tracing mechanisms in Apache HTrace
> (Incubating) with the scalable, flexible crawling architecture presented by
> Apache Nutch 2.X.
> As essentially every job (Inject, Generate, Fetch Parse, UpdateDB, etc.) in
> Nutch 2.X interacts with a stack of complex underlying components (known as
> the search stack) comprehensive tracing would provide insight into system
> performance, latency, etc.
> Every job (a class which extends NutchTool and implements Tool) within Nutch
> 2.X therefore needs to be analyzed for suitability and appropriateness for
> tracing. Once this is understood a ranked list of tools should be produced,
> the ranking will be based upon which tools are most suited to tracing... I
> would suggest that FetcherJob be the top as it enables us to trace not only
> the HTTPSocketConnections but also writing of data through Gora -->
> DataStore.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)