[
https://issues.apache.org/jira/browse/NUTCH-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Lewis John McGibbney updated NUTCH-2005:
----------------------------------------
Description:
Recent developments within the tracing community have brought projects like
Apache HTrace (Incubating) into the Apache Incubator opening up the possibility
of utilizing tracing logic to better understand distributed applications,
systems and systems-of-systems. As many will know, tracing involves a
specialized use of logging to record information about a program’s execution.
Although many use cases involve the use of tracing within distributed systems
such as Hadoop and databases, few tracing experiments belong within the field
of large scale, distributed Web search.
This issue will combine comprehensive tracing mechanisms in Apache HTrace
(Incubating) with the scalable, flexible crawling architecture presented by
Apache Nutch 2.X.
As essentially every job (Inject, Generate, Fetch Parse, UpdateDB, etc.) in
Nutch 2.X interacts with a stack of complex underlying components (known as the
search stack) comprehensive tracing would provide insight into system
performance, latency, etc.
Every job (a class which extends NutchTool and implements Tool) within Nutch
2.X therefore needs to be analyzed for suitability and appropriateness for
tracing. Once this is understood a ranked list of tools should be produced, the
ranking will be based upon which tools are most suited to tracing... I would
suggest that FetcherJob be the top as it enables us to trace not only the
HTTPSocketConnections but also writing of data through Gora --> DataStore.
was:
Recent developments within the tracing community have brought projects like
Apache HTrace (Incubating) into the Apache Incubator opening up the possibility
of utilizing tracing logic to better understand distributed applications,
systems and systems-of-systems. As many will know, tracing involves a
specialized use of logging to record information about a program’s execution.
Although many use cases involve the use of tracing within distributed systems
such as Hadoop and databases, few tracing experiments belong within the field
of large scale, distributed Web search.
This issue will combine comprehensive tracing mechanisms in Apache HTrace
(Incubating) with the scalable, flexible crawling architecture presented by
Apache Nutch 2.X. Key takeaways from this presentation are development and
implementation, tracing guidance for your web search stack and future work in
this area.
> Implement HTrace'ing in Nutch
> -----------------------------
>
> Key: NUTCH-2005
> URL: https://issues.apache.org/jira/browse/NUTCH-2005
> Project: Nutch
> Issue Type: New Feature
> Components: build
> Reporter: Lewis John McGibbney
> Assignee: Lewis John McGibbney
> Labels: gsoc2016
> Fix For: 2.4
>
>
> Recent developments within the tracing community have brought projects like
> Apache HTrace (Incubating) into the Apache Incubator opening up the
> possibility of utilizing tracing logic to better understand distributed
> applications, systems and systems-of-systems. As many will know, tracing
> involves a specialized use of logging to record information about a program’s
> execution. Although many use cases involve the use of tracing within
> distributed systems such as Hadoop and databases, few tracing experiments
> belong within the field of large scale, distributed Web search.
> This issue will combine comprehensive tracing mechanisms in Apache HTrace
> (Incubating) with the scalable, flexible crawling architecture presented by
> Apache Nutch 2.X.
> As essentially every job (Inject, Generate, Fetch Parse, UpdateDB, etc.) in
> Nutch 2.X interacts with a stack of complex underlying components (known as
> the search stack) comprehensive tracing would provide insight into system
> performance, latency, etc.
> Every job (a class which extends NutchTool and implements Tool) within Nutch
> 2.X therefore needs to be analyzed for suitability and appropriateness for
> tracing. Once this is understood a ranked list of tools should be produced,
> the ranking will be based upon which tools are most suited to tracing... I
> would suggest that FetcherJob be the top as it enables us to trace not only
> the HTTPSocketConnections but also writing of data through Gora -->
> DataStore.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)