[jira] [Updated] (NUTCH-2005) Implement HTrace'ing in Nutch

Lewis John McGibbney (JIRA) Thu, 24 Mar 2016 06:46:27 -0700

     [ 
https://issues.apache.org/jira/browse/NUTCH-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Lewis John McGibbney updated NUTCH-2005:
----------------------------------------
    Description: 
Recent developments within the tracing community have brought projects like 
Apache HTrace (Incubating) into the Apache Incubator opening up the possibility 
of utilizing tracing logic to better understand distributed applications, 
systems and systems-of-systems. As many will know, tracing involves a 
specialized use of logging to record information about a program’s execution. 
Although many use cases involve the use of tracing within distributed systems 
such as Hadoop and databases, few tracing experiments belong within the field 
of large scale, distributed Web search. 
This issue will combine comprehensive tracing mechanisms in Apache HTrace 
(Incubating) with the scalable, flexible crawling architecture presented by 
Apache Nutch 2.X.
As essentially every job (Inject, Generate, Fetch Parse, UpdateDB, etc.) in 
Nutch 2.X interacts with a stack of complex underlying components (known as the 
search stack) comprehensive tracing would provide insight into system 
performance, latency, etc. 
Every job (a class which extends NutchTool and implements Tool) within Nutch 
2.X therefore needs to be analyzed for suitability and appropriateness for 
tracing. Once this is understood a ranked list of tools should be produced, the 
ranking will be based upon which tools are most suited to tracing... I would 
suggest that FetcherJob be the top as it enables us to trace not only the 
HTTPSocketConnections but also writing of data through Gora --> DataStore. 



  was:
Recent developments within the tracing community have brought projects like 
Apache HTrace (Incubating) into the Apache Incubator opening up the possibility 
of utilizing tracing logic to better understand distributed applications, 
systems and systems-of-systems. As many will know, tracing involves a 
specialized use of logging to record information about a program’s execution. 
Although many use cases involve the use of tracing within distributed systems 
such as Hadoop and databases, few tracing experiments belong within the field 
of large scale, distributed Web search. 
This issue will combine comprehensive tracing mechanisms in Apache HTrace 
(Incubating) with the scalable, flexible crawling architecture presented by 
Apache Nutch 2.X. Key takeaways from this presentation are development and 
implementation, tracing guidance for your web search stack and future work in 
this area. 




> Implement HTrace'ing in Nutch
> -----------------------------
>
>                 Key: NUTCH-2005
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2005
>             Project: Nutch
>          Issue Type: New Feature
>          Components: build
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>              Labels: gsoc2016
>             Fix For: 2.4
>
>
> Recent developments within the tracing community have brought projects like 
> Apache HTrace (Incubating) into the Apache Incubator opening up the 
> possibility of utilizing tracing logic to better understand distributed 
> applications, systems and systems-of-systems. As many will know, tracing 
> involves a specialized use of logging to record information about a program’s 
> execution. Although many use cases involve the use of tracing within 
> distributed systems such as Hadoop and databases, few tracing experiments 
> belong within the field of large scale, distributed Web search. 
> This issue will combine comprehensive tracing mechanisms in Apache HTrace 
> (Incubating) with the scalable, flexible crawling architecture presented by 
> Apache Nutch 2.X.
> As essentially every job (Inject, Generate, Fetch Parse, UpdateDB, etc.) in 
> Nutch 2.X interacts with a stack of complex underlying components (known as 
> the search stack) comprehensive tracing would provide insight into system 
> performance, latency, etc. 
> Every job (a class which extends NutchTool and implements Tool) within Nutch 
> 2.X therefore needs to be analyzed for suitability and appropriateness for 
> tracing. Once this is understood a ranked list of tools should be produced, 
> the ranking will be based upon which tools are most suited to tracing... I 
> would suggest that FetcherJob be the top as it enables us to trace not only 
> the HTTPSocketConnections but also writing of data through Gora --> 
> DataStore. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-2005) Implement HTrace'ing in Nutch

Reply via email to