Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "MohitBagde" page has been changed by MohitBagde:
https://wiki.apache.org/nutch/MohitBagde

New page:
##language:en
MohitBagde/Gsoc2015ProjectProposal

'''GSoc 2015 – Project Proposal for “NUTCH-1936''' -''' GSoC 2015 - Move Nutch 
to Hadoop 2.X”'''

'''Name''': Mohit Bagde (Email: <<MailTo(bagde AT usc DOT edu)>>)

'''University''': University of Southern California

'''Mentor Name: '''Lewis John McGibbney

'''Abstract: '''The main goal of this project is the porting of the entire 
Apache Trunk codebase to the new Hadoop 2.X API [1]. It will involve the 
complete overhaul of all the main jobs involved in a Nutch crawler viz. 
Injecting, Fetching, Parsing, De-duplication and Indexing to the new Hadoop 
MRv2 API from the current Hadoop 1.X codebase.

'''Content: '''

'''Motivation for the Project'''

My name is Mohit Bagde.  I am a Graduate student currently pursuing my Master’s 
degree in Computer Science at the University of Southern California. I am 
strongly self-motivated because of my interest in this field and to do 
valuable, guided research in Data Informatics, Information Retrieval and 
Database Systems. I am aware that R group and Google expect very high standards 
from its students. On my part, I can assure you of hard work and consistency. I 
believe that my enthusiasm will enable me to meet those expectations.

In my current semester, I have taken CS572 Information Retrieval and Search 
Engines under Prof. Chris Mattmann and have worked on Nutch 1.X [1] as part of 
the first assignment which involved crawling with Nutch and integrating with 
Tika and subsequently developing a plugin in Nutch. I have also taken INF 550 
under Prof. Seon Kim where I have written programs in HDFS using Map Reduce and 
I find that both these subjects have a common point in the JIRA issue 
NUTCH-1936 [2] which is about porting Nutch to Hadoop 2.X. I enjoyed working 
with Nutch and found the entire experience to be very knowledgeable. I would 
like to continue to develop and contribute to Nutch in any which way possible.

'''Apache Nutch – About and Workflow'''

Apache Nutch is a highly scalable and robust web crawler that is also extremely 
polite and obeys the rules of robots.txt file for the websites that it crawls 
[3]. Nutch, developed by Doug Cutting (who also created Lucene and Hadoop), now 
has two separate codebases namely the 1.X and 2.X. Although Nutch is written in 
Java, it makes use of various “plugin” like modules that allow developers to 
implement their own parsers, deduplication algorithms and indexer interfaces.

Apache Hadoop was an Incubator sub-project [4] that was derived from Apache 
Nutch as Nutch required significant processing power to perform multi-machine 
web crawling and indexing. This came about in the form of !MapReduce tasks and 
the HDFS system. Nutch runs on a Hadoop cluster that scales well to the order 
of ~ 100 machines. However, a user can run the Nutch configuration on a local 
machine by configuring Hadoop to run in standalone or Pseudo-distributed mode 
[5] and thus achieve a comparably sized processing power as that of running 
Hadoop over a cluster of machines.

The primary advantage of Apache Nutch lies in its customizable pluggable 
interfaces or “plugins” [6] as they are termed. Below is a fairly simple 
architecture of how Nutch performs its crawling and indexing.

{{http://www.atlantbh.com/wp-content/uploads/2012/03/Apache-Nutch-Flowchart-e1331646794565.png||height="348",width="678"}}

'''Figure 1 – Apache Nutch Flowchart [7]'''

Having done a few simple crawls, the overall process is as follows:

1.       Initially, we create an empty directory containing the files that will 
form the seed URLs. (URL files). Then we run the inject command to inject the 
seed URL into the crawlDB. The crawlDB stores meta-data about the crawled URLs. 
(URL DB)

2.       Then, the generate command is run to create segments that contain a 
list of URLs that have been successfully crawled. (db_success flag will be set 
for these files in the crawlDB) (SEGMENT)

3.       The fetcher command will acquire the content of those URLs on the 
fetchlist and store it in the segment directory created by generate. This step 
takes anywhere between a few minutes (2-3 rounds of crawling) to days (30-40 
rounds of crawling) depending on the number of rounds that the crawler is run 
for. Crawls can be optimized by modifying certain parameters in the 
nutch-site.xml that overrides the nutch-default config. [8] (CONTENT)

4.       Nutch is integrated with Apache Tika which is a general framework of 
parsers [9] to extract the content and resulting metadata from the URLs that 
have been fetched. Subsequently, the parse command is run to parse the content 
from the websites that have been fetched. It can also be run to update the 
content of the crawlDB during a re-crawl. (Nutch re-crawling can be done to 
update the crawlDB to incorporate changes made to data of the crawled URLs). 
Also, HTMLParser removes all <html> tags from the documents that are fetched. 
(PARSED_DATA)

5.       Finally, before indexing the parsed_data with Apache Solr, Nutch will 
perform an inversion of the links. This is due to the following paradigm that 
“it is not of interest to account for the number of outgoing links, instead, we 
should account for the number of inbound links”. This is quite similar to how 
Google PageRank works and is important for the scoring function. The inverted 
links are saved in the linkdb. The linkdb also stores information about all 
links known to each URL fetched by Nutch. (INVERTLINK DB)

6.       The last step is optional in Nutch as of version 1.10 trunk. One can 
index the final 6 directories that are created (as shown in the directory 
structure). The SolrIndex can be used along with Lucene’s library. And after 
having installed Solr [10], one can go to the localhost:8983/solr/admin (if on 
jetty) and query on the data that they have crawled via Nutch.

'''Apache Hadoop – 1.X vs 2.X Issues'''

The current version of Apache Hadoop is 2.6. At the time of implementation of 
Apache Nutch, however it was running Hadoop 1.X. The main goal of this project 
is to migrate Hadoop 1.X to 2.X. Before moving on to how we migrate these 
changes in Nutch, we must first understand what the key differences are between 
1.X and 2.X and what changes must be incorporated to ensure that their both 
binary and source compatibility between the two version for the Nutch trunk.
||<tablestyle="border-collapse:collapse;border:none;mso-border-alt:solid 
windowtext .5pt;mso-yfti-tbllook:1184;mso-padding-alt:0in 5.4pt 0in 5.4pt" 
tableclass="MsoTableGrid"rowstyle="mso-yfti-irow:0;mso-yfti-firstrow:yes"width="151px"
 style="border:solid windowtext 1.0pt;mso-border-alt:solid windowtext 
.5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">'''Difference''' 
||<width="216px" style="border:solid windowtext 
1.0pt;border-left:none;mso-border-left-alt:solid windowtext 
.5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 
5.4pt;vertical-align:top">'''Hadoop 1.X''' ||<width="271px" style="border:solid 
windowtext 1.0pt;border-left:none;mso-border-left-alt:solid windowtext 
.5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 
5.4pt;vertical-align:top">'''Hadoop 2.X''' ||
||<rowstyle="mso-yfti-irow:1"width="151px" style="border:solid windowtext 
1.0pt;border-top:none;mso-border-top-alt:solid windowtext 
.5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 
5.4pt;vertical-align:top">'''Number of nodes''' ||<width="216px" 
style="border-top:none;border-left:none;border-bottom:solid windowtext 
1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid windowtext 
.5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext 
.5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">~4,000 nodes per cluster 
||<width="271px" style="border-top:none;border-left:none;border-bottom:solid 
windowtext 1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid 
windowtext .5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid 
windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">~10,000 nodes 
per cluster ||
||<rowstyle="mso-yfti-irow:2"width="151px" style="border:solid windowtext 
1.0pt;border-top:none;mso-border-top-alt:solid windowtext 
.5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 
5.4pt;vertical-align:top">'''Running Time''' ||<width="216px" 
style="border-top:none;border-left:none;border-bottom:solid windowtext 
1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid windowtext 
.5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext 
.5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">O(#nodes in cluster) 
||<width="271px" style="border-top:none;border-left:none;border-bottom:solid 
windowtext 1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid 
windowtext .5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid 
windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">O(cluster size) 
||
||<rowstyle="mso-yfti-irow:3"width="151px" style="border:solid windowtext 
1.0pt;border-top:none;mso-border-top-alt:solid windowtext 
.5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 
5.4pt;vertical-align:top">'''Namespace Config''' ||<width="216px" 
style="border-top:none;border-left:none;border-bottom:solid windowtext 
1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid windowtext 
.5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext 
.5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">Only 1 namespace node 
||<width="271px" style="border-top:none;border-left:none;border-bottom:solid 
windowtext 1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid 
windowtext .5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid 
windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">Multiple 
namespaces for managing HDFS ||
||<rowstyle="mso-yfti-irow:4"width="151px" style="border:solid windowtext 
1.0pt;border-top:none;mso-border-top-alt:solid windowtext 
.5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 
5.4pt;vertical-align:top">'''Application support''' ||<width="216px" 
style="border-top:none;border-left:none;border-bottom:solid windowtext 
1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid windowtext 
.5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext 
.5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">Only able to run Map and 
reduce jobs, that are static ||<width="271px" 
style="border-top:none;border-left:none;border-bottom:solid windowtext 
1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid windowtext 
.5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext 
.5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">Able to run any java apps 
that can integrate with Hadoop ||
||<rowstyle="mso-yfti-irow:5;mso-yfti-lastrow:yes"width="151px" 
style="border:solid windowtext 1.0pt;border-top:none;mso-border-top-alt:solid 
windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 
5.4pt;vertical-align:top">'''Efficiency''' ||<width="216px" 
style="border-top:none;border-left:none;border-bottom:solid windowtext 
1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid windowtext 
.5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext 
.5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">Bottleneck lies in the 
JobTracker    for both resource management and taskTracker task scheduling 
||<width="271px" style="border-top:none;border-left:none;border-bottom:solid 
windowtext 1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid 
windowtext .5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid 
windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">Uses YARN (Yet 
Another Resource Negotiator) to perform effective   cluster management ||




 . '''Table 1.1 – Key difference in Hadoop 1.X and 2.X [11]'''

Although this table does not highlight all the differences between the two 
codebases, it is a good start to start exploring what changes must be made to 
Apache Nutch’s tasks to port it to 2.X. In Apache Hadoop 2.x the part that 
deals with resource management capabilities has been placed into Apache Hadoop 
YARN, a general purpose, distributed application management framework while 
Apache Hadoop MapReduce (aka MRv2) and it remains as a pure distributed 
computation framework.

So the crux of the project would be to ensure binary and source compatibility 
of the applications that use old '''mapred''' APIs in Nutch. For the case of 
binary compatibility, this means that applications which were built against 
MRv1 '''mapred''' APIs can run directly on YARN without recompilation, merely 
by pointing them to an Apache Hadoop 2.x cluster via configuration. However, we 
cannot ensure complete binary compatibility with the applications that use 
'''mapreduce''' APIs, as these APIs have evolved a lot since MRv1. However, we 
ensure source compatibility for '''mapreduce''' APIs that break binary 
compatibility. In other words, users should recompile their applications that 
use '''mapreduce''' APIs against MRv2 jars. One notable binary incompatibility 
break is Counter and CounterGroup.

In general, MRv2 is able to ensure satisfactory compatibility with MRv1 
applications. However, due to some improvements and code re-factorings, a few 
APIs have been rendered backward-incompatible. The first quarter of the project 
will deal with identifying what APIs are being utilized by Nutch that have been 
rendered backward incompatible and discussing these issues as part of the 
NUTCH-1219 JIRA ticket [12]. I have already begun to look into this and will 
continue to do so as the project proceeds.

'''Project Plan'''

I believe that the project can be completed in a total of 4 quarters with each 
quarter being an increment over its predecessor. I propose the following plan 
for completing the project along with a timeline attached for each quarter, 
including deadlines and report submissions.
||<tablestyle="border-collapse:collapse;border:none;mso-border-alt:solid 
windowtext .5pt;mso-yfti-tbllook:1184;mso-padding-alt:0in 5.4pt 0in 5.4pt" 
tableclass="MsoTableGrid"rowstyle="mso-yfti-irow:0;mso-yfti-firstrow:yes"width="67px"
 style="border:solid windowtext 1.0pt;mso-border-alt:solid windowtext 
.5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">'''Quarter''' 
||<width="90px" style="border:solid windowtext 
1.0pt;border-left:none;mso-border-left-alt:solid windowtext 
.5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 
5.4pt;vertical-align:top">'''Dates''' ||<width="481px" style="border:solid 
windowtext 1.0pt;border-left:none;mso-border-left-alt:solid windowtext 
.5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 
5.4pt;vertical-align:top">'''Content''' ||
||<rowstyle="mso-yfti-irow:1"width="67px" style="border:solid windowtext 
1.0pt;border-top:none;mso-border-top-alt:solid windowtext 
.5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 
5.4pt;vertical-align:top">'''Q1''' ||<width="90px" 
style="border-top:none;border-left:none;border-bottom:solid windowtext 
1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid windowtext 
.5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext 
.5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">03/27/2015 
||<width="481px" style="border-top:none;border-left:none;border-bottom:solid 
windowtext 1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid 
windowtext .5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid 
windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">Initial Project 
Proposal draft deadline ||
||<rowstyle="mso-yfti-irow:2"width="67px" style="border:solid windowtext 
1.0pt;border-top:none;mso-border-top-alt:solid windowtext 
.5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 
5.4pt;vertical-align:top">'''Q1''' ||<width="90px" 
style="border-top:none;border-left:none;border-bottom:solid windowtext 
1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid windowtext 
.5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext 
.5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">03/28/2015 to 04/15/2015 
||<width="481px" style="border-top:none;border-left:none;border-bottom:solid 
windowtext 1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid 
windowtext .5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid 
windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">Work done 
during this phase will mostly be   oriented towards studying the documentation 
of the Apache Nutch and Hadoop. I   will also start with drafting the first 
report at this time. This phase would   be a fairly long as I will have to 
read, understand and discuss the   documentation with my mentor on a regular 
basis. Various issues like 2.X binary   and source compatibilities, YARN 
configurations, identification of mapred   APIs issues in Nutch trunk to be 
addressed [13]. ||
||<rowstyle="mso-yfti-irow:3"width="67px" style="border:solid windowtext 
1.0pt;border-top:none;mso-border-top-alt:solid windowtext 
.5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 
5.4pt;vertical-align:top">'''Q1''' ||<width="90px" 
style="border-top:none;border-left:none;border-bottom:solid windowtext 
1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid windowtext 
.5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext 
.5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">04/16/2015 to 04/23/2015 
||<width="481px" style="border-top:none;border-left:none;border-bottom:solid 
windowtext 1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid 
windowtext .5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid 
windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">In this short 
period of time, I will work on   crawling with Nutch and identifying potential 
incompatibilities when it is   deployed over Hadoop. I have already used Nutch 
as part of my coursework in   college and I already know how to crawl, inject, 
fetch, parse and index   documents. I also have a Cloudera Distribution [14] 
(CDH) of Hadoop and HBase   along with other HDFS framework technologies (like 
Sqoop and Flume) and can   easily test the Map and Reduce tasks that are being 
used in Nutch at this   point. ||
||<rowstyle="mso-yfti-irow:4"width="67px" style="border:solid windowtext 
1.0pt;border-top:none;mso-border-top-alt:solid windowtext 
.5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 
5.4pt;vertical-align:top">'''Q1''' ||<width="90px" 
style="border-top:none;border-left:none;border-bottom:solid windowtext 
1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid windowtext 
.5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext 
.5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">04/24/2015 to 05/01/2015 
||<width="481px" style="border-top:none;border-left:none;border-bottom:solid 
windowtext 1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid 
windowtext .5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid 
windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">Study break as 
I have my final exams on 30^th^   April and 1^st^ May and will require some 
time to prepare   accordingly for it. ||
||<rowstyle="mso-yfti-irow:5"width="67px" style="border:solid windowtext 
1.0pt;border-top:none;mso-border-top-alt:solid windowtext 
.5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 
5.4pt;vertical-align:top">'''Q1''' ||<width="90px" 
style="border-top:none;border-left:none;border-bottom:solid windowtext 
1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid windowtext 
.5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext 
.5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">05/02/2015 to 05/22/2015 
||<width="481px" style="border-top:none;border-left:none;border-bottom:solid 
windowtext 1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid 
windowtext .5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid 
windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">I will submit a 
rough draft of the first report   to the mentor and discuss issues that have 
been solved and issues that have   not yet been tackled. Additional reading and 
resources that have to be   understood and related work done on this area will 
be addressed during this   time frame. Also, will begin work on trying to 
resolve JIRA issue NUTCH-1936   and 1219. ||
||<rowstyle="mso-yfti-irow:6"width="67px" style="border:solid windowtext 
1.0pt;border-top:none;mso-border-top-alt:solid windowtext 
.5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 
5.4pt;vertical-align:top">'''Q1''' ||<width="90px" 
style="border-top:none;border-left:none;border-bottom:solid windowtext 
1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid windowtext 
.5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext 
.5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">05/23/2015 
||<width="481px" style="border-top:none;border-left:none;border-bottom:solid 
windowtext 1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid 
windowtext .5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid 
windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">Deadline for 
the submission of First report ||
||<rowstyle="mso-yfti-irow:7"width="67px" style="border:solid windowtext 
1.0pt;border-top:none;mso-border-top-alt:solid windowtext 
.5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 
5.4pt;vertical-align:top">'''Q2''' ||<width="90px" 
style="border-top:none;border-left:none;border-bottom:solid windowtext 
1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid windowtext 
.5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext 
.5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">05/24/2015 to 06/10/2015 
||<width="481px" style="border-top:none;border-left:none;border-bottom:solid 
windowtext 1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid 
windowtext .5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid 
windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">Coding work to 
begin on the issues handled,   discussed with mentor in the Quarter 1 phase of 
the project. Will need to   keep track of any new bugs and fixes that pop up 
during this phase. Second   report rough draft to be completed. ||
||<rowstyle="mso-yfti-irow:8"width="67px" style="border:solid windowtext 
1.0pt;border-top:none;mso-border-top-alt:solid windowtext 
.5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 
5.4pt;vertical-align:top">'''Q2''' ||<width="90px" 
style="border-top:none;border-left:none;border-bottom:solid windowtext 
1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid windowtext 
.5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext 
.5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">06/11/2015 to 06/20/2015 
||<width="481px" style="border-top:none;border-left:none;border-bottom:solid 
windowtext 1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid 
windowtext .5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid 
windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">First half of 
coding work completion deadline   and discussion of second report with project 
mentor. Also, NUTCH-1219 to be   resolved by the end of this period. ||
||<rowstyle="mso-yfti-irow:9"width="67px" style="border:solid windowtext 
1.0pt;border-top:none;mso-border-top-alt:solid windowtext 
.5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 
5.4pt;vertical-align:top">'''Q3''' ||<width="90px" 
style="border-top:none;border-left:none;border-bottom:solid windowtext 
1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid windowtext 
.5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext 
.5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">06/21/2015 
||<width="481px" style="border-top:none;border-left:none;border-bottom:solid 
windowtext 1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid 
windowtext .5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid 
windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">Submission of 
Second report incorporating   NUTCH-1219 resolution ||
||<rowstyle="mso-yfti-irow:10"width="67px" style="border:solid windowtext 
1.0pt;border-top:none;mso-border-top-alt:solid windowtext 
.5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 
5.4pt;vertical-align:top">'''Q3''' ||<width="90px" 
style="border-top:none;border-left:none;border-bottom:solid windowtext 
1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid windowtext 
.5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext 
.5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">06/22/2015 to 07/02/2015 
||<width="481px" style="border-top:none;border-left:none;border-bottom:solid 
windowtext 1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid 
windowtext .5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid 
windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">Remainder of 
coding work and issues in second   report to be completed. Discussion with 
mentor on handling of the Mid-term   evaluation to be done. ||
||<rowstyle="mso-yfti-irow:11"width="67px" style="border:solid windowtext 
1.0pt;border-top:none;mso-border-top-alt:solid windowtext 
.5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 
5.4pt;vertical-align:top">'''Q3''' ||<width="90px" 
style="border-top:none;border-left:none;border-bottom:solid windowtext 
1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid windowtext 
.5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext 
.5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">07/03/2015 
||<width="481px" style="border-top:none;border-left:none;border-bottom:solid 
windowtext 1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid 
windowtext .5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid 
windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">Mid-term 
evaluation done by Google. ||
||<rowstyle="mso-yfti-irow:12"width="67px" style="border:solid windowtext 
1.0pt;border-top:none;mso-border-top-alt:solid windowtext 
.5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 
5.4pt;vertical-align:top">'''Q4''' ||<width="90px" 
style="border-top:none;border-left:none;border-bottom:solid windowtext 
1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid windowtext 
.5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext 
.5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">07/04/2015 to 07/23/2015 
||<width="481px" style="border-top:none;border-left:none;border-bottom:solid 
windowtext 1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid 
windowtext .5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid 
windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">Based on the 
evaluation, existing work to be   modified based on suggestions by mentor. 
Coding work second half completion   deadline to be addressed as well ||
||<rowstyle="mso-yfti-irow:13"width="67px" style="border:solid windowtext 
1.0pt;border-top:none;mso-border-top-alt:solid windowtext 
.5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 
5.4pt;vertical-align:top">'''Q4''' ||<width="90px" 
style="border-top:none;border-left:none;border-bottom:solid windowtext 
1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid windowtext 
.5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext 
.5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">07/24/2015 to 08/02/2015 
||<width="481px" style="border-top:none;border-left:none;border-bottom:solid 
windowtext 1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid 
windowtext .5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid 
windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">Testing work to 
be carried out on the system   during this phase of implementation. Existing 
code must be tested to handle   edge cases, exceptions and incompatibility 
issues addressed in the first and   second reports. ||
||<rowstyle="mso-yfti-irow:14"width="67px" style="border:solid windowtext 
1.0pt;border-top:none;mso-border-top-alt:solid windowtext 
.5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 
5.4pt;vertical-align:top">'''Q4''' ||<width="90px" 
style="border-top:none;border-left:none;border-bottom:solid windowtext 
1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid windowtext 
.5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext 
.5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">08/03/2015 to 08/10/2015 
||<width="481px" style="border-top:none;border-left:none;border-bottom:solid 
windowtext 1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid 
windowtext .5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid 
windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">Documentation 
to be performed for all phases step by step and discussed   with mentor. FAQ 
section and Bug fixes to be documented as well. Final   changes to be made that 
will increase brevity and readability. ||
||<rowstyle="mso-yfti-irow:15"width="67px" style="border:solid windowtext 
1.0pt;border-top:none;mso-border-top-alt:solid windowtext 
.5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 
5.4pt;vertical-align:top">'''Q4''' ||<width="90px" 
style="border-top:none;border-left:none;border-bottom:solid windowtext 
1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid windowtext 
.5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext 
.5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">08/11/2015 to 08/16/2015 
||<width="481px" style="border-top:none;border-left:none;border-bottom:solid 
windowtext 1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid 
windowtext .5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid 
windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">Proof-checking 
and submission of Final Report deadline preparation. Final   discussions with 
mentor for any loose ends and patches to be done to the   report. ||
||<rowstyle="mso-yfti-irow:16"width="67px" style="border:solid windowtext 
1.0pt;border-top:none;mso-border-top-alt:solid windowtext 
.5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 
5.4pt;vertical-align:top">'''Q4''' ||<width="90px" 
style="border-top:none;border-left:none;border-bottom:solid windowtext 
1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid windowtext 
.5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext 
.5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">08/17/2015 to 08/20/2015 
||<width="481px" style="border-top:none;border-left:none;border-bottom:solid 
windowtext 1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid 
windowtext .5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid 
windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">Suggested 
'pencils down' date. Take a week to scrub code, write tests,   improve 
documentation, etc. ||
||<rowstyle="mso-yfti-irow:17;mso-yfti-lastrow:yes"width="67px" 
style="border:solid windowtext 1.0pt;border-top:none;mso-border-top-alt:solid 
windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 
5.4pt;vertical-align:top">'''Q4''' ||<width="90px" 
style="border-top:none;border-left:none;border-bottom:solid windowtext 
1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid windowtext 
.5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext 
.5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">08/21/2015 
||<width="481px" style="border-top:none;border-left:none;border-bottom:solid 
windowtext 1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid 
windowtext .5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid 
windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">Firm pencils 
down date and submission of report. ||




''' '''

'''References'''

[1]          https://issues.apache.org/jira/browse/NUTCH-1936

[2]          http://sunset.usc.edu/classes/cs572_2015/CS572_HW_NUTCH_POLAR.pdf

[3]          https://wiki.apache.org/nutch/

[4]          http://hadoop.apache.org/docs/stable/

[5]          https://wiki.apache.org/nutch/NutchHadoopSingleNodeTutorial

[6]          https://wiki.apache.org/nutch/AboutPlugins

[7]          
http://www.atlantbh.com/wp-content/uploads/2012/03/Apache-Nutch-Flowchart-e1331646794565.png

[8]          https://wiki.apache.org/nutch/OptimizingCrawls

[9]          https://tika.apache.org/1.7/gettingstarted.html

[10]        http://lucene.apache.org/solr/quickstart.html

[11]        http://www.slideshare.net/RommelGarcia2/hadoop-1x-vs-2

[12]        https://issues.apache.org/jira/browse/NUTCH-1219

[13]        
http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduce_Compatibility_Hadoop1_Hadoop2.html

[14]        
http://www.cloudera.com/content/cloudera/en/about/hadoop-and-big-data.html

Reply via email to