Junqiang Zhang created NUTCH-2675:
-------------------------------------

             Summary: Give parsers the capability to read and write CrawlDatum
                 Key: NUTCH-2675
                 URL: https://issues.apache.org/jira/browse/NUTCH-2675
             Project: Nutch
          Issue Type: Improvement
          Components: parser
    Affects Versions: 1.15
            Reporter: Junqiang Zhang
             Fix For: 1.15


Parsers are called inside org.apache.nutch.parse.ParseSegment,
(Line 127 for version 1.15)        parseResult = parseUtil.parse(content);

and inside org.apache.nutch.fetcher.FetcherThread.
(Line 640 for version 1.15)            parseResult = 
this.parseUtil.parse(content);



The current version of Nutch does not give parsers the capability to access 
CrawlDatum. If users want to customize the parsing process using some metadata 
of CrawlDatum, it is difficult to read the required metadata. 

On the other side, if users want to save metadata generated during parsing, the 
metadata can only be saved as parseMeta of org.apache.nutch.parse.ParseData, 
and those of parseMeta selected by db.parsemeta.to.crawldb in nutch-site.xml 
can be added to CrawlDatum inside org.apache.nutch.parse.ParseOutputFormat and 
org.apache.nutch.crawl.CrawlDbReducer. If parsers have direct access to 
CrawlDatum, the metadata generated during parsing can be added to CrawlDatum 
directly by parsers.




I use Nutch to fetch and parse web pages. To read required metadata from 
CrawlDatum during parsing, I do the following steps to work around.

(1) During web page fetching, inside 
org.apache.nutch.protocol.http.api.HttpBase of lib-http plugin, read the 
required metadata from CrawlDatum, and save the required metadata together with 
the Headers metadata of org.apache.nutch.net.protocols.Response to the metadata 
of org.apache.nutch.protocol.Content. This can be done at line 334 of the code 
by replacing "response.getHeaders()" by a new metadata containing both the 
required metadata from CrawlDatum and the Headers metadata.

The code need to be modified inside org.apache.nutch.protocol.http.api.HttpBase 
of lib-http plugin is
(Line 332 for version 1.15)      Content c = new Content(u.toString(), 
u.toString(),
(Line 333 for version 1.15)           (content == null ? EMPTY_CONTENT : 
content),
(Line 334 for version 1.15)           response.getHeader("Content-Type"), 
response.getHeaders(), mimeTypes);

(2) During html page parsing, inside org.apache.nutch.parse.html.HtmlParser of 
parse-html plugin, read the required metadata from the metadata of 
org.apache.nutch.protocol.Content, and customize the parsing process using the 
required metadata.




If parsers have direct access to CrawlDatum, the above workaround is not 
needed. To give parsers the capacity to directly read and write CrawlDatum, I 
would like to suggest adding a new method "public ParseResult parse(Content 
content, CrawlDatum datum)" to org.apache.nutch.parse.ParseUtil in future 
versions of Nutch.

To be compatible with current 1.15 and previous versions, I would like to 
suggest adding a new configuration property to nutch-default.xml. The default 
of the configuration property can be use the current method "public ParseResult 
parse(Content content)". If users want to use "public ParseResult parse(Content 
content, CrawlDatum datum)", they can change the property in nutch-site.xml.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to