[
https://issues.apache.org/jira/browse/NUTCH-2675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel resolved NUTCH-2675.
------------------------------------
Resolution: Won't Do
Fix Version/s: (was: 1.16)
Resolving as "won't do". [~aquaticwater], please reopen in case the solution
using a scoring filter is not appropriate. Thanks!
> Give parsers the capability to read and write CrawlDatum
> --------------------------------------------------------
>
> Key: NUTCH-2675
> URL: https://issues.apache.org/jira/browse/NUTCH-2675
> Project: Nutch
> Issue Type: Improvement
> Components: parser
> Affects Versions: 1.15
> Reporter: Junqiang Zhang
> Priority: Minor
>
> Parsers are called inside org.apache.nutch.parse.ParseSegment,
> (Line 127 for version 1.15) parseResult = parseUtil.parse(content);
> and inside org.apache.nutch.fetcher.FetcherThread.
> (Line 640 for version 1.15) parseResult =
> this.parseUtil.parse(content);
> The current version of Nutch does not give parsers the capability to access
> CrawlDatum. If users want to customize the parsing process using some
> metadata of CrawlDatum, it is difficult to read the required metadata.
> On the other side, if users want to save metadata generated during parsing,
> the metadata can only be saved as parseMeta of
> org.apache.nutch.parse.ParseData, and those of parseMeta selected by
> db.parsemeta.to.crawldb in nutch-site.xml can be added to CrawlDatum inside
> org.apache.nutch.parse.ParseOutputFormat and
> org.apache.nutch.crawl.CrawlDbReducer. If parsers have direct access to
> CrawlDatum, the metadata generated during parsing can be added to CrawlDatum
> directly by parsers.
> I use Nutch to fetch and parse web pages. To read required metadata from
> CrawlDatum during parsing, I do the following steps to work around.
> (1) During web page fetching, inside
> org.apache.nutch.protocol.http.api.HttpBase of lib-http plugin, read the
> required metadata from CrawlDatum, and save the required metadata together
> with the Headers metadata of org.apache.nutch.net.protocols.Response to the
> metadata of org.apache.nutch.protocol.Content. This can be done at line 334
> of the code by replacing "response.getHeaders()" by a new metadata containing
> both the required metadata from CrawlDatum and the Headers metadata.
> The code need to be modified inside
> org.apache.nutch.protocol.http.api.HttpBase of lib-http plugin is
> (Line 332 for version 1.15) Content c = new Content(u.toString(),
> u.toString(),
> (Line 333 for version 1.15) (content == null ? EMPTY_CONTENT :
> content),
> (Line 334 for version 1.15) response.getHeader("Content-Type"),
> response.getHeaders(), mimeTypes);
> (2) During html page parsing, inside org.apache.nutch.parse.html.HtmlParser
> of parse-html plugin, read the required metadata from the metadata of
> org.apache.nutch.protocol.Content, and customize the parsing process using
> the required metadata.
> If parsers have direct access to CrawlDatum, the above workaround is not
> needed. To give parsers the capacity to directly read and write CrawlDatum, I
> would like to suggest adding a new method "public ParseResult parse(Content
> content, CrawlDatum datum)" to org.apache.nutch.parse.ParseUtil in future
> versions of Nutch.
> To be compatible with current 1.15 and previous versions, I would like to
> suggest adding a new configuration property to nutch-default.xml. The default
> of the configuration property can be use the current method "public
> ParseResult parse(Content content)". If users want to use "public ParseResult
> parse(Content content, CrawlDatum datum)", they can change the property in
> nutch-site.xml.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)