[ https://issues.apache.org/jira/browse/NUTCH-466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12485986 ]
Andrzej Bialecki commented on NUTCH-466: ----------------------------------------- Minor nit: MapFile requires that the key is a WritableComparable. I'm not sure I understand the last part of your comment .. There may be many parts that use the same key/value classes in MapFiles. I think the API should select the part by name (String) or some other ID, with a map of byte ID-s to directory names (this is to avoid excessive overhead during RPC). Regarding the implementing classes - I think we should use the plugin model, with a registry of segment parts that are active for the current configuration. > Flexible segment format > ----------------------- > > Key: NUTCH-466 > URL: https://issues.apache.org/jira/browse/NUTCH-466 > Project: Nutch > Issue Type: Improvement > Components: searcher > Affects Versions: 1.0.0 > Reporter: Andrzej Bialecki > Assigned To: Andrzej Bialecki > > In many situations it is necessary to store more data associated with pages > than it's possible now with the current segment format. Quite often it's a > binary data. There are two common workarounds for this: one is to use > per-page metadata, either in Content or ParseData, the other is to use an > external independent database using page ID-s as foreign keys. > Currently segments can consist of the following predefined parts: content, > crawl_fetch, crawl_generate, crawl_parse, parse_text and parse_data. I > propose a third option, which is a natural extension of this existing segment > format, i.e. to introduce the ability to add arbitrarily named segment > "parts", with the only requirement that they should be MapFile-s that store > Writable keys and values. Alternatively, we could define a > SegmentPart.Writer/Reader to accommodate even more sophisticated scenarios. > Existing segment API and searcher API (NutchBean, DistributedSearch > Client/Server) should be extended to handle such arbitrary parts. > Example applications: > * storing HTML previews of non-HTML pages, such as PDF, PS and Office > documents > * storing pre-tokenized version of plain text for faster snippet generation > * storing linguistically tagged text for sophisticated data mining > * storing image thumbnails > etc, etc ... > I'm going to prepare a patchset shortly. Any comments and suggestions are > welcome. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers