[ https://issues.apache.org/jira/browse/NUTCH-466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Work on NUTCH-466 stopped by Andrzej Bialecki . > Flexible segment format > ----------------------- > > Key: NUTCH-466 > URL: https://issues.apache.org/jira/browse/NUTCH-466 > Project: Nutch > Issue Type: Improvement > Components: searcher > Affects Versions: 1.0.0 > Reporter: Andrzej Bialecki > Assignee: Andrzej Bialecki > Attachments: ParseFilters.java, segmentparts.patch > > > In many situations it is necessary to store more data associated with pages > than it's possible now with the current segment format. Quite often it's a > binary data. There are two common workarounds for this: one is to use > per-page metadata, either in Content or ParseData, the other is to use an > external independent database using page ID-s as foreign keys. > Currently segments can consist of the following predefined parts: content, > crawl_fetch, crawl_generate, crawl_parse, parse_text and parse_data. I > propose a third option, which is a natural extension of this existing segment > format, i.e. to introduce the ability to add arbitrarily named segment > "parts", with the only requirement that they should be MapFile-s that store > Writable keys and values. Alternatively, we could define a > SegmentPart.Writer/Reader to accommodate even more sophisticated scenarios. > Existing segment API and searcher API (NutchBean, DistributedSearch > Client/Server) should be extended to handle such arbitrary parts. > Example applications: > * storing HTML previews of non-HTML pages, such as PDF, PS and Office > documents > * storing pre-tokenized version of plain text for faster snippet generation > * storing linguistically tagged text for sophisticated data mining > * storing image thumbnails > etc, etc ... > I'm going to prepare a patchset shortly. Any comments and suggestions are > welcome. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.