[ https://issues.apache.org/jira/browse/NUTCH-466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486003 ]
Andrzej Bialecki commented on NUTCH-466: ----------------------------------------- > I thought that the map will be from class names to directory names. Well, then you would have to pass the whole class name in an RPC call - I think we should come up with a way that uses at most one byte to select the right part. > Do you think that we sould also move HitDetailer, HitSummarizer, HitContent > and Searcher to this plugin system Yes, that was my plan - the same way we did it with indexing plugins - although I intend to create a separate issue regarding the use of separate index / page / summary servers, to avoid complicating this patch too much.. > Flexible segment format > ----------------------- > > Key: NUTCH-466 > URL: https://issues.apache.org/jira/browse/NUTCH-466 > Project: Nutch > Issue Type: Improvement > Components: searcher > Affects Versions: 1.0.0 > Reporter: Andrzej Bialecki > Assigned To: Andrzej Bialecki > > In many situations it is necessary to store more data associated with pages > than it's possible now with the current segment format. Quite often it's a > binary data. There are two common workarounds for this: one is to use > per-page metadata, either in Content or ParseData, the other is to use an > external independent database using page ID-s as foreign keys. > Currently segments can consist of the following predefined parts: content, > crawl_fetch, crawl_generate, crawl_parse, parse_text and parse_data. I > propose a third option, which is a natural extension of this existing segment > format, i.e. to introduce the ability to add arbitrarily named segment > "parts", with the only requirement that they should be MapFile-s that store > Writable keys and values. Alternatively, we could define a > SegmentPart.Writer/Reader to accommodate even more sophisticated scenarios. > Existing segment API and searcher API (NutchBean, DistributedSearch > Client/Server) should be extended to handle such arbitrary parts. > Example applications: > * storing HTML previews of non-HTML pages, such as PDF, PS and Office > documents > * storing pre-tokenized version of plain text for faster snippet generation > * storing linguistically tagged text for sophisticated data mining > * storing image thumbnails > etc, etc ... > I'm going to prepare a patchset shortly. Any comments and suggestions are > welcome. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers