Flexible segment format
-----------------------

                 Key: NUTCH-466
                 URL: https://issues.apache.org/jira/browse/NUTCH-466
             Project: Nutch
          Issue Type: Improvement
          Components: searcher
    Affects Versions: 1.0.0
            Reporter: Andrzej Bialecki 
         Assigned To: Andrzej Bialecki 


In many situations it is necessary to store more data associated with pages 
than it's possible now with the current segment format. Quite often it's a 
binary data. There are two common workarounds for this: one is to use per-page 
metadata, either in Content or ParseData, the other is to use an external 
independent database using page ID-s as foreign keys.

Currently segments can consist of the following predefined parts: content, 
crawl_fetch, crawl_generate, crawl_parse, parse_text and parse_data. I propose 
a third option, which is a natural extension of this existing segment format, 
i.e. to introduce the ability to add arbitrarily named segment "parts", with 
the only requirement that they should be MapFile-s that store Writable keys and 
values. Alternatively, we could define a SegmentPart.Writer/Reader to 
accommodate even more sophisticated scenarios.

Existing segment API and searcher API (NutchBean, DistributedSearch 
Client/Server) should be extended to handle such arbitrary parts.

Example applications:

* storing HTML previews of non-HTML pages, such as PDF, PS and Office documents
* storing pre-tokenized version of plain text for faster snippet generation
* storing linguistically tagged text for sophisticated data mining
* storing image thumbnails

etc, etc ...

I'm going to prepare a patchset shortly. Any comments and suggestions are 
welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to