[Nutch-dev] [jira] Commented: (NUTCH-466) Flexible segment format

Enis Soztutar (JIRA) Mon, 02 Apr 2007 05:23:12 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12485996
 ]


Enis Soztutar commented on NUTCH-466:
-------------------------------------

>> There may be many parts that use the same key/value classes in MapFiles.

Yes indeed you are right. I haven't thought about several parts having the same 
classes. 

>> I think the API should select the part by name (String) or some other ID, 
>> with a map of byte ID-s to directory names

I thought that the map will be from class names to directory names. 

>>I think we should use the plugin model, with a registry of segment parts that 
>>are active for the current configuration

Do you think that we sould also move HitDetailer, HitSummarizer, HitContent and 
Searcher to this plugin system. And should we break the multiple functionality 
in NutchBean and DistributedSearch$Client, and allow for separate index, 
segment servers? 

> Flexible segment format
> -----------------------
>
>                 Key: NUTCH-466
>                 URL: https://issues.apache.org/jira/browse/NUTCH-466
>             Project: Nutch
>          Issue Type: Improvement
>          Components: searcher
>    Affects Versions: 1.0.0
>            Reporter: Andrzej Bialecki 
>         Assigned To: Andrzej Bialecki 
>
> In many situations it is necessary to store more data associated with pages 
> than it's possible now with the current segment format. Quite often it's a 
> binary data. There are two common workarounds for this: one is to use 
> per-page metadata, either in Content or ParseData, the other is to use an 
> external independent database using page ID-s as foreign keys.
> Currently segments can consist of the following predefined parts: content, 
> crawl_fetch, crawl_generate, crawl_parse, parse_text and parse_data. I 
> propose a third option, which is a natural extension of this existing segment 
> format, i.e. to introduce the ability to add arbitrarily named segment 
> "parts", with the only requirement that they should be MapFile-s that store 
> Writable keys and values. Alternatively, we could define a 
> SegmentPart.Writer/Reader to accommodate even more sophisticated scenarios.
> Existing segment API and searcher API (NutchBean, DistributedSearch 
> Client/Server) should be extended to handle such arbitrary parts.
> Example applications:
> * storing HTML previews of non-HTML pages, such as PDF, PS and Office 
> documents
> * storing pre-tokenized version of plain text for faster snippet generation
> * storing linguistically tagged text for sophisticated data mining
> * storing image thumbnails
> etc, etc ...
> I'm going to prepare a patchset shortly. Any comments and suggestions are 
> welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] [jira] Commented: (NUTCH-466) Flexible segment format

Reply via email to