[Nutch-dev] [jira] Updated: (NUTCH-466) Flexible segment format

Andrzej Bialecki (JIRA) Thu, 31 May 2007 11:42:42 -0700

     [ 
https://issues.apache.org/jira/browse/NUTCH-466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Andrzej Bialecki  updated NUTCH-466:
------------------------------------

    Attachment: segmentparts.patch

This patch contains the following modifications and additions:

* a new extension point, ParseFilter, to post-process results of parsing just 
before they are written out.

* related chained filter facade, ParseFilters.

* Parse / ParseImpl changes to support passing of arbitrary data records 
"tagged" by Text keys.

* ParseOutputFormat changes to store such arbitrary data in MapFile-s named 
after the Text keys.

* NutchBean and DistributedSearch protocol changes to support retrieving 
records from named segment parts.

* PdfParser changes to support storing of HTML preview for PDF files.

Please review and comment.

> Flexible segment format
> -----------------------
>
>                 Key: NUTCH-466
>                 URL: https://issues.apache.org/jira/browse/NUTCH-466
>             Project: Nutch
>          Issue Type: Improvement
>          Components: searcher
>    Affects Versions: 1.0.0
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>         Attachments: segmentparts.patch
>
>
> In many situations it is necessary to store more data associated with pages 
> than it's possible now with the current segment format. Quite often it's a 
> binary data. There are two common workarounds for this: one is to use 
> per-page metadata, either in Content or ParseData, the other is to use an 
> external independent database using page ID-s as foreign keys.
> Currently segments can consist of the following predefined parts: content, 
> crawl_fetch, crawl_generate, crawl_parse, parse_text and parse_data. I 
> propose a third option, which is a natural extension of this existing segment 
> format, i.e. to introduce the ability to add arbitrarily named segment 
> "parts", with the only requirement that they should be MapFile-s that store 
> Writable keys and values. Alternatively, we could define a 
> SegmentPart.Writer/Reader to accommodate even more sophisticated scenarios.
> Existing segment API and searcher API (NutchBean, DistributedSearch 
> Client/Server) should be extended to handle such arbitrary parts.
> Example applications:
> * storing HTML previews of non-HTML pages, such as PDF, PS and Office 
> documents
> * storing pre-tokenized version of plain text for faster snippet generation
> * storing linguistically tagged text for sophisticated data mining
> * storing image thumbnails
> etc, etc ...
> I'm going to prepare a patchset shortly. Any comments and suggestions are 
> welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] [jira] Updated: (NUTCH-466) Flexible segment format

Reply via email to