[
https://issues.apache.org/jira/browse/NUTCH-466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486003
]
Andrzej Bialecki commented on NUTCH-466:
-----------------------------------------
> I thought that the map will be from class names to directory names.
Well, then you would have to pass the whole class name in an RPC call - I think
we should come up with a way that uses at most one byte to select the right
part.
> Do you think that we sould also move HitDetailer, HitSummarizer, HitContent
> and Searcher to this plugin system
Yes, that was my plan - the same way we did it with indexing plugins - although
I intend to create a separate issue regarding the use of separate index / page
/ summary servers, to avoid complicating this patch too much..
> Flexible segment format
> -----------------------
>
> Key: NUTCH-466
> URL: https://issues.apache.org/jira/browse/NUTCH-466
> Project: Nutch
> Issue Type: Improvement
> Components: searcher
> Affects Versions: 1.0.0
> Reporter: Andrzej Bialecki
> Assigned To: Andrzej Bialecki
>
> In many situations it is necessary to store more data associated with pages
> than it's possible now with the current segment format. Quite often it's a
> binary data. There are two common workarounds for this: one is to use
> per-page metadata, either in Content or ParseData, the other is to use an
> external independent database using page ID-s as foreign keys.
> Currently segments can consist of the following predefined parts: content,
> crawl_fetch, crawl_generate, crawl_parse, parse_text and parse_data. I
> propose a third option, which is a natural extension of this existing segment
> format, i.e. to introduce the ability to add arbitrarily named segment
> "parts", with the only requirement that they should be MapFile-s that store
> Writable keys and values. Alternatively, we could define a
> SegmentPart.Writer/Reader to accommodate even more sophisticated scenarios.
> Existing segment API and searcher API (NutchBean, DistributedSearch
> Client/Server) should be extended to handle such arbitrary parts.
> Example applications:
> * storing HTML previews of non-HTML pages, such as PDF, PS and Office
> documents
> * storing pre-tokenized version of plain text for faster snippet generation
> * storing linguistically tagged text for sophisticated data mining
> * storing image thumbnails
> etc, etc ...
> I'm going to prepare a patchset shortly. Any comments and suggestions are
> welcome.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers