[
https://issues.apache.org/jira/browse/NUTCH-1622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990521#comment-13990521
]
Daniel Kugel commented on NUTCH-1622:
-------------------------------------
I don't have any strong feeling to where this code should be, so feel free to
move it around. :-)
To my understanding the content should be only parsed in the parsing phase, so
if any metadata is extracted it should be extracted at that stage.
Are you suggesting the DbUpdate code to parse the content again?
Metadata extraction seems like a parser feature because it is the only
component that should read ("parse") the content and it seems reasonable to
have a metadata aware parsers and metadata-ignorant parsers.
When adding a metadata element the parser is the only one who know what type of
data he is currently parsing.
Perhaps we can add some form of hook methods or plugins for the parsers
themselves to control what to do with each element they encounter? To decide if
its metadata or not and if so what to do with it? I agree it seems complicated
but on the other hand who else is eligible to parse content other than the
parser?
> Create Outlinks with metadata
> -----------------------------
>
> Key: NUTCH-1622
> URL: https://issues.apache.org/jira/browse/NUTCH-1622
> Project: Nutch
> Issue Type: New Feature
> Components: parser
> Affects Versions: 1.7, 2.2.1
> Reporter: Julien Nioche
> Assignee: Julien Nioche
> Fix For: 1.8, 2.4
>
> Attachments: NUTCH-1622-2.x.patch, NUTCH-1622.patch
>
>
> Having the possibility to specify metadata when creating an outlink is
> extremely useful as it allows to pass information from a source page to the
> pages it links to. We use that routinely within our custom parsers in
> combination with the url-meta plugin.
--
This message was sent by Atlassian JIRA
(v6.2#6252)