[
https://issues.apache.org/jira/browse/NUTCH-1870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174064#comment-14174064
]
Albinscode commented on NUTCH-1870:
-----------------------------------
Hello Julien,
1. Yes, I'll regenerate one patch soon. If Sebastian has nothing more to add,
i'll provide it at end of the week.
2. You're totally right, I've removed some additional @author tags but yeah I
would say it's hard to give my baby :)
3. It is a good point. I've created a specific indexer to index all metadata
that are provided in the xsl used for transformation. It allows one people to
avoid specifying another time in the global nutch conf which metadata to index
as far as this is already specified in the xsl file. It is really a matter of
philosophy. If you find it is redundant and that it is clearer to explicitly
write metadata to extract in the global conf we can remove it.
4. This is another good point ;) A very interesting approach. We could for
example specify the rule method attribute (with value "url" or "field"). I'll
write it down to my TODO file!
Thanks a lot for all these remarks!
> Generic xsl parser plugin
> -------------------------
>
> Key: NUTCH-1870
> URL: https://issues.apache.org/jira/browse/NUTCH-1870
> Project: Nutch
> Issue Type: New Feature
> Components: indexer, parser
> Affects Versions: 1.9
> Reporter: Albinscode
> Fix For: 1.10
>
> Attachments: xsl-parse-plugin.patch
>
>
> The aim of this plugin is to use XSLT to extract metadata from HTML DOM
> structures.
> | Your Data | --> | Parse-html plugin or TIKA plugin | --> | DOM structure |
> --> |XSLT plugin |
>
>
> The main advantage is that:
> - You won't have to produce any java code, only XSLT and configuration
> - It can process DOM structure from DocumentFragment (@see NekoHtml and @see
> TagSoup)
> - It is HtmlParseFilter plugin compatible and can be plugged as any other
> plugin (parse-js, parse-swf, etc...)
> This topic has been discussed on
> http://www.mail-archive.com/dev%40nutch.apache.org/msg15257.html
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)