[ 
https://issues.apache.org/jira/browse/NUTCH-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kaidul Islam updated NUTCH-2389:
--------------------------------
    Description: 
As far as I know, currently Nutch 1.x and 2.x has no features to extract/parse 
exact contents for specific websites. I've developed a plugin {{parse-jsoup}} 
using Jsoup for my current project to extract precise content for site specific 
crawling using detailed XML configuration(field name, CSS-selector, attribute, 
extraction rules, data-type, default-value etc).

Please let me know if this feature seems relevant and currently not present in 
Nutch. I have also plan to export it into Nutch 1.x.

  was:
As far as I know, currently Nutch 1.x and 2.x has no features to extract/parse 
exact contents for specific websites. I've developed a plugin {{parse-jsoup}} 
using Jsoup for my current project to extract precise content for site specific 
crawling using detailed XML configuration.

Please let me know if this feature seems relevant and currently not present in 
Nutch. I have also plan to export it into Nutch 1.x.


> Precise data parsing using Jsoup CSS selectors
> ----------------------------------------------
>
>                 Key: NUTCH-2389
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2389
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 2.3
>            Reporter: Kaidul Islam
>            Assignee: Kaidul Islam
>             Fix For: 2.4
>
>   Original Estimate: 0.05h
>  Remaining Estimate: 0.05h
>
> As far as I know, currently Nutch 1.x and 2.x has no features to 
> extract/parse exact contents for specific websites. I've developed a plugin 
> {{parse-jsoup}} using Jsoup for my current project to extract precise content 
> for site specific crawling using detailed XML configuration(field name, 
> CSS-selector, attribute, extraction rules, data-type, default-value etc).
> Please let me know if this feature seems relevant and currently not present 
> in Nutch. I have also plan to export it into Nutch 1.x.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to