[jira] [Commented] (NUTCH-2389) Precise data parsing using Jsoup CSS selectors

ASF GitHub Bot (JIRA) Wed, 07 Jun 2017 04:33:32 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16040736#comment-16040736
 ]


ASF GitHub Bot commented on NUTCH-2389:
---------------------------------------

kaidul opened a new pull request #192: NUTCH-2389 Precise data extractor 
implemented for 2.x
URL: https://github.com/apache/nutch/pull/192
 
 
   Webpage-wise precise data extractor based on jsoup CSS-selector API and 
configurable using XML file. Parse filter and complementary indexing filter 
plugin implemented. Functionality of defining custom normalizers on specific 
extracted data implemented. I've successfully tested this module on my large 
project and unit testing is added as well.
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> Precise data parsing using Jsoup CSS selectors
> ----------------------------------------------
>
>                 Key: NUTCH-2389
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2389
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 2.3
>            Reporter: Kaidul Islam
>            Assignee: Kaidul Islam
>             Fix For: 2.4
>
>   Original Estimate: 0.05h
>  Remaining Estimate: 0.05h
>
> As far as I know, currently Nutch 1.x and 2.x has no features to 
> extract/parse exact contents for specific websites. I've developed a plugin 
> {{parse-jsoup}} using Jsoup for my current project to extract precise content 
> for site specific crawling using detailed XML configuration(field name, 
> CSS-selector, attribute, extraction rules, data-type, default-value etc).
> Please let me know if this feature seems relevant and currently not present 
> in Nutch. I have also plan to export it into Nutch 1.x.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (NUTCH-2389) Precise data parsing using Jsoup CSS selectors

Reply via email to