[
https://issues.apache.org/jira/browse/NUTCH-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16026412#comment-16026412
]
Kaidul Islam commented on NUTCH-2389:
-------------------------------------
Hi [~lewismc] I ended up writing one {{ParseFilter}} plugin - {{parse-jsoup}}
which basically extracts contents from specific URLs which match a specific URL
pattern and uses jsoup APIs (selecting by CSS selector, attributes) using XML
configuration file. A sample XML configuration file is -
{code:xml}
<page>
<!-- Jsoup selection can be applied on those webpages which will match
this url pattern -->
<url-regex>^https?://(?:www\.)?youtu(?:\.be/|be\.com/watch\?v=)(?:[a-zA-Z0-9_-]{11}).*$</url-regex>
<!-- Fields to parse -->
<fields>
<field name="title">
<css-selector>#eow-title</css-selector>
<default-value>foobar</default-value>
</field>
<field name="description">
<css-selector>#watch-description-text
p#eow-description</css-selector>
</field>
<field name="uploadTime">
<css-selector>.watch-time-text</css-selector>
</field>
<field name="likeCount">
<css-selector>.like-button-renderer-like-button.like-button-renderer-like-button-unclicked
span.yt-uix-button-content</css-selector>
</field>
<field name="dislikeCount">
<css-selector>.like-button-renderer-dislike-button.like-button-renderer-dislike-button-unclicked
span.yt-uix-button-content</css-selector>
</field>
<field name="viewCount">
<css-selector>.watch-view-count</css-selector>
</field>
<field name="subscriberCount">
<css-selector>.yt-subscriber-count</css-selector>
</field>
<field name="publisherName">
<css-selector>.yt-user-info a</css-selector>
</field>
<field name="publisherChannel">
<css-selector>.yt-user-info a</css-selector>
<attribute>abs:href</attribute>
</field>
<field name="publisherStatus">
<css-selector>.yt-user-info span</css-selector>
<attribute>aria-label</attribute>
</field>
<field name="category">
<css-selector>.watch-extras-section :nth-child(1)
a</css-selector>
</field>
</fields>
</page> <!-- End of page -->
{code}
And like {{parse-metatags}}, I am putting these contents into
{{Map<CharSequence, ByteBuffer> metadata}} adding {{jsoup_}} as prefix. And to
index these data, I am using similar {{IndexingFilter}} plugin like
{{index-metadata}} plugin which index the entries containing {{jsoup_}} as
prefix.
This suited my requirements in my job as I was building a training dataset and
knowledge-base of 10M youtube.com videos for a NLP based project. But I am not
sure about the general case.
Also as I see, similar kind of plugin had been proposed previously in NUTCH-978
which seems pretty controversial from comment sections and eventually the issue
had been closed. Please let me know your opinion about this plugin. I, myself,
have doubt about it - should it be parse-filter or parser plugin?
Thanks!
> Precise data parsing using Jsoup CSS selectors
> ----------------------------------------------
>
> Key: NUTCH-2389
> URL: https://issues.apache.org/jira/browse/NUTCH-2389
> Project: Nutch
> Issue Type: New Feature
> Components: parser
> Affects Versions: 2.3
> Reporter: Kaidul Islam
> Assignee: Kaidul Islam
> Fix For: 2.4
>
> Original Estimate: 0.05h
> Remaining Estimate: 0.05h
>
> As far as I know, currently Nutch 1.x and 2.x has no features to
> extract/parse exact contents for specific websites. I've developed a plugin
> {{parse-jsoup}} using Jsoup for my current project to extract precise content
> for site specific crawling using detailed XML configuration(field name,
> CSS-selector, attribute, extraction rules, data-type, default-value etc).
> Please let me know if this feature seems relevant and currently not present
> in Nutch. I have also plan to export it into Nutch 1.x.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)