[ 
https://issues.apache.org/jira/browse/NUTCH-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16026412#comment-16026412
 ] 

Kaidul Islam commented on NUTCH-2389:
-------------------------------------

Hi [~lewismc] I ended up writing one {{ParseFilter}} plugin - {{parse-jsoup}} 
which basically extracts contents from specific URLs which match a specific URL 
pattern and uses jsoup APIs (selecting by CSS selector, attributes) using XML 
configuration file. A sample XML configuration file is - 

{code:xml}
<page>
        <!-- Jsoup selection can be applied on those webpages which will match 
this url pattern -->
        
<url-regex>^https?://(?:www\.)?youtu(?:\.be/|be\.com/watch\?v=)(?:[a-zA-Z0-9_-]{11}).*$</url-regex>

        <!-- Fields to parse -->
        <fields>
                <field name="title">
                        <css-selector>#eow-title</css-selector>
                        <default-value>foobar</default-value>
                </field>
                <field name="description">
                        <css-selector>#watch-description-text 
p#eow-description</css-selector>
                </field>
                <field name="uploadTime">
                        <css-selector>.watch-time-text</css-selector>
                </field>
                <field name="likeCount">
                        
<css-selector>.like-button-renderer-like-button.like-button-renderer-like-button-unclicked
 span.yt-uix-button-content</css-selector>
                </field>
                <field name="dislikeCount">
                        
<css-selector>.like-button-renderer-dislike-button.like-button-renderer-dislike-button-unclicked
 span.yt-uix-button-content</css-selector>
                </field>
                <field name="viewCount">
                        <css-selector>.watch-view-count</css-selector>
                </field>
                <field name="subscriberCount">
                        <css-selector>.yt-subscriber-count</css-selector>
                </field>
                <field name="publisherName">
                        <css-selector>.yt-user-info a</css-selector>
                </field>
                <field name="publisherChannel">
                        <css-selector>.yt-user-info a</css-selector>
                        <attribute>abs:href</attribute>
                </field>
                <field name="publisherStatus">
                        <css-selector>.yt-user-info span</css-selector>
                        <attribute>aria-label</attribute>
                </field>
                <field name="category">
                        <css-selector>.watch-extras-section :nth-child(1) 
a</css-selector>
                </field>
        </fields>
</page> <!-- End of page -->
{code}

And like {{parse-metatags}}, I am putting these contents into 
{{Map<CharSequence, ByteBuffer> metadata}} adding {{jsoup_}} as prefix. And to 
index these data, I am using similar {{IndexingFilter}} plugin like 
{{index-metadata}} plugin which index the entries containing {{jsoup_}} as 
prefix.

This suited my requirements in my job as I was building a training dataset and 
knowledge-base of 10M youtube.com videos for a NLP based project. But I am not 
sure about the general case.

Also as I see, similar kind of plugin had been proposed previously in NUTCH-978 
which seems pretty controversial from comment sections and eventually the issue 
had been closed. Please let me know your opinion about this plugin. I, myself, 
have doubt about it - should it be parse-filter or parser plugin?

Thanks!

> Precise data parsing using Jsoup CSS selectors
> ----------------------------------------------
>
>                 Key: NUTCH-2389
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2389
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 2.3
>            Reporter: Kaidul Islam
>            Assignee: Kaidul Islam
>             Fix For: 2.4
>
>   Original Estimate: 0.05h
>  Remaining Estimate: 0.05h
>
> As far as I know, currently Nutch 1.x and 2.x has no features to 
> extract/parse exact contents for specific websites. I've developed a plugin 
> {{parse-jsoup}} using Jsoup for my current project to extract precise content 
> for site specific crawling using detailed XML configuration(field name, 
> CSS-selector, attribute, extraction rules, data-type, default-value etc).
> Please let me know if this feature seems relevant and currently not present 
> in Nutch. I have also plan to export it into Nutch 1.x.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to