[jira] [Commented] (NUTCH-1749) Optionally exclude title from content field

ASF GitHub Bot (Jira) Wed, 04 Sep 2019 04:10:36 -0700


    [ 
https://issues.apache.org/jira/browse/NUTCH-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16922383#comment-16922383
 ]


ASF GitHub Bot commented on NUTCH-1749:
---------------------------------------

jorgelbg commented on pull request #285: fix for NUTCH-1749 contributed by 
steeveb972
URL: https://github.com/apache/nutch/pull/285#discussion_r320663834
 
 

 ##########
 File path: build.xml
 ##########
 @@ -1002,7 +1002,7 @@
 
   <!-- target: ant-eclipse-download   =================================== -->
   <target name="ant-eclipse-download" description="--> downloads the 
ant-eclipse binary.">
-    <get 
src="http://downloads.sourceforge.net/project/ant-eclipse/ant-eclipse/1.0/ant-eclipse-1.0.bin.tar.bz2";
+    <get 
src="http://freefr.dl.sourceforge.net/project/ant-eclipse/ant-eclipse/1.0/ant-eclipse-1.0.bin.tar.bz2";
 
 Review comment:
   I don't see a good reason to change this URL.
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> Optionally exclude title from content field
> -------------------------------------------
>
>                 Key: NUTCH-1749
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1749
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.7
>            Reporter: Greg Padiasek
>            Priority: Major
>             Fix For: 1.16
>
>         Attachments: DOMContentUtils.patch
>
>
> The HTML parser plugin inserts document title into document content. Since 
> the title alone can be retrieved via DOMContentUtils.getTitle() and content 
> is retrieved via DOMContentUtils.getText(), there is no need to duplicate 
> title in the content. When title is included in the content it becomes 
> difficult/impossible to extract document body without title. A need to 
> extract document body without title is visible when user wants to index or 
> display body and title separately.
> Attached is a patch which prevents including title in document content in the 
> HTML parser plugin.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (NUTCH-1749) Optionally exclude title from content field

Reply via email to