[jira] [Commented] (NUTCH-977) SolrMappingReader uses hardcoded configuration parameter name for mapping file
[ https://issues.apache.org/jira/browse/NUTCH-977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13016324#comment-13016324 ] Markus Jelsma commented on NUTCH-977: - Any objections devs? Everythings is working fine with these patches. SolrMappingReader uses hardcoded configuration parameter name for mapping file -- Key: NUTCH-977 URL: https://issues.apache.org/jira/browse/NUTCH-977 Project: Nutch Issue Type: Bug Affects Versions: 1.2, 1.3, 2.0 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.3, 2.0 Attachments: NUTCH-977-1.3.patch, NUTCH-977-trunk.patch Because the SolrMappingReader uses a hard coded value for the name of the mapping file configuration parameter it actually works. It should rely on SolrConstants instead of using a hard coded value. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-976) SolrIndex constants in wrong namespace (or prefix)
[ https://issues.apache.org/jira/browse/NUTCH-976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13016323#comment-13016323 ] Markus Jelsma commented on NUTCH-976: - Any objections devs? Everythings is working fine with these patches. SolrIndex constants in wrong namespace (or prefix) -- Key: NUTCH-976 URL: https://issues.apache.org/jira/browse/NUTCH-976 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.2, 1.3, 2.0 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.3, 2.0 Attachments: NUTCH-976-1.3-trunk.patch The shipped nutch-default.xml configuration file uses solrindex. as namespace for configuration parameters but the namespace (or prefix) in SolrConstants is solr instead. It should be solrindex. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.
[ https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ammar Shadiq updated NUTCH-978: --- Attachment: [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf Proposal for Google Summer of Code 2011 http://www.google-melange.com/gsoc/homepage/google/gsoc2011 haven't found any mentor yet :-( [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing. --- Key: NUTCH-978 URL: https://issues.apache.org/jira/browse/NUTCH-978 Project: Nutch Issue Type: New Feature Components: parser Affects Versions: 1.2 Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9 Reporter: Ammar Shadiq Labels: gsoc Fix For: 2.0 Attachments: [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf Original Estimate: 1680h Remaining Estimate: 1680h Nutch use parse-html plugin to parse web pages, it process the contents of the web page by removing html tags and component like javascript and css and leaving the extracted text to be stored on the index. Nutch by default doesn't have the capability to select certain atomic element on an html page, like certain tags, certain content, some part of the page, etc. A html page have a tree-like xml pattern with html tag as its branch and text as its node. This branch and node could be extracted using XPath. XPath allowing us to select a certain branch or node of an XML and therefore could be used to extract certain information and treat it differently based on its content and the user requirements. Furthermore a web domain like news website usually have a same html code structure for storing the information on its web pages. This same html code structure could be parsed using the same XPath query and retrieve the same content information element. All of the XPath query for selecting various content could be stored on a XPath Configuration File. The purpose of nutch are for various web source, not all of the web page retrieved from those various source have the same html code structure, thus have to be threated differently using the correct XPath Configuration. The selection of the correct XPath configuration could be done automatically using regex by matching the url of the web page with valid url pattern for that xpath configuration. This automatic mechanism allow the user of nutch to process various web page and get only certain information that user wants therefore making the index more accurate and its content more flexible. The component for this idea have been tested on nutch 1.2 for selecting certain elements on various news website for the purpose of document clustering. This includes a Configuration Editor Application build using NetBeans 6.9 Application Framework. though its need a few debugging. http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.
[ https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ammar Shadiq updated NUTCH-978: --- Priority: Minor (was: Major) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing. --- Key: NUTCH-978 URL: https://issues.apache.org/jira/browse/NUTCH-978 Project: Nutch Issue Type: New Feature Components: parser Affects Versions: 1.2 Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9 Reporter: Ammar Shadiq Priority: Minor Labels: gsoc Fix For: 2.0 Attachments: [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf Original Estimate: 1680h Remaining Estimate: 1680h Nutch use parse-html plugin to parse web pages, it process the contents of the web page by removing html tags and component like javascript and css and leaving the extracted text to be stored on the index. Nutch by default doesn't have the capability to select certain atomic element on an html page, like certain tags, certain content, some part of the page, etc. A html page have a tree-like xml pattern with html tag as its branch and text as its node. This branch and node could be extracted using XPath. XPath allowing us to select a certain branch or node of an XML and therefore could be used to extract certain information and treat it differently based on its content and the user requirements. Furthermore a web domain like news website usually have a same html code structure for storing the information on its web pages. This same html code structure could be parsed using the same XPath query and retrieve the same content information element. All of the XPath query for selecting various content could be stored on a XPath Configuration File. The purpose of nutch are for various web source, not all of the web page retrieved from those various source have the same html code structure, thus have to be threated differently using the correct XPath Configuration. The selection of the correct XPath configuration could be done automatically using regex by matching the url of the web page with valid url pattern for that xpath configuration. This automatic mechanism allow the user of nutch to process various web page and get only certain information that user wants therefore making the index more accurate and its content more flexible. The component for this idea have been tested on nutch 1.2 for selecting certain elements on various news website for the purpose of document clustering. This includes a Configuration Editor Application build using NetBeans 6.9 Application Framework. though its need a few debugging. http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-967) Upgrade to Tika 0.9
[ https://issues.apache.org/jira/browse/NUTCH-967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13016465#comment-13016465 ] Gabriele Kahlout commented on NUTCH-967: Julien, why doesn't your patch modify tika-parse plugin.xml to use tika-parsers-0.9 instead of tika-parsers-0.7? Trying to do so I get exception (for both html and pdfs): Exception in thread main java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252) at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:156) at org.apache.nutch.parse.ParseSegment.run(ParseSegment.java:177) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:163) It's enough to set it back to 0.7 to have it work. This is not an issue with html only but also pdfs. Upgrade to Tika 0.9 --- Key: NUTCH-967 URL: https://issues.apache.org/jira/browse/NUTCH-967 Project: Nutch Issue Type: Task Components: parser Affects Versions: 1.3, 2.0 Reporter: Markus Jelsma Assignee: Julien Nioche Fix For: 1.3, 2.0 Attachments: NUTCH-967-1.3.patch -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.
[ https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ammar Shadiq updated NUTCH-978: --- Attachment: app_screenshoot_url_regex_filter.png app_screenshoot_source_view.png app_screenshoot_configuration_result_anchor.png app_screenshoot_configuration_result.png [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing. --- Key: NUTCH-978 URL: https://issues.apache.org/jira/browse/NUTCH-978 Project: Nutch Issue Type: New Feature Components: parser Affects Versions: 1.2 Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9 Reporter: Ammar Shadiq Priority: Minor Labels: gsoc2011, mentor Fix For: 2.0 Attachments: [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, app_screenshoot_configuration_result.png, app_screenshoot_configuration_result_anchor.png, app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png Original Estimate: 1680h Remaining Estimate: 1680h Nutch use parse-html plugin to parse web pages, it process the contents of the web page by removing html tags and component like javascript and css and leaving the extracted text to be stored on the index. Nutch by default doesn't have the capability to select certain atomic element on an html page, like certain tags, certain content, some part of the page, etc. A html page have a tree-like xml pattern with html tag as its branch and text as its node. This branch and node could be extracted using XPath. XPath allowing us to select a certain branch or node of an XML and therefore could be used to extract certain information and treat it differently based on its content and the user requirements. Furthermore a web domain like news website usually have a same html code structure for storing the information on its web pages. This same html code structure could be parsed using the same XPath query and retrieve the same content information element. All of the XPath query for selecting various content could be stored on a XPath Configuration File. The purpose of nutch are for various web source, not all of the web page retrieved from those various source have the same html code structure, thus have to be threated differently using the correct XPath Configuration. The selection of the correct XPath configuration could be done automatically using regex by matching the url of the web page with valid url pattern for that xpath configuration. This automatic mechanism allow the user of nutch to process various web page and get only certain information that user wants therefore making the index more accurate and its content more flexible. The component for this idea have been tested on nutch 1.2 for selecting certain elements on various news website for the purpose of document clustering. This includes a Configuration Editor Application build using NetBeans 6.9 Application Framework. though its need a few debugging. http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
Build failed in Jenkins: Nutch-trunk #1449
See https://hudson.apache.org/hudson/job/Nutch-trunk/1449/ -- [...truncated 1009 lines...] A src/plugin/subcollection/src/java/org/apache/nutch/collection A src/plugin/subcollection/src/java/org/apache/nutch/collection/Subcollection.java A src/plugin/subcollection/src/java/org/apache/nutch/collection/CollectionManager.java A src/plugin/subcollection/src/java/org/apache/nutch/collection/package.html A src/plugin/subcollection/src/java/org/apache/nutch/indexer A src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection A src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection/SubcollectionIndexingFilter.java A src/plugin/subcollection/README.txt A src/plugin/subcollection/plugin.xml A src/plugin/subcollection/build.xml A src/plugin/index-more A src/plugin/index-more/ivy.xml A src/plugin/index-more/src A src/plugin/index-more/src/test A src/plugin/index-more/src/test/org A src/plugin/index-more/src/test/org/apache A src/plugin/index-more/src/test/org/apache/nutch A src/plugin/index-more/src/test/org/apache/nutch/indexer A src/plugin/index-more/src/test/org/apache/nutch/indexer/more A src/plugin/index-more/src/test/org/apache/nutch/indexer/more/TestMoreIndexingFilter.java A src/plugin/index-more/src/java A src/plugin/index-more/src/java/org A src/plugin/index-more/src/java/org/apache A src/plugin/index-more/src/java/org/apache/nutch A src/plugin/index-more/src/java/org/apache/nutch/indexer A src/plugin/index-more/src/java/org/apache/nutch/indexer/more A src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java A src/plugin/index-more/src/java/org/apache/nutch/indexer/more/package.html A src/plugin/index-more/plugin.xml A src/plugin/index-more/build.xml AUsrc/plugin/plugin.dtd A src/plugin/parse-ext A src/plugin/parse-ext/ivy.xml A src/plugin/parse-ext/src A src/plugin/parse-ext/src/test A src/plugin/parse-ext/src/test/org A src/plugin/parse-ext/src/test/org/apache A src/plugin/parse-ext/src/test/org/apache/nutch A src/plugin/parse-ext/src/test/org/apache/nutch/parse A src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext A src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext/TestExtParser.java A src/plugin/parse-ext/src/java A src/plugin/parse-ext/src/java/org A src/plugin/parse-ext/src/java/org/apache A src/plugin/parse-ext/src/java/org/apache/nutch A src/plugin/parse-ext/src/java/org/apache/nutch/parse A src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext A src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext/ExtParser.java A src/plugin/parse-ext/plugin.xml A src/plugin/parse-ext/build.xml A src/plugin/parse-ext/command A src/plugin/urlnormalizer-pass A src/plugin/urlnormalizer-pass/ivy.xml A src/plugin/urlnormalizer-pass/src A src/plugin/urlnormalizer-pass/src/test A src/plugin/urlnormalizer-pass/src/test/org A src/plugin/urlnormalizer-pass/src/test/org/apache A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer/pass AU src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer/pass/TestPassURLNormalizer.java A src/plugin/urlnormalizer-pass/src/java A src/plugin/urlnormalizer-pass/src/java/org A src/plugin/urlnormalizer-pass/src/java/org/apache A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer/pass AU src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer/pass/PassURLNormalizer.java AUsrc/plugin/urlnormalizer-pass/plugin.xml AUsrc/plugin/urlnormalizer-pass/build.xml A src/plugin/parse-html A src/plugin/parse-html/ivy.xml A src/plugin/parse-html/lib A src/plugin/parse-html/lib/tagsoup.LICENSE.txt A src/plugin/parse-html/src A src/plugin/parse-html/src/test A src/plugin/parse-html/src/test/org A src/plugin/parse-html/src/test/org/apache A src/plugin/parse-html/src/test/org/apache/nutch A src/plugin/parse-html/src/test/org/apache/nutch/parse A