[jira] [Created] (NUTCH-1288) Generator should not generate filter and not found and denied and gone and permanently moved pages
Generator should not generate filter and not found and denied and gone and permanently moved pages -- Key: NUTCH-1288 URL: https://issues.apache.org/jira/browse/NUTCH-1288 Project: Nutch Issue Type: Bug Components: fetcher, generator Affects Versions: 1.4 Reporter: behnam nikbakht Generator should not generate filter and not found and denied and gone and permanently moved pages. in the shouldFetch method in AbstractFetchSchedule, CrawlDatum must checked against special states of fetch like not found, and not generate them again. so we can add a status in CrawlDatum that indicates invalid urls, and set this status in fetch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1288) Generator should not generate filter and not found and denied and gone and permanently moved pages
[ https://issues.apache.org/jira/browse/NUTCH-1288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] behnam nikbakht updated NUTCH-1288: --- Attachment: NUTCH-1288.patch Generator should not generate filter and not found and denied and gone and permanently moved pages -- Key: NUTCH-1288 URL: https://issues.apache.org/jira/browse/NUTCH-1288 Project: Nutch Issue Type: Bug Components: fetcher, generator Affects Versions: 1.4 Reporter: behnam nikbakht Attachments: NUTCH-1288.patch Generator should not generate filter and not found and denied and gone and permanently moved pages. in the shouldFetch method in AbstractFetchSchedule, CrawlDatum must checked against special states of fetch like not found, and not generate them again. so we can add a status in CrawlDatum that indicates invalid urls, and set this status in fetch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1288) Generator should not generate filter and not found and denied and gone and permanently moved pages
[ https://issues.apache.org/jira/browse/NUTCH-1288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-1288. -- Resolution: Invalid This is not the right way to do. If you don't want to re-try such pages then implement a custom fetch schedule - don't hack the AbstractFetchSchedule as you do. Hardcoding the schedule policy forces people to use Nutch the way you want to use it, not a good idea. Moreover your patch removes useful information about the status of a page to give a more generic (and dubious value). Generator should not generate filter and not found and denied and gone and permanently moved pages -- Key: NUTCH-1288 URL: https://issues.apache.org/jira/browse/NUTCH-1288 Project: Nutch Issue Type: Bug Components: fetcher, generator Affects Versions: 1.4 Reporter: behnam nikbakht Attachments: NUTCH-1288.patch Generator should not generate filter and not found and denied and gone and permanently moved pages. in the shouldFetch method in AbstractFetchSchedule, CrawlDatum must checked against special states of fetch like not found, and not generate them again. so we can add a status in CrawlDatum that indicates invalid urls, and set this status in fetch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1281) tika parser not work properly with unwanted file types that passed from filters in nutch
[ https://issues.apache.org/jira/browse/NUTCH-1281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13212502#comment-13212502 ] Julien Nioche commented on NUTCH-1281: -- Behnam, I suppose that you are seeing this issue when using the Crawl class but not when using a script. The reason for this is that the timeout mechanism prevents the parser to get locked with files which have been truncated or put the underlying parser library in a spin. When using the Crawl class, these runaway threads are not cleared, accumulate and take all the memory left. The Crawl class is planned to be replaced by a shell script which will remove this issue and allow people to modify the process easily (+ make the pipeline easier to understand) Or are you seeing this when using the Parse command in a script? Again, the timeout mechanism should prevent the parser to crash. Now if the issue is to prevent the Tika plugin to process certain types, a better approach would be to filter the docs prior to parsing based on their mime-types which we now can access from the crawldb metadata. The trouble is that the URLFilters consider only the string of a URL and not any metadata. We could change the API of URLFilters? What other metadata would we take into account for filtering? Another approach would be to filter based on the content type in ParseUtil - so that it is used not only for Tika but for any other parser and have a blacklist of mimetypes that would not be parsed. Any thoughts? tika parser not work properly with unwanted file types that passed from filters in nutch Key: NUTCH-1281 URL: https://issues.apache.org/jira/browse/NUTCH-1281 Project: Nutch Issue Type: Improvement Components: parser Reporter: behnam nikbakht when in parse-plugins.xml, set this property: mimeType name=* plugin id=parse-tika / /mimeType all unwanted files that pass from all filters, refered to tika but for some file types like .flv, tika parser has problem and hunged and cause to fail in parse Job. if this file types passed from regex-urlfilter and other filters, parse job failed. for this problem I suggest that add some properties for valid file types, and use this code in TikaParser.java, like this: public ParseResult getParse(Content content) { String mimeType = content.getContentType(); + String[]validTypes=new String[]{application/pdf,application/x-tika-msoffice,application/x-tika- ooxml,application/vnd.oasis.opendocument.text,text/plain,application/rtf,application/rss+xml,application/x-bzip2,application/x-gzip,application/x-javascript,application/javascript,text/javascript,application/x-shockwave-flash,application/zip,text/xml,application/xml}; + boolean valid=false; + for(int k=0;kvalidTypes.length;k++){ + if(validTypes[k].compareTo(mimeType.toLowerCase())==0) + valid=true; + } + if(!valid) + return new ParseStatus(ParseStatus.NOTPARSED, Can't parse for unwanted filetype + mimeType).getEmptyParseResult(content.getUrl(), getConf()); URL base; -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.
[ https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13212523#comment-13212523 ] Ammar Shadiq commented on NUTCH-978: Hi Lewis, Since the proposal is not accepted, I'm using my summer time to work on my undergrad thesis. I'm graduated from collage recently, and the time has freed up, so I'd love to help, and it's awesome if we could collaborate. thanks, Ammar [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing. --- Key: NUTCH-978 URL: https://issues.apache.org/jira/browse/NUTCH-978 Project: Nutch Issue Type: New Feature Components: parser Affects Versions: 1.2 Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9 Reporter: Ammar Shadiq Assignee: Chris A. Mattmann Priority: Minor Labels: gsoc2011, mentor Fix For: nutchgora Attachments: [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, app_guardian_ivory_coast_news_exmpl.png, app_screenshoot_configuration_result.png, app_screenshoot_configuration_result_anchor.png, app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png, for_GSoc.zip Original Estimate: 1,680h Remaining Estimate: 1,680h Nutch use parse-html plugin to parse web pages, it process the contents of the web page by removing html tags and component like javascript and css and leaving the extracted text to be stored on the index. Nutch by default doesn't have the capability to select certain atomic element on an html page, like certain tags, certain content, some part of the page, etc. A html page have a tree-like xml pattern with html tag as its branch and text as its node. This branch and node could be extracted using XPath. XPath allowing us to select a certain branch or node of an XML and therefore could be used to extract certain information and treat it differently based on its content and the user requirements. Furthermore a web domain like news website usually have a same html code structure for storing the information on its web pages. This same html code structure could be parsed using the same XPath query and retrieve the same content information element. All of the XPath query for selecting various content could be stored on a XPath Configuration File. The purpose of nutch are for various web source, not all of the web page retrieved from those various source have the same html code structure, thus have to be threated differently using the correct XPath Configuration. The selection of the correct XPath configuration could be done automatically using regex by matching the url of the web page with valid url pattern for that xpath configuration. This automatic mechanism allow the user of nutch to process various web page and get only certain information that user wants therefore making the index more accurate and its content more flexible. The component for this idea have been tested on nutch 1.2 for selecting certain elements on various news website for the purpose of document clustering. This includes a Configuration Editor Application build using NetBeans 6.9 Application Framework. though its need a few debugging. http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.
[ https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13212542#comment-13212542 ] Ammar Shadiq commented on NUTCH-978: I'll send you an email. [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing. --- Key: NUTCH-978 URL: https://issues.apache.org/jira/browse/NUTCH-978 Project: Nutch Issue Type: New Feature Components: parser Affects Versions: 1.2 Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9 Reporter: Ammar Shadiq Assignee: Chris A. Mattmann Priority: Minor Labels: gsoc2011, mentor Fix For: nutchgora Attachments: [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, app_guardian_ivory_coast_news_exmpl.png, app_screenshoot_configuration_result.png, app_screenshoot_configuration_result_anchor.png, app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png, for_GSoc.zip Original Estimate: 1,680h Remaining Estimate: 1,680h Nutch use parse-html plugin to parse web pages, it process the contents of the web page by removing html tags and component like javascript and css and leaving the extracted text to be stored on the index. Nutch by default doesn't have the capability to select certain atomic element on an html page, like certain tags, certain content, some part of the page, etc. A html page have a tree-like xml pattern with html tag as its branch and text as its node. This branch and node could be extracted using XPath. XPath allowing us to select a certain branch or node of an XML and therefore could be used to extract certain information and treat it differently based on its content and the user requirements. Furthermore a web domain like news website usually have a same html code structure for storing the information on its web pages. This same html code structure could be parsed using the same XPath query and retrieve the same content information element. All of the XPath query for selecting various content could be stored on a XPath Configuration File. The purpose of nutch are for various web source, not all of the web page retrieved from those various source have the same html code structure, thus have to be threated differently using the correct XPath Configuration. The selection of the correct XPath configuration could be done automatically using regex by matching the url of the web page with valid url pattern for that xpath configuration. This automatic mechanism allow the user of nutch to process various web page and get only certain information that user wants therefore making the index more accurate and its content more flexible. The component for this idea have been tested on nutch 1.2 for selecting certain elements on various news website for the purpose of document clustering. This includes a Configuration Editor Application build using NetBeans 6.9 Application Framework. though its need a few debugging. http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.
[ https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13212570#comment-13212570 ] Chris A. Mattmann commented on NUTCH-978: - Guys, I think it's fine to keep the conversation on list, in fact, I'd favor it unless there is a specific reason to take it there? [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing. --- Key: NUTCH-978 URL: https://issues.apache.org/jira/browse/NUTCH-978 Project: Nutch Issue Type: New Feature Components: parser Affects Versions: 1.2 Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9 Reporter: Ammar Shadiq Assignee: Chris A. Mattmann Priority: Minor Labels: gsoc2011, mentor Fix For: nutchgora Attachments: [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, app_guardian_ivory_coast_news_exmpl.png, app_screenshoot_configuration_result.png, app_screenshoot_configuration_result_anchor.png, app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png, for_GSoc.zip Original Estimate: 1,680h Remaining Estimate: 1,680h Nutch use parse-html plugin to parse web pages, it process the contents of the web page by removing html tags and component like javascript and css and leaving the extracted text to be stored on the index. Nutch by default doesn't have the capability to select certain atomic element on an html page, like certain tags, certain content, some part of the page, etc. A html page have a tree-like xml pattern with html tag as its branch and text as its node. This branch and node could be extracted using XPath. XPath allowing us to select a certain branch or node of an XML and therefore could be used to extract certain information and treat it differently based on its content and the user requirements. Furthermore a web domain like news website usually have a same html code structure for storing the information on its web pages. This same html code structure could be parsed using the same XPath query and retrieve the same content information element. All of the XPath query for selecting various content could be stored on a XPath Configuration File. The purpose of nutch are for various web source, not all of the web page retrieved from those various source have the same html code structure, thus have to be threated differently using the correct XPath Configuration. The selection of the correct XPath configuration could be done automatically using regex by matching the url of the web page with valid url pattern for that xpath configuration. This automatic mechanism allow the user of nutch to process various web page and get only certain information that user wants therefore making the index more accurate and its content more flexible. The component for this idea have been tested on nutch 1.2 for selecting certain elements on various news website for the purpose of document clustering. This includes a Configuration Editor Application build using NetBeans 6.9 Application Framework. though its need a few debugging. http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.
[ https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13212576#comment-13212576 ] Lewis John McGibbney commented on NUTCH-978: No bother Chris. So far questions that have been asked 1. provide a quick run down on the issue, summarizing all of the above 2. what were the motivations, purpose and technical challenges encountered whilst working on it? 3. Why did the issue drop away and what do you think is required to get it back on track and possibly in the codebase? [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing. --- Key: NUTCH-978 URL: https://issues.apache.org/jira/browse/NUTCH-978 Project: Nutch Issue Type: New Feature Components: parser Affects Versions: 1.2 Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9 Reporter: Ammar Shadiq Assignee: Chris A. Mattmann Priority: Minor Labels: gsoc2011, mentor Fix For: nutchgora Attachments: [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, app_guardian_ivory_coast_news_exmpl.png, app_screenshoot_configuration_result.png, app_screenshoot_configuration_result_anchor.png, app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png, for_GSoc.zip Original Estimate: 1,680h Remaining Estimate: 1,680h Nutch use parse-html plugin to parse web pages, it process the contents of the web page by removing html tags and component like javascript and css and leaving the extracted text to be stored on the index. Nutch by default doesn't have the capability to select certain atomic element on an html page, like certain tags, certain content, some part of the page, etc. A html page have a tree-like xml pattern with html tag as its branch and text as its node. This branch and node could be extracted using XPath. XPath allowing us to select a certain branch or node of an XML and therefore could be used to extract certain information and treat it differently based on its content and the user requirements. Furthermore a web domain like news website usually have a same html code structure for storing the information on its web pages. This same html code structure could be parsed using the same XPath query and retrieve the same content information element. All of the XPath query for selecting various content could be stored on a XPath Configuration File. The purpose of nutch are for various web source, not all of the web page retrieved from those various source have the same html code structure, thus have to be threated differently using the correct XPath Configuration. The selection of the correct XPath configuration could be done automatically using regex by matching the url of the web page with valid url pattern for that xpath configuration. This automatic mechanism allow the user of nutch to process various web page and get only certain information that user wants therefore making the index more accurate and its content more flexible. The component for this idea have been tested on nutch 1.2 for selecting certain elements on various news website for the purpose of document clustering. This includes a Configuration Editor Application build using NetBeans 6.9 Application Framework. though its need a few debugging. http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.
[ https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13212582#comment-13212582 ] Lewis John McGibbney commented on NUTCH-978: Replies: 1 2. The main motivation of this issue is for processing news document required for my undergrad thesis of Bahasa Indonesia news text clustering, it's needed a prepossessing to extract the title, news content, date, related news link separately. 2. The most biggest technical challenge for me is processing the web page so it could be parsered as an XML document and could be queried by XPath. 3. The issue is drop away, because with a small tweak a could get it working for only my thesis requirements, i haven't tested it with web page other than the web pages i used for my thesis so i think it's not anyway nearly finished yet. And since the proposal is not accepted as a GSOC project, i lost motivation to continue to work on this issue and decided to work on my thesis instead. related issue : https://issues.apache.org/jira/browse/NUTCH-185 [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing. --- Key: NUTCH-978 URL: https://issues.apache.org/jira/browse/NUTCH-978 Project: Nutch Issue Type: New Feature Components: parser Affects Versions: 1.2 Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9 Reporter: Ammar Shadiq Assignee: Chris A. Mattmann Priority: Minor Labels: gsoc2011, mentor Fix For: nutchgora Attachments: [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, app_guardian_ivory_coast_news_exmpl.png, app_screenshoot_configuration_result.png, app_screenshoot_configuration_result_anchor.png, app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png, for_GSoc.zip Original Estimate: 1,680h Remaining Estimate: 1,680h Nutch use parse-html plugin to parse web pages, it process the contents of the web page by removing html tags and component like javascript and css and leaving the extracted text to be stored on the index. Nutch by default doesn't have the capability to select certain atomic element on an html page, like certain tags, certain content, some part of the page, etc. A html page have a tree-like xml pattern with html tag as its branch and text as its node. This branch and node could be extracted using XPath. XPath allowing us to select a certain branch or node of an XML and therefore could be used to extract certain information and treat it differently based on its content and the user requirements. Furthermore a web domain like news website usually have a same html code structure for storing the information on its web pages. This same html code structure could be parsed using the same XPath query and retrieve the same content information element. All of the XPath query for selecting various content could be stored on a XPath Configuration File. The purpose of nutch are for various web source, not all of the web page retrieved from those various source have the same html code structure, thus have to be threated differently using the correct XPath Configuration. The selection of the correct XPath configuration could be done automatically using regex by matching the url of the web page with valid url pattern for that xpath configuration. This automatic mechanism allow the user of nutch to process various web page and get only certain information that user wants therefore making the index more accurate and its content more flexible. The component for this idea have been tested on nutch 1.2 for selecting certain elements on various news website for the purpose of document clustering. This includes a Configuration Editor Application build using NetBeans 6.9 Application Framework. though its need a few debugging. http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.
[ https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13212584#comment-13212584 ] Lewis John McGibbney commented on NUTCH-978: Generally speaking the plugin sounds sounds really useful, the only problem I see is that it is very specific and for it to be integrated into the code base usually we need to make it specific enough to address some given task fully and in a well defined and well justified manner, but we also need to make it general enough to be used in many different contexts. This increases usability and user feedback as well engagement. 4. With regards to the biggest technical challenge being the processing of web page's, how far did you get with this? We're you able to process it with enough precision to satisfy your requirements? 5. How were you querying it with XPath? You cannot query with XPath, but instead with XQuery. Do you maybe mean that this enabled you to navigate the document and address various parts of it is XPath? 6. Ok I understand why it has crumbled slightly, but I think if the code is there is would be a huge waster if we didn't try to revive it, possibly getting it integrated into the code base, and maybe getting it added as a contrib component but not shipping it within the core codebase if the former was not a viable option. I've had a look at NUTCH-185, but I think we can discard this as it was addressed a very long time ago, it's also already integrated into the codebase. I was referring more to Jira issues which were currently open, which we could maybe merge or combine to give this a more general and possibly more justified arguement for inclusion in the codebase... what do you think? Does NUTCH-585 fit this? [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing. --- Key: NUTCH-978 URL: https://issues.apache.org/jira/browse/NUTCH-978 Project: Nutch Issue Type: New Feature Components: parser Affects Versions: 1.2 Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9 Reporter: Ammar Shadiq Assignee: Chris A. Mattmann Priority: Minor Labels: gsoc2011, mentor Fix For: nutchgora Attachments: [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, app_guardian_ivory_coast_news_exmpl.png, app_screenshoot_configuration_result.png, app_screenshoot_configuration_result_anchor.png, app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png, for_GSoc.zip Original Estimate: 1,680h Remaining Estimate: 1,680h Nutch use parse-html plugin to parse web pages, it process the contents of the web page by removing html tags and component like javascript and css and leaving the extracted text to be stored on the index. Nutch by default doesn't have the capability to select certain atomic element on an html page, like certain tags, certain content, some part of the page, etc. A html page have a tree-like xml pattern with html tag as its branch and text as its node. This branch and node could be extracted using XPath. XPath allowing us to select a certain branch or node of an XML and therefore could be used to extract certain information and treat it differently based on its content and the user requirements. Furthermore a web domain like news website usually have a same html code structure for storing the information on its web pages. This same html code structure could be parsed using the same XPath query and retrieve the same content information element. All of the XPath query for selecting various content could be stored on a XPath Configuration File. The purpose of nutch are for various web source, not all of the web page retrieved from those various source have the same html code structure, thus have to be threated differently using the correct XPath Configuration. The selection of the correct XPath configuration could be done automatically using regex by matching the url of the web page with valid url pattern for that xpath configuration. This automatic mechanism allow the user of nutch to process various web page and get only certain information that user wants therefore making the index more accurate and its content more flexible. The component for this idea have been tested on nutch 1.2 for selecting certain elements on various news website for the purpose of document clustering. This includes a Configuration Editor Application build using NetBeans 6.9 Application Framework. though its need a few debugging. http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip -- This message is automatically generated by JIRA. If you think it was sent incorrectly,
[jira] [Commented] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.
[ https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13212605#comment-13212605 ] Ammar Shadiq commented on NUTCH-978: 4.With regards to the biggest technical challenge being the processing of web page's, how far did you get with this? We're you able to process it with enough precision to satisfy your requirements? I get it to work for my text clustering algorithm, the application screenshoot provided here: http://www.facebook.com/media/set/?set=a.2075564646205.124550.1157621543type=3l=7313965254\. Yes, it's quite satisfactory. 5. How were you querying it with XPath? You cannot query with XPath, but instead with XQuery. Do you maybe mean that this enabled you to navigate the document and address various parts of it is XPath? In my understanding there are 3 ways to query an XML document, that is using XPath, XQuery and XLST, I'm sorry if i get it wrong. For navigating various parts of the page i uses java HTML parse lister extending HTMLEditorKit.ParserCallback and then displaying it on the editor application (some kind of chromium Inspect element), this makes the web page structure visible and thus making the XPath expression easier to make. 6. Ok I understand why it has crumbled slightly, but I think if the code is there is would be a huge waster if we didn't try to revive it, possibly getting it integrated into the code base, and maybe getting it added as a contrib component but not shipping it within the core codebase if the former was not a viable option. I totally agree As for Nutch 585, i think it's different in the idea that is Nutch 585 trying to block certain parts. This idea instead, only retrieve certain parts and in addition store it in certain lucene field (i havent looked into the Solr implementation yet) thus automatically discarding the rest. [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing. --- Key: NUTCH-978 URL: https://issues.apache.org/jira/browse/NUTCH-978 Project: Nutch Issue Type: New Feature Components: parser Affects Versions: 1.2 Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9 Reporter: Ammar Shadiq Assignee: Chris A. Mattmann Priority: Minor Labels: gsoc2011, mentor Fix For: nutchgora Attachments: [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, app_guardian_ivory_coast_news_exmpl.png, app_screenshoot_configuration_result.png, app_screenshoot_configuration_result_anchor.png, app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png, for_GSoc.zip Original Estimate: 1,680h Remaining Estimate: 1,680h Nutch use parse-html plugin to parse web pages, it process the contents of the web page by removing html tags and component like javascript and css and leaving the extracted text to be stored on the index. Nutch by default doesn't have the capability to select certain atomic element on an html page, like certain tags, certain content, some part of the page, etc. A html page have a tree-like xml pattern with html tag as its branch and text as its node. This branch and node could be extracted using XPath. XPath allowing us to select a certain branch or node of an XML and therefore could be used to extract certain information and treat it differently based on its content and the user requirements. Furthermore a web domain like news website usually have a same html code structure for storing the information on its web pages. This same html code structure could be parsed using the same XPath query and retrieve the same content information element. All of the XPath query for selecting various content could be stored on a XPath Configuration File. The purpose of nutch are for various web source, not all of the web page retrieved from those various source have the same html code structure, thus have to be threated differently using the correct XPath Configuration. The selection of the correct XPath configuration could be done automatically using regex by matching the url of the web page with valid url pattern for that xpath configuration. This automatic mechanism allow the user of nutch to process various web page and get only certain information that user wants therefore making the index more accurate and its content more flexible. The component for this idea have been tested on nutch 1.2 for selecting certain elements on various news website for the purpose of document clustering. This includes a Configuration Editor Application build using NetBeans 6.9 Application Framework. though its need a few debugging.
I think I found a bug -- multiple_values_encountered_for_non_multiValued_field_title
so I've been getting this error multiple_values_encountered_for_non_multiValued_field_title every once in a while when I am trying to run solrindex. I can now say that this is being caused by index-more plug in (MoreIndexingFilter.java) private NutchDocument resetTitle(NutchDocument doc, ParseData data, String url) { String contentDisposition = data.getMeta(Metadata.CONTENT_DISPOSITION); if (contentDisposition == null) return doc; for (int i=0; ipatterns.length; i++) { Matcher matcher = patterns[i].matcher(contentDisposition); if (matcher.find()) { doc.add(title, matcher.group(1)); break; } } return doc; } the problem here is that in my case this function is not reseting but it is just adding a new title. it seems that the original idea was that if CONTENT_DISPOSITION exist then the document will not have a title set from other plug ins (namely index-basic). unfortunately this seems not to be always the case as you can see by running this command: bin/nutch indexchecker http://www.2modern.com/site/gift-registry.html what i do get (the part that is relevant) is: tstamp :Tue Feb 21 13:18:13 PST 2012 type : text/html type : text type : html date : Tue Feb 21 13:18:13 PST 2012 url : http://www.2modern.com/site/gift-registry.html content : 2Modern Gift Registry Modern Furniture Lighting items in cart 0 checkout Returning 2Modern cu user_ranking : 25.0 title : 2Modern Gift Registry title : gift-registry.html plutoz_ranking :10.0 categories :Furniture Home contentLength : 12924 and as you can see there are 2 titles. I think it would be very easy to fix that. just check to see if a title exist already before setting the name of the file as title: if (contentDisposition == null || null != doc.getField(title)) return doc; or if the substitution must happen in presence of CONTENT_DISPOSITION, at least remove the old one: if (matcher.find()) { doc.remove(title); doc.add(title, matcher.group(1)); break; } now that being said, the real problem here is why NutchDocument doesn't observe the schema.xml file and alway assumes that all fields are multi value? public void add(String name, Object value) { 53 NutchField field = fields.get(name); 54 if (field == null) { 55field = new NutchField(value); 56fields.put(name, field); 57 } else { 58 field.add(value); --- 59 } 60} -- Kaveh Minooie www.plutoz.com
Build failed in Jenkins: Nutch-nutchgora #169
See https://builds.apache.org/job/Nutch-nutchgora/169/ -- [...truncated 2636 lines...] [javac] Note: Recompile with -Xlint:unchecked for details. jar: [jar] Building jar: /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/urlfilter-suffix/urlfilter-suffix.jar deps-test: deploy: [copy] Copying 1 file to /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/plugins/urlfilter-suffix copy-generated-lib: [copy] Copying 1 file to /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/plugins/urlfilter-suffix init: [mkdir] Created dir: /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/urlfilter-validator [mkdir] Created dir: /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/urlfilter-validator/classes [mkdir] Created dir: /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/urlfilter-validator/test [mkdir] Created dir: /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/plugins/urlfilter-validator init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/ivy/ivysettings.xml compile: [echo] Compiling plugin: urlfilter-validator [javac] /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/src/plugin/build-plugin.xml:117: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds [javac] Compiling 1 source file to /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/urlfilter-validator/classes jar: [jar] Building jar: /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/urlfilter-validator/urlfilter-validator.jar deps-test: deploy: [copy] Copying 1 file to /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/plugins/urlfilter-validator copy-generated-lib: [copy] Copying 1 file to /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/plugins/urlfilter-validator init: [mkdir] Created dir: /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/urlnormalizer-basic [mkdir] Created dir: /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/urlnormalizer-basic/classes [mkdir] Created dir: /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/urlnormalizer-basic/test [mkdir] Created dir: /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/plugins/urlnormalizer-basic init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/ivy/ivysettings.xml compile: [echo] Compiling plugin: urlnormalizer-basic [javac] /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/src/plugin/build-plugin.xml:117: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds [javac] Compiling 1 source file to /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/urlnormalizer-basic/classes jar: [jar] Building jar: /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/urlnormalizer-basic/urlnormalizer-basic.jar deps-test: deploy: [copy] Copying 1 file to /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/plugins/urlnormalizer-basic copy-generated-lib: [copy] Copying 1 file to /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/plugins/urlnormalizer-basic init: [mkdir] Created dir: /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/urlnormalizer-pass [mkdir] Created dir: /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/urlnormalizer-pass/classes [mkdir] Created dir: /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/urlnormalizer-pass/test [mkdir] Created dir: /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/plugins/urlnormalizer-pass init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file =
Jenkins build is back to normal : nutch-trunk-maven #161
See https://builds.apache.org/job/nutch-trunk-maven/161/