[jira] [Created] (NUTCH-1288) Generator should not generate filter and not found and denied and gone and permanently moved pages

2012-02-21 Thread behnam nikbakht (Created) (JIRA)
Generator should not generate filter and not found and denied and gone and 
permanently moved pages
--

 Key: NUTCH-1288
 URL: https://issues.apache.org/jira/browse/NUTCH-1288
 Project: Nutch
  Issue Type: Bug
  Components: fetcher, generator
Affects Versions: 1.4
Reporter: behnam nikbakht


Generator should not generate filter and not found and denied and gone and 
permanently moved pages.
in the shouldFetch method in AbstractFetchSchedule, CrawlDatum must checked 
against special states of fetch like not found, and not generate them again.
so we can add a status in CrawlDatum that indicates invalid urls, and set this 
status in fetch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1288) Generator should not generate filter and not found and denied and gone and permanently moved pages

2012-02-21 Thread behnam nikbakht (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

behnam nikbakht updated NUTCH-1288:
---

Attachment: NUTCH-1288.patch

 Generator should not generate filter and not found and denied and gone and 
 permanently moved pages
 --

 Key: NUTCH-1288
 URL: https://issues.apache.org/jira/browse/NUTCH-1288
 Project: Nutch
  Issue Type: Bug
  Components: fetcher, generator
Affects Versions: 1.4
Reporter: behnam nikbakht
 Attachments: NUTCH-1288.patch


 Generator should not generate filter and not found and denied and gone and 
 permanently moved pages.
 in the shouldFetch method in AbstractFetchSchedule, CrawlDatum must checked 
 against special states of fetch like not found, and not generate them again.
 so we can add a status in CrawlDatum that indicates invalid urls, and set 
 this status in fetch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (NUTCH-1288) Generator should not generate filter and not found and denied and gone and permanently moved pages

2012-02-21 Thread Julien Nioche (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-1288.
--

Resolution: Invalid

This is not the right way to do. If you don't want to re-try such pages then 
implement a custom fetch schedule - don't hack the AbstractFetchSchedule as you 
do.
Hardcoding the schedule policy forces people to use Nutch the way you want to 
use it, not a good idea. Moreover your patch removes useful information about 
the status of a page to give a more generic (and dubious value).

 Generator should not generate filter and not found and denied and gone and 
 permanently moved pages
 --

 Key: NUTCH-1288
 URL: https://issues.apache.org/jira/browse/NUTCH-1288
 Project: Nutch
  Issue Type: Bug
  Components: fetcher, generator
Affects Versions: 1.4
Reporter: behnam nikbakht
 Attachments: NUTCH-1288.patch


 Generator should not generate filter and not found and denied and gone and 
 permanently moved pages.
 in the shouldFetch method in AbstractFetchSchedule, CrawlDatum must checked 
 against special states of fetch like not found, and not generate them again.
 so we can add a status in CrawlDatum that indicates invalid urls, and set 
 this status in fetch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1281) tika parser not work properly with unwanted file types that passed from filters in nutch

2012-02-21 Thread Julien Nioche (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13212502#comment-13212502
 ] 

Julien Nioche commented on NUTCH-1281:
--

Behnam,

I suppose that you are seeing this issue when using the Crawl class but not 
when using a script. The reason for this is that the timeout mechanism prevents 
the parser to get locked with files which have been truncated or put the 
underlying parser library in a spin. When using the Crawl class, these runaway 
threads are not cleared,  accumulate and take all the memory left. The Crawl 
class is planned to be replaced by a shell script which will remove this issue 
and allow people to modify the process easily (+ make the pipeline easier to 
understand)

Or are you seeing this when using the Parse command in a script? Again, the 
timeout mechanism should prevent the parser to crash.

Now if the issue is to prevent the Tika plugin to process certain types, a 
better approach would be to filter the docs prior to parsing based on their 
mime-types which we now can access from the crawldb metadata. The trouble is 
that the URLFilters consider only the string of a URL and not any metadata. We 
could change the API of URLFilters? What other metadata would we take into 
account for filtering?

Another approach would be to filter based on the content type in ParseUtil - so 
that it is used not only for Tika but for any other parser and have a blacklist 
of mimetypes that would not be parsed. 

Any thoughts?





 tika parser not work properly with unwanted file types that passed from 
 filters in nutch
 

 Key: NUTCH-1281
 URL: https://issues.apache.org/jira/browse/NUTCH-1281
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Reporter: behnam nikbakht

 when in parse-plugins.xml, set this property:
 mimeType name=*
 plugin id=parse-tika /
 /mimeType
 all unwanted files that pass from all filters, refered to tika
 but for some file types like .flv, tika parser has problem and hunged and 
 cause to fail in parse Job.
 if this file types passed from regex-urlfilter and other filters, parse job 
 failed.
 for this problem I suggest that add some properties for valid file types, and 
 use this code in TikaParser.java, like this:
 public ParseResult getParse(Content content) {
   String mimeType = content.getContentType();
 + String[]validTypes=new 
 String[]{application/pdf,application/x-tika-msoffice,application/x-tika- 
 ooxml,application/vnd.oasis.opendocument.text,text/plain,application/rtf,application/rss+xml,application/x-bzip2,application/x-gzip,application/x-javascript,application/javascript,text/javascript,application/x-shockwave-flash,application/zip,text/xml,application/xml};
 + boolean valid=false;
 + for(int k=0;kvalidTypes.length;k++){
 + if(validTypes[k].compareTo(mimeType.toLowerCase())==0)
 + valid=true;
 + }
 + if(!valid)
 + return new ParseStatus(ParseStatus.NOTPARSED, Can't 
 parse for unwanted filetype + 
 mimeType).getEmptyParseResult(content.getUrl(), getConf());
   
   URL base;

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

2012-02-21 Thread Ammar Shadiq (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13212523#comment-13212523
 ] 

Ammar Shadiq commented on NUTCH-978:


Hi Lewis,

Since the proposal is not accepted, I'm using my summer time to work on my 
undergrad thesis. I'm graduated from collage recently, and the time has freed 
up, so I'd love to help, and it's awesome if we could collaborate.

thanks,
Ammar

 [GSoC 2011] A Plugin for extracting certain element of a web page on html 
 page parsing.
 ---

 Key: NUTCH-978
 URL: https://issues.apache.org/jira/browse/NUTCH-978
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.2
 Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
Reporter: Ammar Shadiq
Assignee: Chris A. Mattmann
Priority: Minor
  Labels: gsoc2011, mentor
 Fix For: nutchgora

 Attachments: 
 [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, 
 app_guardian_ivory_coast_news_exmpl.png, 
 app_screenshoot_configuration_result.png, 
 app_screenshoot_configuration_result_anchor.png, 
 app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png, 
 for_GSoc.zip

   Original Estimate: 1,680h
  Remaining Estimate: 1,680h

 Nutch use parse-html plugin to parse web pages, it process the contents of 
 the web page by removing html tags and component like javascript and css and 
 leaving the extracted text to be stored on the index. Nutch by default 
 doesn't have the capability to select certain atomic element on an html page, 
 like certain tags, certain content, some part of the page, etc.
 A html page have a tree-like xml pattern with html tag as its branch and text 
 as its node. This branch and node could be extracted using XPath. XPath 
 allowing us to select a certain branch or node of an XML and therefore could 
 be used to extract certain information and treat it differently based on its 
 content and the user requirements. Furthermore a web domain like news website 
 usually have a same html code structure for storing the information on its 
 web pages. This same html code structure could be parsed using the same XPath 
 query and retrieve the same content information element. All of the XPath 
 query for selecting various content could be stored on a XPath Configuration 
 File.
 The purpose of nutch are for various web source, not all of the web page 
 retrieved from those various source have the same html code structure, thus 
 have to be threated differently using the correct XPath Configuration. The 
 selection of the correct XPath configuration could be done automatically 
 using regex by matching the url of the web page with valid url pattern for 
 that xpath configuration.
 This automatic mechanism allow the user of nutch to process various web page 
 and get only certain information that user wants therefore making the index 
 more accurate and its content more flexible.
 The component for this idea have been tested on nutch 1.2 for selecting 
 certain elements on various news website for the purpose of document 
 clustering. This includes a Configuration Editor Application build using 
 NetBeans 6.9 Application Framework. though its need a few debugging.
 http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

2012-02-21 Thread Ammar Shadiq (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13212542#comment-13212542
 ] 

Ammar Shadiq commented on NUTCH-978:


I'll send you an email.

 [GSoC 2011] A Plugin for extracting certain element of a web page on html 
 page parsing.
 ---

 Key: NUTCH-978
 URL: https://issues.apache.org/jira/browse/NUTCH-978
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.2
 Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
Reporter: Ammar Shadiq
Assignee: Chris A. Mattmann
Priority: Minor
  Labels: gsoc2011, mentor
 Fix For: nutchgora

 Attachments: 
 [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, 
 app_guardian_ivory_coast_news_exmpl.png, 
 app_screenshoot_configuration_result.png, 
 app_screenshoot_configuration_result_anchor.png, 
 app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png, 
 for_GSoc.zip

   Original Estimate: 1,680h
  Remaining Estimate: 1,680h

 Nutch use parse-html plugin to parse web pages, it process the contents of 
 the web page by removing html tags and component like javascript and css and 
 leaving the extracted text to be stored on the index. Nutch by default 
 doesn't have the capability to select certain atomic element on an html page, 
 like certain tags, certain content, some part of the page, etc.
 A html page have a tree-like xml pattern with html tag as its branch and text 
 as its node. This branch and node could be extracted using XPath. XPath 
 allowing us to select a certain branch or node of an XML and therefore could 
 be used to extract certain information and treat it differently based on its 
 content and the user requirements. Furthermore a web domain like news website 
 usually have a same html code structure for storing the information on its 
 web pages. This same html code structure could be parsed using the same XPath 
 query and retrieve the same content information element. All of the XPath 
 query for selecting various content could be stored on a XPath Configuration 
 File.
 The purpose of nutch are for various web source, not all of the web page 
 retrieved from those various source have the same html code structure, thus 
 have to be threated differently using the correct XPath Configuration. The 
 selection of the correct XPath configuration could be done automatically 
 using regex by matching the url of the web page with valid url pattern for 
 that xpath configuration.
 This automatic mechanism allow the user of nutch to process various web page 
 and get only certain information that user wants therefore making the index 
 more accurate and its content more flexible.
 The component for this idea have been tested on nutch 1.2 for selecting 
 certain elements on various news website for the purpose of document 
 clustering. This includes a Configuration Editor Application build using 
 NetBeans 6.9 Application Framework. though its need a few debugging.
 http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

2012-02-21 Thread Chris A. Mattmann (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13212570#comment-13212570
 ] 

Chris A. Mattmann commented on NUTCH-978:
-

Guys, I think it's fine to keep the conversation on list, in fact, I'd favor it 
unless there is a specific reason to take it there?

 [GSoC 2011] A Plugin for extracting certain element of a web page on html 
 page parsing.
 ---

 Key: NUTCH-978
 URL: https://issues.apache.org/jira/browse/NUTCH-978
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.2
 Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
Reporter: Ammar Shadiq
Assignee: Chris A. Mattmann
Priority: Minor
  Labels: gsoc2011, mentor
 Fix For: nutchgora

 Attachments: 
 [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, 
 app_guardian_ivory_coast_news_exmpl.png, 
 app_screenshoot_configuration_result.png, 
 app_screenshoot_configuration_result_anchor.png, 
 app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png, 
 for_GSoc.zip

   Original Estimate: 1,680h
  Remaining Estimate: 1,680h

 Nutch use parse-html plugin to parse web pages, it process the contents of 
 the web page by removing html tags and component like javascript and css and 
 leaving the extracted text to be stored on the index. Nutch by default 
 doesn't have the capability to select certain atomic element on an html page, 
 like certain tags, certain content, some part of the page, etc.
 A html page have a tree-like xml pattern with html tag as its branch and text 
 as its node. This branch and node could be extracted using XPath. XPath 
 allowing us to select a certain branch or node of an XML and therefore could 
 be used to extract certain information and treat it differently based on its 
 content and the user requirements. Furthermore a web domain like news website 
 usually have a same html code structure for storing the information on its 
 web pages. This same html code structure could be parsed using the same XPath 
 query and retrieve the same content information element. All of the XPath 
 query for selecting various content could be stored on a XPath Configuration 
 File.
 The purpose of nutch are for various web source, not all of the web page 
 retrieved from those various source have the same html code structure, thus 
 have to be threated differently using the correct XPath Configuration. The 
 selection of the correct XPath configuration could be done automatically 
 using regex by matching the url of the web page with valid url pattern for 
 that xpath configuration.
 This automatic mechanism allow the user of nutch to process various web page 
 and get only certain information that user wants therefore making the index 
 more accurate and its content more flexible.
 The component for this idea have been tested on nutch 1.2 for selecting 
 certain elements on various news website for the purpose of document 
 clustering. This includes a Configuration Editor Application build using 
 NetBeans 6.9 Application Framework. though its need a few debugging.
 http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

2012-02-21 Thread Lewis John McGibbney (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13212576#comment-13212576
 ] 

Lewis John McGibbney commented on NUTCH-978:


No bother Chris. So far questions that have been asked
1. provide a quick run down on the issue, summarizing all of the above
2. what were the motivations, purpose and technical challenges encountered 
whilst working on it?
3. Why did the issue drop away and what do you think is required to get it back 
on track and possibly in the codebase?

 [GSoC 2011] A Plugin for extracting certain element of a web page on html 
 page parsing.
 ---

 Key: NUTCH-978
 URL: https://issues.apache.org/jira/browse/NUTCH-978
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.2
 Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
Reporter: Ammar Shadiq
Assignee: Chris A. Mattmann
Priority: Minor
  Labels: gsoc2011, mentor
 Fix For: nutchgora

 Attachments: 
 [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, 
 app_guardian_ivory_coast_news_exmpl.png, 
 app_screenshoot_configuration_result.png, 
 app_screenshoot_configuration_result_anchor.png, 
 app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png, 
 for_GSoc.zip

   Original Estimate: 1,680h
  Remaining Estimate: 1,680h

 Nutch use parse-html plugin to parse web pages, it process the contents of 
 the web page by removing html tags and component like javascript and css and 
 leaving the extracted text to be stored on the index. Nutch by default 
 doesn't have the capability to select certain atomic element on an html page, 
 like certain tags, certain content, some part of the page, etc.
 A html page have a tree-like xml pattern with html tag as its branch and text 
 as its node. This branch and node could be extracted using XPath. XPath 
 allowing us to select a certain branch or node of an XML and therefore could 
 be used to extract certain information and treat it differently based on its 
 content and the user requirements. Furthermore a web domain like news website 
 usually have a same html code structure for storing the information on its 
 web pages. This same html code structure could be parsed using the same XPath 
 query and retrieve the same content information element. All of the XPath 
 query for selecting various content could be stored on a XPath Configuration 
 File.
 The purpose of nutch are for various web source, not all of the web page 
 retrieved from those various source have the same html code structure, thus 
 have to be threated differently using the correct XPath Configuration. The 
 selection of the correct XPath configuration could be done automatically 
 using regex by matching the url of the web page with valid url pattern for 
 that xpath configuration.
 This automatic mechanism allow the user of nutch to process various web page 
 and get only certain information that user wants therefore making the index 
 more accurate and its content more flexible.
 The component for this idea have been tested on nutch 1.2 for selecting 
 certain elements on various news website for the purpose of document 
 clustering. This includes a Configuration Editor Application build using 
 NetBeans 6.9 Application Framework. though its need a few debugging.
 http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

2012-02-21 Thread Lewis John McGibbney (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13212582#comment-13212582
 ] 

Lewis John McGibbney commented on NUTCH-978:


Replies:

1  2. The main motivation of this issue is for processing news document
required for my undergrad thesis of Bahasa Indonesia news text
clustering, it's needed a prepossessing to extract the title, news
content, date, related news link separately.

2. The most biggest technical challenge for me is processing the web page
so it could be parsered as an XML document   and could be queried by
XPath.

3. The issue is drop away, because with a small tweak a could get it
working for only my thesis requirements, i haven't tested it with
web page other than the web pages i used for my thesis so i think it's
not anyway nearly finished yet. And since the proposal is not accepted
as a GSOC project, i lost motivation to continue to work on this issue
and decided to work on my thesis instead.

related issue : https://issues.apache.org/jira/browse/NUTCH-185

 [GSoC 2011] A Plugin for extracting certain element of a web page on html 
 page parsing.
 ---

 Key: NUTCH-978
 URL: https://issues.apache.org/jira/browse/NUTCH-978
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.2
 Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
Reporter: Ammar Shadiq
Assignee: Chris A. Mattmann
Priority: Minor
  Labels: gsoc2011, mentor
 Fix For: nutchgora

 Attachments: 
 [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, 
 app_guardian_ivory_coast_news_exmpl.png, 
 app_screenshoot_configuration_result.png, 
 app_screenshoot_configuration_result_anchor.png, 
 app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png, 
 for_GSoc.zip

   Original Estimate: 1,680h
  Remaining Estimate: 1,680h

 Nutch use parse-html plugin to parse web pages, it process the contents of 
 the web page by removing html tags and component like javascript and css and 
 leaving the extracted text to be stored on the index. Nutch by default 
 doesn't have the capability to select certain atomic element on an html page, 
 like certain tags, certain content, some part of the page, etc.
 A html page have a tree-like xml pattern with html tag as its branch and text 
 as its node. This branch and node could be extracted using XPath. XPath 
 allowing us to select a certain branch or node of an XML and therefore could 
 be used to extract certain information and treat it differently based on its 
 content and the user requirements. Furthermore a web domain like news website 
 usually have a same html code structure for storing the information on its 
 web pages. This same html code structure could be parsed using the same XPath 
 query and retrieve the same content information element. All of the XPath 
 query for selecting various content could be stored on a XPath Configuration 
 File.
 The purpose of nutch are for various web source, not all of the web page 
 retrieved from those various source have the same html code structure, thus 
 have to be threated differently using the correct XPath Configuration. The 
 selection of the correct XPath configuration could be done automatically 
 using regex by matching the url of the web page with valid url pattern for 
 that xpath configuration.
 This automatic mechanism allow the user of nutch to process various web page 
 and get only certain information that user wants therefore making the index 
 more accurate and its content more flexible.
 The component for this idea have been tested on nutch 1.2 for selecting 
 certain elements on various news website for the purpose of document 
 clustering. This includes a Configuration Editor Application build using 
 NetBeans 6.9 Application Framework. though its need a few debugging.
 http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

2012-02-21 Thread Lewis John McGibbney (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13212584#comment-13212584
 ] 

Lewis John McGibbney commented on NUTCH-978:


Generally speaking the plugin sounds sounds really useful, the only problem I 
see is that it is very specific and for it to be integrated into the code base 
usually we need to make it specific enough to address some given task fully and 
in a well defined and well justified manner, but we also need to make it 
general enough to be used in many different contexts. This increases usability 
and user feedback as well engagement.

4. With regards to the biggest technical challenge being the processing of web 
page's, how far did you get with this? We're you able to process it with enough 
precision to satisfy your requirements?

5. How were you querying it with XPath? You cannot query with XPath, but 
instead with XQuery. Do you maybe mean that this enabled you to navigate the 
document and address various parts of it is XPath?

6. Ok I understand why it has crumbled slightly, but I think if the code is 
there is would be a huge waster if we didn't try to revive it, possibly getting 
it integrated into the code base, and maybe getting it added as a contrib 
component but not shipping it within the core codebase if the former was not a 
viable option.

I've had a look at NUTCH-185, but I think we can discard this as it was 
addressed a very long time ago, it's also already integrated into the codebase. 
I was referring more to Jira issues which were currently open, which we could 
maybe merge or combine to give this a more general and possibly more justified 
arguement for inclusion in the codebase... what do you think? Does NUTCH-585 
fit this?

 [GSoC 2011] A Plugin for extracting certain element of a web page on html 
 page parsing.
 ---

 Key: NUTCH-978
 URL: https://issues.apache.org/jira/browse/NUTCH-978
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.2
 Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
Reporter: Ammar Shadiq
Assignee: Chris A. Mattmann
Priority: Minor
  Labels: gsoc2011, mentor
 Fix For: nutchgora

 Attachments: 
 [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, 
 app_guardian_ivory_coast_news_exmpl.png, 
 app_screenshoot_configuration_result.png, 
 app_screenshoot_configuration_result_anchor.png, 
 app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png, 
 for_GSoc.zip

   Original Estimate: 1,680h
  Remaining Estimate: 1,680h

 Nutch use parse-html plugin to parse web pages, it process the contents of 
 the web page by removing html tags and component like javascript and css and 
 leaving the extracted text to be stored on the index. Nutch by default 
 doesn't have the capability to select certain atomic element on an html page, 
 like certain tags, certain content, some part of the page, etc.
 A html page have a tree-like xml pattern with html tag as its branch and text 
 as its node. This branch and node could be extracted using XPath. XPath 
 allowing us to select a certain branch or node of an XML and therefore could 
 be used to extract certain information and treat it differently based on its 
 content and the user requirements. Furthermore a web domain like news website 
 usually have a same html code structure for storing the information on its 
 web pages. This same html code structure could be parsed using the same XPath 
 query and retrieve the same content information element. All of the XPath 
 query for selecting various content could be stored on a XPath Configuration 
 File.
 The purpose of nutch are for various web source, not all of the web page 
 retrieved from those various source have the same html code structure, thus 
 have to be threated differently using the correct XPath Configuration. The 
 selection of the correct XPath configuration could be done automatically 
 using regex by matching the url of the web page with valid url pattern for 
 that xpath configuration.
 This automatic mechanism allow the user of nutch to process various web page 
 and get only certain information that user wants therefore making the index 
 more accurate and its content more flexible.
 The component for this idea have been tested on nutch 1.2 for selecting 
 certain elements on various news website for the purpose of document 
 clustering. This includes a Configuration Editor Application build using 
 NetBeans 6.9 Application Framework. though its need a few debugging.
 http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, 

[jira] [Commented] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

2012-02-21 Thread Ammar Shadiq (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13212605#comment-13212605
 ] 

Ammar Shadiq commented on NUTCH-978:


4.With regards to the biggest technical challenge being the processing of web 
page's, how far did you get with this? We're you able to process it with 
enough precision to satisfy your requirements?

I get it to work for my text clustering algorithm, the application screenshoot 
provided here: 
http://www.facebook.com/media/set/?set=a.2075564646205.124550.1157621543type=3l=7313965254\.
 Yes, it's quite satisfactory.

 5. How were you querying it with XPath? You cannot query with XPath, but 
 instead with XQuery. Do you maybe mean that this enabled you to navigate the 
 document and address various parts of it is XPath?

In my understanding there are 3 ways to query an XML document, that is using 
XPath, XQuery and XLST, I'm sorry if i get it wrong. For navigating various 
parts of the page i uses java HTML parse lister extending  
HTMLEditorKit.ParserCallback and then displaying it on the editor application 
(some kind of chromium Inspect element), this makes the web page structure 
visible and thus making the XPath expression easier to make.

 6. Ok I understand why it has crumbled slightly, but I think if the code is 
 there is would be a huge waster if we didn't try to revive it, possibly 
 getting it integrated into the code base, and maybe getting it added as a 
 contrib component but not shipping it within the core codebase if the former 
 was not a viable option.

I totally agree

As for Nutch 585, i think it's different in the idea that is Nutch 585 trying 
to block certain parts. This idea instead, only retrieve certain parts and in 
addition store it in certain lucene field (i havent looked into the Solr 
implementation yet) thus automatically discarding the rest.

 [GSoC 2011] A Plugin for extracting certain element of a web page on html 
 page parsing.
 ---

 Key: NUTCH-978
 URL: https://issues.apache.org/jira/browse/NUTCH-978
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.2
 Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
Reporter: Ammar Shadiq
Assignee: Chris A. Mattmann
Priority: Minor
  Labels: gsoc2011, mentor
 Fix For: nutchgora

 Attachments: 
 [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, 
 app_guardian_ivory_coast_news_exmpl.png, 
 app_screenshoot_configuration_result.png, 
 app_screenshoot_configuration_result_anchor.png, 
 app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png, 
 for_GSoc.zip

   Original Estimate: 1,680h
  Remaining Estimate: 1,680h

 Nutch use parse-html plugin to parse web pages, it process the contents of 
 the web page by removing html tags and component like javascript and css and 
 leaving the extracted text to be stored on the index. Nutch by default 
 doesn't have the capability to select certain atomic element on an html page, 
 like certain tags, certain content, some part of the page, etc.
 A html page have a tree-like xml pattern with html tag as its branch and text 
 as its node. This branch and node could be extracted using XPath. XPath 
 allowing us to select a certain branch or node of an XML and therefore could 
 be used to extract certain information and treat it differently based on its 
 content and the user requirements. Furthermore a web domain like news website 
 usually have a same html code structure for storing the information on its 
 web pages. This same html code structure could be parsed using the same XPath 
 query and retrieve the same content information element. All of the XPath 
 query for selecting various content could be stored on a XPath Configuration 
 File.
 The purpose of nutch are for various web source, not all of the web page 
 retrieved from those various source have the same html code structure, thus 
 have to be threated differently using the correct XPath Configuration. The 
 selection of the correct XPath configuration could be done automatically 
 using regex by matching the url of the web page with valid url pattern for 
 that xpath configuration.
 This automatic mechanism allow the user of nutch to process various web page 
 and get only certain information that user wants therefore making the index 
 more accurate and its content more flexible.
 The component for this idea have been tested on nutch 1.2 for selecting 
 certain elements on various news website for the purpose of document 
 clustering. This includes a Configuration Editor Application build using 
 NetBeans 6.9 Application Framework. though its need a few debugging.
 

I think I found a bug -- multiple_values_encountered_for_non_multiValued_field_title

2012-02-21 Thread kaveh minooie
so I've been getting this error 
multiple_values_encountered_for_non_multiValued_field_title every once 
in a while when I am trying to run solrindex. I can now say that this is 
being caused by index-more plug in (MoreIndexingFilter.java)


private NutchDocument resetTitle(NutchDocument doc, ParseData data, 
String url) {

String contentDisposition = 
data.getMeta(Metadata.CONTENT_DISPOSITION);
if (contentDisposition == null)
  return doc;

for (int i=0; ipatterns.length; i++) {
  Matcher matcher = patterns[i].matcher(contentDisposition);
  if (matcher.find()) {
doc.add(title, matcher.group(1));
break;
  }
}
   return doc;
  }


the problem here is that in my case this function is not reseting but it 
is just adding a new title. it seems that the original idea was that if 
CONTENT_DISPOSITION exist then the document will not have a title set 
from other plug ins (namely index-basic). unfortunately this seems not 
to be always the case as you can see by running this command:


bin/nutch indexchecker http://www.2modern.com/site/gift-registry.html

what i do get (the part that is relevant) is:


tstamp :Tue Feb 21 13:18:13 PST 2012
type :  text/html
type :  text
type :  html
date :  Tue Feb 21 13:18:13 PST 2012
url :   http://www.2modern.com/site/gift-registry.html
content :	2Modern Gift Registry  Modern Furniture  Lighting items 
in cart 0 checkout Returning 2Modern cu

user_ranking :  25.0
title : 2Modern Gift Registry
title : gift-registry.html
plutoz_ranking :10.0
categories :Furniture Home
contentLength : 12924

and as you can see there are 2 titles. I think it would be very easy to 
fix that. just check to see if a title exist already before setting the 
name of the file as title:


if (contentDisposition == null || null != doc.getField(title))
  return doc;


 or if the substitution must happen in presence of CONTENT_DISPOSITION, 
at least remove the old one:


if (matcher.find()) {
doc.remove(title);
doc.add(title, matcher.group(1));
break;
 }


 now that being said, the real problem here is why NutchDocument 
doesn't observe the schema.xml file and alway assumes that all fields 
are multi value?


public void add(String name, Object value) {
53  NutchField field = fields.get(name);
54  if (field == null) {
55field = new NutchField(value);
56fields.put(name, field);
57  } else {
58   field.add(value);  ---
59  }
60}

--
Kaveh Minooie

www.plutoz.com


Build failed in Jenkins: Nutch-nutchgora #169

2012-02-21 Thread Apache Jenkins Server
See https://builds.apache.org/job/Nutch-nutchgora/169/

--
[...truncated 2636 lines...]
[javac] Note: Recompile with -Xlint:unchecked for details.

jar:
  [jar] Building jar: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/urlfilter-suffix/urlfilter-suffix.jar

deps-test:

deploy:
 [copy] Copying 1 file to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/plugins/urlfilter-suffix

copy-generated-lib:
 [copy] Copying 1 file to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/plugins/urlfilter-suffix

init:
[mkdir] Created dir: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/urlfilter-validator
[mkdir] Created dir: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/urlfilter-validator/classes
[mkdir] Created dir: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/urlfilter-validator/test
[mkdir] Created dir: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/plugins/urlfilter-validator

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: urlfilter-validator
[javac] 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/src/plugin/build-plugin.xml:117:
 warning: 'includeantruntime' was not set, defaulting to 
build.sysclasspath=last; set to false for repeatable builds
[javac] Compiling 1 source file to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/urlfilter-validator/classes

jar:
  [jar] Building jar: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/urlfilter-validator/urlfilter-validator.jar

deps-test:

deploy:
 [copy] Copying 1 file to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/plugins/urlfilter-validator

copy-generated-lib:
 [copy] Copying 1 file to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/plugins/urlfilter-validator

init:
[mkdir] Created dir: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/urlnormalizer-basic
[mkdir] Created dir: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/urlnormalizer-basic/classes
[mkdir] Created dir: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/urlnormalizer-basic/test
[mkdir] Created dir: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/plugins/urlnormalizer-basic

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: urlnormalizer-basic
[javac] 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/src/plugin/build-plugin.xml:117:
 warning: 'includeantruntime' was not set, defaulting to 
build.sysclasspath=last; set to false for repeatable builds
[javac] Compiling 1 source file to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/urlnormalizer-basic/classes

jar:
  [jar] Building jar: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/urlnormalizer-basic/urlnormalizer-basic.jar

deps-test:

deploy:
 [copy] Copying 1 file to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/plugins/urlnormalizer-basic

copy-generated-lib:
 [copy] Copying 1 file to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/plugins/urlnormalizer-basic

init:
[mkdir] Created dir: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/urlnormalizer-pass
[mkdir] Created dir: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/urlnormalizer-pass/classes
[mkdir] Created dir: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/urlnormalizer-pass/test
[mkdir] Created dir: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/plugins/urlnormalizer-pass

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 

Jenkins build is back to normal : nutch-trunk-maven #161

2012-02-21 Thread Apache Jenkins Server
See https://builds.apache.org/job/nutch-trunk-maven/161/