[jira] [Resolved] (NUTCH-967) Upgrade to Tika 0.9

2011-04-08 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-967.
-

Resolution: Fixed

trunk : Committed revision 1090181
1.3 : Committed revision 1090182



 Upgrade to Tika 0.9
 ---

 Key: NUTCH-967
 URL: https://issues.apache.org/jira/browse/NUTCH-967
 Project: Nutch
  Issue Type: Task
  Components: parser
Affects Versions: 1.3, 2.0
Reporter: Markus Jelsma
Assignee: Julien Nioche
 Fix For: 1.3, 2.0

 Attachments: NUTCH-967-1.3-2.patch, NUTCH-967-1.3-3.patch, 
 NUTCH-967-1.3.patch




--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

2011-04-08 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13017378#comment-13017378
 ] 

Julien Nioche commented on NUTCH-978:
-

Can you please explain how your proposal differs from the HTMLParseFilter 
mechanism that Nutch already has?

 [GSoC 2011] A Plugin for extracting certain element of a web page on html 
 page parsing.
 ---

 Key: NUTCH-978
 URL: https://issues.apache.org/jira/browse/NUTCH-978
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.2
 Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
Reporter: Ammar Shadiq
Assignee: Chris A. Mattmann
Priority: Minor
  Labels: gsoc2011, mentor
 Fix For: 2.0

 Attachments: 
 [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, 
 app_screenshoot_configuration_result.png, 
 app_screenshoot_configuration_result_anchor.png, 
 app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png

   Original Estimate: 1680h
  Remaining Estimate: 1680h

 Nutch use parse-html plugin to parse web pages, it process the contents of 
 the web page by removing html tags and component like javascript and css and 
 leaving the extracted text to be stored on the index. Nutch by default 
 doesn't have the capability to select certain atomic element on an html page, 
 like certain tags, certain content, some part of the page, etc.
 A html page have a tree-like xml pattern with html tag as its branch and text 
 as its node. This branch and node could be extracted using XPath. XPath 
 allowing us to select a certain branch or node of an XML and therefore could 
 be used to extract certain information and treat it differently based on its 
 content and the user requirements. Furthermore a web domain like news website 
 usually have a same html code structure for storing the information on its 
 web pages. This same html code structure could be parsed using the same XPath 
 query and retrieve the same content information element. All of the XPath 
 query for selecting various content could be stored on a XPath Configuration 
 File.
 The purpose of nutch are for various web source, not all of the web page 
 retrieved from those various source have the same html code structure, thus 
 have to be threated differently using the correct XPath Configuration. The 
 selection of the correct XPath configuration could be done automatically 
 using regex by matching the url of the web page with valid url pattern for 
 that xpath configuration.
 This automatic mechanism allow the user of nutch to process various web page 
 and get only certain information that user wants therefore making the index 
 more accurate and its content more flexible.
 The component for this idea have been tested on nutch 1.2 for selecting 
 certain elements on various news website for the purpose of document 
 clustering. This includes a Configuration Editor Application build using 
 NetBeans 6.9 Application Framework. though its need a few debugging.
 http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-977) SolrMappingReader uses hardcoded configuration parameter name for mapping file

2011-04-08 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13017379#comment-13017379
 ] 

Julien Nioche commented on NUTCH-977:
-

Shouldn't MAPPING_FILE be added to SOLRContants as well? 

 SolrMappingReader uses hardcoded configuration parameter name for mapping file
 --

 Key: NUTCH-977
 URL: https://issues.apache.org/jira/browse/NUTCH-977
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.2, 1.3, 2.0
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.3, 2.0

 Attachments: NUTCH-977-1.3.patch, NUTCH-977-trunk.patch


 Because the SolrMappingReader uses a hard coded value for the name of the 
 mapping file configuration parameter it actually works. It should rely on 
 SolrConstants instead of using a hard coded value.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-976) SolrIndex constants in wrong namespace (or prefix)

2011-04-08 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13017382#comment-13017382
 ] 

Julien Nioche commented on NUTCH-976:
-

What about changing the name of the param in the default config instead? I 
suppose it has been named like this to reflect the name of the mapping file 
(solrindex-mapping.xml). SOLR is not used for anything else but indexing so 
using 'solrindex.' is a bit redundant. Not that it really matters mind you...



 SolrIndex constants in wrong namespace (or prefix)
 --

 Key: NUTCH-976
 URL: https://issues.apache.org/jira/browse/NUTCH-976
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.2, 1.3, 2.0
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.3, 2.0

 Attachments: NUTCH-976-1.3-trunk.patch


 The shipped nutch-default.xml configuration file uses solrindex. as namespace 
 for configuration parameters but the namespace (or prefix) in SolrConstants 
 is solr instead. It should be solrindex.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-975) Fix missing/wrong headers in source files

2011-04-08 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13017384#comment-13017384
 ] 

Julien Nioche commented on NUTCH-975:
-

Thanks Markus. Isn't there a tool that we could use to automatically check the 
headers? I think I saw something similar being used with other projects. Would 
save the hassle of doing it manually for the trunk

 Fix missing/wrong headers in source files
 -

 Key: NUTCH-975
 URL: https://issues.apache.org/jira/browse/NUTCH-975
 Project: Nutch
  Issue Type: Task
Affects Versions: 1.3, 2.0
Reporter: Markus Jelsma
Priority: Blocker
 Fix For: 1.3, 2.0

 Attachments: NUTCH-975-1.3.patch


 It seems several source files still do not contain the proper ASL headers. 
 This includes older core in 1.3 (indexer.NutchField etc) and recent code in 
 2.0 (API for instance). This should be fixed (yet again). So if you spot one 
 ;)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-977) SolrMappingReader uses hardcoded configuration parameter name for mapping file

2011-04-08 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13017387#comment-13017387
 ] 

Markus Jelsma commented on NUTCH-977:
-

It was added but https://issues.apache.org/jira/browse/NUTCH-976 seems to 
contain an old patch, i'll update the patch.


 SolrMappingReader uses hardcoded configuration parameter name for mapping file
 --

 Key: NUTCH-977
 URL: https://issues.apache.org/jira/browse/NUTCH-977
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.2, 1.3, 2.0
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.3, 2.0

 Attachments: NUTCH-977-1.3.patch, NUTCH-977-trunk.patch


 Because the SolrMappingReader uses a hard coded value for the name of the 
 mapping file configuration parameter it actually works. It should rely on 
 SolrConstants instead of using a hard coded value.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-897) Subcollection requires blacklist element

2011-04-08 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13017386#comment-13017386
 ] 

Julien Nioche commented on NUTCH-897:
-

Nitpick : What about calling *collection.getElementsByTagName(TAG_BLACKLIST)* 
only once?

 Subcollection requires blacklist element
 

 Key: NUTCH-897
 URL: https://issues.apache.org/jira/browse/NUTCH-897
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.2, 1.3, 2.0
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Trivial
 Fix For: 1.3, 2.0

 Attachments: NUTCH-897.patch


 This is a very minor issue with in Subcollection.java. It throws an error if 
 the (empty) blacklist element was omitted. I think it should either not 
 silently fail in case of an omitted blacklist element or throw a decent error 
 message that the blacklist element is required. The following exception gets 
 thrown if the blacklist element is omitted in a subcollection block:
 2010-09-06 13:32:30,438 INFO  collection.CollectionManager - Instantiating 
 CollectionManager
 2010-09-06 13:32:30,438 INFO  collection.CollectionManager - initializing 
 CollectionManager 
 2010-09-06 13:32:30,451 INFO  collection.CollectionManager - file has1 
 elements  

 2010-09-06 13:32:30,456 WARN  collection.CollectionManager - Error 
 occured:java.lang.NullPointerException

 2010-09-06 13:32:30,469 WARN  collection.CollectionManager - 
 java.lang.NullPointerException
  
 2010-09-06 13:32:30,470 WARN  collection.CollectionManager - at 
 org.apache.nutch.collection.Subcollection.initialize(Subcollection.java:173)  
   
 2010-09-06 13:32:30,470 WARN  collection.CollectionManager - at 
 org.apache.nutch.collection.CollectionManager.parse(CollectionManager.java:98)
   
 2010-09-06 13:32:30,470 WARN  collection.CollectionManager - at 
 org.apache.nutch.collection.CollectionManager.init(CollectionManager.java:75) 
   
 2010-09-06 13:32:30,470 WARN  collection.CollectionManager - at 
 org.apache.nutch.collection.CollectionManager.init(CollectionManager.java:56)
  
 2010-09-06 13:32:30,471 WARN  collection.CollectionManager - at 
 org.apache.nutch.collection.CollectionManager.getCollectionManager(CollectionManager.java:115)
   
 
 2010-09-06 13:32:30,471 WARN  collection.CollectionManager - at 
 org.apache.nutch.indexer.subcollection.SubcollectionIndexingFilter.addSubCollectionField(SubcollectionIndexingFilter.java:65)
   
  
 2010-09-06 13:32:30,471 WARN  collection.CollectionManager - at 
 org.apache.nutch.indexer.subcollection.SubcollectionIndexingFilter.filter(SubcollectionIndexingFilter.java:71)
   
 
 2010-09-06 13:32:30,471 WARN  collection.CollectionManager - at 
 org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:109) 
   
 2010-09-06 13:32:30,471 WARN  collection.CollectionManager - at 
 org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:134)   
   
 2010-09-06 13:32:30,472 WARN  collection.CollectionManager - at 
 org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:50)
   
 2010-09-06 13:32:30,472 WARN  collection.CollectionManager - at 
 org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463)
   
 2010-09-06 13:32:30,472 WARN  collection.CollectionManager - at 
 org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)  
   
 2010-09-06 13:32:30,472 WARN  collection.CollectionManager - at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216) 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-976) SolrIndex constants in wrong namespace (or prefix)

2011-04-08 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-976:


Attachment: NUTCH-976-1.3-1.patch

Correct patch

 SolrIndex constants in wrong namespace (or prefix)
 --

 Key: NUTCH-976
 URL: https://issues.apache.org/jira/browse/NUTCH-976
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.2, 1.3, 2.0
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.3, 2.0

 Attachments: NUTCH-976-1.3-1.patch, NUTCH-976-1.3-trunk.patch


 The shipped nutch-default.xml configuration file uses solrindex. as namespace 
 for configuration parameters but the namespace (or prefix) in SolrConstants 
 is solr instead. It should be solrindex.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-976) SolrIndex constants in wrong namespace (or prefix)

2011-04-08 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13017389#comment-13017389
 ] 

Markus Jelsma commented on NUTCH-976:
-

Yes, i thought about that too but changing the namespace to solr would break 
existing configurations that rely on solrindex.* params. Usually one would set 
commit.size to prevent OOMerrors in Nutch.

 SolrIndex constants in wrong namespace (or prefix)
 --

 Key: NUTCH-976
 URL: https://issues.apache.org/jira/browse/NUTCH-976
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.2, 1.3, 2.0
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.3, 2.0

 Attachments: NUTCH-976-1.3-1.patch, NUTCH-976-1.3-trunk.patch


 The shipped nutch-default.xml configuration file uses solrindex. as namespace 
 for configuration parameters but the namespace (or prefix) in SolrConstants 
 is solr instead. It should be solrindex.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-944) Increase the number of elements to look for URLs and add the ability to specify multiple attributes by elements

2011-04-08 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-944:


Affects Version/s: (was: 1.3)
Fix Version/s: (was: 1.3)
   2.0

Moved out of 1.3. We need to review this patch thoroughly and check that it 
does not generate noisy URLs but this definitely looks like a good contribution

 Increase the number of elements to look for URLs and add the ability to 
 specify multiple attributes by elements
 ---

 Key: NUTCH-944
 URL: https://issues.apache.org/jira/browse/NUTCH-944
 Project: Nutch
  Issue Type: Improvement
  Components: parser
 Environment: GNU/Linux Fedora 12
Reporter: Jean-Francois Gingras
Priority: Minor
 Fix For: 2.0

 Attachments: DOMContentUtils.java.path-1.0, 
 DOMContentUtils.java.path-1.3


 Here a patch for DOMContentUtils.java that increase the number of elements to 
 look for URLs. It also add the ability to specify multiple attributes by 
 elements, for example:
 linkParams.put(frame, new LinkParams(frame, longdesc,src, 0));
 linkParams.put(object, new LinkParams(object, 
 classid,codebase,data,usemap, 0));
 linkParams.put(video, new LinkParams(video, poster,src, 0)); // HTML 5
 I have a patch for release-1.0 and branch-1.3
 I would love to hear your comments about this.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (NUTCH-972) Mergedb doesn't merge with empty directory, as is the case with merge (for indexes)

2011-04-08 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-972.
-

Resolution: Fixed

Committed revision 1090199.

Thanks Gabriele. In the future could you use 'svn diff' to generate patches? 
See [http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer] for best practices

 Mergedb doesn't merge with empty directory, as is the case with merge (for 
 indexes)
 ---

 Key: NUTCH-972
 URL: https://issues.apache.org/jira/browse/NUTCH-972
 Project: Nutch
  Issue Type: Bug
  Components: storage
Affects Versions: 1.2
Reporter: Gabriele Kahlout
Priority: Minor
  Labels: patch
 Fix For: 1.3

 Attachments: check_empty.diff


 Just an issue of unexpected behavior. This series of commands works with 
 bin/nutch merge to merge indexes but not with crawldb.
 allcrawldb=crawl/allcrawldb
 temp_crawldb=crawl/temp_crawldb
 merge_dbs=$it_crawldb $allcrawldb
   
 # if [[ ! -d $allcrawldb ]]
 # then
 # merge_dbs=$it_crawldb
 # fi
 # uncomment the above and mergedb will work fine. 
 bin/nutch mergedb $temp_crawldb $merge_dbs
 rm -r $it_crawldb $allcrawldb crawl/segments crawl/linkdb
 mv $temp_crawldb $allcrawldb
 This is the exception that occurs:
 bin/nutch mergedb crawl/temp_crawldb crawl/crawldb crawl/allcrawldb
 CrawlDb merge: starting at 2011-03-27 10:13:06
 Adding crawl/crawldb
 Adding crawl/allcrawldb
 CrawlDb merge: org.apache.hadoop.mapred.InvalidInputException: Input path 
 does not exist: file:/Users/simpatico/nutch-1.2/crawl/allcrawldb/current
   at 
 org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
   at 
 org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
   at 
 org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
   at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
   at 
 org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
   at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
   at org.apache.nutch.crawl.CrawlDbMerger.merge(CrawlDbMerger.java:126)
   at org.apache.nutch.crawl.CrawlDbMerger.run(CrawlDbMerger.java:187)
   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
   at org.apache.nutch.crawl.CrawlDbMerger.main(CrawlDbMerger.java:159)
 Beside the scripting workaround I've attached a patch which skips adding the 
 empty folder to the collection of dbs to merge. I've also added it a log of 
 which dbs actually get added, consistent with merge interface.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-976) SolrIndex constants in wrong namespace (or prefix)

2011-04-08 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13017401#comment-13017401
 ] 

Julien Nioche commented on NUTCH-976:
-

Apart from 'solrindex.mapping.file' all the other params (including 
commit.size) rely on the existing 'solr.' prefix; changing the namespace *will* 
break them for sure.

Better to rename 'solrindex.mapping.file' so that it uses the same prefix as 
the existing params

 SolrIndex constants in wrong namespace (or prefix)
 --

 Key: NUTCH-976
 URL: https://issues.apache.org/jira/browse/NUTCH-976
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.2, 1.3, 2.0
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.3, 2.0

 Attachments: NUTCH-976-1.3-1.patch, NUTCH-976-1.3-trunk.patch


 The shipped nutch-default.xml configuration file uses solrindex. as namespace 
 for configuration parameters but the namespace (or prefix) in SolrConstants 
 is solr instead. It should be solrindex.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-963) Add support for deleting Solr documents with STATUS_DB_GONE in CrawlDB (404 urls)

2011-04-08 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13017403#comment-13017403
 ] 

Julien Nioche commented on NUTCH-963:
-

Shall we create a new issue to track the progress of solrclean on the trunk? 
I'd like to release 1.3 soon and this issue will look open until we do it on 
trunk, which might take some time

 Add support for deleting Solr documents with STATUS_DB_GONE in CrawlDB (404 
 urls)
 -

 Key: NUTCH-963
 URL: https://issues.apache.org/jira/browse/NUTCH-963
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Affects Versions: 2.0
Reporter: Claudio Martella
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.3, 2.0

 Attachments: NUTCH-963-command-and-log4j.patch, Solr404Deleter.java, 
 SolrClean.java


 When issuing recrawls it can happen that certain urls have expired (i.e. URLs 
 that don't exist anymore and return 404).
 This patch creates a new command in the indexer that scans the crawldb 
 looking for these urls and issues delete commands to SOLR.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (NUTCH-963) Add support for deleting Solr documents with STATUS_DB_GONE in CrawlDB (404 urls)

2011-04-08 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma resolved NUTCH-963.
-

   Resolution: Fixed
Fix Version/s: (was: 2.0)

 Add support for deleting Solr documents with STATUS_DB_GONE in CrawlDB (404 
 urls)
 -

 Key: NUTCH-963
 URL: https://issues.apache.org/jira/browse/NUTCH-963
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Affects Versions: 2.0
Reporter: Claudio Martella
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.3

 Attachments: NUTCH-963-command-and-log4j.patch, Solr404Deleter.java, 
 SolrClean.java


 When issuing recrawls it can happen that certain urls have expired (i.e. URLs 
 that don't exist anymore and return 404).
 This patch creates a new command in the indexer that scans the crawldb 
 looking for these urls and issues delete commands to SOLR.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-963) Add support for deleting Solr documents with STATUS_DB_GONE in CrawlDB (404 urls)

2011-04-08 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13017405#comment-13017405
 ] 

Markus Jelsma commented on NUTCH-963:
-

Yes!

 Add support for deleting Solr documents with STATUS_DB_GONE in CrawlDB (404 
 urls)
 -

 Key: NUTCH-963
 URL: https://issues.apache.org/jira/browse/NUTCH-963
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Affects Versions: 2.0
Reporter: Claudio Martella
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.3

 Attachments: NUTCH-963-command-and-log4j.patch, Solr404Deleter.java, 
 SolrClean.java


 When issuing recrawls it can happen that certain urls have expired (i.e. URLs 
 that don't exist anymore and return 404).
 This patch creates a new command in the indexer that scans the crawldb 
 looking for these urls and issues delete commands to SOLR.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-897) Subcollection requires blacklist element

2011-04-08 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13017406#comment-13017406
 ] 

Markus Jelsma commented on NUTCH-897:
-

Yes, importing NodeList is less lazy. Updated in patch.

 Subcollection requires blacklist element
 

 Key: NUTCH-897
 URL: https://issues.apache.org/jira/browse/NUTCH-897
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.2, 1.3, 2.0
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Trivial
 Fix For: 1.3, 2.0

 Attachments: NUTCH-897-1.patch, NUTCH-897.patch


 This is a very minor issue with in Subcollection.java. It throws an error if 
 the (empty) blacklist element was omitted. I think it should either not 
 silently fail in case of an omitted blacklist element or throw a decent error 
 message that the blacklist element is required. The following exception gets 
 thrown if the blacklist element is omitted in a subcollection block:
 2010-09-06 13:32:30,438 INFO  collection.CollectionManager - Instantiating 
 CollectionManager
 2010-09-06 13:32:30,438 INFO  collection.CollectionManager - initializing 
 CollectionManager 
 2010-09-06 13:32:30,451 INFO  collection.CollectionManager - file has1 
 elements  

 2010-09-06 13:32:30,456 WARN  collection.CollectionManager - Error 
 occured:java.lang.NullPointerException

 2010-09-06 13:32:30,469 WARN  collection.CollectionManager - 
 java.lang.NullPointerException
  
 2010-09-06 13:32:30,470 WARN  collection.CollectionManager - at 
 org.apache.nutch.collection.Subcollection.initialize(Subcollection.java:173)  
   
 2010-09-06 13:32:30,470 WARN  collection.CollectionManager - at 
 org.apache.nutch.collection.CollectionManager.parse(CollectionManager.java:98)
   
 2010-09-06 13:32:30,470 WARN  collection.CollectionManager - at 
 org.apache.nutch.collection.CollectionManager.init(CollectionManager.java:75) 
   
 2010-09-06 13:32:30,470 WARN  collection.CollectionManager - at 
 org.apache.nutch.collection.CollectionManager.init(CollectionManager.java:56)
  
 2010-09-06 13:32:30,471 WARN  collection.CollectionManager - at 
 org.apache.nutch.collection.CollectionManager.getCollectionManager(CollectionManager.java:115)
   
 
 2010-09-06 13:32:30,471 WARN  collection.CollectionManager - at 
 org.apache.nutch.indexer.subcollection.SubcollectionIndexingFilter.addSubCollectionField(SubcollectionIndexingFilter.java:65)
   
  
 2010-09-06 13:32:30,471 WARN  collection.CollectionManager - at 
 org.apache.nutch.indexer.subcollection.SubcollectionIndexingFilter.filter(SubcollectionIndexingFilter.java:71)
   
 
 2010-09-06 13:32:30,471 WARN  collection.CollectionManager - at 
 org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:109) 
   
 2010-09-06 13:32:30,471 WARN  collection.CollectionManager - at 
 org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:134)   
   
 2010-09-06 13:32:30,472 WARN  collection.CollectionManager - at 
 org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:50)
   
 2010-09-06 13:32:30,472 WARN  collection.CollectionManager - at 
 org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463)
   
 2010-09-06 13:32:30,472 WARN  collection.CollectionManager - at 
 org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)  
   
 2010-09-06 13:32:30,472 WARN  collection.CollectionManager - at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216) 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-897) Subcollection requires blacklist element

2011-04-08 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13017410#comment-13017410
 ] 

Julien Nioche commented on NUTCH-897:
-

Looks good to me

 Subcollection requires blacklist element
 

 Key: NUTCH-897
 URL: https://issues.apache.org/jira/browse/NUTCH-897
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.2, 1.3, 2.0
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Trivial
 Fix For: 1.3, 2.0

 Attachments: NUTCH-897-1.patch, NUTCH-897.patch


 This is a very minor issue with in Subcollection.java. It throws an error if 
 the (empty) blacklist element was omitted. I think it should either not 
 silently fail in case of an omitted blacklist element or throw a decent error 
 message that the blacklist element is required. The following exception gets 
 thrown if the blacklist element is omitted in a subcollection block:
 2010-09-06 13:32:30,438 INFO  collection.CollectionManager - Instantiating 
 CollectionManager
 2010-09-06 13:32:30,438 INFO  collection.CollectionManager - initializing 
 CollectionManager 
 2010-09-06 13:32:30,451 INFO  collection.CollectionManager - file has1 
 elements  

 2010-09-06 13:32:30,456 WARN  collection.CollectionManager - Error 
 occured:java.lang.NullPointerException

 2010-09-06 13:32:30,469 WARN  collection.CollectionManager - 
 java.lang.NullPointerException
  
 2010-09-06 13:32:30,470 WARN  collection.CollectionManager - at 
 org.apache.nutch.collection.Subcollection.initialize(Subcollection.java:173)  
   
 2010-09-06 13:32:30,470 WARN  collection.CollectionManager - at 
 org.apache.nutch.collection.CollectionManager.parse(CollectionManager.java:98)
   
 2010-09-06 13:32:30,470 WARN  collection.CollectionManager - at 
 org.apache.nutch.collection.CollectionManager.init(CollectionManager.java:75) 
   
 2010-09-06 13:32:30,470 WARN  collection.CollectionManager - at 
 org.apache.nutch.collection.CollectionManager.init(CollectionManager.java:56)
  
 2010-09-06 13:32:30,471 WARN  collection.CollectionManager - at 
 org.apache.nutch.collection.CollectionManager.getCollectionManager(CollectionManager.java:115)
   
 
 2010-09-06 13:32:30,471 WARN  collection.CollectionManager - at 
 org.apache.nutch.indexer.subcollection.SubcollectionIndexingFilter.addSubCollectionField(SubcollectionIndexingFilter.java:65)
   
  
 2010-09-06 13:32:30,471 WARN  collection.CollectionManager - at 
 org.apache.nutch.indexer.subcollection.SubcollectionIndexingFilter.filter(SubcollectionIndexingFilter.java:71)
   
 
 2010-09-06 13:32:30,471 WARN  collection.CollectionManager - at 
 org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:109) 
   
 2010-09-06 13:32:30,471 WARN  collection.CollectionManager - at 
 org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:134)   
   
 2010-09-06 13:32:30,472 WARN  collection.CollectionManager - at 
 org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:50)
   
 2010-09-06 13:32:30,472 WARN  collection.CollectionManager - at 
 org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463)
   
 2010-09-06 13:32:30,472 WARN  collection.CollectionManager - at 
 org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)  
   
 2010-09-06 13:32:30,472 WARN  collection.CollectionManager - at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216) 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


All solr* commands fail in 1.3

2011-04-08 Thread Markus Jelsma

Hi devs,

Since today i noticed that all solr* commands fail in a similar 
fashion:


SolrDeleteDuplicates: starting at 2011-04-08 14:17:44
SolrDeleteDuplicates: Solr url: http://127.0.0.1:8983/solr
Exception in thread main java.lang.IllegalAccessError: tried to 
access field org.slf4j.impl.StaticLoggerBinder.SINGLETON from class 
org.slf4j.LoggerFactory
at 
org.slf4j.LoggerFactory.staticInitialize(LoggerFactory.java:83)

at org.slf4j.LoggerFactory.clinit(LoggerFactory.java:73)
at 
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.clinit(CommonsHttpSolrServer.java:78)
at 
org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getSplits(SolrDeleteDuplicates.java:188)
at 
org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
at 
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
at 
org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
at 
org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
at 
org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:358)
at 
org.apache.nutch.indexer.solr.SolrDeleteDuplicates.run(SolrDeleteDuplicates.java:370)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at 
org.apache.nutch.indexer.solr.SolrDeleteDuplicates.main(SolrDeleteDuplicates.java:375)



The error can be a bit different between commands but they always end 
up with:


Exception in thread main java.lang.IllegalAccessError: tried to 
access field org.slf4j.impl.StaticLoggerBinder.SINGLETON from class 
org.slf4j.LoggerFactory
at 
org.slf4j.LoggerFactory.staticInitialize(LoggerFactory.java:83)

at org.slf4j.LoggerFactory.clinit(LoggerFactory.java:73)
at 
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.clinit(CommonsHttpSolrServer.java:78)


This happens in the current 1.3 revision but also in a revision 
(1079765) a month old and a revision (1062728) of 2011-01-24. I've no 
idea what's causing the issue but it might have something to do with me 
removing ~/.ivy2 yesterday. Since then all stuff is being downloaded 
again. If you cannot reproduce then i'm quite sure that removing stuff 
in .ivy2 and a fresh svn export will make your Solr commands fail. I 
cannot compile trunk at the moment because of Gora and i cannot compile 
Gora because of some other dependency and haven't come to fixing that 
for now.


Any thoughts?

Cheers,


Re: All solr* commands fail in 1.3

2011-04-08 Thread Julien Nioche
See http://www.slf4j.org/faq.html#IllegalAccessError

This error is caused by the static initilizer of the LoggerFactory class
 attempting to directly access the SINGLETON field of
 org.slf4j.impl.StaticLoggerBinder. While this was allowed in SLF4J 1.5.5
 and earlier, in 1.5.6 and later the SINGLETON field has been marked as
 private access.

 If you get the exception shown above, then you are using an older version
 of slf4j-api, e.g. 1.4.3, with a new version of a slf4j binding, e.g. 1.5.6.
 Typically, this occurs when your Maven *pom.ml* file incoprporates
 hibernate 3.3.0 which declares a dependency on slf4j-api version 1.4.2. If
 your *pom.xml* declares a dependency on an slf4j binding, say
 slf4j-log4j12 version 1.5.6, then you will get illegal access errors.

'ant report' shows
slf4j-api version = 1.5.5 from SOLRbut out ivy.xml lists
slf4j-log4j12  version = 1.5.11so we should either revert slf4-log4j12 to
1.5.5 or set slf4j-api to 1.5.11

Julien



On 8 April 2011 13:44, Markus Jelsma markus.jel...@openindex.io wrote:

 Hi devs,

 Since today i noticed that all solr* commands fail in a similar fashion:

 SolrDeleteDuplicates: starting at 2011-04-08 14:17:44
 SolrDeleteDuplicates: Solr url: http://127.0.0.1:8983/solr
 Exception in thread main java.lang.IllegalAccessError: tried to access
 field org.slf4j.impl.StaticLoggerBinder.SINGLETON from class
 org.slf4j.LoggerFactory
at org.slf4j.LoggerFactory.staticInitialize(LoggerFactory.java:83)
at org.slf4j.LoggerFactory.clinit(LoggerFactory.java:73)
at
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.clinit(CommonsHttpSolrServer.java:78)
at
 org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getSplits(SolrDeleteDuplicates.java:188)
at
 org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
at
 org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
at
 org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:358)
at
 org.apache.nutch.indexer.solr.SolrDeleteDuplicates.run(SolrDeleteDuplicates.java:370)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at
 org.apache.nutch.indexer.solr.SolrDeleteDuplicates.main(SolrDeleteDuplicates.java:375)


 The error can be a bit different between commands but they always end up
 with:

 Exception in thread main java.lang.IllegalAccessError: tried to access
 field org.slf4j.impl.StaticLoggerBinder.SINGLETON from class
 org.slf4j.LoggerFactory
at org.slf4j.LoggerFactory.staticInitialize(LoggerFactory.java:83)
at org.slf4j.LoggerFactory.clinit(LoggerFactory.java:73)
at
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.clinit(CommonsHttpSolrServer.java:78)

 This happens in the current 1.3 revision but also in a revision (1079765) a
 month old and a revision (1062728) of 2011-01-24. I've no idea what's
 causing the issue but it might have something to do with me removing ~/.ivy2
 yesterday. Since then all stuff is being downloaded again. If you cannot
 reproduce then i'm quite sure that removing stuff in .ivy2 and a fresh svn
 export will make your Solr commands fail. I cannot compile trunk at the
 moment because of Gora and i cannot compile Gora because of some other
 dependency and haven't come to fixing that for now.

 Any thoughts?

 Cheers,




-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com


Re: All solr* commands fail in 1.3

2011-04-08 Thread Markus Jelsma
I'll open a ticket and take a look at the issue monday or so (unless someone 
beats me to it). 

Do you have an explanation for why i only noticed the error after removing 
~/.ivy2? 

 See http://www.slf4j.org/faq.html#IllegalAccessError
 
 This error is caused by the static initilizer of the LoggerFactory class
 
  attempting to directly access the SINGLETON field of
  org.slf4j.impl.StaticLoggerBinder. While this was allowed in SLF4J 1.5.5
  and earlier, in 1.5.6 and later the SINGLETON field has been marked as
  private access.
  
  If you get the exception shown above, then you are using an older version
  of slf4j-api, e.g. 1.4.3, with a new version of a slf4j binding, e.g.
  1.5.6. Typically, this occurs when your Maven *pom.ml* file
  incoprporates hibernate 3.3.0 which declares a dependency on slf4j-api
  version 1.4.2. If your *pom.xml* declares a dependency on an slf4j
  binding, say
  slf4j-log4j12 version 1.5.6, then you will get illegal access errors.
 
 'ant report' shows
 slf4j-api version = 1.5.5 from SOLRbut out ivy.xml lists
 slf4j-log4j12  version = 1.5.11so we should either revert slf4-log4j12 to
 1.5.5 or set slf4j-api to 1.5.11
 
 Julien
 
 On 8 April 2011 13:44, Markus Jelsma markus.jel...@openindex.io wrote:
  Hi devs,
  
  Since today i noticed that all solr* commands fail in a similar fashion:
  
  SolrDeleteDuplicates: starting at 2011-04-08 14:17:44
  SolrDeleteDuplicates: Solr url: http://127.0.0.1:8983/solr
  Exception in thread main java.lang.IllegalAccessError: tried to access
  field org.slf4j.impl.StaticLoggerBinder.SINGLETON from class
  org.slf4j.LoggerFactory
  
 at org.slf4j.LoggerFactory.staticInitialize(LoggerFactory.java:83)
 at org.slf4j.LoggerFactory.clinit(LoggerFactory.java:73)
 at
  
  org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.clinit(CommonsH
  ttpSolrServer.java:78)
  
 at
  
  org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getSpl
  its(SolrDeleteDuplicates.java:188)
  
 at
  
  org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
  
 at
  
  org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
  
 at
 org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
 at
  
  org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplic
  ates.java:358)
  
 at
  
  org.apache.nutch.indexer.solr.SolrDeleteDuplicates.run(SolrDeleteDuplicat
  es.java:370)
  
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at
  
  org.apache.nutch.indexer.solr.SolrDeleteDuplicates.main(SolrDeleteDuplica
  tes.java:375)
  
  
  The error can be a bit different between commands but they always end up
  with:
  
  Exception in thread main java.lang.IllegalAccessError: tried to access
  field org.slf4j.impl.StaticLoggerBinder.SINGLETON from class
  org.slf4j.LoggerFactory
  
 at org.slf4j.LoggerFactory.staticInitialize(LoggerFactory.java:83)
 at org.slf4j.LoggerFactory.clinit(LoggerFactory.java:73)
 at
  
  org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.clinit(CommonsH
  ttpSolrServer.java:78)
  
  This happens in the current 1.3 revision but also in a revision (1079765)
  a month old and a revision (1062728) of 2011-01-24. I've no idea what's
  causing the issue but it might have something to do with me removing
  ~/.ivy2 yesterday. Since then all stuff is being downloaded again. If
  you cannot reproduce then i'm quite sure that removing stuff in .ivy2
  and a fresh svn export will make your Solr commands fail. I cannot
  compile trunk at the moment because of Gora and i cannot compile Gora
  because of some other dependency and haven't come to fixing that for
  now.
  
  Any thoughts?
  
  Cheers,


GORA dependency and build failures

2011-04-08 Thread Otis Gospodnetic
Hi,

Just curious - is the plan to wait for the GORA 0.1 release to get published 
somewhere (not familiar with Ivy, so I'm not sure where things need to get 
published), and then that will automatically fix the failing build?

Thanks,
Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



Re: GORA dependency and build failures

2011-04-08 Thread Julien Nioche
Yep. 0.1 has been released and the artifacts should be available soon

On Friday, 8 April 2011, Otis Gospodnetic ogjunk-nu...@yahoo.com wrote:
 Hi,

 Just curious - is the plan to wait for the GORA 0.1 release to get published
 somewhere (not familiar with Ivy, so I'm not sure where things need to get
 published), and then that will automatically fix the failing build?

 Thanks,
 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com


[jira] [Updated] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

2011-04-08 Thread Ammar Shadiq (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ammar Shadiq updated NUTCH-978:
---

Attachment: Screenshot.png

 [GSoC 2011] A Plugin for extracting certain element of a web page on html 
 page parsing.
 ---

 Key: NUTCH-978
 URL: https://issues.apache.org/jira/browse/NUTCH-978
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.2
 Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
Reporter: Ammar Shadiq
Assignee: Chris A. Mattmann
Priority: Minor
  Labels: gsoc2011, mentor
 Fix For: 2.0

 Attachments: Screenshot.png, 
 [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, 
 app_screenshoot_configuration_result.png, 
 app_screenshoot_configuration_result_anchor.png, 
 app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png

   Original Estimate: 1680h
  Remaining Estimate: 1680h

 Nutch use parse-html plugin to parse web pages, it process the contents of 
 the web page by removing html tags and component like javascript and css and 
 leaving the extracted text to be stored on the index. Nutch by default 
 doesn't have the capability to select certain atomic element on an html page, 
 like certain tags, certain content, some part of the page, etc.
 A html page have a tree-like xml pattern with html tag as its branch and text 
 as its node. This branch and node could be extracted using XPath. XPath 
 allowing us to select a certain branch or node of an XML and therefore could 
 be used to extract certain information and treat it differently based on its 
 content and the user requirements. Furthermore a web domain like news website 
 usually have a same html code structure for storing the information on its 
 web pages. This same html code structure could be parsed using the same XPath 
 query and retrieve the same content information element. All of the XPath 
 query for selecting various content could be stored on a XPath Configuration 
 File.
 The purpose of nutch are for various web source, not all of the web page 
 retrieved from those various source have the same html code structure, thus 
 have to be threated differently using the correct XPath Configuration. The 
 selection of the correct XPath configuration could be done automatically 
 using regex by matching the url of the web page with valid url pattern for 
 that xpath configuration.
 This automatic mechanism allow the user of nutch to process various web page 
 and get only certain information that user wants therefore making the index 
 more accurate and its content more flexible.
 The component for this idea have been tested on nutch 1.2 for selecting 
 certain elements on various news website for the purpose of document 
 clustering. This includes a Configuration Editor Application build using 
 NetBeans 6.9 Application Framework. though its need a few debugging.
 http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

2011-04-08 Thread Ammar Shadiq (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ammar Shadiq updated NUTCH-978:
---

Attachment: (was: Screenshot.png)

 [GSoC 2011] A Plugin for extracting certain element of a web page on html 
 page parsing.
 ---

 Key: NUTCH-978
 URL: https://issues.apache.org/jira/browse/NUTCH-978
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.2
 Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
Reporter: Ammar Shadiq
Assignee: Chris A. Mattmann
Priority: Minor
  Labels: gsoc2011, mentor
 Fix For: 2.0

 Attachments: 
 [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, 
 app_guardian_ivory_coast_news_exmpl.png, 
 app_screenshoot_configuration_result.png, 
 app_screenshoot_configuration_result_anchor.png, 
 app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png

   Original Estimate: 1680h
  Remaining Estimate: 1680h

 Nutch use parse-html plugin to parse web pages, it process the contents of 
 the web page by removing html tags and component like javascript and css and 
 leaving the extracted text to be stored on the index. Nutch by default 
 doesn't have the capability to select certain atomic element on an html page, 
 like certain tags, certain content, some part of the page, etc.
 A html page have a tree-like xml pattern with html tag as its branch and text 
 as its node. This branch and node could be extracted using XPath. XPath 
 allowing us to select a certain branch or node of an XML and therefore could 
 be used to extract certain information and treat it differently based on its 
 content and the user requirements. Furthermore a web domain like news website 
 usually have a same html code structure for storing the information on its 
 web pages. This same html code structure could be parsed using the same XPath 
 query and retrieve the same content information element. All of the XPath 
 query for selecting various content could be stored on a XPath Configuration 
 File.
 The purpose of nutch are for various web source, not all of the web page 
 retrieved from those various source have the same html code structure, thus 
 have to be threated differently using the correct XPath Configuration. The 
 selection of the correct XPath configuration could be done automatically 
 using regex by matching the url of the web page with valid url pattern for 
 that xpath configuration.
 This automatic mechanism allow the user of nutch to process various web page 
 and get only certain information that user wants therefore making the index 
 more accurate and its content more flexible.
 The component for this idea have been tested on nutch 1.2 for selecting 
 certain elements on various news website for the purpose of document 
 clustering. This includes a Configuration Editor Application build using 
 NetBeans 6.9 Application Framework. though its need a few debugging.
 http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


Build failed in Jenkins: Nutch-trunk #1451

2011-04-08 Thread Apache Hudson Server
See https://hudson.apache.org/hudson/job/Nutch-trunk/1451/changes

Changes:

[jnioche] NUTCH-967 Upgraded Tika to version 0.9 + changes version name for GORA

--
[...truncated 1012 lines...]
A 
src/plugin/subcollection/src/java/org/apache/nutch/collection/package.html
A src/plugin/subcollection/src/java/org/apache/nutch/indexer
A 
src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection
A 
src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection/SubcollectionIndexingFilter.java
A src/plugin/subcollection/README.txt
A src/plugin/subcollection/plugin.xml
A src/plugin/subcollection/build.xml
A src/plugin/index-more
A src/plugin/index-more/ivy.xml
A src/plugin/index-more/src
A src/plugin/index-more/src/test
A src/plugin/index-more/src/test/org
A src/plugin/index-more/src/test/org/apache
A src/plugin/index-more/src/test/org/apache/nutch
A src/plugin/index-more/src/test/org/apache/nutch/indexer
A src/plugin/index-more/src/test/org/apache/nutch/indexer/more
A 
src/plugin/index-more/src/test/org/apache/nutch/indexer/more/TestMoreIndexingFilter.java
A src/plugin/index-more/src/java
A src/plugin/index-more/src/java/org
A src/plugin/index-more/src/java/org/apache
A src/plugin/index-more/src/java/org/apache/nutch
A src/plugin/index-more/src/java/org/apache/nutch/indexer
A src/plugin/index-more/src/java/org/apache/nutch/indexer/more
A 
src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
A 
src/plugin/index-more/src/java/org/apache/nutch/indexer/more/package.html
A src/plugin/index-more/plugin.xml
A src/plugin/index-more/build.xml
AUsrc/plugin/plugin.dtd
A src/plugin/parse-ext
A src/plugin/parse-ext/ivy.xml
A src/plugin/parse-ext/src
A src/plugin/parse-ext/src/test
A src/plugin/parse-ext/src/test/org
A src/plugin/parse-ext/src/test/org/apache
A src/plugin/parse-ext/src/test/org/apache/nutch
A src/plugin/parse-ext/src/test/org/apache/nutch/parse
A src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext
A 
src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext/TestExtParser.java
A src/plugin/parse-ext/src/java
A src/plugin/parse-ext/src/java/org
A src/plugin/parse-ext/src/java/org/apache
A src/plugin/parse-ext/src/java/org/apache/nutch
A src/plugin/parse-ext/src/java/org/apache/nutch/parse
A src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext
A 
src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext/ExtParser.java
A src/plugin/parse-ext/plugin.xml
A src/plugin/parse-ext/build.xml
A src/plugin/parse-ext/command
A src/plugin/urlnormalizer-pass
A src/plugin/urlnormalizer-pass/ivy.xml
A src/plugin/urlnormalizer-pass/src
A src/plugin/urlnormalizer-pass/src/test
A src/plugin/urlnormalizer-pass/src/test/org
A src/plugin/urlnormalizer-pass/src/test/org/apache
A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch
A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net
A 
src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer
A 
src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer/pass
AU
src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer/pass/TestPassURLNormalizer.java
A src/plugin/urlnormalizer-pass/src/java
A src/plugin/urlnormalizer-pass/src/java/org
A src/plugin/urlnormalizer-pass/src/java/org/apache
A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch
A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net
A 
src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer
A 
src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer/pass
AU
src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer/pass/PassURLNormalizer.java
AUsrc/plugin/urlnormalizer-pass/plugin.xml
AUsrc/plugin/urlnormalizer-pass/build.xml
A src/plugin/parse-html
A src/plugin/parse-html/ivy.xml
A src/plugin/parse-html/lib
A src/plugin/parse-html/lib/tagsoup.LICENSE.txt
A src/plugin/parse-html/src
A src/plugin/parse-html/src/test
A src/plugin/parse-html/src/test/org
A src/plugin/parse-html/src/test/org/apache
A src/plugin/parse-html/src/test/org/apache/nutch
A src/plugin/parse-html/src/test/org/apache/nutch/parse
A src/plugin/parse-html/src/test/org/apache/nutch/parse/html
A 
src/plugin/parse-html/src/test/org/apache/nutch/parse/html/TestRobotsMetaProcessor.java
A