[jira] [Resolved] (NUTCH-967) Upgrade to Tika 0.9
[ https://issues.apache.org/jira/browse/NUTCH-967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-967. - Resolution: Fixed trunk : Committed revision 1090181 1.3 : Committed revision 1090182 Upgrade to Tika 0.9 --- Key: NUTCH-967 URL: https://issues.apache.org/jira/browse/NUTCH-967 Project: Nutch Issue Type: Task Components: parser Affects Versions: 1.3, 2.0 Reporter: Markus Jelsma Assignee: Julien Nioche Fix For: 1.3, 2.0 Attachments: NUTCH-967-1.3-2.patch, NUTCH-967-1.3-3.patch, NUTCH-967-1.3.patch -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.
[ https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13017378#comment-13017378 ] Julien Nioche commented on NUTCH-978: - Can you please explain how your proposal differs from the HTMLParseFilter mechanism that Nutch already has? [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing. --- Key: NUTCH-978 URL: https://issues.apache.org/jira/browse/NUTCH-978 Project: Nutch Issue Type: New Feature Components: parser Affects Versions: 1.2 Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9 Reporter: Ammar Shadiq Assignee: Chris A. Mattmann Priority: Minor Labels: gsoc2011, mentor Fix For: 2.0 Attachments: [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, app_screenshoot_configuration_result.png, app_screenshoot_configuration_result_anchor.png, app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png Original Estimate: 1680h Remaining Estimate: 1680h Nutch use parse-html plugin to parse web pages, it process the contents of the web page by removing html tags and component like javascript and css and leaving the extracted text to be stored on the index. Nutch by default doesn't have the capability to select certain atomic element on an html page, like certain tags, certain content, some part of the page, etc. A html page have a tree-like xml pattern with html tag as its branch and text as its node. This branch and node could be extracted using XPath. XPath allowing us to select a certain branch or node of an XML and therefore could be used to extract certain information and treat it differently based on its content and the user requirements. Furthermore a web domain like news website usually have a same html code structure for storing the information on its web pages. This same html code structure could be parsed using the same XPath query and retrieve the same content information element. All of the XPath query for selecting various content could be stored on a XPath Configuration File. The purpose of nutch are for various web source, not all of the web page retrieved from those various source have the same html code structure, thus have to be threated differently using the correct XPath Configuration. The selection of the correct XPath configuration could be done automatically using regex by matching the url of the web page with valid url pattern for that xpath configuration. This automatic mechanism allow the user of nutch to process various web page and get only certain information that user wants therefore making the index more accurate and its content more flexible. The component for this idea have been tested on nutch 1.2 for selecting certain elements on various news website for the purpose of document clustering. This includes a Configuration Editor Application build using NetBeans 6.9 Application Framework. though its need a few debugging. http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-977) SolrMappingReader uses hardcoded configuration parameter name for mapping file
[ https://issues.apache.org/jira/browse/NUTCH-977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13017379#comment-13017379 ] Julien Nioche commented on NUTCH-977: - Shouldn't MAPPING_FILE be added to SOLRContants as well? SolrMappingReader uses hardcoded configuration parameter name for mapping file -- Key: NUTCH-977 URL: https://issues.apache.org/jira/browse/NUTCH-977 Project: Nutch Issue Type: Bug Affects Versions: 1.2, 1.3, 2.0 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.3, 2.0 Attachments: NUTCH-977-1.3.patch, NUTCH-977-trunk.patch Because the SolrMappingReader uses a hard coded value for the name of the mapping file configuration parameter it actually works. It should rely on SolrConstants instead of using a hard coded value. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-976) SolrIndex constants in wrong namespace (or prefix)
[ https://issues.apache.org/jira/browse/NUTCH-976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13017382#comment-13017382 ] Julien Nioche commented on NUTCH-976: - What about changing the name of the param in the default config instead? I suppose it has been named like this to reflect the name of the mapping file (solrindex-mapping.xml). SOLR is not used for anything else but indexing so using 'solrindex.' is a bit redundant. Not that it really matters mind you... SolrIndex constants in wrong namespace (or prefix) -- Key: NUTCH-976 URL: https://issues.apache.org/jira/browse/NUTCH-976 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.2, 1.3, 2.0 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.3, 2.0 Attachments: NUTCH-976-1.3-trunk.patch The shipped nutch-default.xml configuration file uses solrindex. as namespace for configuration parameters but the namespace (or prefix) in SolrConstants is solr instead. It should be solrindex. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-975) Fix missing/wrong headers in source files
[ https://issues.apache.org/jira/browse/NUTCH-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13017384#comment-13017384 ] Julien Nioche commented on NUTCH-975: - Thanks Markus. Isn't there a tool that we could use to automatically check the headers? I think I saw something similar being used with other projects. Would save the hassle of doing it manually for the trunk Fix missing/wrong headers in source files - Key: NUTCH-975 URL: https://issues.apache.org/jira/browse/NUTCH-975 Project: Nutch Issue Type: Task Affects Versions: 1.3, 2.0 Reporter: Markus Jelsma Priority: Blocker Fix For: 1.3, 2.0 Attachments: NUTCH-975-1.3.patch It seems several source files still do not contain the proper ASL headers. This includes older core in 1.3 (indexer.NutchField etc) and recent code in 2.0 (API for instance). This should be fixed (yet again). So if you spot one ;) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-977) SolrMappingReader uses hardcoded configuration parameter name for mapping file
[ https://issues.apache.org/jira/browse/NUTCH-977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13017387#comment-13017387 ] Markus Jelsma commented on NUTCH-977: - It was added but https://issues.apache.org/jira/browse/NUTCH-976 seems to contain an old patch, i'll update the patch. SolrMappingReader uses hardcoded configuration parameter name for mapping file -- Key: NUTCH-977 URL: https://issues.apache.org/jira/browse/NUTCH-977 Project: Nutch Issue Type: Bug Affects Versions: 1.2, 1.3, 2.0 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.3, 2.0 Attachments: NUTCH-977-1.3.patch, NUTCH-977-trunk.patch Because the SolrMappingReader uses a hard coded value for the name of the mapping file configuration parameter it actually works. It should rely on SolrConstants instead of using a hard coded value. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-897) Subcollection requires blacklist element
[ https://issues.apache.org/jira/browse/NUTCH-897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13017386#comment-13017386 ] Julien Nioche commented on NUTCH-897: - Nitpick : What about calling *collection.getElementsByTagName(TAG_BLACKLIST)* only once? Subcollection requires blacklist element Key: NUTCH-897 URL: https://issues.apache.org/jira/browse/NUTCH-897 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.2, 1.3, 2.0 Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Trivial Fix For: 1.3, 2.0 Attachments: NUTCH-897.patch This is a very minor issue with in Subcollection.java. It throws an error if the (empty) blacklist element was omitted. I think it should either not silently fail in case of an omitted blacklist element or throw a decent error message that the blacklist element is required. The following exception gets thrown if the blacklist element is omitted in a subcollection block: 2010-09-06 13:32:30,438 INFO collection.CollectionManager - Instantiating CollectionManager 2010-09-06 13:32:30,438 INFO collection.CollectionManager - initializing CollectionManager 2010-09-06 13:32:30,451 INFO collection.CollectionManager - file has1 elements 2010-09-06 13:32:30,456 WARN collection.CollectionManager - Error occured:java.lang.NullPointerException 2010-09-06 13:32:30,469 WARN collection.CollectionManager - java.lang.NullPointerException 2010-09-06 13:32:30,470 WARN collection.CollectionManager - at org.apache.nutch.collection.Subcollection.initialize(Subcollection.java:173) 2010-09-06 13:32:30,470 WARN collection.CollectionManager - at org.apache.nutch.collection.CollectionManager.parse(CollectionManager.java:98) 2010-09-06 13:32:30,470 WARN collection.CollectionManager - at org.apache.nutch.collection.CollectionManager.init(CollectionManager.java:75) 2010-09-06 13:32:30,470 WARN collection.CollectionManager - at org.apache.nutch.collection.CollectionManager.init(CollectionManager.java:56) 2010-09-06 13:32:30,471 WARN collection.CollectionManager - at org.apache.nutch.collection.CollectionManager.getCollectionManager(CollectionManager.java:115) 2010-09-06 13:32:30,471 WARN collection.CollectionManager - at org.apache.nutch.indexer.subcollection.SubcollectionIndexingFilter.addSubCollectionField(SubcollectionIndexingFilter.java:65) 2010-09-06 13:32:30,471 WARN collection.CollectionManager - at org.apache.nutch.indexer.subcollection.SubcollectionIndexingFilter.filter(SubcollectionIndexingFilter.java:71) 2010-09-06 13:32:30,471 WARN collection.CollectionManager - at org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:109) 2010-09-06 13:32:30,471 WARN collection.CollectionManager - at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:134) 2010-09-06 13:32:30,472 WARN collection.CollectionManager - at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:50) 2010-09-06 13:32:30,472 WARN collection.CollectionManager - at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463) 2010-09-06 13:32:30,472 WARN collection.CollectionManager - at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411) 2010-09-06 13:32:30,472 WARN collection.CollectionManager - at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-976) SolrIndex constants in wrong namespace (or prefix)
[ https://issues.apache.org/jira/browse/NUTCH-976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-976: Attachment: NUTCH-976-1.3-1.patch Correct patch SolrIndex constants in wrong namespace (or prefix) -- Key: NUTCH-976 URL: https://issues.apache.org/jira/browse/NUTCH-976 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.2, 1.3, 2.0 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.3, 2.0 Attachments: NUTCH-976-1.3-1.patch, NUTCH-976-1.3-trunk.patch The shipped nutch-default.xml configuration file uses solrindex. as namespace for configuration parameters but the namespace (or prefix) in SolrConstants is solr instead. It should be solrindex. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-976) SolrIndex constants in wrong namespace (or prefix)
[ https://issues.apache.org/jira/browse/NUTCH-976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13017389#comment-13017389 ] Markus Jelsma commented on NUTCH-976: - Yes, i thought about that too but changing the namespace to solr would break existing configurations that rely on solrindex.* params. Usually one would set commit.size to prevent OOMerrors in Nutch. SolrIndex constants in wrong namespace (or prefix) -- Key: NUTCH-976 URL: https://issues.apache.org/jira/browse/NUTCH-976 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.2, 1.3, 2.0 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.3, 2.0 Attachments: NUTCH-976-1.3-1.patch, NUTCH-976-1.3-trunk.patch The shipped nutch-default.xml configuration file uses solrindex. as namespace for configuration parameters but the namespace (or prefix) in SolrConstants is solr instead. It should be solrindex. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-944) Increase the number of elements to look for URLs and add the ability to specify multiple attributes by elements
[ https://issues.apache.org/jira/browse/NUTCH-944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-944: Affects Version/s: (was: 1.3) Fix Version/s: (was: 1.3) 2.0 Moved out of 1.3. We need to review this patch thoroughly and check that it does not generate noisy URLs but this definitely looks like a good contribution Increase the number of elements to look for URLs and add the ability to specify multiple attributes by elements --- Key: NUTCH-944 URL: https://issues.apache.org/jira/browse/NUTCH-944 Project: Nutch Issue Type: Improvement Components: parser Environment: GNU/Linux Fedora 12 Reporter: Jean-Francois Gingras Priority: Minor Fix For: 2.0 Attachments: DOMContentUtils.java.path-1.0, DOMContentUtils.java.path-1.3 Here a patch for DOMContentUtils.java that increase the number of elements to look for URLs. It also add the ability to specify multiple attributes by elements, for example: linkParams.put(frame, new LinkParams(frame, longdesc,src, 0)); linkParams.put(object, new LinkParams(object, classid,codebase,data,usemap, 0)); linkParams.put(video, new LinkParams(video, poster,src, 0)); // HTML 5 I have a patch for release-1.0 and branch-1.3 I would love to hear your comments about this. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-972) Mergedb doesn't merge with empty directory, as is the case with merge (for indexes)
[ https://issues.apache.org/jira/browse/NUTCH-972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-972. - Resolution: Fixed Committed revision 1090199. Thanks Gabriele. In the future could you use 'svn diff' to generate patches? See [http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer] for best practices Mergedb doesn't merge with empty directory, as is the case with merge (for indexes) --- Key: NUTCH-972 URL: https://issues.apache.org/jira/browse/NUTCH-972 Project: Nutch Issue Type: Bug Components: storage Affects Versions: 1.2 Reporter: Gabriele Kahlout Priority: Minor Labels: patch Fix For: 1.3 Attachments: check_empty.diff Just an issue of unexpected behavior. This series of commands works with bin/nutch merge to merge indexes but not with crawldb. allcrawldb=crawl/allcrawldb temp_crawldb=crawl/temp_crawldb merge_dbs=$it_crawldb $allcrawldb # if [[ ! -d $allcrawldb ]] # then # merge_dbs=$it_crawldb # fi # uncomment the above and mergedb will work fine. bin/nutch mergedb $temp_crawldb $merge_dbs rm -r $it_crawldb $allcrawldb crawl/segments crawl/linkdb mv $temp_crawldb $allcrawldb This is the exception that occurs: bin/nutch mergedb crawl/temp_crawldb crawl/crawldb crawl/allcrawldb CrawlDb merge: starting at 2011-03-27 10:13:06 Adding crawl/crawldb Adding crawl/allcrawldb CrawlDb merge: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/Users/simpatico/nutch-1.2/crawl/allcrawldb/current at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190) at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at org.apache.nutch.crawl.CrawlDbMerger.merge(CrawlDbMerger.java:126) at org.apache.nutch.crawl.CrawlDbMerger.run(CrawlDbMerger.java:187) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.CrawlDbMerger.main(CrawlDbMerger.java:159) Beside the scripting workaround I've attached a patch which skips adding the empty folder to the collection of dbs to merge. I've also added it a log of which dbs actually get added, consistent with merge interface. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-976) SolrIndex constants in wrong namespace (or prefix)
[ https://issues.apache.org/jira/browse/NUTCH-976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13017401#comment-13017401 ] Julien Nioche commented on NUTCH-976: - Apart from 'solrindex.mapping.file' all the other params (including commit.size) rely on the existing 'solr.' prefix; changing the namespace *will* break them for sure. Better to rename 'solrindex.mapping.file' so that it uses the same prefix as the existing params SolrIndex constants in wrong namespace (or prefix) -- Key: NUTCH-976 URL: https://issues.apache.org/jira/browse/NUTCH-976 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.2, 1.3, 2.0 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.3, 2.0 Attachments: NUTCH-976-1.3-1.patch, NUTCH-976-1.3-trunk.patch The shipped nutch-default.xml configuration file uses solrindex. as namespace for configuration parameters but the namespace (or prefix) in SolrConstants is solr instead. It should be solrindex. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-963) Add support for deleting Solr documents with STATUS_DB_GONE in CrawlDB (404 urls)
[ https://issues.apache.org/jira/browse/NUTCH-963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13017403#comment-13017403 ] Julien Nioche commented on NUTCH-963: - Shall we create a new issue to track the progress of solrclean on the trunk? I'd like to release 1.3 soon and this issue will look open until we do it on trunk, which might take some time Add support for deleting Solr documents with STATUS_DB_GONE in CrawlDB (404 urls) - Key: NUTCH-963 URL: https://issues.apache.org/jira/browse/NUTCH-963 Project: Nutch Issue Type: New Feature Components: indexer Affects Versions: 2.0 Reporter: Claudio Martella Assignee: Markus Jelsma Priority: Minor Fix For: 1.3, 2.0 Attachments: NUTCH-963-command-and-log4j.patch, Solr404Deleter.java, SolrClean.java When issuing recrawls it can happen that certain urls have expired (i.e. URLs that don't exist anymore and return 404). This patch creates a new command in the indexer that scans the crawldb looking for these urls and issues delete commands to SOLR. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-963) Add support for deleting Solr documents with STATUS_DB_GONE in CrawlDB (404 urls)
[ https://issues.apache.org/jira/browse/NUTCH-963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-963. - Resolution: Fixed Fix Version/s: (was: 2.0) Add support for deleting Solr documents with STATUS_DB_GONE in CrawlDB (404 urls) - Key: NUTCH-963 URL: https://issues.apache.org/jira/browse/NUTCH-963 Project: Nutch Issue Type: New Feature Components: indexer Affects Versions: 2.0 Reporter: Claudio Martella Assignee: Markus Jelsma Priority: Minor Fix For: 1.3 Attachments: NUTCH-963-command-and-log4j.patch, Solr404Deleter.java, SolrClean.java When issuing recrawls it can happen that certain urls have expired (i.e. URLs that don't exist anymore and return 404). This patch creates a new command in the indexer that scans the crawldb looking for these urls and issues delete commands to SOLR. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-963) Add support for deleting Solr documents with STATUS_DB_GONE in CrawlDB (404 urls)
[ https://issues.apache.org/jira/browse/NUTCH-963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13017405#comment-13017405 ] Markus Jelsma commented on NUTCH-963: - Yes! Add support for deleting Solr documents with STATUS_DB_GONE in CrawlDB (404 urls) - Key: NUTCH-963 URL: https://issues.apache.org/jira/browse/NUTCH-963 Project: Nutch Issue Type: New Feature Components: indexer Affects Versions: 2.0 Reporter: Claudio Martella Assignee: Markus Jelsma Priority: Minor Fix For: 1.3 Attachments: NUTCH-963-command-and-log4j.patch, Solr404Deleter.java, SolrClean.java When issuing recrawls it can happen that certain urls have expired (i.e. URLs that don't exist anymore and return 404). This patch creates a new command in the indexer that scans the crawldb looking for these urls and issues delete commands to SOLR. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-897) Subcollection requires blacklist element
[ https://issues.apache.org/jira/browse/NUTCH-897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13017406#comment-13017406 ] Markus Jelsma commented on NUTCH-897: - Yes, importing NodeList is less lazy. Updated in patch. Subcollection requires blacklist element Key: NUTCH-897 URL: https://issues.apache.org/jira/browse/NUTCH-897 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.2, 1.3, 2.0 Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Trivial Fix For: 1.3, 2.0 Attachments: NUTCH-897-1.patch, NUTCH-897.patch This is a very minor issue with in Subcollection.java. It throws an error if the (empty) blacklist element was omitted. I think it should either not silently fail in case of an omitted blacklist element or throw a decent error message that the blacklist element is required. The following exception gets thrown if the blacklist element is omitted in a subcollection block: 2010-09-06 13:32:30,438 INFO collection.CollectionManager - Instantiating CollectionManager 2010-09-06 13:32:30,438 INFO collection.CollectionManager - initializing CollectionManager 2010-09-06 13:32:30,451 INFO collection.CollectionManager - file has1 elements 2010-09-06 13:32:30,456 WARN collection.CollectionManager - Error occured:java.lang.NullPointerException 2010-09-06 13:32:30,469 WARN collection.CollectionManager - java.lang.NullPointerException 2010-09-06 13:32:30,470 WARN collection.CollectionManager - at org.apache.nutch.collection.Subcollection.initialize(Subcollection.java:173) 2010-09-06 13:32:30,470 WARN collection.CollectionManager - at org.apache.nutch.collection.CollectionManager.parse(CollectionManager.java:98) 2010-09-06 13:32:30,470 WARN collection.CollectionManager - at org.apache.nutch.collection.CollectionManager.init(CollectionManager.java:75) 2010-09-06 13:32:30,470 WARN collection.CollectionManager - at org.apache.nutch.collection.CollectionManager.init(CollectionManager.java:56) 2010-09-06 13:32:30,471 WARN collection.CollectionManager - at org.apache.nutch.collection.CollectionManager.getCollectionManager(CollectionManager.java:115) 2010-09-06 13:32:30,471 WARN collection.CollectionManager - at org.apache.nutch.indexer.subcollection.SubcollectionIndexingFilter.addSubCollectionField(SubcollectionIndexingFilter.java:65) 2010-09-06 13:32:30,471 WARN collection.CollectionManager - at org.apache.nutch.indexer.subcollection.SubcollectionIndexingFilter.filter(SubcollectionIndexingFilter.java:71) 2010-09-06 13:32:30,471 WARN collection.CollectionManager - at org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:109) 2010-09-06 13:32:30,471 WARN collection.CollectionManager - at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:134) 2010-09-06 13:32:30,472 WARN collection.CollectionManager - at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:50) 2010-09-06 13:32:30,472 WARN collection.CollectionManager - at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463) 2010-09-06 13:32:30,472 WARN collection.CollectionManager - at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411) 2010-09-06 13:32:30,472 WARN collection.CollectionManager - at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-897) Subcollection requires blacklist element
[ https://issues.apache.org/jira/browse/NUTCH-897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13017410#comment-13017410 ] Julien Nioche commented on NUTCH-897: - Looks good to me Subcollection requires blacklist element Key: NUTCH-897 URL: https://issues.apache.org/jira/browse/NUTCH-897 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.2, 1.3, 2.0 Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Trivial Fix For: 1.3, 2.0 Attachments: NUTCH-897-1.patch, NUTCH-897.patch This is a very minor issue with in Subcollection.java. It throws an error if the (empty) blacklist element was omitted. I think it should either not silently fail in case of an omitted blacklist element or throw a decent error message that the blacklist element is required. The following exception gets thrown if the blacklist element is omitted in a subcollection block: 2010-09-06 13:32:30,438 INFO collection.CollectionManager - Instantiating CollectionManager 2010-09-06 13:32:30,438 INFO collection.CollectionManager - initializing CollectionManager 2010-09-06 13:32:30,451 INFO collection.CollectionManager - file has1 elements 2010-09-06 13:32:30,456 WARN collection.CollectionManager - Error occured:java.lang.NullPointerException 2010-09-06 13:32:30,469 WARN collection.CollectionManager - java.lang.NullPointerException 2010-09-06 13:32:30,470 WARN collection.CollectionManager - at org.apache.nutch.collection.Subcollection.initialize(Subcollection.java:173) 2010-09-06 13:32:30,470 WARN collection.CollectionManager - at org.apache.nutch.collection.CollectionManager.parse(CollectionManager.java:98) 2010-09-06 13:32:30,470 WARN collection.CollectionManager - at org.apache.nutch.collection.CollectionManager.init(CollectionManager.java:75) 2010-09-06 13:32:30,470 WARN collection.CollectionManager - at org.apache.nutch.collection.CollectionManager.init(CollectionManager.java:56) 2010-09-06 13:32:30,471 WARN collection.CollectionManager - at org.apache.nutch.collection.CollectionManager.getCollectionManager(CollectionManager.java:115) 2010-09-06 13:32:30,471 WARN collection.CollectionManager - at org.apache.nutch.indexer.subcollection.SubcollectionIndexingFilter.addSubCollectionField(SubcollectionIndexingFilter.java:65) 2010-09-06 13:32:30,471 WARN collection.CollectionManager - at org.apache.nutch.indexer.subcollection.SubcollectionIndexingFilter.filter(SubcollectionIndexingFilter.java:71) 2010-09-06 13:32:30,471 WARN collection.CollectionManager - at org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:109) 2010-09-06 13:32:30,471 WARN collection.CollectionManager - at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:134) 2010-09-06 13:32:30,472 WARN collection.CollectionManager - at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:50) 2010-09-06 13:32:30,472 WARN collection.CollectionManager - at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463) 2010-09-06 13:32:30,472 WARN collection.CollectionManager - at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411) 2010-09-06 13:32:30,472 WARN collection.CollectionManager - at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
All solr* commands fail in 1.3
Hi devs, Since today i noticed that all solr* commands fail in a similar fashion: SolrDeleteDuplicates: starting at 2011-04-08 14:17:44 SolrDeleteDuplicates: Solr url: http://127.0.0.1:8983/solr Exception in thread main java.lang.IllegalAccessError: tried to access field org.slf4j.impl.StaticLoggerBinder.SINGLETON from class org.slf4j.LoggerFactory at org.slf4j.LoggerFactory.staticInitialize(LoggerFactory.java:83) at org.slf4j.LoggerFactory.clinit(LoggerFactory.java:73) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.clinit(CommonsHttpSolrServer.java:78) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getSplits(SolrDeleteDuplicates.java:188) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:358) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.run(SolrDeleteDuplicates.java:370) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.main(SolrDeleteDuplicates.java:375) The error can be a bit different between commands but they always end up with: Exception in thread main java.lang.IllegalAccessError: tried to access field org.slf4j.impl.StaticLoggerBinder.SINGLETON from class org.slf4j.LoggerFactory at org.slf4j.LoggerFactory.staticInitialize(LoggerFactory.java:83) at org.slf4j.LoggerFactory.clinit(LoggerFactory.java:73) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.clinit(CommonsHttpSolrServer.java:78) This happens in the current 1.3 revision but also in a revision (1079765) a month old and a revision (1062728) of 2011-01-24. I've no idea what's causing the issue but it might have something to do with me removing ~/.ivy2 yesterday. Since then all stuff is being downloaded again. If you cannot reproduce then i'm quite sure that removing stuff in .ivy2 and a fresh svn export will make your Solr commands fail. I cannot compile trunk at the moment because of Gora and i cannot compile Gora because of some other dependency and haven't come to fixing that for now. Any thoughts? Cheers,
Re: All solr* commands fail in 1.3
See http://www.slf4j.org/faq.html#IllegalAccessError This error is caused by the static initilizer of the LoggerFactory class attempting to directly access the SINGLETON field of org.slf4j.impl.StaticLoggerBinder. While this was allowed in SLF4J 1.5.5 and earlier, in 1.5.6 and later the SINGLETON field has been marked as private access. If you get the exception shown above, then you are using an older version of slf4j-api, e.g. 1.4.3, with a new version of a slf4j binding, e.g. 1.5.6. Typically, this occurs when your Maven *pom.ml* file incoprporates hibernate 3.3.0 which declares a dependency on slf4j-api version 1.4.2. If your *pom.xml* declares a dependency on an slf4j binding, say slf4j-log4j12 version 1.5.6, then you will get illegal access errors. 'ant report' shows slf4j-api version = 1.5.5 from SOLRbut out ivy.xml lists slf4j-log4j12 version = 1.5.11so we should either revert slf4-log4j12 to 1.5.5 or set slf4j-api to 1.5.11 Julien On 8 April 2011 13:44, Markus Jelsma markus.jel...@openindex.io wrote: Hi devs, Since today i noticed that all solr* commands fail in a similar fashion: SolrDeleteDuplicates: starting at 2011-04-08 14:17:44 SolrDeleteDuplicates: Solr url: http://127.0.0.1:8983/solr Exception in thread main java.lang.IllegalAccessError: tried to access field org.slf4j.impl.StaticLoggerBinder.SINGLETON from class org.slf4j.LoggerFactory at org.slf4j.LoggerFactory.staticInitialize(LoggerFactory.java:83) at org.slf4j.LoggerFactory.clinit(LoggerFactory.java:73) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.clinit(CommonsHttpSolrServer.java:78) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getSplits(SolrDeleteDuplicates.java:188) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:358) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.run(SolrDeleteDuplicates.java:370) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.main(SolrDeleteDuplicates.java:375) The error can be a bit different between commands but they always end up with: Exception in thread main java.lang.IllegalAccessError: tried to access field org.slf4j.impl.StaticLoggerBinder.SINGLETON from class org.slf4j.LoggerFactory at org.slf4j.LoggerFactory.staticInitialize(LoggerFactory.java:83) at org.slf4j.LoggerFactory.clinit(LoggerFactory.java:73) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.clinit(CommonsHttpSolrServer.java:78) This happens in the current 1.3 revision but also in a revision (1079765) a month old and a revision (1062728) of 2011-01-24. I've no idea what's causing the issue but it might have something to do with me removing ~/.ivy2 yesterday. Since then all stuff is being downloaded again. If you cannot reproduce then i'm quite sure that removing stuff in .ivy2 and a fresh svn export will make your Solr commands fail. I cannot compile trunk at the moment because of Gora and i cannot compile Gora because of some other dependency and haven't come to fixing that for now. Any thoughts? Cheers, -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com
Re: All solr* commands fail in 1.3
I'll open a ticket and take a look at the issue monday or so (unless someone beats me to it). Do you have an explanation for why i only noticed the error after removing ~/.ivy2? See http://www.slf4j.org/faq.html#IllegalAccessError This error is caused by the static initilizer of the LoggerFactory class attempting to directly access the SINGLETON field of org.slf4j.impl.StaticLoggerBinder. While this was allowed in SLF4J 1.5.5 and earlier, in 1.5.6 and later the SINGLETON field has been marked as private access. If you get the exception shown above, then you are using an older version of slf4j-api, e.g. 1.4.3, with a new version of a slf4j binding, e.g. 1.5.6. Typically, this occurs when your Maven *pom.ml* file incoprporates hibernate 3.3.0 which declares a dependency on slf4j-api version 1.4.2. If your *pom.xml* declares a dependency on an slf4j binding, say slf4j-log4j12 version 1.5.6, then you will get illegal access errors. 'ant report' shows slf4j-api version = 1.5.5 from SOLRbut out ivy.xml lists slf4j-log4j12 version = 1.5.11so we should either revert slf4-log4j12 to 1.5.5 or set slf4j-api to 1.5.11 Julien On 8 April 2011 13:44, Markus Jelsma markus.jel...@openindex.io wrote: Hi devs, Since today i noticed that all solr* commands fail in a similar fashion: SolrDeleteDuplicates: starting at 2011-04-08 14:17:44 SolrDeleteDuplicates: Solr url: http://127.0.0.1:8983/solr Exception in thread main java.lang.IllegalAccessError: tried to access field org.slf4j.impl.StaticLoggerBinder.SINGLETON from class org.slf4j.LoggerFactory at org.slf4j.LoggerFactory.staticInitialize(LoggerFactory.java:83) at org.slf4j.LoggerFactory.clinit(LoggerFactory.java:73) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.clinit(CommonsH ttpSolrServer.java:78) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getSpl its(SolrDeleteDuplicates.java:188) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplic ates.java:358) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.run(SolrDeleteDuplicat es.java:370) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.main(SolrDeleteDuplica tes.java:375) The error can be a bit different between commands but they always end up with: Exception in thread main java.lang.IllegalAccessError: tried to access field org.slf4j.impl.StaticLoggerBinder.SINGLETON from class org.slf4j.LoggerFactory at org.slf4j.LoggerFactory.staticInitialize(LoggerFactory.java:83) at org.slf4j.LoggerFactory.clinit(LoggerFactory.java:73) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.clinit(CommonsH ttpSolrServer.java:78) This happens in the current 1.3 revision but also in a revision (1079765) a month old and a revision (1062728) of 2011-01-24. I've no idea what's causing the issue but it might have something to do with me removing ~/.ivy2 yesterday. Since then all stuff is being downloaded again. If you cannot reproduce then i'm quite sure that removing stuff in .ivy2 and a fresh svn export will make your Solr commands fail. I cannot compile trunk at the moment because of Gora and i cannot compile Gora because of some other dependency and haven't come to fixing that for now. Any thoughts? Cheers,
GORA dependency and build failures
Hi, Just curious - is the plan to wait for the GORA 0.1 release to get published somewhere (not familiar with Ivy, so I'm not sure where things need to get published), and then that will automatically fix the failing build? Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/
Re: GORA dependency and build failures
Yep. 0.1 has been released and the artifacts should be available soon On Friday, 8 April 2011, Otis Gospodnetic ogjunk-nu...@yahoo.com wrote: Hi, Just curious - is the plan to wait for the GORA 0.1 release to get published somewhere (not familiar with Ivy, so I'm not sure where things need to get published), and then that will automatically fix the failing build? Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com
[jira] [Updated] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.
[ https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ammar Shadiq updated NUTCH-978: --- Attachment: Screenshot.png [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing. --- Key: NUTCH-978 URL: https://issues.apache.org/jira/browse/NUTCH-978 Project: Nutch Issue Type: New Feature Components: parser Affects Versions: 1.2 Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9 Reporter: Ammar Shadiq Assignee: Chris A. Mattmann Priority: Minor Labels: gsoc2011, mentor Fix For: 2.0 Attachments: Screenshot.png, [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, app_screenshoot_configuration_result.png, app_screenshoot_configuration_result_anchor.png, app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png Original Estimate: 1680h Remaining Estimate: 1680h Nutch use parse-html plugin to parse web pages, it process the contents of the web page by removing html tags and component like javascript and css and leaving the extracted text to be stored on the index. Nutch by default doesn't have the capability to select certain atomic element on an html page, like certain tags, certain content, some part of the page, etc. A html page have a tree-like xml pattern with html tag as its branch and text as its node. This branch and node could be extracted using XPath. XPath allowing us to select a certain branch or node of an XML and therefore could be used to extract certain information and treat it differently based on its content and the user requirements. Furthermore a web domain like news website usually have a same html code structure for storing the information on its web pages. This same html code structure could be parsed using the same XPath query and retrieve the same content information element. All of the XPath query for selecting various content could be stored on a XPath Configuration File. The purpose of nutch are for various web source, not all of the web page retrieved from those various source have the same html code structure, thus have to be threated differently using the correct XPath Configuration. The selection of the correct XPath configuration could be done automatically using regex by matching the url of the web page with valid url pattern for that xpath configuration. This automatic mechanism allow the user of nutch to process various web page and get only certain information that user wants therefore making the index more accurate and its content more flexible. The component for this idea have been tested on nutch 1.2 for selecting certain elements on various news website for the purpose of document clustering. This includes a Configuration Editor Application build using NetBeans 6.9 Application Framework. though its need a few debugging. http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.
[ https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ammar Shadiq updated NUTCH-978: --- Attachment: (was: Screenshot.png) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing. --- Key: NUTCH-978 URL: https://issues.apache.org/jira/browse/NUTCH-978 Project: Nutch Issue Type: New Feature Components: parser Affects Versions: 1.2 Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9 Reporter: Ammar Shadiq Assignee: Chris A. Mattmann Priority: Minor Labels: gsoc2011, mentor Fix For: 2.0 Attachments: [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, app_guardian_ivory_coast_news_exmpl.png, app_screenshoot_configuration_result.png, app_screenshoot_configuration_result_anchor.png, app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png Original Estimate: 1680h Remaining Estimate: 1680h Nutch use parse-html plugin to parse web pages, it process the contents of the web page by removing html tags and component like javascript and css and leaving the extracted text to be stored on the index. Nutch by default doesn't have the capability to select certain atomic element on an html page, like certain tags, certain content, some part of the page, etc. A html page have a tree-like xml pattern with html tag as its branch and text as its node. This branch and node could be extracted using XPath. XPath allowing us to select a certain branch or node of an XML and therefore could be used to extract certain information and treat it differently based on its content and the user requirements. Furthermore a web domain like news website usually have a same html code structure for storing the information on its web pages. This same html code structure could be parsed using the same XPath query and retrieve the same content information element. All of the XPath query for selecting various content could be stored on a XPath Configuration File. The purpose of nutch are for various web source, not all of the web page retrieved from those various source have the same html code structure, thus have to be threated differently using the correct XPath Configuration. The selection of the correct XPath configuration could be done automatically using regex by matching the url of the web page with valid url pattern for that xpath configuration. This automatic mechanism allow the user of nutch to process various web page and get only certain information that user wants therefore making the index more accurate and its content more flexible. The component for this idea have been tested on nutch 1.2 for selecting certain elements on various news website for the purpose of document clustering. This includes a Configuration Editor Application build using NetBeans 6.9 Application Framework. though its need a few debugging. http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
Build failed in Jenkins: Nutch-trunk #1451
See https://hudson.apache.org/hudson/job/Nutch-trunk/1451/changes Changes: [jnioche] NUTCH-967 Upgraded Tika to version 0.9 + changes version name for GORA -- [...truncated 1012 lines...] A src/plugin/subcollection/src/java/org/apache/nutch/collection/package.html A src/plugin/subcollection/src/java/org/apache/nutch/indexer A src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection A src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection/SubcollectionIndexingFilter.java A src/plugin/subcollection/README.txt A src/plugin/subcollection/plugin.xml A src/plugin/subcollection/build.xml A src/plugin/index-more A src/plugin/index-more/ivy.xml A src/plugin/index-more/src A src/plugin/index-more/src/test A src/plugin/index-more/src/test/org A src/plugin/index-more/src/test/org/apache A src/plugin/index-more/src/test/org/apache/nutch A src/plugin/index-more/src/test/org/apache/nutch/indexer A src/plugin/index-more/src/test/org/apache/nutch/indexer/more A src/plugin/index-more/src/test/org/apache/nutch/indexer/more/TestMoreIndexingFilter.java A src/plugin/index-more/src/java A src/plugin/index-more/src/java/org A src/plugin/index-more/src/java/org/apache A src/plugin/index-more/src/java/org/apache/nutch A src/plugin/index-more/src/java/org/apache/nutch/indexer A src/plugin/index-more/src/java/org/apache/nutch/indexer/more A src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java A src/plugin/index-more/src/java/org/apache/nutch/indexer/more/package.html A src/plugin/index-more/plugin.xml A src/plugin/index-more/build.xml AUsrc/plugin/plugin.dtd A src/plugin/parse-ext A src/plugin/parse-ext/ivy.xml A src/plugin/parse-ext/src A src/plugin/parse-ext/src/test A src/plugin/parse-ext/src/test/org A src/plugin/parse-ext/src/test/org/apache A src/plugin/parse-ext/src/test/org/apache/nutch A src/plugin/parse-ext/src/test/org/apache/nutch/parse A src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext A src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext/TestExtParser.java A src/plugin/parse-ext/src/java A src/plugin/parse-ext/src/java/org A src/plugin/parse-ext/src/java/org/apache A src/plugin/parse-ext/src/java/org/apache/nutch A src/plugin/parse-ext/src/java/org/apache/nutch/parse A src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext A src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext/ExtParser.java A src/plugin/parse-ext/plugin.xml A src/plugin/parse-ext/build.xml A src/plugin/parse-ext/command A src/plugin/urlnormalizer-pass A src/plugin/urlnormalizer-pass/ivy.xml A src/plugin/urlnormalizer-pass/src A src/plugin/urlnormalizer-pass/src/test A src/plugin/urlnormalizer-pass/src/test/org A src/plugin/urlnormalizer-pass/src/test/org/apache A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer/pass AU src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer/pass/TestPassURLNormalizer.java A src/plugin/urlnormalizer-pass/src/java A src/plugin/urlnormalizer-pass/src/java/org A src/plugin/urlnormalizer-pass/src/java/org/apache A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer/pass AU src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer/pass/PassURLNormalizer.java AUsrc/plugin/urlnormalizer-pass/plugin.xml AUsrc/plugin/urlnormalizer-pass/build.xml A src/plugin/parse-html A src/plugin/parse-html/ivy.xml A src/plugin/parse-html/lib A src/plugin/parse-html/lib/tagsoup.LICENSE.txt A src/plugin/parse-html/src A src/plugin/parse-html/src/test A src/plugin/parse-html/src/test/org A src/plugin/parse-html/src/test/org/apache A src/plugin/parse-html/src/test/org/apache/nutch A src/plugin/parse-html/src/test/org/apache/nutch/parse A src/plugin/parse-html/src/test/org/apache/nutch/parse/html A src/plugin/parse-html/src/test/org/apache/nutch/parse/html/TestRobotsMetaProcessor.java A