[jira] Created: (NUTCH-951) Backport changes from 2.0 into 1.3
Backport changes from 2.0 into 1.3 -- Key: NUTCH-951 URL: https://issues.apache.org/jira/browse/NUTCH-951 Project: Nutch Issue Type: Task Affects Versions: 1.3 Reporter: Julien Nioche Priority: Blocker Fix For: 1.3 I've compared the changes from 2.0 with 1.3 and found the following differences (excluding anything specific to 2.0/GORA) * NUTCH-564 External parser supports encoding attribute (Antony Bowesman, mattmann) * NUTCH-714 Need a SFTP and SCP Protocol Handler (Sanjoy Ghosh, mattmann) * NUTCH-825 Publish nutch artifacts to central maven repository (mattmann) * NUTCH-851 Port logging to slf4j (jnioche) * NUTCH-861 Renamed HTMLParseFilter into ParseFilter * NUTCH-872 Change the default fetcher.parse to FALSE (ab). * NUTCH-876 Remove remaining robots/IP blocking code in lib-http (ab) * NUTCH-880 REST API for Nutch (ab) * NUTCH-883 Remove unused parameters from nutch-default.xml (jnioche) * NUTCH-884 FetcherJob should run more reduce tasks than default (ab) * NUTCH-886 A .gitignore file for Nutch (dogacan) * NUTCH-894 Move statistical language identification from indexing to parsing step * NUTCH-921 Reduce dependency of Nutch on config files (ab) * NUTCH-930 Remove remaining dependencies on Lucene API (ab) * NUTCH-931 Simple admin API to fetch status and stop the service (ab) * NUTCH-932 Bulk REST API to retrieve crawl results as JSON (ab) * NUTCH-936 LanguageIdentifier should not set empty lang field on NutchDocument (Markus Jelsma via jnioche) Let's go through this and decide what to port to 1.3 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Backport to 1.3 (was: Release planning)
I've finished porting the changes from 1.2 which were missing in 1.3 and were not related to the Lucene indexing or search - NUTCH-878 ScoringFilters should not override the injected score - NUTCH-901 Make index-more plug-in configurable (Markus Jelsma via mattmann) - NUTCH-905 Configurable file protocol parent directory crawling (Thorsten Scherler, mattmann, ab) - NUTCH-855 ScoringFilter and IndexingFilter: To allow for the propagation of URL Metatags and their subsequent indexing (Scott Gonyea via mattmann) - NUTCH-716 Make subcollection index filed multivalued (Dmitry Lihachev via jnioche) I've compared the changes from 2.0 with 1.3 and found the following differences (excluding anything specific to 2.0/GORA) - * NUTCH-564 External parser supports encoding attribute (Antony Bowesman, mattmann)* - NUTCH-714 Need a SFTP and SCP Protocol Handler (Sanjoy Ghosh, mattmann) - * NUTCH-825 Publish nutch artifacts to central maven repository (mattmann)* - NUTCH-851 Port logging to slf4j (jnioche) - NUTCH-861 Renamed HTMLParseFilter into ParseFilter - * NUTCH-872 Change the default fetcher.parse to FALSE (ab).* - * NUTCH-876 Remove remaining robots/IP blocking code in lib-http (ab)* - NUTCH-880 REST API for Nutch (ab) - * NUTCH-883 Remove unused parameters from nutch-default.xml (jnioche)* - * NUTCH-884 FetcherJob should run more reduce tasks than default (ab)* - * NUTCH-886 A .gitignore file for Nutch (dogacan)* - * NUTCH-894 Move statistical language identification from indexing to parsing step* - * NUTCH-921 Reduce dependency of Nutch on config files (ab)* - * NUTCH-930 Remove remaining dependencies on Lucene API (ab)* - NUTCH-931 Simple admin API to fetch status and stop the service (ab) - NUTCH-932 Bulk REST API to retrieve crawl results as JSON (ab) I've created a new issue on https://issues.apache.org/jira/browse/NUTCH-951to track this. I'd be in favour of porting only the things that are not new functionalities and put them in bold above. Any thoughts on this? Julien On 4 January 2011 21:44, Julien Nioche lists.digitalpeb...@gmail.comwrote: +1 from me. I've committed today a bunch of patches which were in 1.2 but not in 1.3 (just one last one to do) but haven't compared with 2.0 Having a release based on 1.3 would be great as it would be a nice transition towards 2.0 (delegate indexing/search, dependency management with Ivy, separation between local and remote deployment, removal of redondant plugins etc...). Julien -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com On 4 January 2011 20:27, Andrzej Bialecki a...@getopt.org wrote: Hi users devs, As you probably know, there are currently two active lines of development for Nutch: * Nutch trunk, a.k.a. Nutch 2.0: this is based on a completely redesigned storage layer that uses Apache Gora, which in turn can use various storage implementations such as HBase, Cassandra, and MySQL. This branch is still largely experimental and unstable, but work is progressing, and at the current pace I think a release should be possible within the next ~6 months. Another important addition on this branch is a REST API that allows using Nutch as a black-box crawling service. * Nutch branch-1.3: this started as a snapshot of Nutch trunk just before merging with nutchbase (i.e. switching to Gora as a storage layer). This branch is still largely similar to the previous versions of Nutch, and uses Hadoop MapFile/SequenceFile and segments. As compared with release 1.2 it does NOT ship with any search infrastructure, because all search functionality has been delegated to Solr (via SolrIndexer). This is BTW also true about Nutch trunk. Regarding branch-1.2 (which is a maintenance branch after release 1.2) there have been pretty no updates there, if any. Nutch committer resources are very limited (when it comes to active committers), so I don't expect any maintenance release from this branch to happen... I think that considering the relatively remote release date for Nutch 2.-0 it would make sense to roll out a 1.3 release based on branch-1.3, after making sure that all critical patches from trunk have been merged in there. What do you think? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com
[jira] Updated: (NUTCH-951) Backport changes from 2.0 into 1.3
[ https://issues.apache.org/jira/browse/NUTCH-951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-951: Description: I've compared the changes from 2.0 with 1.3 and found the following differences (excluding anything specific to 2.0/GORA) * NUTCH-564 External parser supports encoding attribute (Antony Bowesman, mattmann) * NUTCH-714 Need a SFTP and SCP Protocol Handler (Sanjoy Ghosh, mattmann) * NUTCH-825 Publish nutch artifacts to central maven repository (mattmann) * NUTCH-851 Port logging to slf4j (jnioche) * NUTCH-861 Renamed HTMLParseFilter into ParseFilter * NUTCH-872 Change the default fetcher.parse to FALSE (ab). * NUTCH-876 Remove remaining robots/IP blocking code in lib-http (ab) * NUTCH-880 REST API for Nutch (ab) * NUTCH-883 Remove unused parameters from nutch-default.xml (jnioche) * NUTCH-884 FetcherJob should run more reduce tasks than default (ab) * NUTCH-886 A .gitignore file for Nutch (dogacan) * NUTCH-894 Move statistical language identification from indexing to parsing step * NUTCH-921 Reduce dependency of Nutch on config files (ab) * NUTCH-930 Remove remaining dependencies on Lucene API (ab) * NUTCH-931 Simple admin API to fetch status and stop the service (ab) * NUTCH-932 Bulk REST API to retrieve crawl results as JSON (ab) Let's go through this and decide what to port to 1.3 was: I've compared the changes from 2.0 with 1.3 and found the following differences (excluding anything specific to 2.0/GORA) * NUTCH-564 External parser supports encoding attribute (Antony Bowesman, mattmann) * NUTCH-714 Need a SFTP and SCP Protocol Handler (Sanjoy Ghosh, mattmann) * NUTCH-825 Publish nutch artifacts to central maven repository (mattmann) * NUTCH-851 Port logging to slf4j (jnioche) * NUTCH-861 Renamed HTMLParseFilter into ParseFilter * NUTCH-872 Change the default fetcher.parse to FALSE (ab). * NUTCH-876 Remove remaining robots/IP blocking code in lib-http (ab) * NUTCH-880 REST API for Nutch (ab) * NUTCH-883 Remove unused parameters from nutch-default.xml (jnioche) * NUTCH-884 FetcherJob should run more reduce tasks than default (ab) * NUTCH-886 A .gitignore file for Nutch (dogacan) * NUTCH-894 Move statistical language identification from indexing to parsing step * NUTCH-921 Reduce dependency of Nutch on config files (ab) * NUTCH-930 Remove remaining dependencies on Lucene API (ab) * NUTCH-931 Simple admin API to fetch status and stop the service (ab) * NUTCH-932 Bulk REST API to retrieve crawl results as JSON (ab) * NUTCH-936 LanguageIdentifier should not set empty lang field on NutchDocument (Markus Jelsma via jnioche) Let's go through this and decide what to port to 1.3 Backport changes from 2.0 into 1.3 -- Key: NUTCH-951 URL: https://issues.apache.org/jira/browse/NUTCH-951 Project: Nutch Issue Type: Task Affects Versions: 1.3 Reporter: Julien Nioche Priority: Blocker Fix For: 1.3 I've compared the changes from 2.0 with 1.3 and found the following differences (excluding anything specific to 2.0/GORA) * NUTCH-564 External parser supports encoding attribute (Antony Bowesman, mattmann) * NUTCH-714 Need a SFTP and SCP Protocol Handler (Sanjoy Ghosh, mattmann) * NUTCH-825 Publish nutch artifacts to central maven repository (mattmann) * NUTCH-851 Port logging to slf4j (jnioche) * NUTCH-861 Renamed HTMLParseFilter into ParseFilter * NUTCH-872 Change the default fetcher.parse to FALSE (ab). * NUTCH-876 Remove remaining robots/IP blocking code in lib-http (ab) * NUTCH-880 REST API for Nutch (ab) * NUTCH-883 Remove unused parameters from nutch-default.xml (jnioche) * NUTCH-884 FetcherJob should run more reduce tasks than default (ab) * NUTCH-886 A .gitignore file for Nutch (dogacan) * NUTCH-894 Move statistical language identification from indexing to parsing step * NUTCH-921 Reduce dependency of Nutch on config files (ab) * NUTCH-930 Remove remaining dependencies on Lucene API (ab) * NUTCH-931 Simple admin API to fetch status and stop the service (ab) * NUTCH-932 Bulk REST API to retrieve crawl results as JSON (ab) Let's go through this and decide what to port to 1.3 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-951) Backport changes from 2.0 into 1.3
[ https://issues.apache.org/jira/browse/NUTCH-951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12977819#action_12977819 ] Julien Nioche commented on NUTCH-951: - ported NUTCH-883 to 1.3 in rev 1055503 Backport changes from 2.0 into 1.3 -- Key: NUTCH-951 URL: https://issues.apache.org/jira/browse/NUTCH-951 Project: Nutch Issue Type: Task Affects Versions: 1.3 Reporter: Julien Nioche Priority: Blocker Fix For: 1.3 I've compared the changes from 2.0 with 1.3 and found the following differences (excluding anything specific to 2.0/GORA) * NUTCH-564 External parser supports encoding attribute (Antony Bowesman, mattmann) * NUTCH-714 Need a SFTP and SCP Protocol Handler (Sanjoy Ghosh, mattmann) * NUTCH-825 Publish nutch artifacts to central maven repository (mattmann) * NUTCH-851 Port logging to slf4j (jnioche) * NUTCH-861 Renamed HTMLParseFilter into ParseFilter * NUTCH-872 Change the default fetcher.parse to FALSE (ab). * NUTCH-876 Remove remaining robots/IP blocking code in lib-http (ab) * NUTCH-880 REST API for Nutch (ab) * NUTCH-883 Remove unused parameters from nutch-default.xml (jnioche) * NUTCH-884 FetcherJob should run more reduce tasks than default (ab) * NUTCH-886 A .gitignore file for Nutch (dogacan) * NUTCH-894 Move statistical language identification from indexing to parsing step * NUTCH-921 Reduce dependency of Nutch on config files (ab) * NUTCH-930 Remove remaining dependencies on Lucene API (ab) * NUTCH-931 Simple admin API to fetch status and stop the service (ab) * NUTCH-932 Bulk REST API to retrieve crawl results as JSON (ab) Let's go through this and decide what to port to 1.3 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-951) Backport changes from 2.0 into 1.3
[ https://issues.apache.org/jira/browse/NUTCH-951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12977824#action_12977824 ] Julien Nioche commented on NUTCH-951: - ported NUTCH-886 to 1.3 in rev 1055512 Backport changes from 2.0 into 1.3 -- Key: NUTCH-951 URL: https://issues.apache.org/jira/browse/NUTCH-951 Project: Nutch Issue Type: Task Affects Versions: 1.3 Reporter: Julien Nioche Priority: Blocker Fix For: 1.3 I've compared the changes from 2.0 with 1.3 and found the following differences (excluding anything specific to 2.0/GORA) * NUTCH-564 External parser supports encoding attribute (Antony Bowesman, mattmann) * NUTCH-714 Need a SFTP and SCP Protocol Handler (Sanjoy Ghosh, mattmann) * NUTCH-825 Publish nutch artifacts to central maven repository (mattmann) * NUTCH-851 Port logging to slf4j (jnioche) * NUTCH-861 Renamed HTMLParseFilter into ParseFilter * NUTCH-872 Change the default fetcher.parse to FALSE (ab). * NUTCH-876 Remove remaining robots/IP blocking code in lib-http (ab) * NUTCH-880 REST API for Nutch (ab) * NUTCH-883 Remove unused parameters from nutch-default.xml (jnioche) * NUTCH-884 FetcherJob should run more reduce tasks than default (ab) * NUTCH-886 A .gitignore file for Nutch (dogacan) * NUTCH-894 Move statistical language identification from indexing to parsing step * NUTCH-921 Reduce dependency of Nutch on config files (ab) * NUTCH-930 Remove remaining dependencies on Lucene API (ab) * NUTCH-931 Simple admin API to fetch status and stop the service (ab) * NUTCH-932 Bulk REST API to retrieve crawl results as JSON (ab) Let's go through this and decide what to port to 1.3 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-951) Backport changes from 2.0 into 1.3
[ https://issues.apache.org/jira/browse/NUTCH-951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12977828#action_12977828 ] Julien Nioche commented on NUTCH-951: - NUTCH-894 : has been written for 2.0 and would need some effort to backport to 1.3 I suggest that we leave it there. The list of things that IMHO are worth porting to 1.3 are now * NUTCH-564 External parser supports encoding attribute (Antony Bowesman, mattmann) * NUTCH-825 Publish nutch artifacts to central maven repository (mattmann) * NUTCH-872 Change the default fetcher.parse to FALSE (ab). * NUTCH-876 Remove remaining robots/IP blocking code in lib-http (ab) * NUTCH-884 FetcherJob should run more reduce tasks than default (ab) * NUTCH-921 Reduce dependency of Nutch on config files (ab) Any volunteers? Backport changes from 2.0 into 1.3 -- Key: NUTCH-951 URL: https://issues.apache.org/jira/browse/NUTCH-951 Project: Nutch Issue Type: Task Affects Versions: 1.3 Reporter: Julien Nioche Priority: Blocker Fix For: 1.3 I've compared the changes from 2.0 with 1.3 and found the following differences (excluding anything specific to 2.0/GORA) * NUTCH-564 External parser supports encoding attribute (Antony Bowesman, mattmann) * NUTCH-714 Need a SFTP and SCP Protocol Handler (Sanjoy Ghosh, mattmann) * NUTCH-825 Publish nutch artifacts to central maven repository (mattmann) * NUTCH-851 Port logging to slf4j (jnioche) * NUTCH-861 Renamed HTMLParseFilter into ParseFilter * NUTCH-872 Change the default fetcher.parse to FALSE (ab). * NUTCH-876 Remove remaining robots/IP blocking code in lib-http (ab) * NUTCH-880 REST API for Nutch (ab) * NUTCH-883 Remove unused parameters from nutch-default.xml (jnioche) * NUTCH-884 FetcherJob should run more reduce tasks than default (ab) * NUTCH-886 A .gitignore file for Nutch (dogacan) * NUTCH-894 Move statistical language identification from indexing to parsing step * NUTCH-921 Reduce dependency of Nutch on config files (ab) * NUTCH-930 Remove remaining dependencies on Lucene API (ab) * NUTCH-931 Simple admin API to fetch status and stop the service (ab) * NUTCH-932 Bulk REST API to retrieve crawl results as JSON (ab) Let's go through this and decide what to port to 1.3 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-912) MoreIndexingFilter does not parse docx and xlsx date formats
[ https://issues.apache.org/jira/browse/NUTCH-912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-912. - Resolution: Fixed branch 1.3 : NUTCH-912 added to CHANGES.txt in rev 105551 trunk : committed rev 1055518 MoreIndexingFilter does not parse docx and xlsx date formats Key: NUTCH-912 URL: https://issues.apache.org/jira/browse/NUTCH-912 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.2, 1.3, 2.0 Reporter: Erlend GarĂ¥sen Assignee: Markus Jelsma Fix For: 1.3, 2.0 Attachments: NUTCH-912-v12-1.patch, NUTCH-912-v12-1.patch, NUTCH-912-v13-1.patch The following error occurs in hadoop.log when MoreIndexingFilter tries to parse dates from MS Office formats: 2010-10-08 13:56:32,555 WARN more.MoreIndexingFilter - http://ridder.uio.no/test1.xlsx: can't parse erroneous date: 2010-10-08T13:55:54Z This problem affects docx and xlsx formats, but probably the other XML-based MS Office formats as well. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [jira] Commented: (NUTCH-951) Backport changes from 2.0 into 1.3
Hey Julien, I'll take care of the ones with my name on them below (NUTCH-564 and NUTCH-825). Cheers, Chris On Jan 5, 2011, at 8:36 AM, Julien Nioche (JIRA) wrote: [ https://issues.apache.org/jira/browse/NUTCH-951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12977828#action_12977828 ] Julien Nioche commented on NUTCH-951: - NUTCH-894 : has been written for 2.0 and would need some effort to backport to 1.3 I suggest that we leave it there. The list of things that IMHO are worth porting to 1.3 are now * NUTCH-564 External parser supports encoding attribute (Antony Bowesman, mattmann) * NUTCH-825 Publish nutch artifacts to central maven repository (mattmann) * NUTCH-872 Change the default fetcher.parse to FALSE (ab). * NUTCH-876 Remove remaining robots/IP blocking code in lib-http (ab) * NUTCH-884 FetcherJob should run more reduce tasks than default (ab) * NUTCH-921 Reduce dependency of Nutch on config files (ab) Any volunteers? Backport changes from 2.0 into 1.3 -- Key: NUTCH-951 URL: https://issues.apache.org/jira/browse/NUTCH-951 Project: Nutch Issue Type: Task Affects Versions: 1.3 Reporter: Julien Nioche Priority: Blocker Fix For: 1.3 I've compared the changes from 2.0 with 1.3 and found the following differences (excluding anything specific to 2.0/GORA) * NUTCH-564 External parser supports encoding attribute (Antony Bowesman, mattmann) * NUTCH-714 Need a SFTP and SCP Protocol Handler (Sanjoy Ghosh, mattmann) * NUTCH-825 Publish nutch artifacts to central maven repository (mattmann) * NUTCH-851 Port logging to slf4j (jnioche) * NUTCH-861 Renamed HTMLParseFilter into ParseFilter * NUTCH-872 Change the default fetcher.parse to FALSE (ab). * NUTCH-876 Remove remaining robots/IP blocking code in lib-http (ab) * NUTCH-880 REST API for Nutch (ab) * NUTCH-883 Remove unused parameters from nutch-default.xml (jnioche) * NUTCH-884 FetcherJob should run more reduce tasks than default (ab) * NUTCH-886 A .gitignore file for Nutch (dogacan) * NUTCH-894 Move statistical language identification from indexing to parsing step * NUTCH-921 Reduce dependency of Nutch on config files (ab) * NUTCH-930 Remove remaining dependencies on Lucene API (ab) * NUTCH-931 Simple admin API to fetch status and stop the service (ab) * NUTCH-932 Bulk REST API to retrieve crawl results as JSON (ab) Let's go through this and decide what to port to 1.3 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
[jira] Commented: (NUTCH-951) Backport changes from 2.0 into 1.3
[ https://issues.apache.org/jira/browse/NUTCH-951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12977854#action_12977854 ] Mattmann, Chris A (388J) commented on NUTCH-951: Hey Julien, I'll take care of the ones with my name on them below (NUTCH-564 and NUTCH-825). Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ Backport changes from 2.0 into 1.3 -- Key: NUTCH-951 URL: https://issues.apache.org/jira/browse/NUTCH-951 Project: Nutch Issue Type: Task Affects Versions: 1.3 Reporter: Julien Nioche Priority: Blocker Fix For: 1.3 I've compared the changes from 2.0 with 1.3 and found the following differences (excluding anything specific to 2.0/GORA) * NUTCH-564 External parser supports encoding attribute (Antony Bowesman, mattmann) * NUTCH-714 Need a SFTP and SCP Protocol Handler (Sanjoy Ghosh, mattmann) * NUTCH-825 Publish nutch artifacts to central maven repository (mattmann) * NUTCH-851 Port logging to slf4j (jnioche) * NUTCH-861 Renamed HTMLParseFilter into ParseFilter * NUTCH-872 Change the default fetcher.parse to FALSE (ab). * NUTCH-876 Remove remaining robots/IP blocking code in lib-http (ab) * NUTCH-880 REST API for Nutch (ab) * NUTCH-883 Remove unused parameters from nutch-default.xml (jnioche) * NUTCH-884 FetcherJob should run more reduce tasks than default (ab) * NUTCH-886 A .gitignore file for Nutch (dogacan) * NUTCH-894 Move statistical language identification from indexing to parsing step * NUTCH-921 Reduce dependency of Nutch on config files (ab) * NUTCH-930 Remove remaining dependencies on Lucene API (ab) * NUTCH-931 Simple admin API to fetch status and stop the service (ab) * NUTCH-932 Bulk REST API to retrieve crawl results as JSON (ab) Let's go through this and decide what to port to 1.3 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-950) Content-Length limit, URL filter and few minor issues
[ https://issues.apache.org/jira/browse/NUTCH-950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12977934#action_12977934 ] Julien Nioche commented on NUTCH-950: - Committed 1055604 in 1.3 NUTCH-950 DomainURLFilter throws NPE on bogus urls (Alexis Detreglode via jnioche) will commit for 2.0 later and review the other submissions Content-Length limit, URL filter and few minor issues - Key: NUTCH-950 URL: https://issues.apache.org/jira/browse/NUTCH-950 Project: Nutch Issue Type: Bug Affects Versions: 2.0 Reporter: Alexis Attachments: nutch1.patch, nutch2.patch, nutch3.patch, nutch4.patch 1. crawl command (nutch1.patch) The class was renamed to Crawler but the references to it were not updated. 2. URL filter (nutch2.patch) This avoids a NPE on bogus urls which host do not have a suffix. 3. Content-Length limit (nutch3.patch) This is related to NUTCH-899. The patch avoids the entire flush operation on the Gora datastore to crash because the MySQL blob limit was exceeded by a few bytes. Both protocol-http and protocol-httpclient plugins were problematic. 4. Ivy configuration (nutch4.patch) - Change xercesImpl and restlet versions. These 2 version changes are required. The first one currently makes a JUnit test crash, the second one is missing in default Maven repository. - Add gora-hbase, zookeeper which is an HBase dependency. Add MySQL connector. These jars are necesary to run Gora with HBase or MySQL datastores. (more a suggestion that a requirement here) - Add com.jcraft/jsch, which is a protocol-sftp plugin dependency. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Issue Comment Edited: (NUTCH-950) Content-Length limit, URL filter and few minor issues
[ https://issues.apache.org/jira/browse/NUTCH-950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12977934#action_12977934 ] Julien Nioche edited comment on NUTCH-950 at 1/5/11 2:56 PM: - Committed revision 1055604 in 1.3 Committed revision 1055608 for trunk {panel} NUTCH-950 DomainURLFilter throws NPE on bogus urls (Alexis Detreglode via jnioche) {panel} will review the other submissions later was (Author: jnioche): Committed 1055604 in 1.3 NUTCH-950 DomainURLFilter throws NPE on bogus urls (Alexis Detreglode via jnioche) will commit for 2.0 later and review the other submissions Content-Length limit, URL filter and few minor issues - Key: NUTCH-950 URL: https://issues.apache.org/jira/browse/NUTCH-950 Project: Nutch Issue Type: Bug Affects Versions: 2.0 Reporter: Alexis Attachments: nutch1.patch, nutch2.patch, nutch3.patch, nutch4.patch 1. crawl command (nutch1.patch) The class was renamed to Crawler but the references to it were not updated. 2. URL filter (nutch2.patch) This avoids a NPE on bogus urls which host do not have a suffix. 3. Content-Length limit (nutch3.patch) This is related to NUTCH-899. The patch avoids the entire flush operation on the Gora datastore to crash because the MySQL blob limit was exceeded by a few bytes. Both protocol-http and protocol-httpclient plugins were problematic. 4. Ivy configuration (nutch4.patch) - Change xercesImpl and restlet versions. These 2 version changes are required. The first one currently makes a JUnit test crash, the second one is missing in default Maven repository. - Add gora-hbase, zookeeper which is an HBase dependency. Add MySQL connector. These jars are necesary to run Gora with HBase or MySQL datastores. (more a suggestion that a requirement here) - Add com.jcraft/jsch, which is a protocol-sftp plugin dependency. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-952) fix outlink which started with '?' in html parser
[ https://issues.apache.org/jira/browse/NUTCH-952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stondet updated NUTCH-952: -- Attachment: NUTCH-952.patch fix outlink which started with '?' fix outlink which started with '?' in html parser - Key: NUTCH-952 URL: https://issues.apache.org/jira/browse/NUTCH-952 Project: Nutch Issue Type: Bug Components: parser Reporter: Stondet Attachments: NUTCH-952.patch a href=?w=ruby%20on%20railsty=csd=0 ruby on rails/a(a snippet from http://bbs.soso.com/search?ty=csd=0w=rails) outlink parsed from above link: http://bbs.soso.com/?w=ruby%20on%20railsty=csd=0 but expected is http://bbs.soso.com/search?w=ruby%20on%20railsty=csd=0 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Build failed in Hudson: Nutch-trunk #1359
See https://hudson.apache.org/hudson/job/Nutch-trunk/1359/changes Changes: [jnioche] NUTCH-950 DomainURLFilter throws NPE on bogus urls [jnioche] NUTCH-935 basicurlnormalizer removes unnecessary /./ in URLs [jnioche] NUTCH-912 MoreIndexingFilter does not parse docx and xlsx date formats -- [...truncated 1006 lines...] A src/plugin/subcollection/src/java/org/apache/nutch A src/plugin/subcollection/src/java/org/apache/nutch/collection A src/plugin/subcollection/src/java/org/apache/nutch/collection/Subcollection.java A src/plugin/subcollection/src/java/org/apache/nutch/collection/CollectionManager.java A src/plugin/subcollection/src/java/org/apache/nutch/collection/package.html A src/plugin/subcollection/src/java/org/apache/nutch/indexer A src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection A src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection/SubcollectionIndexingFilter.java A src/plugin/subcollection/README.txt A src/plugin/subcollection/plugin.xml A src/plugin/subcollection/build.xml A src/plugin/index-more A src/plugin/index-more/ivy.xml A src/plugin/index-more/src A src/plugin/index-more/src/test A src/plugin/index-more/src/test/org A src/plugin/index-more/src/test/org/apache A src/plugin/index-more/src/test/org/apache/nutch A src/plugin/index-more/src/test/org/apache/nutch/indexer A src/plugin/index-more/src/test/org/apache/nutch/indexer/more A src/plugin/index-more/src/test/org/apache/nutch/indexer/more/TestMoreIndexingFilter.java A src/plugin/index-more/src/java A src/plugin/index-more/src/java/org A src/plugin/index-more/src/java/org/apache A src/plugin/index-more/src/java/org/apache/nutch A src/plugin/index-more/src/java/org/apache/nutch/indexer A src/plugin/index-more/src/java/org/apache/nutch/indexer/more A src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java A src/plugin/index-more/src/java/org/apache/nutch/indexer/more/package.html A src/plugin/index-more/plugin.xml A src/plugin/index-more/build.xml AUsrc/plugin/plugin.dtd A src/plugin/parse-ext A src/plugin/parse-ext/ivy.xml A src/plugin/parse-ext/src A src/plugin/parse-ext/src/test A src/plugin/parse-ext/src/test/org A src/plugin/parse-ext/src/test/org/apache A src/plugin/parse-ext/src/test/org/apache/nutch A src/plugin/parse-ext/src/test/org/apache/nutch/parse A src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext A src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext/TestExtParser.java A src/plugin/parse-ext/src/java A src/plugin/parse-ext/src/java/org A src/plugin/parse-ext/src/java/org/apache A src/plugin/parse-ext/src/java/org/apache/nutch A src/plugin/parse-ext/src/java/org/apache/nutch/parse A src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext A src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext/ExtParser.java A src/plugin/parse-ext/plugin.xml A src/plugin/parse-ext/build.xml A src/plugin/parse-ext/command A src/plugin/urlnormalizer-pass A src/plugin/urlnormalizer-pass/ivy.xml A src/plugin/urlnormalizer-pass/src A src/plugin/urlnormalizer-pass/src/test A src/plugin/urlnormalizer-pass/src/test/org A src/plugin/urlnormalizer-pass/src/test/org/apache A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer/pass AU src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer/pass/TestPassURLNormalizer.java A src/plugin/urlnormalizer-pass/src/java A src/plugin/urlnormalizer-pass/src/java/org A src/plugin/urlnormalizer-pass/src/java/org/apache A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer/pass AU src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer/pass/PassURLNormalizer.java AUsrc/plugin/urlnormalizer-pass/plugin.xml AUsrc/plugin/urlnormalizer-pass/build.xml A src/plugin/parse-html A src/plugin/parse-html/ivy.xml A src/plugin/parse-html/lib A src/plugin/parse-html/lib/tagsoup.LICENSE.txt A