[jira] Created: (NUTCH-951) Backport changes from 2.0 into 1.3

2011-01-05 Thread Julien Nioche (JIRA)
Backport changes from 2.0 into 1.3
--

 Key: NUTCH-951
 URL: https://issues.apache.org/jira/browse/NUTCH-951
 Project: Nutch
  Issue Type: Task
Affects Versions: 1.3
Reporter: Julien Nioche
Priority: Blocker
 Fix For: 1.3


I've compared the changes from 2.0 with 1.3 and found the following differences 
(excluding anything specific to 2.0/GORA)

*  NUTCH-564 External parser supports encoding attribute (Antony Bowesman, 
mattmann)
*  NUTCH-714 Need a SFTP and SCP Protocol Handler (Sanjoy Ghosh, mattmann)
*  NUTCH-825 Publish nutch artifacts to central maven repository (mattmann)
*  NUTCH-851 Port logging to slf4j (jnioche)
*  NUTCH-861 Renamed HTMLParseFilter into ParseFilter
*  NUTCH-872 Change the default fetcher.parse to FALSE (ab).
*  NUTCH-876 Remove remaining robots/IP blocking code in lib-http (ab)
*  NUTCH-880 REST API for Nutch (ab)
*  NUTCH-883 Remove unused parameters from nutch-default.xml (jnioche)
*  NUTCH-884 FetcherJob should run more reduce tasks than default (ab)
*  NUTCH-886 A .gitignore file for Nutch (dogacan)
*  NUTCH-894 Move statistical language identification from indexing to 
parsing step
*  NUTCH-921 Reduce dependency of Nutch on config files (ab)
*  NUTCH-930 Remove remaining dependencies on Lucene API (ab)
*  NUTCH-931 Simple admin API to fetch status and stop the service (ab)
*  NUTCH-932 Bulk REST API to retrieve crawl results as JSON (ab)
*  NUTCH-936 LanguageIdentifier should not set empty lang field on 
NutchDocument (Markus Jelsma via jnioche)

Let's go through this and decide what to port to 1.3

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Backport to 1.3 (was: Release planning)

2011-01-05 Thread Julien Nioche
I've finished porting the changes from 1.2 which were missing in 1.3 and
were not related to the Lucene indexing or search

   - NUTCH-878 ScoringFilters should not override the injected score
   - NUTCH-901 Make index-more plug-in configurable (Markus Jelsma via
   mattmann)
   - NUTCH-905 Configurable file protocol parent directory crawling
   (Thorsten Scherler, mattmann, ab)
   - NUTCH-855 ScoringFilter and IndexingFilter: To allow for the
   propagation of URL Metatags and their subsequent indexing (Scott Gonyea via
   mattmann)
   - NUTCH-716 Make subcollection index filed multivalued (Dmitry Lihachev
   via jnioche)

I've compared the changes from 2.0 with 1.3 and found the following
differences (excluding anything specific to 2.0/GORA)

   - * NUTCH-564 External parser supports encoding attribute (Antony
   Bowesman, mattmann)*
   -  NUTCH-714 Need a SFTP and SCP Protocol Handler (Sanjoy Ghosh,
   mattmann)
   - * NUTCH-825 Publish nutch artifacts to central maven repository
   (mattmann)*
   -  NUTCH-851 Port logging to slf4j (jnioche)
   -  NUTCH-861 Renamed HTMLParseFilter into ParseFilter
   - * NUTCH-872 Change the default fetcher.parse to FALSE (ab).*
   - * NUTCH-876 Remove remaining robots/IP blocking code in lib-http (ab)*
   -  NUTCH-880 REST API for Nutch (ab)
   - * NUTCH-883 Remove unused parameters from nutch-default.xml (jnioche)*
   - * NUTCH-884 FetcherJob should run more reduce tasks than default (ab)*
   - * NUTCH-886 A .gitignore file for Nutch (dogacan)*
   - * NUTCH-894 Move statistical language identification from indexing to
   parsing step*
   - * NUTCH-921 Reduce dependency of Nutch on config files (ab)*
   - * NUTCH-930 Remove remaining dependencies on Lucene API (ab)*
   -  NUTCH-931 Simple admin API to fetch status and stop the service (ab)
   -  NUTCH-932 Bulk REST API to retrieve crawl results as JSON (ab)

I've created a new issue on
https://issues.apache.org/jira/browse/NUTCH-951to track this. I'd be
in favour of porting only the things that are not new
functionalities and put them in bold above.

Any thoughts on this?

Julien

On 4 January 2011 21:44, Julien Nioche lists.digitalpeb...@gmail.comwrote:

 +1 from me. I've committed today a bunch of patches which were in 1.2 but
 not in 1.3 (just one last one to do) but haven't compared with 2.0

 Having a release based on 1.3 would be great as it would be a nice
 transition towards 2.0 (delegate indexing/search, dependency management with
 Ivy, separation between local and remote deployment, removal of redondant
 plugins etc...).

 Julien

 --
 *
 *Open Source Solutions for Text Engineering

 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com


 On 4 January 2011 20:27, Andrzej Bialecki a...@getopt.org wrote:

 Hi users  devs,

 As you probably know, there are currently two active lines of development
 for Nutch:

 * Nutch trunk, a.k.a. Nutch 2.0: this is based on a completely redesigned
 storage layer that uses Apache Gora, which in turn can use various storage
 implementations such as HBase, Cassandra, and MySQL. This branch is still
 largely experimental and unstable, but work is progressing, and at the
 current pace I think a release should be possible within the next ~6 months.
 Another important addition on this branch is a REST API that allows using
 Nutch as a black-box crawling service.

 * Nutch branch-1.3: this started as a snapshot of Nutch trunk just before
 merging with nutchbase (i.e. switching to Gora as a storage layer). This
 branch is still largely similar to the previous versions of Nutch, and uses
 Hadoop MapFile/SequenceFile and segments. As compared with release 1.2 it
 does NOT ship with any search infrastructure, because all search
 functionality has been delegated to Solr (via SolrIndexer). This is BTW also
 true about Nutch trunk.

 Regarding branch-1.2 (which is a maintenance branch after release 1.2)
 there have been pretty no updates there, if any. Nutch committer resources
 are very limited (when it comes to active committers), so I don't expect any
 maintenance release from this branch to happen...

 I think that considering the relatively remote release date for Nutch 2.-0
 it would make sense to roll out a 1.3 release based on branch-1.3, after
 making sure that all critical patches from trunk have been merged in there.

 What do you think?

 --
 Best regards,
 Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com







-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com


[jira] Updated: (NUTCH-951) Backport changes from 2.0 into 1.3

2011-01-05 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-951:


Description: 
I've compared the changes from 2.0 with 1.3 and found the following differences 
(excluding anything specific to 2.0/GORA)

*  NUTCH-564 External parser supports encoding attribute (Antony Bowesman, 
mattmann)
*  NUTCH-714 Need a SFTP and SCP Protocol Handler (Sanjoy Ghosh, mattmann)
*  NUTCH-825 Publish nutch artifacts to central maven repository (mattmann)
*  NUTCH-851 Port logging to slf4j (jnioche)
*  NUTCH-861 Renamed HTMLParseFilter into ParseFilter
*  NUTCH-872 Change the default fetcher.parse to FALSE (ab).
*  NUTCH-876 Remove remaining robots/IP blocking code in lib-http (ab)
*  NUTCH-880 REST API for Nutch (ab)
*  NUTCH-883 Remove unused parameters from nutch-default.xml (jnioche)
*  NUTCH-884 FetcherJob should run more reduce tasks than default (ab)
*  NUTCH-886 A .gitignore file for Nutch (dogacan)
*  NUTCH-894 Move statistical language identification from indexing to 
parsing step
*  NUTCH-921 Reduce dependency of Nutch on config files (ab)
*  NUTCH-930 Remove remaining dependencies on Lucene API (ab)
*  NUTCH-931 Simple admin API to fetch status and stop the service (ab)
*  NUTCH-932 Bulk REST API to retrieve crawl results as JSON (ab)

Let's go through this and decide what to port to 1.3

  was:
I've compared the changes from 2.0 with 1.3 and found the following differences 
(excluding anything specific to 2.0/GORA)

*  NUTCH-564 External parser supports encoding attribute (Antony Bowesman, 
mattmann)
*  NUTCH-714 Need a SFTP and SCP Protocol Handler (Sanjoy Ghosh, mattmann)
*  NUTCH-825 Publish nutch artifacts to central maven repository (mattmann)
*  NUTCH-851 Port logging to slf4j (jnioche)
*  NUTCH-861 Renamed HTMLParseFilter into ParseFilter
*  NUTCH-872 Change the default fetcher.parse to FALSE (ab).
*  NUTCH-876 Remove remaining robots/IP blocking code in lib-http (ab)
*  NUTCH-880 REST API for Nutch (ab)
*  NUTCH-883 Remove unused parameters from nutch-default.xml (jnioche)
*  NUTCH-884 FetcherJob should run more reduce tasks than default (ab)
*  NUTCH-886 A .gitignore file for Nutch (dogacan)
*  NUTCH-894 Move statistical language identification from indexing to 
parsing step
*  NUTCH-921 Reduce dependency of Nutch on config files (ab)
*  NUTCH-930 Remove remaining dependencies on Lucene API (ab)
*  NUTCH-931 Simple admin API to fetch status and stop the service (ab)
*  NUTCH-932 Bulk REST API to retrieve crawl results as JSON (ab)
*  NUTCH-936 LanguageIdentifier should not set empty lang field on 
NutchDocument (Markus Jelsma via jnioche)

Let's go through this and decide what to port to 1.3


 Backport changes from 2.0 into 1.3
 --

 Key: NUTCH-951
 URL: https://issues.apache.org/jira/browse/NUTCH-951
 Project: Nutch
  Issue Type: Task
Affects Versions: 1.3
Reporter: Julien Nioche
Priority: Blocker
 Fix For: 1.3


 I've compared the changes from 2.0 with 1.3 and found the following 
 differences (excluding anything specific to 2.0/GORA)
 *  NUTCH-564 External parser supports encoding attribute (Antony 
 Bowesman, mattmann)
 *  NUTCH-714 Need a SFTP and SCP Protocol Handler (Sanjoy Ghosh, mattmann)
 *  NUTCH-825 Publish nutch artifacts to central maven repository 
 (mattmann)
 *  NUTCH-851 Port logging to slf4j (jnioche)
 *  NUTCH-861 Renamed HTMLParseFilter into ParseFilter
 *  NUTCH-872 Change the default fetcher.parse to FALSE (ab).
 *  NUTCH-876 Remove remaining robots/IP blocking code in lib-http (ab)
 *  NUTCH-880 REST API for Nutch (ab)
 *  NUTCH-883 Remove unused parameters from nutch-default.xml (jnioche)
 *  NUTCH-884 FetcherJob should run more reduce tasks than default (ab)
 *  NUTCH-886 A .gitignore file for Nutch (dogacan)
 *  NUTCH-894 Move statistical language identification from indexing to 
 parsing step
 *  NUTCH-921 Reduce dependency of Nutch on config files (ab)
 *  NUTCH-930 Remove remaining dependencies on Lucene API (ab)
 *  NUTCH-931 Simple admin API to fetch status and stop the service (ab)
 *  NUTCH-932 Bulk REST API to retrieve crawl results as JSON (ab)
 Let's go through this and decide what to port to 1.3

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-951) Backport changes from 2.0 into 1.3

2011-01-05 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12977819#action_12977819
 ] 

Julien Nioche commented on NUTCH-951:
-

ported NUTCH-883 to 1.3 in rev 1055503

 Backport changes from 2.0 into 1.3
 --

 Key: NUTCH-951
 URL: https://issues.apache.org/jira/browse/NUTCH-951
 Project: Nutch
  Issue Type: Task
Affects Versions: 1.3
Reporter: Julien Nioche
Priority: Blocker
 Fix For: 1.3


 I've compared the changes from 2.0 with 1.3 and found the following 
 differences (excluding anything specific to 2.0/GORA)
 *  NUTCH-564 External parser supports encoding attribute (Antony 
 Bowesman, mattmann)
 *  NUTCH-714 Need a SFTP and SCP Protocol Handler (Sanjoy Ghosh, mattmann)
 *  NUTCH-825 Publish nutch artifacts to central maven repository 
 (mattmann)
 *  NUTCH-851 Port logging to slf4j (jnioche)
 *  NUTCH-861 Renamed HTMLParseFilter into ParseFilter
 *  NUTCH-872 Change the default fetcher.parse to FALSE (ab).
 *  NUTCH-876 Remove remaining robots/IP blocking code in lib-http (ab)
 *  NUTCH-880 REST API for Nutch (ab)
 *  NUTCH-883 Remove unused parameters from nutch-default.xml (jnioche)
 *  NUTCH-884 FetcherJob should run more reduce tasks than default (ab)
 *  NUTCH-886 A .gitignore file for Nutch (dogacan)
 *  NUTCH-894 Move statistical language identification from indexing to 
 parsing step
 *  NUTCH-921 Reduce dependency of Nutch on config files (ab)
 *  NUTCH-930 Remove remaining dependencies on Lucene API (ab)
 *  NUTCH-931 Simple admin API to fetch status and stop the service (ab)
 *  NUTCH-932 Bulk REST API to retrieve crawl results as JSON (ab)
 Let's go through this and decide what to port to 1.3

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-951) Backport changes from 2.0 into 1.3

2011-01-05 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12977824#action_12977824
 ] 

Julien Nioche commented on NUTCH-951:
-

ported NUTCH-886 to 1.3 in rev 1055512

 Backport changes from 2.0 into 1.3
 --

 Key: NUTCH-951
 URL: https://issues.apache.org/jira/browse/NUTCH-951
 Project: Nutch
  Issue Type: Task
Affects Versions: 1.3
Reporter: Julien Nioche
Priority: Blocker
 Fix For: 1.3


 I've compared the changes from 2.0 with 1.3 and found the following 
 differences (excluding anything specific to 2.0/GORA)
 *  NUTCH-564 External parser supports encoding attribute (Antony 
 Bowesman, mattmann)
 *  NUTCH-714 Need a SFTP and SCP Protocol Handler (Sanjoy Ghosh, mattmann)
 *  NUTCH-825 Publish nutch artifacts to central maven repository 
 (mattmann)
 *  NUTCH-851 Port logging to slf4j (jnioche)
 *  NUTCH-861 Renamed HTMLParseFilter into ParseFilter
 *  NUTCH-872 Change the default fetcher.parse to FALSE (ab).
 *  NUTCH-876 Remove remaining robots/IP blocking code in lib-http (ab)
 *  NUTCH-880 REST API for Nutch (ab)
 *  NUTCH-883 Remove unused parameters from nutch-default.xml (jnioche)
 *  NUTCH-884 FetcherJob should run more reduce tasks than default (ab)
 *  NUTCH-886 A .gitignore file for Nutch (dogacan)
 *  NUTCH-894 Move statistical language identification from indexing to 
 parsing step
 *  NUTCH-921 Reduce dependency of Nutch on config files (ab)
 *  NUTCH-930 Remove remaining dependencies on Lucene API (ab)
 *  NUTCH-931 Simple admin API to fetch status and stop the service (ab)
 *  NUTCH-932 Bulk REST API to retrieve crawl results as JSON (ab)
 Let's go through this and decide what to port to 1.3

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-951) Backport changes from 2.0 into 1.3

2011-01-05 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12977828#action_12977828
 ] 

Julien Nioche commented on NUTCH-951:
-

NUTCH-894 : has been written for 2.0 and would need some effort to backport to 
1.3 
I suggest that we leave it there. 

The list of things that IMHO are worth porting to 1.3 are now 

* NUTCH-564 External parser supports encoding attribute (Antony Bowesman, 
mattmann)
* NUTCH-825 Publish nutch artifacts to central maven repository (mattmann)
* NUTCH-872 Change the default fetcher.parse to FALSE (ab).
* NUTCH-876 Remove remaining robots/IP blocking code in lib-http (ab)
* NUTCH-884 FetcherJob should run more reduce tasks than default (ab)
* NUTCH-921 Reduce dependency of Nutch on config files (ab)

Any volunteers?


 Backport changes from 2.0 into 1.3
 --

 Key: NUTCH-951
 URL: https://issues.apache.org/jira/browse/NUTCH-951
 Project: Nutch
  Issue Type: Task
Affects Versions: 1.3
Reporter: Julien Nioche
Priority: Blocker
 Fix For: 1.3


 I've compared the changes from 2.0 with 1.3 and found the following 
 differences (excluding anything specific to 2.0/GORA)
 *  NUTCH-564 External parser supports encoding attribute (Antony 
 Bowesman, mattmann)
 *  NUTCH-714 Need a SFTP and SCP Protocol Handler (Sanjoy Ghosh, mattmann)
 *  NUTCH-825 Publish nutch artifacts to central maven repository 
 (mattmann)
 *  NUTCH-851 Port logging to slf4j (jnioche)
 *  NUTCH-861 Renamed HTMLParseFilter into ParseFilter
 *  NUTCH-872 Change the default fetcher.parse to FALSE (ab).
 *  NUTCH-876 Remove remaining robots/IP blocking code in lib-http (ab)
 *  NUTCH-880 REST API for Nutch (ab)
 *  NUTCH-883 Remove unused parameters from nutch-default.xml (jnioche)
 *  NUTCH-884 FetcherJob should run more reduce tasks than default (ab)
 *  NUTCH-886 A .gitignore file for Nutch (dogacan)
 *  NUTCH-894 Move statistical language identification from indexing to 
 parsing step
 *  NUTCH-921 Reduce dependency of Nutch on config files (ab)
 *  NUTCH-930 Remove remaining dependencies on Lucene API (ab)
 *  NUTCH-931 Simple admin API to fetch status and stop the service (ab)
 *  NUTCH-932 Bulk REST API to retrieve crawl results as JSON (ab)
 Let's go through this and decide what to port to 1.3

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-912) MoreIndexingFilter does not parse docx and xlsx date formats

2011-01-05 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-912.
-

Resolution: Fixed

branch 1.3 : NUTCH-912 added to CHANGES.txt in rev 105551
trunk : committed rev 1055518

 MoreIndexingFilter does not parse docx and xlsx date formats
 

 Key: NUTCH-912
 URL: https://issues.apache.org/jira/browse/NUTCH-912
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.2, 1.3, 2.0
Reporter: Erlend GarĂ¥sen
Assignee: Markus Jelsma
 Fix For: 1.3, 2.0

 Attachments: NUTCH-912-v12-1.patch, NUTCH-912-v12-1.patch, 
 NUTCH-912-v13-1.patch


 The following error occurs in hadoop.log when MoreIndexingFilter tries to 
 parse dates from MS Office formats:
 2010-10-08 13:56:32,555 WARN  more.MoreIndexingFilter - 
 http://ridder.uio.no/test1.xlsx: can't parse erroneous date: 
 2010-10-08T13:55:54Z
 This problem affects docx and xlsx formats, but probably the other XML-based 
 MS Office formats as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [jira] Commented: (NUTCH-951) Backport changes from 2.0 into 1.3

2011-01-05 Thread Mattmann, Chris A (388J)
Hey Julien,

I'll take care of the ones with my name on them below (NUTCH-564 and NUTCH-825).

Cheers,
Chris

On Jan 5, 2011, at 8:36 AM, Julien Nioche (JIRA) wrote:

 
[ 
 https://issues.apache.org/jira/browse/NUTCH-951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12977828#action_12977828
  ] 
 
 Julien Nioche commented on NUTCH-951:
 -
 
 NUTCH-894 : has been written for 2.0 and would need some effort to backport 
 to 1.3 
 I suggest that we leave it there. 
 
 The list of things that IMHO are worth porting to 1.3 are now 
 
* NUTCH-564 External parser supports encoding attribute (Antony Bowesman, 
 mattmann)
* NUTCH-825 Publish nutch artifacts to central maven repository (mattmann)
* NUTCH-872 Change the default fetcher.parse to FALSE (ab).
* NUTCH-876 Remove remaining robots/IP blocking code in lib-http (ab)
* NUTCH-884 FetcherJob should run more reduce tasks than default (ab)
* NUTCH-921 Reduce dependency of Nutch on config files (ab)
 
 Any volunteers?
 
 
 Backport changes from 2.0 into 1.3
 --
 
Key: NUTCH-951
URL: https://issues.apache.org/jira/browse/NUTCH-951
Project: Nutch
 Issue Type: Task
   Affects Versions: 1.3
   Reporter: Julien Nioche
   Priority: Blocker
Fix For: 1.3
 
 
 I've compared the changes from 2.0 with 1.3 and found the following 
 differences (excluding anything specific to 2.0/GORA)
*  NUTCH-564 External parser supports encoding attribute (Antony 
 Bowesman, mattmann)
*  NUTCH-714 Need a SFTP and SCP Protocol Handler (Sanjoy Ghosh, mattmann)
*  NUTCH-825 Publish nutch artifacts to central maven repository 
 (mattmann)
*  NUTCH-851 Port logging to slf4j (jnioche)
*  NUTCH-861 Renamed HTMLParseFilter into ParseFilter
*  NUTCH-872 Change the default fetcher.parse to FALSE (ab).
*  NUTCH-876 Remove remaining robots/IP blocking code in lib-http (ab)
*  NUTCH-880 REST API for Nutch (ab)
*  NUTCH-883 Remove unused parameters from nutch-default.xml (jnioche)
*  NUTCH-884 FetcherJob should run more reduce tasks than default (ab)
*  NUTCH-886 A .gitignore file for Nutch (dogacan)
*  NUTCH-894 Move statistical language identification from indexing to 
 parsing step
*  NUTCH-921 Reduce dependency of Nutch on config files (ab)
*  NUTCH-930 Remove remaining dependencies on Lucene API (ab)
*  NUTCH-931 Simple admin API to fetch status and stop the service (ab)
*  NUTCH-932 Bulk REST API to retrieve crawl results as JSON (ab)
 Let's go through this and decide what to port to 1.3
 
 -- 
 This message is automatically generated by JIRA.
 -
 You can reply to this email to add a comment to the issue online.
 


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



[jira] Commented: (NUTCH-951) Backport changes from 2.0 into 1.3

2011-01-05 Thread Mattmann, Chris A (388J) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12977854#action_12977854
 ] 

Mattmann, Chris A (388J) commented on NUTCH-951:


Hey Julien,

I'll take care of the ones with my name on them below (NUTCH-564 and NUTCH-825).

Cheers,
Chris




++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++




 Backport changes from 2.0 into 1.3
 --

 Key: NUTCH-951
 URL: https://issues.apache.org/jira/browse/NUTCH-951
 Project: Nutch
  Issue Type: Task
Affects Versions: 1.3
Reporter: Julien Nioche
Priority: Blocker
 Fix For: 1.3


 I've compared the changes from 2.0 with 1.3 and found the following 
 differences (excluding anything specific to 2.0/GORA)
 *  NUTCH-564 External parser supports encoding attribute (Antony 
 Bowesman, mattmann)
 *  NUTCH-714 Need a SFTP and SCP Protocol Handler (Sanjoy Ghosh, mattmann)
 *  NUTCH-825 Publish nutch artifacts to central maven repository 
 (mattmann)
 *  NUTCH-851 Port logging to slf4j (jnioche)
 *  NUTCH-861 Renamed HTMLParseFilter into ParseFilter
 *  NUTCH-872 Change the default fetcher.parse to FALSE (ab).
 *  NUTCH-876 Remove remaining robots/IP blocking code in lib-http (ab)
 *  NUTCH-880 REST API for Nutch (ab)
 *  NUTCH-883 Remove unused parameters from nutch-default.xml (jnioche)
 *  NUTCH-884 FetcherJob should run more reduce tasks than default (ab)
 *  NUTCH-886 A .gitignore file for Nutch (dogacan)
 *  NUTCH-894 Move statistical language identification from indexing to 
 parsing step
 *  NUTCH-921 Reduce dependency of Nutch on config files (ab)
 *  NUTCH-930 Remove remaining dependencies on Lucene API (ab)
 *  NUTCH-931 Simple admin API to fetch status and stop the service (ab)
 *  NUTCH-932 Bulk REST API to retrieve crawl results as JSON (ab)
 Let's go through this and decide what to port to 1.3

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-950) Content-Length limit, URL filter and few minor issues

2011-01-05 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12977934#action_12977934
 ] 

Julien Nioche commented on NUTCH-950:
-

Committed 1055604 in 1.3

 NUTCH-950 DomainURLFilter throws NPE on bogus urls (Alexis Detreglode via 
jnioche)

will commit for 2.0 later and review the other submissions



 Content-Length limit, URL filter and few minor issues
 -

 Key: NUTCH-950
 URL: https://issues.apache.org/jira/browse/NUTCH-950
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.0
Reporter: Alexis
 Attachments: nutch1.patch, nutch2.patch, nutch3.patch, nutch4.patch


 1. crawl command (nutch1.patch)
 The class was renamed to Crawler but the references to it were not updated.
 2. URL filter (nutch2.patch)
 This avoids a NPE on bogus urls which host do not have a suffix.
 3. Content-Length limit (nutch3.patch)
 This is related to NUTCH-899.
 The patch avoids the entire flush operation on the Gora datastore to crash 
 because the MySQL blob limit was exceeded by a few bytes. Both protocol-http 
 and protocol-httpclient plugins were problematic.
 4. Ivy configuration (nutch4.patch)
 - Change xercesImpl and restlet versions. These 2 version changes are 
 required. The first one currently makes a JUnit test crash, the second one is 
 missing in default Maven repository.
 - Add gora-hbase, zookeeper which is an HBase dependency. Add MySQL 
 connector. These jars are necesary to run Gora with HBase or MySQL 
 datastores. (more a suggestion that a requirement here)
 - Add com.jcraft/jsch, which is a protocol-sftp plugin dependency. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Issue Comment Edited: (NUTCH-950) Content-Length limit, URL filter and few minor issues

2011-01-05 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12977934#action_12977934
 ] 

Julien Nioche edited comment on NUTCH-950 at 1/5/11 2:56 PM:
-

Committed revision 1055604 in 1.3
Committed revision 1055608 for trunk

{panel}
 NUTCH-950 DomainURLFilter throws NPE on bogus urls (Alexis Detreglode via 
jnioche)
{panel}

will review the other submissions later



  was (Author: jnioche):
Committed 1055604 in 1.3

 NUTCH-950 DomainURLFilter throws NPE on bogus urls (Alexis Detreglode via 
jnioche)

will commit for 2.0 later and review the other submissions


  
 Content-Length limit, URL filter and few minor issues
 -

 Key: NUTCH-950
 URL: https://issues.apache.org/jira/browse/NUTCH-950
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.0
Reporter: Alexis
 Attachments: nutch1.patch, nutch2.patch, nutch3.patch, nutch4.patch


 1. crawl command (nutch1.patch)
 The class was renamed to Crawler but the references to it were not updated.
 2. URL filter (nutch2.patch)
 This avoids a NPE on bogus urls which host do not have a suffix.
 3. Content-Length limit (nutch3.patch)
 This is related to NUTCH-899.
 The patch avoids the entire flush operation on the Gora datastore to crash 
 because the MySQL blob limit was exceeded by a few bytes. Both protocol-http 
 and protocol-httpclient plugins were problematic.
 4. Ivy configuration (nutch4.patch)
 - Change xercesImpl and restlet versions. These 2 version changes are 
 required. The first one currently makes a JUnit test crash, the second one is 
 missing in default Maven repository.
 - Add gora-hbase, zookeeper which is an HBase dependency. Add MySQL 
 connector. These jars are necesary to run Gora with HBase or MySQL 
 datastores. (more a suggestion that a requirement here)
 - Add com.jcraft/jsch, which is a protocol-sftp plugin dependency. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-952) fix outlink which started with '?' in html parser

2011-01-05 Thread Stondet (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stondet updated NUTCH-952:
--

Attachment: NUTCH-952.patch

fix outlink which started with '?'

 fix outlink which started with '?' in html parser
 -

 Key: NUTCH-952
 URL: https://issues.apache.org/jira/browse/NUTCH-952
 Project: Nutch
  Issue Type: Bug
  Components: parser
Reporter: Stondet
 Attachments: NUTCH-952.patch


 a href=?w=ruby%20on%20railsty=csd=0 ruby on rails/a(a snippet from 
 http://bbs.soso.com/search?ty=csd=0w=rails)
 outlink parsed from above link: 
 http://bbs.soso.com/?w=ruby%20on%20railsty=csd=0
 but expected is http://bbs.soso.com/search?w=ruby%20on%20railsty=csd=0

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Build failed in Hudson: Nutch-trunk #1359

2011-01-05 Thread Apache Hudson Server
See https://hudson.apache.org/hudson/job/Nutch-trunk/1359/changes

Changes:

[jnioche] NUTCH-950 DomainURLFilter throws NPE on bogus urls

[jnioche] NUTCH-935 basicurlnormalizer removes unnecessary /./ in URLs

[jnioche] NUTCH-912 MoreIndexingFilter does not parse docx and xlsx date formats

--
[...truncated 1006 lines...]
A src/plugin/subcollection/src/java/org/apache/nutch
A src/plugin/subcollection/src/java/org/apache/nutch/collection
A 
src/plugin/subcollection/src/java/org/apache/nutch/collection/Subcollection.java
A 
src/plugin/subcollection/src/java/org/apache/nutch/collection/CollectionManager.java
A 
src/plugin/subcollection/src/java/org/apache/nutch/collection/package.html
A src/plugin/subcollection/src/java/org/apache/nutch/indexer
A 
src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection
A 
src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection/SubcollectionIndexingFilter.java
A src/plugin/subcollection/README.txt
A src/plugin/subcollection/plugin.xml
A src/plugin/subcollection/build.xml
A src/plugin/index-more
A src/plugin/index-more/ivy.xml
A src/plugin/index-more/src
A src/plugin/index-more/src/test
A src/plugin/index-more/src/test/org
A src/plugin/index-more/src/test/org/apache
A src/plugin/index-more/src/test/org/apache/nutch
A src/plugin/index-more/src/test/org/apache/nutch/indexer
A src/plugin/index-more/src/test/org/apache/nutch/indexer/more
A 
src/plugin/index-more/src/test/org/apache/nutch/indexer/more/TestMoreIndexingFilter.java
A src/plugin/index-more/src/java
A src/plugin/index-more/src/java/org
A src/plugin/index-more/src/java/org/apache
A src/plugin/index-more/src/java/org/apache/nutch
A src/plugin/index-more/src/java/org/apache/nutch/indexer
A src/plugin/index-more/src/java/org/apache/nutch/indexer/more
A 
src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
A 
src/plugin/index-more/src/java/org/apache/nutch/indexer/more/package.html
A src/plugin/index-more/plugin.xml
A src/plugin/index-more/build.xml
AUsrc/plugin/plugin.dtd
A src/plugin/parse-ext
A src/plugin/parse-ext/ivy.xml
A src/plugin/parse-ext/src
A src/plugin/parse-ext/src/test
A src/plugin/parse-ext/src/test/org
A src/plugin/parse-ext/src/test/org/apache
A src/plugin/parse-ext/src/test/org/apache/nutch
A src/plugin/parse-ext/src/test/org/apache/nutch/parse
A src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext
A 
src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext/TestExtParser.java
A src/plugin/parse-ext/src/java
A src/plugin/parse-ext/src/java/org
A src/plugin/parse-ext/src/java/org/apache
A src/plugin/parse-ext/src/java/org/apache/nutch
A src/plugin/parse-ext/src/java/org/apache/nutch/parse
A src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext
A 
src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext/ExtParser.java
A src/plugin/parse-ext/plugin.xml
A src/plugin/parse-ext/build.xml
A src/plugin/parse-ext/command
A src/plugin/urlnormalizer-pass
A src/plugin/urlnormalizer-pass/ivy.xml
A src/plugin/urlnormalizer-pass/src
A src/plugin/urlnormalizer-pass/src/test
A src/plugin/urlnormalizer-pass/src/test/org
A src/plugin/urlnormalizer-pass/src/test/org/apache
A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch
A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net
A 
src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer
A 
src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer/pass
AU
src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer/pass/TestPassURLNormalizer.java
A src/plugin/urlnormalizer-pass/src/java
A src/plugin/urlnormalizer-pass/src/java/org
A src/plugin/urlnormalizer-pass/src/java/org/apache
A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch
A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net
A 
src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer
A 
src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer/pass
AU
src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer/pass/PassURLNormalizer.java
AUsrc/plugin/urlnormalizer-pass/plugin.xml
AUsrc/plugin/urlnormalizer-pass/build.xml
A src/plugin/parse-html
A src/plugin/parse-html/ivy.xml
A src/plugin/parse-html/lib
A src/plugin/parse-html/lib/tagsoup.LICENSE.txt
A