Re: [RESULT] [VOTE] Move 2.0 out of trunk

2011-09-22 Thread Julien Nioche
+1 thanks Chris

On 22 September 2011 04:12, Mattmann, Chris A (388J) 
chris.a.mattm...@jpl.nasa.gov wrote:

 Guys,

 If no one objects, I will execute the move Friday by 12pm PDT.

 Will that work?

 Cheers,
 Chris

 On Sep 21, 2011, at 3:09 AM, Julien Nioche wrote:

  Hi Folks,
 
  Okey dok, this VOTE has passed with the following tallies:
 
  +1 PMC
  Markus Jelsma
  Sami Siren
  Chris Mattmann
  Lewis John McGibbney
  Dennis Kubes
  Julien Nioche
  Andrzej Bialecki
 
  -1 PMC
  Alexis de Tréglodé
 
  -1 Community
  Radim Kola
 
 
  Accordingly we will move the current Nutch trunk to a bew branch
 nutchgora and then will move the current 1.4-development branch into trunk.
 
  I assume the two commands below would do the trick?
 
  svn mv https://svn.apache.org/repos/asf/nutch/trunk
 https://svn.apache.org/repos/asf/nutch/branches/nutchgora
  svn mv https://svn.apache.org/repos/asf/nutch/branches/branch-1.4/
 https://svn.apache.org/repos/asf/nutch/trunk
 
 
  Thanks
 
  Julien
 
 
 
 
  On 18 September 2011 10:21, Julien Nioche lists.digitalpeb...@gmail.com
 wrote:
  Hi,
 
  Following the discussions [1] on the dev-list about the future of Nutch
 2.0, I would like to call for a vote on moving Nutch 2.0 from the trunk to a
 separate branch, promote 1.4 to trunk and consider 2.0 as unmaintained. The
 arguments for / against can be found in the thread I mentioned.
 
  The vote is open for the next 72 hours.
 
  [ ] +1 : Shelve 2.0 and move 1.4 to trunk
  [] 0 : No opinion
  [] -1 : Bad idea.  Please give justification.
 
  Thanks
 
  Julien
 
  [1]
 http://www.mail-archive.com/gora-dev@incubator.apache.org/msg00483.html
 
  --
 
  Open Source Solutions for Text Engineering
 
  http://digitalpebble.blogspot.com/
  http://www.digitalpebble.com
 
 
 
  --
 
  Open Source Solutions for Text Engineering
 
  http://digitalpebble.blogspot.com/
  http://www.digitalpebble.com


 ++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 Email: chris.a.mattm...@nasa.gov
 WWW:   http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Assistant Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++




-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com


[jira] [Commented] (NUTCH-1115) Option to disable fixing of embedded params in DomContentUtils

2011-09-22 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13112576#comment-13112576
 ] 

Julien Nioche commented on NUTCH-1115:
--

+1 Don't forget to add the same logic to DomContentUtils in Parse-Html 

 Option to disable fixing of embedded params in DomContentUtils
 --

 Key: NUTCH-1115
 URL: https://issues.apache.org/jira/browse/NUTCH-1115
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.3
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.4

 Attachments: NUTCH-1115-1.4-1.patch


 Add option to disable fixing of embedded params:
 http://lucene.472066.n3.nabble.com/Outlinks-with-embedded-params-td3332396.html
 When enabled, millions of crap url's are output as outlink. This results in 
 many 404 in the DB and many very long URL's that actually lead to the same 
 page.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1115) Option to disable fixing of embedded params in DomContentUtils

2011-09-22 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1115:
-

Attachment: NUTCH-1115-1.4-2.patch

Yes. Here's the complete patch for both parser implementation and nutch-default 
section.

 Option to disable fixing of embedded params in DomContentUtils
 --

 Key: NUTCH-1115
 URL: https://issues.apache.org/jira/browse/NUTCH-1115
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.3
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.4

 Attachments: NUTCH-1115-1.4-1.patch, NUTCH-1115-1.4-2.patch


 Add option to disable fixing of embedded params:
 http://lucene.472066.n3.nabble.com/Outlinks-with-embedded-params-td3332396.html
 When enabled, millions of crap url's are output as outlink. This results in 
 many 404 in the DB and many very long URL's that actually lead to the same 
 page.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (NUTCH-1115) Option to disable fixing of embedded params in DomContentUtils

2011-09-22 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma resolved NUTCH-1115.
--

Resolution: Fixed

Committed for 1.4 in rev. 1174147. Fixes a significant pollution of the crawldb.

 Option to disable fixing of embedded params in DomContentUtils
 --

 Key: NUTCH-1115
 URL: https://issues.apache.org/jira/browse/NUTCH-1115
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.3
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.4

 Attachments: NUTCH-1115-1.4-1.patch, NUTCH-1115-1.4-2.patch


 Add option to disable fixing of embedded params:
 http://lucene.472066.n3.nabble.com/Outlinks-with-embedded-params-td3332396.html
 When enabled, millions of crap url's are output as outlink. This results in 
 many 404 in the DB and many very long URL's that actually lead to the same 
 page.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1078) Upgrade all instances of commons logging to slf4j (with log4j backend)

2011-09-22 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13112600#comment-13112600
 ] 

Markus Jelsma commented on NUTCH-1078:
--

Push it in Lewis! I'll fix whatever breaks here :)
cheers

 Upgrade all instances of commons logging to slf4j (with log4j backend)
 --

 Key: NUTCH-1078
 URL: https://issues.apache.org/jira/browse/NUTCH-1078
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.4
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: 1.4

 Attachments: NUTCH-1078-branch-1.4-20110816.patch, 
 NUTCH-1078-branch-1.4-20110824-v2.patch, 
 NUTCH-1078-branch-1.4-20110911-v3.patch, 
 NUTCH-1078-branch-1.4-20110916-v4.patch


 Whilst working on another issue, I noticed that some classes still import and 
 use commons logging for example HttpBase.java
 {code}
 import java.util.*;
 // Commons Logging imports
 import org.apache.commons.logging.Log;
 import org.apache.commons.logging.LogFactory;
 // Nutch imports
 import org.apache.nutch.crawl.CrawlDatum;
 {code}
 At this stage I am unsure how many (if any others) still import and reply 
 upon commons logging, however they should be upgraded to slf4j for branch-1.4.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: [RESULT] [VOTE] Move 2.0 out of trunk

2011-09-22 Thread Markus Jelsma
Cheers!

On Thursday 22 September 2011 05:12:27 Mattmann, Chris A (388J) wrote:
 Guys,
 
 If no one objects, I will execute the move Friday by 12pm PDT.
 
 Will that work?
 
 Cheers,
 Chris
 
 On Sep 21, 2011, at 3:09 AM, Julien Nioche wrote:
  Hi Folks,
  
  Okey dok, this VOTE has passed with the following tallies:
  
  +1 PMC
  Markus Jelsma
  Sami Siren
  Chris Mattmann
  Lewis John McGibbney
  Dennis Kubes
  Julien Nioche
  Andrzej Bialecki
  
  -1 PMC
  Alexis de Tréglodé
  
  -1 Community
  Radim Kola
  
  
  Accordingly we will move the current Nutch trunk to a bew branch
  nutchgora and then will move the current 1.4-development branch into
  trunk.
  
  I assume the two commands below would do the trick?
  
  svn mv https://svn.apache.org/repos/asf/nutch/trunk
  https://svn.apache.org/repos/asf/nutch/branches/nutchgora svn mv
  https://svn.apache.org/repos/asf/nutch/branches/branch-1.4/
  https://svn.apache.org/repos/asf/nutch/trunk
  
  
  Thanks
  
  Julien
  
  
  
  
  On 18 September 2011 10:21, Julien Nioche lists.digitalpeb...@gmail.com
  wrote: Hi,
  
  Following the discussions [1] on the dev-list about the future of Nutch
  2.0, I would like to call for a vote on moving Nutch 2.0 from the trunk
  to a separate branch, promote 1.4 to trunk and consider 2.0 as
  unmaintained. The arguments for / against can be found in the thread I
  mentioned.
  
  The vote is open for the next 72 hours.
  
  [ ] +1 : Shelve 2.0 and move 1.4 to trunk
  [] 0 : No opinion
  [] -1 : Bad idea.  Please give justification.
  
  Thanks
  
  Julien
  
  [1]
  http://www.mail-archive.com/gora-dev@incubator.apache.org/msg00483.html
 
 ++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 Email: chris.a.mattm...@nasa.gov
 WWW:   http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Assistant Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


[jira] [Resolved] (NUTCH-1092) overhaul FAQ's and publish to Nutch site

2011-09-22 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-1092.
-

Resolution: Fixed
  Assignee: Lewis John McGibbney

completed and will be fully functional once NUTCH-1093 is completed. 

 overhaul FAQ's and publish to Nutch site
 

 Key: NUTCH-1092
 URL: https://issues.apache.org/jira/browse/NUTCH-1092
 Project: Nutch
  Issue Type: Sub-task
  Components: documentation
Affects Versions: 1.4, 2.0
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.4, 2.0


 We require a complete overhaul of the FAQ's on the Wiki. Once this is 
 accomplished they need to be pushed into the Nutch site. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (NUTCH-1078) Upgrade all instances of commons logging to slf4j (with log4j backend)

2011-09-22 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-1078.
-

Resolution: Fixed

Committed @ revision 1174191.

Would like to say a big thanks to everyone for keeping me right on this one. Ta 

 Upgrade all instances of commons logging to slf4j (with log4j backend)
 --

 Key: NUTCH-1078
 URL: https://issues.apache.org/jira/browse/NUTCH-1078
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.4
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: 1.4

 Attachments: NUTCH-1078-branch-1.4-20110816.patch, 
 NUTCH-1078-branch-1.4-20110824-v2.patch, 
 NUTCH-1078-branch-1.4-20110911-v3.patch, 
 NUTCH-1078-branch-1.4-20110916-v4.patch


 Whilst working on another issue, I noticed that some classes still import and 
 use commons logging for example HttpBase.java
 {code}
 import java.util.*;
 // Commons Logging imports
 import org.apache.commons.logging.Log;
 import org.apache.commons.logging.LogFactory;
 // Nutch imports
 import org.apache.nutch.crawl.CrawlDatum;
 {code}
 At this stage I am unsure how many (if any others) still import and reply 
 upon commons logging, however they should be upgraded to slf4j for branch-1.4.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Closed] (NUTCH-1078) Upgrade all instances of commons logging to slf4j (with log4j backend)

2011-09-22 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-1078.
---


 Upgrade all instances of commons logging to slf4j (with log4j backend)
 --

 Key: NUTCH-1078
 URL: https://issues.apache.org/jira/browse/NUTCH-1078
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.4
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: 1.4

 Attachments: NUTCH-1078-branch-1.4-20110816.patch, 
 NUTCH-1078-branch-1.4-20110824-v2.patch, 
 NUTCH-1078-branch-1.4-20110911-v3.patch, 
 NUTCH-1078-branch-1.4-20110916-v4.patch


 Whilst working on another issue, I noticed that some classes still import and 
 use commons logging for example HttpBase.java
 {code}
 import java.util.*;
 // Commons Logging imports
 import org.apache.commons.logging.Log;
 import org.apache.commons.logging.LogFactory;
 // Nutch imports
 import org.apache.nutch.crawl.CrawlDatum;
 {code}
 At this stage I am unsure how many (if any others) still import and reply 
 upon commons logging, however they should be upgraded to slf4j for branch-1.4.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1078) Upgrade all instances of commons logging to slf4j (with log4j backend)

2011-09-22 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13112664#comment-13112664
 ] 

Markus Jelsma commented on NUTCH-1078:
--

Cheers! Only had to recommit the CHANGELOG entry for NUTCH-1115 which was 
committed a few houts ago.

 Upgrade all instances of commons logging to slf4j (with log4j backend)
 --

 Key: NUTCH-1078
 URL: https://issues.apache.org/jira/browse/NUTCH-1078
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.4
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: 1.4

 Attachments: NUTCH-1078-branch-1.4-20110816.patch, 
 NUTCH-1078-branch-1.4-20110824-v2.patch, 
 NUTCH-1078-branch-1.4-20110911-v3.patch, 
 NUTCH-1078-branch-1.4-20110916-v4.patch


 Whilst working on another issue, I noticed that some classes still import and 
 use commons logging for example HttpBase.java
 {code}
 import java.util.*;
 // Commons Logging imports
 import org.apache.commons.logging.Log;
 import org.apache.commons.logging.LogFactory;
 // Nutch imports
 import org.apache.nutch.crawl.CrawlDatum;
 {code}
 At this stage I am unsure how many (if any others) still import and reply 
 upon commons logging, however they should be upgraded to slf4j for branch-1.4.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: [RESULT] [VOTE] Move 2.0 out of trunk

2011-09-22 Thread Markus Jelsma
If you're on it anyway, are you going to push out a little news post on the 
site?

 Guys,
 
 If no one objects, I will execute the move Friday by 12pm PDT.
 
 Will that work?
 
 Cheers,
 Chris
 
 On Sep 21, 2011, at 3:09 AM, Julien Nioche wrote:
  Hi Folks,
  
  Okey dok, this VOTE has passed with the following tallies:
  
  +1 PMC
  Markus Jelsma
  Sami Siren
  Chris Mattmann
  Lewis John McGibbney
  Dennis Kubes
  Julien Nioche
  Andrzej Bialecki
  
  -1 PMC
  Alexis de Tréglodé
  
  -1 Community
  Radim Kola
  
  
  Accordingly we will move the current Nutch trunk to a bew branch
  nutchgora and then will move the current 1.4-development branch into
  trunk.
  
  I assume the two commands below would do the trick?
  
  svn mv https://svn.apache.org/repos/asf/nutch/trunk
  https://svn.apache.org/repos/asf/nutch/branches/nutchgora svn mv
  https://svn.apache.org/repos/asf/nutch/branches/branch-1.4/
  https://svn.apache.org/repos/asf/nutch/trunk
  
  
  Thanks
  
  Julien
  
  
  
  
  On 18 September 2011 10:21, Julien Nioche lists.digitalpeb...@gmail.com
  wrote: Hi,
  
  Following the discussions [1] on the dev-list about the future of Nutch
  2.0, I would like to call for a vote on moving Nutch 2.0 from the trunk
  to a separate branch, promote 1.4 to trunk and consider 2.0 as
  unmaintained. The arguments for / against can be found in the thread I
  mentioned.
  
  The vote is open for the next 72 hours.
  
  [ ] +1 : Shelve 2.0 and move 1.4 to trunk
  [] 0 : No opinion
  [] -1 : Bad idea.  Please give justification.
  
  Thanks
  
  Julien
  
  [1]
  http://www.mail-archive.com/gora-dev@incubator.apache.org/msg00483.html
 
 ++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 Email: chris.a.mattm...@nasa.gov
 WWW:   http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Assistant Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++


[jira] [Commented] (NUTCH-1074) topN is ignored with maxNumSegments

2011-09-22 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13112847#comment-13112847
 ] 

Markus Jelsma commented on NUTCH-1074:
--

Yes! I overlooked generate.max.count and you're right. Could you attach your 
patch to the issue with a flag for approval of inclusion in Nutch? So we can 
test it more and include if all goes well.

 topN is ignored with maxNumSegments
 ---

 Key: NUTCH-1074
 URL: https://issues.apache.org/jira/browse/NUTCH-1074
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: 1.3
Reporter: Markus Jelsma
 Fix For: 1.4


 When generating segments with topN and maxNumSegments, topN is not respected. 
 It looks like the first generated segment contains topN * maxNumSegments of 
 URLs's, at least the number of map input records roughly matches.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Assigned] (NUTCH-1074) topN is ignored with maxNumSegments

2011-09-22 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma reassigned NUTCH-1074:


Assignee: Markus Jelsma

 topN is ignored with maxNumSegments
 ---

 Key: NUTCH-1074
 URL: https://issues.apache.org/jira/browse/NUTCH-1074
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: 1.3
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.4


 When generating segments with topN and maxNumSegments, topN is not respected. 
 It looks like the first generated segment contains topN * maxNumSegments of 
 URLs's, at least the number of map input records roughly matches.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Reopened] (NUTCH-1078) Upgrade all instances of commons logging to slf4j (with log4j backend)

2011-09-22 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma reopened NUTCH-1078:
--


Too bad, the fetcher's logging is not partially broken, something i didn't see 
happening in a controlled environment since there were no exceptions. This is 
what's happening in a production environment:

{code}
2011-09-22 19:36:02,046 ERROR org.apache.nutch.util.LogUtil: Cannot log with 
method [null]
java.lang.NullPointerException
at org.apache.nutch.util.LogUtil$1.flush(LogUtil.java:103)
at java.io.PrintStream.write(PrintStream.java:432)
at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:202)
at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:272)
at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:85)
at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:168)
at java.io.PrintStream.newLine(PrintStream.java:496)
at java.io.PrintStream.println(PrintStream.java:774)
at java.lang.Throwable.printStackTrace(Throwable.java:461)
at 
org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:197)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:665)
{code}

Can you check it out?

 Upgrade all instances of commons logging to slf4j (with log4j backend)
 --

 Key: NUTCH-1078
 URL: https://issues.apache.org/jira/browse/NUTCH-1078
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.4
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: 1.4

 Attachments: NUTCH-1078-branch-1.4-20110816.patch, 
 NUTCH-1078-branch-1.4-20110824-v2.patch, 
 NUTCH-1078-branch-1.4-20110911-v3.patch, 
 NUTCH-1078-branch-1.4-20110916-v4.patch


 Whilst working on another issue, I noticed that some classes still import and 
 use commons logging for example HttpBase.java
 {code}
 import java.util.*;
 // Commons Logging imports
 import org.apache.commons.logging.Log;
 import org.apache.commons.logging.LogFactory;
 // Nutch imports
 import org.apache.nutch.crawl.CrawlDatum;
 {code}
 At this stage I am unsure how many (if any others) still import and reply 
 upon commons logging, however they should be upgraded to slf4j for branch-1.4.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1078) Upgrade all instances of commons logging to slf4j (with log4j backend)

2011-09-22 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13112857#comment-13112857
 ] 

Markus Jelsma commented on NUTCH-1078:
--

This one also pops up in the fetcher:

{code}
2011-09-22 19:44:26,929 ERROR org.apache.nutch.util.LogUtil: Cannot init log 
methods
java.lang.NoSuchMethodException: org.slf4j.Logger.trace(java.lang.Object)
at java.lang.Class.getMethod(Class.java:1605)
at org.apache.nutch.util.LogUtil.clinit(LogUtil.java:48)
at 
org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:197)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:665)
{code}

 Upgrade all instances of commons logging to slf4j (with log4j backend)
 --

 Key: NUTCH-1078
 URL: https://issues.apache.org/jira/browse/NUTCH-1078
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.4
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: 1.4

 Attachments: NUTCH-1078-branch-1.4-20110816.patch, 
 NUTCH-1078-branch-1.4-20110824-v2.patch, 
 NUTCH-1078-branch-1.4-20110911-v3.patch, 
 NUTCH-1078-branch-1.4-20110916-v4.patch


 Whilst working on another issue, I noticed that some classes still import and 
 use commons logging for example HttpBase.java
 {code}
 import java.util.*;
 // Commons Logging imports
 import org.apache.commons.logging.Log;
 import org.apache.commons.logging.LogFactory;
 // Nutch imports
 import org.apache.nutch.crawl.CrawlDatum;
 {code}
 At this stage I am unsure how many (if any others) still import and reply 
 upon commons logging, however they should be upgraded to slf4j for branch-1.4.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: [RESULT] [VOTE] Move 2.0 out of trunk

2011-09-22 Thread Mattmann, Chris A (388J)
Sure, will do.

Cheers,
Chris

On Sep 22, 2011, at 10:12 AM, Markus Jelsma wrote:

 If you're on it anyway, are you going to push out a little news post on the 
 site?
 
 Guys,
 
 If no one objects, I will execute the move Friday by 12pm PDT.
 
 Will that work?
 
 Cheers,
 Chris
 
 On Sep 21, 2011, at 3:09 AM, Julien Nioche wrote:
 Hi Folks,
 
 Okey dok, this VOTE has passed with the following tallies:
 
 +1 PMC
 Markus Jelsma
 Sami Siren
 Chris Mattmann
 Lewis John McGibbney
 Dennis Kubes
 Julien Nioche
 Andrzej Bialecki
 
 -1 PMC
 Alexis de Tréglodé
 
 -1 Community
 Radim Kola
 
 
 Accordingly we will move the current Nutch trunk to a bew branch
 nutchgora and then will move the current 1.4-development branch into
 trunk.
 
 I assume the two commands below would do the trick?
 
 svn mv https://svn.apache.org/repos/asf/nutch/trunk
 https://svn.apache.org/repos/asf/nutch/branches/nutchgora svn mv
 https://svn.apache.org/repos/asf/nutch/branches/branch-1.4/
 https://svn.apache.org/repos/asf/nutch/trunk
 
 
 Thanks
 
 Julien
 
 
 
 
 On 18 September 2011 10:21, Julien Nioche lists.digitalpeb...@gmail.com
 wrote: Hi,
 
 Following the discussions [1] on the dev-list about the future of Nutch
 2.0, I would like to call for a vote on moving Nutch 2.0 from the trunk
 to a separate branch, promote 1.4 to trunk and consider 2.0 as
 unmaintained. The arguments for / against can be found in the thread I
 mentioned.
 
 The vote is open for the next 72 hours.
 
 [ ] +1 : Shelve 2.0 and move 1.4 to trunk
 [] 0 : No opinion
 [] -1 : Bad idea.  Please give justification.
 
 Thanks
 
 Julien
 
 [1]
 http://www.mail-archive.com/gora-dev@incubator.apache.org/msg00483.html
 
 ++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 Email: chris.a.mattm...@nasa.gov
 WWW:   http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Assistant Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



[jira] [Updated] (NUTCH-1074) topN is ignored with maxNumSegments

2011-09-22 Thread Robert Thomson (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Thomson updated NUTCH-1074:
--

Attachment: generator_fix.patch

Patch to make generator.max.count and topN work together

 topN is ignored with maxNumSegments
 ---

 Key: NUTCH-1074
 URL: https://issues.apache.org/jira/browse/NUTCH-1074
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: 1.3
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.4

 Attachments: generator_fix.patch


 When generating segments with topN and maxNumSegments, topN is not respected. 
 It looks like the first generated segment contains topN * maxNumSegments of 
 URLs's, at least the number of map input records roughly matches.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1115) Option to disable fixing of embedded params in DomContentUtils

2011-09-22 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13113148#comment-13113148
 ] 

Hudson commented on NUTCH-1115:
---

Integrated in Nutch-branch-1.4 #14 (See 
[https://builds.apache.org/job/Nutch-branch-1.4/14/])
Recommitted CHANGELOG entry for NUTCH-1115. Was overwritten by NUTCH-1078 
commit
NUTCH-1115 Option to disable fixing of URL embedded parameters in 
DomContentUtils

markus : 
http://svn.apache.org/viewvc/nutch/branches/branch-1.4/viewvc/?view=revroot=revision=1174222
Files : 
* /nutch/branches/branch-1.4/CHANGES.txt

markus : 
http://svn.apache.org/viewvc/nutch/branches/branch-1.4/viewvc/?view=revroot=revision=1174147
Files : 
* /nutch/branches/branch-1.4/conf/nutch-default.xml
* 
/nutch/branches/branch-1.4/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java
* 
/nutch/branches/branch-1.4/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java


 Option to disable fixing of embedded params in DomContentUtils
 --

 Key: NUTCH-1115
 URL: https://issues.apache.org/jira/browse/NUTCH-1115
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.3
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.4

 Attachments: NUTCH-1115-1.4-1.patch, NUTCH-1115-1.4-2.patch


 Add option to disable fixing of embedded params:
 http://lucene.472066.n3.nabble.com/Outlinks-with-embedded-params-td3332396.html
 When enabled, millions of crap url's are output as outlink. This results in 
 many 404 in the DB and many very long URL's that actually lead to the same 
 page.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira