Re: [RESULT] [VOTE] Move 2.0 out of trunk
+1 thanks Chris On 22 September 2011 04:12, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov wrote: Guys, If no one objects, I will execute the move Friday by 12pm PDT. Will that work? Cheers, Chris On Sep 21, 2011, at 3:09 AM, Julien Nioche wrote: Hi Folks, Okey dok, this VOTE has passed with the following tallies: +1 PMC Markus Jelsma Sami Siren Chris Mattmann Lewis John McGibbney Dennis Kubes Julien Nioche Andrzej Bialecki -1 PMC Alexis de Tréglodé -1 Community Radim Kola Accordingly we will move the current Nutch trunk to a bew branch nutchgora and then will move the current 1.4-development branch into trunk. I assume the two commands below would do the trick? svn mv https://svn.apache.org/repos/asf/nutch/trunk https://svn.apache.org/repos/asf/nutch/branches/nutchgora svn mv https://svn.apache.org/repos/asf/nutch/branches/branch-1.4/ https://svn.apache.org/repos/asf/nutch/trunk Thanks Julien On 18 September 2011 10:21, Julien Nioche lists.digitalpeb...@gmail.com wrote: Hi, Following the discussions [1] on the dev-list about the future of Nutch 2.0, I would like to call for a vote on moving Nutch 2.0 from the trunk to a separate branch, promote 1.4 to trunk and consider 2.0 as unmaintained. The arguments for / against can be found in the thread I mentioned. The vote is open for the next 72 hours. [ ] +1 : Shelve 2.0 and move 1.4 to trunk [] 0 : No opinion [] -1 : Bad idea. Please give justification. Thanks Julien [1] http://www.mail-archive.com/gora-dev@incubator.apache.org/msg00483.html -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com
[jira] [Commented] (NUTCH-1115) Option to disable fixing of embedded params in DomContentUtils
[ https://issues.apache.org/jira/browse/NUTCH-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13112576#comment-13112576 ] Julien Nioche commented on NUTCH-1115: -- +1 Don't forget to add the same logic to DomContentUtils in Parse-Html Option to disable fixing of embedded params in DomContentUtils -- Key: NUTCH-1115 URL: https://issues.apache.org/jira/browse/NUTCH-1115 Project: Nutch Issue Type: Improvement Components: parser Affects Versions: 1.3 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.4 Attachments: NUTCH-1115-1.4-1.patch Add option to disable fixing of embedded params: http://lucene.472066.n3.nabble.com/Outlinks-with-embedded-params-td3332396.html When enabled, millions of crap url's are output as outlink. This results in many 404 in the DB and many very long URL's that actually lead to the same page. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1115) Option to disable fixing of embedded params in DomContentUtils
[ https://issues.apache.org/jira/browse/NUTCH-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1115: - Attachment: NUTCH-1115-1.4-2.patch Yes. Here's the complete patch for both parser implementation and nutch-default section. Option to disable fixing of embedded params in DomContentUtils -- Key: NUTCH-1115 URL: https://issues.apache.org/jira/browse/NUTCH-1115 Project: Nutch Issue Type: Improvement Components: parser Affects Versions: 1.3 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.4 Attachments: NUTCH-1115-1.4-1.patch, NUTCH-1115-1.4-2.patch Add option to disable fixing of embedded params: http://lucene.472066.n3.nabble.com/Outlinks-with-embedded-params-td3332396.html When enabled, millions of crap url's are output as outlink. This results in many 404 in the DB and many very long URL's that actually lead to the same page. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1115) Option to disable fixing of embedded params in DomContentUtils
[ https://issues.apache.org/jira/browse/NUTCH-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-1115. -- Resolution: Fixed Committed for 1.4 in rev. 1174147. Fixes a significant pollution of the crawldb. Option to disable fixing of embedded params in DomContentUtils -- Key: NUTCH-1115 URL: https://issues.apache.org/jira/browse/NUTCH-1115 Project: Nutch Issue Type: Improvement Components: parser Affects Versions: 1.3 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.4 Attachments: NUTCH-1115-1.4-1.patch, NUTCH-1115-1.4-2.patch Add option to disable fixing of embedded params: http://lucene.472066.n3.nabble.com/Outlinks-with-embedded-params-td3332396.html When enabled, millions of crap url's are output as outlink. This results in many 404 in the DB and many very long URL's that actually lead to the same page. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1078) Upgrade all instances of commons logging to slf4j (with log4j backend)
[ https://issues.apache.org/jira/browse/NUTCH-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13112600#comment-13112600 ] Markus Jelsma commented on NUTCH-1078: -- Push it in Lewis! I'll fix whatever breaks here :) cheers Upgrade all instances of commons logging to slf4j (with log4j backend) -- Key: NUTCH-1078 URL: https://issues.apache.org/jira/browse/NUTCH-1078 Project: Nutch Issue Type: Improvement Affects Versions: 1.4 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: 1.4 Attachments: NUTCH-1078-branch-1.4-20110816.patch, NUTCH-1078-branch-1.4-20110824-v2.patch, NUTCH-1078-branch-1.4-20110911-v3.patch, NUTCH-1078-branch-1.4-20110916-v4.patch Whilst working on another issue, I noticed that some classes still import and use commons logging for example HttpBase.java {code} import java.util.*; // Commons Logging imports import org.apache.commons.logging.Log; import org.apache.commons.logging.LogFactory; // Nutch imports import org.apache.nutch.crawl.CrawlDatum; {code} At this stage I am unsure how many (if any others) still import and reply upon commons logging, however they should be upgraded to slf4j for branch-1.4. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: [RESULT] [VOTE] Move 2.0 out of trunk
Cheers! On Thursday 22 September 2011 05:12:27 Mattmann, Chris A (388J) wrote: Guys, If no one objects, I will execute the move Friday by 12pm PDT. Will that work? Cheers, Chris On Sep 21, 2011, at 3:09 AM, Julien Nioche wrote: Hi Folks, Okey dok, this VOTE has passed with the following tallies: +1 PMC Markus Jelsma Sami Siren Chris Mattmann Lewis John McGibbney Dennis Kubes Julien Nioche Andrzej Bialecki -1 PMC Alexis de Tréglodé -1 Community Radim Kola Accordingly we will move the current Nutch trunk to a bew branch nutchgora and then will move the current 1.4-development branch into trunk. I assume the two commands below would do the trick? svn mv https://svn.apache.org/repos/asf/nutch/trunk https://svn.apache.org/repos/asf/nutch/branches/nutchgora svn mv https://svn.apache.org/repos/asf/nutch/branches/branch-1.4/ https://svn.apache.org/repos/asf/nutch/trunk Thanks Julien On 18 September 2011 10:21, Julien Nioche lists.digitalpeb...@gmail.com wrote: Hi, Following the discussions [1] on the dev-list about the future of Nutch 2.0, I would like to call for a vote on moving Nutch 2.0 from the trunk to a separate branch, promote 1.4 to trunk and consider 2.0 as unmaintained. The arguments for / against can be found in the thread I mentioned. The vote is open for the next 72 hours. [ ] +1 : Shelve 2.0 and move 1.4 to trunk [] 0 : No opinion [] -1 : Bad idea. Please give justification. Thanks Julien [1] http://www.mail-archive.com/gora-dev@incubator.apache.org/msg00483.html ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
[jira] [Resolved] (NUTCH-1092) overhaul FAQ's and publish to Nutch site
[ https://issues.apache.org/jira/browse/NUTCH-1092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-1092. - Resolution: Fixed Assignee: Lewis John McGibbney completed and will be fully functional once NUTCH-1093 is completed. overhaul FAQ's and publish to Nutch site Key: NUTCH-1092 URL: https://issues.apache.org/jira/browse/NUTCH-1092 Project: Nutch Issue Type: Sub-task Components: documentation Affects Versions: 1.4, 2.0 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 1.4, 2.0 We require a complete overhaul of the FAQ's on the Wiki. Once this is accomplished they need to be pushed into the Nutch site. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1078) Upgrade all instances of commons logging to slf4j (with log4j backend)
[ https://issues.apache.org/jira/browse/NUTCH-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-1078. - Resolution: Fixed Committed @ revision 1174191. Would like to say a big thanks to everyone for keeping me right on this one. Ta Upgrade all instances of commons logging to slf4j (with log4j backend) -- Key: NUTCH-1078 URL: https://issues.apache.org/jira/browse/NUTCH-1078 Project: Nutch Issue Type: Improvement Affects Versions: 1.4 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: 1.4 Attachments: NUTCH-1078-branch-1.4-20110816.patch, NUTCH-1078-branch-1.4-20110824-v2.patch, NUTCH-1078-branch-1.4-20110911-v3.patch, NUTCH-1078-branch-1.4-20110916-v4.patch Whilst working on another issue, I noticed that some classes still import and use commons logging for example HttpBase.java {code} import java.util.*; // Commons Logging imports import org.apache.commons.logging.Log; import org.apache.commons.logging.LogFactory; // Nutch imports import org.apache.nutch.crawl.CrawlDatum; {code} At this stage I am unsure how many (if any others) still import and reply upon commons logging, however they should be upgraded to slf4j for branch-1.4. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-1078) Upgrade all instances of commons logging to slf4j (with log4j backend)
[ https://issues.apache.org/jira/browse/NUTCH-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney closed NUTCH-1078. --- Upgrade all instances of commons logging to slf4j (with log4j backend) -- Key: NUTCH-1078 URL: https://issues.apache.org/jira/browse/NUTCH-1078 Project: Nutch Issue Type: Improvement Affects Versions: 1.4 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: 1.4 Attachments: NUTCH-1078-branch-1.4-20110816.patch, NUTCH-1078-branch-1.4-20110824-v2.patch, NUTCH-1078-branch-1.4-20110911-v3.patch, NUTCH-1078-branch-1.4-20110916-v4.patch Whilst working on another issue, I noticed that some classes still import and use commons logging for example HttpBase.java {code} import java.util.*; // Commons Logging imports import org.apache.commons.logging.Log; import org.apache.commons.logging.LogFactory; // Nutch imports import org.apache.nutch.crawl.CrawlDatum; {code} At this stage I am unsure how many (if any others) still import and reply upon commons logging, however they should be upgraded to slf4j for branch-1.4. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1078) Upgrade all instances of commons logging to slf4j (with log4j backend)
[ https://issues.apache.org/jira/browse/NUTCH-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13112664#comment-13112664 ] Markus Jelsma commented on NUTCH-1078: -- Cheers! Only had to recommit the CHANGELOG entry for NUTCH-1115 which was committed a few houts ago. Upgrade all instances of commons logging to slf4j (with log4j backend) -- Key: NUTCH-1078 URL: https://issues.apache.org/jira/browse/NUTCH-1078 Project: Nutch Issue Type: Improvement Affects Versions: 1.4 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: 1.4 Attachments: NUTCH-1078-branch-1.4-20110816.patch, NUTCH-1078-branch-1.4-20110824-v2.patch, NUTCH-1078-branch-1.4-20110911-v3.patch, NUTCH-1078-branch-1.4-20110916-v4.patch Whilst working on another issue, I noticed that some classes still import and use commons logging for example HttpBase.java {code} import java.util.*; // Commons Logging imports import org.apache.commons.logging.Log; import org.apache.commons.logging.LogFactory; // Nutch imports import org.apache.nutch.crawl.CrawlDatum; {code} At this stage I am unsure how many (if any others) still import and reply upon commons logging, however they should be upgraded to slf4j for branch-1.4. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: [RESULT] [VOTE] Move 2.0 out of trunk
If you're on it anyway, are you going to push out a little news post on the site? Guys, If no one objects, I will execute the move Friday by 12pm PDT. Will that work? Cheers, Chris On Sep 21, 2011, at 3:09 AM, Julien Nioche wrote: Hi Folks, Okey dok, this VOTE has passed with the following tallies: +1 PMC Markus Jelsma Sami Siren Chris Mattmann Lewis John McGibbney Dennis Kubes Julien Nioche Andrzej Bialecki -1 PMC Alexis de Tréglodé -1 Community Radim Kola Accordingly we will move the current Nutch trunk to a bew branch nutchgora and then will move the current 1.4-development branch into trunk. I assume the two commands below would do the trick? svn mv https://svn.apache.org/repos/asf/nutch/trunk https://svn.apache.org/repos/asf/nutch/branches/nutchgora svn mv https://svn.apache.org/repos/asf/nutch/branches/branch-1.4/ https://svn.apache.org/repos/asf/nutch/trunk Thanks Julien On 18 September 2011 10:21, Julien Nioche lists.digitalpeb...@gmail.com wrote: Hi, Following the discussions [1] on the dev-list about the future of Nutch 2.0, I would like to call for a vote on moving Nutch 2.0 from the trunk to a separate branch, promote 1.4 to trunk and consider 2.0 as unmaintained. The arguments for / against can be found in the thread I mentioned. The vote is open for the next 72 hours. [ ] +1 : Shelve 2.0 and move 1.4 to trunk [] 0 : No opinion [] -1 : Bad idea. Please give justification. Thanks Julien [1] http://www.mail-archive.com/gora-dev@incubator.apache.org/msg00483.html ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
[jira] [Commented] (NUTCH-1074) topN is ignored with maxNumSegments
[ https://issues.apache.org/jira/browse/NUTCH-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13112847#comment-13112847 ] Markus Jelsma commented on NUTCH-1074: -- Yes! I overlooked generate.max.count and you're right. Could you attach your patch to the issue with a flag for approval of inclusion in Nutch? So we can test it more and include if all goes well. topN is ignored with maxNumSegments --- Key: NUTCH-1074 URL: https://issues.apache.org/jira/browse/NUTCH-1074 Project: Nutch Issue Type: Bug Components: generator Affects Versions: 1.3 Reporter: Markus Jelsma Fix For: 1.4 When generating segments with topN and maxNumSegments, topN is not respected. It looks like the first generated segment contains topN * maxNumSegments of URLs's, at least the number of map input records roughly matches. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (NUTCH-1074) topN is ignored with maxNumSegments
[ https://issues.apache.org/jira/browse/NUTCH-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma reassigned NUTCH-1074: Assignee: Markus Jelsma topN is ignored with maxNumSegments --- Key: NUTCH-1074 URL: https://issues.apache.org/jira/browse/NUTCH-1074 Project: Nutch Issue Type: Bug Components: generator Affects Versions: 1.3 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.4 When generating segments with topN and maxNumSegments, topN is not respected. It looks like the first generated segment contains topN * maxNumSegments of URLs's, at least the number of map input records roughly matches. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Reopened] (NUTCH-1078) Upgrade all instances of commons logging to slf4j (with log4j backend)
[ https://issues.apache.org/jira/browse/NUTCH-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma reopened NUTCH-1078: -- Too bad, the fetcher's logging is not partially broken, something i didn't see happening in a controlled environment since there were no exceptions. This is what's happening in a production environment: {code} 2011-09-22 19:36:02,046 ERROR org.apache.nutch.util.LogUtil: Cannot log with method [null] java.lang.NullPointerException at org.apache.nutch.util.LogUtil$1.flush(LogUtil.java:103) at java.io.PrintStream.write(PrintStream.java:432) at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:202) at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:272) at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:85) at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:168) at java.io.PrintStream.newLine(PrintStream.java:496) at java.io.PrintStream.println(PrintStream.java:774) at java.lang.Throwable.printStackTrace(Throwable.java:461) at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:197) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:665) {code} Can you check it out? Upgrade all instances of commons logging to slf4j (with log4j backend) -- Key: NUTCH-1078 URL: https://issues.apache.org/jira/browse/NUTCH-1078 Project: Nutch Issue Type: Improvement Affects Versions: 1.4 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: 1.4 Attachments: NUTCH-1078-branch-1.4-20110816.patch, NUTCH-1078-branch-1.4-20110824-v2.patch, NUTCH-1078-branch-1.4-20110911-v3.patch, NUTCH-1078-branch-1.4-20110916-v4.patch Whilst working on another issue, I noticed that some classes still import and use commons logging for example HttpBase.java {code} import java.util.*; // Commons Logging imports import org.apache.commons.logging.Log; import org.apache.commons.logging.LogFactory; // Nutch imports import org.apache.nutch.crawl.CrawlDatum; {code} At this stage I am unsure how many (if any others) still import and reply upon commons logging, however they should be upgraded to slf4j for branch-1.4. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1078) Upgrade all instances of commons logging to slf4j (with log4j backend)
[ https://issues.apache.org/jira/browse/NUTCH-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13112857#comment-13112857 ] Markus Jelsma commented on NUTCH-1078: -- This one also pops up in the fetcher: {code} 2011-09-22 19:44:26,929 ERROR org.apache.nutch.util.LogUtil: Cannot init log methods java.lang.NoSuchMethodException: org.slf4j.Logger.trace(java.lang.Object) at java.lang.Class.getMethod(Class.java:1605) at org.apache.nutch.util.LogUtil.clinit(LogUtil.java:48) at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:197) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:665) {code} Upgrade all instances of commons logging to slf4j (with log4j backend) -- Key: NUTCH-1078 URL: https://issues.apache.org/jira/browse/NUTCH-1078 Project: Nutch Issue Type: Improvement Affects Versions: 1.4 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: 1.4 Attachments: NUTCH-1078-branch-1.4-20110816.patch, NUTCH-1078-branch-1.4-20110824-v2.patch, NUTCH-1078-branch-1.4-20110911-v3.patch, NUTCH-1078-branch-1.4-20110916-v4.patch Whilst working on another issue, I noticed that some classes still import and use commons logging for example HttpBase.java {code} import java.util.*; // Commons Logging imports import org.apache.commons.logging.Log; import org.apache.commons.logging.LogFactory; // Nutch imports import org.apache.nutch.crawl.CrawlDatum; {code} At this stage I am unsure how many (if any others) still import and reply upon commons logging, however they should be upgraded to slf4j for branch-1.4. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: [RESULT] [VOTE] Move 2.0 out of trunk
Sure, will do. Cheers, Chris On Sep 22, 2011, at 10:12 AM, Markus Jelsma wrote: If you're on it anyway, are you going to push out a little news post on the site? Guys, If no one objects, I will execute the move Friday by 12pm PDT. Will that work? Cheers, Chris On Sep 21, 2011, at 3:09 AM, Julien Nioche wrote: Hi Folks, Okey dok, this VOTE has passed with the following tallies: +1 PMC Markus Jelsma Sami Siren Chris Mattmann Lewis John McGibbney Dennis Kubes Julien Nioche Andrzej Bialecki -1 PMC Alexis de Tréglodé -1 Community Radim Kola Accordingly we will move the current Nutch trunk to a bew branch nutchgora and then will move the current 1.4-development branch into trunk. I assume the two commands below would do the trick? svn mv https://svn.apache.org/repos/asf/nutch/trunk https://svn.apache.org/repos/asf/nutch/branches/nutchgora svn mv https://svn.apache.org/repos/asf/nutch/branches/branch-1.4/ https://svn.apache.org/repos/asf/nutch/trunk Thanks Julien On 18 September 2011 10:21, Julien Nioche lists.digitalpeb...@gmail.com wrote: Hi, Following the discussions [1] on the dev-list about the future of Nutch 2.0, I would like to call for a vote on moving Nutch 2.0 from the trunk to a separate branch, promote 1.4 to trunk and consider 2.0 as unmaintained. The arguments for / against can be found in the thread I mentioned. The vote is open for the next 72 hours. [ ] +1 : Shelve 2.0 and move 1.4 to trunk [] 0 : No opinion [] -1 : Bad idea. Please give justification. Thanks Julien [1] http://www.mail-archive.com/gora-dev@incubator.apache.org/msg00483.html ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
[jira] [Updated] (NUTCH-1074) topN is ignored with maxNumSegments
[ https://issues.apache.org/jira/browse/NUTCH-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Thomson updated NUTCH-1074: -- Attachment: generator_fix.patch Patch to make generator.max.count and topN work together topN is ignored with maxNumSegments --- Key: NUTCH-1074 URL: https://issues.apache.org/jira/browse/NUTCH-1074 Project: Nutch Issue Type: Bug Components: generator Affects Versions: 1.3 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.4 Attachments: generator_fix.patch When generating segments with topN and maxNumSegments, topN is not respected. It looks like the first generated segment contains topN * maxNumSegments of URLs's, at least the number of map input records roughly matches. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1115) Option to disable fixing of embedded params in DomContentUtils
[ https://issues.apache.org/jira/browse/NUTCH-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13113148#comment-13113148 ] Hudson commented on NUTCH-1115: --- Integrated in Nutch-branch-1.4 #14 (See [https://builds.apache.org/job/Nutch-branch-1.4/14/]) Recommitted CHANGELOG entry for NUTCH-1115. Was overwritten by NUTCH-1078 commit NUTCH-1115 Option to disable fixing of URL embedded parameters in DomContentUtils markus : http://svn.apache.org/viewvc/nutch/branches/branch-1.4/viewvc/?view=revroot=revision=1174222 Files : * /nutch/branches/branch-1.4/CHANGES.txt markus : http://svn.apache.org/viewvc/nutch/branches/branch-1.4/viewvc/?view=revroot=revision=1174147 Files : * /nutch/branches/branch-1.4/conf/nutch-default.xml * /nutch/branches/branch-1.4/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java * /nutch/branches/branch-1.4/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java Option to disable fixing of embedded params in DomContentUtils -- Key: NUTCH-1115 URL: https://issues.apache.org/jira/browse/NUTCH-1115 Project: Nutch Issue Type: Improvement Components: parser Affects Versions: 1.3 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.4 Attachments: NUTCH-1115-1.4-1.patch, NUTCH-1115-1.4-2.patch Add option to disable fixing of embedded params: http://lucene.472066.n3.nabble.com/Outlinks-with-embedded-params-td3332396.html When enabled, millions of crap url's are output as outlink. This results in many 404 in the DB and many very long URL's that actually lead to the same page. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira