[jira] [Updated] (NUTCH-1304) GeneratorMapper.java dosen't return when skipping and already generated mark

2012-03-08 Thread Dan Rosher (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dan Rosher updated NUTCH-1304:
--

Attachment: NUTCH-1304.patch

 GeneratorMapper.java dosen't return when skipping and already generated mark
 

 Key: NUTCH-1304
 URL: https://issues.apache.org/jira/browse/NUTCH-1304
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: nutchgora
Reporter: Dan Rosher
Priority: Minor
 Fix For: nutchgora

 Attachments: NUTCH-1304.patch


 GeneratorMapper.java dosen't return when skipping and already generated mark

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (NUTCH-1304) GeneratorMapper.java dosen't return when skipping and already generated mark

2012-03-08 Thread Dan Rosher (Created) (JIRA)
GeneratorMapper.java dosen't return when skipping and already generated mark


 Key: NUTCH-1304
 URL: https://issues.apache.org/jira/browse/NUTCH-1304
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: nutchgora
Reporter: Dan Rosher
Priority: Minor
 Fix For: nutchgora
 Attachments: NUTCH-1304.patch

GeneratorMapper.java dosen't return when skipping and already generated mark

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1304) GeneratorMapper.java dosen't return when skipping and already generated mark

2012-03-08 Thread Lewis John McGibbney (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13225159#comment-13225159
 ] 

Lewis John McGibbney commented on NUTCH-1304:
-

+1 for commit. I'll wait until this afternoon to hear back from anyone else 
before doing so. Thanks Dan.

 GeneratorMapper.java dosen't return when skipping and already generated mark
 

 Key: NUTCH-1304
 URL: https://issues.apache.org/jira/browse/NUTCH-1304
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: nutchgora
Reporter: Dan Rosher
Priority: Minor
 Fix For: nutchgora

 Attachments: NUTCH-1304.patch


 GeneratorMapper.java dosen't return when skipping and already generated mark

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1300) Indexer to normalize URL's

2012-03-08 Thread Sebastian Nagel (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13225174#comment-13225174
 ] 

Sebastian Nagel commented on NUTCH-1300:


+1
* effective fix for a serious problem: long running continuous crawls require 
adjustments of the normalization rules quite often
* tested (with 1.4): costs (time spent for extra normalization) are ok compared 
to the benefit

Two suggestions:
# Does a URLNormalizer scope index make sense? E.g., if only outlinks are 
normalized and default rules are empty, the scope index may use the same 
rules as scope outlink.
# Wouldn't commandline options for solrindex be nice? Most other tools 
(generate, updatedb, invertlinks) have options such as -filter / -norm / 
-noNorm.

 Indexer to normalize URL's
 --

 Key: NUTCH-1300
 URL: https://issues.apache.org/jira/browse/NUTCH-1300
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.5

 Attachments: NUTCH-1300-1.5-1.patch


 Indexers should be able to normalize URL's. This is useful when a new 
 normalizer is applied to the entire CrawlDB. Without it, some or all records 
 in a segment cannot be indexed at all.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1305) Domain(blacklist)URLFilter to trim entries

2012-03-08 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1305:
-

Attachment: NUTCH-1305-1.5-1.patch

Patch for 1.5. Fixes the issue.

 Domain(blacklist)URLFilter to trim entries
 --

 Key: NUTCH-1305
 URL: https://issues.apache.org/jira/browse/NUTCH-1305
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.5

 Attachments: NUTCH-1305-1.5-1.patch


 Both filters should handle entries with trailing whitespace.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (NUTCH-1305) Domain(blacklist)URLFilter to trim entries

2012-03-08 Thread Markus Jelsma (Created) (JIRA)
Domain(blacklist)URLFilter to trim entries
--

 Key: NUTCH-1305
 URL: https://issues.apache.org/jira/browse/NUTCH-1305
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.5
 Attachments: NUTCH-1305-1.5-1.patch

Both filters should handle entries with trailing whitespace.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1305) Domain(blacklist)URLFilter to trim entries

2012-03-08 Thread Lewis John McGibbney (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13225206#comment-13225206
 ] 

Lewis John McGibbney commented on NUTCH-1305:
-

+1

 Domain(blacklist)URLFilter to trim entries
 --

 Key: NUTCH-1305
 URL: https://issues.apache.org/jira/browse/NUTCH-1305
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.5

 Attachments: NUTCH-1305-1.5-1.patch


 Both filters should handle entries with trailing whitespace.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (NUTCH-1305) Domain(blacklist)URLFilter to trim entries

2012-03-08 Thread Markus Jelsma (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma resolved NUTCH-1305.
--

Resolution: Fixed

Committed for 1.5 in rev. 1298394.

 Domain(blacklist)URLFilter to trim entries
 --

 Key: NUTCH-1305
 URL: https://issues.apache.org/jira/browse/NUTCH-1305
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.5

 Attachments: NUTCH-1305-1.5-1.patch


 Both filters should handle entries with trailing whitespace.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1305) Domain(blacklist)URLFilter to trim entries

2012-03-08 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13225209#comment-13225209
 ] 

Markus Jelsma commented on NUTCH-1305:
--

Thanks Lewis.

 Domain(blacklist)URLFilter to trim entries
 --

 Key: NUTCH-1305
 URL: https://issues.apache.org/jira/browse/NUTCH-1305
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.5

 Attachments: NUTCH-1305-1.5-1.patch


 Both filters should handle entries with trailing whitespace.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (NUTCH-1306) Commit after finished writing to solr index

2012-03-08 Thread Dan Rosher (Created) (JIRA)
Commit after finished writing to solr index
---

 Key: NUTCH-1306
 URL: https://issues.apache.org/jira/browse/NUTCH-1306
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: nutchgora
Reporter: Dan Rosher
Priority: Trivial
 Fix For: nutchgora
 Attachments: NUTCH-1306.patch

Commit after finished writing to solr index - otherwise a bit confusing not 
seeing the number of docs we expect in solr

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1306) Commit after finished writing to solr index

2012-03-08 Thread Dan Rosher (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dan Rosher updated NUTCH-1306:
--

Attachment: NUTCH-1306.patch

 Commit after finished writing to solr index
 ---

 Key: NUTCH-1306
 URL: https://issues.apache.org/jira/browse/NUTCH-1306
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: nutchgora
Reporter: Dan Rosher
Priority: Trivial
 Fix For: nutchgora

 Attachments: NUTCH-1306.patch


 Commit after finished writing to solr index - otherwise a bit confusing not 
 seeing the number of docs we expect in solr

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




NutchGora release, and Nutch 1.x trunk release

2012-03-08 Thread Mattmann, Chris A (388J)
Hey Guys,

I've got some cycles this weekend -- anyone up for a 1.5 release off trunk 
(stable), and
a NutchGora branch release? I suggested this before [1] regarding NutchGora.
I'm inclined to say let's do the following:

1. NutchGora: apache-nutch-2.0 - release 2.x series based on this branch
2. Nutch: apache-nutch-1.x - stable trunk branch

Then, when the time comes, we can try and create a:

3. Nutch: apache-nutch-3.x - merge of 1.x and 2.x feature branches

Would this make sense? Anyways we don't have to decide anything now that
we can't undo later, but are folks OK with me doing an RC for NutchGora and for
1.x this weekend?

Cheers,
Chris

[1] http://s.apache.org/GD2

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



[jira] [Commented] (NUTCH-1305) Domain(blacklist)URLFilter to trim entries

2012-03-08 Thread Hudson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13225232#comment-13225232
 ] 

Hudson commented on NUTCH-1305:
---

Integrated in nutch-trunk-maven #187 (See 
[https://builds.apache.org/job/nutch-trunk-maven/187/])
NUTCH-1305 Domain(blacklist)URLFilter to trim entries (Revision 1298394)

 Result = SUCCESS
markus : 
Files : 
* 
/nutch/trunk/src/plugin/urlfilter-domainblacklist/src/java/org/apache/nutch/urlfilter/domainblacklist/DomainBlacklistURLFilter.java


 Domain(blacklist)URLFilter to trim entries
 --

 Key: NUTCH-1305
 URL: https://issues.apache.org/jira/browse/NUTCH-1305
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.5

 Attachments: NUTCH-1305-1.5-1.patch


 Both filters should handle entries with trailing whitespace.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: NutchGora release, and Nutch 1.x trunk release

2012-03-08 Thread Markus Jelsma
+1 

1.5 has, again, many fixes and improvements, just as 1.4 had over 1.3. But i'd 
like to integrate Tika 1.1 after its pending release.

Cheers

On Thursday 08 March 2012 15:38:15 Mattmann, Chris A (388J) wrote:
 Hey Guys,
 
 I've got some cycles this weekend -- anyone up for a 1.5 release off trunk
 (stable), and a NutchGora branch release? I suggested this before [1]
 regarding NutchGora. I'm inclined to say let's do the following:
 
 1. NutchGora: apache-nutch-2.0 - release 2.x series based on this branch
 2. Nutch: apache-nutch-1.x - stable trunk branch
 
 Then, when the time comes, we can try and create a:
 
 3. Nutch: apache-nutch-3.x - merge of 1.x and 2.x feature branches
 
 Would this make sense? Anyways we don't have to decide anything now that
 we can't undo later, but are folks OK with me doing an RC for NutchGora and
 for 1.x this weekend?
 
 Cheers,
 Chris
 
 [1] http://s.apache.org/GD2
 
 ++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 Email: chris.a.mattm...@nasa.gov
 WWW:   http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Assistant Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++

-- 
Markus Jelsma - CTO - Openindex


Re: NutchGora release, and Nutch 1.x trunk release

2012-03-08 Thread Lewis John Mcgibbney
Yeah I agree Chris  Markus.

On the Nutchgora note, I would like to see Gora 0.2. released before hand,
as we have a blocking issue NUTCH-1205 with Ivy retrieving alien Gora
0.2-SNAPSHOT dependencies from repository.apache.org. We should be able to
overcome this issue by releasing Gora 0.2 to maven central then just
pulling those dependencies with Ivy in Nutchgora rather than messing about
with chain/multiple/snapshot resolvers in the Ivy configuration.

My 2 cents

On Thu, Mar 8, 2012 at 3:03 PM, Markus Jelsma markus.jel...@openindex.iowrote:

 +1

 1.5 has, again, many fixes and improvements, just as 1.4 had over 1.3. But
 i'd
 like to integrate Tika 1.1 after its pending release.

 Cheers

 On Thursday 08 March 2012 15:38:15 Mattmann, Chris A (388J) wrote:
  Hey Guys,
 
  I've got some cycles this weekend -- anyone up for a 1.5 release off
 trunk
  (stable), and a NutchGora branch release? I suggested this before [1]
  regarding NutchGora. I'm inclined to say let's do the following:
 
  1. NutchGora: apache-nutch-2.0 - release 2.x series based on this branch
  2. Nutch: apache-nutch-1.x - stable trunk branch
 
  Then, when the time comes, we can try and create a:
 
  3. Nutch: apache-nutch-3.x - merge of 1.x and 2.x feature branches
 
  Would this make sense? Anyways we don't have to decide anything now that
  we can't undo later, but are folks OK with me doing an RC for NutchGora
 and
  for 1.x this weekend?
 
  Cheers,
  Chris
 
  [1] http://s.apache.org/GD2
 
  ++
  Chris Mattmann, Ph.D.
  Senior Computer Scientist
  NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
  Office: 171-266B, Mailstop: 171-246
  Email: chris.a.mattm...@nasa.gov
  WWW:   http://sunset.usc.edu/~mattmann/
  ++
  Adjunct Assistant Professor, Computer Science Department
  University of Southern California, Los Angeles, CA 90089 USA
  ++

 --
 Markus Jelsma - CTO - Openindex




-- 
*Lewis*


[jira] [Created] (NUTCH-1307) Improve formatting of ant targets for clearer project help

2012-03-08 Thread Lewis John McGibbney (Created) (JIRA)
Improve formatting of ant targets for clearer project help
--

 Key: NUTCH-1307
 URL: https://issues.apache.org/jira/browse/NUTCH-1307
 Project: Nutch
  Issue Type: New Feature
  Components: build
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Trivial
 Fix For: nutchgora, 1.5


This is a trivial formatting issue I will submit a patch shortly and fix it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: NutchGora release, and Nutch 1.x trunk release

2012-03-08 Thread Mattmann, Chris A (388J)
Hey Guys,

OK, sounds good. Looks like we need to wait for the Tika 1.1 release (seems to 
be going
well so far), and then try and push Gora 0.2 (which I know Lewis is pushing, 
and which 
I'm happy to RM once we're ready there). So, maybe I'll shoot for next weekend
or the weekend after to push Nutch 1.5 and 2.0 RCs.

Cheers,
Chris

On Mar 8, 2012, at 7:23 AM, Lewis John Mcgibbney wrote:

 Yeah I agree Chris  Markus.
 
 On the Nutchgora note, I would like to see Gora 0.2. released before hand, as 
 we have a blocking issue NUTCH-1205 with Ivy retrieving alien Gora 
 0.2-SNAPSHOT dependencies from repository.apache.org. We should be able to 
 overcome this issue by releasing Gora 0.2 to maven central then just pulling 
 those dependencies with Ivy in Nutchgora rather than messing about with 
 chain/multiple/snapshot resolvers in the Ivy configuration.
 
 My 2 cents
 
 On Thu, Mar 8, 2012 at 3:03 PM, Markus Jelsma markus.jel...@openindex.io 
 wrote:
 +1
 
 1.5 has, again, many fixes and improvements, just as 1.4 had over 1.3. But i'd
 like to integrate Tika 1.1 after its pending release.
 
 Cheers
 
 On Thursday 08 March 2012 15:38:15 Mattmann, Chris A (388J) wrote:
  Hey Guys,
 
  I've got some cycles this weekend -- anyone up for a 1.5 release off trunk
  (stable), and a NutchGora branch release? I suggested this before [1]
  regarding NutchGora. I'm inclined to say let's do the following:
 
  1. NutchGora: apache-nutch-2.0 - release 2.x series based on this branch
  2. Nutch: apache-nutch-1.x - stable trunk branch
 
  Then, when the time comes, we can try and create a:
 
  3. Nutch: apache-nutch-3.x - merge of 1.x and 2.x feature branches
 
  Would this make sense? Anyways we don't have to decide anything now that
  we can't undo later, but are folks OK with me doing an RC for NutchGora and
  for 1.x this weekend?
 
  Cheers,
  Chris
 
  [1] http://s.apache.org/GD2
 
  ++
  Chris Mattmann, Ph.D.
  Senior Computer Scientist
  NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
  Office: 171-266B, Mailstop: 171-246
  Email: chris.a.mattm...@nasa.gov
  WWW:   http://sunset.usc.edu/~mattmann/
  ++
  Adjunct Assistant Professor, Computer Science Department
  University of Southern California, Los Angeles, CA 90089 USA
  ++
 
 --
 Markus Jelsma - CTO - Openindex
 
 
 
 -- 
 Lewis 
 


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: NutchGora release, and Nutch 1.x trunk release

2012-03-08 Thread Ferdy Galema
+1 for pushing Gora 0.2 prior to the Nutchgora 2.0 RC.

For Nutchgora, besides Nutch-1205 the only thing I'm a bit concerned about
is Nutch-1253. This seems like a blocker to me, and I think it only affects
Nutch trunk. (Though I'm not sure).

On Thu, Mar 8, 2012 at 4:32 PM, Mattmann, Chris A (388J) 
chris.a.mattm...@jpl.nasa.gov wrote:

 Hey Guys,

 OK, sounds good. Looks like we need to wait for the Tika 1.1 release
 (seems to be going
 well so far), and then try and push Gora 0.2 (which I know Lewis is
 pushing, and which
 I'm happy to RM once we're ready there). So, maybe I'll shoot for next
 weekend
 or the weekend after to push Nutch 1.5 and 2.0 RCs.

 Cheers,
 Chris

 On Mar 8, 2012, at 7:23 AM, Lewis John Mcgibbney wrote:

  Yeah I agree Chris  Markus.
 
  On the Nutchgora note, I would like to see Gora 0.2. released before
 hand, as we have a blocking issue NUTCH-1205 with Ivy retrieving alien Gora
 0.2-SNAPSHOT dependencies from repository.apache.org. We should be able
 to overcome this issue by releasing Gora 0.2 to maven central then just
 pulling those dependencies with Ivy in Nutchgora rather than messing about
 with chain/multiple/snapshot resolvers in the Ivy configuration.
 
  My 2 cents
 
  On Thu, Mar 8, 2012 at 3:03 PM, Markus Jelsma 
 markus.jel...@openindex.io wrote:
  +1
 
  1.5 has, again, many fixes and improvements, just as 1.4 had over 1.3.
 But i'd
  like to integrate Tika 1.1 after its pending release.
 
  Cheers
 
  On Thursday 08 March 2012 15:38:15 Mattmann, Chris A (388J) wrote:
   Hey Guys,
  
   I've got some cycles this weekend -- anyone up for a 1.5 release off
 trunk
   (stable), and a NutchGora branch release? I suggested this before [1]
   regarding NutchGora. I'm inclined to say let's do the following:
  
   1. NutchGora: apache-nutch-2.0 - release 2.x series based on this
 branch
   2. Nutch: apache-nutch-1.x - stable trunk branch
  
   Then, when the time comes, we can try and create a:
  
   3. Nutch: apache-nutch-3.x - merge of 1.x and 2.x feature branches
  
   Would this make sense? Anyways we don't have to decide anything now
 that
   we can't undo later, but are folks OK with me doing an RC for
 NutchGora and
   for 1.x this weekend?
  
   Cheers,
   Chris
  
   [1] http://s.apache.org/GD2
  
   ++
   Chris Mattmann, Ph.D.
   Senior Computer Scientist
   NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
   Office: 171-266B, Mailstop: 171-246
   Email: chris.a.mattm...@nasa.gov
   WWW:   http://sunset.usc.edu/~mattmann/
   ++
   Adjunct Assistant Professor, Computer Science Department
   University of Southern California, Los Angeles, CA 90089 USA
   ++
 
  --
  Markus Jelsma - CTO - Openindex
 
 
 
  --
  Lewis
 


 ++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 Email: chris.a.mattm...@nasa.gov
 WWW:   http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Assistant Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++




[jira] [Updated] (NUTCH-1307) Improve formatting of ant targets for clearer project help

2012-03-08 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1307:


Attachment: NUTCH-1307-trunk.patch
NUTCH-1307-nutchgora.patch

trivial patches

When running 
{code}
$ant -projecthelp
{code}
(from $NUTCH_HOME)

this gives nicer output.

 Improve formatting of ant targets for clearer project help
 --

 Key: NUTCH-1307
 URL: https://issues.apache.org/jira/browse/NUTCH-1307
 Project: Nutch
  Issue Type: New Feature
  Components: build
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Trivial
 Fix For: nutchgora, 1.5

 Attachments: NUTCH-1307-nutchgora.patch, NUTCH-1307-trunk.patch


 This is a trivial formatting issue I will submit a patch shortly and fix it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Closed] (NUTCH-1307) Improve formatting of ant targets for clearer project help

2012-03-08 Thread Lewis John McGibbney (Closed) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-1307.
---


 Improve formatting of ant targets for clearer project help
 --

 Key: NUTCH-1307
 URL: https://issues.apache.org/jira/browse/NUTCH-1307
 Project: Nutch
  Issue Type: New Feature
  Components: build
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Trivial
 Fix For: nutchgora, 1.5

 Attachments: NUTCH-1307-nutchgora.patch, NUTCH-1307-trunk.patch


 This is a trivial formatting issue I will submit a patch shortly and fix it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (NUTCH-1307) Improve formatting of ant targets for clearer project help

2012-03-08 Thread Lewis John McGibbney (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-1307.
-

Resolution: Fixed

Committed @ revision 1298437 in Nutchgora branch
Committed @ revision 1298438 in trunk




 Improve formatting of ant targets for clearer project help
 --

 Key: NUTCH-1307
 URL: https://issues.apache.org/jira/browse/NUTCH-1307
 Project: Nutch
  Issue Type: New Feature
  Components: build
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Trivial
 Fix For: nutchgora, 1.5

 Attachments: NUTCH-1307-nutchgora.patch, NUTCH-1307-trunk.patch


 This is a trivial formatting issue I will submit a patch shortly and fix it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (NUTCH-1304) GeneratorMapper.java dosen't return when skipping and already generated mark

2012-03-08 Thread Lewis John McGibbney (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-1304.
-

Resolution: Fixed

Committed @ revision 1298444 in Nutchgora branch

Thank you again Dan for the patch.


 GeneratorMapper.java dosen't return when skipping and already generated mark
 

 Key: NUTCH-1304
 URL: https://issues.apache.org/jira/browse/NUTCH-1304
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: nutchgora
Reporter: Dan Rosher
Priority: Minor
 Fix For: nutchgora

 Attachments: NUTCH-1304.patch


 GeneratorMapper.java dosen't return when skipping and already generated mark

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1304) GeneratorMapper.java dosen't return when skipping and already generated mark

2012-03-08 Thread Lewis John McGibbney (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13225270#comment-13225270
 ] 

Lewis John McGibbney commented on NUTCH-1304:
-

Please close this one off when you have time Dan you.

 GeneratorMapper.java dosen't return when skipping and already generated mark
 

 Key: NUTCH-1304
 URL: https://issues.apache.org/jira/browse/NUTCH-1304
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: nutchgora
Reporter: Dan Rosher
Priority: Minor
 Fix For: nutchgora

 Attachments: NUTCH-1304.patch


 GeneratorMapper.java dosen't return when skipping and already generated mark

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1307) Improve formatting of ant targets for clearer project help

2012-03-08 Thread Hudson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13225299#comment-13225299
 ] 

Hudson commented on NUTCH-1307:
---

Integrated in nutch-trunk-maven #188 (See 
[https://builds.apache.org/job/nutch-trunk-maven/188/])
commit to address NUTCH-1307 and update to CHANGES.txt (Revision 1298438)

 Result = SUCCESS
lewismc : 
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/build.xml


 Improve formatting of ant targets for clearer project help
 --

 Key: NUTCH-1307
 URL: https://issues.apache.org/jira/browse/NUTCH-1307
 Project: Nutch
  Issue Type: New Feature
  Components: build
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Trivial
 Fix For: nutchgora, 1.5

 Attachments: NUTCH-1307-nutchgora.patch, NUTCH-1307-trunk.patch


 This is a trivial formatting issue I will submit a patch shortly and fix it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-728) Improve nutch release packaging

2012-03-08 Thread Lewis John McGibbney (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13225389#comment-13225389
 ] 

Lewis John McGibbney commented on NUTCH-728:


Looking at this, then at what we have available on our mirrors, I don't really 
see the need at the moment (unless it would make release process easier) of 
including this code. Chris already provides us with src.tar.gz with every 
release?
I suppose this ones really down to release manager's opinion. 

 Improve nutch release packaging
 ---

 Key: NUTCH-728
 URL: https://issues.apache.org/jira/browse/NUTCH-728
 Project: Nutch
  Issue Type: Improvement
Reporter: Sami Siren
 Attachments: NUTCH-728-nutchgora.patch, NUTCH-728-v2.patch, 
 NUTCH-728.patch


 see the discussion from 
 http://www.lucidimagination.com/search/document/aa4d52cbd9af026a/discuss_contents_of_nutch_release_artifact

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-882) Design a Host table in GORA

2012-03-08 Thread Mathijs Homminga (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13225486#comment-13225486
 ] 

Mathijs Homminga commented on NUTCH-882:


Status:
I have updated the patches to match the current HEAD (nutchgora). Also added a 
HostDbUpdateJob which populates the host db from an existing web table (needed 
to fix an issue in GORA for this: 
https://issues.apache.org/jira/browse/GORA-105). 

I'm currently finishing some work on the NutchContext and will post the patch 
somewhere next week.



 Design a Host table in GORA
 ---

 Key: NUTCH-882
 URL: https://issues.apache.org/jira/browse/NUTCH-882
 Project: Nutch
  Issue Type: New Feature
Affects Versions: nutchgora
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: nutchgora

 Attachments: NUTCH-882-v1.patch, hostdb.patch


 Having a separate GORA table for storing information about hosts (and 
 domains?) would be very useful for : 
 * customising the behaviour of the fetching on a host basis e.g. number of 
 threads, min time between threads etc...
 * storing stats
 * keeping metadata and possibly propagate them to the webpages 
 * keeping a copy of the robots.txt and possibly use that later to filter the 
 webtable
 * store sitemaps files and update the webtable accordingly
 I'll try to come up with a GORA schema for such a host table but any comments 
 are of course already welcome 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1278) Fetch Improvement in threads per host

2012-03-08 Thread Ferdy Galema (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13225545#comment-13225545
 ] 

Ferdy Galema commented on NUTCH-1278:
-

I noticed you used the diff command this time, but failed to include the new 
file in patch. When you want the diff command to include new files, you simply 
add them first to svn. In the case of HostsUtil, this would be:

svn add src/java/org/apache/nutch/util/HostsUtil.java

When you execute the diff command afterwards, you will notice that it included 
the new file. Now you can simply upload this patch file only instead of a zip.

Good luck.

 Fetch Improvement in threads per host
 -

 Key: NUTCH-1278
 URL: https://issues.apache.org/jira/browse/NUTCH-1278
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 1.4
Reporter: behnam nikbakht
 Attachments: NUTCH-1278-v.2.zip, NUTCH-1278.zip


 the value of maxThreads is equal to fetcher.threads.per.host and is constant 
 for every host
 there is a possibility with using of dynamic values for every host that 
 influeced with number of blocked requests.
 this means that if number of blocked requests for one host increased, then we 
 most decrease this value and increase http.timeout

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-841) Nutch 2.0 webapp

2012-03-08 Thread Ferdy Galema (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferdy Galema updated NUTCH-841:
---

Priority: Major  (was: Blocker)

 Nutch 2.0 webapp
 

 Key: NUTCH-841
 URL: https://issues.apache.org/jira/browse/NUTCH-841
 Project: Nutch
  Issue Type: Improvement
  Components: web gui
 Environment: Nutch 2.0
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: nutchgora


 In light of the conversation on NUTCH-837, we are removing the old Nutch 
 webapp and will replace it with a 2.0 one that works with GORA + Solr. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-841) Nutch 2.0 webapp

2012-03-08 Thread Chris A. Mattmann (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13225551#comment-13225551
 ] 

Chris A. Mattmann commented on NUTCH-841:
-

Yep not a blocker!

 Nutch 2.0 webapp
 

 Key: NUTCH-841
 URL: https://issues.apache.org/jira/browse/NUTCH-841
 Project: Nutch
  Issue Type: Improvement
  Components: web gui
 Environment: Nutch 2.0
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: nutchgora


 In light of the conversation on NUTCH-837, we are removing the old Nutch 
 webapp and will replace it with a 2.0 one that works with GORA + Solr. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira