from:"Ferdy Galema \(JIRA\)"

[jira] [Closed] (NUTCH-1340) Increase scalability by only removing markers when they actually exist for DbUpdaterReducer

2012-04-26 Thread Ferdy Galema (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ferdy Galema closed NUTCH-1340.
---

Resolution: Fixed

Increase scalability by only removing markers when they actually exist for
DbUpdaterReducer
---

Key: NUTCH-1340
URL: https://issues.apache.org/jira/browse/NUTCH-1340
Project: Nutch
Issue Type: Improvement
Reporter: Ferdy Galema
Fix For: nutchgora

Attachments: NUTCH-1340-v1.txt, NUTCH-1340-v2.txt

After applying GORA-120 (this already is a huge performance boost by itself)
one of the major bottlenecks of the DbUpdaterReducer is the deletion of the
markers. The update reducer simply sets every row to delete its markers. A
lot of rows do not actually have the markers but the deletes are fired away
in any case. Because the markers are already always on the input, a simple
check to see if they exist greaty improves performance.
In particular it is very expensive in HBase, because every single Delete
inmediately triggers a connection to the regionservers. (They ignore the
autoflush=false directive). Although deletes can be done in batch, this is
currently not supported by Gora. For one it is very difficult to implement in
the current HBaseStore with regard to multithreading, and secondly I noticed
performance did not increase significantly.
By performance debugging on a real life cluster this currently seems to be
the biggest bottleneck of the DbUpdaterReducer. (Remember only after applying
GORA-120)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1340) Increase scalability by only removing markers when they actually exist for DbUpdaterReducer

2012-04-26 Thread Ferdy Galema (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ferdy Galema updated NUTCH-1340:

Attachment: NUTCH-1340-v2.txt

v2 of patch, including javadoc. This patch increases performance, but when
updating huge crawls it still can be a bit troublesome to process the huge
amounts of deletes. However this is something that needs to be solved in Gora.

Committed!

Thanks Lewis.

Increase scalability by only removing markers when they actually exist for
DbUpdaterReducer
---

Key: NUTCH-1340
URL: https://issues.apache.org/jira/browse/NUTCH-1340
Project: Nutch
Issue Type: Improvement
Reporter: Ferdy Galema
Fix For: nutchgora

Attachments: NUTCH-1340-v1.txt, NUTCH-1340-v2.txt

[jira] [Commented] (NUTCH-882) Design a Host table in GORA

2012-04-26 Thread Ferdy Galema (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13262488#comment-13262488
 ] 

Ferdy Galema commented on NUTCH-882:


Committed. I realize that the current state is far from finished, however I 
figured it is enough to close this longstanding issue off. This makes room for 
people to easily play around with it and make improvements where necessary. 
(Adding definitions for other stores, new features such as storing stats 
etcetera.)

I'll leave the final closing to Julien, since he is the original reporter.

Please let me know if any of you disagree.

 Design a Host table in GORA
 ---

 Key: NUTCH-882
 URL: https://issues.apache.org/jira/browse/NUTCH-882
 Project: Nutch
  Issue Type: New Feature
Affects Versions: nutchgora
Reporter: Julien Nioche
 Fix For: nutchgora

 Attachments: NUTCH-882-v1.patch, NUTCH-882-v3.txt, NUTCH-882-v3.txt, 
 hostdb.patch


 Having a separate GORA table for storing information about hosts (and 
 domains?) would be very useful for : 
 * customising the behaviour of the fetching on a host basis e.g. number of 
 threads, min time between threads etc...
 * storing stats
 * keeping metadata and possibly propagate them to the webpages 
 * keeping a copy of the robots.txt and possibly use that later to filter the 
 webtable
 * store sitemaps files and update the webtable accordingly
 I'll try to come up with a GORA schema for such a host table but any comments 
 are of course already welcome 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (NUTCH-882) Design a Host table in GORA

2012-04-26 Thread Ferdy Galema (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferdy Galema resolved NUTCH-882.


Resolution: Fixed

 Design a Host table in GORA
 ---

 Key: NUTCH-882
 URL: https://issues.apache.org/jira/browse/NUTCH-882
 Project: Nutch
  Issue Type: New Feature
Affects Versions: nutchgora
Reporter: Julien Nioche
 Fix For: nutchgora

 Attachments: NUTCH-882-v1.patch, NUTCH-882-v3.txt, NUTCH-882-v3.txt, 
 hostdb.patch


 Having a separate GORA table for storing information about hosts (and 
 domains?) would be very useful for : 
 * customising the behaviour of the fetching on a host basis e.g. number of 
 threads, min time between threads etc...
 * storing stats
 * keeping metadata and possibly propagate them to the webpages 
 * keeping a copy of the robots.txt and possibly use that later to filter the 
 webtable
 * store sitemaps files and update the webtable accordingly
 I'll try to come up with a GORA schema for such a host table but any comments 
 are of course already welcome 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-902) Add all necessary files and configuration so that nutch can be used with different backends out-of-the-box

2012-04-26 Thread Ferdy Galema (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13262496#comment-13262496
 ] 

Ferdy Galema commented on NUTCH-902:


I think nutch-default.xml does not correctly use the description field of the 
storage.data.store.class property. The description should describe what the 
property is about, not what the value is about. So instead of the various 
entries:

property
  namestorage.data.store.class/name
  valueorg.apache.gora.cassandra.store.CassandraStore/value
  descriptionGora class for storing data in Apache Cassandra/description
/property
--

!--
property
  namestorage.data.store.class/name
  valueorg.apache.gora.hbase.store.HBaseStore/value
  descriptionGora class for storing data in Apache HBase/description
/property
--

so on..

I propose to add a single property entry with the following description like 
this:

property
  namestorage.data.store.class/name
  valueorg.apache.gora.sql.store.SqlStore/value
  descriptionThe Gora DataStore class for storing/retrieving data.
Currently the following stores are available:

org.apache.gora.sql.store.SqlStore
  A DataStore implementation for RDBMS with a SQL interface.
  SqlStore uses JDBC drivers to communicate with the DB.

org.apache.gora.hbase.store.HBaseStore
  DataStore implementation for Hadoop HBase.

etcetera

  /description
/property

This has the additional benefit to make the nutch-default.xml look cleaner, 
imho.

 Add all necessary files and configuration so that nutch can be used with 
 different backends out-of-the-box
 --

 Key: NUTCH-902
 URL: https://issues.apache.org/jira/browse/NUTCH-902
 Project: Nutch
  Issue Type: New Feature
  Components: documentation, storage
Affects Versions: nutchbase
Reporter: Enis Soztutar
Assignee: Lewis John McGibbney
 Fix For: nutchgora

 Attachments: NUTCH-902-v2.patch, NUTCH-902-v3.patch, NUTCH-902.patch


 As per the discussion in the mailing list and 
 http://wiki.apache.org/nutch/GORA_HBase, it will be good to include all the 
 necessary files and configuration. I propose that we maintain configuration 
 for at least SQL, HBase and Cassandra. 
 The following changes are needed:
 conf/gora-sql-mapping.xml
 conf/gora-hbase-mapping.xml
 conf/gora-cassandra-mapping.xml
 comments on nutch-default and ivy.xml 
 Shall we also include jars from gora-hbase, gora-cassandra and their 
 dependencies ? 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1189) add commented out default settings to gora.properties files

2012-04-26 Thread Ferdy Galema (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13262506#comment-13262506
 ] 

Ferdy Galema commented on NUTCH-1189:
-

FYI: I just committed a change to update the HBaseStore properties section.

 add commented out default settings to gora.properties files 
 

 Key: NUTCH-1189
 URL: https://issues.apache.org/jira/browse/NUTCH-1189
 Project: Nutch
  Issue Type: Sub-task
  Components: storage
Affects Versions: nutchgora
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: nutchgora

 Attachments: NUTCH-1189-v2.patch, NUTCH-1189-v3.patch, 
 NUTCH-1189-v4.patch, NUTCH-1189.patch


 This issues should have been dealt with as part of its parent issue, however 
 I think as it is a fairly lareg task in itself, it needs to be done 
 independently. The gora.properties file should, amongst other settings, and 
 beside the extreme basic defaults for sqlstore, include defaults for opening 
 HBase, Cassandra, etc servers on their default ports etc. Leaving this down 
 to individual interpretation puts a huge owness of the user, hence 
 constructing a barrier to entry for getting the configuration settings up and 
 running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Closed] (NUTCH-882) Design a Host table in GORA

2012-04-26 Thread Ferdy Galema (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferdy Galema closed NUTCH-882.
--


Ok.

Thanks to anyone who was involved.

 Design a Host table in GORA
 ---

 Key: NUTCH-882
 URL: https://issues.apache.org/jira/browse/NUTCH-882
 Project: Nutch
  Issue Type: New Feature
Affects Versions: nutchgora
Reporter: Julien Nioche
 Fix For: nutchgora

 Attachments: NUTCH-882-v1.patch, NUTCH-882-v3.txt, NUTCH-882-v3.txt, 
 hostdb.patch


 Having a separate GORA table for storing information about hosts (and 
 domains?) would be very useful for : 
 * customising the behaviour of the fetching on a host basis e.g. number of 
 threads, min time between threads etc...
 * storing stats
 * keeping metadata and possibly propagate them to the webpages 
 * keeping a copy of the robots.txt and possibly use that later to filter the 
 webtable
 * store sitemaps files and update the webtable accordingly
 I'll try to come up with a GORA schema for such a host table but any comments 
 are of course already welcome 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Closed] (NUTCH-1290) crawlId not supported by all Tools

2012-04-26 Thread Ferdy Galema (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferdy Galema closed NUTCH-1290.
---

Resolution: Fixed

 crawlId not supported by all Tools
 --

 Key: NUTCH-1290
 URL: https://issues.apache.org/jira/browse/NUTCH-1290
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: nutchgora
Reporter: Mathijs Homminga
Priority: Minor
 Fix For: nutchgora

 Attachments: NUTCH-1290.patch


 See also: https://issues.apache.org/jira/browse/NUTCH-907
 The StorageUtils class exposes a createDataStore method which uses the 
 default schema for a persistent class specified in the Gora configuration. 
 This method ignores Nutch' storage.schema property and the notion of a 
 crawlId.
 Two tools use this method instead of the createWebStore method (which does 
 support the storage.schema property and a crawlId):
 o.a.n.indexer.IndexerReducer (IndexerJob)
 o.a.n.util.domain.DomainStatistics
  
 I propose that these two start using the createWebStore method and that we 
 make remove the createDataStore method from the StorageUtils.
 Also, these two tools should support the crawlId command line parameter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-902) Add all necessary files and configuration so that nutch can be used with different backends out-of-the-box

2012-04-26 Thread Ferdy Galema (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13262548#comment-13262548
 ] 

Ferdy Galema commented on NUTCH-902:


Alright I'll change and commit the storage.data.store.class property 
description.

Aside from that I think we can close this issue. Effort can be put into 
NUTCH-1205 and after that actual testing of the stores to see if the current 
configuration is sufficient for out-of-the-box usage. If this is not the case 
for some stores, we can always create new issues for thosde. (To prevent too 
much clutter in this issue).

 Add all necessary files and configuration so that nutch can be used with 
 different backends out-of-the-box
 --

 Key: NUTCH-902
 URL: https://issues.apache.org/jira/browse/NUTCH-902
 Project: Nutch
  Issue Type: New Feature
  Components: documentation, storage
Affects Versions: nutchbase
Reporter: Enis Soztutar
Assignee: Lewis John McGibbney
 Fix For: nutchgora

 Attachments: NUTCH-902-v2.patch, NUTCH-902-v3.patch, NUTCH-902.patch


 As per the discussion in the mailing list and 
 http://wiki.apache.org/nutch/GORA_HBase, it will be good to include all the 
 necessary files and configuration. I propose that we maintain configuration 
 for at least SQL, HBase and Cassandra. 
 The following changes are needed:
 conf/gora-sql-mapping.xml
 conf/gora-hbase-mapping.xml
 conf/gora-cassandra-mapping.xml
 comments on nutch-default and ivy.xml 
 Shall we also include jars from gora-hbase, gora-cassandra and their 
 dependencies ? 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-902) Add all necessary files and configuration so that nutch can be used with different backends out-of-the-box

2012-04-26 Thread Ferdy Galema (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13262556#comment-13262556
 ] 

Ferdy Galema commented on NUTCH-902:


Ok done. (Note that I did not actually check the stores, I simply merged the 
nutch-default.xml entries)

 Add all necessary files and configuration so that nutch can be used with 
 different backends out-of-the-box
 --

 Key: NUTCH-902
 URL: https://issues.apache.org/jira/browse/NUTCH-902
 Project: Nutch
  Issue Type: New Feature
  Components: documentation, storage
Affects Versions: nutchbase
Reporter: Enis Soztutar
Assignee: Lewis John McGibbney
 Fix For: nutchgora

 Attachments: NUTCH-902-v2.patch, NUTCH-902-v3.patch, NUTCH-902.patch


 As per the discussion in the mailing list and 
 http://wiki.apache.org/nutch/GORA_HBase, it will be good to include all the 
 necessary files and configuration. I propose that we maintain configuration 
 for at least SQL, HBase and Cassandra. 
 The following changes are needed:
 conf/gora-sql-mapping.xml
 conf/gora-hbase-mapping.xml
 conf/gora-cassandra-mapping.xml
 comments on nutch-default and ivy.xml 
 Shall we also include jars from gora-hbase, gora-cassandra and their 
 dependencies ? 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-879) URL-s getting lost

2012-04-26 Thread Ferdy Galema (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13262558#comment-13262558
 ] 

Ferdy Galema commented on NUTCH-879:


This a pretty old issue. Nevertheless the bug might still be active. I'll look 
into it.

 URL-s getting lost
 --

 Key: NUTCH-879
 URL: https://issues.apache.org/jira/browse/NUTCH-879
 Project: Nutch
  Issue Type: Bug
Affects Versions: nutchgora
 Environment: * Ubuntu 10.4 x64, Sun JDK 1.6
 * using 1-node Hadoop + HDFS
 * trunk r983472, using MySQL store
 * branch-1.3
Reporter: Andrzej Bialecki 
 Fix For: nutchgora

 Attachments: branch-1.3-bench.txt, trunk-bench.txt


 I ran the Benchmark using branch-1.3 and trunk (formerly nutchbase). With the 
 same Benchmark parameters and the same plugins branch-1.3 collects ~1.5mln 
 urls, while trunk collects ~20,000 urls. Clearly something is wrong.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1205) Upgrade gora modules to 0.2 in ivy/ivy.xml

2012-04-27 Thread Ferdy Galema (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ferdy Galema updated NUTCH-1205:

Attachment: NUTCH-1205-v7.patch

OK I got the tests working now. The problem is the fact that properties object
are not correctly handled throughout the tests. This is a problem currently in
Gora. In short, it means that Properties are not properly loaded in GoraMapper
from dynamic properties but ALWAYS from static gora.properties. (Will shortly
open issue for that). A consequence to make tests work now with Gora 0.2 is
that the default gora.properties now uses a hsqldb memstore instead of a
standalone hsqldb server.

Lewis, I noticed that you excluded jdom in the ivy.xml? Why is that? I included
it again because the SqlStore needs it to reads it mapping.

Upgrade gora modules to 0.2 in ivy/ivy.xml
--

Key: NUTCH-1205
URL: https://issues.apache.org/jira/browse/NUTCH-1205
Project: Nutch
Issue Type: Improvement
Components: storage
Affects Versions: nutchgora
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Blocker
Fix For: nutchgora

Attachments: NUTCH-1205-v2.patch, NUTCH-1205-v3.patch,
NUTCH-1205-v4.patch, NUTCH-1205-v5.patch, NUTCH-1205-v5.patch,
NUTCH-1205-v6.patch, NUTCH-1205-v7.patch, NUTCH-1205.patch

Although gora trunk is unstable, work is ongoing to get this fixed. For the
time being, I think Nutchgora should use gora trunk as this will identify
more vulnerabilities. I'll get the trivial patch submitted shortly.

[jira] [Commented] (NUTCH-1205) Upgrade gora modules to 0.2 in ivy/ivy.xml

2012-04-27 Thread Ferdy Galema (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13263596#comment-13263596
 ] 

Ferdy Galema commented on NUTCH-1205:
-

(Also I reformatted the ivy.xml to only include spaces as indentation. This is 
the policy right? If so, could anyone editing xml files double check their 
editor settings.)

 Upgrade gora modules to 0.2 in ivy/ivy.xml
 --

 Key: NUTCH-1205
 URL: https://issues.apache.org/jira/browse/NUTCH-1205
 Project: Nutch
  Issue Type: Improvement
  Components: storage
Affects Versions: nutchgora
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Blocker
 Fix For: nutchgora

 Attachments: NUTCH-1205-v2.patch, NUTCH-1205-v3.patch, 
 NUTCH-1205-v4.patch, NUTCH-1205-v5.patch, NUTCH-1205-v5.patch, 
 NUTCH-1205-v6.patch, NUTCH-1205-v7.patch, NUTCH-1205.patch


 Although gora trunk is unstable, work is ongoing to get this fixed. For the 
 time being, I think Nutchgora should use gora trunk as this will identify 
 more vulnerabilities. I'll get the trivial patch submitted shortly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1205) Upgrade gora modules to 0.2 in ivy/ivy.xml

2012-04-27 Thread Ferdy Galema (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ferdy Galema updated NUTCH-1205:

Attachment: NUTCH-1205-v8.patch

Oops there still was a failure in a test later on. (TestProtocolHttpClient).
This was because of multiple ant jars. I noticed that this was caused by
removing the global exclude but adding excludes to hadoop deps. (This was not
sufficient obviously).

New version of patch succeeds all tests.

Upgrade gora modules to 0.2 in ivy/ivy.xml
--

Attachments: NUTCH-1205-v2.patch, NUTCH-1205-v3.patch,
NUTCH-1205-v4.patch, NUTCH-1205-v5.patch, NUTCH-1205-v5.patch,
NUTCH-1205-v6.patch, NUTCH-1205-v7.patch, NUTCH-1205-v8.patch,
NUTCH-1205.patch

[jira] [Updated] (NUTCH-1205) Upgrade gora modules to 0.2 in ivy/ivy.xml

2012-05-02 Thread Ferdy Galema (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferdy Galema updated NUTCH-1205:


Attachment: (was: NUTCH-1205-v7.patch)

 Upgrade gora modules to 0.2 in ivy/ivy.xml
 --

 Key: NUTCH-1205
 URL: https://issues.apache.org/jira/browse/NUTCH-1205
 Project: Nutch
  Issue Type: Improvement
  Components: storage
Affects Versions: nutchgora
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Blocker
 Fix For: nutchgora

 Attachments: NUTCH-1205-v10.patch, NUTCH-1205-v2.patch, 
 NUTCH-1205-v3.patch, NUTCH-1205-v4.patch, NUTCH-1205-v5.patch, 
 NUTCH-1205-v5.patch, NUTCH-1205-v6.patch, NUTCH-1205.patch


 Although gora trunk is unstable, work is ongoing to get this fixed. For the 
 time being, I think Nutchgora should use gora trunk as this will identify 
 more vulnerabilities. I'll get the trivial patch submitted shortly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1205) Upgrade gora modules to 0.2 in ivy/ivy.xml

2012-05-02 Thread Ferdy Galema (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferdy Galema updated NUTCH-1205:


Attachment: NUTCH-1205-v10.patch

The tests now work and TestGoraStorage uses a proper standalone database. 
(Integrating issue NUTCH-902). I think it's good to do a final check of what's 
to be included as dependencies in the ivy.xml.

 Upgrade gora modules to 0.2 in ivy/ivy.xml
 --

 Key: NUTCH-1205
 URL: https://issues.apache.org/jira/browse/NUTCH-1205
 Project: Nutch
  Issue Type: Improvement
  Components: storage
Affects Versions: nutchgora
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Blocker
 Fix For: nutchgora

 Attachments: NUTCH-1205-v10.patch, NUTCH-1205-v2.patch, 
 NUTCH-1205-v3.patch, NUTCH-1205-v4.patch, NUTCH-1205-v5.patch, 
 NUTCH-1205-v5.patch, NUTCH-1205-v6.patch, NUTCH-1205.patch


 Although gora trunk is unstable, work is ongoing to get this fixed. For the 
 time being, I think Nutchgora should use gora trunk as this will identify 
 more vulnerabilities. I'll get the trivial patch submitted shortly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1205) Upgrade gora modules to 0.2 in ivy/ivy.xml

2012-05-02 Thread Ferdy Galema (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferdy Galema updated NUTCH-1205:


Attachment: (was: NUTCH-1205-v9.patch)

 Upgrade gora modules to 0.2 in ivy/ivy.xml
 --

 Key: NUTCH-1205
 URL: https://issues.apache.org/jira/browse/NUTCH-1205
 Project: Nutch
  Issue Type: Improvement
  Components: storage
Affects Versions: nutchgora
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Blocker
 Fix For: nutchgora

 Attachments: NUTCH-1205-v10.patch, NUTCH-1205-v2.patch, 
 NUTCH-1205-v3.patch, NUTCH-1205-v4.patch, NUTCH-1205-v5.patch, 
 NUTCH-1205-v5.patch, NUTCH-1205-v6.patch, NUTCH-1205.patch


 Although gora trunk is unstable, work is ongoing to get this fixed. For the 
 time being, I think Nutchgora should use gora trunk as this will identify 
 more vulnerabilities. I'll get the trivial patch submitted shortly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1205) Upgrade gora modules to 0.2 in ivy/ivy.xml

2012-05-02 Thread Ferdy Galema (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferdy Galema updated NUTCH-1205:


Attachment: (was: NUTCH-1205-v9.patch)

 Upgrade gora modules to 0.2 in ivy/ivy.xml
 --

 Key: NUTCH-1205
 URL: https://issues.apache.org/jira/browse/NUTCH-1205
 Project: Nutch
  Issue Type: Improvement
  Components: storage
Affects Versions: nutchgora
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Blocker
 Fix For: nutchgora

 Attachments: NUTCH-1205-v10.patch, NUTCH-1205-v2.patch, 
 NUTCH-1205-v3.patch, NUTCH-1205-v4.patch, NUTCH-1205-v5.patch, 
 NUTCH-1205-v5.patch, NUTCH-1205-v6.patch, NUTCH-1205.patch


 Although gora trunk is unstable, work is ongoing to get this fixed. For the 
 time being, I think Nutchgora should use gora trunk as this will identify 
 more vulnerabilities. I'll get the trivial patch submitted shortly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1205) Upgrade gora modules to 0.2 in ivy/ivy.xml

2012-05-03 Thread Ferdy Galema (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferdy Galema updated NUTCH-1205:


Attachment: NUTCH-1205-v11.patch

 Upgrade gora modules to 0.2 in ivy/ivy.xml
 --

 Key: NUTCH-1205
 URL: https://issues.apache.org/jira/browse/NUTCH-1205
 Project: Nutch
  Issue Type: Improvement
  Components: storage
Affects Versions: nutchgora
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Blocker
 Fix For: nutchgora

 Attachments: NUTCH-1205-v10.patch, NUTCH-1205-v11.patch, 
 NUTCH-1205-v2.patch, NUTCH-1205-v3.patch, NUTCH-1205-v4.patch, 
 NUTCH-1205-v5.patch, NUTCH-1205-v5.patch, NUTCH-1205-v6.patch, 
 NUTCH-1205.patch


 Although gora trunk is unstable, work is ongoing to get this fixed. For the 
 time being, I think Nutchgora should use gora trunk as this will identify 
 more vulnerabilities. I'll get the trivial patch submitted shortly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (NUTCH-1205) Upgrade gora modules to 0.2 in ivy/ivy.xml

2012-05-03 Thread Ferdy Galema (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ferdy Galema resolved NUTCH-1205.
-

Resolution: Fixed

Attached new patch v11. Committed.

-Fixed the jdom issue. (Added test dep again).
-Added a single global exclusion for hsqldb. (The deps can have the exclusion
removed).
-Tests succeed.
-Build a sqlstore runtime and played around doing some local crawls succesfully.

I did not test a deployment for the other stores. (When there is something
wrong with one of them dependency-wise, we can always create new issues).

Upgrade gora modules to 0.2 in ivy/ivy.xml
--

Attachments: NUTCH-1205-v10.patch, NUTCH-1205-v11.patch,
NUTCH-1205-v2.patch, NUTCH-1205-v3.patch, NUTCH-1205-v4.patch,
NUTCH-1205-v5.patch, NUTCH-1205-v5.patch, NUTCH-1205-v6.patch,
NUTCH-1205.patch

[jira] [Updated] (NUTCH-896) Gora-based tests need to have their own config files

2012-05-03 Thread Ferdy Galema (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferdy Galema updated NUTCH-896:
---

Affects Version/s: (was: nutchgora)
Fix Version/s: (was: 2.1)
   nutchgora

 Gora-based tests need to have their own config files 
 -

 Key: NUTCH-896
 URL: https://issues.apache.org/jira/browse/NUTCH-896
 Project: Nutch
  Issue Type: Bug
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: nutchgora


 The tests extending AbstractNutchTest (Injector, Generator, Fetcher) have 
 hard-coded properties for GORA. It would be better to be able to rely on a 
 file gora.properties used only for the tests, just as we do with the 
 nutch-*.xml config files (see CrawlTestUtil). This way we wouldn't use the 
 configs set in the main /conf file as they could be specific to a given GORA 
 backend e.g. Mysql vs hsqldb. This would also help running the tests with a 
 non-default GORA backend. 
 We need to modify GORA and make the method DataStoreFactory.setProperties 
 public. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1205) Upgrade gora modules to 0.2 in ivy/ivy.xml

2012-05-03 Thread Ferdy Galema (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13267581#comment-13267581
 ] 

Ferdy Galema commented on NUTCH-1205:
-

I committed a minor addition, that fixes the maven-plugins error when 
uncommenting another store.

 Upgrade gora modules to 0.2 in ivy/ivy.xml
 --

 Key: NUTCH-1205
 URL: https://issues.apache.org/jira/browse/NUTCH-1205
 Project: Nutch
  Issue Type: Improvement
  Components: storage
Affects Versions: nutchgora
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Blocker
 Fix For: nutchgora

 Attachments: NUTCH-1205-v10.patch, NUTCH-1205-v11.patch, 
 NUTCH-1205-v2.patch, NUTCH-1205-v3.patch, NUTCH-1205-v4.patch, 
 NUTCH-1205-v5.patch, NUTCH-1205-v5.patch, NUTCH-1205-v6.patch, 
 NUTCH-1205.patch


 Although gora trunk is unstable, work is ongoing to get this fixed. For the 
 time being, I think Nutchgora should use gora trunk as this will identify 
 more vulnerabilities. I'll get the trivial patch submitted shortly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1349) Make batchId explcit within debug logging.

2012-05-03 Thread Ferdy Galema (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13267758#comment-13267758
 ] 

Ferdy Galema commented on NUTCH-1349:
-

+1 This will also benefits other jobs depending on a batchId.

 Make batchId explcit within debug logging.
 --

 Key: NUTCH-1349
 URL: https://issues.apache.org/jira/browse/NUTCH-1349
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: nutchgora
Reporter: Lewis John McGibbney
Priority: Minor
 Fix For: nutchgora


 I find this a pain when trying to locate the batchId of some urls which are 
 skipped when going to the Solr index. My DEBUG log output gives me
 {code}
 2012-05-03 20:44:55,268 DEBUG indexer.IndexerJob (IndexerJob.java:map(83)) - 
 Skipping http://www.glasgowwheelers.com/; different batch id
 2012-05-03 20:44:55,259 DEBUG indexer.IndexerJob (IndexerJob.java:map(83)) - 
 Skipping http://www.heraldscotland.com/; different batch id
 {code}
 when I would actually like
 {code}
 2012-05-03 20:44:55,268 DEBUG indexer.IndexerJob (IndexerJob.java:map(83)) - 
 Skipping http://www.glasgowwheelers.com/; different batch id (ACTUAL BATCH ID)
 2012-05-03 20:44:55,259 DEBUG indexer.IndexerJob (IndexerJob.java:map(83)) - 
 Skipping http://www.heraldscotland.com/; different batch id (ACTUAL BATCH ID)
 {code} 
 patch coming up soon

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (NUTCH-1350) remove unused dependancy because of access restriction

2012-05-04 Thread Ferdy Galema (JIRA)

Ferdy Galema created NUTCH-1350:
---

 Summary: remove unused dependancy because of access restriction
 Key: NUTCH-1350
 URL: https://issues.apache.org/jira/browse/NUTCH-1350
 Project: Nutch
  Issue Type: Bug
Reporter: Ferdy Galema
Priority: Trivial
 Fix For: nutchgora


CrawlTestUtil has an unused dependancy com.sun.net.httpserver.HttpContext that 
sometimes causes an access restriction error when used with certain jdks. I 
figured since it isn't used anyway I can just remove it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1349) Make batchId explcit within debug logging and improve CLI

2012-05-04 Thread Ferdy Galema (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13268397#comment-13268397
 ] 

Ferdy Galema commented on NUTCH-1349:
-

Good work on improving the CLI. About the displaying mismatching batchId, your 
patch prints batchId while you should use 'mark' instead.

What do you mean with matching TableUtil.unreverseUrl(key)?

 Make batchId explcit within debug logging and improve CLI
 -

 Key: NUTCH-1349
 URL: https://issues.apache.org/jira/browse/NUTCH-1349
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: nutchgora
Reporter: Lewis John McGibbney
Priority: Minor
 Fix For: nutchgora

 Attachments: NUTCH-1349.patch


 I find this a pain when trying to locate the batchId of some urls which are 
 skipped when going to the Solr index. My DEBUG log output gives me
 {code}
 2012-05-03 20:44:55,268 DEBUG indexer.IndexerJob (IndexerJob.java:map(83)) - 
 Skipping http://www.glasgowwheelers.com/; different batch id
 2012-05-03 20:44:55,259 DEBUG indexer.IndexerJob (IndexerJob.java:map(83)) - 
 Skipping http://www.heraldscotland.com/; different batch id
 {code}
 when I would actually like
 {code}
 2012-05-03 20:44:55,268 DEBUG indexer.IndexerJob (IndexerJob.java:map(83)) - 
 Skipping http://www.glasgowwheelers.com/; different batch id (ACTUAL BATCH ID)
 2012-05-03 20:44:55,259 DEBUG indexer.IndexerJob (IndexerJob.java:map(83)) - 
 Skipping http://www.heraldscotland.com/; different batch id (ACTUAL BATCH ID)
 {code} 
 patch coming up soon

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (NUTCH-1352) Improve regex urlfilters/normalizers synchronization

2012-05-07 Thread Ferdy Galema (JIRA)

Ferdy Galema created NUTCH-1352:
---

 Summary: Improve regex urlfilters/normalizers synchronization
 Key: NUTCH-1352
 URL: https://issues.apache.org/jira/browse/NUTCH-1352
 Project: Nutch
  Issue Type: Improvement
Reporter: Ferdy Galema
 Fix For: nutchgora


I noticed that during fetching a lot of the time the fetcherthreads are 
blocking on a monitor because of outlink normalizing/filtering. The cause of 
this: Some of the regex plugins use single lock synchronization.

This patch improves throughput by removing synchronization locks and replace 
them with threadlocals were needed.

It has been extensively tested in production. I will commit this later today 
when no objection.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1352) Improve regex urlfilters/normalizers synchronization

2012-05-07 Thread Ferdy Galema (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferdy Galema updated NUTCH-1352:


Attachment: NUTCH-1352.patch

 Improve regex urlfilters/normalizers synchronization
 

 Key: NUTCH-1352
 URL: https://issues.apache.org/jira/browse/NUTCH-1352
 Project: Nutch
  Issue Type: Improvement
Reporter: Ferdy Galema
 Fix For: nutchgora

 Attachments: NUTCH-1352.patch


 I noticed that during fetching a lot of the time the fetcherthreads are 
 blocking on a monitor because of outlink normalizing/filtering. The cause of 
 this: Some of the regex plugins use single lock synchronization.
 This patch improves throughput by removing synchronization locks and replace 
 them with threadlocals were needed.
 It has been extensively tested in production. I will commit this later today 
 when no objection.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1353) nutchgora DomainStatistics support crawlId, counter bug and reformatting

2012-05-07 Thread Ferdy Galema (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferdy Galema updated NUTCH-1353:


Attachment: NUTCH-1353.patch

 nutchgora DomainStatistics support crawlId, counter bug and reformatting
 

 Key: NUTCH-1353
 URL: https://issues.apache.org/jira/browse/NUTCH-1353
 Project: Nutch
  Issue Type: Bug
Reporter: Ferdy Galema
Priority: Minor
 Fix For: nutchgora

 Attachments: NUTCH-1353.patch


 This patch fixes three issues about nutchgora DomainStatistics:
 -crawlId support (note I closed NUTCH-1290 because I thought DomainStatistics 
 was already fixed. This was not the case.)
 -A counter bug (NOT_FETCHED should be increased instead of FETCHED)
 -reformatting (convert tabs to spaces and clear unused imports)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Closed] (NUTCH-1353) nutchgora DomainStatistics support crawlId, counter bug and reformatting

2012-05-07 Thread Ferdy Galema (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferdy Galema closed NUTCH-1353.
---

Resolution: Fixed

committed

 nutchgora DomainStatistics support crawlId, counter bug and reformatting
 

 Key: NUTCH-1353
 URL: https://issues.apache.org/jira/browse/NUTCH-1353
 Project: Nutch
  Issue Type: Bug
Reporter: Ferdy Galema
Priority: Minor
 Fix For: nutchgora

 Attachments: NUTCH-1353.patch


 This patch fixes three issues about nutchgora DomainStatistics:
 -crawlId support (note I closed NUTCH-1290 because I thought DomainStatistics 
 was already fixed. This was not the case.)
 -A counter bug (NOT_FETCHED should be increased instead of FETCHED)
 -reformatting (convert tabs to spaces and clear unused imports)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (NUTCH-1354) nutchgora support fetcher.queue.depth.multiplier property

2012-05-07 Thread Ferdy Galema (JIRA)

Ferdy Galema created NUTCH-1354:
---

 Summary: nutchgora support fetcher.queue.depth.multiplier property
 Key: NUTCH-1354
 URL: https://issues.apache.org/jira/browse/NUTCH-1354
 Project: Nutch
  Issue Type: New Feature
Reporter: Ferdy Galema
Priority: Minor
 Fix For: nutchgora


Like trunk, nutchgora should support fetcher.queue.depth.multiplier property 
too.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1354) nutchgora support fetcher.queue.depth.multiplier property

2012-05-07 Thread Ferdy Galema (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferdy Galema updated NUTCH-1354:


Attachment: NUTCH-1354.patch

 nutchgora support fetcher.queue.depth.multiplier property
 -

 Key: NUTCH-1354
 URL: https://issues.apache.org/jira/browse/NUTCH-1354
 Project: Nutch
  Issue Type: New Feature
Reporter: Ferdy Galema
Priority: Minor
 Fix For: nutchgora

 Attachments: NUTCH-1354.patch


 Like trunk, nutchgora should support fetcher.queue.depth.multiplier property 
 too.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Closed] (NUTCH-1354) nutchgora support fetcher.queue.depth.multiplier property

2012-05-07 Thread Ferdy Galema (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferdy Galema closed NUTCH-1354.
---

Resolution: Fixed

committed

 nutchgora support fetcher.queue.depth.multiplier property
 -

 Key: NUTCH-1354
 URL: https://issues.apache.org/jira/browse/NUTCH-1354
 Project: Nutch
  Issue Type: New Feature
Reporter: Ferdy Galema
Priority: Minor
 Fix For: nutchgora

 Attachments: NUTCH-1354.patch


 Like trunk, nutchgora should support fetcher.queue.depth.multiplier property 
 too.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1352) Improve regex urlfilters/normalizers synchronization

2012-05-07 Thread Ferdy Galema (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferdy Galema updated NUTCH-1352:


Fix Version/s: 1.5

 Improve regex urlfilters/normalizers synchronization
 

 Key: NUTCH-1352
 URL: https://issues.apache.org/jira/browse/NUTCH-1352
 Project: Nutch
  Issue Type: Improvement
Reporter: Ferdy Galema
 Fix For: nutchgora, 1.5

 Attachments: NUTCH-1352.patch


 I noticed that during fetching a lot of the time the fetcherthreads are 
 blocking on a monitor because of outlink normalizing/filtering. The cause of 
 this: Some of the regex plugins use single lock synchronization.
 This patch improves throughput by removing synchronization locks and replace 
 them with threadlocals were needed.
 It has been extensively tested in production. I will commit this later today 
 when no objection.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1352) Improve regex urlfilters/normalizers synchronization

2012-05-07 Thread Ferdy Galema (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13269497#comment-13269497
 ] 

Ferdy Galema commented on NUTCH-1352:
-

This indeed applies to trunk too. (Except for a minor patch segment about a 
logging statement... quite irrelevant).

I'll commit it to trunk too.

 Improve regex urlfilters/normalizers synchronization
 

 Key: NUTCH-1352
 URL: https://issues.apache.org/jira/browse/NUTCH-1352
 Project: Nutch
  Issue Type: Improvement
Reporter: Ferdy Galema
 Fix For: nutchgora, 1.5

 Attachments: NUTCH-1352.patch


 I noticed that during fetching a lot of the time the fetcherthreads are 
 blocking on a monitor because of outlink normalizing/filtering. The cause of 
 this: Some of the regex plugins use single lock synchronization.
 This patch improves throughput by removing synchronization locks and replace 
 them with threadlocals were needed.
 It has been extensively tested in production. I will commit this later today 
 when no objection.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1352) Improve regex urlfilters/normalizers synchronization

2012-05-07 Thread Ferdy Galema (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferdy Galema updated NUTCH-1352:


Fix Version/s: (was: 1.5)
   1.6

On second thought, I will hold commit for trunk for now. (Feature freeze I 
guess?)

 Improve regex urlfilters/normalizers synchronization
 

 Key: NUTCH-1352
 URL: https://issues.apache.org/jira/browse/NUTCH-1352
 Project: Nutch
  Issue Type: Improvement
Reporter: Ferdy Galema
 Fix For: nutchgora, 1.6

 Attachments: NUTCH-1352.patch


 I noticed that during fetching a lot of the time the fetcherthreads are 
 blocking on a monitor because of outlink normalizing/filtering. The cause of 
 this: Some of the regex plugins use single lock synchronization.
 This patch improves throughput by removing synchronization locks and replace 
 them with threadlocals were needed.
 It has been extensively tested in production. I will commit this later today 
 when no objection.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (NUTCH-1355) nutchgora Configure minimum throughput for fetcher

2012-05-07 Thread Ferdy Galema (JIRA)

Ferdy Galema created NUTCH-1355:
---

 Summary: nutchgora Configure minimum throughput for fetcher
 Key: NUTCH-1355
 URL: https://issues.apache.org/jira/browse/NUTCH-1355
 Project: Nutch
  Issue Type: New Feature
Reporter: Ferdy Galema
 Fix For: nutchgora


Like trunk, nutchgora should also have a feature to configure the fetcher with 
a minimum throughput. (See NUTCH-1067 for the work done by Markus).

It's implemented in almost the same way, except that the number of times 
throughput falls below threshold is measured sequentially. (The counter is 
reset when throughput is healthy again; this should work even better against 
temporary dips).

Defaults to disabled. Will commit later today if there is no objection.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1355) nutchgora Configure minimum throughput for fetcher

2012-05-07 Thread Ferdy Galema (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferdy Galema updated NUTCH-1355:


Attachment: NUTCH-1355.patch

 nutchgora Configure minimum throughput for fetcher
 --

 Key: NUTCH-1355
 URL: https://issues.apache.org/jira/browse/NUTCH-1355
 Project: Nutch
  Issue Type: New Feature
Reporter: Ferdy Galema
 Fix For: nutchgora

 Attachments: NUTCH-1355.patch


 Like trunk, nutchgora should also have a feature to configure the fetcher 
 with a minimum throughput. (See NUTCH-1067 for the work done by Markus).
 It's implemented in almost the same way, except that the number of times 
 throughput falls below threshold is measured sequentially. (The counter is 
 reset when throughput is healthy again; this should work even better against 
 temporary dips).
 Defaults to disabled. Will commit later today if there is no objection.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (NUTCH-1356) ParseUtil use ExecutorService instead of manually thread handling.

2012-05-07 Thread Ferdy Galema (JIRA)

Ferdy Galema created NUTCH-1356:
---

 Summary: ParseUtil use ExecutorService instead of manually thread 
handling.
 Key: NUTCH-1356
 URL: https://issues.apache.org/jira/browse/NUTCH-1356
 Project: Nutch
  Issue Type: Improvement
Reporter: Ferdy Galema
 Fix For: nutchgora
 Attachments: NUTCH-1356.patch

Because ParseUtil manages it's own parser threads by creating a thread for 
every parse it sometimes happens that specific parsers are very expensive. For 
example, parsers that have threadlocal fields will initialize them for every 
item to be parsed.

By simply introducing a caching ExecutorService the ParseUtil will be able to 
cache threads therefore parsing more efficient. See attached patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1356) ParseUtil use ExecutorService instead of manually thread handling.

2012-05-07 Thread Ferdy Galema (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferdy Galema updated NUTCH-1356:


Attachment: NUTCH-1356.patch

 ParseUtil use ExecutorService instead of manually thread handling.
 --

 Key: NUTCH-1356
 URL: https://issues.apache.org/jira/browse/NUTCH-1356
 Project: Nutch
  Issue Type: Improvement
Reporter: Ferdy Galema
 Fix For: nutchgora

 Attachments: NUTCH-1356.patch


 Because ParseUtil manages it's own parser threads by creating a thread for 
 every parse it sometimes happens that specific parsers are very expensive. 
 For example, parsers that have threadlocal fields will initialize them for 
 every item to be parsed.
 By simply introducing a caching ExecutorService the ParseUtil will be able to 
 cache threads therefore parsing more efficient. See attached patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1356) ParseUtil use ExecutorService instead of manually thread handling.

2012-05-07 Thread Ferdy Galema (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferdy Galema updated NUTCH-1356:


Fix Version/s: 1.6

Sure will create patch for 1.x too. (Seems not that different).

 ParseUtil use ExecutorService instead of manually thread handling.
 --

 Key: NUTCH-1356
 URL: https://issues.apache.org/jira/browse/NUTCH-1356
 Project: Nutch
  Issue Type: Improvement
Reporter: Ferdy Galema
 Fix For: nutchgora, 1.6

 Attachments: NUTCH-1356.patch


 Because ParseUtil manages it's own parser threads by creating a thread for 
 every parse it sometimes happens that specific parsers are very expensive. 
 For example, parsers that have threadlocal fields will initialize them for 
 every item to be parsed.
 By simply introducing a caching ExecutorService the ParseUtil will be able to 
 cache threads therefore parsing more efficient. See attached patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1356) ParseUtil use ExecutorService instead of manually thread handling.

2012-05-07 Thread Ferdy Galema (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferdy Galema updated NUTCH-1356:


Attachment: NUTCH-1356-trunk.patch

Patch for trunk.

 ParseUtil use ExecutorService instead of manually thread handling.
 --

 Key: NUTCH-1356
 URL: https://issues.apache.org/jira/browse/NUTCH-1356
 Project: Nutch
  Issue Type: Improvement
Reporter: Ferdy Galema
 Fix For: nutchgora, 1.6

 Attachments: NUTCH-1356-trunk.patch, NUTCH-1356.patch


 Because ParseUtil manages it's own parser threads by creating a thread for 
 every parse it sometimes happens that specific parsers are very expensive. 
 For example, parsers that have threadlocal fields will initialize them for 
 every item to be parsed.
 By simply introducing a caching ExecutorService the ParseUtil will be able to 
 cache threads therefore parsing more efficient. See attached patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1352) Improve regex urlfilters/normalizers synchronization

2012-05-07 Thread Ferdy Galema (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13269619#comment-13269619
 ] 

Ferdy Galema commented on NUTCH-1352:
-

Thanks.

 Improve regex urlfilters/normalizers synchronization
 

 Key: NUTCH-1352
 URL: https://issues.apache.org/jira/browse/NUTCH-1352
 Project: Nutch
  Issue Type: Improvement
Reporter: Ferdy Galema
 Fix For: nutchgora, 1.6

 Attachments: NUTCH-1352-1.6-1.patch, NUTCH-1352.patch


 I noticed that during fetching a lot of the time the fetcherthreads are 
 blocking on a monitor because of outlink normalizing/filtering. The cause of 
 this: Some of the regex plugins use single lock synchronization.
 This patch improves throughput by removing synchronization locks and replace 
 them with threadlocals were needed.
 It has been extensively tested in production. I will commit this later today 
 when no objection.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1356) ParseUtil use ExecutorService instead of manually thread handling.

2012-05-07 Thread Ferdy Galema (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferdy Galema updated NUTCH-1356:


Attachment: NUTCH-1356-trunk-v2.patch

It was working though, I guess that is because of a transitive dependancy. 
Anyway it's best to declare it as a direct dependancy too. Patch v2 does this. 
(11.0.2 -- the same as the already present jar).

 ParseUtil use ExecutorService instead of manually thread handling.
 --

 Key: NUTCH-1356
 URL: https://issues.apache.org/jira/browse/NUTCH-1356
 Project: Nutch
  Issue Type: Improvement
Reporter: Ferdy Galema
 Fix For: nutchgora, 1.6

 Attachments: NUTCH-1356-trunk-v2.patch, NUTCH-1356-trunk.patch, 
 NUTCH-1356.patch


 Because ParseUtil manages it's own parser threads by creating a thread for 
 every parse it sometimes happens that specific parsers are very expensive. 
 For example, parsers that have threadlocal fields will initialize them for 
 every item to be parsed.
 By simply introducing a caching ExecutorService the ParseUtil will be able to 
 cache threads therefore parsing more efficient. See attached patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Closed] (NUTCH-1355) nutchgora Configure minimum throughput for fetcher

2012-05-07 Thread Ferdy Galema (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferdy Galema closed NUTCH-1355.
---

Resolution: Fixed

committed

 nutchgora Configure minimum throughput for fetcher
 --

 Key: NUTCH-1355
 URL: https://issues.apache.org/jira/browse/NUTCH-1355
 Project: Nutch
  Issue Type: New Feature
Reporter: Ferdy Galema
 Fix For: nutchgora

 Attachments: NUTCH-1355.patch


 Like trunk, nutchgora should also have a feature to configure the fetcher 
 with a minimum throughput. (See NUTCH-1067 for the work done by Markus).
 It's implemented in almost the same way, except that the number of times 
 throughput falls below threshold is measured sequentially. (The counter is 
 reset when throughput is healthy again; this should work even better against 
 temporary dips).
 Defaults to disabled. Will commit later today if there is no objection.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1356) ParseUtil use ExecutorService instead of manually thread handling.

2012-05-07 Thread Ferdy Galema (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13269695#comment-13269695
 ] 

Ferdy Galema commented on NUTCH-1356:
-

committed at nutchgora

 ParseUtil use ExecutorService instead of manually thread handling.
 --

 Key: NUTCH-1356
 URL: https://issues.apache.org/jira/browse/NUTCH-1356
 Project: Nutch
  Issue Type: Improvement
Reporter: Ferdy Galema
 Fix For: nutchgora, 1.6

 Attachments: NUTCH-1356-trunk-v2.patch, NUTCH-1356-trunk.patch, 
 NUTCH-1356.patch


 Because ParseUtil manages it's own parser threads by creating a thread for 
 every parse it sometimes happens that specific parsers are very expensive. 
 For example, parsers that have threadlocal fields will initialize them for 
 every item to be parsed.
 By simply introducing a caching ExecutorService the ParseUtil will be able to 
 cache threads therefore parsing more efficient. See attached patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1352) Improve regex urlfilters/normalizers synchronization

2012-05-07 Thread Ferdy Galema (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13269697#comment-13269697
 ] 

Ferdy Galema commented on NUTCH-1352:
-

committed at nutchgora

 Improve regex urlfilters/normalizers synchronization
 

 Key: NUTCH-1352
 URL: https://issues.apache.org/jira/browse/NUTCH-1352
 Project: Nutch
  Issue Type: Improvement
Reporter: Ferdy Galema
 Fix For: nutchgora, 1.6

 Attachments: NUTCH-1352-1.6-1.patch, NUTCH-1352.patch


 I noticed that during fetching a lot of the time the fetcherthreads are 
 blocking on a monitor because of outlink normalizing/filtering. The cause of 
 this: Some of the regex plugins use single lock synchronization.
 This patch improves throughput by removing synchronization locks and replace 
 them with threadlocals were needed.
 It has been extensively tested in production. I will commit this later today 
 when no objection.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (NUTCH-1357) All gora mapreduce functionality should go through StorageUtils

2012-05-09 Thread Ferdy Galema (JIRA)

Ferdy Galema created NUTCH-1357:
---

 Summary: All gora mapreduce functionality should go through 
StorageUtils
 Key: NUTCH-1357
 URL: https://issues.apache.org/jira/browse/NUTCH-1357
 Project: Nutch
  Issue Type: Improvement
Reporter: Ferdy Galema
 Fix For: nutchgora


I am trying to make the concept of crawlId work for ALL nutch jobs: it seems 
the biggest problem with it not working as expected is because of the various 
ways gora mapreduce is used in nutch.

Some jobs use StorageUtils, some use GoraMapper/GoraReduce, some even use 
directly GoraInputFormat/GoraOutputFormat. But the only place the translation 
is made from crawlId into a schema name is in StorageUtils! Currently I am 
converting all calls to Gora* mapreduce initializing code to StorageUtils calls.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (NUTCH-1358) Do not accept bogus arguments

2012-05-09 Thread Ferdy Galema (JIRA)

Ferdy Galema created NUTCH-1358:
---

 Summary: Do not accept bogus arguments
 Key: NUTCH-1358
 URL: https://issues.apache.org/jira/browse/NUTCH-1358
 Project: Nutch
  Issue Type: Improvement
Reporter: Ferdy Galema
Priority: Minor
 Fix For: nutchgora


Some of the tools do not explicitely check every passed argument for validity. 
This can mask very frustrating issues because one passes wrong arguments and 
the tool does not fail fast.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1358) Do not accept bogus arguments

2012-05-09 Thread Ferdy Galema (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferdy Galema updated NUTCH-1358:


Attachment: NUTCH-1358.patch

 Do not accept bogus arguments
 -

 Key: NUTCH-1358
 URL: https://issues.apache.org/jira/browse/NUTCH-1358
 Project: Nutch
  Issue Type: Improvement
Reporter: Ferdy Galema
Priority: Minor
 Fix For: nutchgora

 Attachments: NUTCH-1358.patch


 Some of the tools do not explicitely check every passed argument for 
 validity. This can mask very frustrating issues because one passes wrong 
 arguments and the tool does not fail fast.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Closed] (NUTCH-1358) Do not accept bogus arguments

2012-05-09 Thread Ferdy Galema (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferdy Galema closed NUTCH-1358.
---

Resolution: Fixed

Committed.

 Do not accept bogus arguments
 -

 Key: NUTCH-1358
 URL: https://issues.apache.org/jira/browse/NUTCH-1358
 Project: Nutch
  Issue Type: Improvement
Reporter: Ferdy Galema
Priority: Minor
 Fix For: nutchgora

 Attachments: NUTCH-1358.patch


 Some of the tools do not explicitely check every passed argument for 
 validity. This can mask very frustrating issues because one passes wrong 
 arguments and the tool does not fail fast.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1357) All gora mapreduce functionality should go through StorageUtils

2012-05-09 Thread Ferdy Galema (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13271421#comment-13271421
 ] 

Ferdy Galema commented on NUTCH-1357:
-

Side note: It seems some tools do need to call Gora* code directly but this 
does not matter as long as they pass around the DataStore that is created by 
using StorageUtils.createWebStore(..).

 All gora mapreduce functionality should go through StorageUtils
 ---

 Key: NUTCH-1357
 URL: https://issues.apache.org/jira/browse/NUTCH-1357
 Project: Nutch
  Issue Type: Improvement
Reporter: Ferdy Galema
 Fix For: nutchgora


 I am trying to make the concept of crawlId work for ALL nutch jobs: it seems 
 the biggest problem with it not working as expected is because of the various 
 ways gora mapreduce is used in nutch.
 Some jobs use StorageUtils, some use GoraMapper/GoraReduce, some even use 
 directly GoraInputFormat/GoraOutputFormat. But the only place the translation 
 is made from crawlId into a schema name is in StorageUtils! Currently I am 
 converting all calls to Gora* mapreduce initializing code to StorageUtils 
 calls.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1363) Make parsing in FetcherJob actually work.

2012-05-09 Thread Ferdy Galema (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13271445#comment-13271445
 ] 

Ferdy Galema commented on NUTCH-1363:
-

Hey Lewis,

This does work, with the -Dfetcher.parse=true option. Note that the -parse is 
not supported anymore.

 Make parsing in FetcherJob actually work.
 -

 Key: NUTCH-1363
 URL: https://issues.apache.org/jira/browse/NUTCH-1363
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: nutchgora
Reporter: Lewis John McGibbney
 Fix For: nutchgora


 We know that parsing during fetching is not recommended, however for those 
 that wish to dive into the abyss the functionality should be available. This 
 issue will address this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (NUTCH-1363) Make parsing in FetcherJob actually work.

2012-05-09 Thread Ferdy Galema (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13271445#comment-13271445
 ] 

Ferdy Galema edited comment on NUTCH-1363 at 5/9/12 2:27 PM:
-

Hey Lewis,

This does work, with the -Dfetcher.parse=true option. Note that the -parse 
option is not supported anymore. (But it did the same thing).

  was (Author: ferdy.g):
Hey Lewis,

This does work, with the -Dfetcher.parse=true option. Note that the -parse is 
not supported anymore.
  
 Make parsing in FetcherJob actually work.
 -

 Key: NUTCH-1363
 URL: https://issues.apache.org/jira/browse/NUTCH-1363
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: nutchgora
Reporter: Lewis John McGibbney
 Fix For: nutchgora


 We know that parsing during fetching is not recommended, however for those 
 that wish to dive into the abyss the functionality should be available. This 
 issue will address this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1357) All gora mapreduce functionality should go through StorageUtils

2012-05-09 Thread Ferdy Galema (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferdy Galema updated NUTCH-1357:


Fix Version/s: (was: nutchgora)

On second though, this can be solved later.

 All gora mapreduce functionality should go through StorageUtils
 ---

 Key: NUTCH-1357
 URL: https://issues.apache.org/jira/browse/NUTCH-1357
 Project: Nutch
  Issue Type: Improvement
Reporter: Ferdy Galema

 I am trying to make the concept of crawlId work for ALL nutch jobs: it seems 
 the biggest problem with it not working as expected is because of the various 
 ways gora mapreduce is used in nutch.
 Some jobs use StorageUtils, some use GoraMapper/GoraReduce, some even use 
 directly GoraInputFormat/GoraOutputFormat. But the only place the translation 
 is made from crawlId into a schema name is in StorageUtils! Currently I am 
 converting all calls to Gora* mapreduce initializing code to StorageUtils 
 calls.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1363) Make parsing in FetcherJob actually work.

2012-05-10 Thread Ferdy Galema (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13272150#comment-13272150
 ] 

Ferdy Galema commented on NUTCH-1363:
-

I'm not sure I follow. What makes this property different from all the other 
properties?

In general, properties defined in nutch-default can be overriden using 
nutch-site (in either distributed and local mode) and finally using generic 
Hadoop -Dkey=value command-line options. Additionally, tools are able to 
provide specific arguments. For exampe -threads 10 with the fetcher sets 
fetcher.threads.fetch to 10 in the configuration.

 Make parsing in FetcherJob actually work.
 -

 Key: NUTCH-1363
 URL: https://issues.apache.org/jira/browse/NUTCH-1363
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: nutchgora
Reporter: Lewis John McGibbney
 Fix For: nutchgora


 We know that parsing during fetching is not recommended, however for those 
 that wish to dive into the abyss the functionality should be available. This 
 issue will address this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (NUTCH-1365) Fix crawlId functionalilty by making using of new gora configuration

2012-05-10 Thread Ferdy Galema (JIRA)

Ferdy Galema created NUTCH-1365:
---

 Summary: Fix crawlId functionalilty by making using of new gora 
configuration
 Key: NUTCH-1365
 URL: https://issues.apache.org/jira/browse/NUTCH-1365
 Project: Nutch
  Issue Type: Bug
Reporter: Ferdy Galema
 Fix For: 2.1


With GORA-126 it is finally possible to make correctly use of crawlId 
throughout nutch. This patch changes StorageUtils so that the preferred schema 
name (crawlId + _ + schema) is correctly set on gora.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1365) Fix crawlId functionalilty by making using of new gora configuration

2012-05-10 Thread Ferdy Galema (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferdy Galema updated NUTCH-1365:


Attachment: NUTCH-1365.patch

 Fix crawlId functionalilty by making using of new gora configuration
 

 Key: NUTCH-1365
 URL: https://issues.apache.org/jira/browse/NUTCH-1365
 Project: Nutch
  Issue Type: Bug
Reporter: Ferdy Galema
 Fix For: 2.1

 Attachments: NUTCH-1365.patch


 With GORA-126 it is finally possible to make correctly use of crawlId 
 throughout nutch. This patch changes StorageUtils so that the preferred 
 schema name (crawlId + _ + schema) is correctly set on gora.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1306) Commit after finished writing to solr index

2012-05-10 Thread Ferdy Galema (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-1306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13272231#comment-13272231
]

Ferdy Galema commented on NUTCH-1306:
-

Lewis,

Do you suggest to add the commit as implemented by the fix but make it
conditional? Something like this:

if (getConf().getBoolean(solr.commit, true)) {
solr.commit()
}

This makes it enabled by default. I think it is a good idea.

Secondly, you say that Nutchgora does not commit at all. It looks like trunk
does not commit either. I think it's a bit confusing the COMMIT_SIZE nutch
property does no solr commit but rather 'flush' data to solr. Perhaps we could
clarify this a bit more. (Update the property description by mentioning the
fact that it does NOT trigger a solr commit.) Agree?

Commit after finished writing to solr index
---

Key: NUTCH-1306
URL: https://issues.apache.org/jira/browse/NUTCH-1306
Project: Nutch
Issue Type: Improvement
Components: indexer
Affects Versions: nutchgora
Reporter: Dan Rosher
Priority: Trivial
Fix For: 2.1

Attachments: NUTCH-1306.patch

Commit after finished writing to solr index - otherwise a bit confusing not
seeing the number of docs we expect in solr

[jira] [Closed] (NUTCH-1026) Strip UTF-8 non-character codepoints

2012-05-10 Thread Ferdy Galema (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-1026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ferdy Galema closed NUTCH-1026.
---

Resolution: Fixed
Fix Version/s: (was: 2.1)
nutchgora

When indexing a huge dataset I ran into this issue too. The patch in NUTCH-1016
works fine. (Thanks Markus!) I verified and tested this. Committed at nutchgora.

Minor note: The patch checks for invalid chars ONLY on the content field of
the NutchDocument. But since the problem is most likely to only occur on this
field, it is okay for now.

Strip UTF-8 non-character codepoints

Key: NUTCH-1026
URL: https://issues.apache.org/jira/browse/NUTCH-1026
Project: Nutch
Issue Type: Bug
Components: indexer
Affects Versions: nutchgora
Reporter: Markus Jelsma
Fix For: nutchgora

During a very large crawl i found a few documents producing non-character
codepoints. When indexing to Solr this will yield the following exception:
{code}
SEVERE: java.lang.RuntimeException: [was class
java.io.CharConversionException] Invalid UTF-8 character 0x at char
#1142033, byte #1155068)
at
com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
at
com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
{code}
Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the
content field to a method to strip away non-characters. I'm not too sure
about this implementation but the tests i've done locally with a huge dataset
now passes correctly. Here's a list of codepoints to strip away:
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
Please comment!

[jira] [Updated] (NUTCH-1306) Commit after finished writing to solr index

2012-05-10 Thread Ferdy Galema (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferdy Galema updated NUTCH-1306:


Attachment: NUTCH-1306-v2.patch
NUTCH-1306-trunk.patch

Agree with trying to make both branches to match each other.

By the way there is a commit done after the whole job completes. (I previously 
thought there was no commit at all, but I was wrong). But, if this is the case, 
then the commit after closing a single indexwriter is not needed. (So the 
reason Dan is not seeing updates must have been a different problem).

Anyway, I've uploaded patches for making this committing after the job 
completes configurable. (But enabled by default). Let me know if there are 
comments.

 Commit after finished writing to solr index
 ---

 Key: NUTCH-1306
 URL: https://issues.apache.org/jira/browse/NUTCH-1306
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: nutchgora
Reporter: Dan Rosher
Priority: Trivial
 Fix For: 2.1

 Attachments: NUTCH-1306-trunk.patch, NUTCH-1306-v2.patch, 
 NUTCH-1306.patch


 Commit after finished writing to solr index - otherwise a bit confusing not 
 seeing the number of docs we expect in solr

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1306) Commit after finished writing to solr index

2012-05-11 Thread Ferdy Galema (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferdy Galema updated NUTCH-1306:


Attachment: NUTCH-1306-trunk-v2.patch

Heh indeed that's not ready for committing yet. Weird though that my workspace 
did not get a compile error at first, only after refreshing the ivy deps. 
(Somehow it fetched a Gora library).

Anyway I've uploaded an updated patch.

I was not aware of NUTCH-1025. Is it ok if we incorporate that issue and rename 
this issue to Add option to not commit and clarify existing solr.commit.size?

 Commit after finished writing to solr index
 ---

 Key: NUTCH-1306
 URL: https://issues.apache.org/jira/browse/NUTCH-1306
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: nutchgora
Reporter: Dan Rosher
Priority: Trivial
 Fix For: 2.1

 Attachments: NUTCH-1306-trunk-v2.patch, NUTCH-1306-trunk.patch, 
 NUTCH-1306-v2.patch, NUTCH-1306.patch


 Commit after finished writing to solr index - otherwise a bit confusing not 
 seeing the number of docs we expect in solr

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1362) Fix error handling of urls with empty fields

2012-05-11 Thread Ferdy Galema (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferdy Galema updated NUTCH-1362:


Attachment: NUTCH-1362.patch

Hey Lewis,

This patches fixes the problem and makes the reversing a bit faster by using 
StringUtils.split instead of String.split. (The latter compiles a regular 
expression every time a split is done. That's a bit excessive for simple dot 
and colon splitting.)

Tested and verified.

 Fix error handling of urls with empty fields 
 -

 Key: NUTCH-1362
 URL: https://issues.apache.org/jira/browse/NUTCH-1362
 Project: Nutch
  Issue Type: Bug
Affects Versions: nutchgora
Reporter: Lewis John McGibbney
 Fix For: nutchgora

 Attachments: NUTCH-1362.patch


 Within o.a.n.util.TableUtil.reverseAppendSplits() a simple if (split.length  
 0) block enables us to address this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1362) Fix error handling of urls with empty fields

2012-05-11 Thread Ferdy Galema (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13273129#comment-13273129
 ] 

Ferdy Galema commented on NUTCH-1362:
-

Btw this is a duplicate of NUTCH-1077.

 Fix error handling of urls with empty fields 
 -

 Key: NUTCH-1362
 URL: https://issues.apache.org/jira/browse/NUTCH-1362
 Project: Nutch
  Issue Type: Bug
Affects Versions: nutchgora
Reporter: Lewis John McGibbney
 Fix For: nutchgora

 Attachments: NUTCH-1362.patch


 Within o.a.n.util.TableUtil.reverseAppendSplits() a simple if (split.length  
 0) block enables us to address this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Closed] (NUTCH-1077) Nutch 2 DbUpdateMapper throws ArrayOutOfBoundsException when running update

2012-05-11 Thread Ferdy Galema (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferdy Galema closed NUTCH-1077.
---

   Resolution: Duplicate
Fix Version/s: (was: 2.1)

Will be fixed with NUTCH-1362. (Use attached patch or wait for commit.)

Thanks for reporting.

 Nutch 2 DbUpdateMapper throws ArrayOutOfBoundsException when running update
 ---

 Key: NUTCH-1077
 URL: https://issues.apache.org/jira/browse/NUTCH-1077
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: nutchgora
 Environment: CentOS 5 Linux with CDH3 Hadoop.
Reporter: Tom Davidson

 I got this error when running a simple nutch update after doing  a small 
 fetch and parse.
 java.lang.ArrayIndexOutOfBoundsException: 0
 at 
 org.apache.nutch.util.TableUtil.reverseAppendSplits(TableUtil.java:126)
 at org.apache.nutch.util.TableUtil.reverseUrl(TableUtil.java:66)
 at org.apache.nutch.util.TableUtil.reverseUrl(TableUtil.java:43)
 at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:70)
 at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:36)
 at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
 at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
 at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:396)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
 at org.apache.hadoop.mapred.Child.main(Child.java:264)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Closed] (NUTCH-1362) Fix error handling of urls with empty fields

2012-05-11 Thread Ferdy Galema (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferdy Galema closed NUTCH-1362.
---

Resolution: Fixed

Done! Thanks.

 Fix error handling of urls with empty fields 
 -

 Key: NUTCH-1362
 URL: https://issues.apache.org/jira/browse/NUTCH-1362
 Project: Nutch
  Issue Type: Bug
Affects Versions: nutchgora
Reporter: Lewis John McGibbney
 Fix For: nutchgora

 Attachments: NUTCH-1362.patch


 Within o.a.n.util.TableUtil.reverseAppendSplits() a simple if (split.length  
 0) block enables us to address this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (NUTCH-1366) speed up indexing by eliminating the indexreducer

2012-05-11 Thread Ferdy Galema (JIRA)

Ferdy Galema created NUTCH-1366:
---

 Summary: speed up indexing by eliminating the indexreducer
 Key: NUTCH-1366
 URL: https://issues.apache.org/jira/browse/NUTCH-1366
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Reporter: Ferdy Galema
 Fix For: nutchgora


Currently the indexer in Nutchgora consists of both mappers and reduces. But 
the reduce code does not actually iterate over any (grouped/sorted) values. It 
simply indexes individual key/value (String/Webpage) pairs. Therefore by moving 
this indexing code to the mapper we can eliminate the reduce step therefore 
making the indexing job much faster. (No more unnecessary spilling to 
disk/network and no cpu wasted to sorting).

Note this is not (directly) applicable to trunk because trunk uses a quite 
different approach. Different types of input are combined to a single value in 
the reducer. Although I think it is possible to implement a similar 
optimization I am not sure how to do this. So if anyone wants this for trunk 
too feel free to implement a similar patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1366) speed up indexing by eliminating the indexreducer

2012-05-11 Thread Ferdy Galema (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ferdy Galema updated NUTCH-1366:

Attachment: NUTCH-1366.patch

speed up indexing by eliminating the indexreducer
-

Key: NUTCH-1366
URL: https://issues.apache.org/jira/browse/NUTCH-1366
Project: Nutch
Issue Type: Improvement
Components: indexer
Reporter: Ferdy Galema
Fix For: nutchgora

Attachments: NUTCH-1366.patch

Currently the indexer in Nutchgora consists of both mappers and reduces. But
the reduce code does not actually iterate over any (grouped/sorted) values.
It simply indexes individual key/value (String/Webpage) pairs. Therefore by
moving this indexing code to the mapper we can eliminate the reduce step
therefore making the indexing job much faster. (No more unnecessary spilling
to disk/network and no cpu wasted to sorting).
Note this is not (directly) applicable to trunk because trunk uses a quite
different approach. Different types of input are combined to a single value
in the reducer. Although I think it is possible to implement a similar
optimization I am not sure how to do this. So if anyone wants this for trunk
too feel free to implement a similar patch.

[jira] [Commented] (NUTCH-1366) speed up indexing by eliminating the indexreducer

2012-05-11 Thread Ferdy Galema (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13273335#comment-13273335
]

Ferdy Galema commented on NUTCH-1366:
-

The cool part about Nutchgora is that inlinks are already populated for the row
that is inputted into the indexer. The DbUpdateReducer does this outlink
inverting as part of the updating the db.

Btw it's very simple to reinstate the reducer, if we need to have one again.

speed up indexing by eliminating the indexreducer
-

Key: NUTCH-1366
URL: https://issues.apache.org/jira/browse/NUTCH-1366
Project: Nutch
Issue Type: Improvement
Components: indexer
Reporter: Ferdy Galema
Fix For: nutchgora

Attachments: NUTCH-1366.patch

[jira] [Commented] (NUTCH-1367) Port ParserChecker to Nutchgora

2012-05-14 Thread Ferdy Galema (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13274590#comment-13274590
 ] 

Ferdy Galema commented on NUTCH-1367:
-

Hey Lewis,

This tool is already present in Nutchgora.

 Port ParserChecker to Nutchgora
 ---

 Key: NUTCH-1367
 URL: https://issues.apache.org/jira/browse/NUTCH-1367
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Affects Versions: nutchgora
Reporter: Lewis John McGibbney
 Fix For: 2.1


 This is such a great tool. It has come in handy so many times I would go blue 
 in the face if I had to try and count. e.g. for (int i = 0; i  infinity; i++)
 I think you get the idea.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Closed] (NUTCH-1366) speed up indexing by eliminating the indexreducer

2012-05-14 Thread Ferdy Galema (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ferdy Galema closed NUTCH-1366.
---

Resolution: Fixed

committed

speed up indexing by eliminating the indexreducer
-

Key: NUTCH-1366
URL: https://issues.apache.org/jira/browse/NUTCH-1366
Project: Nutch
Issue Type: Improvement
Components: indexer
Reporter: Ferdy Galema
Fix For: nutchgora

Attachments: NUTCH-1366.patch

[jira] [Commented] (NUTCH-879) URL-s getting lost

2012-05-23 Thread Ferdy Galema (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13281459#comment-13281459
 ] 

Ferdy Galema commented on NUTCH-879:


Agree to fix this issue later. Although I could not yet get to the bottom of 
this, I'm pretty sure the issue is not as severe as originally reported. (Based 
on current experencies with running Nutchgora in production).

 URL-s getting lost
 --

 Key: NUTCH-879
 URL: https://issues.apache.org/jira/browse/NUTCH-879
 Project: Nutch
  Issue Type: Bug
Affects Versions: nutchgora
 Environment: * Ubuntu 10.4 x64, Sun JDK 1.6
 * using 1-node Hadoop + HDFS
 * trunk r983472, using MySQL store
 * branch-1.3
Reporter: Andrzej Bialecki 
 Fix For: 2.1

 Attachments: branch-1.3-bench.txt, trunk-bench.txt


 I ran the Benchmark using branch-1.3 and trunk (formerly nutchbase). With the 
 same Benchmark parameters and the same plugins branch-1.3 collects ~1.5mln 
 urls, while trunk collects ~20,000 urls. Clearly something is wrong.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (NUTCH-1378) HostDb NullPointerException

2012-05-23 Thread Ferdy Galema (JIRA)

Ferdy Galema created NUTCH-1378:
---

 Summary: HostDb NullPointerException
 Key: NUTCH-1378
 URL: https://issues.apache.org/jira/browse/NUTCH-1378
 Project: Nutch
  Issue Type: Bug
Reporter: Ferdy Galema
 Fix For: nutchgora


This is a no-brainer to fix a NPE when using the HostDb functionality. Will 
attach patch and commit right away.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1378) HostDb NullPointerException

2012-05-23 Thread Ferdy Galema (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferdy Galema updated NUTCH-1378:


Attachment: NUTCH-1378.patch

 HostDb NullPointerException
 ---

 Key: NUTCH-1378
 URL: https://issues.apache.org/jira/browse/NUTCH-1378
 Project: Nutch
  Issue Type: Bug
Reporter: Ferdy Galema
 Fix For: nutchgora

 Attachments: NUTCH-1378.patch


 This is a no-brainer to fix a NPE when using the HostDb functionality. Will 
 attach patch and commit right away.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Closed] (NUTCH-1378) HostDb NullPointerException

2012-05-23 Thread Ferdy Galema (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferdy Galema closed NUTCH-1378.
---

Resolution: Fixed

 HostDb NullPointerException
 ---

 Key: NUTCH-1378
 URL: https://issues.apache.org/jira/browse/NUTCH-1378
 Project: Nutch
  Issue Type: Bug
Reporter: Ferdy Galema
 Fix For: nutchgora

 Attachments: NUTCH-1378.patch


 This is a no-brainer to fix a NPE when using the HostDb functionality. Will 
 attach patch and commit right away.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1356) ParseUtil use ExecutorService instead of manually thread handling.

2012-05-30 Thread Ferdy Galema (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-1356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13285510#comment-13285510
]

Ferdy Galema commented on NUTCH-1356:
-

I find it difficult to believe those exceptions are caused by this patch. It
does not change the way exceptions/timeouts are handled, it only makes sure
parser threads are reused.

It seems you are suffering from two types of (unrelated) exceptions. The first
is ExecutionException. This is caused whenever the execution inside the
FutureTask.get() throws an exception that is not catched anywere but the
FutureTask.get() itself. In your case this seems to be a NPE during the parse
of the html page. Might be a bug but then again it is strange that it is not
reproducible with the ParserChecker. (You sure about this?)

The second is TimeoutException, caused whenever the FutureTask.get() cannot be
completed within the specified timeout. The tricky part is that single urls
might be perfectly able to complete within the timeout, but when there is a
heavy concurrent load (a lot of semi-expensive parses) the parser load might
stack up and cause many parses to timeout. This can be the case with parsing
during fetch. But when using a separate parserjob this can also happen because
Parser implementation do not necessarily have to respond to a thread interrupt.
(Which is fired away with the task.cancel(true) call). If a parser does not
check the Thread.interrupted state at regular intervals, it will just continue
to run and eat up resources. I find it very helpful to debug stalling
fetchers/parsers with the lazy men's profiler: kill -QUIT process_id. This
will dump stacktraces, sometimes exposing the fact that hundreds of parser
threads are still active in the background. (Of course many of them already
timed out a long time ago).

ParseUtil use ExecutorService instead of manually thread handling.
--

Key: NUTCH-1356
URL: https://issues.apache.org/jira/browse/NUTCH-1356
Project: Nutch
Issue Type: Improvement
Reporter: Ferdy Galema
Fix For: nutchgora, 1.6

Attachments: NUTCH-1356-trunk-v2.patch, NUTCH-1356-trunk.patch,
NUTCH-1356.patch

Because ParseUtil manages it's own parser threads by creating a thread for
every parse it sometimes happens that specific parsers are very expensive.
For example, parsers that have threadlocal fields will initialize them for
every item to be parsed.
By simply introducing a caching ExecutorService the ParseUtil will be able to
cache threads therefore parsing more efficient. See attached patch.

[jira] [Created] (NUTCH-1379) NPE when reprUrl is null in ParseUtil

2012-05-30 Thread Ferdy Galema (JIRA)

Ferdy Galema created NUTCH-1379:
---

 Summary: NPE when reprUrl is null in ParseUtil
 Key: NUTCH-1379
 URL: https://issues.apache.org/jira/browse/NUTCH-1379
 Project: Nutch
  Issue Type: Bug
Reporter: Ferdy Galema




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1379) NPE when reprUrl is null in ParseUtil

2012-05-30 Thread Ferdy Galema (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferdy Galema updated NUTCH-1379:


Attachment: NUTCH-1379.patch

committed

 NPE when reprUrl is null in ParseUtil
 -

 Key: NUTCH-1379
 URL: https://issues.apache.org/jira/browse/NUTCH-1379
 Project: Nutch
  Issue Type: Bug
Reporter: Ferdy Galema
 Attachments: NUTCH-1379.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Reopened] (NUTCH-1379) NPE when reprUrl is null in ParseUtil

2012-05-30 Thread Ferdy Galema (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferdy Galema reopened NUTCH-1379:
-


 NPE when reprUrl is null in ParseUtil
 -

 Key: NUTCH-1379
 URL: https://issues.apache.org/jira/browse/NUTCH-1379
 Project: Nutch
  Issue Type: Bug
Reporter: Ferdy Galema
 Attachments: NUTCH-1379.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Closed] (NUTCH-1379) NPE when reprUrl is null in ParseUtil

2012-05-30 Thread Ferdy Galema (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferdy Galema closed NUTCH-1379.
---

Resolution: Fixed

 NPE when reprUrl is null in ParseUtil
 -

 Key: NUTCH-1379
 URL: https://issues.apache.org/jira/browse/NUTCH-1379
 Project: Nutch
  Issue Type: Bug
Reporter: Ferdy Galema
 Attachments: NUTCH-1379.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Closed] (NUTCH-1379) NPE when reprUrl is null in ParseUtil

2012-05-30 Thread Ferdy Galema (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferdy Galema closed NUTCH-1379.
---

Resolution: Fixed

 NPE when reprUrl is null in ParseUtil
 -

 Key: NUTCH-1379
 URL: https://issues.apache.org/jira/browse/NUTCH-1379
 Project: Nutch
  Issue Type: Bug
Reporter: Ferdy Galema
 Fix For: nutchgora

 Attachments: NUTCH-1379.patch


 Sometimes reprUrl is null in ParseUtil. Exact cause is still fuzzy but this 
 is a nice workaround for now.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1379) NPE when reprUrl is null in ParseUtil

2012-05-30 Thread Ferdy Galema (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferdy Galema updated NUTCH-1379:


  Description: Sometimes reprUrl is null in ParseUtil. Exact cause is still 
fuzzy but this is a nice workaround for now.
Fix Version/s: nutchgora

 NPE when reprUrl is null in ParseUtil
 -

 Key: NUTCH-1379
 URL: https://issues.apache.org/jira/browse/NUTCH-1379
 Project: Nutch
  Issue Type: Bug
Reporter: Ferdy Galema
 Fix For: nutchgora

 Attachments: NUTCH-1379.patch


 Sometimes reprUrl is null in ParseUtil. Exact cause is still fuzzy but this 
 is a nice workaround for now.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1356) ParseUtil use ExecutorService instead of manually thread handling.

2012-06-12 Thread Ferdy Galema (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-1356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13293510#comment-13293510
]

Ferdy Galema commented on NUTCH-1356:
-

Thanks.

The parser threads you refer to, is that a known problem? Can we solve it?
To solve it correctly every parser should check the interrupted state at
regular intervals. This is pretty huge task considering the amount of parsers.
For now it is something to keep in mind. I'll create an issue for reference.

ParseUtil use ExecutorService instead of manually thread handling.
--

Key: NUTCH-1356
URL: https://issues.apache.org/jira/browse/NUTCH-1356
Project: Nutch
Issue Type: Improvement
Reporter: Ferdy Galema
Fix For: nutchgora, 1.6

Attachments: NUTCH-1356-trunk-v2.patch, NUTCH-1356-trunk.patch,
NUTCH-1356.patch

[jira] [Created] (NUTCH-1387) All parsers should respond to cancellation.

2012-06-12 Thread Ferdy Galema (JIRA)

Ferdy Galema created NUTCH-1387:
---

 Summary: All parsers should respond to cancellation.
 Key: NUTCH-1387
 URL: https://issues.apache.org/jira/browse/NUTCH-1387
 Project: Nutch
  Issue Type: Bug
Reporter: Ferdy Galema


During parsing a TimeoutException can occur. This is caused whenever the 
FutureTask.get() cannot be completed within the specified timeout. The tricky 
part is that single urls might be perfectly able to complete within the 
timeout, but when there is a heavy concurrent load (a lot of semi-expensive 
parses) the parser load might stack up and cause many parses to timeout. This 
can be the case with parsing during fetch. But when using a separate parserjob 
this can also happen because Parser implementation do not necessarily have to 
respond to a thread interrupt. (Which is fired away with the task.cancel(true) 
call). If a parser does not check the Thread.interrupted state at regular 
intervals, it will just continue to run and eat up resources. I find it very 
helpful to debug stalling fetchers/parsers with the lazy men's profiler: kill 
-QUIT process_id. This will dump stacktraces, sometimes exposing the fact 
that hundreds of parser threads are still active in the background. (Of course 
many of them already timed out a long time ago).

To fix this, every parser should check it's interrupted state at regular 
intervals. (For example an html parse might be stuck walking the DOM tree, so 
checking after every Nth element would be an appropiate moment.)

This issue is for reference first. Fixing it all at once would be a huge task.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1387) All parsers should respond to cancellation / interrupts.

2012-06-12 Thread Ferdy Galema (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ferdy Galema updated NUTCH-1387:

Component/s: parser
Summary: All parsers should respond to cancellation / interrupts.
(was: All parsers should respond to cancellation.)

All parsers should respond to cancellation / interrupts.

Key: NUTCH-1387
URL: https://issues.apache.org/jira/browse/NUTCH-1387
Project: Nutch
Issue Type: Bug
Components: parser
Reporter: Ferdy Galema

During parsing a TimeoutException can occur. This is caused whenever the
FutureTask.get() cannot be completed within the specified timeout. The tricky
part is that single urls might be perfectly able to complete within the
timeout, but when there is a heavy concurrent load (a lot of semi-expensive
parses) the parser load might stack up and cause many parses to timeout. This
can be the case with parsing during fetch. But when using a separate
parserjob this can also happen because Parser implementation do not
necessarily have to respond to a thread interrupt. (Which is fired away with
the task.cancel(true) call). If a parser does not check the
Thread.interrupted state at regular intervals, it will just continue to run
and eat up resources. I find it very helpful to debug stalling
fetchers/parsers with the lazy men's profiler: kill -QUIT process_id. This
will dump stacktraces, sometimes exposing the fact that hundreds of parser
threads are still active in the background. (Of course many of them already
timed out a long time ago).
To fix this, every parser should check it's interrupted state at regular
intervals. (For example an html parse might be stuck walking the DOM tree, so
checking after every Nth element would be an appropiate moment.)
This issue is for reference first. Fixing it all at once would be a huge task.

[jira] [Commented] (NUTCH-1342) Read time out protocol-http

2012-06-13 Thread Ferdy Galema (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13294429#comment-13294429
 ] 

Ferdy Galema commented on NUTCH-1342:
-

Do you have any clue as to why protocol-httpclient has a different behaviour?

Also, two suggestions for your patch:

Perhaps you could finegrain the mechanism by allowing a configurable amount of 
timeouts before definitely failing. Something like:
if (++timeoutRetriesthis.allowedNumberOfTimeoutRetries) throw e; //rethrow

Secondly, could you specifically catch SocketTimeoutException? (I'm not sure if 
there are other IOExceptions that shouldn't be catched in any case.)

 Read time out protocol-http
 ---

 Key: NUTCH-1342
 URL: https://issues.apache.org/jira/browse/NUTCH-1342
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.4, 1.5
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Critical
 Fix For: 1.6

 Attachments: NUTCH-1342-1.6-1.patch


 For some reason some URL's always time out with protocol-http but not 
 protocol-httpclient. The stack trace is always the same:
 {code}
 2012-04-20 11:25:44,275 ERROR http.Http - Failed to get protocol output
 java.net.SocketTimeoutException: Read timed out
 at java.net.SocketInputStream.socketRead0(Native Method)
 at java.net.SocketInputStream.read(SocketInputStream.java:129)
 at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
 at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
 at java.io.FilterInputStream.read(FilterInputStream.java:116)
 at java.io.PushbackInputStream.read(PushbackInputStream.java:169)
 at java.io.FilterInputStream.read(FilterInputStream.java:90)
 at 
 org.apache.nutch.protocol.http.HttpResponse.readPlainContent(HttpResponse.java:228)
 at 
 org.apache.nutch.protocol.http.HttpResponse.init(HttpResponse.java:157)
 at org.apache.nutch.protocol.http.Http.getResponse(Http.java:64)
 at 
 org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:138)
 {code}
 Some example URL's:
 * 404 http://www.fcgroningen.nl/tribunenamen/stemmen/
 * 301 http://shop.fcgroningen.nl/aanbieding

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1392) -force and -resume arguments being ignored in ParserJob

2012-06-14 Thread Ferdy Galema (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferdy Galema updated NUTCH-1392:


Attachment: NUTCH-1392.patch

 -force and -resume arguments being ignored in ParserJob
 ---

 Key: NUTCH-1392
 URL: https://issues.apache.org/jira/browse/NUTCH-1392
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: nutchgora
Reporter: Lewis John McGibbney
 Fix For: 2.1

 Attachments: NUTCH-1392.patch


 From the log below there is obviously something not right here as both 
 -resume and -force are passed to the CLI but blatantly ignored within the log 
 output.
 lewis@lewis:~/ASF/nutchgora/runtime/local$ ./bin/nutch parse
 Usage: ParserJob (batchId | -all) [-crawlId id] [-resume] [-force]
 batchId - symbolic batch ID created by Generator
 -crawlId id - the id to prefix the schemas to operate on, 
   (default: storage.crawl.id)
 -all  - consider pages from all crawl jobs
 -resume   - resume a previous incomplete job
 -force- force re-parsing even if a page is already parsed
 lewis@lewis:~/ASF/nutchgora/runtime/local$ ./bin/nutch parse -all -resume 
 -force
 ParserJob: starting
 ParserJob: resuming:  false
 ParserJob: forced reparse:false
 ParserJob: parsing all
 Parsing http://www.trancearoundtheworld.com/
 ParserJob: success

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1081) ant tests fail

2012-06-15 Thread Ferdy Galema (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13295676#comment-13295676
 ] 

Ferdy Galema commented on NUTCH-1081:
-

Yes this one should be closed.

 ant tests fail 
 ---

 Key: NUTCH-1081
 URL: https://issues.apache.org/jira/browse/NUTCH-1081
 Project: Nutch
  Issue Type: Bug
  Components: fetcher, generator, injector, storage
Affects Versions: nutchgora
 Environment: Ubuntu release 11.04 (natty)
 Kernerl Linux 2.6.38-10-generic
 GNOME 2.32.1
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Critical
 Fix For: 2.1


 The following tests fail when running ant test on trunk 2.0
 {code}
 [junit] Running org.apache.nutch.api.TestAPI
 [junit] Tests run: 4, Failures: 1, Errors: 0, Time elapsed: 11.028 sec
 [junit] Test org.apache.nutch.api.TestAPI FAILED
 [junit] Running org.apache.nutch.crawl.TestGenerator
 [junit] Tests run: 4, Failures: 0, Errors: 4, Time elapsed: 0.478 sec
 [junit] Test org.apache.nutch.crawl.TestGenerator FAILED
 [junit] Running org.apache.nutch.crawl.TestInjector
 [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0.474 sec
 [junit] Test org.apache.nutch.crawl.TestInjector FAILED
 [junit] Running org.apache.nutch.fetcher.TestFetcher
 [junit] Tests run: 2, Failures: 0, Errors: 2, Time elapsed: 0.526 sec
 [junit] Test org.apache.nutch.fetcher.TestFetcher FAILED
 [junit] Running org.apache.nutch.storage.TestGoraStorage
 [junit] Tests run: 2, Failures: 0, Errors: 2, Time elapsed: 0.468 sec
 [junit] Test org.apache.nutch.storage.TestGoraStorage FAILED
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (NUTCH-1411) nutchgora fetcher.store.content does not work

2012-06-27 Thread Ferdy Galema (JIRA)

Ferdy Galema created NUTCH-1411:
---

 Summary: nutchgora fetcher.store.content does not work
 Key: NUTCH-1411
 URL: https://issues.apache.org/jira/browse/NUTCH-1411
 Project: Nutch
  Issue Type: Bug
Affects Versions: nutchgora
Reporter: Ferdy Galema
Priority: Minor


http://lucene.472066.n3.nabble.com/parse-and-solrindex-in-nutch-2-0-td3991247.html

The property fetcher.store.content doesn't do anything. Content is always 
stored. Fix or remove property, what do you think?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1306) Add option to not commit and clarify existing solr.commit.size

2012-07-04 Thread Ferdy Galema (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferdy Galema updated NUTCH-1306:


Summary: Add option to not commit and clarify existing solr.commit.size  
(was: Commit after finished writing to solr index)

 Add option to not commit and clarify existing solr.commit.size
 --

 Key: NUTCH-1306
 URL: https://issues.apache.org/jira/browse/NUTCH-1306
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: nutchgora
Reporter: Dan Rosher
Priority: Trivial
 Fix For: 2.1

 Attachments: NUTCH-1306-trunk-v2.patch, NUTCH-1306-trunk.patch, 
 NUTCH-1306-v2.patch, NUTCH-1306.patch


 Commit after finished writing to solr index - otherwise a bit confusing not 
 seeing the number of docs we expect in solr

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1306) Add option to not commit and clarify existing solr.commit.size

2012-07-04 Thread Ferdy Galema (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406363#comment-13406363
 ] 

Ferdy Galema commented on NUTCH-1306:
-

New option added solr.commit.index

Defaults to true: Commit after index. Will commit to trunk and nutchgora on no 
objection.

 Add option to not commit and clarify existing solr.commit.size
 --

 Key: NUTCH-1306
 URL: https://issues.apache.org/jira/browse/NUTCH-1306
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: nutchgora
Reporter: Dan Rosher
Priority: Trivial
 Fix For: 2.1

 Attachments: NUTCH-1306-trunk-v2.patch, NUTCH-1306-trunk.patch, 
 NUTCH-1306-v2.patch, NUTCH-1306.patch


 Commit after finished writing to solr index - otherwise a bit confusing not 
 seeing the number of docs we expect in solr

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

2012-07-04 Thread Ferdy Galema (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406495#comment-13406495
]

Ferdy Galema commented on NUTCH-1360:
-

Sorry for the late response, but this issue is not properly implemented (for
both branch and trunk).

- IP is always stored instead of depending on property: headers.set(_ip,...
should be done only if http.getIP_Header() is true.

- http.store.ip.address appends the _ip:true or false property to the request
string? What is the purpose of that? If not intentional, we should simply
revert this. On top of that it uses the property with a default of true, but
is should be false if the adding to request string is intentional.

Thanks.

Suport the storing of IP address connected to when web crawling
---

Key: NUTCH-1360
URL: https://issues.apache.org/jira/browse/NUTCH-1360
Project: Nutch
Issue Type: New Feature
Components: protocol
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
Fix For: nutchgora, 1.6

Attachments: NUTCH-1360-nutchgora-v2.patch,
NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch

Simple issue enabling us to capture the specific IP address of the host which
we connect to to fetch a page.

[jira] [Comment Edited] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

2012-07-04 Thread Ferdy Galema (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406495#comment-13406495
]

Ferdy Galema edited comment on NUTCH-1360 at 7/4/12 1:41 PM:
-

Sorry for the late response, but this issue is not properly implemented (for
both branch and trunk).

- IP is always stored instead of depending on property: headers.set(_ip,...
should be done only if http.getIP_Header() is true.

- http.store.ip.address appends the _ip:true or false property to the request
string? What is the purpose of that? If not intentional, we should simply
revert this. On top of that it rereads the property with a default of true,
but is should be false (or just use http.getIP_Header()) if the adding to
request string is intentional.

Thanks.

was (Author: ferdy.g):
Sorry for the late response, but this issue is not properly implemented
(for both branch and trunk).

- IP is always stored instead of depending on property: headers.set(_ip,...
should be done only if http.getIP_Header() is true.

Thanks.

Suport the storing of IP address connected to when web crawling
---

Attachments: NUTCH-1360-nutchgora-v2.patch,
NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch

Simple issue enabling us to capture the specific IP address of the host which
we connect to to fetch a page.

[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

2012-07-04 Thread Ferdy Galema (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406561#comment-13406561
 ] 

Ferdy Galema commented on NUTCH-1360:
-

Just one more thing:
Should the IP not be stored in the metadata instead of the headers field? It is 
technically not a response header. As far as I know currently the headers 
container is only used for the headers returned by the http server. (But 
correct me if I'm wrong)

 Suport the storing of IP address connected to when web crawling
 ---

 Key: NUTCH-1360
 URL: https://issues.apache.org/jira/browse/NUTCH-1360
 Project: Nutch
  Issue Type: New Feature
  Components: protocol
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: nutchgora, 1.6

 Attachments: NUTCH-1360-nutchgora-v2.patch, 
 NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch


 Simple issue enabling us to capture the specific IP address of the host which 
 we connect to to fetch a page.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1306) Add option to not commit and clarify existing solr.commit.size

2012-07-09 Thread Ferdy Galema (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferdy Galema updated NUTCH-1306:


Attachment: NUTCH-1306-trunk-v3.patch

minor bug in prev. patch. uploaded v3 of trunk patch.

 Add option to not commit and clarify existing solr.commit.size
 --

 Key: NUTCH-1306
 URL: https://issues.apache.org/jira/browse/NUTCH-1306
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: nutchgora
Reporter: Dan Rosher
Priority: Trivial
 Fix For: 2.1

 Attachments: NUTCH-1306-trunk-v2.patch, NUTCH-1306-trunk-v3.patch, 
 NUTCH-1306-trunk.patch, NUTCH-1306-v2.patch, NUTCH-1306.patch


 Commit after finished writing to solr index - otherwise a bit confusing not 
 seeing the number of docs we expect in solr

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (NUTCH-1423) Remove unused fields in LanguageIndexingFilter

2012-07-09 Thread Ferdy Galema (JIRA)

Ferdy Galema created NUTCH-1423:
---

 Summary: Remove unused fields in LanguageIndexingFilter
 Key: NUTCH-1423
 URL: https://issues.apache.org/jira/browse/NUTCH-1423
 Project: Nutch
  Issue Type: Bug
Reporter: Ferdy Galema
Priority: Trivial
 Fix For: 2.1


The LanguageIndexingFilter declares fields on the input that are not used. 
These fields must be removed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1423) Remove unused fields in LanguageIndexingFilter

2012-07-09 Thread Ferdy Galema (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferdy Galema updated NUTCH-1423:


Attachment: NUTCH-1423.patch

 Remove unused fields in LanguageIndexingFilter
 --

 Key: NUTCH-1423
 URL: https://issues.apache.org/jira/browse/NUTCH-1423
 Project: Nutch
  Issue Type: Bug
Reporter: Ferdy Galema
Priority: Trivial
 Fix For: 2.1

 Attachments: NUTCH-1423.patch


 The LanguageIndexingFilter declares fields on the input that are not used. 
 These fields must be removed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1424) fix fetcher timelimit logging

2012-07-09 Thread Ferdy Galema (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferdy Galema updated NUTCH-1424:


Attachment: NUTCH-1424.patch

 fix fetcher timelimit logging 
 --

 Key: NUTCH-1424
 URL: https://issues.apache.org/jira/browse/NUTCH-1424
 Project: Nutch
  Issue Type: Bug
Reporter: Ferdy Galema
Priority: Trivial
 Fix For: 2.1

 Attachments: NUTCH-1424.patch


 When fetching with timelimit, the log does not correctly reflect this. 
 (Always shows -1).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Closed] (NUTCH-1424) fix fetcher timelimit logging

2012-07-09 Thread Ferdy Galema (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferdy Galema closed NUTCH-1424.
---

Resolution: Fixed

Committed.

 fix fetcher timelimit logging 
 --

 Key: NUTCH-1424
 URL: https://issues.apache.org/jira/browse/NUTCH-1424
 Project: Nutch
  Issue Type: Bug
Reporter: Ferdy Galema
Priority: Trivial
 Fix For: 2.1

 Attachments: NUTCH-1424.patch


 When fetching with timelimit, the log does not correctly reflect this. 
 (Always shows -1).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (NUTCH-1425) DbUpdaterJob declares PREV_SIGNATURE on input twice

2012-07-09 Thread Ferdy Galema (JIRA)

Ferdy Galema created NUTCH-1425:
---

 Summary: DbUpdaterJob declares PREV_SIGNATURE on input twice
 Key: NUTCH-1425
 URL: https://issues.apache.org/jira/browse/NUTCH-1425
 Project: Nutch
  Issue Type: Bug
Reporter: Ferdy Galema
Priority: Trivial
 Fix For: 2.1
 Attachments: NUTCH-1425.patch

Although harmless, DbUpdaterJob should not declare input fields twice.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Closed] (NUTCH-1425) DbUpdaterJob declares PREV_SIGNATURE on input twice

2012-07-09 Thread Ferdy Galema (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferdy Galema closed NUTCH-1425.
---

Resolution: Fixed

Committed.

 DbUpdaterJob declares PREV_SIGNATURE on input twice
 ---

 Key: NUTCH-1425
 URL: https://issues.apache.org/jira/browse/NUTCH-1425
 Project: Nutch
  Issue Type: Bug
Reporter: Ferdy Galema
Priority: Trivial
 Fix For: 2.1

 Attachments: NUTCH-1425.patch


 Although harmless, DbUpdaterJob should not declare input fields twice.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

1 2 >

1 - 100 of 163 matches

Mail list logo