[jira] [Closed] (NUTCH-1340) Increase scalability by only removing markers when they actually exist for DbUpdaterReducer
[ https://issues.apache.org/jira/browse/NUTCH-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema closed NUTCH-1340. --- Resolution: Fixed Increase scalability by only removing markers when they actually exist for DbUpdaterReducer --- Key: NUTCH-1340 URL: https://issues.apache.org/jira/browse/NUTCH-1340 Project: Nutch Issue Type: Improvement Reporter: Ferdy Galema Fix For: nutchgora Attachments: NUTCH-1340-v1.txt, NUTCH-1340-v2.txt After applying GORA-120 (this already is a huge performance boost by itself) one of the major bottlenecks of the DbUpdaterReducer is the deletion of the markers. The update reducer simply sets every row to delete its markers. A lot of rows do not actually have the markers but the deletes are fired away in any case. Because the markers are already always on the input, a simple check to see if they exist greaty improves performance. In particular it is very expensive in HBase, because every single Delete inmediately triggers a connection to the regionservers. (They ignore the autoflush=false directive). Although deletes can be done in batch, this is currently not supported by Gora. For one it is very difficult to implement in the current HBaseStore with regard to multithreading, and secondly I noticed performance did not increase significantly. By performance debugging on a real life cluster this currently seems to be the biggest bottleneck of the DbUpdaterReducer. (Remember only after applying GORA-120) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1340) Increase scalability by only removing markers when they actually exist for DbUpdaterReducer
[ https://issues.apache.org/jira/browse/NUTCH-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema updated NUTCH-1340: Attachment: NUTCH-1340-v2.txt v2 of patch, including javadoc. This patch increases performance, but when updating huge crawls it still can be a bit troublesome to process the huge amounts of deletes. However this is something that needs to be solved in Gora. Committed! Thanks Lewis. Increase scalability by only removing markers when they actually exist for DbUpdaterReducer --- Key: NUTCH-1340 URL: https://issues.apache.org/jira/browse/NUTCH-1340 Project: Nutch Issue Type: Improvement Reporter: Ferdy Galema Fix For: nutchgora Attachments: NUTCH-1340-v1.txt, NUTCH-1340-v2.txt After applying GORA-120 (this already is a huge performance boost by itself) one of the major bottlenecks of the DbUpdaterReducer is the deletion of the markers. The update reducer simply sets every row to delete its markers. A lot of rows do not actually have the markers but the deletes are fired away in any case. Because the markers are already always on the input, a simple check to see if they exist greaty improves performance. In particular it is very expensive in HBase, because every single Delete inmediately triggers a connection to the regionservers. (They ignore the autoflush=false directive). Although deletes can be done in batch, this is currently not supported by Gora. For one it is very difficult to implement in the current HBaseStore with regard to multithreading, and secondly I noticed performance did not increase significantly. By performance debugging on a real life cluster this currently seems to be the biggest bottleneck of the DbUpdaterReducer. (Remember only after applying GORA-120) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-882) Design a Host table in GORA
[ https://issues.apache.org/jira/browse/NUTCH-882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13262488#comment-13262488 ] Ferdy Galema commented on NUTCH-882: Committed. I realize that the current state is far from finished, however I figured it is enough to close this longstanding issue off. This makes room for people to easily play around with it and make improvements where necessary. (Adding definitions for other stores, new features such as storing stats etcetera.) I'll leave the final closing to Julien, since he is the original reporter. Please let me know if any of you disagree. Design a Host table in GORA --- Key: NUTCH-882 URL: https://issues.apache.org/jira/browse/NUTCH-882 Project: Nutch Issue Type: New Feature Affects Versions: nutchgora Reporter: Julien Nioche Fix For: nutchgora Attachments: NUTCH-882-v1.patch, NUTCH-882-v3.txt, NUTCH-882-v3.txt, hostdb.patch Having a separate GORA table for storing information about hosts (and domains?) would be very useful for : * customising the behaviour of the fetching on a host basis e.g. number of threads, min time between threads etc... * storing stats * keeping metadata and possibly propagate them to the webpages * keeping a copy of the robots.txt and possibly use that later to filter the webtable * store sitemaps files and update the webtable accordingly I'll try to come up with a GORA schema for such a host table but any comments are of course already welcome -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-882) Design a Host table in GORA
[ https://issues.apache.org/jira/browse/NUTCH-882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema resolved NUTCH-882. Resolution: Fixed Design a Host table in GORA --- Key: NUTCH-882 URL: https://issues.apache.org/jira/browse/NUTCH-882 Project: Nutch Issue Type: New Feature Affects Versions: nutchgora Reporter: Julien Nioche Fix For: nutchgora Attachments: NUTCH-882-v1.patch, NUTCH-882-v3.txt, NUTCH-882-v3.txt, hostdb.patch Having a separate GORA table for storing information about hosts (and domains?) would be very useful for : * customising the behaviour of the fetching on a host basis e.g. number of threads, min time between threads etc... * storing stats * keeping metadata and possibly propagate them to the webpages * keeping a copy of the robots.txt and possibly use that later to filter the webtable * store sitemaps files and update the webtable accordingly I'll try to come up with a GORA schema for such a host table but any comments are of course already welcome -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-902) Add all necessary files and configuration so that nutch can be used with different backends out-of-the-box
[ https://issues.apache.org/jira/browse/NUTCH-902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13262496#comment-13262496 ] Ferdy Galema commented on NUTCH-902: I think nutch-default.xml does not correctly use the description field of the storage.data.store.class property. The description should describe what the property is about, not what the value is about. So instead of the various entries: property namestorage.data.store.class/name valueorg.apache.gora.cassandra.store.CassandraStore/value descriptionGora class for storing data in Apache Cassandra/description /property -- !-- property namestorage.data.store.class/name valueorg.apache.gora.hbase.store.HBaseStore/value descriptionGora class for storing data in Apache HBase/description /property -- so on.. I propose to add a single property entry with the following description like this: property namestorage.data.store.class/name valueorg.apache.gora.sql.store.SqlStore/value descriptionThe Gora DataStore class for storing/retrieving data. Currently the following stores are available: org.apache.gora.sql.store.SqlStore A DataStore implementation for RDBMS with a SQL interface. SqlStore uses JDBC drivers to communicate with the DB. org.apache.gora.hbase.store.HBaseStore DataStore implementation for Hadoop HBase. etcetera /description /property This has the additional benefit to make the nutch-default.xml look cleaner, imho. Add all necessary files and configuration so that nutch can be used with different backends out-of-the-box -- Key: NUTCH-902 URL: https://issues.apache.org/jira/browse/NUTCH-902 Project: Nutch Issue Type: New Feature Components: documentation, storage Affects Versions: nutchbase Reporter: Enis Soztutar Assignee: Lewis John McGibbney Fix For: nutchgora Attachments: NUTCH-902-v2.patch, NUTCH-902-v3.patch, NUTCH-902.patch As per the discussion in the mailing list and http://wiki.apache.org/nutch/GORA_HBase, it will be good to include all the necessary files and configuration. I propose that we maintain configuration for at least SQL, HBase and Cassandra. The following changes are needed: conf/gora-sql-mapping.xml conf/gora-hbase-mapping.xml conf/gora-cassandra-mapping.xml comments on nutch-default and ivy.xml Shall we also include jars from gora-hbase, gora-cassandra and their dependencies ? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1189) add commented out default settings to gora.properties files
[ https://issues.apache.org/jira/browse/NUTCH-1189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13262506#comment-13262506 ] Ferdy Galema commented on NUTCH-1189: - FYI: I just committed a change to update the HBaseStore properties section. add commented out default settings to gora.properties files Key: NUTCH-1189 URL: https://issues.apache.org/jira/browse/NUTCH-1189 Project: Nutch Issue Type: Sub-task Components: storage Affects Versions: nutchgora Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: nutchgora Attachments: NUTCH-1189-v2.patch, NUTCH-1189-v3.patch, NUTCH-1189-v4.patch, NUTCH-1189.patch This issues should have been dealt with as part of its parent issue, however I think as it is a fairly lareg task in itself, it needs to be done independently. The gora.properties file should, amongst other settings, and beside the extreme basic defaults for sqlstore, include defaults for opening HBase, Cassandra, etc servers on their default ports etc. Leaving this down to individual interpretation puts a huge owness of the user, hence constructing a barrier to entry for getting the configuration settings up and running. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-882) Design a Host table in GORA
[ https://issues.apache.org/jira/browse/NUTCH-882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema closed NUTCH-882. -- Ok. Thanks to anyone who was involved. Design a Host table in GORA --- Key: NUTCH-882 URL: https://issues.apache.org/jira/browse/NUTCH-882 Project: Nutch Issue Type: New Feature Affects Versions: nutchgora Reporter: Julien Nioche Fix For: nutchgora Attachments: NUTCH-882-v1.patch, NUTCH-882-v3.txt, NUTCH-882-v3.txt, hostdb.patch Having a separate GORA table for storing information about hosts (and domains?) would be very useful for : * customising the behaviour of the fetching on a host basis e.g. number of threads, min time between threads etc... * storing stats * keeping metadata and possibly propagate them to the webpages * keeping a copy of the robots.txt and possibly use that later to filter the webtable * store sitemaps files and update the webtable accordingly I'll try to come up with a GORA schema for such a host table but any comments are of course already welcome -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-1290) crawlId not supported by all Tools
[ https://issues.apache.org/jira/browse/NUTCH-1290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema closed NUTCH-1290. --- Resolution: Fixed crawlId not supported by all Tools -- Key: NUTCH-1290 URL: https://issues.apache.org/jira/browse/NUTCH-1290 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: nutchgora Reporter: Mathijs Homminga Priority: Minor Fix For: nutchgora Attachments: NUTCH-1290.patch See also: https://issues.apache.org/jira/browse/NUTCH-907 The StorageUtils class exposes a createDataStore method which uses the default schema for a persistent class specified in the Gora configuration. This method ignores Nutch' storage.schema property and the notion of a crawlId. Two tools use this method instead of the createWebStore method (which does support the storage.schema property and a crawlId): o.a.n.indexer.IndexerReducer (IndexerJob) o.a.n.util.domain.DomainStatistics I propose that these two start using the createWebStore method and that we make remove the createDataStore method from the StorageUtils. Also, these two tools should support the crawlId command line parameter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-902) Add all necessary files and configuration so that nutch can be used with different backends out-of-the-box
[ https://issues.apache.org/jira/browse/NUTCH-902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13262548#comment-13262548 ] Ferdy Galema commented on NUTCH-902: Alright I'll change and commit the storage.data.store.class property description. Aside from that I think we can close this issue. Effort can be put into NUTCH-1205 and after that actual testing of the stores to see if the current configuration is sufficient for out-of-the-box usage. If this is not the case for some stores, we can always create new issues for thosde. (To prevent too much clutter in this issue). Add all necessary files and configuration so that nutch can be used with different backends out-of-the-box -- Key: NUTCH-902 URL: https://issues.apache.org/jira/browse/NUTCH-902 Project: Nutch Issue Type: New Feature Components: documentation, storage Affects Versions: nutchbase Reporter: Enis Soztutar Assignee: Lewis John McGibbney Fix For: nutchgora Attachments: NUTCH-902-v2.patch, NUTCH-902-v3.patch, NUTCH-902.patch As per the discussion in the mailing list and http://wiki.apache.org/nutch/GORA_HBase, it will be good to include all the necessary files and configuration. I propose that we maintain configuration for at least SQL, HBase and Cassandra. The following changes are needed: conf/gora-sql-mapping.xml conf/gora-hbase-mapping.xml conf/gora-cassandra-mapping.xml comments on nutch-default and ivy.xml Shall we also include jars from gora-hbase, gora-cassandra and their dependencies ? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-902) Add all necessary files and configuration so that nutch can be used with different backends out-of-the-box
[ https://issues.apache.org/jira/browse/NUTCH-902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13262556#comment-13262556 ] Ferdy Galema commented on NUTCH-902: Ok done. (Note that I did not actually check the stores, I simply merged the nutch-default.xml entries) Add all necessary files and configuration so that nutch can be used with different backends out-of-the-box -- Key: NUTCH-902 URL: https://issues.apache.org/jira/browse/NUTCH-902 Project: Nutch Issue Type: New Feature Components: documentation, storage Affects Versions: nutchbase Reporter: Enis Soztutar Assignee: Lewis John McGibbney Fix For: nutchgora Attachments: NUTCH-902-v2.patch, NUTCH-902-v3.patch, NUTCH-902.patch As per the discussion in the mailing list and http://wiki.apache.org/nutch/GORA_HBase, it will be good to include all the necessary files and configuration. I propose that we maintain configuration for at least SQL, HBase and Cassandra. The following changes are needed: conf/gora-sql-mapping.xml conf/gora-hbase-mapping.xml conf/gora-cassandra-mapping.xml comments on nutch-default and ivy.xml Shall we also include jars from gora-hbase, gora-cassandra and their dependencies ? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-879) URL-s getting lost
[ https://issues.apache.org/jira/browse/NUTCH-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13262558#comment-13262558 ] Ferdy Galema commented on NUTCH-879: This a pretty old issue. Nevertheless the bug might still be active. I'll look into it. URL-s getting lost -- Key: NUTCH-879 URL: https://issues.apache.org/jira/browse/NUTCH-879 Project: Nutch Issue Type: Bug Affects Versions: nutchgora Environment: * Ubuntu 10.4 x64, Sun JDK 1.6 * using 1-node Hadoop + HDFS * trunk r983472, using MySQL store * branch-1.3 Reporter: Andrzej Bialecki Fix For: nutchgora Attachments: branch-1.3-bench.txt, trunk-bench.txt I ran the Benchmark using branch-1.3 and trunk (formerly nutchbase). With the same Benchmark parameters and the same plugins branch-1.3 collects ~1.5mln urls, while trunk collects ~20,000 urls. Clearly something is wrong. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1205) Upgrade gora modules to 0.2 in ivy/ivy.xml
[ https://issues.apache.org/jira/browse/NUTCH-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema updated NUTCH-1205: Attachment: NUTCH-1205-v7.patch OK I got the tests working now. The problem is the fact that properties object are not correctly handled throughout the tests. This is a problem currently in Gora. In short, it means that Properties are not properly loaded in GoraMapper from dynamic properties but ALWAYS from static gora.properties. (Will shortly open issue for that). A consequence to make tests work now with Gora 0.2 is that the default gora.properties now uses a hsqldb memstore instead of a standalone hsqldb server. Lewis, I noticed that you excluded jdom in the ivy.xml? Why is that? I included it again because the SqlStore needs it to reads it mapping. Upgrade gora modules to 0.2 in ivy/ivy.xml -- Key: NUTCH-1205 URL: https://issues.apache.org/jira/browse/NUTCH-1205 Project: Nutch Issue Type: Improvement Components: storage Affects Versions: nutchgora Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Blocker Fix For: nutchgora Attachments: NUTCH-1205-v2.patch, NUTCH-1205-v3.patch, NUTCH-1205-v4.patch, NUTCH-1205-v5.patch, NUTCH-1205-v5.patch, NUTCH-1205-v6.patch, NUTCH-1205-v7.patch, NUTCH-1205.patch Although gora trunk is unstable, work is ongoing to get this fixed. For the time being, I think Nutchgora should use gora trunk as this will identify more vulnerabilities. I'll get the trivial patch submitted shortly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1205) Upgrade gora modules to 0.2 in ivy/ivy.xml
[ https://issues.apache.org/jira/browse/NUTCH-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13263596#comment-13263596 ] Ferdy Galema commented on NUTCH-1205: - (Also I reformatted the ivy.xml to only include spaces as indentation. This is the policy right? If so, could anyone editing xml files double check their editor settings.) Upgrade gora modules to 0.2 in ivy/ivy.xml -- Key: NUTCH-1205 URL: https://issues.apache.org/jira/browse/NUTCH-1205 Project: Nutch Issue Type: Improvement Components: storage Affects Versions: nutchgora Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Blocker Fix For: nutchgora Attachments: NUTCH-1205-v2.patch, NUTCH-1205-v3.patch, NUTCH-1205-v4.patch, NUTCH-1205-v5.patch, NUTCH-1205-v5.patch, NUTCH-1205-v6.patch, NUTCH-1205-v7.patch, NUTCH-1205.patch Although gora trunk is unstable, work is ongoing to get this fixed. For the time being, I think Nutchgora should use gora trunk as this will identify more vulnerabilities. I'll get the trivial patch submitted shortly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1205) Upgrade gora modules to 0.2 in ivy/ivy.xml
[ https://issues.apache.org/jira/browse/NUTCH-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema updated NUTCH-1205: Attachment: NUTCH-1205-v8.patch Oops there still was a failure in a test later on. (TestProtocolHttpClient). This was because of multiple ant jars. I noticed that this was caused by removing the global exclude but adding excludes to hadoop deps. (This was not sufficient obviously). New version of patch succeeds all tests. Upgrade gora modules to 0.2 in ivy/ivy.xml -- Key: NUTCH-1205 URL: https://issues.apache.org/jira/browse/NUTCH-1205 Project: Nutch Issue Type: Improvement Components: storage Affects Versions: nutchgora Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Blocker Fix For: nutchgora Attachments: NUTCH-1205-v2.patch, NUTCH-1205-v3.patch, NUTCH-1205-v4.patch, NUTCH-1205-v5.patch, NUTCH-1205-v5.patch, NUTCH-1205-v6.patch, NUTCH-1205-v7.patch, NUTCH-1205-v8.patch, NUTCH-1205.patch Although gora trunk is unstable, work is ongoing to get this fixed. For the time being, I think Nutchgora should use gora trunk as this will identify more vulnerabilities. I'll get the trivial patch submitted shortly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1205) Upgrade gora modules to 0.2 in ivy/ivy.xml
[ https://issues.apache.org/jira/browse/NUTCH-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema updated NUTCH-1205: Attachment: (was: NUTCH-1205-v7.patch) Upgrade gora modules to 0.2 in ivy/ivy.xml -- Key: NUTCH-1205 URL: https://issues.apache.org/jira/browse/NUTCH-1205 Project: Nutch Issue Type: Improvement Components: storage Affects Versions: nutchgora Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Blocker Fix For: nutchgora Attachments: NUTCH-1205-v10.patch, NUTCH-1205-v2.patch, NUTCH-1205-v3.patch, NUTCH-1205-v4.patch, NUTCH-1205-v5.patch, NUTCH-1205-v5.patch, NUTCH-1205-v6.patch, NUTCH-1205.patch Although gora trunk is unstable, work is ongoing to get this fixed. For the time being, I think Nutchgora should use gora trunk as this will identify more vulnerabilities. I'll get the trivial patch submitted shortly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1205) Upgrade gora modules to 0.2 in ivy/ivy.xml
[ https://issues.apache.org/jira/browse/NUTCH-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema updated NUTCH-1205: Attachment: NUTCH-1205-v10.patch The tests now work and TestGoraStorage uses a proper standalone database. (Integrating issue NUTCH-902). I think it's good to do a final check of what's to be included as dependencies in the ivy.xml. Upgrade gora modules to 0.2 in ivy/ivy.xml -- Key: NUTCH-1205 URL: https://issues.apache.org/jira/browse/NUTCH-1205 Project: Nutch Issue Type: Improvement Components: storage Affects Versions: nutchgora Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Blocker Fix For: nutchgora Attachments: NUTCH-1205-v10.patch, NUTCH-1205-v2.patch, NUTCH-1205-v3.patch, NUTCH-1205-v4.patch, NUTCH-1205-v5.patch, NUTCH-1205-v5.patch, NUTCH-1205-v6.patch, NUTCH-1205.patch Although gora trunk is unstable, work is ongoing to get this fixed. For the time being, I think Nutchgora should use gora trunk as this will identify more vulnerabilities. I'll get the trivial patch submitted shortly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1205) Upgrade gora modules to 0.2 in ivy/ivy.xml
[ https://issues.apache.org/jira/browse/NUTCH-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema updated NUTCH-1205: Attachment: (was: NUTCH-1205-v9.patch) Upgrade gora modules to 0.2 in ivy/ivy.xml -- Key: NUTCH-1205 URL: https://issues.apache.org/jira/browse/NUTCH-1205 Project: Nutch Issue Type: Improvement Components: storage Affects Versions: nutchgora Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Blocker Fix For: nutchgora Attachments: NUTCH-1205-v10.patch, NUTCH-1205-v2.patch, NUTCH-1205-v3.patch, NUTCH-1205-v4.patch, NUTCH-1205-v5.patch, NUTCH-1205-v5.patch, NUTCH-1205-v6.patch, NUTCH-1205.patch Although gora trunk is unstable, work is ongoing to get this fixed. For the time being, I think Nutchgora should use gora trunk as this will identify more vulnerabilities. I'll get the trivial patch submitted shortly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1205) Upgrade gora modules to 0.2 in ivy/ivy.xml
[ https://issues.apache.org/jira/browse/NUTCH-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema updated NUTCH-1205: Attachment: (was: NUTCH-1205-v9.patch) Upgrade gora modules to 0.2 in ivy/ivy.xml -- Key: NUTCH-1205 URL: https://issues.apache.org/jira/browse/NUTCH-1205 Project: Nutch Issue Type: Improvement Components: storage Affects Versions: nutchgora Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Blocker Fix For: nutchgora Attachments: NUTCH-1205-v10.patch, NUTCH-1205-v2.patch, NUTCH-1205-v3.patch, NUTCH-1205-v4.patch, NUTCH-1205-v5.patch, NUTCH-1205-v5.patch, NUTCH-1205-v6.patch, NUTCH-1205.patch Although gora trunk is unstable, work is ongoing to get this fixed. For the time being, I think Nutchgora should use gora trunk as this will identify more vulnerabilities. I'll get the trivial patch submitted shortly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1205) Upgrade gora modules to 0.2 in ivy/ivy.xml
[ https://issues.apache.org/jira/browse/NUTCH-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema updated NUTCH-1205: Attachment: NUTCH-1205-v11.patch Upgrade gora modules to 0.2 in ivy/ivy.xml -- Key: NUTCH-1205 URL: https://issues.apache.org/jira/browse/NUTCH-1205 Project: Nutch Issue Type: Improvement Components: storage Affects Versions: nutchgora Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Blocker Fix For: nutchgora Attachments: NUTCH-1205-v10.patch, NUTCH-1205-v11.patch, NUTCH-1205-v2.patch, NUTCH-1205-v3.patch, NUTCH-1205-v4.patch, NUTCH-1205-v5.patch, NUTCH-1205-v5.patch, NUTCH-1205-v6.patch, NUTCH-1205.patch Although gora trunk is unstable, work is ongoing to get this fixed. For the time being, I think Nutchgora should use gora trunk as this will identify more vulnerabilities. I'll get the trivial patch submitted shortly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1205) Upgrade gora modules to 0.2 in ivy/ivy.xml
[ https://issues.apache.org/jira/browse/NUTCH-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema resolved NUTCH-1205. - Resolution: Fixed Attached new patch v11. Committed. -Fixed the jdom issue. (Added test dep again). -Added a single global exclusion for hsqldb. (The deps can have the exclusion removed). -Tests succeed. -Build a sqlstore runtime and played around doing some local crawls succesfully. I did not test a deployment for the other stores. (When there is something wrong with one of them dependency-wise, we can always create new issues). Upgrade gora modules to 0.2 in ivy/ivy.xml -- Key: NUTCH-1205 URL: https://issues.apache.org/jira/browse/NUTCH-1205 Project: Nutch Issue Type: Improvement Components: storage Affects Versions: nutchgora Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Blocker Fix For: nutchgora Attachments: NUTCH-1205-v10.patch, NUTCH-1205-v11.patch, NUTCH-1205-v2.patch, NUTCH-1205-v3.patch, NUTCH-1205-v4.patch, NUTCH-1205-v5.patch, NUTCH-1205-v5.patch, NUTCH-1205-v6.patch, NUTCH-1205.patch Although gora trunk is unstable, work is ongoing to get this fixed. For the time being, I think Nutchgora should use gora trunk as this will identify more vulnerabilities. I'll get the trivial patch submitted shortly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-896) Gora-based tests need to have their own config files
[ https://issues.apache.org/jira/browse/NUTCH-896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema updated NUTCH-896: --- Affects Version/s: (was: nutchgora) Fix Version/s: (was: 2.1) nutchgora Gora-based tests need to have their own config files - Key: NUTCH-896 URL: https://issues.apache.org/jira/browse/NUTCH-896 Project: Nutch Issue Type: Bug Reporter: Julien Nioche Assignee: Julien Nioche Fix For: nutchgora The tests extending AbstractNutchTest (Injector, Generator, Fetcher) have hard-coded properties for GORA. It would be better to be able to rely on a file gora.properties used only for the tests, just as we do with the nutch-*.xml config files (see CrawlTestUtil). This way we wouldn't use the configs set in the main /conf file as they could be specific to a given GORA backend e.g. Mysql vs hsqldb. This would also help running the tests with a non-default GORA backend. We need to modify GORA and make the method DataStoreFactory.setProperties public. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1205) Upgrade gora modules to 0.2 in ivy/ivy.xml
[ https://issues.apache.org/jira/browse/NUTCH-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13267581#comment-13267581 ] Ferdy Galema commented on NUTCH-1205: - I committed a minor addition, that fixes the maven-plugins error when uncommenting another store. Upgrade gora modules to 0.2 in ivy/ivy.xml -- Key: NUTCH-1205 URL: https://issues.apache.org/jira/browse/NUTCH-1205 Project: Nutch Issue Type: Improvement Components: storage Affects Versions: nutchgora Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Blocker Fix For: nutchgora Attachments: NUTCH-1205-v10.patch, NUTCH-1205-v11.patch, NUTCH-1205-v2.patch, NUTCH-1205-v3.patch, NUTCH-1205-v4.patch, NUTCH-1205-v5.patch, NUTCH-1205-v5.patch, NUTCH-1205-v6.patch, NUTCH-1205.patch Although gora trunk is unstable, work is ongoing to get this fixed. For the time being, I think Nutchgora should use gora trunk as this will identify more vulnerabilities. I'll get the trivial patch submitted shortly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1349) Make batchId explcit within debug logging.
[ https://issues.apache.org/jira/browse/NUTCH-1349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13267758#comment-13267758 ] Ferdy Galema commented on NUTCH-1349: - +1 This will also benefits other jobs depending on a batchId. Make batchId explcit within debug logging. -- Key: NUTCH-1349 URL: https://issues.apache.org/jira/browse/NUTCH-1349 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: nutchgora Reporter: Lewis John McGibbney Priority: Minor Fix For: nutchgora I find this a pain when trying to locate the batchId of some urls which are skipped when going to the Solr index. My DEBUG log output gives me {code} 2012-05-03 20:44:55,268 DEBUG indexer.IndexerJob (IndexerJob.java:map(83)) - Skipping http://www.glasgowwheelers.com/; different batch id 2012-05-03 20:44:55,259 DEBUG indexer.IndexerJob (IndexerJob.java:map(83)) - Skipping http://www.heraldscotland.com/; different batch id {code} when I would actually like {code} 2012-05-03 20:44:55,268 DEBUG indexer.IndexerJob (IndexerJob.java:map(83)) - Skipping http://www.glasgowwheelers.com/; different batch id (ACTUAL BATCH ID) 2012-05-03 20:44:55,259 DEBUG indexer.IndexerJob (IndexerJob.java:map(83)) - Skipping http://www.heraldscotland.com/; different batch id (ACTUAL BATCH ID) {code} patch coming up soon -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1350) remove unused dependancy because of access restriction
Ferdy Galema created NUTCH-1350: --- Summary: remove unused dependancy because of access restriction Key: NUTCH-1350 URL: https://issues.apache.org/jira/browse/NUTCH-1350 Project: Nutch Issue Type: Bug Reporter: Ferdy Galema Priority: Trivial Fix For: nutchgora CrawlTestUtil has an unused dependancy com.sun.net.httpserver.HttpContext that sometimes causes an access restriction error when used with certain jdks. I figured since it isn't used anyway I can just remove it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1349) Make batchId explcit within debug logging and improve CLI
[ https://issues.apache.org/jira/browse/NUTCH-1349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13268397#comment-13268397 ] Ferdy Galema commented on NUTCH-1349: - Good work on improving the CLI. About the displaying mismatching batchId, your patch prints batchId while you should use 'mark' instead. What do you mean with matching TableUtil.unreverseUrl(key)? Make batchId explcit within debug logging and improve CLI - Key: NUTCH-1349 URL: https://issues.apache.org/jira/browse/NUTCH-1349 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: nutchgora Reporter: Lewis John McGibbney Priority: Minor Fix For: nutchgora Attachments: NUTCH-1349.patch I find this a pain when trying to locate the batchId of some urls which are skipped when going to the Solr index. My DEBUG log output gives me {code} 2012-05-03 20:44:55,268 DEBUG indexer.IndexerJob (IndexerJob.java:map(83)) - Skipping http://www.glasgowwheelers.com/; different batch id 2012-05-03 20:44:55,259 DEBUG indexer.IndexerJob (IndexerJob.java:map(83)) - Skipping http://www.heraldscotland.com/; different batch id {code} when I would actually like {code} 2012-05-03 20:44:55,268 DEBUG indexer.IndexerJob (IndexerJob.java:map(83)) - Skipping http://www.glasgowwheelers.com/; different batch id (ACTUAL BATCH ID) 2012-05-03 20:44:55,259 DEBUG indexer.IndexerJob (IndexerJob.java:map(83)) - Skipping http://www.heraldscotland.com/; different batch id (ACTUAL BATCH ID) {code} patch coming up soon -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1352) Improve regex urlfilters/normalizers synchronization
Ferdy Galema created NUTCH-1352: --- Summary: Improve regex urlfilters/normalizers synchronization Key: NUTCH-1352 URL: https://issues.apache.org/jira/browse/NUTCH-1352 Project: Nutch Issue Type: Improvement Reporter: Ferdy Galema Fix For: nutchgora I noticed that during fetching a lot of the time the fetcherthreads are blocking on a monitor because of outlink normalizing/filtering. The cause of this: Some of the regex plugins use single lock synchronization. This patch improves throughput by removing synchronization locks and replace them with threadlocals were needed. It has been extensively tested in production. I will commit this later today when no objection. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1352) Improve regex urlfilters/normalizers synchronization
[ https://issues.apache.org/jira/browse/NUTCH-1352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema updated NUTCH-1352: Attachment: NUTCH-1352.patch Improve regex urlfilters/normalizers synchronization Key: NUTCH-1352 URL: https://issues.apache.org/jira/browse/NUTCH-1352 Project: Nutch Issue Type: Improvement Reporter: Ferdy Galema Fix For: nutchgora Attachments: NUTCH-1352.patch I noticed that during fetching a lot of the time the fetcherthreads are blocking on a monitor because of outlink normalizing/filtering. The cause of this: Some of the regex plugins use single lock synchronization. This patch improves throughput by removing synchronization locks and replace them with threadlocals were needed. It has been extensively tested in production. I will commit this later today when no objection. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1353) nutchgora DomainStatistics support crawlId, counter bug and reformatting
[ https://issues.apache.org/jira/browse/NUTCH-1353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema updated NUTCH-1353: Attachment: NUTCH-1353.patch nutchgora DomainStatistics support crawlId, counter bug and reformatting Key: NUTCH-1353 URL: https://issues.apache.org/jira/browse/NUTCH-1353 Project: Nutch Issue Type: Bug Reporter: Ferdy Galema Priority: Minor Fix For: nutchgora Attachments: NUTCH-1353.patch This patch fixes three issues about nutchgora DomainStatistics: -crawlId support (note I closed NUTCH-1290 because I thought DomainStatistics was already fixed. This was not the case.) -A counter bug (NOT_FETCHED should be increased instead of FETCHED) -reformatting (convert tabs to spaces and clear unused imports) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-1353) nutchgora DomainStatistics support crawlId, counter bug and reformatting
[ https://issues.apache.org/jira/browse/NUTCH-1353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema closed NUTCH-1353. --- Resolution: Fixed committed nutchgora DomainStatistics support crawlId, counter bug and reformatting Key: NUTCH-1353 URL: https://issues.apache.org/jira/browse/NUTCH-1353 Project: Nutch Issue Type: Bug Reporter: Ferdy Galema Priority: Minor Fix For: nutchgora Attachments: NUTCH-1353.patch This patch fixes three issues about nutchgora DomainStatistics: -crawlId support (note I closed NUTCH-1290 because I thought DomainStatistics was already fixed. This was not the case.) -A counter bug (NOT_FETCHED should be increased instead of FETCHED) -reformatting (convert tabs to spaces and clear unused imports) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1354) nutchgora support fetcher.queue.depth.multiplier property
Ferdy Galema created NUTCH-1354: --- Summary: nutchgora support fetcher.queue.depth.multiplier property Key: NUTCH-1354 URL: https://issues.apache.org/jira/browse/NUTCH-1354 Project: Nutch Issue Type: New Feature Reporter: Ferdy Galema Priority: Minor Fix For: nutchgora Like trunk, nutchgora should support fetcher.queue.depth.multiplier property too. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1354) nutchgora support fetcher.queue.depth.multiplier property
[ https://issues.apache.org/jira/browse/NUTCH-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema updated NUTCH-1354: Attachment: NUTCH-1354.patch nutchgora support fetcher.queue.depth.multiplier property - Key: NUTCH-1354 URL: https://issues.apache.org/jira/browse/NUTCH-1354 Project: Nutch Issue Type: New Feature Reporter: Ferdy Galema Priority: Minor Fix For: nutchgora Attachments: NUTCH-1354.patch Like trunk, nutchgora should support fetcher.queue.depth.multiplier property too. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-1354) nutchgora support fetcher.queue.depth.multiplier property
[ https://issues.apache.org/jira/browse/NUTCH-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema closed NUTCH-1354. --- Resolution: Fixed committed nutchgora support fetcher.queue.depth.multiplier property - Key: NUTCH-1354 URL: https://issues.apache.org/jira/browse/NUTCH-1354 Project: Nutch Issue Type: New Feature Reporter: Ferdy Galema Priority: Minor Fix For: nutchgora Attachments: NUTCH-1354.patch Like trunk, nutchgora should support fetcher.queue.depth.multiplier property too. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1352) Improve regex urlfilters/normalizers synchronization
[ https://issues.apache.org/jira/browse/NUTCH-1352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema updated NUTCH-1352: Fix Version/s: 1.5 Improve regex urlfilters/normalizers synchronization Key: NUTCH-1352 URL: https://issues.apache.org/jira/browse/NUTCH-1352 Project: Nutch Issue Type: Improvement Reporter: Ferdy Galema Fix For: nutchgora, 1.5 Attachments: NUTCH-1352.patch I noticed that during fetching a lot of the time the fetcherthreads are blocking on a monitor because of outlink normalizing/filtering. The cause of this: Some of the regex plugins use single lock synchronization. This patch improves throughput by removing synchronization locks and replace them with threadlocals were needed. It has been extensively tested in production. I will commit this later today when no objection. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1352) Improve regex urlfilters/normalizers synchronization
[ https://issues.apache.org/jira/browse/NUTCH-1352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13269497#comment-13269497 ] Ferdy Galema commented on NUTCH-1352: - This indeed applies to trunk too. (Except for a minor patch segment about a logging statement... quite irrelevant). I'll commit it to trunk too. Improve regex urlfilters/normalizers synchronization Key: NUTCH-1352 URL: https://issues.apache.org/jira/browse/NUTCH-1352 Project: Nutch Issue Type: Improvement Reporter: Ferdy Galema Fix For: nutchgora, 1.5 Attachments: NUTCH-1352.patch I noticed that during fetching a lot of the time the fetcherthreads are blocking on a monitor because of outlink normalizing/filtering. The cause of this: Some of the regex plugins use single lock synchronization. This patch improves throughput by removing synchronization locks and replace them with threadlocals were needed. It has been extensively tested in production. I will commit this later today when no objection. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1352) Improve regex urlfilters/normalizers synchronization
[ https://issues.apache.org/jira/browse/NUTCH-1352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema updated NUTCH-1352: Fix Version/s: (was: 1.5) 1.6 On second thought, I will hold commit for trunk for now. (Feature freeze I guess?) Improve regex urlfilters/normalizers synchronization Key: NUTCH-1352 URL: https://issues.apache.org/jira/browse/NUTCH-1352 Project: Nutch Issue Type: Improvement Reporter: Ferdy Galema Fix For: nutchgora, 1.6 Attachments: NUTCH-1352.patch I noticed that during fetching a lot of the time the fetcherthreads are blocking on a monitor because of outlink normalizing/filtering. The cause of this: Some of the regex plugins use single lock synchronization. This patch improves throughput by removing synchronization locks and replace them with threadlocals were needed. It has been extensively tested in production. I will commit this later today when no objection. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1355) nutchgora Configure minimum throughput for fetcher
Ferdy Galema created NUTCH-1355: --- Summary: nutchgora Configure minimum throughput for fetcher Key: NUTCH-1355 URL: https://issues.apache.org/jira/browse/NUTCH-1355 Project: Nutch Issue Type: New Feature Reporter: Ferdy Galema Fix For: nutchgora Like trunk, nutchgora should also have a feature to configure the fetcher with a minimum throughput. (See NUTCH-1067 for the work done by Markus). It's implemented in almost the same way, except that the number of times throughput falls below threshold is measured sequentially. (The counter is reset when throughput is healthy again; this should work even better against temporary dips). Defaults to disabled. Will commit later today if there is no objection. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1355) nutchgora Configure minimum throughput for fetcher
[ https://issues.apache.org/jira/browse/NUTCH-1355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema updated NUTCH-1355: Attachment: NUTCH-1355.patch nutchgora Configure minimum throughput for fetcher -- Key: NUTCH-1355 URL: https://issues.apache.org/jira/browse/NUTCH-1355 Project: Nutch Issue Type: New Feature Reporter: Ferdy Galema Fix For: nutchgora Attachments: NUTCH-1355.patch Like trunk, nutchgora should also have a feature to configure the fetcher with a minimum throughput. (See NUTCH-1067 for the work done by Markus). It's implemented in almost the same way, except that the number of times throughput falls below threshold is measured sequentially. (The counter is reset when throughput is healthy again; this should work even better against temporary dips). Defaults to disabled. Will commit later today if there is no objection. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1356) ParseUtil use ExecutorService instead of manually thread handling.
Ferdy Galema created NUTCH-1356: --- Summary: ParseUtil use ExecutorService instead of manually thread handling. Key: NUTCH-1356 URL: https://issues.apache.org/jira/browse/NUTCH-1356 Project: Nutch Issue Type: Improvement Reporter: Ferdy Galema Fix For: nutchgora Attachments: NUTCH-1356.patch Because ParseUtil manages it's own parser threads by creating a thread for every parse it sometimes happens that specific parsers are very expensive. For example, parsers that have threadlocal fields will initialize them for every item to be parsed. By simply introducing a caching ExecutorService the ParseUtil will be able to cache threads therefore parsing more efficient. See attached patch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1356) ParseUtil use ExecutorService instead of manually thread handling.
[ https://issues.apache.org/jira/browse/NUTCH-1356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema updated NUTCH-1356: Attachment: NUTCH-1356.patch ParseUtil use ExecutorService instead of manually thread handling. -- Key: NUTCH-1356 URL: https://issues.apache.org/jira/browse/NUTCH-1356 Project: Nutch Issue Type: Improvement Reporter: Ferdy Galema Fix For: nutchgora Attachments: NUTCH-1356.patch Because ParseUtil manages it's own parser threads by creating a thread for every parse it sometimes happens that specific parsers are very expensive. For example, parsers that have threadlocal fields will initialize them for every item to be parsed. By simply introducing a caching ExecutorService the ParseUtil will be able to cache threads therefore parsing more efficient. See attached patch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1356) ParseUtil use ExecutorService instead of manually thread handling.
[ https://issues.apache.org/jira/browse/NUTCH-1356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema updated NUTCH-1356: Fix Version/s: 1.6 Sure will create patch for 1.x too. (Seems not that different). ParseUtil use ExecutorService instead of manually thread handling. -- Key: NUTCH-1356 URL: https://issues.apache.org/jira/browse/NUTCH-1356 Project: Nutch Issue Type: Improvement Reporter: Ferdy Galema Fix For: nutchgora, 1.6 Attachments: NUTCH-1356.patch Because ParseUtil manages it's own parser threads by creating a thread for every parse it sometimes happens that specific parsers are very expensive. For example, parsers that have threadlocal fields will initialize them for every item to be parsed. By simply introducing a caching ExecutorService the ParseUtil will be able to cache threads therefore parsing more efficient. See attached patch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1356) ParseUtil use ExecutorService instead of manually thread handling.
[ https://issues.apache.org/jira/browse/NUTCH-1356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema updated NUTCH-1356: Attachment: NUTCH-1356-trunk.patch Patch for trunk. ParseUtil use ExecutorService instead of manually thread handling. -- Key: NUTCH-1356 URL: https://issues.apache.org/jira/browse/NUTCH-1356 Project: Nutch Issue Type: Improvement Reporter: Ferdy Galema Fix For: nutchgora, 1.6 Attachments: NUTCH-1356-trunk.patch, NUTCH-1356.patch Because ParseUtil manages it's own parser threads by creating a thread for every parse it sometimes happens that specific parsers are very expensive. For example, parsers that have threadlocal fields will initialize them for every item to be parsed. By simply introducing a caching ExecutorService the ParseUtil will be able to cache threads therefore parsing more efficient. See attached patch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1352) Improve regex urlfilters/normalizers synchronization
[ https://issues.apache.org/jira/browse/NUTCH-1352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13269619#comment-13269619 ] Ferdy Galema commented on NUTCH-1352: - Thanks. Improve regex urlfilters/normalizers synchronization Key: NUTCH-1352 URL: https://issues.apache.org/jira/browse/NUTCH-1352 Project: Nutch Issue Type: Improvement Reporter: Ferdy Galema Fix For: nutchgora, 1.6 Attachments: NUTCH-1352-1.6-1.patch, NUTCH-1352.patch I noticed that during fetching a lot of the time the fetcherthreads are blocking on a monitor because of outlink normalizing/filtering. The cause of this: Some of the regex plugins use single lock synchronization. This patch improves throughput by removing synchronization locks and replace them with threadlocals were needed. It has been extensively tested in production. I will commit this later today when no objection. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1356) ParseUtil use ExecutorService instead of manually thread handling.
[ https://issues.apache.org/jira/browse/NUTCH-1356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema updated NUTCH-1356: Attachment: NUTCH-1356-trunk-v2.patch It was working though, I guess that is because of a transitive dependancy. Anyway it's best to declare it as a direct dependancy too. Patch v2 does this. (11.0.2 -- the same as the already present jar). ParseUtil use ExecutorService instead of manually thread handling. -- Key: NUTCH-1356 URL: https://issues.apache.org/jira/browse/NUTCH-1356 Project: Nutch Issue Type: Improvement Reporter: Ferdy Galema Fix For: nutchgora, 1.6 Attachments: NUTCH-1356-trunk-v2.patch, NUTCH-1356-trunk.patch, NUTCH-1356.patch Because ParseUtil manages it's own parser threads by creating a thread for every parse it sometimes happens that specific parsers are very expensive. For example, parsers that have threadlocal fields will initialize them for every item to be parsed. By simply introducing a caching ExecutorService the ParseUtil will be able to cache threads therefore parsing more efficient. See attached patch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-1355) nutchgora Configure minimum throughput for fetcher
[ https://issues.apache.org/jira/browse/NUTCH-1355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema closed NUTCH-1355. --- Resolution: Fixed committed nutchgora Configure minimum throughput for fetcher -- Key: NUTCH-1355 URL: https://issues.apache.org/jira/browse/NUTCH-1355 Project: Nutch Issue Type: New Feature Reporter: Ferdy Galema Fix For: nutchgora Attachments: NUTCH-1355.patch Like trunk, nutchgora should also have a feature to configure the fetcher with a minimum throughput. (See NUTCH-1067 for the work done by Markus). It's implemented in almost the same way, except that the number of times throughput falls below threshold is measured sequentially. (The counter is reset when throughput is healthy again; this should work even better against temporary dips). Defaults to disabled. Will commit later today if there is no objection. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1356) ParseUtil use ExecutorService instead of manually thread handling.
[ https://issues.apache.org/jira/browse/NUTCH-1356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13269695#comment-13269695 ] Ferdy Galema commented on NUTCH-1356: - committed at nutchgora ParseUtil use ExecutorService instead of manually thread handling. -- Key: NUTCH-1356 URL: https://issues.apache.org/jira/browse/NUTCH-1356 Project: Nutch Issue Type: Improvement Reporter: Ferdy Galema Fix For: nutchgora, 1.6 Attachments: NUTCH-1356-trunk-v2.patch, NUTCH-1356-trunk.patch, NUTCH-1356.patch Because ParseUtil manages it's own parser threads by creating a thread for every parse it sometimes happens that specific parsers are very expensive. For example, parsers that have threadlocal fields will initialize them for every item to be parsed. By simply introducing a caching ExecutorService the ParseUtil will be able to cache threads therefore parsing more efficient. See attached patch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1352) Improve regex urlfilters/normalizers synchronization
[ https://issues.apache.org/jira/browse/NUTCH-1352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13269697#comment-13269697 ] Ferdy Galema commented on NUTCH-1352: - committed at nutchgora Improve regex urlfilters/normalizers synchronization Key: NUTCH-1352 URL: https://issues.apache.org/jira/browse/NUTCH-1352 Project: Nutch Issue Type: Improvement Reporter: Ferdy Galema Fix For: nutchgora, 1.6 Attachments: NUTCH-1352-1.6-1.patch, NUTCH-1352.patch I noticed that during fetching a lot of the time the fetcherthreads are blocking on a monitor because of outlink normalizing/filtering. The cause of this: Some of the regex plugins use single lock synchronization. This patch improves throughput by removing synchronization locks and replace them with threadlocals were needed. It has been extensively tested in production. I will commit this later today when no objection. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1357) All gora mapreduce functionality should go through StorageUtils
Ferdy Galema created NUTCH-1357: --- Summary: All gora mapreduce functionality should go through StorageUtils Key: NUTCH-1357 URL: https://issues.apache.org/jira/browse/NUTCH-1357 Project: Nutch Issue Type: Improvement Reporter: Ferdy Galema Fix For: nutchgora I am trying to make the concept of crawlId work for ALL nutch jobs: it seems the biggest problem with it not working as expected is because of the various ways gora mapreduce is used in nutch. Some jobs use StorageUtils, some use GoraMapper/GoraReduce, some even use directly GoraInputFormat/GoraOutputFormat. But the only place the translation is made from crawlId into a schema name is in StorageUtils! Currently I am converting all calls to Gora* mapreduce initializing code to StorageUtils calls. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1358) Do not accept bogus arguments
Ferdy Galema created NUTCH-1358: --- Summary: Do not accept bogus arguments Key: NUTCH-1358 URL: https://issues.apache.org/jira/browse/NUTCH-1358 Project: Nutch Issue Type: Improvement Reporter: Ferdy Galema Priority: Minor Fix For: nutchgora Some of the tools do not explicitely check every passed argument for validity. This can mask very frustrating issues because one passes wrong arguments and the tool does not fail fast. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1358) Do not accept bogus arguments
[ https://issues.apache.org/jira/browse/NUTCH-1358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema updated NUTCH-1358: Attachment: NUTCH-1358.patch Do not accept bogus arguments - Key: NUTCH-1358 URL: https://issues.apache.org/jira/browse/NUTCH-1358 Project: Nutch Issue Type: Improvement Reporter: Ferdy Galema Priority: Minor Fix For: nutchgora Attachments: NUTCH-1358.patch Some of the tools do not explicitely check every passed argument for validity. This can mask very frustrating issues because one passes wrong arguments and the tool does not fail fast. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-1358) Do not accept bogus arguments
[ https://issues.apache.org/jira/browse/NUTCH-1358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema closed NUTCH-1358. --- Resolution: Fixed Committed. Do not accept bogus arguments - Key: NUTCH-1358 URL: https://issues.apache.org/jira/browse/NUTCH-1358 Project: Nutch Issue Type: Improvement Reporter: Ferdy Galema Priority: Minor Fix For: nutchgora Attachments: NUTCH-1358.patch Some of the tools do not explicitely check every passed argument for validity. This can mask very frustrating issues because one passes wrong arguments and the tool does not fail fast. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1357) All gora mapreduce functionality should go through StorageUtils
[ https://issues.apache.org/jira/browse/NUTCH-1357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13271421#comment-13271421 ] Ferdy Galema commented on NUTCH-1357: - Side note: It seems some tools do need to call Gora* code directly but this does not matter as long as they pass around the DataStore that is created by using StorageUtils.createWebStore(..). All gora mapreduce functionality should go through StorageUtils --- Key: NUTCH-1357 URL: https://issues.apache.org/jira/browse/NUTCH-1357 Project: Nutch Issue Type: Improvement Reporter: Ferdy Galema Fix For: nutchgora I am trying to make the concept of crawlId work for ALL nutch jobs: it seems the biggest problem with it not working as expected is because of the various ways gora mapreduce is used in nutch. Some jobs use StorageUtils, some use GoraMapper/GoraReduce, some even use directly GoraInputFormat/GoraOutputFormat. But the only place the translation is made from crawlId into a schema name is in StorageUtils! Currently I am converting all calls to Gora* mapreduce initializing code to StorageUtils calls. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1363) Make parsing in FetcherJob actually work.
[ https://issues.apache.org/jira/browse/NUTCH-1363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13271445#comment-13271445 ] Ferdy Galema commented on NUTCH-1363: - Hey Lewis, This does work, with the -Dfetcher.parse=true option. Note that the -parse is not supported anymore. Make parsing in FetcherJob actually work. - Key: NUTCH-1363 URL: https://issues.apache.org/jira/browse/NUTCH-1363 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: nutchgora Reporter: Lewis John McGibbney Fix For: nutchgora We know that parsing during fetching is not recommended, however for those that wish to dive into the abyss the functionality should be available. This issue will address this. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Issue Comment Edited] (NUTCH-1363) Make parsing in FetcherJob actually work.
[ https://issues.apache.org/jira/browse/NUTCH-1363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13271445#comment-13271445 ] Ferdy Galema edited comment on NUTCH-1363 at 5/9/12 2:27 PM: - Hey Lewis, This does work, with the -Dfetcher.parse=true option. Note that the -parse option is not supported anymore. (But it did the same thing). was (Author: ferdy.g): Hey Lewis, This does work, with the -Dfetcher.parse=true option. Note that the -parse is not supported anymore. Make parsing in FetcherJob actually work. - Key: NUTCH-1363 URL: https://issues.apache.org/jira/browse/NUTCH-1363 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: nutchgora Reporter: Lewis John McGibbney Fix For: nutchgora We know that parsing during fetching is not recommended, however for those that wish to dive into the abyss the functionality should be available. This issue will address this. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1357) All gora mapreduce functionality should go through StorageUtils
[ https://issues.apache.org/jira/browse/NUTCH-1357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema updated NUTCH-1357: Fix Version/s: (was: nutchgora) On second though, this can be solved later. All gora mapreduce functionality should go through StorageUtils --- Key: NUTCH-1357 URL: https://issues.apache.org/jira/browse/NUTCH-1357 Project: Nutch Issue Type: Improvement Reporter: Ferdy Galema I am trying to make the concept of crawlId work for ALL nutch jobs: it seems the biggest problem with it not working as expected is because of the various ways gora mapreduce is used in nutch. Some jobs use StorageUtils, some use GoraMapper/GoraReduce, some even use directly GoraInputFormat/GoraOutputFormat. But the only place the translation is made from crawlId into a schema name is in StorageUtils! Currently I am converting all calls to Gora* mapreduce initializing code to StorageUtils calls. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1363) Make parsing in FetcherJob actually work.
[ https://issues.apache.org/jira/browse/NUTCH-1363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13272150#comment-13272150 ] Ferdy Galema commented on NUTCH-1363: - I'm not sure I follow. What makes this property different from all the other properties? In general, properties defined in nutch-default can be overriden using nutch-site (in either distributed and local mode) and finally using generic Hadoop -Dkey=value command-line options. Additionally, tools are able to provide specific arguments. For exampe -threads 10 with the fetcher sets fetcher.threads.fetch to 10 in the configuration. Make parsing in FetcherJob actually work. - Key: NUTCH-1363 URL: https://issues.apache.org/jira/browse/NUTCH-1363 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: nutchgora Reporter: Lewis John McGibbney Fix For: nutchgora We know that parsing during fetching is not recommended, however for those that wish to dive into the abyss the functionality should be available. This issue will address this. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1365) Fix crawlId functionalilty by making using of new gora configuration
Ferdy Galema created NUTCH-1365: --- Summary: Fix crawlId functionalilty by making using of new gora configuration Key: NUTCH-1365 URL: https://issues.apache.org/jira/browse/NUTCH-1365 Project: Nutch Issue Type: Bug Reporter: Ferdy Galema Fix For: 2.1 With GORA-126 it is finally possible to make correctly use of crawlId throughout nutch. This patch changes StorageUtils so that the preferred schema name (crawlId + _ + schema) is correctly set on gora. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1365) Fix crawlId functionalilty by making using of new gora configuration
[ https://issues.apache.org/jira/browse/NUTCH-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema updated NUTCH-1365: Attachment: NUTCH-1365.patch Fix crawlId functionalilty by making using of new gora configuration Key: NUTCH-1365 URL: https://issues.apache.org/jira/browse/NUTCH-1365 Project: Nutch Issue Type: Bug Reporter: Ferdy Galema Fix For: 2.1 Attachments: NUTCH-1365.patch With GORA-126 it is finally possible to make correctly use of crawlId throughout nutch. This patch changes StorageUtils so that the preferred schema name (crawlId + _ + schema) is correctly set on gora. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1306) Commit after finished writing to solr index
[ https://issues.apache.org/jira/browse/NUTCH-1306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13272231#comment-13272231 ] Ferdy Galema commented on NUTCH-1306: - Lewis, Do you suggest to add the commit as implemented by the fix but make it conditional? Something like this: if (getConf().getBoolean(solr.commit, true)) { solr.commit() } This makes it enabled by default. I think it is a good idea. Secondly, you say that Nutchgora does not commit at all. It looks like trunk does not commit either. I think it's a bit confusing the COMMIT_SIZE nutch property does no solr commit but rather 'flush' data to solr. Perhaps we could clarify this a bit more. (Update the property description by mentioning the fact that it does NOT trigger a solr commit.) Agree? Commit after finished writing to solr index --- Key: NUTCH-1306 URL: https://issues.apache.org/jira/browse/NUTCH-1306 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: nutchgora Reporter: Dan Rosher Priority: Trivial Fix For: 2.1 Attachments: NUTCH-1306.patch Commit after finished writing to solr index - otherwise a bit confusing not seeing the number of docs we expect in solr -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-1026) Strip UTF-8 non-character codepoints
[ https://issues.apache.org/jira/browse/NUTCH-1026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema closed NUTCH-1026. --- Resolution: Fixed Fix Version/s: (was: 2.1) nutchgora When indexing a huge dataset I ran into this issue too. The patch in NUTCH-1016 works fine. (Thanks Markus!) I verified and tested this. Committed at nutchgora. Minor note: The patch checks for invalid chars ONLY on the content field of the NutchDocument. But since the problem is most likely to only occur on this field, it is okay for now. Strip UTF-8 non-character codepoints Key: NUTCH-1026 URL: https://issues.apache.org/jira/browse/NUTCH-1026 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: nutchgora Reporter: Markus Jelsma Fix For: nutchgora During a very large crawl i found a few documents producing non-character codepoints. When indexing to Solr this will yield the following exception: {code} SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0x at char #1142033, byte #1155068) at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18) at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731) {code} Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the content field to a method to strip away non-characters. I'm not too sure about this implementation but the tests i've done locally with a huge dataset now passes correctly. Here's a list of codepoints to strip away: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:] Please comment! -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1306) Commit after finished writing to solr index
[ https://issues.apache.org/jira/browse/NUTCH-1306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema updated NUTCH-1306: Attachment: NUTCH-1306-v2.patch NUTCH-1306-trunk.patch Agree with trying to make both branches to match each other. By the way there is a commit done after the whole job completes. (I previously thought there was no commit at all, but I was wrong). But, if this is the case, then the commit after closing a single indexwriter is not needed. (So the reason Dan is not seeing updates must have been a different problem). Anyway, I've uploaded patches for making this committing after the job completes configurable. (But enabled by default). Let me know if there are comments. Commit after finished writing to solr index --- Key: NUTCH-1306 URL: https://issues.apache.org/jira/browse/NUTCH-1306 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: nutchgora Reporter: Dan Rosher Priority: Trivial Fix For: 2.1 Attachments: NUTCH-1306-trunk.patch, NUTCH-1306-v2.patch, NUTCH-1306.patch Commit after finished writing to solr index - otherwise a bit confusing not seeing the number of docs we expect in solr -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1306) Commit after finished writing to solr index
[ https://issues.apache.org/jira/browse/NUTCH-1306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema updated NUTCH-1306: Attachment: NUTCH-1306-trunk-v2.patch Heh indeed that's not ready for committing yet. Weird though that my workspace did not get a compile error at first, only after refreshing the ivy deps. (Somehow it fetched a Gora library). Anyway I've uploaded an updated patch. I was not aware of NUTCH-1025. Is it ok if we incorporate that issue and rename this issue to Add option to not commit and clarify existing solr.commit.size? Commit after finished writing to solr index --- Key: NUTCH-1306 URL: https://issues.apache.org/jira/browse/NUTCH-1306 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: nutchgora Reporter: Dan Rosher Priority: Trivial Fix For: 2.1 Attachments: NUTCH-1306-trunk-v2.patch, NUTCH-1306-trunk.patch, NUTCH-1306-v2.patch, NUTCH-1306.patch Commit after finished writing to solr index - otherwise a bit confusing not seeing the number of docs we expect in solr -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1362) Fix error handling of urls with empty fields
[ https://issues.apache.org/jira/browse/NUTCH-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema updated NUTCH-1362: Attachment: NUTCH-1362.patch Hey Lewis, This patches fixes the problem and makes the reversing a bit faster by using StringUtils.split instead of String.split. (The latter compiles a regular expression every time a split is done. That's a bit excessive for simple dot and colon splitting.) Tested and verified. Fix error handling of urls with empty fields - Key: NUTCH-1362 URL: https://issues.apache.org/jira/browse/NUTCH-1362 Project: Nutch Issue Type: Bug Affects Versions: nutchgora Reporter: Lewis John McGibbney Fix For: nutchgora Attachments: NUTCH-1362.patch Within o.a.n.util.TableUtil.reverseAppendSplits() a simple if (split.length 0) block enables us to address this issue. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1362) Fix error handling of urls with empty fields
[ https://issues.apache.org/jira/browse/NUTCH-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13273129#comment-13273129 ] Ferdy Galema commented on NUTCH-1362: - Btw this is a duplicate of NUTCH-1077. Fix error handling of urls with empty fields - Key: NUTCH-1362 URL: https://issues.apache.org/jira/browse/NUTCH-1362 Project: Nutch Issue Type: Bug Affects Versions: nutchgora Reporter: Lewis John McGibbney Fix For: nutchgora Attachments: NUTCH-1362.patch Within o.a.n.util.TableUtil.reverseAppendSplits() a simple if (split.length 0) block enables us to address this issue. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-1077) Nutch 2 DbUpdateMapper throws ArrayOutOfBoundsException when running update
[ https://issues.apache.org/jira/browse/NUTCH-1077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema closed NUTCH-1077. --- Resolution: Duplicate Fix Version/s: (was: 2.1) Will be fixed with NUTCH-1362. (Use attached patch or wait for commit.) Thanks for reporting. Nutch 2 DbUpdateMapper throws ArrayOutOfBoundsException when running update --- Key: NUTCH-1077 URL: https://issues.apache.org/jira/browse/NUTCH-1077 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: nutchgora Environment: CentOS 5 Linux with CDH3 Hadoop. Reporter: Tom Davidson I got this error when running a simple nutch update after doing a small fetch and parse. java.lang.ArrayIndexOutOfBoundsException: 0 at org.apache.nutch.util.TableUtil.reverseAppendSplits(TableUtil.java:126) at org.apache.nutch.util.TableUtil.reverseUrl(TableUtil.java:66) at org.apache.nutch.util.TableUtil.reverseUrl(TableUtil.java:43) at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:70) at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:36) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323) at org.apache.hadoop.mapred.Child$4.run(Child.java:270) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127) at org.apache.hadoop.mapred.Child.main(Child.java:264) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-1362) Fix error handling of urls with empty fields
[ https://issues.apache.org/jira/browse/NUTCH-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema closed NUTCH-1362. --- Resolution: Fixed Done! Thanks. Fix error handling of urls with empty fields - Key: NUTCH-1362 URL: https://issues.apache.org/jira/browse/NUTCH-1362 Project: Nutch Issue Type: Bug Affects Versions: nutchgora Reporter: Lewis John McGibbney Fix For: nutchgora Attachments: NUTCH-1362.patch Within o.a.n.util.TableUtil.reverseAppendSplits() a simple if (split.length 0) block enables us to address this issue. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1366) speed up indexing by eliminating the indexreducer
Ferdy Galema created NUTCH-1366: --- Summary: speed up indexing by eliminating the indexreducer Key: NUTCH-1366 URL: https://issues.apache.org/jira/browse/NUTCH-1366 Project: Nutch Issue Type: Improvement Components: indexer Reporter: Ferdy Galema Fix For: nutchgora Currently the indexer in Nutchgora consists of both mappers and reduces. But the reduce code does not actually iterate over any (grouped/sorted) values. It simply indexes individual key/value (String/Webpage) pairs. Therefore by moving this indexing code to the mapper we can eliminate the reduce step therefore making the indexing job much faster. (No more unnecessary spilling to disk/network and no cpu wasted to sorting). Note this is not (directly) applicable to trunk because trunk uses a quite different approach. Different types of input are combined to a single value in the reducer. Although I think it is possible to implement a similar optimization I am not sure how to do this. So if anyone wants this for trunk too feel free to implement a similar patch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1366) speed up indexing by eliminating the indexreducer
[ https://issues.apache.org/jira/browse/NUTCH-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema updated NUTCH-1366: Attachment: NUTCH-1366.patch speed up indexing by eliminating the indexreducer - Key: NUTCH-1366 URL: https://issues.apache.org/jira/browse/NUTCH-1366 Project: Nutch Issue Type: Improvement Components: indexer Reporter: Ferdy Galema Fix For: nutchgora Attachments: NUTCH-1366.patch Currently the indexer in Nutchgora consists of both mappers and reduces. But the reduce code does not actually iterate over any (grouped/sorted) values. It simply indexes individual key/value (String/Webpage) pairs. Therefore by moving this indexing code to the mapper we can eliminate the reduce step therefore making the indexing job much faster. (No more unnecessary spilling to disk/network and no cpu wasted to sorting). Note this is not (directly) applicable to trunk because trunk uses a quite different approach. Different types of input are combined to a single value in the reducer. Although I think it is possible to implement a similar optimization I am not sure how to do this. So if anyone wants this for trunk too feel free to implement a similar patch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1366) speed up indexing by eliminating the indexreducer
[ https://issues.apache.org/jira/browse/NUTCH-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13273335#comment-13273335 ] Ferdy Galema commented on NUTCH-1366: - The cool part about Nutchgora is that inlinks are already populated for the row that is inputted into the indexer. The DbUpdateReducer does this outlink inverting as part of the updating the db. Btw it's very simple to reinstate the reducer, if we need to have one again. speed up indexing by eliminating the indexreducer - Key: NUTCH-1366 URL: https://issues.apache.org/jira/browse/NUTCH-1366 Project: Nutch Issue Type: Improvement Components: indexer Reporter: Ferdy Galema Fix For: nutchgora Attachments: NUTCH-1366.patch Currently the indexer in Nutchgora consists of both mappers and reduces. But the reduce code does not actually iterate over any (grouped/sorted) values. It simply indexes individual key/value (String/Webpage) pairs. Therefore by moving this indexing code to the mapper we can eliminate the reduce step therefore making the indexing job much faster. (No more unnecessary spilling to disk/network and no cpu wasted to sorting). Note this is not (directly) applicable to trunk because trunk uses a quite different approach. Different types of input are combined to a single value in the reducer. Although I think it is possible to implement a similar optimization I am not sure how to do this. So if anyone wants this for trunk too feel free to implement a similar patch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1367) Port ParserChecker to Nutchgora
[ https://issues.apache.org/jira/browse/NUTCH-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13274590#comment-13274590 ] Ferdy Galema commented on NUTCH-1367: - Hey Lewis, This tool is already present in Nutchgora. Port ParserChecker to Nutchgora --- Key: NUTCH-1367 URL: https://issues.apache.org/jira/browse/NUTCH-1367 Project: Nutch Issue Type: New Feature Components: parser Affects Versions: nutchgora Reporter: Lewis John McGibbney Fix For: 2.1 This is such a great tool. It has come in handy so many times I would go blue in the face if I had to try and count. e.g. for (int i = 0; i infinity; i++) I think you get the idea. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-1366) speed up indexing by eliminating the indexreducer
[ https://issues.apache.org/jira/browse/NUTCH-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema closed NUTCH-1366. --- Resolution: Fixed committed speed up indexing by eliminating the indexreducer - Key: NUTCH-1366 URL: https://issues.apache.org/jira/browse/NUTCH-1366 Project: Nutch Issue Type: Improvement Components: indexer Reporter: Ferdy Galema Fix For: nutchgora Attachments: NUTCH-1366.patch Currently the indexer in Nutchgora consists of both mappers and reduces. But the reduce code does not actually iterate over any (grouped/sorted) values. It simply indexes individual key/value (String/Webpage) pairs. Therefore by moving this indexing code to the mapper we can eliminate the reduce step therefore making the indexing job much faster. (No more unnecessary spilling to disk/network and no cpu wasted to sorting). Note this is not (directly) applicable to trunk because trunk uses a quite different approach. Different types of input are combined to a single value in the reducer. Although I think it is possible to implement a similar optimization I am not sure how to do this. So if anyone wants this for trunk too feel free to implement a similar patch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-879) URL-s getting lost
[ https://issues.apache.org/jira/browse/NUTCH-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13281459#comment-13281459 ] Ferdy Galema commented on NUTCH-879: Agree to fix this issue later. Although I could not yet get to the bottom of this, I'm pretty sure the issue is not as severe as originally reported. (Based on current experencies with running Nutchgora in production). URL-s getting lost -- Key: NUTCH-879 URL: https://issues.apache.org/jira/browse/NUTCH-879 Project: Nutch Issue Type: Bug Affects Versions: nutchgora Environment: * Ubuntu 10.4 x64, Sun JDK 1.6 * using 1-node Hadoop + HDFS * trunk r983472, using MySQL store * branch-1.3 Reporter: Andrzej Bialecki Fix For: 2.1 Attachments: branch-1.3-bench.txt, trunk-bench.txt I ran the Benchmark using branch-1.3 and trunk (formerly nutchbase). With the same Benchmark parameters and the same plugins branch-1.3 collects ~1.5mln urls, while trunk collects ~20,000 urls. Clearly something is wrong. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1378) HostDb NullPointerException
Ferdy Galema created NUTCH-1378: --- Summary: HostDb NullPointerException Key: NUTCH-1378 URL: https://issues.apache.org/jira/browse/NUTCH-1378 Project: Nutch Issue Type: Bug Reporter: Ferdy Galema Fix For: nutchgora This is a no-brainer to fix a NPE when using the HostDb functionality. Will attach patch and commit right away. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1378) HostDb NullPointerException
[ https://issues.apache.org/jira/browse/NUTCH-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema updated NUTCH-1378: Attachment: NUTCH-1378.patch HostDb NullPointerException --- Key: NUTCH-1378 URL: https://issues.apache.org/jira/browse/NUTCH-1378 Project: Nutch Issue Type: Bug Reporter: Ferdy Galema Fix For: nutchgora Attachments: NUTCH-1378.patch This is a no-brainer to fix a NPE when using the HostDb functionality. Will attach patch and commit right away. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-1378) HostDb NullPointerException
[ https://issues.apache.org/jira/browse/NUTCH-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema closed NUTCH-1378. --- Resolution: Fixed HostDb NullPointerException --- Key: NUTCH-1378 URL: https://issues.apache.org/jira/browse/NUTCH-1378 Project: Nutch Issue Type: Bug Reporter: Ferdy Galema Fix For: nutchgora Attachments: NUTCH-1378.patch This is a no-brainer to fix a NPE when using the HostDb functionality. Will attach patch and commit right away. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1356) ParseUtil use ExecutorService instead of manually thread handling.
[ https://issues.apache.org/jira/browse/NUTCH-1356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13285510#comment-13285510 ] Ferdy Galema commented on NUTCH-1356: - I find it difficult to believe those exceptions are caused by this patch. It does not change the way exceptions/timeouts are handled, it only makes sure parser threads are reused. It seems you are suffering from two types of (unrelated) exceptions. The first is ExecutionException. This is caused whenever the execution inside the FutureTask.get() throws an exception that is not catched anywere but the FutureTask.get() itself. In your case this seems to be a NPE during the parse of the html page. Might be a bug but then again it is strange that it is not reproducible with the ParserChecker. (You sure about this?) The second is TimeoutException, caused whenever the FutureTask.get() cannot be completed within the specified timeout. The tricky part is that single urls might be perfectly able to complete within the timeout, but when there is a heavy concurrent load (a lot of semi-expensive parses) the parser load might stack up and cause many parses to timeout. This can be the case with parsing during fetch. But when using a separate parserjob this can also happen because Parser implementation do not necessarily have to respond to a thread interrupt. (Which is fired away with the task.cancel(true) call). If a parser does not check the Thread.interrupted state at regular intervals, it will just continue to run and eat up resources. I find it very helpful to debug stalling fetchers/parsers with the lazy men's profiler: kill -QUIT process_id. This will dump stacktraces, sometimes exposing the fact that hundreds of parser threads are still active in the background. (Of course many of them already timed out a long time ago). ParseUtil use ExecutorService instead of manually thread handling. -- Key: NUTCH-1356 URL: https://issues.apache.org/jira/browse/NUTCH-1356 Project: Nutch Issue Type: Improvement Reporter: Ferdy Galema Fix For: nutchgora, 1.6 Attachments: NUTCH-1356-trunk-v2.patch, NUTCH-1356-trunk.patch, NUTCH-1356.patch Because ParseUtil manages it's own parser threads by creating a thread for every parse it sometimes happens that specific parsers are very expensive. For example, parsers that have threadlocal fields will initialize them for every item to be parsed. By simply introducing a caching ExecutorService the ParseUtil will be able to cache threads therefore parsing more efficient. See attached patch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1379) NPE when reprUrl is null in ParseUtil
Ferdy Galema created NUTCH-1379: --- Summary: NPE when reprUrl is null in ParseUtil Key: NUTCH-1379 URL: https://issues.apache.org/jira/browse/NUTCH-1379 Project: Nutch Issue Type: Bug Reporter: Ferdy Galema -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1379) NPE when reprUrl is null in ParseUtil
[ https://issues.apache.org/jira/browse/NUTCH-1379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema updated NUTCH-1379: Attachment: NUTCH-1379.patch committed NPE when reprUrl is null in ParseUtil - Key: NUTCH-1379 URL: https://issues.apache.org/jira/browse/NUTCH-1379 Project: Nutch Issue Type: Bug Reporter: Ferdy Galema Attachments: NUTCH-1379.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Reopened] (NUTCH-1379) NPE when reprUrl is null in ParseUtil
[ https://issues.apache.org/jira/browse/NUTCH-1379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema reopened NUTCH-1379: - NPE when reprUrl is null in ParseUtil - Key: NUTCH-1379 URL: https://issues.apache.org/jira/browse/NUTCH-1379 Project: Nutch Issue Type: Bug Reporter: Ferdy Galema Attachments: NUTCH-1379.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-1379) NPE when reprUrl is null in ParseUtil
[ https://issues.apache.org/jira/browse/NUTCH-1379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema closed NUTCH-1379. --- Resolution: Fixed NPE when reprUrl is null in ParseUtil - Key: NUTCH-1379 URL: https://issues.apache.org/jira/browse/NUTCH-1379 Project: Nutch Issue Type: Bug Reporter: Ferdy Galema Attachments: NUTCH-1379.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-1379) NPE when reprUrl is null in ParseUtil
[ https://issues.apache.org/jira/browse/NUTCH-1379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema closed NUTCH-1379. --- Resolution: Fixed NPE when reprUrl is null in ParseUtil - Key: NUTCH-1379 URL: https://issues.apache.org/jira/browse/NUTCH-1379 Project: Nutch Issue Type: Bug Reporter: Ferdy Galema Fix For: nutchgora Attachments: NUTCH-1379.patch Sometimes reprUrl is null in ParseUtil. Exact cause is still fuzzy but this is a nice workaround for now. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1379) NPE when reprUrl is null in ParseUtil
[ https://issues.apache.org/jira/browse/NUTCH-1379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema updated NUTCH-1379: Description: Sometimes reprUrl is null in ParseUtil. Exact cause is still fuzzy but this is a nice workaround for now. Fix Version/s: nutchgora NPE when reprUrl is null in ParseUtil - Key: NUTCH-1379 URL: https://issues.apache.org/jira/browse/NUTCH-1379 Project: Nutch Issue Type: Bug Reporter: Ferdy Galema Fix For: nutchgora Attachments: NUTCH-1379.patch Sometimes reprUrl is null in ParseUtil. Exact cause is still fuzzy but this is a nice workaround for now. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1356) ParseUtil use ExecutorService instead of manually thread handling.
[ https://issues.apache.org/jira/browse/NUTCH-1356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13293510#comment-13293510 ] Ferdy Galema commented on NUTCH-1356: - Thanks. The parser threads you refer to, is that a known problem? Can we solve it? To solve it correctly every parser should check the interrupted state at regular intervals. This is pretty huge task considering the amount of parsers. For now it is something to keep in mind. I'll create an issue for reference. ParseUtil use ExecutorService instead of manually thread handling. -- Key: NUTCH-1356 URL: https://issues.apache.org/jira/browse/NUTCH-1356 Project: Nutch Issue Type: Improvement Reporter: Ferdy Galema Fix For: nutchgora, 1.6 Attachments: NUTCH-1356-trunk-v2.patch, NUTCH-1356-trunk.patch, NUTCH-1356.patch Because ParseUtil manages it's own parser threads by creating a thread for every parse it sometimes happens that specific parsers are very expensive. For example, parsers that have threadlocal fields will initialize them for every item to be parsed. By simply introducing a caching ExecutorService the ParseUtil will be able to cache threads therefore parsing more efficient. See attached patch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1387) All parsers should respond to cancellation.
Ferdy Galema created NUTCH-1387: --- Summary: All parsers should respond to cancellation. Key: NUTCH-1387 URL: https://issues.apache.org/jira/browse/NUTCH-1387 Project: Nutch Issue Type: Bug Reporter: Ferdy Galema During parsing a TimeoutException can occur. This is caused whenever the FutureTask.get() cannot be completed within the specified timeout. The tricky part is that single urls might be perfectly able to complete within the timeout, but when there is a heavy concurrent load (a lot of semi-expensive parses) the parser load might stack up and cause many parses to timeout. This can be the case with parsing during fetch. But when using a separate parserjob this can also happen because Parser implementation do not necessarily have to respond to a thread interrupt. (Which is fired away with the task.cancel(true) call). If a parser does not check the Thread.interrupted state at regular intervals, it will just continue to run and eat up resources. I find it very helpful to debug stalling fetchers/parsers with the lazy men's profiler: kill -QUIT process_id. This will dump stacktraces, sometimes exposing the fact that hundreds of parser threads are still active in the background. (Of course many of them already timed out a long time ago). To fix this, every parser should check it's interrupted state at regular intervals. (For example an html parse might be stuck walking the DOM tree, so checking after every Nth element would be an appropiate moment.) This issue is for reference first. Fixing it all at once would be a huge task. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1387) All parsers should respond to cancellation / interrupts.
[ https://issues.apache.org/jira/browse/NUTCH-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema updated NUTCH-1387: Component/s: parser Summary: All parsers should respond to cancellation / interrupts. (was: All parsers should respond to cancellation.) All parsers should respond to cancellation / interrupts. Key: NUTCH-1387 URL: https://issues.apache.org/jira/browse/NUTCH-1387 Project: Nutch Issue Type: Bug Components: parser Reporter: Ferdy Galema During parsing a TimeoutException can occur. This is caused whenever the FutureTask.get() cannot be completed within the specified timeout. The tricky part is that single urls might be perfectly able to complete within the timeout, but when there is a heavy concurrent load (a lot of semi-expensive parses) the parser load might stack up and cause many parses to timeout. This can be the case with parsing during fetch. But when using a separate parserjob this can also happen because Parser implementation do not necessarily have to respond to a thread interrupt. (Which is fired away with the task.cancel(true) call). If a parser does not check the Thread.interrupted state at regular intervals, it will just continue to run and eat up resources. I find it very helpful to debug stalling fetchers/parsers with the lazy men's profiler: kill -QUIT process_id. This will dump stacktraces, sometimes exposing the fact that hundreds of parser threads are still active in the background. (Of course many of them already timed out a long time ago). To fix this, every parser should check it's interrupted state at regular intervals. (For example an html parse might be stuck walking the DOM tree, so checking after every Nth element would be an appropiate moment.) This issue is for reference first. Fixing it all at once would be a huge task. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1342) Read time out protocol-http
[ https://issues.apache.org/jira/browse/NUTCH-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13294429#comment-13294429 ] Ferdy Galema commented on NUTCH-1342: - Do you have any clue as to why protocol-httpclient has a different behaviour? Also, two suggestions for your patch: Perhaps you could finegrain the mechanism by allowing a configurable amount of timeouts before definitely failing. Something like: if (++timeoutRetriesthis.allowedNumberOfTimeoutRetries) throw e; //rethrow Secondly, could you specifically catch SocketTimeoutException? (I'm not sure if there are other IOExceptions that shouldn't be catched in any case.) Read time out protocol-http --- Key: NUTCH-1342 URL: https://issues.apache.org/jira/browse/NUTCH-1342 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.4, 1.5 Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Critical Fix For: 1.6 Attachments: NUTCH-1342-1.6-1.patch For some reason some URL's always time out with protocol-http but not protocol-httpclient. The stack trace is always the same: {code} 2012-04-20 11:25:44,275 ERROR http.Http - Failed to get protocol output java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:129) at java.io.BufferedInputStream.read1(BufferedInputStream.java:256) at java.io.BufferedInputStream.read(BufferedInputStream.java:317) at java.io.FilterInputStream.read(FilterInputStream.java:116) at java.io.PushbackInputStream.read(PushbackInputStream.java:169) at java.io.FilterInputStream.read(FilterInputStream.java:90) at org.apache.nutch.protocol.http.HttpResponse.readPlainContent(HttpResponse.java:228) at org.apache.nutch.protocol.http.HttpResponse.init(HttpResponse.java:157) at org.apache.nutch.protocol.http.Http.getResponse(Http.java:64) at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:138) {code} Some example URL's: * 404 http://www.fcgroningen.nl/tribunenamen/stemmen/ * 301 http://shop.fcgroningen.nl/aanbieding -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1392) -force and -resume arguments being ignored in ParserJob
[ https://issues.apache.org/jira/browse/NUTCH-1392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema updated NUTCH-1392: Attachment: NUTCH-1392.patch -force and -resume arguments being ignored in ParserJob --- Key: NUTCH-1392 URL: https://issues.apache.org/jira/browse/NUTCH-1392 Project: Nutch Issue Type: Bug Components: parser Affects Versions: nutchgora Reporter: Lewis John McGibbney Fix For: 2.1 Attachments: NUTCH-1392.patch From the log below there is obviously something not right here as both -resume and -force are passed to the CLI but blatantly ignored within the log output. lewis@lewis:~/ASF/nutchgora/runtime/local$ ./bin/nutch parse Usage: ParserJob (batchId | -all) [-crawlId id] [-resume] [-force] batchId - symbolic batch ID created by Generator -crawlId id - the id to prefix the schemas to operate on, (default: storage.crawl.id) -all - consider pages from all crawl jobs -resume - resume a previous incomplete job -force- force re-parsing even if a page is already parsed lewis@lewis:~/ASF/nutchgora/runtime/local$ ./bin/nutch parse -all -resume -force ParserJob: starting ParserJob: resuming: false ParserJob: forced reparse:false ParserJob: parsing all Parsing http://www.trancearoundtheworld.com/ ParserJob: success -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1081) ant tests fail
[ https://issues.apache.org/jira/browse/NUTCH-1081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13295676#comment-13295676 ] Ferdy Galema commented on NUTCH-1081: - Yes this one should be closed. ant tests fail --- Key: NUTCH-1081 URL: https://issues.apache.org/jira/browse/NUTCH-1081 Project: Nutch Issue Type: Bug Components: fetcher, generator, injector, storage Affects Versions: nutchgora Environment: Ubuntu release 11.04 (natty) Kernerl Linux 2.6.38-10-generic GNOME 2.32.1 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Critical Fix For: 2.1 The following tests fail when running ant test on trunk 2.0 {code} [junit] Running org.apache.nutch.api.TestAPI [junit] Tests run: 4, Failures: 1, Errors: 0, Time elapsed: 11.028 sec [junit] Test org.apache.nutch.api.TestAPI FAILED [junit] Running org.apache.nutch.crawl.TestGenerator [junit] Tests run: 4, Failures: 0, Errors: 4, Time elapsed: 0.478 sec [junit] Test org.apache.nutch.crawl.TestGenerator FAILED [junit] Running org.apache.nutch.crawl.TestInjector [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0.474 sec [junit] Test org.apache.nutch.crawl.TestInjector FAILED [junit] Running org.apache.nutch.fetcher.TestFetcher [junit] Tests run: 2, Failures: 0, Errors: 2, Time elapsed: 0.526 sec [junit] Test org.apache.nutch.fetcher.TestFetcher FAILED [junit] Running org.apache.nutch.storage.TestGoraStorage [junit] Tests run: 2, Failures: 0, Errors: 2, Time elapsed: 0.468 sec [junit] Test org.apache.nutch.storage.TestGoraStorage FAILED {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1411) nutchgora fetcher.store.content does not work
Ferdy Galema created NUTCH-1411: --- Summary: nutchgora fetcher.store.content does not work Key: NUTCH-1411 URL: https://issues.apache.org/jira/browse/NUTCH-1411 Project: Nutch Issue Type: Bug Affects Versions: nutchgora Reporter: Ferdy Galema Priority: Minor http://lucene.472066.n3.nabble.com/parse-and-solrindex-in-nutch-2-0-td3991247.html The property fetcher.store.content doesn't do anything. Content is always stored. Fix or remove property, what do you think? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1306) Add option to not commit and clarify existing solr.commit.size
[ https://issues.apache.org/jira/browse/NUTCH-1306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema updated NUTCH-1306: Summary: Add option to not commit and clarify existing solr.commit.size (was: Commit after finished writing to solr index) Add option to not commit and clarify existing solr.commit.size -- Key: NUTCH-1306 URL: https://issues.apache.org/jira/browse/NUTCH-1306 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: nutchgora Reporter: Dan Rosher Priority: Trivial Fix For: 2.1 Attachments: NUTCH-1306-trunk-v2.patch, NUTCH-1306-trunk.patch, NUTCH-1306-v2.patch, NUTCH-1306.patch Commit after finished writing to solr index - otherwise a bit confusing not seeing the number of docs we expect in solr -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1306) Add option to not commit and clarify existing solr.commit.size
[ https://issues.apache.org/jira/browse/NUTCH-1306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406363#comment-13406363 ] Ferdy Galema commented on NUTCH-1306: - New option added solr.commit.index Defaults to true: Commit after index. Will commit to trunk and nutchgora on no objection. Add option to not commit and clarify existing solr.commit.size -- Key: NUTCH-1306 URL: https://issues.apache.org/jira/browse/NUTCH-1306 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: nutchgora Reporter: Dan Rosher Priority: Trivial Fix For: 2.1 Attachments: NUTCH-1306-trunk-v2.patch, NUTCH-1306-trunk.patch, NUTCH-1306-v2.patch, NUTCH-1306.patch Commit after finished writing to solr index - otherwise a bit confusing not seeing the number of docs we expect in solr -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling
[ https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406495#comment-13406495 ] Ferdy Galema commented on NUTCH-1360: - Sorry for the late response, but this issue is not properly implemented (for both branch and trunk). - IP is always stored instead of depending on property: headers.set(_ip,... should be done only if http.getIP_Header() is true. - http.store.ip.address appends the _ip:true or false property to the request string? What is the purpose of that? If not intentional, we should simply revert this. On top of that it uses the property with a default of true, but is should be false if the adding to request string is intentional. Thanks. Suport the storing of IP address connected to when web crawling --- Key: NUTCH-1360 URL: https://issues.apache.org/jira/browse/NUTCH-1360 Project: Nutch Issue Type: New Feature Components: protocol Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: nutchgora, 1.6 Attachments: NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch Simple issue enabling us to capture the specific IP address of the host which we connect to to fetch a page. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (NUTCH-1360) Suport the storing of IP address connected to when web crawling
[ https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406495#comment-13406495 ] Ferdy Galema edited comment on NUTCH-1360 at 7/4/12 1:41 PM: - Sorry for the late response, but this issue is not properly implemented (for both branch and trunk). - IP is always stored instead of depending on property: headers.set(_ip,... should be done only if http.getIP_Header() is true. - http.store.ip.address appends the _ip:true or false property to the request string? What is the purpose of that? If not intentional, we should simply revert this. On top of that it rereads the property with a default of true, but is should be false (or just use http.getIP_Header()) if the adding to request string is intentional. Thanks. was (Author: ferdy.g): Sorry for the late response, but this issue is not properly implemented (for both branch and trunk). - IP is always stored instead of depending on property: headers.set(_ip,... should be done only if http.getIP_Header() is true. - http.store.ip.address appends the _ip:true or false property to the request string? What is the purpose of that? If not intentional, we should simply revert this. On top of that it uses the property with a default of true, but is should be false if the adding to request string is intentional. Thanks. Suport the storing of IP address connected to when web crawling --- Key: NUTCH-1360 URL: https://issues.apache.org/jira/browse/NUTCH-1360 Project: Nutch Issue Type: New Feature Components: protocol Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: nutchgora, 1.6 Attachments: NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch Simple issue enabling us to capture the specific IP address of the host which we connect to to fetch a page. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling
[ https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406561#comment-13406561 ] Ferdy Galema commented on NUTCH-1360: - Just one more thing: Should the IP not be stored in the metadata instead of the headers field? It is technically not a response header. As far as I know currently the headers container is only used for the headers returned by the http server. (But correct me if I'm wrong) Suport the storing of IP address connected to when web crawling --- Key: NUTCH-1360 URL: https://issues.apache.org/jira/browse/NUTCH-1360 Project: Nutch Issue Type: New Feature Components: protocol Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: nutchgora, 1.6 Attachments: NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch Simple issue enabling us to capture the specific IP address of the host which we connect to to fetch a page. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1306) Add option to not commit and clarify existing solr.commit.size
[ https://issues.apache.org/jira/browse/NUTCH-1306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema updated NUTCH-1306: Attachment: NUTCH-1306-trunk-v3.patch minor bug in prev. patch. uploaded v3 of trunk patch. Add option to not commit and clarify existing solr.commit.size -- Key: NUTCH-1306 URL: https://issues.apache.org/jira/browse/NUTCH-1306 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: nutchgora Reporter: Dan Rosher Priority: Trivial Fix For: 2.1 Attachments: NUTCH-1306-trunk-v2.patch, NUTCH-1306-trunk-v3.patch, NUTCH-1306-trunk.patch, NUTCH-1306-v2.patch, NUTCH-1306.patch Commit after finished writing to solr index - otherwise a bit confusing not seeing the number of docs we expect in solr -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1423) Remove unused fields in LanguageIndexingFilter
Ferdy Galema created NUTCH-1423: --- Summary: Remove unused fields in LanguageIndexingFilter Key: NUTCH-1423 URL: https://issues.apache.org/jira/browse/NUTCH-1423 Project: Nutch Issue Type: Bug Reporter: Ferdy Galema Priority: Trivial Fix For: 2.1 The LanguageIndexingFilter declares fields on the input that are not used. These fields must be removed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1423) Remove unused fields in LanguageIndexingFilter
[ https://issues.apache.org/jira/browse/NUTCH-1423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema updated NUTCH-1423: Attachment: NUTCH-1423.patch Remove unused fields in LanguageIndexingFilter -- Key: NUTCH-1423 URL: https://issues.apache.org/jira/browse/NUTCH-1423 Project: Nutch Issue Type: Bug Reporter: Ferdy Galema Priority: Trivial Fix For: 2.1 Attachments: NUTCH-1423.patch The LanguageIndexingFilter declares fields on the input that are not used. These fields must be removed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1424) fix fetcher timelimit logging
[ https://issues.apache.org/jira/browse/NUTCH-1424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema updated NUTCH-1424: Attachment: NUTCH-1424.patch fix fetcher timelimit logging -- Key: NUTCH-1424 URL: https://issues.apache.org/jira/browse/NUTCH-1424 Project: Nutch Issue Type: Bug Reporter: Ferdy Galema Priority: Trivial Fix For: 2.1 Attachments: NUTCH-1424.patch When fetching with timelimit, the log does not correctly reflect this. (Always shows -1). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-1424) fix fetcher timelimit logging
[ https://issues.apache.org/jira/browse/NUTCH-1424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema closed NUTCH-1424. --- Resolution: Fixed Committed. fix fetcher timelimit logging -- Key: NUTCH-1424 URL: https://issues.apache.org/jira/browse/NUTCH-1424 Project: Nutch Issue Type: Bug Reporter: Ferdy Galema Priority: Trivial Fix For: 2.1 Attachments: NUTCH-1424.patch When fetching with timelimit, the log does not correctly reflect this. (Always shows -1). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1425) DbUpdaterJob declares PREV_SIGNATURE on input twice
Ferdy Galema created NUTCH-1425: --- Summary: DbUpdaterJob declares PREV_SIGNATURE on input twice Key: NUTCH-1425 URL: https://issues.apache.org/jira/browse/NUTCH-1425 Project: Nutch Issue Type: Bug Reporter: Ferdy Galema Priority: Trivial Fix For: 2.1 Attachments: NUTCH-1425.patch Although harmless, DbUpdaterJob should not declare input fields twice. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-1425) DbUpdaterJob declares PREV_SIGNATURE on input twice
[ https://issues.apache.org/jira/browse/NUTCH-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema closed NUTCH-1425. --- Resolution: Fixed Committed. DbUpdaterJob declares PREV_SIGNATURE on input twice --- Key: NUTCH-1425 URL: https://issues.apache.org/jira/browse/NUTCH-1425 Project: Nutch Issue Type: Bug Reporter: Ferdy Galema Priority: Trivial Fix For: 2.1 Attachments: NUTCH-1425.patch Although harmless, DbUpdaterJob should not declare input fields twice. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira