[Nutch-dev] [jira] Commented: (NUTCH-392) OutputFormat implementations should pass on Progressable

2007-06-28 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508820 ] Sami Siren commented on NUTCH-392: -- But why is parse_text_block's size so close to parse_text data of parse_text

[Nutch-dev] [jira] Commented: (NUTCH-499) Refactor LinkDb and LinkDbMerger to reuse code

2007-06-27 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508449 ] Sami Siren commented on NUTCH-499: -- +1, seems good to me Refactor LinkDb and LinkDbMerger to reuse code

[Nutch-dev] [jira] Commented: (NUTCH-434) Replace usage of ObjectWritable with something based on GenericWritable

2007-06-26 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508222 ] Sami Siren commented on NUTCH-434: -- You missed one ObjectWritable in Indexer (the one that hit my head too hard

[Nutch-dev] [jira] Commented: (NUTCH-434) Replace usage of ObjectWritable with something based on GenericWritable

2007-06-26 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508239 ] Sami Siren commented on NUTCH-434: -- Now there is a good chance that you knew all this :). If your point was that

[Nutch-dev] [jira] Updated: (NUTCH-496) ConcurrentModificationException can be thrown when getSorted() is called.

2007-06-05 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren updated NUTCH-496: - Attachment: nutch-496.txt This patch changes LanguageIdentifier to have NGramProfile per thread instead

[Nutch-dev] [jira] Commented: (NUTCH-496) ConcurrentModificationException can be thrown when getSorted() is called.

2007-06-04 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12501266 ] Sami Siren commented on NUTCH-496: -- I believe the problem is even more severe. Now several threads share the

[Nutch-dev] [jira] Updated: (NUTCH-161) Change Plain text parser to use parser.character.encoding.default property for fall back encoding

2007-05-15 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren updated NUTCH-161: - Fix Version/s: 1.0.0 Assignee: Sami Siren Summary: Change Plain text parser to use

[Nutch-dev] [jira] Resolved: (NUTCH-161) Change Plain text parser to use parser.character.encoding.default property for fall back encoding

2007-05-15 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-161. -- Resolution: Fixed I just committed a fix for this, thanks KuroSaka! Change Plain text parser to use

[Nutch-dev] [jira] Resolved: (NUTCH-482) Remove redundant plugin lib-log4j

2007-05-14 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-482. -- Resolution: Fixed Fix Version/s: 1.0.0 Remove redundant plugin lib-log4j

[Nutch-dev] [jira] Resolved: (NUTCH-483) remove redundant commons-logging jar from ontology plugin

2007-05-14 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-483. -- Resolution: Fixed Fix Version/s: 1.0.0 Assignee: Sami Siren remove redundant

[Nutch-dev] [jira] Resolved: (NUTCH-457) Create top level dist directory and checkin KEYS file to subversion be standard with Lucene Java and Hadoop

2007-05-14 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-457. -- Resolution: Fixed Create top level dist directory and checkin KEYS file to subversion be standard

[Nutch-dev] [jira] Resolved: (NUTCH-484) Nutch Nightly API link is broken in site

2007-05-13 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-484. -- Resolution: Fixed committed and updated site, thanks Gal Nutch Nightly API link is broken in site

[Nutch-dev] [jira] Created: (NUTCH-482) Remove redundant plugin lib-log4j

2007-05-12 Thread Sami Siren (JIRA)
Remove redundant plugin lib-log4j - Key: NUTCH-482 URL: https://issues.apache.org/jira/browse/NUTCH-482 Project: Nutch Issue Type: Bug Affects Versions: 0.9.0 Reporter: Sami Siren

[Nutch-dev] [jira] Created: (NUTCH-483) remove redundant commons-logging jar from ontology plugin

2007-05-12 Thread Sami Siren (JIRA)
remove redundant commons-logging jar from ontology plugin - Key: NUTCH-483 URL: https://issues.apache.org/jira/browse/NUTCH-483 Project: Nutch Issue Type: Bug Affects Versions:

[Nutch-dev] [jira] Commented: (NUTCH-472) NullPointerException in ZipTextExtractor if no MIME type for zipped file

2007-05-11 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12495229 ] Sami Siren commented on NUTCH-472: -- Not sure how to turn source code in description into a patch file, but the

[Nutch-dev] [jira] Resolved: (NUTCH-456) parse msexcel plugin speedup

2007-05-10 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-456. -- Resolution: Fixed committed with minor modifications (used StringBuilder instead of StringBuffer,

[Nutch-dev] [jira] Assigned: (NUTCH-446) RobotRulesParser should ignore Crawl-delay values of other bots in robots.txt

2007-05-10 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren reassigned NUTCH-446: Assignee: Sami Siren RobotRulesParser should ignore Crawl-delay values of other bots in robots.txt

[Nutch-dev] [jira] Resolved: (NUTCH-446) RobotRulesParser should ignore Crawl-delay values of other bots in robots.txt

2007-05-10 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-446. -- Resolution: Fixed I just committed this, keep the patches coming Doğacan! RobotRulesParser should

[Nutch-dev] [jira] Updated: (NUTCH-469) changes to geoPosition plugin to make it work on nutch 0.9

2007-05-09 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren updated NUTCH-469: - Attachment: NUTCH-469-2007-05-09.txt.gz tnahks for putting this together, I briefly checked through the

[Nutch-dev] [jira] Commented: (NUTCH-477) Extend URLFilters to support different filtering chains

2007-05-09 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12494531 ] Sami Siren commented on NUTCH-477: -- I don't feel strongly about this but could enums be used instead of static

[Nutch-dev] [jira] Commented: (NUTCH-472) NullPointerException in ZipTextExtractor if no MIME type for zipped file

2007-05-09 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12494534 ] Sami Siren commented on NUTCH-472: -- have a patch? NullPointerException in ZipTextExtractor if no MIME type for

[Nutch-dev] [jira] Commented: (NUTCH-476) Would like to add a field to the document class for its MD5 signature

2007-05-09 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12494537 ] Sami Siren commented on NUTCH-476: -- md5 sum (or any other configurable digest) is already calculated in fetcher or

[Nutch-dev] [jira] Commented: (NUTCH-446) RobotRulesParser should ignore Crawl-delay values of other bots in robots.txt

2007-05-01 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12492850 ] Sami Siren commented on NUTCH-446: -- +1 RobotRulesParser should ignore Crawl-delay values of other bots in

[Nutch-dev] [jira] Commented: (NUTCH-471) Fix synchronization in NutchBean creation

2007-04-24 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12491305 ] Sami Siren commented on NUTCH-471: -- Isn't the DCL declared to be broken? We could perhaps instead instantiate

[Nutch-dev] [jira] Resolved: (NUTCH-473) ExcelExtractor performance bad due to String concatenation

2007-04-24 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-473. -- Resolution: Duplicate duplicate of NUTCH-456 ExcelExtractor performance bad due to String

[Nutch-dev] [jira] Resolved: (NUTCH-432) JAVA_PLATFORM with spaces (i.e. Mac OS X-ppc-32) breaks bin/nutch script

2007-03-27 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-432. -- Resolution: Fixed Fix Version/s: 0.9.0 The problem above has been fixed by ab. JAVA_PLATFORM

[Nutch-dev] [jira] Reopened: (NUTCH-432) JAVA_PLATFORM with spaces (i.e. Mac OS X-ppc-32) breaks bin/nutch script

2007-03-11 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren reopened NUTCH-432: -- After this got applied there's this error printed on console when run on FC5: bin/nutch: line 152:

[Nutch-dev] [jira] Commented: (NUTCH-457) Create top level dist directory and checkin KEYS file to subversion be standard with Lucene Java and Hadoop

2007-03-08 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12479419 ] Sami Siren commented on NUTCH-457: -- +1 Create top level dist directory and checkin KEYS file to subversion be

[Nutch-dev] [jira] Commented: (NUTCH-247) robot parser to restrict.

2007-02-20 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12474466 ] Sami Siren commented on NUTCH-247: -- I am not seeing how this would grow into multiple sets of checking rules.

[Nutch-dev] [jira] Commented: (NUTCH-247) robot parser to restrict.

2007-02-19 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12474257 ] Sami Siren commented on NUTCH-247: -- Setting even a bogus agent name is an insignificant effort compared to the

[Nutch-dev] [jira] Commented: (NUTCH-247) robot parser to restrict.

2007-02-19 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12474269 ] Sami Siren commented on NUTCH-247: -- I am OK with the efforts making things more user friendly but still doing

[Nutch-dev] [jira] Commented: (NUTCH-247) robot parser to restrict.

2007-02-18 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473990 ] Sami Siren commented on NUTCH-247: -- Agent name has actually only relevance in http. IMO not setting agent name

[Nutch-dev] [jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-14 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473148 ] Sami Siren commented on NUTCH-443: -- Didn't know this, will change this too. (Why is Nutch not using this class in

[Nutch-dev] [jira] Commented: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore

2007-01-26 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12467916 ] Sami Siren commented on NUTCH-258: -- I haven't noticed this being a problem for me, so no objections from here.

[Nutch-dev] [jira] Commented: (NUTCH-433) java.io.EOFException in newer nightlies in mergesegs or indexing from hadoop.io.DataOutputBuffer

2007-01-25 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12467491 ] Sami Siren commented on NUTCH-433: -- ok, now it is committed, sorry. java.io.EOFException in newer nightlies in

[Nutch-dev] [jira] Assigned: (NUTCH-433) java.io.EOFException in newer nightlies in mergesegs or indexing from hadoop.io.DataOutputBuffer

2007-01-24 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren reassigned NUTCH-433: Assignee: Sami Siren java.io.EOFException in newer nightlies in mergesegs or indexing from

[Nutch-dev] [jira] Created: (NUTCH-434) Replace usage of ObjectWritable with something based on GenericWritable

2007-01-24 Thread Sami Siren (JIRA)
Replace usage of ObjectWritable with something based on GenericWritable --- Key: NUTCH-434 URL: https://issues.apache.org/jira/browse/NUTCH-434 Project: Nutch Issue Type:

[Nutch-dev] [jira] Commented: (NUTCH-61) Adaptive re-fetch interval. Detecting umodified content

2007-01-17 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-61?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12465493 ] Sami Siren commented on NUTCH-61: - Havent looked the patch (tm) How would one manage segments after something linke

[Nutch-dev] [jira] Commented: (NUTCH-61) Adaptive re-fetch interval. Detecting umodified content

2007-01-17 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-61?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12465540 ] Sami Siren commented on NUTCH-61: - ok, so in my usual use case where there are far more urls than I can fetch this

[Nutch-dev] [jira] Resolved: (NUTCH-430) integer overflow in HashComparator.compare

2007-01-15 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-430. -- Resolution: Fixed Fix Version/s: 0.9.0 committed in revision 495732 with additional whitespace

[Nutch-dev] [jira] Created: (NUTCH-430) integer overflow in HashComparator.compare

2007-01-15 Thread Sami Siren (JIRA)
integer overflow in HashComparator.compare -- Key: NUTCH-430 URL: https://issues.apache.org/jira/browse/NUTCH-430 Project: Nutch Issue Type: Bug Components: generator Affects Versions:

[Nutch-dev] [jira] Commented: (NUTCH-422) index-extra plugin creates additional fields in the index, based on configurable logic

2007-01-12 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12464347 ] Sami Siren commented on NUTCH-422: -- Is there a reason for the two takarta-regexp-jars (v 1.2 and 1.3) in source

[Nutch-dev] [jira] Commented: (NUTCH-422) index-extra plugin creates additional fields in the index, based on configurable logic

2007-01-12 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12464351 ] Sami Siren commented on NUTCH-422: -- couple of more points: -source files use tabs for indentation -headers of files

[Nutch-dev] [jira] Resolved: (NUTCH-428) NullPointerException

2007-01-12 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-428. -- Resolution: Fixed Fix Version/s: 0.9.0 Most propably you dont have agent name configured in

[Nutch-dev] [jira] Commented: (NUTCH-420) DeleteDuplicates.HashPartitioner depends on the order of IndexDocs

2007-01-08 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12463059 ] Sami Siren commented on NUTCH-420: -- The feather 'Licensed for inclusion in ASF works' is missing from 2nd patch.

[Nutch-dev] [jira] Resolved: (NUTCH-325) UrlFilters.java throws NPE in case urlfilter.order contains Filters that are not in plugin.includes

2007-01-06 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-325. -- Resolution: Fixed just committed this with additional junit testcase. Thanks Stefan! UrlFilters.java

[Nutch-dev] [jira] Assigned: (NUTCH-422) index-extra plugin creates additional fields in the index, based on configurable logic

2007-01-06 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren reassigned NUTCH-422: Assignee: Sami Siren index-extra plugin creates additional fields in the index, based on

[Nutch-dev] [jira] Assigned: (NUTCH-421) Allow predeterminate running order of index filters

2007-01-06 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren reassigned NUTCH-421: Assignee: Sami Siren Allow predeterminate running order of index filters

[Nutch-dev] [jira] Resolved: (NUTCH-421) Allow predeterminate running order of index filters

2007-01-06 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-421. -- Resolution: Fixed Fix Version/s: 0.9.0 Thanks Alan, I just committed this with additionali

[Nutch-dev] [jira] Commented: (NUTCH-418) Fixes parsing of XHTML (e.g. title)

2006-12-21 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-418?page=comments#action_12460282 ] Sami Siren commented on NUTCH-418: -- We should perhaps include the rest of changes made in NUTCH-362. Fixes parsing of XHTML (e.g. title)

[Nutch-dev] [jira] Updated: (NUTCH-272) Max. pages to crawl/fetch per site (emergency limit)

2006-12-20 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-272?page=all ] Sami Siren updated NUTCH-272: - Thanks Doug, that makes more sense now. Running URLFilters.filter() during Generate seems very handy, albeit costly for large crawls. (Should have an option to turn

[Nutch-dev] [jira] Commented: (NUTCH-415) Generate should mark selected records in crawlDB

2006-12-15 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-415?page=comments#action_12458814 ] Sami Siren commented on NUTCH-415: -- Please also consider the performance implications. If this marking will add signifigant performance overhead then it would be

[Nutch-dev] [jira] Commented: (NUTCH-248) add support for internationalized domain names

2006-12-11 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-248?page=comments#action_12457437 ] Sami Siren commented on NUTCH-248: -- Seems like the latest java has build in support http://java.sun.com/javase/6/docs/api/java/net/IDN.html add support for

[Nutch-dev] [jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher improvements

2006-11-28 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-339?page=comments#action_12453975 ] Sami Siren commented on NUTCH-339: -- perhaps thath exception is just a consequence of something other like this: 2006-11-27 07:35:09,434 INFO fetcher.Fetcher2 -

[Nutch-dev] [jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher improvements

2006-11-28 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-339?page=comments#action_12454045 ] Sami Siren commented on NUTCH-339: -- I am running with 300 thread, and in parsing mode thread dump shows: 191 threads waiting on condition at

[Nutch-dev] [jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher improvements

2006-11-27 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-339?page=comments#action_12453798 ] Sami Siren commented on NUTCH-339: -- When running a test fetch with Fetcher2 I enountered this error after fetching few thousand pages (of 1 million segment):

[Nutch-dev] [jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher improvements

2006-11-24 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-339?page=comments#action_12452522 ] Sami Siren commented on NUTCH-339: -- patch applies ok, but there's this error when I try to compile: compile: [echo] Compiling plugin: lib-http [javac]

[Nutch-dev] [jira] Commented: (NUTCH-251) Administration GUI

2006-11-23 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-251?page=comments#action_12452321 ] Sami Siren commented on NUTCH-251: -- Are you thinking of something like UI extension point like in contrib/web2 ? not necessarily, that was also a quick hack I

[Nutch-dev] [jira] Created: (NUTCH-404) Fix LinkDB Usage - implementation mismatch

2006-11-19 Thread Sami Siren (JIRA)
Fix LinkDB Usage - implementation mismatch -- Key: NUTCH-404 URL: http://issues.apache.org/jira/browse/NUTCH-404 Project: Nutch Issue Type: Bug Components: linkdb Reporter: Sami

[Nutch-dev] [jira] Resolved: (NUTCH-404) Fix LinkDB Usage - implementation mismatch

2006-11-19 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-404?page=all ] Sami Siren resolved NUTCH-404. -- Fix Version/s: 0.9.0 Resolution: Fixed fixed Fix LinkDB Usage - implementation mismatch -- Key:

[Nutch-dev] [jira] Resolved: (NUTCH-403) Make URL filtering optional in Generator

2006-11-19 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-403?page=all ] Sami Siren resolved NUTCH-403. -- Fix Version/s: 0.9.0 Resolution: Fixed Committed to trunk with change to name of conf parameter. Make URL filtering optional in Generator

[Nutch-dev] [jira] Created: (NUTCH-403) Make URL filtering optional in Generator

2006-11-18 Thread Sami Siren (JIRA)
Make URL filtering optional in Generator Key: NUTCH-403 URL: http://issues.apache.org/jira/browse/NUTCH-403 Project: Nutch Issue Type: Improvement Components: generator

[Nutch-dev] [jira] Updated: (NUTCH-403) Make URL filtering optional in Generator

2006-11-18 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-403?page=all ] Sami Siren updated NUTCH-403: - Attachment: nutch-generate-optional-filtering.patch Attached patch adds option -noFilter to crawl command (and additional parameter to java api) to control if

[Nutch-dev] [jira] Updated: (NUTCH-403) Make URL filtering optional in Generator

2006-11-18 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-403?page=all ] Sami Siren updated NUTCH-403: - The command that is altered is generate (Generator) not crawl. Make URL filtering optional in Generator Key:

[Nutch-dev] [jira] Resolved: (NUTCH-388) nutch-default.xml has outdated example for urlfilter.order

2006-11-18 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-388?page=all ] Sami Siren resolved NUTCH-388. -- Fix Version/s: 0.9.0 Resolution: Fixed This is now fixed (rev 476617). Thanks for reporting it! nutch-default.xml has outdated example for urlfilter.order

[Nutch-dev] [jira] Commented: (NUTCH-400) Update add missing license headers

2006-11-13 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-400?page=comments#action_12449440 ] Sami Siren commented on NUTCH-400: -- I updated headers and added missing headers to .java files in trunk. There are still plenty of (.xml, .jsp, html, properties)

[Nutch-dev] [jira] Resolved: (NUTCH-395) Increase fetching speed

2006-11-13 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-395?page=all ] Sami Siren resolved NUTCH-395. -- Fix Version/s: 0.9.0 Resolution: Fixed applied to trunk with some additional whitespace changes. Increase fetching speed ---

[Nutch-dev] [jira] Commented: (NUTCH-401) Hardcoded /tmp directory in SegmentReader

2006-11-13 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-401?page=comments#action_12449485 ] Sami Siren commented on NUTCH-401: -- Shouldn't this directory be configurable? I found it because of permission issues (/tmp isn't globally writable to catch stuff

[Nutch-dev] [jira] Updated: (NUTCH-395) Increase fetching speed

2006-11-12 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-395?page=all ] Sami Siren updated NUTCH-395: - Attachment: NUTCH-395-trunk-metadata-only-2.patch Additional change to Content cuts down time needed in effective fetching. Now seeing speeds like 45 pages/sec also

[Nutch-dev] [jira] Updated: (NUTCH-395) Increase fetching speed

2006-11-11 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-395?page=all ] Sami Siren updated NUTCH-395: - Affects Version/s: 0.9.0 Increase fetching speed --- Key: NUTCH-395 URL:

[Nutch-dev] [jira] Updated: (NUTCH-395) Increase fetching speed

2006-11-11 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-395?page=all ] Sami Siren updated NUTCH-395: - Attachment: NUTCH-395-trunk-metadata-only.patch Here's a first stab at svn trunk version of nutch that just optimizes the use of metadata and splits it into two

[Nutch-dev] [jira] Commented: (NUTCH-398) map-reduce very slow when crawling on single server

2006-11-11 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-398?page=comments#action_12448949 ] Sami Siren commented on NUTCH-398: -- Did anyone try to use single machine but not with local mode but with nutch acting like one node? Maybe this is workaround

[Nutch-dev] [jira] Created: (NUTCH-399) Change CommandRunner to use concurrent api from jdk

2006-11-11 Thread Sami Siren (JIRA)
Change CommandRunner to use concurrent api from jdk --- Key: NUTCH-399 URL: http://issues.apache.org/jira/browse/NUTCH-399 Project: Nutch Issue Type: Task Reporter: Sami Siren

[Nutch-dev] [jira] Resolved: (NUTCH-399) Change CommandRunner to use concurrent api from jdk

2006-11-11 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-399?page=all ] Sami Siren resolved NUTCH-399. -- Fix Version/s: 0.9.0 Resolution: Fixed Change CommandRunner to use concurrent api from jdk ---

[Nutch-dev] [jira] Created: (NUTCH-400) Update add missing license headers

2006-11-11 Thread Sami Siren (JIRA)
Update add missing license headers Key: NUTCH-400 URL: http://issues.apache.org/jira/browse/NUTCH-400 Project: Nutch Issue Type: Task Affects Versions: 0.8.2, 0.9.0 Reporter: Sami Siren

[Nutch-dev] [jira] Commented: (NUTCH-395) Increase fetching speed

2006-11-10 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-395?page=comments#action_12448795 ] Sami Siren commented on NUTCH-395: -- have you measured what made the biggest impact on performance - changes to Metadata, or changes to IO in FetcherOutput? did

[Nutch-dev] [jira] Commented: (NUTCH-395) Increase fetching speed

2006-10-31 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-395?page=comments#action_12445956 ] Sami Siren commented on NUTCH-395: -- have you measured what made the biggest impact on performance - changes to Metadata, or changes to IO in FetcherOutput? did

[Nutch-dev] [jira] Commented: (NUTCH-395) Increase fetching speed

2006-10-31 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-395?page=comments#action_12445999 ] Sami Siren commented on NUTCH-395: -- settings. I.e. if someone created a segment with high max # of outlinks, you should still be able to read it and process

[Nutch-dev] [jira] Created: (NUTCH-395) Increase fetching speed

2006-10-29 Thread Sami Siren (JIRA)
Increase fetching speed --- Key: NUTCH-395 URL: http://issues.apache.org/jira/browse/NUTCH-395 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 0.8.1 Reporter: Sami

[Nutch-dev] [jira] Updated: (NUTCH-379) ParseUtil does not pass through the content's URL to the ParserFactory

2006-10-13 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-379?page=all ] Sami Siren updated NUTCH-379: - Fix Version/s: (was: 0.8.1) (was: 0.8) cannot fix released versions ParseUtil does not pass through the content's URL to the

[Nutch-dev] [jira] Created: (NUTCH-376) Add methods to control runtime behaviour of NutchBean

2006-09-30 Thread Sami Siren (JIRA)
Add methods to control runtime behaviour of NutchBean - Key: NUTCH-376 URL: http://issues.apache.org/jira/browse/NUTCH-376 Project: Nutch Issue Type: Improvement Affects Versions:

[Nutch-dev] [jira] Created: (NUTCH-375) Link to 0.8.x apidocs broken on website

2006-09-28 Thread Sami Siren (JIRA)
Link to 0.8.x apidocs broken on website --- Key: NUTCH-375 URL: http://issues.apache.org/jira/browse/NUTCH-375 Project: Nutch Issue Type: Bug Components: documentation Reporter: Sami

[Nutch-dev] [jira] Resolved: (NUTCH-375) Link to 0.8.x apidocs broken on website

2006-09-28 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-375?page=all ] Sami Siren resolved NUTCH-375. -- Resolution: Fixed this was fixed by copying apidocs from 0.8.1 to /www/lucene.apache.org/nutch/apidocs-0.8.x/ as soon as next rsync occurs it should be fine,

[Nutch-dev] [jira] Commented: (NUTCH-351) Protocol forward proxy

2006-09-26 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-351?page=comments#action_12438013 ] Sami Siren commented on NUTCH-351: -- As the plugin name says it by using a protocol-forwardproxy acts as a protocol plugin and does not need additional protocol

[Nutch-dev] [jira] Closed: (NUTCH-266) hadoop bug when doing updatedb

2006-09-24 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-266?page=all ] Sami Siren closed NUTCH-266. hadoop bug when doing updatedb -- Key: NUTCH-266 URL: http://issues.apache.org/jira/browse/NUTCH-266

[Nutch-dev] [jira] Closed: (NUTCH-105) Network error during robots.txt fetch causes file to be ignored

2006-09-24 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-105?page=all ] Sami Siren closed NUTCH-105. Network error during robots.txt fetch causes file to be ignored --- Key: NUTCH-105

[Nutch-dev] [jira] Closed: (NUTCH-338) Remove the text parser as an option for parsing PDF files in parse-plugins.xml

2006-09-24 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-338?page=all ] Sami Siren closed NUTCH-338. Remove the text parser as an option for parsing PDF files in parse-plugins.xml --

[Nutch-dev] [jira] Closed: (NUTCH-344) Fetcher threads blocked on synchronized block in cleanExpiredServerBlocks

2006-09-24 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-344?page=all ] Sami Siren closed NUTCH-344. Fetcher threads blocked on synchronized block in cleanExpiredServerBlocks - Key:

[Nutch-dev] [jira] Closed: (NUTCH-318) log4j not proper configured, readdb doesnt give any information

2006-09-24 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-318?page=all ] Sami Siren closed NUTCH-318. log4j not proper configured, readdb doesnt give any information --- Key: NUTCH-318

[Nutch-dev] [jira] Closed: (NUTCH-370) Generator looses urls when run with LocalJobRunner

2006-09-24 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-370?page=all ] Sami Siren closed NUTCH-370. Resolution: Duplicate actually this is a duplicate of #361 Generator looses urls when run with LocalJobRunner --

[Nutch-dev] [jira] Updated: (NUTCH-370) Generator looses urls when run with LocalJobRunner

2006-09-22 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-370?page=all ] Sami Siren updated NUTCH-370: - Summary: Generator looses urls when run with LocalJobRunner (was: Generator loosed urls when run with LocalJobRunner) Generator looses urls when run with

[Nutch-dev] [jira] Resolved: (NUTCH-105) Network error during robots.txt fetch causes file to be ignored

2006-09-19 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-105?page=all ] Sami Siren resolved NUTCH-105. -- Resolution: Fixed This is now committed, thanks! Network error during robots.txt fetch causes file to be ignored

[Nutch-dev] [jira] Resolved: (NUTCH-367) DistributedSearch thown ClassCastException

2006-09-19 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-367?page=all ] Sami Siren resolved NUTCH-367. -- Fix Version/s: 0.9.0 Resolution: Fixed Assignee: Sami Siren I just committed a fix for this together with testcase, thanks for reporting it.

[Nutch-dev] [jira] Commented: (NUTCH-368) Message queueing system

2006-09-18 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-368?page=comments#action_1243 ] Sami Siren commented on NUTCH-368: -- IMO a place for stuff like this is in hadoop more than nutch and i would like to see this implemented there. Mainly because i

[Nutch-dev] [jira] Commented: (NUTCH-365) Flexible URL normalization

2006-09-15 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-365?page=comments#action_12435175 ] Sami Siren commented on NUTCH-365: -- looks ok to me, the ugly (with amp;) regexps could perhaps be put inside ![CDATA[ ]] elements in generator there's + try { +

[Nutch-dev] [jira] Created: (NUTCH-362) Remove parse-text from unsupported filetypes in parse-plugins.xml

2006-09-07 Thread Sami Siren (JIRA)
Remove parse-text from unsupported filetypes in parse-plugins.xml - Key: NUTCH-362 URL: http://issues.apache.org/jira/browse/NUTCH-362 Project: Nutch Issue Type: Bug

[Nutch-dev] [jira] Commented: (NUTCH-361) generator create fetchlist randomly

2006-09-07 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-361?page=comments#action_12433169 ] Sami Siren commented on NUTCH-361: -- The / by 0 was due to bug in testcase. Now the testcase fails about 50% of time. I also noticed that the number of reduce

[Nutch-dev] [jira] Commented: (NUTCH-208) http: proxy exception list:

2006-09-07 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-208?page=comments#action_12433175 ] Sami Siren commented on NUTCH-208: -- This looks like a good addition to Nutch, couple of comments: -The added comments in HttpResponse should be removed. -Any

[Nutch-dev] [jira] Commented: (NUTCH-273) When a page is redirected, the original url is NOT updated.

2006-09-07 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-273?page=comments#action_12433183 ] Sami Siren commented on NUTCH-273: -- +1 for not following redirects immediately - simplify fetcher logic. I would also like to see a flexible (configurable?)

[Nutch-dev] [jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher improvements

2006-09-07 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-339?page=comments#action_12433185 ] Sami Siren commented on NUTCH-339: -- Andrzej, are you still working with this or should I proceed as I originally planned ;) Refactor nutch to allow fetcher

  1   2   3   >