[jira] Commented: (NUTCH-565) Arc File to Nutch Segments Converter

2007-10-10 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12533787 ] Sami Siren commented on NUTCH-565: -- bq. Both jars are LGPL. I think that prohibits direct inclusion then. Take a

[jira] Commented: (NUTCH-565) Arc File to Nutch Segments Converter

2007-10-09 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12533458 ] Sami Siren commented on NUTCH-565: -- What are the licenses for those jars? Arc File to Nutch Segments Converter

[jira] Commented: (NUTCH-392) OutputFormat implementations should pass on Progressable

2007-06-28 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508820 ] Sami Siren commented on NUTCH-392: -- But why is parse_text_block's size so close to parse_text data of parse_text

[jira] Commented: (NUTCH-499) Refactor LinkDb and LinkDbMerger to reuse code

2007-06-27 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508449 ] Sami Siren commented on NUTCH-499: -- +1, seems good to me Refactor LinkDb and LinkDbMerger to reuse code

[jira] Commented: (NUTCH-498) Use Combiner in LinkDb to increase speed of linkdb generation

2007-06-27 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508508 ] Sami Siren commented on NUTCH-498: -- +1 Use Combiner in LinkDb to increase speed of linkdb generation

[jira] Commented: (NUTCH-434) Replace usage of ObjectWritable with something based on GenericWritable

2007-06-26 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508222 ] Sami Siren commented on NUTCH-434: -- You missed one ObjectWritable in Indexer (the one that hit my head too hard

[jira] Commented: (NUTCH-434) Replace usage of ObjectWritable with something based on GenericWritable

2007-06-26 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508239 ] Sami Siren commented on NUTCH-434: -- Now there is a good chance that you knew all this :). If your point was that

[jira] Updated: (NUTCH-496) ConcurrentModificationException can be thrown when getSorted() is called.

2007-06-05 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren updated NUTCH-496: - Attachment: nutch-496.txt This patch changes LanguageIdentifier to have NGramProfile per thread instead

[jira] Commented: (NUTCH-496) ConcurrentModificationException can be thrown when getSorted() is called.

2007-06-04 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12501266 ] Sami Siren commented on NUTCH-496: -- I believe the problem is even more severe. Now several threads share the

[jira] Updated: (NUTCH-161) Change Plain text parser to use parser.character.encoding.default property for fall back encoding

2007-05-15 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren updated NUTCH-161: - Fix Version/s: 1.0.0 Assignee: Sami Siren Summary: Change Plain text parser to use

[jira] Resolved: (NUTCH-482) Remove redundant plugin lib-log4j

2007-05-14 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-482. -- Resolution: Fixed Fix Version/s: 1.0.0 Remove redundant plugin lib-log4j

[jira] Resolved: (NUTCH-483) remove redundant commons-logging jar from ontology plugin

2007-05-14 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-483. -- Resolution: Fixed Fix Version/s: 1.0.0 Assignee: Sami Siren remove redundant

[jira] Resolved: (NUTCH-484) Nutch Nightly API link is broken in site

2007-05-13 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-484. -- Resolution: Fixed committed and updated site, thanks Gal Nutch Nightly API link is broken in site

[jira] Created: (NUTCH-482) Remove redundant plugin lib-log4j

2007-05-12 Thread Sami Siren (JIRA)
Remove redundant plugin lib-log4j - Key: NUTCH-482 URL: https://issues.apache.org/jira/browse/NUTCH-482 Project: Nutch Issue Type: Bug Affects Versions: 0.9.0 Reporter: Sami Siren

[jira] Created: (NUTCH-483) remove redundant commons-logging jar from ontology plugin

2007-05-12 Thread Sami Siren (JIRA)
remove redundant commons-logging jar from ontology plugin - Key: NUTCH-483 URL: https://issues.apache.org/jira/browse/NUTCH-483 Project: Nutch Issue Type: Bug Affects Versions:

[jira] Resolved: (NUTCH-456) parse msexcel plugin speedup

2007-05-10 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-456. -- Resolution: Fixed committed with minor modifications (used StringBuilder instead of StringBuffer,

[jira] Resolved: (NUTCH-446) RobotRulesParser should ignore Crawl-delay values of other bots in robots.txt

2007-05-10 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-446. -- Resolution: Fixed I just committed this, keep the patches coming Doğacan! RobotRulesParser should

[jira] Updated: (NUTCH-469) changes to geoPosition plugin to make it work on nutch 0.9

2007-05-09 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren updated NUTCH-469: - Attachment: NUTCH-469-2007-05-09.txt.gz tnahks for putting this together, I briefly checked through the

[jira] Updated: (NUTCH-469) changes to geoPosition plugin to make it work on nutch 0.9

2007-05-09 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren updated NUTCH-469: - Fix Version/s: (was: 0.7.3) 1.0.0 changes to geoPosition plugin to make it work

[jira] Commented: (NUTCH-477) Extend URLFilters to support different filtering chains

2007-05-09 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12494531 ] Sami Siren commented on NUTCH-477: -- I don't feel strongly about this but could enums be used instead of static

[jira] Commented: (NUTCH-472) NullPointerException in ZipTextExtractor if no MIME type for zipped file

2007-05-09 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12494534 ] Sami Siren commented on NUTCH-472: -- have a patch? NullPointerException in ZipTextExtractor if no MIME type for

[jira] Commented: (NUTCH-476) Would like to add a field to the document class for its MD5 signature

2007-05-09 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12494537 ] Sami Siren commented on NUTCH-476: -- md5 sum (or any other configurable digest) is already calculated in fetcher or

[jira] Commented: (NUTCH-446) RobotRulesParser should ignore Crawl-delay values of other bots in robots.txt

2007-05-01 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12492850 ] Sami Siren commented on NUTCH-446: -- +1 RobotRulesParser should ignore Crawl-delay values of other bots in

[jira] Commented: (NUTCH-471) Fix synchronization in NutchBean creation

2007-04-24 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12491305 ] Sami Siren commented on NUTCH-471: -- Isn't the DCL declared to be broken? We could perhaps instead instantiate

[jira] Resolved: (NUTCH-473) ExcelExtractor performance bad due to String concatenation

2007-04-24 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-473. -- Resolution: Duplicate duplicate of NUTCH-456 ExcelExtractor performance bad due to String

[jira] Resolved: (NUTCH-432) JAVA_PLATFORM with spaces (i.e. Mac OS X-ppc-32) breaks bin/nutch script

2007-03-27 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-432. -- Resolution: Fixed Fix Version/s: 0.9.0 The problem above has been fixed by ab. JAVA_PLATFORM

[jira] Reopened: (NUTCH-432) JAVA_PLATFORM with spaces (i.e. Mac OS X-ppc-32) breaks bin/nutch script

2007-03-11 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren reopened NUTCH-432: -- After this got applied there's this error printed on console when run on FC5: bin/nutch: line 152:

[jira] Commented: (NUTCH-457) Create top level dist directory and checkin KEYS file to subversion be standard with Lucene Java and Hadoop

2007-03-08 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12479419 ] Sami Siren commented on NUTCH-457: -- +1 Create top level dist directory and checkin KEYS file to subversion be

[jira] Resolved: (NUTCH-400) Update add missing license headers

2007-03-03 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-400. -- Resolution: Fixed Fix Version/s: (was: 0.8.2) I think this is pretty much done. Update

[jira] Commented: (NUTCH-247) robot parser to restrict.

2007-02-19 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12474257 ] Sami Siren commented on NUTCH-247: -- Setting even a bogus agent name is an insignificant effort compared to the

[jira] Commented: (NUTCH-247) robot parser to restrict.

2007-02-18 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473990 ] Sami Siren commented on NUTCH-247: -- Agent name has actually only relevance in http. IMO not setting agent name

[jira] Commented: (NUTCH-434) Replace usage of ObjectWritable with something based on GenericWritable

2007-01-26 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12467912 ] Sami Siren commented on NUTCH-434: -- It's only half way if we get the Configuration into our subclass, there's no

[jira] Commented: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore

2007-01-26 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12467916 ] Sami Siren commented on NUTCH-258: -- I haven't noticed this being a problem for me, so no objections from here.

[jira] Commented: (NUTCH-434) Replace usage of ObjectWritable with something based on GenericWritable

2007-01-26 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12467927 ] Sami Siren commented on NUTCH-434: -- I can see the light, overriding readFields is sufficient. Replace usage of

[jira] Commented: (NUTCH-433) java.io.EOFException in newer nightlies in mergesegs or indexing from hadoop.io.DataOutputBuffer

2007-01-25 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12467491 ] Sami Siren commented on NUTCH-433: -- ok, now it is committed, sorry. java.io.EOFException in newer nightlies in

[jira] Commented: (NUTCH-433) java.io.EOFException in newer nightlies in mergesegs or indexing from hadoop.io.DataOutputBuffer

2007-01-24 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12467101 ] Sami Siren commented on NUTCH-433: -- I am working on this and will probably submit a patch today.

[jira] Assigned: (NUTCH-433) java.io.EOFException in newer nightlies in mergesegs or indexing from hadoop.io.DataOutputBuffer

2007-01-24 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren reassigned NUTCH-433: Assignee: Sami Siren java.io.EOFException in newer nightlies in mergesegs or indexing from

[jira] Resolved: (NUTCH-433) java.io.EOFException in newer nightlies in mergesegs or indexing from hadoop.io.DataOutputBuffer

2007-01-24 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-433. -- Resolution: Fixed Fix Version/s: 0.9.0 I just committed a fix for this, however at least I am

[jira] Created: (NUTCH-434) Replace usage of ObjectWritable with something based on GenericWritable

2007-01-24 Thread Sami Siren (JIRA)
Replace usage of ObjectWritable with something based on GenericWritable --- Key: NUTCH-434 URL: https://issues.apache.org/jira/browse/NUTCH-434 Project: Nutch Issue Type:

[jira] Commented: (NUTCH-61) Adaptive re-fetch interval. Detecting umodified content

2007-01-17 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-61?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12465493 ] Sami Siren commented on NUTCH-61: - Havent looked the patch (tm) How would one manage segments after something linke

[jira] Commented: (NUTCH-61) Adaptive re-fetch interval. Detecting umodified content

2007-01-17 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-61?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12465540 ] Sami Siren commented on NUTCH-61: - ok, so in my usual use case where there are far more urls than I can fetch this

[jira] Resolved: (NUTCH-430) integer overflow in HashComparator.compare

2007-01-15 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-430. -- Resolution: Fixed Fix Version/s: 0.9.0 committed in revision 495732 with additional whitespace

[jira] Updated: (NUTCH-430) integer overflow in HashComparator.compare

2007-01-13 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren updated NUTCH-430: - Attachment: NUTCH-430.patch integer overflow in HashComparator.compare

[jira] Commented: (NUTCH-422) index-extra plugin creates additional fields in the index, based on configurable logic

2007-01-12 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12464347 ] Sami Siren commented on NUTCH-422: -- Is there a reason for the two takarta-regexp-jars (v 1.2 and 1.3) in source

[jira] Commented: (NUTCH-422) index-extra plugin creates additional fields in the index, based on configurable logic

2007-01-12 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12464351 ] Sami Siren commented on NUTCH-422: -- couple of more points: -source files use tabs for indentation -headers of files

[jira] Resolved: (NUTCH-428) NullPointerException

2007-01-12 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-428. -- Resolution: Fixed Fix Version/s: 0.9.0 Most propably you dont have agent name configured in

[jira] Commented: (NUTCH-420) DeleteDuplicates.HashPartitioner depends on the order of IndexDocs

2007-01-08 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12463059 ] Sami Siren commented on NUTCH-420: -- The feather 'Licensed for inclusion in ASF works' is missing from 2nd patch.

[jira] Resolved: (NUTCH-325) UrlFilters.java throws NPE in case urlfilter.order contains Filters that are not in plugin.includes

2007-01-06 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-325. -- Resolution: Fixed just committed this with additional junit testcase. Thanks Stefan! UrlFilters.java

[jira] Assigned: (NUTCH-421) Allow predeterminate running order of index filters

2007-01-06 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren reassigned NUTCH-421: Assignee: Sami Siren Allow predeterminate running order of index filters

[jira] Assigned: (NUTCH-422) index-extra plugin creates additional fields in the index, based on configurable logic

2007-01-06 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren reassigned NUTCH-422: Assignee: Sami Siren index-extra plugin creates additional fields in the index, based on

[jira] Resolved: (NUTCH-421) Allow predeterminate running order of index filters

2007-01-06 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-421. -- Resolution: Fixed Fix Version/s: 0.9.0 Thanks Alan, I just committed this with additionali

[jira] Commented: (NUTCH-418) Fixes parsing of XHTML (e.g. title)

2006-12-21 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-418?page=comments#action_12460282 ] Sami Siren commented on NUTCH-418: -- We should perhaps include the rest of changes made in NUTCH-362. Fixes parsing of XHTML (e.g. title)

[jira] Commented: (NUTCH-415) Generate should mark selected records in crawlDB

2006-12-15 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-415?page=comments#action_12458814 ] Sami Siren commented on NUTCH-415: -- Please also consider the performance implications. If this marking will add signifigant performance overhead then it would be

[jira] Commented: (NUTCH-248) add support for internationalized domain names

2006-12-11 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-248?page=comments#action_12457437 ] Sami Siren commented on NUTCH-248: -- Seems like the latest java has build in support http://java.sun.com/javase/6/docs/api/java/net/IDN.html add support for

[jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher improvements

2006-11-28 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-339?page=comments#action_12453975 ] Sami Siren commented on NUTCH-339: -- perhaps thath exception is just a consequence of something other like this: 2006-11-27 07:35:09,434 INFO fetcher.Fetcher2 -

[jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher improvements

2006-11-28 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-339?page=comments#action_12454045 ] Sami Siren commented on NUTCH-339: -- I am running with 300 thread, and in parsing mode thread dump shows: 191 threads waiting on condition at

[jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher improvements

2006-11-24 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-339?page=comments#action_12452522 ] Sami Siren commented on NUTCH-339: -- patch applies ok, but there's this error when I try to compile: compile: [echo] Compiling plugin: lib-http [javac]

[jira] Commented: (NUTCH-251) Administration GUI

2006-11-23 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-251?page=comments#action_12452321 ] Sami Siren commented on NUTCH-251: -- Are you thinking of something like UI extension point like in contrib/web2 ? not necessarily, that was also a quick hack I

[jira] Resolved: (NUTCH-362) Remove parse-text from unsupported filetypes in parse-plugins.xml

2006-11-21 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-362?page=all ] Sami Siren resolved NUTCH-362. -- Resolution: Fixed Remove parse-text from unsupported filetypes in parse-plugins.xml -

[jira] Commented: (NUTCH-251) Administration GUI

2006-11-21 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-251?page=comments#action_12451527 ] Sami Siren commented on NUTCH-251: -- I am a strong supporter of XML. Can we not re-think about this like SOLR-58 or plain/jsp like the way hadoop does it? I

[jira] Created: (NUTCH-404) Fix LinkDB Usage - implementation mismatch

2006-11-19 Thread Sami Siren (JIRA)
Fix LinkDB Usage - implementation mismatch -- Key: NUTCH-404 URL: http://issues.apache.org/jira/browse/NUTCH-404 Project: Nutch Issue Type: Bug Components: linkdb Reporter: Sami

[jira] Resolved: (NUTCH-404) Fix LinkDB Usage - implementation mismatch

2006-11-19 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-404?page=all ] Sami Siren resolved NUTCH-404. -- Fix Version/s: 0.9.0 Resolution: Fixed fixed Fix LinkDB Usage - implementation mismatch -- Key:

[jira] Resolved: (NUTCH-403) Make URL filtering optional in Generator

2006-11-19 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-403?page=all ] Sami Siren resolved NUTCH-403. -- Fix Version/s: 0.9.0 Resolution: Fixed Committed to trunk with change to name of conf parameter. Make URL filtering optional in Generator

[jira] Updated: (NUTCH-403) Make URL filtering optional in Generator

2006-11-18 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-403?page=all ] Sami Siren updated NUTCH-403: - The command that is altered is generate (Generator) not crawl. Make URL filtering optional in Generator Key:

[jira] Resolved: (NUTCH-388) nutch-default.xml has outdated example for urlfilter.order

2006-11-18 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-388?page=all ] Sami Siren resolved NUTCH-388. -- Fix Version/s: 0.9.0 Resolution: Fixed This is now fixed (rev 476617). Thanks for reporting it! nutch-default.xml has outdated example for urlfilter.order

[jira] Resolved: (NUTCH-395) Increase fetching speed

2006-11-13 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-395?page=all ] Sami Siren resolved NUTCH-395. -- Fix Version/s: 0.9.0 Resolution: Fixed applied to trunk with some additional whitespace changes. Increase fetching speed ---

[jira] Updated: (NUTCH-395) Increase fetching speed

2006-11-12 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-395?page=all ] Sami Siren updated NUTCH-395: - Attachment: NUTCH-395-trunk-metadata-only-2.patch Additional change to Content cuts down time needed in effective fetching. Now seeing speeds like 45 pages/sec also

[jira] Updated: (NUTCH-395) Increase fetching speed

2006-11-11 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-395?page=all ] Sami Siren updated NUTCH-395: - Attachment: NUTCH-395-trunk-metadata-only.patch Here's a first stab at svn trunk version of nutch that just optimizes the use of metadata and splits it into two

[jira] Updated: (NUTCH-395) Increase fetching speed

2006-11-11 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-395?page=all ] Sami Siren updated NUTCH-395: - Affects Version/s: 0.9.0 Increase fetching speed --- Key: NUTCH-395 URL:

[jira] Commented: (NUTCH-398) map-reduce very slow when crawling on single server

2006-11-11 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-398?page=comments#action_12448949 ] Sami Siren commented on NUTCH-398: -- Did anyone try to use single machine but not with local mode but with nutch acting like one node? Maybe this is workaround

[jira] Created: (NUTCH-399) Change CommandRunner to use concurrent api from jdk

2006-11-11 Thread Sami Siren (JIRA)
Change CommandRunner to use concurrent api from jdk --- Key: NUTCH-399 URL: http://issues.apache.org/jira/browse/NUTCH-399 Project: Nutch Issue Type: Task Reporter: Sami Siren

[jira] Resolved: (NUTCH-399) Change CommandRunner to use concurrent api from jdk

2006-11-11 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-399?page=all ] Sami Siren resolved NUTCH-399. -- Fix Version/s: 0.9.0 Resolution: Fixed Change CommandRunner to use concurrent api from jdk ---

[jira] Created: (NUTCH-400) Update add missing license headers

2006-11-11 Thread Sami Siren (JIRA)
Update add missing license headers Key: NUTCH-400 URL: http://issues.apache.org/jira/browse/NUTCH-400 Project: Nutch Issue Type: Task Affects Versions: 0.8.2, 0.9.0 Reporter: Sami Siren

[jira] Updated: (NUTCH-400) Update add missing license headers

2006-11-11 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-400?page=all ] Sami Siren updated NUTCH-400: - Fix Version/s: 0.8.2 0.9.0 Update add missing license headers Key: NUTCH-400

[jira] Commented: (NUTCH-395) Increase fetching speed

2006-11-10 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-395?page=comments#action_12448795 ] Sami Siren commented on NUTCH-395: -- have you measured what made the biggest impact on performance - changes to Metadata, or changes to IO in FetcherOutput? did

[jira] Commented: (NUTCH-395) Increase fetching speed

2006-10-31 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-395?page=comments#action_12445956 ] Sami Siren commented on NUTCH-395: -- have you measured what made the biggest impact on performance - changes to Metadata, or changes to IO in FetcherOutput? did

[jira] Created: (NUTCH-395) Increase fetching speed

2006-10-29 Thread Sami Siren (JIRA)
Increase fetching speed --- Key: NUTCH-395 URL: http://issues.apache.org/jira/browse/NUTCH-395 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 0.8.1 Reporter: Sami

[jira] Updated: (NUTCH-395) Increase fetching speed

2006-10-29 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-395?page=all ] Sami Siren updated NUTCH-395: - Attachment: nutch-0.8-performance.txt a rough patch for testing purposes Increase fetching speed --- Key: NUTCH-395

[jira] Created: (NUTCH-391) ParseUtil logs file contents to log file when it cannot find parser

2006-10-24 Thread Sami Siren (JIRA)
ParseUtil logs file contents to log file when it cannot find parser --- Key: NUTCH-391 URL: http://issues.apache.org/jira/browse/NUTCH-391 Project: Nutch Issue Type: Bug

[jira] Resolved: (NUTCH-391) ParseUtil logs file contents to log file when it cannot find parser

2006-10-24 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-391?page=all ] Sami Siren resolved NUTCH-391. -- Resolution: Fixed ParseUtil logs file contents to log file when it cannot find parser ---

[jira] Resolved: (NUTCH-379) ParseUtil does not pass through the content's URL to the ParserFactory

2006-10-24 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-379?page=all ] Sami Siren resolved NUTCH-379. -- Resolution: Fixed Committed this to 0.8(.x) branch and trunk. Thanks Chris. ParseUtil does not pass through the content's URL to the ParserFactory

[jira] Closed: (NUTCH-52) Parser plugin for MS Excel files

2006-10-24 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-52?page=all ] Sami Siren closed NUTCH-52. --- Parser plugin for MS Excel files Key: NUTCH-52 URL: http://issues.apache.org/jira/browse/NUTCH-52

[jira] Closed: (NUTCH-53) Parser plugin for Zip files

2006-10-24 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-53?page=all ] Sami Siren closed NUTCH-53. --- Parser plugin for Zip files --- Key: NUTCH-53 URL: http://issues.apache.org/jira/browse/NUTCH-53

[jira] Closed: (NUTCH-81) Webapp only works when deployed in root

2006-10-24 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-81?page=all ] Sami Siren closed NUTCH-81. --- Webapp only works when deployed in root --- Key: NUTCH-81 URL:

[jira] Closed: (NUTCH-88) Enhance ParserFactory plugin selection policy

2006-10-24 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-88?page=all ] Sami Siren closed NUTCH-88. --- Enhance ParserFactory plugin selection policy - Key: NUTCH-88 URL:

[jira] Closed: (NUTCH-102) jobtracker does not start when webapps is in src

2006-10-24 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-102?page=all ] Sami Siren closed NUTCH-102. jobtracker does not start when webapps is in src Key: NUTCH-102 URL:

[jira] Closed: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

2006-10-24 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-110?page=all ] Sami Siren closed NUTCH-110. OpenSearchServlet outputs illegal xml characters Key: NUTCH-110 URL:

[jira] Closed: (NUTCH-116) TestNDFS a JUnit test specifically for NDFS

2006-10-24 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-116?page=all ] Sami Siren closed NUTCH-116. TestNDFS a JUnit test specifically for NDFS --- Key: NUTCH-116 URL:

[jira] Closed: (NUTCH-114) getting number of urls and links from crawldb

2006-10-24 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-114?page=all ] Sami Siren closed NUTCH-114. getting number of urls and links from crawldb - Key: NUTCH-114 URL:

[jira] Closed: (NUTCH-108) tasktracker crashs when reconnecting to a new jobtracker.

2006-10-24 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-108?page=all ] Sami Siren closed NUTCH-108. tasktracker crashs when reconnecting to a new jobtracker. - Key: NUTCH-108 URL:

[jira] Closed: (NUTCH-130) Be explicit about target JVM when building (1.4.x?)

2006-10-24 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-130?page=all ] Sami Siren closed NUTCH-130. Be explicit about target JVM when building (1.4.x?) --- Key: NUTCH-130 URL:

[jira] Closed: (NUTCH-131) Non-documented variable: mapred.child.heap.size

2006-10-24 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-131?page=all ] Sami Siren closed NUTCH-131. Non-documented variable: mapred.child.heap.size --- Key: NUTCH-131 URL:

[jira] Closed: (NUTCH-118) FAQ link points to invalid URL

2006-10-24 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-118?page=all ] Sami Siren closed NUTCH-118. FAQ link points to invalid URL -- Key: NUTCH-118 URL: http://issues.apache.org/jira/browse/NUTCH-118

[jira] Closed: (NUTCH-124) protocol-httpclient does not follow redirects when fetching robots.txt

2006-10-24 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-124?page=all ] Sami Siren closed NUTCH-124. protocol-httpclient does not follow redirects when fetching robots.txt -- Key:

[jira] Closed: (NUTCH-135) http header meta data are case insensitive in the real world (e.g. Content-Type or content-type)

2006-10-24 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-135?page=all ] Sami Siren closed NUTCH-135. http header meta data are case insensitive in the real world (e.g. Content-Type or content-type)

[jira] Closed: (NUTCH-134) Summarizer doesn't select the best snippets

2006-10-24 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-134?page=all ] Sami Siren closed NUTCH-134. Summarizer doesn't select the best snippets --- Key: NUTCH-134 URL:

[jira] Closed: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-10-24 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-139?page=all ] Sami Siren closed NUTCH-139. Standard metadata property names in the ParseData metadata -- Key: NUTCH-139

[jira] Closed: (NUTCH-146) mapred.job.tracker.info.port is defined 2 times in the nutch-default.xml

2006-10-24 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-146?page=all ] Sami Siren closed NUTCH-146. mapred.job.tracker.info.port is defined 2 times in the nutch-default.xml Key:

[jira] Closed: (NUTCH-145) build of war file fails on Chinese (zh) .xml files due to UTF-8 BOM

2006-10-24 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-145?page=all ] Sami Siren closed NUTCH-145. build of war file fails on Chinese (zh) .xml files due to UTF-8 BOM --- Key: NUTCH-145

[jira] Closed: (NUTCH-166) secure jobtracker info pages with a password

2006-10-24 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-166?page=all ] Sami Siren closed NUTCH-166. secure jobtracker info pages with a password Key: NUTCH-166 URL:

<    1   2   3   4   >