[Nutch-dev] [jira] Commented: (NUTCH-65) index-more plugin can't parse large set of modification-date

2005-07-04 Thread JIRA
[ http://issues.apache.org/jira/browse/NUTCH-65?page=comments#action_12315010 ] Lutischán Ferenc commented on NUTCH-65: --- Dear Developers, I have a finally solution (I have a firewall, I can't make patch with svn), I suggested please commit

[Nutch-dev] [jira] Updated: (NUTCH-76) NDFS DataNode advertises localhost as it's address

2005-07-24 Thread JIRA
[ http://issues.apache.org/jira/browse/NUTCH-76?page=all ] Peter Sandström updated NUTCH-76: - Attachment: ndfs-datanode-fix.patch fixes the problem by connecting to the NameNode and using the address that the local socket is bound to instead of calling

[Nutch-dev] [jira] Created: (NUTCH-76) NDFS DataNode advertises localhost as it's address

2005-07-24 Thread JIRA
NDFS DataNode advertises localhost as it's address -- Key: NUTCH-76 URL: http://issues.apache.org/jira/browse/NUTCH-76 Project: Nutch Type: Bug Environment: Linux Reporter: Peter Sandström Attachments: ndfs

[Nutch-dev] [jira] Created: (NUTCH-119) Regexp to extract outlinks incorrect

2005-10-20 Thread JIRA
Regexp to extract outlinks incorrect Key: NUTCH-119 URL: http://issues.apache.org/jira/browse/NUTCH-119 Project: Nutch Type: Bug Components: fetcher Versions: 0.7.1, 0.7.2-dev, 0.8-dev Reporter: Sébastien Le

[Nutch-dev] [jira] Updated: (NUTCH-119) Regexp to extract outlinks incorrect

2005-10-20 Thread JIRA
[ http://issues.apache.org/jira/browse/NUTCH-119?page=all ] Sébastien Le Callonnec updated NUTCH-119: - Attachment: TestPattern.java JUnit Test file recreating the issue. Regexp to extract outlinks incorrect

[Nutch-dev] [jira] Updated: (NUTCH-119) Regexp to extract outlinks incorrect

2005-10-20 Thread JIRA
[ http://issues.apache.org/jira/browse/NUTCH-119?page=all ] Sébastien Le Callonnec updated NUTCH-119: - Attachment: TestPattern.java Please ignore previous file, which was incorrect. Regexp to extract outlinks incorrect

[Nutch-dev] [jira] Created: (NUTCH-123) Cache.jsp some times generate NullPointerException

2005-11-04 Thread JIRA
Cache.jsp some times generate NullPointerException -- Key: NUTCH-123 URL: http://issues.apache.org/jira/browse/NUTCH-123 Project: Nutch Type: Bug Components: web gui Environment: All systems Reporter

[Nutch-dev] [jira] Commented: (NUTCH-133) ParserFactory does not work as expected

2005-12-07 Thread JIRA
[ http://issues.apache.org/jira/browse/NUTCH-133?page=comments#action_12359564 ] Lutischán Ferenc commented on NUTCH-133: Dear Stephan, Please see http://issues.apache.org/jira/browse/NUTCH-123. This problem is also problem in cached.jsp. Regards

[Nutch-dev] [jira] Created: (NUTCH-174) Problem encountered with ant during compilation

2006-01-14 Thread JIRA
Problem encountered with ant during compilation --- Key: NUTCH-174 URL: http://issues.apache.org/jira/browse/NUTCH-174 Project: Nutch Type: Bug Versions: 0.7.1 Environment: Suse LInux 9.3 Reporter: Matthias

[Nutch-dev] [jira] Created: (NUTCH-175) No input directories specified in: while crawing in nightly build from the 14.1.2006: sh ./nutch crawl urllist.txt -dir tmpdir

2006-01-14 Thread JIRA
://issues.apache.org/jira/browse/NUTCH-175 Project: Nutch Type: Bug Environment: SUSE Linux 9.3 Reporter: Matthias Günter Priority: Trivial [EMAIL PROTECTED]:~/workspace/lucene/nutch-nightly/bin sh ./nutch crawl urllist.txt -dir tmpdir 060114 205612 parsing file:/home

[Nutch-dev] [jira] Created: (NUTCH-176) Using -dir: creates an error, when the directory already exists

2006-01-15 Thread JIRA
Using -dir: creates an error, when the directory already exists --- Key: NUTCH-176 URL: http://issues.apache.org/jira/browse/NUTCH-176 Project: Nutch Type: Bug Versions: 0.7.1 Environment: SUSE

[Nutch-dev] [jira] Created: (NUTCH-177) Default installation seems to produce working entity of nutch

2006-01-15 Thread JIRA
Default installation seems to produce working entity of nutch - Key: NUTCH-177 URL: http://issues.apache.org/jira/browse/NUTCH-177 Project: Nutch Type: Bug Versions: 0.7.1 Environment: Linux SUSE

[Nutch-dev] [jira] Updated: (NUTCH-177) Default installation seems to produce working entity of nutch

2006-01-15 Thread JIRA
[ http://issues.apache.org/jira/browse/NUTCH-177?page=all ] Matthias Günter updated NUTCH-177: -- Attachment: crawl-urlfilter.txt The crawl-filter with a change for apache.org Default installation seems to produce working entity of nutch

[Nutch-dev] [jira] Updated: (NUTCH-177) Default installation seems to produce working entity of nutch

2006-01-15 Thread JIRA
[ http://issues.apache.org/jira/browse/NUTCH-177?page=all ] Matthias Günter updated NUTCH-177: -- Attachment: urllist.txt URL-List used.. Default installation seems to produce working entity of nutch

[Nutch-dev] [jira] Created: (NUTCH-208) http: proxy exception list:

2006-02-08 Thread JIRA
http: proxy exception list: Key: NUTCH-208 URL: http://issues.apache.org/jira/browse/NUTCH-208 Project: Nutch Type: New Feature Components: fetcher Versions: 0.8-dev Reporter: Matthias Günter Priority: Minor I

[Nutch-dev] [jira] Updated: (NUTCH-208) http: proxy exception list:

2006-02-08 Thread JIRA
[ http://issues.apache.org/jira/browse/NUTCH-208?page=all ] Matthias Günter updated NUTCH-208: -- Attachment: patch.txt A preliminary patch!! http: proxy exception list: --- Key: NUTCH-208 URL: http

[Nutch-dev] [jira] Updated: (NUTCH-208) http: proxy exception list:

2006-02-08 Thread JIRA
[ http://issues.apache.org/jira/browse/NUTCH-208?page=all ] Matthias Günter updated NUTCH-208: -- Attachment: patch.txt A preliminary patch!! http: proxy exception list: --- Key: NUTCH-208 URL: http

[Nutch-dev] [jira] Updated: (NUTCH-339) Refactor nutch to allow fetcher improvements

2006-09-08 Thread JIRA
[ http://issues.apache.org/jira/browse/NUTCH-339?page=all ] Doğacan Güney updated NUTCH-339: Attachment: patch3.txt Refactor nutch to allow fetcher improvements Key: NUTCH-339

[Nutch-dev] [jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher improvements

2006-09-08 Thread JIRA
[ http://issues.apache.org/jira/browse/NUTCH-339?page=comments#action_12433354 ] Doğacan Güney commented on NUTCH-339: - I have made a few changes to Andrzej's latest patch. The biggest change is that BLOCKED_ADDR_QUEUE is now a priority

[Nutch-dev] [jira] Created: (NUTCH-397) porting clustering-carrot2 plugin to carrot2 v2.0

2006-11-07 Thread JIRA
porting clustering-carrot2 plugin to carrot2 v2.0 - Key: NUTCH-397 URL: http://issues.apache.org/jira/browse/NUTCH-397 Project: Nutch Issue Type: Improvement Reporter: Do?acan

[Nutch-dev] [jira] Updated: (NUTCH-397) porting clustering-carrot2 plugin to carrot2 v2.0

2006-11-07 Thread JIRA
[ http://issues.apache.org/jira/browse/NUTCH-397?page=all ] Doğacan Güney updated NUTCH-397: Attachment: clustering-carrot2-lib.tar.gz carrot2-nutch-plugin.patch clustering.patch porting clustering-carrot2 plugin

[Nutch-dev] [jira] Commented: (NUTCH-331) Fetcher incorrectly reports task progress to tasktracker resulting in skipped URLs

2006-11-23 Thread JIRA
[ http://issues.apache.org/jira/browse/NUTCH-331?page=comments#action_12452194 ] Doğacan Güney commented on NUTCH-331: - You obviously know about this a lot more than I do, but looking at fetcher code I can't see how this is possible

[Nutch-dev] [jira] Created: (NUTCH-406) Metadata tries to write null values

2006-11-23 Thread JIRA
Metadata tries to write null values --- Key: NUTCH-406 URL: http://issues.apache.org/jira/browse/NUTCH-406 Project: Nutch Issue Type: Bug Affects Versions: 0.9.0 Reporter: Doğacan Güney

[Nutch-dev] [jira] Updated: (NUTCH-406) Metadata tries to write null values

2006-11-23 Thread JIRA
[ http://issues.apache.org/jira/browse/NUTCH-406?page=all ] Doğacan Güney updated NUTCH-406: Attachment: NUTCH-406.patch A simple patch that writes nulls as empty strings. Metadata tries to write null values

[Nutch-dev] [jira] Updated: (NUTCH-406) Metadata tries to write null values

2006-11-23 Thread JIRA
[ http://issues.apache.org/jira/browse/NUTCH-406?page=all ] Doğacan Güney updated NUTCH-406: Attachment: NUTCH-406.patch How about something like this then? Metadata tries to write null values --- Key

[Nutch-dev] [jira] Commented: (NUTCH-92) DistributedSearch incorrectly scores results

2006-11-27 Thread JIRA
[ http://issues.apache.org/jira/browse/NUTCH-92?page=comments#action_12453682 ] Dogacan Güney commented on NUTCH-92: Here is my second attempt at this. Now DistributedSearch$Client keeps a mapping from addresses to numDocs, and in search

[Nutch-dev] [jira] Created: (NUTCH-411) Parse ignores meta refresh redirection

2006-11-30 Thread JIRA
Parse ignores meta refresh redirection -- Key: NUTCH-411 URL: http://issues.apache.org/jira/browse/NUTCH-411 Project: Nutch Issue Type: Bug Affects Versions: 0.9.0 Reporter: Dogacan Güney

[Nutch-dev] [jira] Commented: (NUTCH-411) Parse ignores meta refresh redirection

2006-11-30 Thread JIRA
[ http://issues.apache.org/jira/browse/NUTCH-411?page=comments#action_12454649 ] Dogacan Güney commented on NUTCH-411: - My not-necessarily-correct patch for this. We add the new url as a newly discovered url (so it gets initialScore), which

[Nutch-dev] [jira] Updated: (NUTCH-411) Parse ignores meta refresh redirection

2006-11-30 Thread JIRA
[ http://issues.apache.org/jira/browse/NUTCH-411?page=all ] Dogacan Güney updated NUTCH-411: Attachment: parse-redirect.patch Parse ignores meta refresh redirection -- Key: NUTCH-411

[Nutch-dev] [jira] Commented: (NUTCH-413) Fetcher ignores -noParsing command line option

2006-12-08 Thread JIRA
[ http://issues.apache.org/jira/browse/NUTCH-413?page=comments#action_12456832 ] Dogacan Güney commented on NUTCH-413: - Are you sure about this? Running the fetcher (latest trunk) with -noParsing option does not create any parse segments

[Nutch-dev] [jira] Commented: (NUTCH-413) Fetcher ignores -noParsing command line option

2006-12-08 Thread JIRA
[ http://issues.apache.org/jira/browse/NUTCH-413?page=comments#action_12456967 ] Dogacan Güney commented on NUTCH-413: - About command-line options: that is not what I meant(I am not a native speaker). I meant that I also set fetcher.parse

[Nutch-dev] [jira] Created: (NUTCH-417) After upgrade to hadoop-0.9.1, parsing and indexing doesn't work.

2006-12-15 Thread JIRA
After upgrade to hadoop-0.9.1, parsing and indexing doesn't work. - Key: NUTCH-417 URL: http://issues.apache.org/jira/browse/NUTCH-417 Project: Nutch Issue Type: Bug

[Nutch-dev] [jira] Commented: (NUTCH-417) After upgrade to hadoop-0.9.1, parsing and indexing doesn't work.

2006-12-15 Thread JIRA
[ http://issues.apache.org/jira/browse/NUTCH-417?page=comments#action_12458794 ] Dogacan Güney commented on NUTCH-417: - Patch for indexer. Instead of using the FileSystem coming from getRecordWriter, use FileSystem.get(job) to get the file

[Nutch-dev] [jira] Updated: (NUTCH-417) After upgrade to hadoop-0.9.1, parsing and indexing doesn't work.

2006-12-15 Thread JIRA
[ http://issues.apache.org/jira/browse/NUTCH-417?page=all ] Dogacan Güney updated NUTCH-417: Attachment: index.patch After upgrade to hadoop-0.9.1, parsing and indexing doesn't work

[Nutch-dev] [jira] Commented: (NUTCH-417) After upgrade to hadoop-0.9.1, parsing and indexing doesn't work.

2006-12-15 Thread JIRA
[ http://issues.apache.org/jira/browse/NUTCH-417?page=comments#action_12458811 ] Dogacan Güney commented on NUTCH-417: - Setting speculative execution to false also fixes my problem with parser. Thank you for the quick answer. I guess you

[Nutch-dev] [jira] Created: (NUTCH-420) DeleteDuplicates.HashPartitioner depends on the order of IndexDocs

2006-12-26 Thread JIRA
DeleteDuplicates.HashPartitioner depends on the order of IndexDocs -- Key: NUTCH-420 URL: http://issues.apache.org/jira/browse/NUTCH-420 Project: Nutch Issue Type: Bug

[Nutch-dev] [jira] Updated: (NUTCH-420) DeleteDuplicates.HashPartitioner depends on the order of IndexDocs

2006-12-26 Thread JIRA
[ http://issues.apache.org/jira/browse/NUTCH-420?page=all ] Dogacan Güney updated NUTCH-420: Attachment: dedup.patch Patch for the problem. This patch also slightly refactors the code. DeleteDuplicates.HashPartitioner depends on the order of IndexDocs

[Nutch-dev] [jira] Updated: (NUTCH-420) DeleteDuplicates.HashPartitioner depends on the order of IndexDocs

2007-01-04 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dogacan Güney updated NUTCH-420: Attachment: dedup-v2.patch DeleteDuplicates.HashPartitioner depends on the order of IndexDocs

[Nutch-dev] [jira] Commented: (NUTCH-420) DeleteDuplicates.HashPartitioner depends on the order of IndexDocs

2007-01-08 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12463056 ] Dogacan Güney commented on NUTCH-420: - I thought I would attach an index which exhibits this bug. If you run

[Nutch-dev] [jira] Updated: (NUTCH-420) DeleteDuplicates.HashPartitioner depends on the order of IndexDocs

2007-01-08 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dogacan Güney updated NUTCH-420: Attachment: index.tar.gz DeleteDuplicates.HashPartitioner depends on the order of IndexDocs

[Nutch-dev] [jira] Commented: (NUTCH-420) DeleteDuplicates.HashPartitioner depends on the order of IndexDocs

2007-01-09 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12463214 ] Dogacan Güney commented on NUTCH-420: - Attaching the patch with a testcase (I hope that I got it right, but I am

[Nutch-dev] [jira] Updated: (NUTCH-420) DeleteDuplicates.HashPartitioner depends on the order of IndexDocs

2007-01-09 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dogacan Güney updated NUTCH-420: Attachment: dedup-v3.patch DeleteDuplicates.HashPartitioner depends on the order of IndexDocs

[Nutch-dev] [jira] Created: (NUTCH-438) Add -noAdditions to updatedb

2007-02-02 Thread JIRA
Add -noAdditions to updatedb Key: NUTCH-438 URL: https://issues.apache.org/jira/browse/NUTCH-438 Project: Nutch Issue Type: Improvement Affects Versions: 0.8.1, 0.8 Reporter: Nicolás Lichtmaier

[Nutch-dev] [jira] Updated: (NUTCH-438) Add -noAdditions to updatedb

2007-02-02 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolás Lichtmaier updated NUTCH-438: - Attachment: noAdditions-backport.diff I've backported revision 450799 to the 0.8.x branch

[Nutch-dev] [jira] Created: (NUTCH-440) Command line utilities should exit with an error message when given wrong arguments

2007-02-06 Thread JIRA
Command line utilities should exit with an error message when given wrong arguments --- Key: NUTCH-440 URL: https://issues.apache.org/jira/browse/NUTCH-440 Project

[Nutch-dev] [jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-08 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12471231 ] Dogacan Güney commented on NUTCH-443: - Here is a very initial patch. It is entirely untested and only changes

[Nutch-dev] [jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-08 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dogacan Güney updated NUTCH-443: Attachment: parse-map-core-untested.patch allow parsers to return multiple Parse object

[Nutch-dev] [jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-08 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dogacan Güney updated NUTCH-443: Attachment: parse-map-core-draft-v1.patch allow parsers to return multiple Parse object

[Nutch-dev] [jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-09 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12471620 ] Dogacan Güney commented on NUTCH-443: - This is pretty much the merge of our work(except parse-rss, it kept

[Nutch-dev] [jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-09 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dogacan Güney updated NUTCH-443: Attachment: NUTCH-443-draft-v1.patch allow parsers to return multiple Parse object

[Nutch-dev] [jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-09 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dogacan Güney updated NUTCH-443: Attachment: NUTCH-443-draft-v2.patch Small update to the patch. Now all core junit tests pass. Now

[Nutch-dev] [jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-09 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dogacan Güney updated NUTCH-443: Attachment: NUTCH-443-draft-v3.patch new patch, contains a possible fix for CrawlDbReducer problem

[Nutch-dev] [jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-09 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12471857 ] Dogacan Güney commented on NUTCH-443: - nutch.newbie: I fail to see what the problem is. If feedparser doesn't

[Nutch-dev] [jira] Updated: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility

2007-02-10 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dogacan Güney updated NUTCH-444: Attachment: parse-feed.tar.bz2 OK, here is my feedparsing plugin using rome. Note that this plugin

[Nutch-dev] [jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-11 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12472079 ] Dogacan Güney commented on NUTCH-443: - nutch.newbie, I will take a look at these issues, but parse-rss

[Nutch-dev] [jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-11 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dogacan Güney updated NUTCH-443: Attachment: NUTCH-443-draft-v5.patch New version. Now indexing also works but has a catch. Many

[Nutch-dev] [jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-11 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dogacan Güney updated NUTCH-443: Attachment: NUTCH-443-draft-v6.patch Oops... I forgot to merge Renaud Richardet's work

[Nutch-dev] [jira] Updated: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility

2007-02-11 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dogacan Güney updated NUTCH-444: Attachment: parse-feed-v2.tar.bz2 Updated parse-feed plugin. Still not ready for any serious use

[Nutch-dev] [jira] Commented: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility

2007-02-13 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12472581 ] Doğacan Güney commented on NUTCH-444: - Hi nutch.newbie, Can you mail me a list of the failing atom urls

[Nutch-dev] [jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-14 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473129 ] Doğacan Güney commented on NUTCH-443: - Andrzej: Thanks for taking the time to review this. The contract

[Nutch-dev] [jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-14 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473147 ] Doğacan Güney commented on NUTCH-443: - Hmm, actually this is an important question. I don't think FetcherOutput

[Nutch-dev] [jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-14 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473184 ] Doğacan Güney commented on NUTCH-443: - Andrzej: Why does fetcher need to synchronize? Why does the order fetcher

[Nutch-dev] [jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-14 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney updated NUTCH-443: Attachment: NUTCH-443-draft-v7.patch allow parsers to return multiple Parse object

[Nutch-dev] [jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-15 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473383 ] Doğacan Güney commented on NUTCH-443: - Regarding the ObjectWritable: since in this case all data is composed

[Nutch-dev] [jira] Created: (NUTCH-446) RobotRulesParser should ignore Crawl-delay values of other bots in robots.txt

2007-02-15 Thread JIRA
RobotRulesParser should ignore Crawl-delay values of other bots in robots.txt - Key: NUTCH-446 URL: https://issues.apache.org/jira/browse/NUTCH-446 Project: Nutch

[Nutch-dev] [jira] Updated: (NUTCH-446) RobotRulesParser should ignore Crawl-delay values of other bots in robots.txt

2007-02-15 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney updated NUTCH-446: Attachment: crawl-delay.patch RobotRulesParser should ignore Crawl-delay values of other bots

[Nutch-dev] [jira] Commented: (NUTCH-247) robot parser to restrict.

2007-02-16 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473885 ] Doğacan Güney commented on NUTCH-247: - +1 for this approach. Fetcher should check if agent-name is set

[Nutch-dev] [jira] Updated: (NUTCH-434) Replace usage of ObjectWritable with something based on GenericWritable

2007-02-24 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney updated NUTCH-434: Attachment: NUTCH-434.patch This patch adds two new classes: GenericWritableConfigurable which

[Nutch-dev] [jira] Commented: (NUTCH-445) Dom ain İndexing / Query Filter

2007-02-27 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12476212 ] Doğacan Güney commented on NUTCH-445: - Has anyone looked at this? Google seems to do site: searches like this too

[Nutch-dev] [jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-27 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12476357 ] Doğacan Güney commented on NUTCH-443: - Hi Andrzej, * in my opinion it's easier to add missing CrawlDatum's

[Nutch-dev] [jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-28 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney updated NUTCH-443: Attachment: NUTCH-443.02282007.patch Hi everyone, Here is the updated patch. Andrzej, I believe

[Nutch-dev] [jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-28 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney updated NUTCH-443: Attachment: NUTCH-443.02282007-v2.patch Yet another patch. ParseResult.filter is out and Nutch

[Nutch-dev] [jira] Created: (NUTCH-460) RDF parser plugin

2007-03-16 Thread JIRA
RDF parser plugin - Key: NUTCH-460 URL: https://issues.apache.org/jira/browse/NUTCH-460 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 0.9.0 Reporter: Ricardo J. Méndez

[Nutch-dev] [jira] Updated: (NUTCH-460) RDF parser plugin

2007-03-16 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ricardo J. Méndez updated NUTCH-460: Attachment: rubyspider-rdf.zip Code for the aforementioned plugins, to be included under

[Nutch-dev] [jira] Commented: (NUTCH-460) RDF parser plugin

2007-03-21 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12482793 ] Ricardo J. Méndez commented on NUTCH-460: - Two requirements I hadn't added explicitly: Apache Jena: http

[Nutch-dev] [jira] Updated: (NUTCH-438) Add -noAdditions to updatedb

2007-03-27 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolás Lichtmaier updated NUTCH-438: - Description: It would be great for me to have -noAdditions support (which is implemented

[Nutch-dev] [jira] Created: (NUTCH-468) Scoring filter should distribute score to all outlinks at once

2007-04-09 Thread JIRA
Scoring filter should distribute score to all outlinks at once -- Key: NUTCH-468 URL: https://issues.apache.org/jira/browse/NUTCH-468 Project: Nutch Issue Type: Improvement

[Nutch-dev] [jira] Updated: (NUTCH-468) Scoring filter should distribute score to all outlinks at once

2007-04-09 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney updated NUTCH-468: Attachment: scoring.patch Patch for the issue. It doesn't change the way scoring-opic works

[Nutch-dev] [jira] Commented: (NUTCH-468) Scoring filter should distribute score to all outlinks at once

2007-04-23 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12491051 ] Nicolás Lichtmaier commented on NUTCH-468: -- This patch would be useful to me. Just one very minor thing

[Nutch-dev] [jira] Updated: (NUTCH-468) Scoring filter should distribute score to all outlinks at once

2007-04-23 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney updated NUTCH-468: Attachment: scoring-v2.patch That makes sense, patch with the suggested change. Scoring filter

[Nutch-dev] [jira] Created: (NUTCH-474) Fetcher2 sets server-delay and blocking checks incorrectly

2007-04-24 Thread JIRA
Fetcher2 sets server-delay and blocking checks incorrectly -- Key: NUTCH-474 URL: https://issues.apache.org/jira/browse/NUTCH-474 Project: Nutch Issue Type: Bug Components

[Nutch-dev] [jira] Updated: (NUTCH-474) Fetcher2 sets server-delay and blocking checks incorrectly

2007-04-24 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney updated NUTCH-474: Attachment: fetcher2.patch Fetcher2 sets server-delay and blocking checks incorrectly

[Nutch-dev] [jira] Created: (NUTCH-475) Adaptive crawl delay

2007-04-25 Thread JIRA
Adaptive crawl delay Key: NUTCH-475 URL: https://issues.apache.org/jira/browse/NUTCH-475 Project: Nutch Issue Type: Improvement Components: fetcher Reporter: Doğacan Güney Fix

[Nutch-dev] [jira] Updated: (NUTCH-475) Adaptive crawl delay

2007-04-25 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney updated NUTCH-475: Attachment: adaptive-delay_draft.patch Patch with a simple adaptive algorithm. It measures the last

[Nutch-dev] [jira] Updated: (NUTCH-446) RobotRulesParser should ignore Crawl-delay values of other bots in robots.txt

2007-05-01 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney updated NUTCH-446: Attachment: crawl-delay_test.patch Test case for crawl delay rules. Nutch fails the test case

[Nutch-dev] [jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-05-09 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney updated NUTCH-443: Attachment: NUTCH-443.08052007.patch Patch updated to latest trunk. allow parsers to return

[Nutch-dev] [jira] Commented: (NUTCH-470) Adding optional terms to a query

2007-05-09 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12494496 ] Ronny Næss commented on NUTCH-470: -- Hi, Trond. Optional meaning does that mean? I would like more Lucene based

[Nutch-dev] [jira] Updated: (NUTCH-479) Support for OR queries

2007-05-09 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolás Lichtmaier updated NUTCH-479: - This patch doesn't seem to add support for nested clauses like this: greenhouse effect

[Nutch-dev] [jira] Commented: (NUTCH-446) RobotRulesParser should ignore Crawl-delay values of other bots in robots.txt

2007-05-10 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12494734 ] Doğacan Güney commented on NUTCH-446: - So, does anyone have objections to this? It fixes an annoying (albeit rare

[Nutch-dev] [jira] Commented: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility

2007-05-11 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12494987 ] Doğacan Güney commented on NUTCH-444: - Hi Chris, Well I must say, with all the discussion that's gone on w.r.t

[Nutch-dev] [jira] Commented: (NUTCH-485) Change HtmlParseFilter 's to return ParseResult object instead of Parse object

2007-05-13 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12495350 ] Doğacan Güney commented on NUTCH-485: - You probably should not add put(String/Text key, Parse parse) methods

[Nutch-dev] [jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-05-13 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12495357 ] Doğacan Güney commented on NUTCH-443: - Well... That's embarrassing. It seems I forgot to include the necessary

[Nutch-dev] [jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-05-13 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney updated NUTCH-443: Attachment: redirect_and_index.patch Patch for the problem. Now, if Fetcher gets a null content

[Nutch-dev] [jira] Updated: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility

2007-05-13 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney updated NUTCH-444: Attachment: NUTCH-444.patch feed.tar.bz2 First version of feed plugin featuring

[Nutch-dev] [jira] Commented: (NUTCH-485) Change HtmlParseFilter 's to return ParseResult object instead of Parse object

2007-05-13 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12495410 ] Doğacan Güney commented on NUTCH-485: - I have two more minor nits: 1) ParseResult.isSuccess returns true only

[Nutch-dev] [jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-05-14 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12495696 ] Doğacan Güney commented on NUTCH-443: - I am not sure I follow you Andrzej. My patch already does a very similar

[Nutch-dev] [jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-05-14 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney updated NUTCH-443: Attachment: redirect_and_index_v2.patch New version. Moves parsing code into (content != null

[Nutch-dev] [jira] Updated: (NUTCH-25) needs 'character encoding' detector

2007-05-21 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney updated NUTCH-25: --- Attachment: NUTCH-25_draft.patch Well, something like this should work... + Adds a new configurable

[Nutch-dev] [jira] Commented: (NUTCH-489) URLFilter-suffix management of the url path when the url contains some query parameters

2007-05-22 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12497770 ] Doğacan Güney commented on NUTCH-489: - This is obviously useful but: * Your patches both in this issue

[Nutch-dev] [jira] Commented: (NUTCH-489) URLFilter-suffix management of the url path when the url contains some query parameters

2007-05-23 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12498113 ] Doğacan Güney commented on NUTCH-489: - Hmm.. Won't it now cause Nutch to filter on path on a line like

  1   2   3   4   5   6   7   8   9   10   >