[jira] Updated: (NUTCH-563) Include custom fields in BasicQueryFilter

2007-10-01 Thread julien nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] julien nioche updated NUTCH-563: Attachment: diff.BasicQueryFilter.dynamicFields.txt Include custom fields in BasicQueryFilter

[jira] Created: (NUTCH-563) Include custom fields in BasicQueryFilter

2007-10-01 Thread julien nioche (JIRA)
Include custom fields in BasicQueryFilter - Key: NUTCH-563 URL: https://issues.apache.org/jira/browse/NUTCH-563 Project: Nutch Issue Type: New Feature Components: searcher

[jira] Commented: (NUTCH-563) Include custom fields in BasicQueryFilter

2007-10-03 Thread julien nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12532034 ] julien nioche commented on NUTCH-563: - As I explained in my message to the dev-list, having a separate plugin for

[jira] Created: (NUTCH-655) Injecting Crawl metadata

2008-10-01 Thread julien nioche (JIRA)
Injecting Crawl metadata Key: NUTCH-655 URL: https://issues.apache.org/jira/browse/NUTCH-655 Project: Nutch Issue Type: Improvement Components: injector Reporter: julien nioche

[jira] Created: (NUTCH-656) DeleteDuplicates based on crawlDB only

2008-10-09 Thread julien nioche (JIRA)
DeleteDuplicates based on crawlDB only --- Key: NUTCH-656 URL: https://issues.apache.org/jira/browse/NUTCH-656 Project: Nutch Issue Type: Wish Components: indexer Reporter: julien

[jira] Reopened: (NUTCH-656) DeleteDuplicates based on crawlDB only

2008-10-09 Thread julien nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] julien nioche reopened NUTCH-656: - I suppose that the SOLR dedup mechanism is valid on a single instance. If the documents are

[jira] Commented: (NUTCH-658) Add Counter for # of doc fetched in Reporter

2008-11-27 Thread julien nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12651406#action_12651406 ] julien nioche commented on NUTCH-658: - Hi Dogacan, I am off work for several weeks and

[jira] Updated: (NUTCH-658) Add Counter for # of doc fetched in Reporter

2008-12-08 Thread julien nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] julien nioche updated NUTCH-658: Attachment: ReporterCounter.patch Hi, I eventually managed to make the change. The new patch

[jira] Created: (NUTCH-678) Hadoop 0.19 requires an update of jets3t

2009-01-14 Thread julien nioche (JIRA)
Hadoop 0.19 requires an update of jets3t Key: NUTCH-678 URL: https://issues.apache.org/jira/browse/NUTCH-678 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Reporter: julien

[jira] Created: (NUTCH-679) Fetcher2 implementing Tool

2009-01-15 Thread julien nioche (JIRA)
Fetcher2 implementing Tool -- Key: NUTCH-679 URL: https://issues.apache.org/jira/browse/NUTCH-679 Project: Nutch Issue Type: Improvement Components: fetcher Reporter: julien nioche

[jira] Updated: (NUTCH-679) Fetcher2 implementing Tool

2009-01-15 Thread julien nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] julien nioche updated NUTCH-679: Attachment: Fetcher2.Tool.patch Patch which makes Fetcher2 implement Tool interface Fetcher2

[jira] Commented: (NUTCH-678) Hadoop 0.19 requires an update of jets3t

2009-01-19 Thread julien nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12665126#action_12665126 ] julien nioche commented on NUTCH-678: - I confirm. Upgrading to 0.6.1 fixed the problem

[jira] Commented: (NUTCH-679) Fetcher2 implementing Tool

2009-01-21 Thread julien nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12665791#action_12665791 ] julien nioche commented on NUTCH-679: - I can send a modified version of it once Todd has

[jira] Created: (NUTCH-682) SOLR indexer does not set boost on the document

2009-01-29 Thread julien nioche (JIRA)
SOLR indexer does not set boost on the document --- Key: NUTCH-682 URL: https://issues.apache.org/jira/browse/NUTCH-682 Project: Nutch Issue Type: Bug Components: injector Affects

[jira] Closed: (NUTCH-656) DeleteDuplicates based on crawlDB only

2009-02-03 Thread julien nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] julien nioche closed NUTCH-656. --- Resolution: Duplicate DeleteDuplicates based on crawlDB only

[jira] Updated: (NUTCH-563) Include custom fields in BasicQueryFilter

2009-02-10 Thread julien nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] julien nioche updated NUTCH-563: Attachment: NUTCH-563.patch Updated the original patch + added class level javadoc comment +

[jira] Commented: (NUTCH-668) Domain URL Filter

2009-02-12 Thread julien nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12672994#action_12672994 ] julien nioche commented on NUTCH-668: - at line 173 - shouldn't we return 'url' instead

[jira] Created: (NUTCH-692) AlreadyBeingCreatedException with Hadoop 0.19

2009-02-18 Thread julien nioche (JIRA)
AlreadyBeingCreatedException with Hadoop 0.19 - Key: NUTCH-692 URL: https://issues.apache.org/jira/browse/NUTCH-692 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Reporter:

[jira] Commented: (NUTCH-692) AlreadyBeingCreatedException with Hadoop 0.19

2009-02-18 Thread julien nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12674607#action_12674607 ] julien nioche commented on NUTCH-692: - I have seen this only in multinode setup and on

[jira] Created: (NUTCH-696) Timeout for Parser

2009-02-19 Thread julien nioche (JIRA)
Timeout for Parser -- Key: NUTCH-696 URL: https://issues.apache.org/jira/browse/NUTCH-696 Project: Nutch Issue Type: Wish Components: fetcher Reporter: julien nioche Priority: Minor I

[jira] Created: (NUTCH-700) Neko1.9.11 goes into a loop

2009-02-20 Thread julien nioche (JIRA)
Neko1.9.11 goes into a loop --- Key: NUTCH-700 URL: https://issues.apache.org/jira/browse/NUTCH-700 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Reporter: julien nioche Priority:

[jira] Commented: (NUTCH-700) Neko1.9.11 goes into a loop

2009-02-20 Thread julien nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12675335#action_12675335 ] julien nioche commented on NUTCH-700: - Reported to CyberNeko

[jira] Commented: (NUTCH-692) AlreadyBeingCreatedException with Hadoop 0.19

2009-02-20 Thread julien nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12675518#action_12675518 ] julien nioche commented on NUTCH-692: - I have been investigating this a bit more. Same

[jira] Created: (NUTCH-702) Lazy Instanciation of Metadata in CrawlDatum

2009-02-25 Thread julien nioche (JIRA)
Lazy Instanciation of Metadata in CrawlDatum Key: NUTCH-702 URL: https://issues.apache.org/jira/browse/NUTCH-702 Project: Nutch Issue Type: Improvement Affects Versions: 1.0.0

[jira] Updated: (NUTCH-702) Lazy Instanciation of Metadata in CrawlDatum

2009-02-25 Thread julien nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] julien nioche updated NUTCH-702: Attachment: lazyMetadataInstanciation.patch patch for lazy instanciation of metadata in crawldatum

[jira] Commented: (NUTCH-696) Timeout for Parser

2009-02-25 Thread julien nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12676658#action_12676658 ] julien nioche commented on NUTCH-696: - I was thinking along the lines of your first

[jira] Updated: (NUTCH-702) Lazy Instanciation of Metadata in CrawlDatum

2009-02-25 Thread julien nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] julien nioche updated NUTCH-702: Attachment: (was: lazyMetadataInstanciation.patch) Lazy Instanciation of Metadata in

[jira] Updated: (NUTCH-702) Lazy Instanciation of Metadata in CrawlDatum

2009-02-25 Thread julien nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] julien nioche updated NUTCH-702: Attachment: NUTCH-702.patch patch for lazy instanciation of metadata in crawldatum (replaces

[jira] Commented: (NUTCH-709) JSParseFilter gets into an infinate loop and ets all the stack

2009-03-03 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12678314#action_12678314 ] Julien Nioche commented on NUTCH-709: - do you know the URL of the document causing this

[jira] Updated: (NUTCH-709) JSParseFilter gets into an infinate loop and ets all the stack

2009-03-03 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-709: Attachment: JSParseFilter.error.patch This patch catches errors in the walk method of JSParser and

[jira] Commented: (NUTCH-709) JSParseFilter gets into an infinate loop and ets all the stack

2009-03-03 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12678325#action_12678325 ] Julien Nioche commented on NUTCH-709: - the patch above does not fix the issue but

[jira] Commented: (NUTCH-709) JSParseFilter gets into an infinate loop and ets all the stack

2009-03-06 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12679549#action_12679549 ] Julien Nioche commented on NUTCH-709: - Hi Tim, did you have a look at the logs to see

[jira] Created: (NUTCH-712) ParseOutputFormat should catch java.net.MalformedURLException coming from normalizers

2009-03-06 Thread Julien Nioche (JIRA)
ParseOutputFormat should catch java.net.MalformedURLException coming from normalizers - Key: NUTCH-712 URL: https://issues.apache.org/jira/browse/NUTCH-712 Project:

[jira] Updated: (NUTCH-712) ParseOutputFormat should catch java.net.MalformedURLException coming from normalizers

2009-03-06 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-712: Attachment: ParseOutputFormat-NUTCH712.patch ParseOutputFormat should catch

[jira] Updated: (NUTCH-712) ParseOutputFormat should catch java.net.MalformedURLException coming from normalizers

2009-03-06 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-712: Attachment: (was: ParseOutputFormat-NUTCH712.patch) ParseOutputFormat should catch

[jira] Updated: (NUTCH-712) ParseOutputFormat should catch java.net.MalformedURLException coming from normalizers

2009-03-06 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-712: Attachment: ParseOutputFormat-NUTCH712v2.patch Modified version of the patch : if normalizers

[jira] Commented: (NUTCH-692) AlreadyBeingCreatedException with Hadoop 0.19

2009-04-02 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12694942#action_12694942 ] Julien Nioche commented on NUTCH-692: - As I pointed out in my previous message the root

[jira] Commented: (NUTCH-692) AlreadyBeingCreatedException with Hadoop 0.19

2009-04-03 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695346#action_12695346 ] Julien Nioche commented on NUTCH-692: - setting mapred.task.timeout to a small value

[jira] Commented: (NUTCH-721) Fetcher2 Slow

2009-04-03 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695394#action_12695394 ] Julien Nioche commented on NUTCH-721: - The message about the Aborted hung threads looks

[jira] Updated: (NUTCH-731) Redirection of robots.txt in RobotRulesParser

2009-04-03 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-731: Attachment: NUTCH-731.patch Redirection of robots.txt in RobotRulesParser

[jira] Commented: (NUTCH-692) AlreadyBeingCreatedException with Hadoop 0.19

2009-04-08 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12696958#action_12696958 ] Julien Nioche commented on NUTCH-692: - I haven't had the time to try it on the SVN

[jira] Commented: (NUTCH-477) Extend URLFilters to support different filtering chains

2009-04-24 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702326#action_12702326 ] Julien Nioche commented on NUTCH-477: - Having a scope for the URL filters could be

[jira] Commented: (NUTCH-692) AlreadyBeingCreatedException with Hadoop 0.19

2009-04-24 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702412#action_12702412 ] Julien Nioche commented on NUTCH-692: - OK I had the same problem again on my main

[jira] Commented: (NUTCH-731) Redirection of robots.txt in RobotRulesParser

2009-05-26 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12712907#action_12712907 ] Julien Nioche commented on NUTCH-731: - I don't have a specific example now, in all the

[jira] Updated: (NUTCH-702) Lazy Instanciation of Metadata in CrawlDatum

2009-05-27 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-702: Attachment: NUTCH-702.patch.v2 Fixed bug reported by Dmitry Lihachev Lazy Instanciation of

[jira] Commented: (NUTCH-731) Redirection of robots.txt in RobotRulesParser

2009-06-20 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12722214#action_12722214 ] Julien Nioche commented on NUTCH-731: - Here is an example which the patch helps

[jira] Commented: (NUTCH-721) Fetcher2 Slow

2009-08-09 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12741082#action_12741082 ] Julien Nioche commented on NUTCH-721: - I had another look at this issue after applying

[jira] Updated: (NUTCH-721) Fetcher2 Slow

2009-08-10 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-721: Attachment: NUTCH-721.patch Sets the default value for fetcher.threads.per.host.by.ip to false

[jira] Updated: (NUTCH-679) Fetcher2 implementing Tool

2009-08-13 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-679: Attachment: NUTCH-679.patch Updated version of the patch Fetcher2 implementing Tool

[jira] Closed: (NUTCH-696) Timeout for Parser

2009-08-28 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-696. --- Resolution: Later Timeout for Parser -- Key: NUTCH-696

[jira] Commented: (NUTCH-702) Lazy Instanciation of Metadata in CrawlDatum

2009-08-28 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12748840#action_12748840 ] Julien Nioche commented on NUTCH-702: - There have been quite a few related questions on

[jira] Commented: (NUTCH-702) Lazy Instanciation of Metadata in CrawlDatum

2009-08-28 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12748841#action_12748841 ] Julien Nioche commented on NUTCH-702: - of course it was meant to be stats: original :

[jira] Created: (NUTCH-751) Upgrade version of HttpClient

2009-09-04 Thread Julien Nioche (JIRA)
Upgrade version of HttpClient -- Key: NUTCH-751 URL: https://issues.apache.org/jira/browse/NUTCH-751 Project: Nutch Issue Type: Improvement Components: fetcher Reporter: Julien Nioche The

[jira] Created: (NUTCH-753) Prevent new Fetcher to retrieve the robots twice

2009-09-07 Thread Julien Nioche (JIRA)
Prevent new Fetcher to retrieve the robots twice Key: NUTCH-753 URL: https://issues.apache.org/jira/browse/NUTCH-753 Project: Nutch Issue Type: Improvement Components: fetcher

[jira] Updated: (NUTCH-753) Prevent new Fetcher to retrieve the robots twice

2009-09-07 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-753: Attachment: NUTCH-753.patch Patch which prevents fetching the robots file twice with the new

[jira] Commented: (NUTCH-751) Upgrade version of HttpClient

2009-09-09 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12753175#action_12753175 ] Julien Nioche commented on NUTCH-751: - Thanks for the pointer Ken, what will be very

[jira] Commented: (NUTCH-692) AlreadyBeingCreatedException with Hadoop 0.19

2009-09-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12755956#action_12755956 ] Julien Nioche commented on NUTCH-692: - I've been using this patch for a while now and

[jira] Created: (NUTCH-754) Use GenericOptionsParser instead of FileSystem.parseArgs()

2009-09-16 Thread Julien Nioche (JIRA)
Use GenericOptionsParser instead of FileSystem.parseArgs() -- Key: NUTCH-754 URL: https://issues.apache.org/jira/browse/NUTCH-754 Project: Nutch Issue Type: Improvement

[jira] Created: (NUTCH-756) CrawlDatum.set() does not resets Metadata if it is null

2009-09-29 Thread Julien Nioche (JIRA)
CrawlDatum.set() does not resets Metadata if it is null --- Key: NUTCH-756 URL: https://issues.apache.org/jira/browse/NUTCH-756 Project: Nutch Issue Type: Bug Reporter: Julien

[jira] Updated: (NUTCH-756) CrawlDatum.set() does not resets Metadata if it is null

2009-09-29 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-756: Attachment: NUTCH-756.patch Fixes issue with metadata not being properly overridden for CrawlDatum

[jira] Updated: (NUTCH-756) CrawlDatum.set() does not reset Metadata if it is null

2009-09-29 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-756: Summary: CrawlDatum.set() does not reset Metadata if it is null (was: CrawlDatum.set() does not

[jira] Created: (NUTCH-761) Avoid cloningCrawlDatum in CrawlDbReducer

2009-11-03 Thread Julien Nioche (JIRA)
Avoid cloningCrawlDatum in CrawlDbReducer -- Key: NUTCH-761 URL: https://issues.apache.org/jira/browse/NUTCH-761 Project: Nutch Issue Type: Improvement Reporter: Julien Nioche

[jira] Updated: (NUTCH-761) Avoid cloningCrawlDatum in CrawlDbReducer

2009-11-03 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-761: Attachment: optiCrawlReducer.patch Avoid cloningCrawlDatum in CrawlDbReducer

[jira] Updated: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB

2009-11-03 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-762: Attachment: NUTCH-762-MultiGenerator.patch Patch for the MultiGenerator Alternative Generator

[jira] Created: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB

2009-11-03 Thread Julien Nioche (JIRA)
Alternative Generator which can generate several segments in one parse of the crawlDB - Key: NUTCH-762 URL: https://issues.apache.org/jira/browse/NUTCH-762 Project:

[jira] Updated: (NUTCH-767) Update version of Tika for the MimeType detection

2009-11-18 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-767: Attachment: NUTCH-767.patch Update version of Tika for the MimeType detection

[jira] Created: (NUTCH-769) Fetcher to skip queues for URLS getting repeated exceptions

2009-11-23 Thread Julien Nioche (JIRA)
Fetcher to skip queues for URLS getting repeated exceptions - Key: NUTCH-769 URL: https://issues.apache.org/jira/browse/NUTCH-769 Project: Nutch Issue Type: Improvement

[jira] Updated: (NUTCH-769) Fetcher to skip queues for URLS getting repeated exceptions

2009-11-23 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-769: Attachment: NUTCH-769.patch Fetcher to skip queues for URLS getting repeated exceptions

[jira] Created: (NUTCH-770) Timebomb for Fetcher

2009-11-23 Thread Julien Nioche (JIRA)
Timebomb for Fetcher Key: NUTCH-770 URL: https://issues.apache.org/jira/browse/NUTCH-770 Project: Nutch Issue Type: Improvement Reporter: Julien Nioche This patch provides the Fetcher with a timebomb

[jira] Updated: (NUTCH-770) Timebomb for Fetcher

2009-11-23 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-770: Attachment: NUTCH-770.patch Timebomb for Fetcher Key:

[jira] Updated: (NUTCH-769) Fetcher to skip queues for URLS getting repeated exceptions

2009-11-28 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-769: Attachment: NUTCH-769-2.patch Fetcher to skip queues for URLS getting repeated exceptions

[jira] Commented: (NUTCH-769) Fetcher to skip queues for URLS getting repeated exceptions

2009-11-28 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783247#action_12783247 ] Julien Nioche commented on NUTCH-769: - Missed a couple of lines indeed when I was trying

[jira] Commented: (NUTCH-770) Timebomb for Fetcher

2009-11-28 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783248#action_12783248 ] Julien Nioche commented on NUTCH-770: - The log simply shows that the patch has not been

[jira] Commented: (NUTCH-692) AlreadyBeingCreatedException with Hadoop 0.19

2009-11-30 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783612#action_12783612 ] Julien Nioche commented on NUTCH-692: - Ok let's leave it open for now

[jira] Updated: (NUTCH-770) Timebomb for Fetcher

2009-11-30 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-770: Attachment: NUTCH-770-v3.patch the v2 applied the Lucene code formatting to the whole java file

[jira] Updated: (NUTCH-767) Update Tika to v5.0 for the MimeType detection

2009-12-01 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-767: Description: The version 5 of TIka requires a few changes to the MimeType implementation. Tika is

[jira] Updated: (NUTCH-767) Update Tika to v0.5 for the MimeType detection

2009-12-01 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-767: Description: The version 0.5 of TIka requires a few changes to the MimeType implementation. Tika

[jira] Updated: (NUTCH-767) Update Tika to v0.5 for the MimeType detection

2009-12-03 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-767: Attachment: NUTCH-767-part2.patch Fixes compilation issues for test class

[jira] Reopened: (NUTCH-767) Update Tika to v0.5 for the MimeType detection

2009-12-05 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche reopened NUTCH-767: - the problem with the test class has been investigated. am reopening the issue so that we can mark it

[jira] Updated: (NUTCH-767) Update Tika to v0.5 for the MimeType detection

2009-12-05 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-767: Attachment: NUTCH-767-part3.patch the problems with the test comes from the fact that tika's

[jira] Commented: (NUTCH-658) Add Counter for # of doc fetched in Reporter

2010-01-04 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12796133#action_12796133 ] Julien Nioche commented on NUTCH-658: - If no one objects I'll commit this one in the

[jira] Commented: (NUTCH-666) Analysis plugins for multiple language and new Language Identifier Tool

2010-01-04 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12796221#action_12796221 ] Julien Nioche commented on NUTCH-666: - I agree with Sami that this should be contributed

[jira] Resolved: (NUTCH-658) Add Counter for # of doc fetched in Reporter

2010-01-05 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-658. - Resolution: Fixed Fix Version/s: 1.1 Committed revision 895972 Add Counter for # of doc

[jira] Assigned: (NUTCH-655) Injecting Crawl metadata

2010-01-05 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche reassigned NUTCH-655: --- Assignee: Julien Nioche Injecting Crawl metadata

[jira] Assigned: (NUTCH-719) fetchQueues.totalSize incorrect in Fetcher2

2010-01-05 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche reassigned NUTCH-719: --- Assignee: Julien Nioche fetchQueues.totalSize incorrect in Fetcher2

[jira] Assigned: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB

2010-01-05 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche reassigned NUTCH-762: --- Assignee: Julien Nioche Alternative Generator which can generate several segments in one

[jira] Assigned: (NUTCH-692) AlreadyBeingCreatedException with Hadoop 0.19

2010-01-05 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche reassigned NUTCH-692: --- Assignee: Julien Nioche AlreadyBeingCreatedException with Hadoop 0.19

[jira] Resolved: (NUTCH-655) Injecting Crawl metadata

2010-01-06 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-655. - Resolution: Fixed Committed revision 896539 Injecting Crawl metadata

[jira] Commented: (NUTCH-776) Configurable queue depth

2010-01-07 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12797653#action_12797653 ] Julien Nioche commented on NUTCH-776: - Did you notice any improvement in the fetch rate

[jira] Assigned: (NUTCH-269) CrawlDbReducer: OOME because no upper-bound on inlinks count

2010-01-08 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche reassigned NUTCH-269: --- Assignee: Julien Nioche CrawlDbReducer: OOME because no upper-bound on inlinks count

[jira] Commented: (NUTCH-269) CrawlDbReducer: OOME because no upper-bound on inlinks count

2010-01-08 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12797990#action_12797990 ] Julien Nioche commented on NUTCH-269: - I will shortly commit a variant of this approach

[jira] Resolved: (NUTCH-269) CrawlDbReducer: OOME because no upper-bound on inlinks count

2010-01-08 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-269. - Resolution: Fixed Fix Version/s: 1.1 Committed revision 897180 CrawlDbReducer: OOME

[jira] Closed: (NUTCH-767) Update Tika to v0.5 for the MimeType detection

2010-01-11 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-767. --- Resolution: Fixed Committed revision 897825 Update Tika to v0.5 for the MimeType detection

[jira] Resolved: (NUTCH-751) Upgrade version of HttpClient

2010-01-11 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-751. - Resolution: Later The changes in the underlying API are quite substantial and this would need a

[jira] Commented: (NUTCH-766) Tika parser

2010-01-11 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798727#action_12798727 ] Julien Nioche commented on NUTCH-766: - Hi Chris, No worries, I'd rather wait for you

[jira] Created: (NUTCH-779) Mechanism for passing metadata from parse to crawldb

2010-01-18 Thread Julien Nioche (JIRA)
Mechanism for passing metadata from parse to crawldb Key: NUTCH-779 URL: https://issues.apache.org/jira/browse/NUTCH-779 Project: Nutch Issue Type: New Feature Reporter:

[jira] Updated: (NUTCH-779) Mechanism for passing metadata from parse to crawldb

2010-01-18 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-779: Attachment: NUTCH-779 Mechanism for passing metadata from parse to crawldb

[jira] Commented: (NUTCH-779) Mechanism for passing metadata from parse to crawldb

2010-01-19 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12802172#action_12802172 ] Julien Nioche commented on NUTCH-779: - The property needs some documentation in

[jira] Resolved: (NUTCH-778) Running Nutch On linux having whoami exception?

2010-01-22 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-778. - Resolution: Invalid Fix Version/s: (was: 1.0.0) This is likely to be a problem with

[jira] Commented: (NUTCH-766) Tika parser

2010-01-22 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12803670#action_12803670 ] Julien Nioche commented on NUTCH-766: - I think the end result of this plugin should be

  1   2   >