[jira] Assigned: (NUTCH-817) parse-(html)does follow links of full html page, parse-(tika) does follow any links and stops at level 1

2010-05-02 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche reassigned NUTCH-817: --- Assignee: Julien Nioche > parse-(html)does follow links of full html page, parse-(tika) does f

[jira] Commented: (NUTCH-710) Support for rel="canonical" attribute

2010-04-21 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859286#action_12859286 ] Julien Nioche commented on NUTCH-710: - As suggested previously we could either treat can

[jira] Commented: (NUTCH-808) Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs

2010-04-13 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856349#action_12856349 ] Julien Nioche commented on NUTCH-808: - Hi Enis, {quote} On the other hand, current impl

[jira] Updated: (NUTCH-650) Hbase Integration

2010-04-07 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-650: Affects Version/s: (was: 1.0.0) Fix Version/s: 2.0 > Hbase Integration > ---

[jira] Updated: (NUTCH-808) Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs

2010-04-07 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-808: Fix Version/s: 2.0 > Evaluate ORM Frameworks which support non-relational column-oriented > datasto

[jira] Closed: (NUTCH-810) Upgrade to Tika 0.7

2010-04-06 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-810. --- Resolution: Fixed Committed in rev 931098. http://issues.apache.org/jira/browse/TIKA-317 changed the

[jira] Updated: (NUTCH-789) Improvements to Tika parser

2010-04-06 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-789: Component/s: (was: fetcher) parser Fix Version/s: (was: 1.1) Have c

[jira] Created: (NUTCH-810) Upgrade to Tika 0.7

2010-04-06 Thread Julien Nioche (JIRA)
Upgrade to Tika 0.7 --- Key: NUTCH-810 URL: https://issues.apache.org/jira/browse/NUTCH-810 Project: Nutch Issue Type: Improvement Components: parser Affects Versions: 1.0.0 Reporter: Julien Nioche

[jira] Updated: (NUTCH-809) Parse-metatags plugin

2010-04-05 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-809: Description: h2. Parse-metatags plugin The parse-metatags plugin consists of a HTMLParserFilter whi

[jira] Commented: (NUTCH-789) Improvements to Tika parser

2010-04-04 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12853251#action_12853251 ] Julien Nioche commented on NUTCH-789: - Will upgrade as soon as 0.7 is available from ht

[jira] Updated: (NUTCH-809) Parse-metatags plugin

2010-04-02 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-809: Description: h2. Parse-metatags plugin The parse-metatags plugin consists of a HTMLParserFilter whi

[jira] Updated: (NUTCH-809) Parse-metatags plugin

2010-04-02 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-809: Attachment: NUTCH-809.patch Modified version of the plugin which is compatible with parse-tika > Pa

[jira] Updated: (NUTCH-809) Parse-metatags plugin

2010-04-02 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-809: Attachment: (was: NUTCH-809.patch) > Parse-metatags plugin > - > >

[jira] Updated: (NUTCH-809) Parse-metatags plugin

2010-04-02 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-809: Attachment: NUTCH-809.patch > Parse-metatags plugin > - > > Key:

[jira] Created: (NUTCH-809) Parse-metatags plugin

2010-04-02 Thread Julien Nioche (JIRA)
Parse-metatags plugin - Key: NUTCH-809 URL: https://issues.apache.org/jira/browse/NUTCH-809 Project: Nutch Issue Type: New Feature Components: parser Reporter: Julien Nioche Assignee: Jul

[jira] Commented: (NUTCH-794) Language Identification must use check the parse metadata for language values

2010-03-31 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12852095#action_12852095 ] Julien Nioche commented on NUTCH-794: - The issue has not been fixed in Tika. Will refile

[jira] Updated: (NUTCH-706) Url regex normalizer

2010-03-31 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-706: Fix Version/s: (was: 1.1) Both variants of the substitution rule above break existing tests. Mor

[jira] Commented: (NUTCH-570) Improvement of URL Ordering in Generator.java

2010-03-30 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851545#action_12851545 ] Julien Nioche commented on NUTCH-570: - {quote}Julien, want to take this?{quote} Not par

[jira] Commented: (NUTCH-789) Improvements to Tika parser

2010-03-30 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851316#action_12851316 ] Julien Nioche commented on NUTCH-789: - Shall we postpone the work on this issue to after

[jira] Updated: (NUTCH-714) Need a SFTP and SCP Protocol Handler

2010-03-30 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-714: Affects Version/s: (was: 0.9.0) 1.0.0 Fix Version/s: (was: 0.8

[jira] Closed: (NUTCH-785) Fetcher : copy metadata from origin URL when redirecting + call scfilters.initialScore on newly created URL

2010-03-30 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-785. --- Resolution: Fixed Committed revision 929039 Thanks Andrzej for reviewing it > Fetcher : copy metadat

[jira] Resolved: (NUTCH-779) Mechanism for passing metadata from parse to crawldb

2010-03-30 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-779. - Resolution: Fixed Fix Version/s: 1.1 Committed revision 929038. Thanks Andrzej for your fe

[jira] Commented: (NUTCH-779) Mechanism for passing metadata from parse to crawldb

2010-03-29 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12850915#action_12850915 ] Julien Nioche commented on NUTCH-779: - Could anyone please review this issue? I would li

[jira] Commented: (NUTCH-785) Fetcher : copy metadata from origin URL when redirecting + call scfilters.initialScore on newly created URL

2010-03-29 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12850912#action_12850912 ] Julien Nioche commented on NUTCH-785: - Could anyone please review this issue? I would li

[jira] Updated: (NUTCH-783) IndexerChecker Utilty

2010-03-29 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-783: Fix Version/s: (was: 1.1) Removed tag 1.1 Will rename to IndexingPluginsChecker later > Indexer

[jira] Created: (NUTCH-806) Merge CrawlDBScanner with CrawlDBReader

2010-03-29 Thread Julien Nioche (JIRA)
Merge CrawlDBScanner with CrawlDBReader --- Key: NUTCH-806 URL: https://issues.apache.org/jira/browse/NUTCH-806 Project: Nutch Issue Type: Improvement Reporter: Julien Nioche Assign

[jira] Updated: (NUTCH-784) CrawlDBScanner

2010-03-29 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-784: Fix Version/s: 1.1 > CrawlDBScanner > --- > > Key: NUTCH-784 >

[jira] Closed: (NUTCH-784) CrawlDBScanner

2010-03-29 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-784. --- Resolution: Fixed Committed revision 928746 > CrawlDBScanner > --- > > K

[jira] Updated: (NUTCH-776) Configurable queue depth

2010-03-23 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-776: Fix Version/s: (was: 1.1) Moving this issue post 1.1 Needs a patch file, some description of the

[jira] Closed: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB

2010-03-22 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-762. --- Resolution: Fixed Committed revision 926155 Have reverted the prefix for params to 'generate.' + adde

[jira] Commented: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB

2010-03-22 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12848140#action_12848140 ] Julien Nioche commented on NUTCH-762: - The change of prefix also reflected that we now u

[jira] Commented: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB

2010-03-22 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12848095#action_12848095 ] Julien Nioche commented on NUTCH-762: - {quote} I just noticed that the new Generator use

[jira] Updated: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB

2010-03-22 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-762: Attachment: NUTCH-762-v3.patch new patch which reintroduces the 'generator.update.crawldb' functiona

[jira] Updated: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB

2010-03-22 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-762: Fix Version/s: 1.1 > Alternative Generator which can generate several segments in one parse of the

[jira] Closed: (NUTCH-740) Configuration option to override default language for fetched pages.

2010-03-22 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-740. --- Resolution: Fixed Assignee: Julien Nioche Committed in rev 926003 Thanks Marcin for contributing

[jira] Updated: (NUTCH-740) Configuration option to override default language for fetched pages.

2010-03-19 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-740: Attachment: NUTCH-740.patch Slightly modified version of the patch with modifs for protocol-http. wi

[jira] Commented: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB

2010-03-18 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846930#action_12846930 ] Julien Nioche commented on NUTCH-762: - Yes, I came across that situation too on a large

[jira] Commented: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB

2010-03-18 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846910#action_12846910 ] Julien Nioche commented on NUTCH-762: - OK, there was indeed an assumption that the gener

[jira] Commented: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB

2010-03-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846141#action_12846141 ] Julien Nioche commented on NUTCH-762: - If I am not mistaken the point of having _genera

[jira] Commented: (NUTCH-740) Configuration option to override default language for fetched pages.

2010-03-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845886#action_12845886 ] Julien Nioche commented on NUTCH-740: - A nice contribution but should not this be applie

[jira] Updated: (NUTCH-469) changes to geoPosition plugin to make it work on nutch 0.9

2010-03-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-469: Fix Version/s: (was: 1.1) There has not been any changes to this issue since February 09 and it

[jira] Resolved: (NUTCH-692) AlreadyBeingCreatedException with Hadoop 0.19

2010-03-15 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-692. - Resolution: Cannot Reproduce Fix Version/s: 1.1 I cannot reproduce the issue since we moved

[jira] Updated: (NUTCH-710) Support for rel="canonical" attribute

2010-03-15 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-710: Fix Version/s: (was: 1.1) Great idea. Won't be included in 1.1 though so moving to *fix : unknow

[jira] Resolved: (NUTCH-801) Remove RTF and MP3 parse plugins

2010-03-11 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-801. - Resolution: Fixed Committed revision 921840. > Remove RTF and MP3 parse plugins > --

[jira] Resolved: (NUTCH-798) Upgrade to SOLR1.4

2010-03-11 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-798. - Resolution: Fixed Updated SOLRJ's dependencies at the same time : Deleting lib/apache-solr

[jira] Created: (NUTCH-801) Remove RTF and MP3 parse plugins

2010-03-10 Thread Julien Nioche (JIRA)
Remove RTF and MP3 parse plugins Key: NUTCH-801 URL: https://issues.apache.org/jira/browse/NUTCH-801 Project: Nutch Issue Type: Improvement Components: parser Affects Versions: 1.0.0

[jira] Updated: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB

2010-03-06 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-762: Attachment: NUTCH-762-v2.patch Improved version of the patch : - fixed a few minor bugs - renamed

[jira] Updated: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB

2010-03-06 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-762: Attachment: (was: NUTCH-762-MultiGenerator.patch) > Alternative Generator which can generate sev

[jira] Closed: (NUTCH-799) SOLRIndexer to commit once all reducers have finished

2010-03-05 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-799. --- Resolution: Fixed Assignee: Julien Nioche Thanks for your feedback Andrzej Committed revision 9

[jira] Closed: (NUTCH-782) Ability to order htmlparsefilters

2010-03-01 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-782. --- Resolution: Fixed Committed revision 917557 > Ability to order htmlparsefilters > ---

[jira] Updated: (NUTCH-799) SOLRIndexer to commit once all reducers have finished

2010-03-01 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-799: Attachment: NUTCH-799.patch > SOLRIndexer to commit once all reducers have finished > --

[jira] Created: (NUTCH-799) SOLRIndexer to commit once all reducers have finished

2010-03-01 Thread Julien Nioche (JIRA)
SOLRIndexer to commit once all reducers have finished - Key: NUTCH-799 URL: https://issues.apache.org/jira/browse/NUTCH-799 Project: Nutch Issue Type: Improvement Components: inde

[jira] Created: (NUTCH-798) Upgrade to SOLR1.4

2010-02-26 Thread Julien Nioche (JIRA)
Upgrade to SOLR1.4 -- Key: NUTCH-798 URL: https://issues.apache.org/jira/browse/NUTCH-798 Project: Nutch Issue Type: Improvement Components: indexer Reporter: Julien Nioche Fix For: 1.1 in

[jira] Commented: (NUTCH-719) fetchQueues.totalSize incorrect in Fetcher2

2010-02-23 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837147#action_12837147 ] Julien Nioche commented on NUTCH-719: - the other addFetchItem method of FetchItemQueues

[jira] Closed: (NUTCH-719) fetchQueues.totalSize incorrect in Fetcher2

2010-02-19 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-719. --- > fetchQueues.totalSize incorrect in Fetcher2 > --- > >

[jira] Resolved: (NUTCH-719) fetchQueues.totalSize incorrect in Fetcher2

2010-02-19 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-719. - Resolution: Fixed Fix Version/s: 1.1 Committed revision 911905. Thanks to S. Dennis for inv

[jira] Resolved: (NUTCH-644) RTF parser doesn't compile anymore

2010-02-18 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-644. - Resolution: Fixed RTF parsing is now handled by the TikaPlugin (NUTCH-766) which solves the issue

[jira] Resolved: (NUTCH-705) parse-rtf plugin

2010-02-18 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-705. - Resolution: Fixed RTF parsing is now handled by the TikaPlugin (NUTCH-766). Please open an issue

[jira] Updated: (NUTCH-750) HtmlParser plugin - page title extraction

2010-02-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-750: Component/s: parser > HtmlParser plugin - page title extraction > --

[jira] Updated: (NUTCH-782) Ability to order htmlparsefilters

2010-02-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-782: Component/s: parser > Ability to order htmlparsefilters > - > >

[jira] Updated: (NUTCH-794) Language Identification must use check the parse metadata for language values

2010-02-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-794: Component/s: parser > Language Identification must use check the parse metadata for language values

[jira] Work started: (NUTCH-794) Language Identification must use check the parse metadata for language values

2010-02-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-794 started by Julien Nioche. > Language Identification must use check the parse metadata for language values >

[jira] Updated: (NUTCH-794) Language Identification must use check the parse metadata for language values

2010-02-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-794: Summary: Language Identification must use check the parse metadata for language values (was: Tika

[jira] Commented: (NUTCH-794) Language Identification must use check the parse metadata for language values

2010-02-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12834147#action_12834147 ] Julien Nioche commented on NUTCH-794: - Committed patch in revision 910454 Waiting for i

[jira] Updated: (NUTCH-794) Tika parser does identify lang attributes on html tag

2010-02-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-794: Attachment: NUTCH-794.patch > Tika parser does identify lang attributes on html tag > --

[jira] Commented: (NUTCH-794) Tika parser does identify lang attributes on html tag

2010-02-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12834143#action_12834143 ] Julien Nioche commented on NUTCH-794: - Apart from the html attribute being lost (see abo

[jira] Updated: (NUTCH-794) Tika parser does identify lang attributes on html tag

2010-02-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-794: Description: The following HTML document : document 1 titlejotain suomeksi is rendered as the fol

[jira] Created: (NUTCH-794) Tika parser does not keep attributes on html tag

2010-02-16 Thread Julien Nioche (JIRA)
Tika parser does not keep attributes on html tag Key: NUTCH-794 URL: https://issues.apache.org/jira/browse/NUTCH-794 Project: Nutch Issue Type: Bug Reporter: Julien Nioche

[jira] Closed: (NUTCH-766) Tika parser

2010-02-15 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-766. --- Have added small improvement in revision 910187 (Prioritise default Tika parser when discovering plugins

[jira] Commented: (NUTCH-766) Tika parser

2010-02-11 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832583#action_12832583 ] Julien Nioche commented on NUTCH-766: - @Chris : did you do ant -f src/plugin/parse-tik

[jira] Issue Comment Edited: (NUTCH-766) Tika parser

2010-02-11 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832564#action_12832564 ] Julien Nioche edited comment on NUTCH-766 at 2/11/10 5:22 PM: --

[jira] Commented: (NUTCH-766) Tika parser

2010-02-11 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832564#action_12832564 ] Julien Nioche commented on NUTCH-766: - I had a closer look at the HTML parsing issue. Wh

[jira] Commented: (NUTCH-766) Tika parser

2010-02-11 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832454#action_12832454 ] Julien Nioche commented on NUTCH-766: - @Chris : I just did a fresh co from svn, applied

[jira] Updated: (NUTCH-787) Upgrade Lucene to 3.0.0.

2010-02-10 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-787: Fix Version/s: 1.1 > Upgrade Lucene to 3.0.0. > > > Key: NU

[jira] Closed: (NUTCH-786) Better list of suffix domains

2010-02-05 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-786. --- Resolution: Fixed Committed revision 906907 > Better list of suffix domains > ---

[jira] Updated: (NUTCH-786) Better list of suffix domains

2010-02-05 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-786: Attachment: NUTCH-786.patch Small improvement to the content of domain-suffixes.xml : added compound

[jira] Created: (NUTCH-786) Better list of suffix domains

2010-02-05 Thread Julien Nioche (JIRA)
Better list of suffix domains - Key: NUTCH-786 URL: https://issues.apache.org/jira/browse/NUTCH-786 Project: Nutch Issue Type: Improvement Affects Versions: 1.0.0 Reporter: Julien Nioche

[jira] Commented: (NUTCH-781) Update Tika to v0.6 for the MimeType detection

2010-02-02 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12828548#action_12828548 ] Julien Nioche commented on NUTCH-781: - > did you forgot to update conf/tika-mimetypes.xm

[jira] Updated: (NUTCH-785) Fetcher : copy metadata from origin URL when redirecting + call scfilters.initialScore on newly created URL

2010-02-01 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-785: Attachment: NUTCH-785.patch > Fetcher : copy metadata from origin URL when redirecting + call > scf

[jira] Created: (NUTCH-785) Fetcher : copy metadata from origin URL when redirecting + call scfilters.initialScore on newly created URL

2010-02-01 Thread Julien Nioche (JIRA)
Fetcher : copy metadata from origin URL when redirecting + call scfilters.initialScore on newly created URL --- Key: NUTCH-785 URL: https://issues.apache.org/j

[jira] Updated: (NUTCH-784) CrawlDBScanner

2010-02-01 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-784: Attachment: NUTCH-784.patch > CrawlDBScanner > --- > > Key: NUTCH-784 >

[jira] Created: (NUTCH-784) CrawlDBScanner

2010-02-01 Thread Julien Nioche (JIRA)
CrawlDBScanner --- Key: NUTCH-784 URL: https://issues.apache.org/jira/browse/NUTCH-784 Project: Nutch Issue Type: New Feature Reporter: Julien Nioche Assignee: Julien Nioche Attachments: NUTCH-78

[jira] Updated: (NUTCH-779) Mechanism for passing metadata from parse to crawldb

2010-02-01 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-779: Attachment: NUTCH-779-v2.patch Improved version of the patch. Followed AB's recommendations and rena

[jira] Assigned: (NUTCH-779) Mechanism for passing metadata from parse to crawldb

2010-02-01 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche reassigned NUTCH-779: --- Assignee: Julien Nioche > Mechanism for passing metadata from parse to crawldb > -

[jira] Updated: (NUTCH-783) IndexerChecker Utilty

2010-02-01 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-783: Attachment: NUTCH-783.patch > IndexerChecker Utilty > - > > Key:

[jira] Assigned: (NUTCH-783) IndexerChecker Utilty

2010-02-01 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche reassigned NUTCH-783: --- Assignee: Julien Nioche > IndexerChecker Utilty > - > > Ke

[jira] Created: (NUTCH-783) IndexerChecker Utilty

2010-02-01 Thread Julien Nioche (JIRA)
IndexerChecker Utilty - Key: NUTCH-783 URL: https://issues.apache.org/jira/browse/NUTCH-783 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Julien Nioche Fix For: 1.

[jira] Updated: (NUTCH-782) Ability to order htmlparsefilters

2010-02-01 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-782: Attachment: NUTCH-782.patch > Ability to order htmlparsefilters > -

[jira] Created: (NUTCH-782) Ability to order htmlparsefilters

2010-02-01 Thread Julien Nioche (JIRA)
Ability to order htmlparsefilters - Key: NUTCH-782 URL: https://issues.apache.org/jira/browse/NUTCH-782 Project: Nutch Issue Type: New Feature Reporter: Julien Nioche Assignee: Julien N

[jira] Updated: (NUTCH-766) Tika parser

2010-02-01 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-766: Attachment: (was: Nutch-766.ParserFactory.patch) > Tika parser > --- > >

[jira] Updated: (NUTCH-766) Tika parser

2010-02-01 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-766: Attachment: (was: NUTCH-766.tika.patch) > Tika parser > --- > > Key: NUT

[jira] Updated: (NUTCH-766) Tika parser

2010-02-01 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-766: Attachment: NUTCH-766-v3.patch Updated version of the plugin : uses Tika 0.6 > Tika parser > --

[jira] Closed: (NUTCH-781) Update Tika to v0.6 for the MimeType detection

2010-02-01 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-781. --- > Update Tika to v0.6 for the MimeType detection > --- > >

[jira] Resolved: (NUTCH-781) Update Tika to v0.6 for the MimeType detection

2010-02-01 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-781. - Resolution: Fixed Committed revision 905228 > Update Tika to v0.6 for the MimeType detection > -

[jira] Created: (NUTCH-781) Update Tika to v0.6 for the MimeType detection

2010-02-01 Thread Julien Nioche (JIRA)
Update Tika to v0.6 for the MimeType detection --- Key: NUTCH-781 URL: https://issues.apache.org/jira/browse/NUTCH-781 Project: Nutch Issue Type: Improvement Reporter: Julien Nioche

[jira] Updated: (NUTCH-766) Tika parser

2010-01-28 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-766: Attachment: NUTCH-766.v2 sample.tar.gz new version of the patch + archive containing

[jira] Commented: (NUTCH-766) Tika parser

2010-01-28 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12805892#action_12805892 ] Julien Nioche commented on NUTCH-766: - Here is a slightly better version of the patch wh

[jira] Commented: (NUTCH-766) Tika parser

2010-01-22 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12803670#action_12803670 ] Julien Nioche commented on NUTCH-766: - > I think the end result of this plugin should be

[jira] Resolved: (NUTCH-778) Running Nutch On linux having whoami exception?

2010-01-22 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-778. - Resolution: Invalid Fix Version/s: (was: 1.0.0) This is likely to be a problem with the

[jira] Commented: (NUTCH-779) Mechanism for passing metadata from parse to crawldb

2010-01-19 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12802172#action_12802172 ] Julien Nioche commented on NUTCH-779: - > The property needs some documentation in nutch-

  1   2   3   >