[jira] Created: (NUTCH-900) Confusion in nutch-default between http.content.limit and file.content.limit
Confusion in nutch-default between http.content.limit and file.content.limit Key: NUTCH-900 URL: https://issues.apache.org/jira/browse/NUTCH-900 Project: Nutch Issue Type: Improvement Affects Versions: 1.2 Reporter: Markus Jelsma Priority: Trivial Fix For: 1.2 The http.content.limit and file.content.limit settings can be confusing and have fooled at least several users. The description element for these settings should be changed to reflect the difference between them so users won't be fooled that easy. See also: http://lucene.472066.n3.nabble.com/ERROR-tika-TikaParser-org-apache-pdfbox-io-PushBackInputStream-td964353.html for a discussion. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-900) Confusion in nutch-default between http.content.limit and file.content.limit
[ https://issues.apache.org/jira/browse/NUTCH-900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-900: Attachment: NUTCH-900.MarkusJelsma.100908.patch.txt Confusion in nutch-default between http.content.limit and file.content.limit Key: NUTCH-900 URL: https://issues.apache.org/jira/browse/NUTCH-900 Project: Nutch Issue Type: Improvement Affects Versions: 1.2 Reporter: Markus Jelsma Priority: Trivial Fix For: 1.2 Attachments: NUTCH-900.MarkusJelsma.100908.patch.txt The http.content.limit and file.content.limit settings can be confusing and have fooled at least several users. The description element for these settings should be changed to reflect the difference between them so users won't be fooled that easy. See also: http://lucene.472066.n3.nabble.com/ERROR-tika-TikaParser-org-apache-pdfbox-io-PushBackInputStream-td964353.html for a discussion. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-900) Confusion in nutch-default between http.content.limit and file.content.limit
[ https://issues.apache.org/jira/browse/NUTCH-900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-900: Patch Info: [Patch Available] Confusion in nutch-default between http.content.limit and file.content.limit Key: NUTCH-900 URL: https://issues.apache.org/jira/browse/NUTCH-900 Project: Nutch Issue Type: Improvement Affects Versions: 1.2 Reporter: Markus Jelsma Priority: Trivial Fix For: 1.2 Attachments: NUTCH-900.MarkusJelsma.100908.patch.txt The http.content.limit and file.content.limit settings can be confusing and have fooled at least several users. The description element for these settings should be changed to reflect the difference between them so users won't be fooled that easy. See also: http://lucene.472066.n3.nabble.com/ERROR-tika-TikaParser-org-apache-pdfbox-io-PushBackInputStream-td964353.html for a discussion. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-901) Make index-more plug-in configurable
Make index-more plug-in configurable -- Key: NUTCH-901 URL: https://issues.apache.org/jira/browse/NUTCH-901 Project: Nutch Issue Type: Improvement Components: indexer Reporter: Markus Jelsma Fix For: 1.2 In my case, i don't want the index-more plug-in to split content-types on slash. Tokenization is something a Solr instance should take care of. Instead of removing the code (which would break compatibility for users that rely on it), we need a way to configure the plug-in not to split the content-type. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[Nutch Wiki] Update of GORA_HBase by JulienNioche
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The GORA_HBase page has been changed by JulienNioche. http://wiki.apache.org/nutch/GORA_HBase -- New page: This document describes how to get Nutch 2.0 to use HBase as a backend for GORA and is based on the revision 993857 of the Nutch trunk * Install and configure HBase 0.20.6 * Pull the GORA code and compile it * Copy the jars from gora/gora-hbase/lib-ext to nutch/lib * Add the following to nutch/ivy/ivy.xml {{{ dependency org=org.gora name=gora-hbase rev=0.1 conf=*-compile exclude org=com.sun.jdmk/ exclude org=com.sun.jmx/ exclude org=javax.jms/ /dependency }}} * Specify the GORA backend in nutch-site.xml {{{ property namestorage.data.store.class/name valueorg.gora.hbase.store.HBaseStore/value descriptionDefault class for storing data/description /property }}} * Add mapping file for hbase in conf/gora-hbase-mapping.xml {{{ ?xml version=1.0 encoding=UTF-8? gora-orm table name=webtable family name=p/ !-- This can also have params like compression, bloom filters -- family name=f/ family name=s/ family name=il/ family name=ol/ family name=h/ family name=mtdt/ family name=mk/ /table class table=webtable keyClass=java.lang.String name=org.apache.nutch.storage.WebPage !-- fetch fields -- field name=baseUrl family=f qualifier=bas/ field name=status family=f qualifier=st/ field name=prevFetchTime family=f qualifier=pts/ field name=fetchTime family=f qualifier=ts/ field name=fetchInterval family=f qualifier=fi/ field name=retriesSinceFetch family=f qualifier=rsf/ field name=reprUrl family=f qualifier=rpr/ field name=content family=f qualifier=cnt/ field name=contentType family=f qualifier=typ/ field name=protocolStatus family=f qualifier=prot/ field name=modifiedTime family=f qualifier=mod/ !-- parse fields -- field name=title family=p qualifier=t/ field name=text family=p qualifier=c/ field name=parseStatus family=p qualifier=st/ field name=signature family=p qualifier=sig/ field name=prevSignature family=p qualifier=psig/ !-- score fields -- field name=score family=s qualifier=s/ field name=headers family=h/ field name=inlinks family=il/ field name=outlinks family=ol/ field name=metadata family=mtdt/ field name=markers family=mk/ /class /gora-orm }}} * Compile Nutch - ant runtime * Make sure HBase is started and working properly You should then be able to use it. Try going to'' $NUTCH_HOME/runtime/local/bin'' and do : {{{ nutch inject /someseedDir nutch readdb }}} You should find more details in the logs on ''$NUTCH_HOME/runtime/local/logs/hadoop.log''
[Nutch Wiki] Update of FrontPage by JulienNioche
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The FrontPage page has been changed by JulienNioche. http://wiki.apache.org/nutch/FrontPage?action=diffrev1=137rev2=138 -- Please contribute your knowledge about Nutch here! == Looking for the Version 1.1 release == - Find it at http://www.apache.org/dyn/closer.cgi/nutch/ - == General Information == * [[http://nutch.apache.org|Nutch Website]] @@ -99, +97 @@ * Nutch2Architecture -- Discussions on the Nutch 2.0 architecture (old) * NewScoring -- New stable pagerank like webgraph and link-analysis jobs. * NewScoringIndexingExample -- Two full fetch cycles of commands using new scoring and indexing systems. + * [[GORA_HBase]] -- Configuring Nutch 2.0 with GORA and HBASE == Other Resources == * [[http://nutch.sourceforge.net/blog/cutting.html|Doug's Weblog]] -- He's the one who originally wrote Lucene and Nutch.
Re: Nutch 2.0 Help
Hi guys, I've summarized the steps to follow for having GORA+Hbase with Nutch 2.0 on http://wiki.apache.org/nutch/GORA_HBase Feel free to amend and improve as you see fit. Please bear in mind that Nutch 2.0 is at a very early stage and is far from being bug-proof, see in particular [1]. HTH Julien [1] https://issues.apache.org/jira/browse/NUTCH-893 -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com On 6 September 2010 13:35, Andrzej Bialecki a...@getopt.org wrote: On 2010-09-05 14:56, David Stuart wrote: Hi All, I have done as per below and can create a table from within the hbase shell. I found the appropriate create table method bin/nutch org.apache.nutch.storage.WebTableCreator webtable but it only returns null Any help would be great You don't have to create a table manually - this should happen automatically when you first run any Nutch tool. Just make sure you have hbase-site.xml on your classpath in Nutch - best if you put it in your conf/ and rebuild, so that it's packed into a job jar. Here's for example my config files that work with HBase (I don't use any non-standard settings for HBase, so my hbase-site.xml has no properties, but still it needs to be included in Nutch job jar): gora-hbase-mapping.xml: - gora-orm table name=webtable family name=p/ !-- This can also have params like compression, bloom filters -- family name=f/ family name=s/ family name=il/ family name=ol/ family name=h/ family name=mtdt/ family name=mk/ /table class table=webtable keyClass=java.lang.String name=org.apache.nutch.storage.WebPage !-- fetch fields -- field name=baseUrl family=f qualifier=bas/ field name=status family=f qualifier=st/ field name=prevFetchTime family=f qualifier=pts/ field name=fetchTime family=f qualifier=ts/ field name=fetchInterval family=f qualifier=fi/ field name=retriesSinceFetch family=f qualifier=rsf/ field name=reprUrl family=f qualifier=rpr/ field name=content family=f qualifier=cnt/ field name=contentType family=f qualifier=typ/ field name=protocolStatus family=f qualifier=prot/ field name=modifiedTime family=f qualifier=mod/ !-- parse fields -- field name=title family=p qualifier=t/ field name=text family=p qualifier=c/ field name=parseStatus family=p qualifier=st/ field name=signature family=p qualifier=sig/ field name=prevSignature family=p qualifier=psig/ !-- score fields -- field name=score family=s qualifier=s/ field name=headers family=h/ field name=inlinks family=il/ field name=outlinks family=ol/ field name=metadata family=mtdt/ field name=markers family=mk/ /class /gora-orm - nutch-site.xml: - ... blah blah, a lot of unrelated stuff... property namestorage.data.store.class/name valueorg.gora.hbase.store.HBaseStore/value descriptionDefault class for storing data/description /property - Of course you need also to use the same hadoop files (hdfs-site and mapred-site) as the ones that HBase uses. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[jira] Updated: (NUTCH-901) Make index-more plug-in configurable
[ https://issues.apache.org/jira/browse/NUTCH-901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-901: Summary: Make index-more plug-in configurable (was: Make index-more plug-in configurable ) Fix Version/s: 2.0 Affects Version/s: 1.2 2.0 Needs fixing in the trunk as well (v2.0) Make index-more plug-in configurable Key: NUTCH-901 URL: https://issues.apache.org/jira/browse/NUTCH-901 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 1.2, 2.0 Reporter: Markus Jelsma Fix For: 1.2, 2.0 In my case, i don't want the index-more plug-in to split content-types on slash. Tokenization is something a Solr instance should take care of. Instead of removing the code (which would break compatibility for users that rely on it), we need a way to configure the plug-in not to split the content-type. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-900) Confusion in nutch-default between http.content.limit and file.content.limit
[ https://issues.apache.org/jira/browse/NUTCH-900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-900: Fix Version/s: 2.0 Affects Version/s: 2.0 To be fixed in the trunk as well Confusion in nutch-default between http.content.limit and file.content.limit Key: NUTCH-900 URL: https://issues.apache.org/jira/browse/NUTCH-900 Project: Nutch Issue Type: Improvement Affects Versions: 1.2, 2.0 Reporter: Markus Jelsma Priority: Trivial Fix For: 1.2, 2.0 Attachments: NUTCH-900.MarkusJelsma.100908.patch.txt The http.content.limit and file.content.limit settings can be confusing and have fooled at least several users. The description element for these settings should be changed to reflect the difference between them so users won't be fooled that easy. See also: http://lucene.472066.n3.nabble.com/ERROR-tika-TikaParser-org-apache-pdfbox-io-PushBackInputStream-td964353.html for a discussion. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (NUTCH-900) Confusion in nutch-default between http.content.limit and file.content.limit
[ https://issues.apache.org/jira/browse/NUTCH-900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche reassigned NUTCH-900: --- Assignee: Julien Nioche Confusion in nutch-default between http.content.limit and file.content.limit Key: NUTCH-900 URL: https://issues.apache.org/jira/browse/NUTCH-900 Project: Nutch Issue Type: Improvement Affects Versions: 1.2, 2.0 Reporter: Markus Jelsma Assignee: Julien Nioche Priority: Trivial Fix For: 1.2, 2.0 Attachments: NUTCH-900.MarkusJelsma.100908.patch.txt The http.content.limit and file.content.limit settings can be confusing and have fooled at least several users. The description element for these settings should be changed to reflect the difference between them so users won't be fooled that easy. See also: http://lucene.472066.n3.nabble.com/ERROR-tika-TikaParser-org-apache-pdfbox-io-PushBackInputStream-td964353.html for a discussion. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-900) Confusion in nutch-default between http.content.limit and file.content.limit
[ https://issues.apache.org/jira/browse/NUTCH-900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-900. --- Resolution: Fixed Committed revision 994984 (trunk) Committed revision 994985 (1.2) Thanks! Confusion in nutch-default between http.content.limit and file.content.limit Key: NUTCH-900 URL: https://issues.apache.org/jira/browse/NUTCH-900 Project: Nutch Issue Type: Improvement Affects Versions: 1.2, 2.0 Reporter: Markus Jelsma Assignee: Julien Nioche Priority: Trivial Fix For: 1.2, 2.0 Attachments: NUTCH-900.MarkusJelsma.100908.patch.txt The http.content.limit and file.content.limit settings can be confusing and have fooled at least several users. The description element for these settings should be changed to reflect the difference between them so users won't be fooled that easy. See also: http://lucene.472066.n3.nabble.com/ERROR-tika-TikaParser-org-apache-pdfbox-io-PushBackInputStream-td964353.html for a discussion. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-407) Make Nutch crawling parent directories for file protocol configurable
[ https://issues.apache.org/jira/browse/NUTCH-407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12907182#action_12907182 ] Andrey Sapegin commented on NUTCH-407: -- Please accept the original patch or find a better solution. +1 Make Nutch crawling parent directories for file protocol configurable - Key: NUTCH-407 URL: https://issues.apache.org/jira/browse/NUTCH-407 Project: Nutch Issue Type: Improvement Affects Versions: 0.8 Reporter: Thorsten Scherler Assignee: Andrzej Bialecki Attachments: 407.fix.diff http://www.mail-archive.com/nutch-u...@lucene.apache.org/msg06698.html I am looking into fixing some very weird behavior of the file protocol. I am using 0.8. Researching this topic I found http://www.mail-archive.com/nutch-u...@lucene.apache.org/msg06536.html and http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch I am on Ubuntu but I have the same problem that nutch is going down the tree (including parents) and not up (including children from the root url). Further I would vote to make the fetch-parents optional and defined per a property whether I would like this not very intuitive feature. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Nutch 2.0 Help
Hi, I think we need to commit all the necessary files to nutch so that it can work out of the box for sql, hbase and casssandra. We can even write commented-out entries in gora.properties, nutch-site.xml, etc so that using nutch with different backends becomes a configuration change. I will open a issue to track this down. Cheers, Enis On Wed, Sep 8, 2010 at 1:53 PM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Hi guys, I've summarized the steps to follow for having GORA+Hbase with Nutch 2.0 on http://wiki.apache.org/nutch/GORA_HBase Feel free to amend and improve as you see fit. Please bear in mind that Nutch 2.0 is at a very early stage and is far from being bug-proof, see in particular [1]. HTH Julien [1] https://issues.apache.org/jira/browse/NUTCH-893 -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com On 6 September 2010 13:35, Andrzej Bialecki a...@getopt.org wrote: On 2010-09-05 14:56, David Stuart wrote: Hi All, I have done as per below and can create a table from within the hbase shell. I found the appropriate create table method bin/nutch org.apache.nutch.storage.WebTableCreator webtable but it only returns null Any help would be great You don't have to create a table manually - this should happen automatically when you first run any Nutch tool. Just make sure you have hbase-site.xml on your classpath in Nutch - best if you put it in your conf/ and rebuild, so that it's packed into a job jar. Here's for example my config files that work with HBase (I don't use any non-standard settings for HBase, so my hbase-site.xml has no properties, but still it needs to be included in Nutch job jar): gora-hbase-mapping.xml: - gora-orm table name=webtable family name=p/ !-- This can also have params like compression, bloom filters -- family name=f/ family name=s/ family name=il/ family name=ol/ family name=h/ family name=mtdt/ family name=mk/ /table class table=webtable keyClass=java.lang.String name=org.apache.nutch.storage.WebPage !-- fetch fields -- field name=baseUrl family=f qualifier=bas/ field name=status family=f qualifier=st/ field name=prevFetchTime family=f qualifier=pts/ field name=fetchTime family=f qualifier=ts/ field name=fetchInterval family=f qualifier=fi/ field name=retriesSinceFetch family=f qualifier=rsf/ field name=reprUrl family=f qualifier=rpr/ field name=content family=f qualifier=cnt/ field name=contentType family=f qualifier=typ/ field name=protocolStatus family=f qualifier=prot/ field name=modifiedTime family=f qualifier=mod/ !-- parse fields -- field name=title family=p qualifier=t/ field name=text family=p qualifier=c/ field name=parseStatus family=p qualifier=st/ field name=signature family=p qualifier=sig/ field name=prevSignature family=p qualifier=psig/ !-- score fields -- field name=score family=s qualifier=s/ field name=headers family=h/ field name=inlinks family=il/ field name=outlinks family=ol/ field name=metadata family=mtdt/ field name=markers family=mk/ /class /gora-orm - nutch-site.xml: - ... blah blah, a lot of unrelated stuff... property namestorage.data.store.class/name valueorg.gora.hbase.store.HBaseStore/value descriptionDefault class for storing data/description /property - Of course you need also to use the same hadoop files (hdfs-site and mapred-site) as the ones that HBase uses. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[jira] Created: (NUTCH-902) Add all necessary files and configuration so that nutch can be used with different backends out-of-the-box
Add all necessary files and configuration so that nutch can be used with different backends out-of-the-box -- Key: NUTCH-902 URL: https://issues.apache.org/jira/browse/NUTCH-902 Project: Nutch Issue Type: New Feature Components: documentation, storage Affects Versions: nutchbase Reporter: Enis Soztutar Assignee: Enis Soztutar As per the discussion in the mailing list and http://wiki.apache.org/nutch/GORA_HBase, it will be good to include all the necessary files and configuration. I propose that we maintain configuration for at least SQL, HBase and Cassandra. The following changes are needed: conf/gora-sql-mapping.xml conf/gora-hbase-mapping.xml conf/gora-cassandra-mapping.xml comments on nutch-default and ivy.xml Shall we also include jars from gora-hbase, gora-cassandra and their dependencies ? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-903) RESUME_KEY field in FetcherJob.Java has not been get correctly
RESUME_KEY field in FetcherJob.Java has not been get correctly -- Key: NUTCH-903 URL: https://issues.apache.org/jira/browse/NUTCH-903 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 2.0 Environment: nutch 2.0 Reporter: faruk berksöz Priority: Minor Fix For: 2.0 Source modification request for nutch 2.0 . FetcherJob.Java ... FetcherMapper protected void setup(Context context) { Configuration conf = context.getConfiguration(); shouldContinue = conf.getBoolean(job.continue, false); job.continue has not beeen set anywhere job.continue should be RESUME_KEY which is set before for this purpose crawlId = new Utf8(conf.get(GeneratorJob.CRAWL_ID, Nutch.ALL_CRAWL_ID_STR)); } ... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-903) RESUME_KEY field in FetcherJob.Java has not been get correctly
[ https://issues.apache.org/jira/browse/NUTCH-903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] faruk berksöz updated NUTCH-903: Description: Source modification request for nutch 2.0 .xx FetcherJob.Java ... FetcherMapper protected void setup(Context context) { Configuration conf = context.getConfiguration(); shouldContinue = conf.getBoolean(job.continue, false); job.continue has not beeen set anywhere job.continue should be RESUME_KEY which is set before for this purpose crawlId = new Utf8(conf.get(GeneratorJob.CRAWL_ID, Nutch.ALL_CRAWL_ID_STR)); } ... was: Source modification request for nutch 2.0 .xx FetcherJob.Java ... FetcherMapper protected void setup(Context context) { Configuration conf = context.getConfiguration(); shouldContinue = conf.getBoolean(job.continue, false); job.continue has not beeen set anywhere job.continue should be RESUME_KEY which is set before for this purpose crawlId = new Utf8(conf.get(GeneratorJob.CRAWL_ID, Nutch.ALL_CRAWL_ID_STR)); } ... RESUME_KEY field in FetcherJob.Java has not been get correctly -- Key: NUTCH-903 URL: https://issues.apache.org/jira/browse/NUTCH-903 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 2.0 Environment: nutch 2.0 Reporter: faruk berksöz Priority: Minor Fix For: 2.0 Source modification request for nutch 2.0 .xx FetcherJob.Java ... FetcherMapper protected void setup(Context context) { Configuration conf = context.getConfiguration(); shouldContinue = conf.getBoolean(job.continue, false); job.continue has not beeen set anywhere job.continue should be RESUME_KEY which is set before for this purpose crawlId = new Utf8(conf.get(GeneratorJob.CRAWL_ID, Nutch.ALL_CRAWL_ID_STR)); } ... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-903) RESUME_KEY field in FetcherJob.Java has not been get correctly
[ https://issues.apache.org/jira/browse/NUTCH-903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] faruk berksöz updated NUTCH-903: Description: Source modification request for nutch 2.0 . FetcherJob.Java ... FetcherMapper protected void setup(Context context) { Configuration conf = context.getConfiguration(); shouldContinue = conf.getBoolean(job.continue, false); job.continue has not beeen set anywhere job.continue should be RESUME_KEY which is set before for this purpose crawlId = new Utf8(conf.get(GeneratorJob.CRAWL_ID, Nutch.ALL_CRAWL_ID_STR)); } ... was: Source modification request for nutch 2.0 .xx FetcherJob.Java ... FetcherMapper protected void setup(Context context) { Configuration conf = context.getConfiguration(); shouldContinue = conf.getBoolean(job.continue, false); job.continue has not beeen set anywhere job.continue should be RESUME_KEY which is set before for this purpose crawlId = new Utf8(conf.get(GeneratorJob.CRAWL_ID, Nutch.ALL_CRAWL_ID_STR)); } ... RESUME_KEY field in FetcherJob.Java has not been get correctly -- Key: NUTCH-903 URL: https://issues.apache.org/jira/browse/NUTCH-903 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 2.0 Environment: nutch 2.0 Reporter: faruk berksöz Priority: Minor Fix For: 2.0 Source modification request for nutch 2.0 . FetcherJob.Java ... FetcherMapper protected void setup(Context context) { Configuration conf = context.getConfiguration(); shouldContinue = conf.getBoolean(job.continue, false); job.continue has not beeen set anywhere job.continue should be RESUME_KEY which is set before for this purpose crawlId = new Utf8(conf.get(GeneratorJob.CRAWL_ID, Nutch.ALL_CRAWL_ID_STR)); } ... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-903) RESUME_KEY field in FetcherJob.Java has not been get correctly
[ https://issues.apache.org/jira/browse/NUTCH-903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] faruk berksöz closed NUTCH-903. --- Resolution: Fixed I'm so sorry... Description is not readable.Why i don't know.I close this one and open new. RESUME_KEY field in FetcherJob.Java has not been get correctly -- Key: NUTCH-903 URL: https://issues.apache.org/jira/browse/NUTCH-903 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 2.0 Environment: nutch 2.0 Reporter: faruk berksöz Priority: Minor Fix For: 2.0 Source modification request for nutch 2.0 . FetcherJob.Java ... FetcherMapper protected void setup(Context context) { Configuration conf = context.getConfiguration(); shouldContinue = conf.getBoolean(job.continue, false); job.continue has not beeen set anywhere job.continue should be RESUME_KEY which is set before for this purpose crawlId = new Utf8(conf.get(GeneratorJob.CRAWL_ID, Nutch.ALL_CRAWL_ID_STR)); } ... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-904) -resume option is always processed as false in FetcherJob.
-resume option is always processed as false in FetcherJob. --- Key: NUTCH-904 URL: https://issues.apache.org/jira/browse/NUTCH-904 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 2.0 Environment: Nutch 2.0 Reporter: faruk berksöz Fix For: 2.0 job.continue has not beeen set anywhere. job.continue should be RESUME_KEY which is set before for this purpose. \\ \\ {code:title=FetcherJob.java|borderStyle=solid} ... FetcherMapper protected void setup(Context context) { Configuration conf = context.getConfiguration(); shouldContinue = conf.getBoolean(job.continue, false); crawlId = new Utf8(conf.get(GeneratorJob.CRAWL_ID, Nutch.ALL_CRAWL_ID_STR)); } ... {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-407) Make Nutch crawling parent directories for file protocol configurable
[ https://issues.apache.org/jira/browse/NUTCH-407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12907275#action_12907275 ] Chris A. Mattmann commented on NUTCH-407: - Hmmm: I agree here. If no one objects in the next 48 hours, I would like to: 1. open a new issue and link it to this one 2. commit the fix for NUTCH-407. My rationale here is that: a. the behavior is configurable (can be turned on or off) b. supports at least a few user's and their specific use cases c. this issue hasn't really be intensely debated or looked at for a few years now d. the patch is backwards compat, b/c the default behavior is the current behavior (but can be override per a.) Make Nutch crawling parent directories for file protocol configurable - Key: NUTCH-407 URL: https://issues.apache.org/jira/browse/NUTCH-407 Project: Nutch Issue Type: Improvement Affects Versions: 0.8 Reporter: Thorsten Scherler Assignee: Andrzej Bialecki Attachments: 407.fix.diff http://www.mail-archive.com/nutch-u...@lucene.apache.org/msg06698.html I am looking into fixing some very weird behavior of the file protocol. I am using 0.8. Researching this topic I found http://www.mail-archive.com/nutch-u...@lucene.apache.org/msg06536.html and http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch I am on Ubuntu but I have the same problem that nutch is going down the tree (including parents) and not up (including children from the root url). Further I would vote to make the fetch-parents optional and defined per a property whether I would like this not very intuitive feature. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-904) -resume option is always processed as false in FetcherJob.
[ https://issues.apache.org/jira/browse/NUTCH-904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] faruk berksöz updated NUTCH-904: Attachment: NUTCH-904.patch patch -resume option is always processed as false in FetcherJob. --- Key: NUTCH-904 URL: https://issues.apache.org/jira/browse/NUTCH-904 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 2.0 Environment: Nutch 2.0 Reporter: faruk berksöz Fix For: 2.0 Attachments: NUTCH-904.patch job.continue has not beeen set anywhere. job.continue should be RESUME_KEY which is set before for this purpose. \\ \\ {code:title=FetcherJob.java|borderStyle=solid} ... FetcherMapper protected void setup(Context context) { Configuration conf = context.getConfiguration(); shouldContinue = conf.getBoolean(job.continue, false); crawlId = new Utf8(conf.get(GeneratorJob.CRAWL_ID, Nutch.ALL_CRAWL_ID_STR)); } ... {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-893) DataStore.put() silently loses records when executed from multiple processes
[ https://issues.apache.org/jira/browse/NUTCH-893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12907297#action_12907297 ] Andrzej Bialecki commented on NUTCH-893: - Very good catch - yes, the test now passes for me too. This is actually good news for Gora :) I'll continue digging regarding NUTCH-879 ... don't hesitate if you have any ideas how to solve that. I suspect we may be losing keys in Generator or Fetcher, due to partitioning collisions but this hypothesis needs to be tested. DataStore.put() silently loses records when executed from multiple processes Key: NUTCH-893 URL: https://issues.apache.org/jira/browse/NUTCH-893 Project: Nutch Issue Type: Bug Affects Versions: 2.0 Environment: Gora HEAD, SqlStore, MySQL 5.1, Ubuntu 10.4 x64, Sun JDK 1.6 Reporter: Andrzej Bialecki Priority: Blocker Fix For: 2.0 Attachments: NUTCH-893.patch, NUTCH-893_v2.patch In order to debug the issue described in NUTCH-879 I created a test to simulate multiple clients appending to webtable (please see the patch), which is the situation that we have in distributed map-reduce jobs. There are two tests there: one that uses multiple threads within the same JVM, and another that uses single thread in multiple JVMs. Each test first clears webtable (be careful!), and then puts a bunch of pages, and finally counts that all are present and their values correspond to keys. To make things more interesting each execution context (thread or process) closes and reopens its instance of DataStore a few times. The multithreaded test passes just fine. However, the multi-process test fails with missing keys, as many as 30%. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.