[jira] Created: (NUTCH-900) Confusion in nutch-default between http.content.limit and file.content.limit

2010-09-08 Thread Markus Jelsma (JIRA)
Confusion in nutch-default between http.content.limit and file.content.limit


 Key: NUTCH-900
 URL: https://issues.apache.org/jira/browse/NUTCH-900
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.2
Reporter: Markus Jelsma
Priority: Trivial
 Fix For: 1.2


The http.content.limit and file.content.limit settings can be confusing and 
have fooled at least several users. The description element for these settings 
should be changed to reflect the difference between them so users won't be 
fooled that easy.
See also: 
http://lucene.472066.n3.nabble.com/ERROR-tika-TikaParser-org-apache-pdfbox-io-PushBackInputStream-td964353.html
 for a discussion.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-900) Confusion in nutch-default between http.content.limit and file.content.limit

2010-09-08 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-900:


Attachment: NUTCH-900.MarkusJelsma.100908.patch.txt

 Confusion in nutch-default between http.content.limit and file.content.limit
 

 Key: NUTCH-900
 URL: https://issues.apache.org/jira/browse/NUTCH-900
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.2
Reporter: Markus Jelsma
Priority: Trivial
 Fix For: 1.2

 Attachments: NUTCH-900.MarkusJelsma.100908.patch.txt


 The http.content.limit and file.content.limit settings can be confusing and 
 have fooled at least several users. The description element for these 
 settings should be changed to reflect the difference between them so users 
 won't be fooled that easy.
 See also: 
 http://lucene.472066.n3.nabble.com/ERROR-tika-TikaParser-org-apache-pdfbox-io-PushBackInputStream-td964353.html
  for a discussion.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-900) Confusion in nutch-default between http.content.limit and file.content.limit

2010-09-08 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-900:


Patch Info: [Patch Available]

 Confusion in nutch-default between http.content.limit and file.content.limit
 

 Key: NUTCH-900
 URL: https://issues.apache.org/jira/browse/NUTCH-900
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.2
Reporter: Markus Jelsma
Priority: Trivial
 Fix For: 1.2

 Attachments: NUTCH-900.MarkusJelsma.100908.patch.txt


 The http.content.limit and file.content.limit settings can be confusing and 
 have fooled at least several users. The description element for these 
 settings should be changed to reflect the difference between them so users 
 won't be fooled that easy.
 See also: 
 http://lucene.472066.n3.nabble.com/ERROR-tika-TikaParser-org-apache-pdfbox-io-PushBackInputStream-td964353.html
  for a discussion.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-901) Make index-more plug-in configurable

2010-09-08 Thread Markus Jelsma (JIRA)
Make index-more plug-in configurable

--

 Key: NUTCH-901
 URL: https://issues.apache.org/jira/browse/NUTCH-901
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Reporter: Markus Jelsma
 Fix For: 1.2


In my case, i don't want the index-more plug-in to split content-types on 
slash. Tokenization is something a Solr instance should take care of. Instead 
of removing the code (which would break compatibility for users that rely on 
it), we need a way to configure the plug-in not to split the content-type.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[Nutch Wiki] Update of GORA_HBase by JulienNioche

2010-09-08 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The GORA_HBase page has been changed by JulienNioche.
http://wiki.apache.org/nutch/GORA_HBase

--

New page:
This document describes how to get Nutch 2.0 to use HBase as a backend for GORA 
and is based on the revision 993857 of the Nutch trunk

 * Install and configure HBase 0.20.6
 * Pull the GORA code and compile it
 * Copy the jars from gora/gora-hbase/lib-ext to nutch/lib
 * Add the following to nutch/ivy/ivy.xml

{{{
dependency org=org.gora name=gora-hbase rev=0.1 conf=*-compile
 exclude org=com.sun.jdmk/
 exclude org=com.sun.jmx/
 exclude org=javax.jms/
/dependency
}}}

 * Specify the GORA backend in nutch-site.xml

{{{
property
 namestorage.data.store.class/name
 valueorg.gora.hbase.store.HBaseStore/value
 descriptionDefault class for storing data/description
/property
}}}

 * Add mapping file for hbase in conf/gora-hbase-mapping.xml

{{{
?xml version=1.0 encoding=UTF-8?
gora-orm
table name=webtable
  family name=p/ !-- This can also have params like compression, bloom 
filters --
  family name=f/
  family name=s/
  family name=il/
  family name=ol/
  family name=h/
  family name=mtdt/
  family name=mk/
/table
class table=webtable keyClass=java.lang.String 
name=org.apache.nutch.storage.WebPage
  !-- fetch fields   --
  field name=baseUrl family=f qualifier=bas/
  field name=status family=f qualifier=st/
  field name=prevFetchTime family=f qualifier=pts/
  field name=fetchTime family=f qualifier=ts/
  field name=fetchInterval family=f qualifier=fi/
  field name=retriesSinceFetch family=f qualifier=rsf/
  field name=reprUrl family=f qualifier=rpr/
  field name=content family=f qualifier=cnt/
  field name=contentType family=f qualifier=typ/
  field name=protocolStatus family=f qualifier=prot/
  field name=modifiedTime family=f qualifier=mod/
  !-- parse fields   --
  field name=title family=p qualifier=t/
  field name=text family=p qualifier=c/
  field name=parseStatus family=p qualifier=st/
  field name=signature family=p qualifier=sig/
  field name=prevSignature family=p qualifier=psig/
  !-- score fields   --
  field name=score family=s qualifier=s/
  field name=headers family=h/
  field name=inlinks family=il/
  field name=outlinks family=ol/
  field name=metadata family=mtdt/
  field name=markers family=mk/
/class
/gora-orm
}}}
 * Compile Nutch - ant runtime
 * Make sure HBase is started and working properly

You should then be able to use it. Try going to'' 
$NUTCH_HOME/runtime/local/bin'' and do :

{{{
  nutch inject /someseedDir
  nutch readdb
}}}

You should find more details in the logs on 
''$NUTCH_HOME/runtime/local/logs/hadoop.log''


[Nutch Wiki] Update of FrontPage by JulienNioche

2010-09-08 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The FrontPage page has been changed by JulienNioche.
http://wiki.apache.org/nutch/FrontPage?action=diffrev1=137rev2=138

--

  Please contribute your knowledge about Nutch here!
  
  == Looking for the Version 1.1 release ==
- 
  Find it at http://www.apache.org/dyn/closer.cgi/nutch/
- 
  
  == General Information ==
   * [[http://nutch.apache.org|Nutch Website]]
@@ -99, +97 @@

   * Nutch2Architecture -- Discussions on the Nutch 2.0 architecture (old)
   * NewScoring -- New stable pagerank like webgraph and link-analysis jobs.
   * NewScoringIndexingExample -- Two full fetch cycles of commands using new 
scoring and indexing systems.
+  * [[GORA_HBase]] -- Configuring Nutch 2.0 with GORA and HBASE
  
  == Other Resources ==
   * [[http://nutch.sourceforge.net/blog/cutting.html|Doug's Weblog]] -- He's 
the one who originally wrote Lucene and Nutch.


Re: Nutch 2.0 Help

2010-09-08 Thread Julien Nioche
Hi guys,

I've summarized the steps to follow for having GORA+Hbase with Nutch 2.0 on
http://wiki.apache.org/nutch/GORA_HBase

Feel free to amend and improve as you see fit.

Please bear in mind that Nutch 2.0 is at a very early stage and is far from
being bug-proof, see in particular [1].

HTH

Julien

[1] https://issues.apache.org/jira/browse/NUTCH-893

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com


On 6 September 2010 13:35, Andrzej Bialecki a...@getopt.org wrote:

 On 2010-09-05 14:56, David Stuart wrote:

 Hi All,

 I have done as per below and can create a table from within the hbase
 shell. I found the appropriate create table method
 bin/nutch org.apache.nutch.storage.WebTableCreator webtable but it only
 returns null

 Any help would be great


 You don't have to create a table manually - this should happen
 automatically when you first run any Nutch tool. Just make sure you have
 hbase-site.xml on your classpath in Nutch - best if you put it in your conf/
 and rebuild, so that it's packed into a job jar.

 Here's for example my config files that work with HBase (I don't use any
 non-standard settings for HBase, so my hbase-site.xml has no properties, but
 still it needs to be included in Nutch job jar):

 gora-hbase-mapping.xml:
 -

 gora-orm

 table name=webtable
  family name=p/ !-- This can also have params like compression, bloom
 filters --
  family name=f/
  family name=s/
  family name=il/
  family name=ol/
  family name=h/
  family name=mtdt/
  family name=mk/
 /table

 class table=webtable keyClass=java.lang.String
 name=org.apache.nutch.storage.WebPage
  !-- fetch fields   --
  field name=baseUrl family=f qualifier=bas/
  field name=status family=f qualifier=st/
  field name=prevFetchTime family=f qualifier=pts/
  field name=fetchTime family=f qualifier=ts/
  field name=fetchInterval family=f qualifier=fi/
  field name=retriesSinceFetch family=f qualifier=rsf/
  field name=reprUrl family=f qualifier=rpr/
  field name=content family=f qualifier=cnt/
  field name=contentType family=f qualifier=typ/
  field name=protocolStatus family=f qualifier=prot/
  field name=modifiedTime family=f qualifier=mod/

  !-- parse fields   --
  field name=title family=p qualifier=t/
  field name=text family=p qualifier=c/
  field name=parseStatus family=p qualifier=st/
  field name=signature family=p qualifier=sig/
  field name=prevSignature family=p qualifier=psig/

  !-- score fields   --
  field name=score family=s qualifier=s/

  field name=headers family=h/

  field name=inlinks family=il/

  field name=outlinks family=ol/

  field name=metadata family=mtdt/

  field name=markers family=mk/

 /class

 /gora-orm
 -

 nutch-site.xml:
 -
 ... blah blah, a lot of unrelated stuff...

 property
  namestorage.data.store.class/name
  valueorg.gora.hbase.store.HBaseStore/value

  descriptionDefault class for storing data/description
 /property
 -

 Of course you need also to use the same hadoop files (hdfs-site and
 mapred-site) as the ones that HBase uses.


 --
 Best regards,
 Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com



[jira] Updated: (NUTCH-901) Make index-more plug-in configurable

2010-09-08 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-901:


  Summary: Make index-more plug-in configurable  (was: Make 
index-more plug-in configurable
)
Fix Version/s: 2.0
Affects Version/s: 1.2
   2.0

Needs fixing in the trunk as well (v2.0)

 Make index-more plug-in configurable
 

 Key: NUTCH-901
 URL: https://issues.apache.org/jira/browse/NUTCH-901
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.2, 2.0
Reporter: Markus Jelsma
 Fix For: 1.2, 2.0


 In my case, i don't want the index-more plug-in to split content-types on 
 slash. Tokenization is something a Solr instance should take care of. Instead 
 of removing the code (which would break compatibility for users that rely on 
 it), we need a way to configure the plug-in not to split the content-type.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-900) Confusion in nutch-default between http.content.limit and file.content.limit

2010-09-08 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-900:


Fix Version/s: 2.0
Affects Version/s: 2.0

To be fixed in the trunk as well

 Confusion in nutch-default between http.content.limit and file.content.limit
 

 Key: NUTCH-900
 URL: https://issues.apache.org/jira/browse/NUTCH-900
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.2, 2.0
Reporter: Markus Jelsma
Priority: Trivial
 Fix For: 1.2, 2.0

 Attachments: NUTCH-900.MarkusJelsma.100908.patch.txt


 The http.content.limit and file.content.limit settings can be confusing and 
 have fooled at least several users. The description element for these 
 settings should be changed to reflect the difference between them so users 
 won't be fooled that easy.
 See also: 
 http://lucene.472066.n3.nabble.com/ERROR-tika-TikaParser-org-apache-pdfbox-io-PushBackInputStream-td964353.html
  for a discussion.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (NUTCH-900) Confusion in nutch-default between http.content.limit and file.content.limit

2010-09-08 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche reassigned NUTCH-900:
---

Assignee: Julien Nioche

 Confusion in nutch-default between http.content.limit and file.content.limit
 

 Key: NUTCH-900
 URL: https://issues.apache.org/jira/browse/NUTCH-900
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.2, 2.0
Reporter: Markus Jelsma
Assignee: Julien Nioche
Priority: Trivial
 Fix For: 1.2, 2.0

 Attachments: NUTCH-900.MarkusJelsma.100908.patch.txt


 The http.content.limit and file.content.limit settings can be confusing and 
 have fooled at least several users. The description element for these 
 settings should be changed to reflect the difference between them so users 
 won't be fooled that easy.
 See also: 
 http://lucene.472066.n3.nabble.com/ERROR-tika-TikaParser-org-apache-pdfbox-io-PushBackInputStream-td964353.html
  for a discussion.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-900) Confusion in nutch-default between http.content.limit and file.content.limit

2010-09-08 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche closed NUTCH-900.
---

Resolution: Fixed

Committed revision 994984 (trunk)
Committed revision 994985 (1.2)

Thanks!

 Confusion in nutch-default between http.content.limit and file.content.limit
 

 Key: NUTCH-900
 URL: https://issues.apache.org/jira/browse/NUTCH-900
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.2, 2.0
Reporter: Markus Jelsma
Assignee: Julien Nioche
Priority: Trivial
 Fix For: 1.2, 2.0

 Attachments: NUTCH-900.MarkusJelsma.100908.patch.txt


 The http.content.limit and file.content.limit settings can be confusing and 
 have fooled at least several users. The description element for these 
 settings should be changed to reflect the difference between them so users 
 won't be fooled that easy.
 See also: 
 http://lucene.472066.n3.nabble.com/ERROR-tika-TikaParser-org-apache-pdfbox-io-PushBackInputStream-td964353.html
  for a discussion.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-407) Make Nutch crawling parent directories for file protocol configurable

2010-09-08 Thread Andrey Sapegin (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12907182#action_12907182
 ] 

Andrey Sapegin commented on NUTCH-407:
--

Please accept the original patch or find a better solution.
+1

 Make Nutch crawling parent directories for file protocol configurable
 -

 Key: NUTCH-407
 URL: https://issues.apache.org/jira/browse/NUTCH-407
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.8
Reporter: Thorsten Scherler
Assignee: Andrzej Bialecki 
 Attachments: 407.fix.diff


 http://www.mail-archive.com/nutch-u...@lucene.apache.org/msg06698.html
 I am looking into fixing some very weird behavior of the file protocol.
 I am using 0.8.
 Researching this topic I found 
 http://www.mail-archive.com/nutch-u...@lucene.apache.org/msg06536.html
 and
 http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch
 I am on Ubuntu but I have the same problem that nutch is going down the
 tree (including parents) and not up (including children from the root
 url).
 Further I would vote to make the fetch-parents optional and defined per
 a property whether I would like this not very intuitive feature.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Nutch 2.0 Help

2010-09-08 Thread Enis Soztutar
Hi,

I think we need to commit all the necessary files to nutch so that it can
work out of the box for sql, hbase and casssandra. We can even write
commented-out entries in gora.properties, nutch-site.xml, etc so that using
nutch with different backends becomes a configuration change. I will open a
issue to track this down.

Cheers,
Enis

On Wed, Sep 8, 2010 at 1:53 PM, Julien Nioche lists.digitalpeb...@gmail.com
 wrote:

 Hi guys,

 I've summarized the steps to follow for having GORA+Hbase with Nutch 2.0 on
 http://wiki.apache.org/nutch/GORA_HBase

 Feel free to amend and improve as you see fit.

 Please bear in mind that Nutch 2.0 is at a very early stage and is far from
 being bug-proof, see in particular [1].

 HTH

 Julien

 [1] https://issues.apache.org/jira/browse/NUTCH-893

 --
 *
 *Open Source Solutions for Text Engineering

 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com


 On 6 September 2010 13:35, Andrzej Bialecki a...@getopt.org wrote:

  On 2010-09-05 14:56, David Stuart wrote:
 
  Hi All,
 
  I have done as per below and can create a table from within the hbase
  shell. I found the appropriate create table method
  bin/nutch org.apache.nutch.storage.WebTableCreator webtable but it only
  returns null
 
  Any help would be great
 
 
  You don't have to create a table manually - this should happen
  automatically when you first run any Nutch tool. Just make sure you have
  hbase-site.xml on your classpath in Nutch - best if you put it in your
 conf/
  and rebuild, so that it's packed into a job jar.
 
  Here's for example my config files that work with HBase (I don't use any
  non-standard settings for HBase, so my hbase-site.xml has no properties,
 but
  still it needs to be included in Nutch job jar):
 
  gora-hbase-mapping.xml:
  -
 
  gora-orm
 
  table name=webtable
   family name=p/ !-- This can also have params like compression,
 bloom
  filters --
   family name=f/
   family name=s/
   family name=il/
   family name=ol/
   family name=h/
   family name=mtdt/
   family name=mk/
  /table
 
  class table=webtable keyClass=java.lang.String
  name=org.apache.nutch.storage.WebPage
   !-- fetch fields   --
   field name=baseUrl family=f qualifier=bas/
   field name=status family=f qualifier=st/
   field name=prevFetchTime family=f qualifier=pts/
   field name=fetchTime family=f qualifier=ts/
   field name=fetchInterval family=f qualifier=fi/
   field name=retriesSinceFetch family=f qualifier=rsf/
   field name=reprUrl family=f qualifier=rpr/
   field name=content family=f qualifier=cnt/
   field name=contentType family=f qualifier=typ/
   field name=protocolStatus family=f qualifier=prot/
   field name=modifiedTime family=f qualifier=mod/
 
   !-- parse fields   --
   field name=title family=p qualifier=t/
   field name=text family=p qualifier=c/
   field name=parseStatus family=p qualifier=st/
   field name=signature family=p qualifier=sig/
   field name=prevSignature family=p qualifier=psig/
 
   !-- score fields   --
   field name=score family=s qualifier=s/
 
   field name=headers family=h/
 
   field name=inlinks family=il/
 
   field name=outlinks family=ol/
 
   field name=metadata family=mtdt/
 
   field name=markers family=mk/
 
  /class
 
  /gora-orm
  -
 
  nutch-site.xml:
  -
  ... blah blah, a lot of unrelated stuff...
 
  property
   namestorage.data.store.class/name
   valueorg.gora.hbase.store.HBaseStore/value
 
   descriptionDefault class for storing data/description
  /property
  -
 
  Of course you need also to use the same hadoop files (hdfs-site and
  mapred-site) as the ones that HBase uses.
 
 
  --
  Best regards,
  Andrzej Bialecki 
   ___. ___ ___ ___ _ _   __
  [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
  ___|||__||  \|  ||  |  Embedded Unix, System Integration
  http://www.sigram.com  Contact: info at sigram dot com
 



[jira] Created: (NUTCH-902) Add all necessary files and configuration so that nutch can be used with different backends out-of-the-box

2010-09-08 Thread Enis Soztutar (JIRA)
Add all necessary files and configuration so that nutch can be used with 
different backends out-of-the-box
--

 Key: NUTCH-902
 URL: https://issues.apache.org/jira/browse/NUTCH-902
 Project: Nutch
  Issue Type: New Feature
  Components: documentation, storage
Affects Versions: nutchbase
Reporter: Enis Soztutar
Assignee: Enis Soztutar


As per the discussion in the mailing list and 
http://wiki.apache.org/nutch/GORA_HBase, it will be good to include all the 
necessary files and configuration. I propose that we maintain configuration for 
at least SQL, HBase and Cassandra. 

The following changes are needed:
conf/gora-sql-mapping.xml
conf/gora-hbase-mapping.xml
conf/gora-cassandra-mapping.xml
comments on nutch-default and ivy.xml 

Shall we also include jars from gora-hbase, gora-cassandra and their 
dependencies ? 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-903) RESUME_KEY field in FetcherJob.Java has not been get correctly

2010-09-08 Thread JIRA
RESUME_KEY field in FetcherJob.Java has not been get correctly
--

 Key: NUTCH-903
 URL: https://issues.apache.org/jira/browse/NUTCH-903
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 2.0
 Environment: nutch 2.0
Reporter: faruk berksöz
Priority: Minor
 Fix For: 2.0


Source modification request for nutch 2.0 .

FetcherJob.Java
...
FetcherMapper

protected void setup(Context context) {
  Configuration conf = context.getConfiguration();
  shouldContinue = conf.getBoolean(job.continue, 
false);
   
  job.continue has not beeen set anywhere
 job.continue should be RESUME_KEY which is set before for 
this purpose
   
  crawlId = new Utf8(conf.get(GeneratorJob.CRAWL_ID, 
Nutch.ALL_CRAWL_ID_STR));
}
...


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-903) RESUME_KEY field in FetcherJob.Java has not been get correctly

2010-09-08 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/NUTCH-903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

faruk berksöz updated NUTCH-903:


Description: 
Source modification request for nutch 2.0 .xx

FetcherJob.Java
...
FetcherMapper

 protected void setup(Context context) {
Configuration conf = context.getConfiguration();
shouldContinue = conf.getBoolean(job.continue, false);
   
  job.continue has not beeen set anywhere
 job.continue should be RESUME_KEY which is set before 
for this purpose

crawlId = new Utf8(conf.get(GeneratorJob.CRAWL_ID, 
Nutch.ALL_CRAWL_ID_STR));
}
...


  was:
Source modification request for nutch 2.0 .xx

FetcherJob.Java
...
FetcherMapper

protected void setup(Context context) {

  Configuration conf = context.getConfiguration();

  shouldContinue = conf.getBoolean(job.continue, 
false);

   

  job.continue has not beeen set anywhere

 job.continue should be RESUME_KEY which is set before for 
this purpose

   

  crawlId = new Utf8(conf.get(GeneratorJob.CRAWL_ID, 
Nutch.ALL_CRAWL_ID_STR));

}
...



 RESUME_KEY field in FetcherJob.Java has not been get correctly
 --

 Key: NUTCH-903
 URL: https://issues.apache.org/jira/browse/NUTCH-903
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 2.0
 Environment: nutch 2.0
Reporter: faruk berksöz
Priority: Minor
 Fix For: 2.0


 Source modification request for nutch 2.0 .xx
 FetcherJob.Java
   ...
 FetcherMapper
 
  protected void setup(Context context) {
 Configuration conf = context.getConfiguration();
 shouldContinue = conf.getBoolean(job.continue, false);
  
   job.continue has not beeen set anywhere
  job.continue should be RESUME_KEY which is set 
 before for this purpose
 
 crawlId = new 
 Utf8(conf.get(GeneratorJob.CRAWL_ID, Nutch.ALL_CRAWL_ID_STR));
 }
 ...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-903) RESUME_KEY field in FetcherJob.Java has not been get correctly

2010-09-08 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/NUTCH-903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

faruk berksöz updated NUTCH-903:


Description: 
Source modification request for nutch 2.0 .

FetcherJob.Java
...
FetcherMapper

protected void setup(Context context) {
  Configuration conf = context.getConfiguration();
  shouldContinue = conf.getBoolean(job.continue, 
false);
   
  job.continue has not beeen set anywhere
 job.continue should be RESUME_KEY which is set before for 
this purpose
   
  crawlId = new Utf8(conf.get(GeneratorJob.CRAWL_ID, 
Nutch.ALL_CRAWL_ID_STR));
}
...



  was:
Source modification request for nutch 2.0 .xx

FetcherJob.Java
...
FetcherMapper

 protected void setup(Context context) {

Configuration conf = context.getConfiguration();

shouldContinue = conf.getBoolean(job.continue, false);

   

  job.continue has not beeen set anywhere

 job.continue should be RESUME_KEY which is set before 
for this purpose



crawlId = new Utf8(conf.get(GeneratorJob.CRAWL_ID, 
Nutch.ALL_CRAWL_ID_STR));

}

...



 RESUME_KEY field in FetcherJob.Java has not been get correctly
 --

 Key: NUTCH-903
 URL: https://issues.apache.org/jira/browse/NUTCH-903
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 2.0
 Environment: nutch 2.0
Reporter: faruk berksöz
Priority: Minor
 Fix For: 2.0


 Source modification request for nutch 2.0 .
 FetcherJob.Java
   ...
   FetcherMapper
   
   protected void setup(Context context) {
 Configuration conf = context.getConfiguration();
 shouldContinue = conf.getBoolean(job.continue, 
 false);
  
 job.continue has not beeen set anywhere
job.continue should be RESUME_KEY which is set before for 
 this purpose
  
 crawlId = new Utf8(conf.get(GeneratorJob.CRAWL_ID, 
 Nutch.ALL_CRAWL_ID_STR));
   }
   ...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-903) RESUME_KEY field in FetcherJob.Java has not been get correctly

2010-09-08 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/NUTCH-903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

faruk berksöz closed NUTCH-903.
---

Resolution: Fixed

I'm so sorry... 
Description is not readable.Why i don't know.I close this one and open new.


 RESUME_KEY field in FetcherJob.Java has not been get correctly
 --

 Key: NUTCH-903
 URL: https://issues.apache.org/jira/browse/NUTCH-903
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 2.0
 Environment: nutch 2.0
Reporter: faruk berksöz
Priority: Minor
 Fix For: 2.0


 Source modification request for nutch 2.0 .
 FetcherJob.Java
   ...
   FetcherMapper
   
   protected void setup(Context context) {
 Configuration conf = context.getConfiguration();
 shouldContinue = conf.getBoolean(job.continue, 
 false);
  
 job.continue has not beeen set anywhere
job.continue should be RESUME_KEY which is set before for 
 this purpose
  
 crawlId = new Utf8(conf.get(GeneratorJob.CRAWL_ID, 
 Nutch.ALL_CRAWL_ID_STR));
   }
   ...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-904) -resume option is always processed as false in FetcherJob.

2010-09-08 Thread JIRA
-resume option is always processed  as false in FetcherJob.
---

 Key: NUTCH-904
 URL: https://issues.apache.org/jira/browse/NUTCH-904
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 2.0
 Environment: Nutch 2.0
Reporter: faruk berksöz
 Fix For: 2.0


job.continue has not beeen set anywhere.
job.continue should be RESUME_KEY which is set before for this purpose.
\\
\\
{code:title=FetcherJob.java|borderStyle=solid}
   ...
   FetcherMapper
   
  protected void setup(Context context) {
 Configuration conf = context.getConfiguration();
 shouldContinue = conf.getBoolean(job.continue, false);   
 crawlId = new Utf8(conf.get(GeneratorJob.CRAWL_ID, 
Nutch.ALL_CRAWL_ID_STR));
  }
...
 {code} 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-407) Make Nutch crawling parent directories for file protocol configurable

2010-09-08 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12907275#action_12907275
 ] 

Chris A. Mattmann commented on NUTCH-407:
-

Hmmm: I agree here. If no one objects in the next 48 hours, I would like to:

1. open a new issue and link it to this one
2. commit the fix for NUTCH-407. 

My rationale here is that:

a. the behavior is configurable (can be turned on or off)
b. supports at least a few user's and their specific use cases
c. this issue hasn't really be intensely debated or looked at for a few years 
now
d. the patch is backwards compat, b/c the default behavior is the current 
behavior (but can be override per a.)



 Make Nutch crawling parent directories for file protocol configurable
 -

 Key: NUTCH-407
 URL: https://issues.apache.org/jira/browse/NUTCH-407
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.8
Reporter: Thorsten Scherler
Assignee: Andrzej Bialecki 
 Attachments: 407.fix.diff


 http://www.mail-archive.com/nutch-u...@lucene.apache.org/msg06698.html
 I am looking into fixing some very weird behavior of the file protocol.
 I am using 0.8.
 Researching this topic I found 
 http://www.mail-archive.com/nutch-u...@lucene.apache.org/msg06536.html
 and
 http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch
 I am on Ubuntu but I have the same problem that nutch is going down the
 tree (including parents) and not up (including children from the root
 url).
 Further I would vote to make the fetch-parents optional and defined per
 a property whether I would like this not very intuitive feature.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-904) -resume option is always processed as false in FetcherJob.

2010-09-08 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/NUTCH-904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

faruk berksöz updated NUTCH-904:


Attachment: NUTCH-904.patch

patch

 -resume option is always processed  as false in FetcherJob.
 ---

 Key: NUTCH-904
 URL: https://issues.apache.org/jira/browse/NUTCH-904
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 2.0
 Environment: Nutch 2.0
Reporter: faruk berksöz
 Fix For: 2.0

 Attachments: NUTCH-904.patch


 job.continue has not beeen set anywhere.
 job.continue should be RESUME_KEY which is set before for this purpose.
 \\
 \\
 {code:title=FetcherJob.java|borderStyle=solid}
...
FetcherMapper

   protected void setup(Context context) {
  Configuration conf = context.getConfiguration();
  shouldContinue = conf.getBoolean(job.continue, false);   
  crawlId = new Utf8(conf.get(GeneratorJob.CRAWL_ID, 
 Nutch.ALL_CRAWL_ID_STR));
   }
   ...
  {code} 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-893) DataStore.put() silently loses records when executed from multiple processes

2010-09-08 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12907297#action_12907297
 ] 

Andrzej Bialecki  commented on NUTCH-893:
-

Very good catch - yes, the test now passes for me too. This is actually good 
news for Gora :) I'll continue digging regarding NUTCH-879 ... don't hesitate 
if you have any ideas how to solve that. I suspect we may be losing keys in 
Generator or Fetcher, due to partitioning collisions but this hypothesis needs 
to be tested.

 DataStore.put() silently loses records when executed from multiple processes
 

 Key: NUTCH-893
 URL: https://issues.apache.org/jira/browse/NUTCH-893
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.0
 Environment: Gora HEAD, SqlStore, MySQL 5.1, Ubuntu 10.4 x64, Sun JDK 
 1.6
Reporter: Andrzej Bialecki 
Priority: Blocker
 Fix For: 2.0

 Attachments: NUTCH-893.patch, NUTCH-893_v2.patch


 In order to debug the issue described in NUTCH-879 I created a test to 
 simulate multiple clients appending to webtable (please see the patch), which 
 is the situation that we have in distributed map-reduce jobs.
 There are two tests there: one that uses multiple threads within the same 
 JVM, and another that uses single thread in multiple JVMs. Each test first 
 clears webtable (be careful!), and then puts a bunch of pages, and finally 
 counts that all are present and their values correspond to keys. To make 
 things more interesting each execution context (thread or process) closes and 
 reopens its instance of DataStore a few times.
 The multithreaded test passes just fine. However, the multi-process test 
 fails with missing keys, as many as 30%.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.