run Tika GUI via Nutch

2011-11-01 Thread Ahmad Ajiloo
Hello
We can run Tika GUI by running Nutch source in Eclipse, because we are
allowed to run org.apache.tika.gui.TikaGUI class in Run Configuration.
But Is there any solution to run Tika GUI via the application mode of Nutch?
I changed some codes of Tika and want to test whether Nutch is true with
this changes or no.


Re: run Tika GUI via Nutch

2011-11-01 Thread Ahmad Ajiloo
ّI solved the problem.


On Tue, Nov 1, 2011 at 3:24 PM, Ahmad Ajiloo ahmad.aji...@gmail.com wrote:

 Hello
 We can run Tika GUI by running Nutch source in Eclipse, because we are
 allowed to run org.apache.tika.gui.TikaGUI class in Run Configuration.
 But Is there any solution to run Tika GUI via the application mode of Nutch?
 I changed some codes of Tika and want to test whether Nutch is true with
 this changes or no.



[jira] [Created] (NUTCH-1187) Port NUTCH-1028 to nutchgora - log parser keys

2011-11-01 Thread Ferdy (Created) (JIRA)
Port NUTCH-1028 to nutchgora - log parser keys
--

 Key: NUTCH-1187
 URL: https://issues.apache.org/jira/browse/NUTCH-1187
 Project: Nutch
  Issue Type: Sub-task
  Components: parser
Reporter: Ferdy
Priority: Trivial


This task is to port NUTCH-1028 to nutchgora - log parser keys. Very trivial, 
will attach patch and commit right away.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1104) Port issues from 1.x to trunk

2011-11-01 Thread Ferdy (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13141101#comment-13141101
 ] 

Ferdy commented on NUTCH-1104:
--

Ok. Btw could you rename this issue to reflect the recent trunk to nutchgora 
branch move?

 Port issues from 1.x to trunk
 -

 Key: NUTCH-1104
 URL: https://issues.apache.org/jira/browse/NUTCH-1104
 Project: Nutch
  Issue Type: Task
Affects Versions: nutchgora
Reporter: Markus Jelsma
 Fix For: nutchgora


 A new issue to track issues that have not yet been ported from 1.x to trunk:
 NUTCH-987
 NUTCH-1028
 NUTCH-1036
 NUTCH-1057
 NUTCH-1067
 NUTCH-1101
 NUTCH-1102
 NUTCH-1105
 NUTCH-940
 NUTCH-1094 create comprehensive documentation for Nutch 2.0 trunk

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1104) Port issues from trunk NutchGora branch

2011-11-01 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1104:
-

Description: 
Umbrella issue for tracking issues that should be ported from 1.x trunk to the 
NutchGora branch. Please mark ported issues by modifying this description.

NOT YET PORTED:

* NUTCH-987 Support HTTP auth for Solr communication
* NUTCH-1028 Log parser keys
* NUTCH-1036 Solr jobs should increment counters in Reporter
* NUTCH-1057 Make fetcher thread time out configurable
* NUTCH-1067 Configure minimum throughput for fetcher
* NUTCH-1101 Options to purge db_gone records in updatedb
* NUTCH-1102 Fetcher, rely on fetcher.parse directive only
* NUTCH-1105 MaxContentLength option for index-basic
* NUTCH-940 Statis field plugin
* NUTCH-1094 create comprehensive documentation for Nutch 2.0 trunk


PORTED:
* No issues yet


NOT GOING TO BE PORTED:
* No issues, explain why it should not be ported



  was:
A new issue to track issues that have not yet been ported from 1.x to trunk:

NUTCH-987
NUTCH-1028
NUTCH-1036
NUTCH-1057
NUTCH-1067
NUTCH-1101
NUTCH-1102
NUTCH-1105
NUTCH-940
NUTCH-1094 create comprehensive documentation for Nutch 2.0 trunk



Summary: Port issues from trunk NutchGora branch  (was: Port issues 
from 1.x to trunk)

 Port issues from trunk NutchGora branch
 ---

 Key: NUTCH-1104
 URL: https://issues.apache.org/jira/browse/NUTCH-1104
 Project: Nutch
  Issue Type: Task
Affects Versions: nutchgora
Reporter: Markus Jelsma
 Fix For: nutchgora


 Umbrella issue for tracking issues that should be ported from 1.x trunk to 
 the NutchGora branch. Please mark ported issues by modifying this description.
 NOT YET PORTED:
 * NUTCH-987 Support HTTP auth for Solr communication
 * NUTCH-1028 Log parser keys
 * NUTCH-1036 Solr jobs should increment counters in Reporter
 * NUTCH-1057 Make fetcher thread time out configurable
 * NUTCH-1067 Configure minimum throughput for fetcher
 * NUTCH-1101 Options to purge db_gone records in updatedb
 * NUTCH-1102 Fetcher, rely on fetcher.parse directive only
 * NUTCH-1105 MaxContentLength option for index-basic
 * NUTCH-940 Statis field plugin
 * NUTCH-1094 create comprehensive documentation for Nutch 2.0 trunk
 PORTED:
 * No issues yet
 NOT GOING TO BE PORTED:
 * No issues, explain why it should not be ported

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1187) Port NUTCH-1028 to nutchgora - log parser keys

2011-11-01 Thread Ferdy (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferdy updated NUTCH-1187:
-

Attachment: NUTCH-1187.patch

This patches logs key for the parser. It uses INFO level and changes the 
surrounding DEBUG logs to INFO. This makes sure a parse for every one of the 
total four scenarios is logged only once:
-Skipped because of different id.
-Skipped because already parsed.
-Forced parse of already parsed.
-Regular parsing.

 Port NUTCH-1028 to nutchgora - log parser keys
 --

 Key: NUTCH-1187
 URL: https://issues.apache.org/jira/browse/NUTCH-1187
 Project: Nutch
  Issue Type: Sub-task
  Components: parser
Reporter: Ferdy
Priority: Trivial
 Fix For: nutchgora

 Attachments: NUTCH-1187.patch


 This task is to port NUTCH-1028 to nutchgora - log parser keys. Very trivial, 
 will attach patch and commit right away.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1187) Port NUTCH-1028 to nutchgora - log parser keys

2011-11-01 Thread Ferdy (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferdy updated NUTCH-1187:
-

Patch Info: Patch Available

 Port NUTCH-1028 to nutchgora - log parser keys
 --

 Key: NUTCH-1187
 URL: https://issues.apache.org/jira/browse/NUTCH-1187
 Project: Nutch
  Issue Type: Sub-task
  Components: parser
Reporter: Ferdy
Priority: Trivial
 Fix For: nutchgora

 Attachments: NUTCH-1187.patch


 This task is to port NUTCH-1028 to nutchgora - log parser keys. Very trivial, 
 will attach patch and commit right away.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




Setting properties in gora.properties

2011-11-01 Thread Lewis John Mcgibbney
Hi,

I'm currently trying to complete NUTCH-902 and GORA-39 and kill two birds
with the one stone, however I've uprooted some more nasties which I'm now
trying to address. When configuring Nutchgora with Cassandra I'm getting
the following

lewis@lewis-01:~/ASF/nutchgora/runtime/local$ bin/nutch inject urls crawldb
InjectorJob: starting
InjectorJob: urlDir: urls
InjectorJob: org.apache.gora.util.GoraException: java.io.IOException:
java.io.IOException: Property with base name servers could not be found,
make sure to include this property in gora.properties file
at
org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:110)
at
org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:93)
at
org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:59)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:243)
at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:268)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:282)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:292)
Caused by: java.io.IOException: java.io.IOException: Property with base
name servers could not be found, make sure to include this property in
gora.properties file
at
org.apache.gora.cassandra.store.CassandraStore.readMapping(CassandraStore.java:462)
at
org.apache.gora.cassandra.store.CassandraStore.initialize(CassandraStore.java:91)
at
org.apache.gora.store.DataStoreFactory.initializeDataStore(DataStoreFactory.java:81)
at
org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:104)
... 7 more
Caused by: java.io.IOException: Property with base name servers could not
be found, make sure to include this property in gora.properties file
at
org.apache.gora.store.DataStoreFactory.findPropertyOrDie(DataStoreFactory.java:254)
at
org.apache.gora.cassandra.store.CassandraStore.createClient(CassandraStore.java:394)
at
org.apache.gora.cassandra.store.CassandraStore.readMapping(CassandraStore.java:425)
... 10 more

Can someone please explain a bit about what kind of properties we
can/should add to gora.properties for cassandra setup. I've tried editing
gora.properties as follows with no luck

#gora.sqlstore.jdbc.driver=org.hsqldb.jdbcDriver
#gora.sqlstore.jdbc.url=jdbc:hsqldb:hsql://localhost/nutchtest
servers=localhost/127.0.0.1:9160

If there are any resources people are aware of on the net then I'll begin
getting my head around them.

Thanks in advance

Lewis


-- 
*Lewis*


[jira] [Created] (NUTCH-1188) ERROR util.LogUtil - Cannot log with method [null]

2011-11-01 Thread Zhang JinYan (Created) (JIRA)
ERROR util.LogUtil - Cannot log with method [null]
--

 Key: NUTCH-1188
 URL: https://issues.apache.org/jira/browse/NUTCH-1188
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.4
 Environment: no special enviroment
Reporter: Zhang JinYan


LogUtil has static fields,which is initialized like this:
FATAL = Logger.class.getMethod(error, new Class[] { Object.class });
but the Logger has no such method,the correct method is:
void org.slf4j.Logger.error(String msg)
So,LogUtil's static fields are not initialized correctly(they are null)
Run crawl,you will find msg in hadoop.log:

2011-11-01 22:38:14,442 ERROR util.LogUtil - Cannot log with method [null]
java.lang.NullPointerException
at org.apache.nutch.util.LogUtil$1.flush(LogUtil.java:103)
at java.io.PrintStream.write(PrintStream.java:432)
at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:202)
at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:272)
at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:85)
at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:168)
at java.io.PrintStream.newLine(PrintStream.java:496)
at java.io.PrintStream.println(PrintStream.java:757)
at java.lang.Throwable.printStackTraceAsCause(Throwable.java:492)
at java.lang.Throwable.printStackTrace(Throwable.java:468)
at 
org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:197)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:665)



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1188) ERROR util.LogUtil - Cannot log with method [null]

2011-11-01 Thread Zhang JinYan (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhang JinYan updated NUTCH-1188:


Description: 
LogUtil has static fields,which is initialized like this:
FATAL = Logger.class.getMethod(error, new Class[] { Object.class });
but the Logger has no such method,the correct method is:
void org.slf4j.Logger.error(String msg)
So,LogUtil's static fields are not initialized correctly(they are null)
---
Run crawl,you will find msg in hadoop.log:
2011-11-01 22:38:14,442 ERROR util.LogUtil - Cannot log with method [null]
java.lang.NullPointerException
at org.apache.nutch.util.LogUtil$1.flush(LogUtil.java:103)
at java.io.PrintStream.write(PrintStream.java:432)
at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:202)
at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:272)
at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:85)
at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:168)
at java.io.PrintStream.newLine(PrintStream.java:496)
at java.io.PrintStream.println(PrintStream.java:757)
at java.lang.Throwable.printStackTraceAsCause(Throwable.java:492)
at java.lang.Throwable.printStackTrace(Throwable.java:468)
at 
org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:197)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:665)


Patch:
FATAL = Logger.class.getMethod(error, new Class[] { String.class });

  was:
LogUtil has static fields,which is initialized like this:
FATAL = Logger.class.getMethod(error, new Class[] { Object.class });
but the Logger has no such method,the correct method is:
void org.slf4j.Logger.error(String msg)
So,LogUtil's static fields are not initialized correctly(they are null)
Run crawl,you will find msg in hadoop.log:

2011-11-01 22:38:14,442 ERROR util.LogUtil - Cannot log with method [null]
java.lang.NullPointerException
at org.apache.nutch.util.LogUtil$1.flush(LogUtil.java:103)
at java.io.PrintStream.write(PrintStream.java:432)
at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:202)
at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:272)
at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:85)
at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:168)
at java.io.PrintStream.newLine(PrintStream.java:496)
at java.io.PrintStream.println(PrintStream.java:757)
at java.lang.Throwable.printStackTraceAsCause(Throwable.java:492)
at java.lang.Throwable.printStackTrace(Throwable.java:468)
at 
org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:197)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:665)




 ERROR util.LogUtil - Cannot log with method [null]
 --

 Key: NUTCH-1188
 URL: https://issues.apache.org/jira/browse/NUTCH-1188
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.4
 Environment: no special enviroment
Reporter: Zhang JinYan

 LogUtil has static fields,which is initialized like this:
 FATAL = Logger.class.getMethod(error, new Class[] { Object.class });
 but the Logger has no such method,the correct method is:
 void org.slf4j.Logger.error(String msg)
 So,LogUtil's static fields are not initialized correctly(they are null)
 ---
 Run crawl,you will find msg in hadoop.log:
 2011-11-01 22:38:14,442 ERROR util.LogUtil - Cannot log with method [null]
 java.lang.NullPointerException
   at org.apache.nutch.util.LogUtil$1.flush(LogUtil.java:103)
   at java.io.PrintStream.write(PrintStream.java:432)
   at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:202)
   at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:272)
   at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:85)
   at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:168)
   at java.io.PrintStream.newLine(PrintStream.java:496)
   at java.io.PrintStream.println(PrintStream.java:757)
   at java.lang.Throwable.printStackTraceAsCause(Throwable.java:492)
   at java.lang.Throwable.printStackTrace(Throwable.java:468)
   at 
 org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:197)
   at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:665)
 
 Patch:
 FATAL = Logger.class.getMethod(error, new Class[] { String.class });

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA 

[jira] [Updated] (NUTCH-1188) ERROR util.LogUtil - Cannot log with method [null]

2011-11-01 Thread Zhang JinYan (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhang JinYan updated NUTCH-1188:


Patch Info: Patch Available

 ERROR util.LogUtil - Cannot log with method [null]
 --

 Key: NUTCH-1188
 URL: https://issues.apache.org/jira/browse/NUTCH-1188
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.4
 Environment: no special enviroment
Reporter: Zhang JinYan

 LogUtil has static fields,which is initialized like this:
 FATAL = Logger.class.getMethod(error, new Class[] { Object.class });
 but the Logger has no such method,the correct method is:
 void org.slf4j.Logger.error(String msg)
 So,LogUtil's static fields are not initialized correctly(they are null)
 ---
 Run crawl,you will find msg in hadoop.log:
 2011-11-01 22:38:14,442 ERROR util.LogUtil - Cannot log with method [null]
 java.lang.NullPointerException
   at org.apache.nutch.util.LogUtil$1.flush(LogUtil.java:103)
   at java.io.PrintStream.write(PrintStream.java:432)
   at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:202)
   at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:272)
   at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:85)
   at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:168)
   at java.io.PrintStream.newLine(PrintStream.java:496)
   at java.io.PrintStream.println(PrintStream.java:757)
   at java.lang.Throwable.printStackTraceAsCause(Throwable.java:492)
   at java.lang.Throwable.printStackTrace(Throwable.java:468)
   at 
 org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:197)
   at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:665)
 
 Patch:
 FATAL = Logger.class.getMethod(error, new Class[] { String.class });

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1188) ERROR util.LogUtil - Cannot log with method [null]

2011-11-01 Thread Zhang JinYan (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhang JinYan updated NUTCH-1188:


Attachment: LogUtil.patch

patch for the bug

 ERROR util.LogUtil - Cannot log with method [null]
 --

 Key: NUTCH-1188
 URL: https://issues.apache.org/jira/browse/NUTCH-1188
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.4
 Environment: no special enviroment
Reporter: Zhang JinYan
 Attachments: LogUtil.patch


 LogUtil has static fields,which is initialized like this:
 FATAL = Logger.class.getMethod(error, new Class[] { Object.class });
 but the Logger has no such method,the correct method is:
 void org.slf4j.Logger.error(String msg)
 So,LogUtil's static fields are not initialized correctly(they are null)
 ---
 Run crawl,you will find msg in hadoop.log:
 2011-11-01 22:38:14,442 ERROR util.LogUtil - Cannot log with method [null]
 java.lang.NullPointerException
   at org.apache.nutch.util.LogUtil$1.flush(LogUtil.java:103)
   at java.io.PrintStream.write(PrintStream.java:432)
   at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:202)
   at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:272)
   at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:85)
   at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:168)
   at java.io.PrintStream.newLine(PrintStream.java:496)
   at java.io.PrintStream.println(PrintStream.java:757)
   at java.lang.Throwable.printStackTraceAsCause(Throwable.java:492)
   at java.lang.Throwable.printStackTrace(Throwable.java:468)
   at 
 org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:197)
   at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:665)
 
 Patch:
 FATAL = Logger.class.getMethod(error, new Class[] { String.class });

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Reopened] (NUTCH-902) Add all necessary files and configuration so that nutch can be used with different backends out-of-the-box

2011-11-01 Thread Lewis John McGibbney (Reopened) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney reopened NUTCH-902:



Reopened as Cassandra configurations in ivy/ivy.xml are not complete.

 Add all necessary files and configuration so that nutch can be used with 
 different backends out-of-the-box
 --

 Key: NUTCH-902
 URL: https://issues.apache.org/jira/browse/NUTCH-902
 Project: Nutch
  Issue Type: New Feature
  Components: documentation, storage
Affects Versions: nutchbase
Reporter: Enis Soztutar
Assignee: Lewis John McGibbney
 Fix For: nutchgora

 Attachments: NUTCH-902-v2.patch, NUTCH-902.patch


 As per the discussion in the mailing list and 
 http://wiki.apache.org/nutch/GORA_HBase, it will be good to include all the 
 necessary files and configuration. I propose that we maintain configuration 
 for at least SQL, HBase and Cassandra. 
 The following changes are needed:
 conf/gora-sql-mapping.xml
 conf/gora-hbase-mapping.xml
 conf/gora-cassandra-mapping.xml
 comments on nutch-default and ivy.xml 
 Shall we also include jars from gora-hbase, gora-cassandra and their 
 dependencies ? 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1189) add commented out default settings to gora.properties files

2011-11-01 Thread Lewis John McGibbney (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13141238#comment-13141238
 ] 

Lewis John McGibbney commented on NUTCH-1189:
-

Ferdy, would it be possible for you to attach a patch for HBase (if required), 
I will work on the Cassandra stuff, then hopefully we can knock ours heads 
together with some others to get the remaining back ends included within the 
gora.poperties file.

 add commented out default settings to gora.properties files 
 

 Key: NUTCH-1189
 URL: https://issues.apache.org/jira/browse/NUTCH-1189
 Project: Nutch
  Issue Type: Sub-task
  Components: storage
Affects Versions: nutchgora
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: nutchgora


 This issues should have been dealt with as part of its parent issue, however 
 I think as it is a fairly lareg task in itself, it needs to be done 
 independently. The gora.properties file should, amongst other settings, and 
 beside the extreme basic defaults for sqlstore, include defaults for opening 
 HBase, Cassandra, etc servers on their default ports etc. Leaving this down 
 to individual interpretation puts a huge owness of the user, hence 
 constructing a barrier to entry for getting the configuration settings up and 
 running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1188) ERROR util.LogUtil - Cannot log with method [null]

2011-11-01 Thread Lewis John McGibbney (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13141240#comment-13141240
 ] 

Lewis John McGibbney commented on NUTCH-1188:
-

Thank you for this patch. In the short term, when we get one other +1, I would 
like to commit. Can I ask you to have a look @ NUTCH-1138 and comment on 
whether the patch is any use for your activities. It is our vision to remove 
LogUtil and use the Slf4j/Log4j framework for all logging.
Thank you very much for this patch.

 ERROR util.LogUtil - Cannot log with method [null]
 --

 Key: NUTCH-1188
 URL: https://issues.apache.org/jira/browse/NUTCH-1188
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.4
 Environment: no special enviroment
Reporter: Zhang JinYan
 Attachments: LogUtil.patch


 LogUtil has static fields,which is initialized like this:
 FATAL = Logger.class.getMethod(error, new Class[] { Object.class });
 but the Logger has no such method,the correct method is:
 void org.slf4j.Logger.error(String msg)
 So,LogUtil's static fields are not initialized correctly(they are null)
 ---
 Run crawl,you will find msg in hadoop.log:
 2011-11-01 22:38:14,442 ERROR util.LogUtil - Cannot log with method [null]
 java.lang.NullPointerException
   at org.apache.nutch.util.LogUtil$1.flush(LogUtil.java:103)
   at java.io.PrintStream.write(PrintStream.java:432)
   at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:202)
   at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:272)
   at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:85)
   at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:168)
   at java.io.PrintStream.newLine(PrintStream.java:496)
   at java.io.PrintStream.println(PrintStream.java:757)
   at java.lang.Throwable.printStackTraceAsCause(Throwable.java:492)
   at java.lang.Throwable.printStackTrace(Throwable.java:468)
   at 
 org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:197)
   at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:665)
 
 Patch:
 FATAL = Logger.class.getMethod(error, new Class[] { String.class });

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (NUTCH-1190) MoreIndexingFilter refactor: move data formats used to parse lastModified to a config file.

2011-11-01 Thread Zhang JinYan (Created) (JIRA)
MoreIndexingFilter refactor: move data formats used to parse lastModified to 
a config file.
-

 Key: NUTCH-1190
 URL: https://issues.apache.org/jira/browse/NUTCH-1190
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.4
 Environment: jdk6
Reporter: Zhang JinYan


There many issues about missing date format:
[NUTCH-871|https://issues.apache.org/jira/browse/NUTCH-871]
[NUTCH-912|https://issues.apache.org/jira/browse/NUTCH-912]
[NUTCH-1015|https://issues.apache.org/jira/browse/NUTCH-1015]

The data formats can be diverse, so why not move those data formats to a extra 
config file?
I move all the data formats from MoreIndexingFilter.java to a file named 
date-styles.txt, which will be load on startup.
{code}
  public void setConf(Configuration conf) {
this.conf = conf;
MIME = new MimeUtil(conf);

URL res = conf.getResource(date-styles.txt);
if(res==null){
  LOG.error(Can't find resource: date-styles.txt);
}else{
  try {
List lines = FileUtils.readLines(new File(res.getFile()));
for (int i = 0; i  lines.size(); i++) {
  String dateStyle = (String) lines.get(i);
  if(StringUtils.isBlank(dateStyle)){
lines.remove(i);
i--;
continue;
  }
  dateStyle=StringUtils.trim(dateStyle);
  if(dateStyle.startsWith(#)){
lines.remove(i);
i--;
continue;
  }
  lines.set(i, dateStyle);
}
dateStyles = new String[lines.size()];
lines.toArray(dateStyles);
  } catch (IOException e) {
LOG.error(Failed to load resource: date-styles.txt);
  }
}
  }
{code}
Then parse lastModified like this(sample):
{code}
  private long getTime(String date, String url) {
..
Date parsedDate = DateUtils.parseDate(date, dateStyles);
time = parsedDate.getTime();
..
return time;
  }
{code}
This path also contains the path of 
[NUTCH-1140|https://issues.apache.org/jira/browse/NUTCH-1140].
Find more details in the patch file.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1190) MoreIndexingFilter refactor: move data formats used to parse lastModified to a config file.

2011-11-01 Thread Zhang JinYan (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhang JinYan updated NUTCH-1190:


Attachment: date-styles.txt
MoreIndexingFilter.patch

 MoreIndexingFilter refactor: move data formats used to parse lastModified 
 to a config file.
 -

 Key: NUTCH-1190
 URL: https://issues.apache.org/jira/browse/NUTCH-1190
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.4
 Environment: jdk6
Reporter: Zhang JinYan
 Attachments: MoreIndexingFilter.patch, date-styles.txt


 There many issues about missing date format:
 [NUTCH-871|https://issues.apache.org/jira/browse/NUTCH-871]
 [NUTCH-912|https://issues.apache.org/jira/browse/NUTCH-912]
 [NUTCH-1015|https://issues.apache.org/jira/browse/NUTCH-1015]
 The data formats can be diverse, so why not move those data formats to a 
 extra config file?
 I move all the data formats from MoreIndexingFilter.java to a file named 
 date-styles.txt, which will be load on startup.
 {code}
   public void setConf(Configuration conf) {
 this.conf = conf;
 MIME = new MimeUtil(conf);
 
 URL res = conf.getResource(date-styles.txt);
 if(res==null){
   LOG.error(Can't find resource: date-styles.txt);
 }else{
   try {
 List lines = FileUtils.readLines(new File(res.getFile()));
 for (int i = 0; i  lines.size(); i++) {
   String dateStyle = (String) lines.get(i);
   if(StringUtils.isBlank(dateStyle)){
 lines.remove(i);
 i--;
 continue;
   }
   dateStyle=StringUtils.trim(dateStyle);
   if(dateStyle.startsWith(#)){
 lines.remove(i);
 i--;
 continue;
   }
   lines.set(i, dateStyle);
 }
 dateStyles = new String[lines.size()];
 lines.toArray(dateStyles);
   } catch (IOException e) {
 LOG.error(Failed to load resource: date-styles.txt);
   }
 }
   }
 {code}
 Then parse lastModified like this(sample):
 {code}
   private long getTime(String date, String url) {
 ..
 Date parsedDate = DateUtils.parseDate(date, dateStyles);
 time = parsedDate.getTime();
 ..
 return time;
   }
 {code}
 This path also contains the path of 
 [NUTCH-1140|https://issues.apache.org/jira/browse/NUTCH-1140].
 Find more details in the patch file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1190) MoreIndexingFilter refactor: move data formats used to parse lastModified to a config file.

2011-11-01 Thread Zhang JinYan (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhang JinYan updated NUTCH-1190:


Description: 
There many issues about missing date format:
[NUTCH-871|https://issues.apache.org/jira/browse/NUTCH-871]
[NUTCH-912|https://issues.apache.org/jira/browse/NUTCH-912]
[NUTCH-1015|https://issues.apache.org/jira/browse/NUTCH-1015]

The data formats can be diverse, so why not move those data formats to a extra 
config file?
I move all the data formats from MoreIndexingFilter.java to a file named 
date-styles.txt(place in conf), which will be load on startup.
{code}
  public void setConf(Configuration conf) {
this.conf = conf;
MIME = new MimeUtil(conf);

URL res = conf.getResource(date-styles.txt);
if(res==null){
  LOG.error(Can't find resource: date-styles.txt);
}else{
  try {
List lines = FileUtils.readLines(new File(res.getFile()));
for (int i = 0; i  lines.size(); i++) {
  String dateStyle = (String) lines.get(i);
  if(StringUtils.isBlank(dateStyle)){
lines.remove(i);
i--;
continue;
  }
  dateStyle=StringUtils.trim(dateStyle);
  if(dateStyle.startsWith(#)){
lines.remove(i);
i--;
continue;
  }
  lines.set(i, dateStyle);
}
dateStyles = new String[lines.size()];
lines.toArray(dateStyles);
  } catch (IOException e) {
LOG.error(Failed to load resource: date-styles.txt);
  }
}
  }
{code}
Then parse lastModified like this(sample):
{code}
  private long getTime(String date, String url) {
..
Date parsedDate = DateUtils.parseDate(date, dateStyles);
time = parsedDate.getTime();
..
return time;
  }
{code}
This path also contains the path of 
[NUTCH-1140|https://issues.apache.org/jira/browse/NUTCH-1140].
Find more details in the patch file.


  was:
There many issues about missing date format:
[NUTCH-871|https://issues.apache.org/jira/browse/NUTCH-871]
[NUTCH-912|https://issues.apache.org/jira/browse/NUTCH-912]
[NUTCH-1015|https://issues.apache.org/jira/browse/NUTCH-1015]

The data formats can be diverse, so why not move those data formats to a extra 
config file?
I move all the data formats from MoreIndexingFilter.java to a file named 
date-styles.txt, which will be load on startup.
{code}
  public void setConf(Configuration conf) {
this.conf = conf;
MIME = new MimeUtil(conf);

URL res = conf.getResource(date-styles.txt);
if(res==null){
  LOG.error(Can't find resource: date-styles.txt);
}else{
  try {
List lines = FileUtils.readLines(new File(res.getFile()));
for (int i = 0; i  lines.size(); i++) {
  String dateStyle = (String) lines.get(i);
  if(StringUtils.isBlank(dateStyle)){
lines.remove(i);
i--;
continue;
  }
  dateStyle=StringUtils.trim(dateStyle);
  if(dateStyle.startsWith(#)){
lines.remove(i);
i--;
continue;
  }
  lines.set(i, dateStyle);
}
dateStyles = new String[lines.size()];
lines.toArray(dateStyles);
  } catch (IOException e) {
LOG.error(Failed to load resource: date-styles.txt);
  }
}
  }
{code}
Then parse lastModified like this(sample):
{code}
  private long getTime(String date, String url) {
..
Date parsedDate = DateUtils.parseDate(date, dateStyles);
time = parsedDate.getTime();
..
return time;
  }
{code}
This path also contains the path of 
[NUTCH-1140|https://issues.apache.org/jira/browse/NUTCH-1140].
Find more details in the patch file.



 MoreIndexingFilter refactor: move data formats used to parse lastModified 
 to a config file.
 -

 Key: NUTCH-1190
 URL: https://issues.apache.org/jira/browse/NUTCH-1190
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.4
 Environment: jdk6
Reporter: Zhang JinYan
 Attachments: MoreIndexingFilter.patch, date-styles.txt


 There many issues about missing date format:
 [NUTCH-871|https://issues.apache.org/jira/browse/NUTCH-871]
 [NUTCH-912|https://issues.apache.org/jira/browse/NUTCH-912]
 [NUTCH-1015|https://issues.apache.org/jira/browse/NUTCH-1015]
 The data formats can be diverse, so why not move those data formats to a 
 extra config file?
 I move all the data formats from MoreIndexingFilter.java to a file named 
 date-styles.txt(place in conf), which will be load on startup.
 {code}
   public void setConf(Configuration conf) {
 this.conf = conf;
 MIME = new MimeUtil(conf);
 
 URL res = conf.getResource(date-styles.txt);
 

[jira] [Commented] (NUTCH-1188) ERROR util.LogUtil - Cannot log with method [null]

2011-11-01 Thread Julien Nioche (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13141267#comment-13141267
 ] 

Julien Nioche commented on NUTCH-1188:
--

+1 to commit. See corresponding class in branch nutchgora
Thanks

 ERROR util.LogUtil - Cannot log with method [null]
 --

 Key: NUTCH-1188
 URL: https://issues.apache.org/jira/browse/NUTCH-1188
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.4
 Environment: no special enviroment
Reporter: Zhang JinYan
 Attachments: LogUtil.patch


 LogUtil has static fields,which is initialized like this:
 FATAL = Logger.class.getMethod(error, new Class[] { Object.class });
 but the Logger has no such method,the correct method is:
 void org.slf4j.Logger.error(String msg)
 So,LogUtil's static fields are not initialized correctly(they are null)
 ---
 Run crawl,you will find msg in hadoop.log:
 2011-11-01 22:38:14,442 ERROR util.LogUtil - Cannot log with method [null]
 java.lang.NullPointerException
   at org.apache.nutch.util.LogUtil$1.flush(LogUtil.java:103)
   at java.io.PrintStream.write(PrintStream.java:432)
   at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:202)
   at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:272)
   at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:85)
   at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:168)
   at java.io.PrintStream.newLine(PrintStream.java:496)
   at java.io.PrintStream.println(PrintStream.java:757)
   at java.lang.Throwable.printStackTraceAsCause(Throwable.java:492)
   at java.lang.Throwable.printStackTrace(Throwable.java:468)
   at 
 org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:197)
   at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:665)
 
 Patch:
 FATAL = Logger.class.getMethod(error, new Class[] { String.class });

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (NUTCH-1191) Port NUTCH-1102 to nutchgora - consistent use of fetcher.parse

2011-11-01 Thread Ferdy Galema (Created) (JIRA)
Port NUTCH-1102 to nutchgora - consistent use of fetcher.parse
--

 Key: NUTCH-1191
 URL: https://issues.apache.org/jira/browse/NUTCH-1191
 Project: Nutch
  Issue Type: Sub-task
Reporter: Ferdy Galema
 Fix For: nutchgora




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-902) Add all necessary files and configuration so that nutch can be used with different backends out-of-the-box

2011-11-01 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-902:
---

Attachment: NUTCH-902-v3.patch

patch to include previous config changes to NUTCHGORA/ivy/ivy.xml

 Add all necessary files and configuration so that nutch can be used with 
 different backends out-of-the-box
 --

 Key: NUTCH-902
 URL: https://issues.apache.org/jira/browse/NUTCH-902
 Project: Nutch
  Issue Type: New Feature
  Components: documentation, storage
Affects Versions: nutchbase
Reporter: Enis Soztutar
Assignee: Lewis John McGibbney
 Fix For: nutchgora

 Attachments: NUTCH-902-v2.patch, NUTCH-902-v3.patch, NUTCH-902.patch


 As per the discussion in the mailing list and 
 http://wiki.apache.org/nutch/GORA_HBase, it will be good to include all the 
 necessary files and configuration. I propose that we maintain configuration 
 for at least SQL, HBase and Cassandra. 
 The following changes are needed:
 conf/gora-sql-mapping.xml
 conf/gora-hbase-mapping.xml
 conf/gora-cassandra-mapping.xml
 comments on nutch-default and ivy.xml 
 Shall we also include jars from gora-hbase, gora-cassandra and their 
 dependencies ? 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1191) Port NUTCH-1102 to nutchgora - consistent use of fetcher.parse

2011-11-01 Thread Ferdy Galema (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferdy Galema updated NUTCH-1191:


Attachment: NUTCH-1191.patch

Patch replaces all references with 'parse' argument to the 'fetcher.parse' 
property and sets it to FALSE default throughout the code (there was still a 
reference that used TRUE).

Tested with both TRUE and FALSE and it works like a charm. Will commit when 
there are no objections.

 Port NUTCH-1102 to nutchgora - consistent use of fetcher.parse
 --

 Key: NUTCH-1191
 URL: https://issues.apache.org/jira/browse/NUTCH-1191
 Project: Nutch
  Issue Type: Sub-task
Reporter: Ferdy Galema
 Fix For: nutchgora

 Attachments: NUTCH-1191.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1191) Port NUTCH-1102 to nutchgora - consistent use of fetcher.parse

2011-11-01 Thread Ferdy Galema (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferdy Galema updated NUTCH-1191:


Component/s: fetcher

 Port NUTCH-1102 to nutchgora - consistent use of fetcher.parse
 --

 Key: NUTCH-1191
 URL: https://issues.apache.org/jira/browse/NUTCH-1191
 Project: Nutch
  Issue Type: Sub-task
  Components: fetcher
Reporter: Ferdy Galema
 Fix For: nutchgora

 Attachments: NUTCH-1191.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1188) ERROR util.LogUtil - Cannot log with method [null]

2011-11-01 Thread Lewis John McGibbney (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13141300#comment-13141300
 ] 

Lewis John McGibbney commented on NUTCH-1188:
-

Is it just me, or has this already been committed along with NUTCH-1078 in 
trunk [1]  when Julien fixed it in Nutchgora branch [2]!

[1] 
http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/util/LogUtil.java?r1=1175075r2=1177290diff_format=h
[2] 
http://svn.apache.org/viewvc/nutch/branches/nutchgora/src/java/org/apache/nutch/util/LogUtil.java?r1=983885r2=988544diff_format=h

 ERROR util.LogUtil - Cannot log with method [null]
 --

 Key: NUTCH-1188
 URL: https://issues.apache.org/jira/browse/NUTCH-1188
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.4
 Environment: no special enviroment
Reporter: Zhang JinYan
 Attachments: LogUtil.patch


 LogUtil has static fields,which is initialized like this:
 FATAL = Logger.class.getMethod(error, new Class[] { Object.class });
 but the Logger has no such method,the correct method is:
 void org.slf4j.Logger.error(String msg)
 So,LogUtil's static fields are not initialized correctly(they are null)
 ---
 Run crawl,you will find msg in hadoop.log:
 2011-11-01 22:38:14,442 ERROR util.LogUtil - Cannot log with method [null]
 java.lang.NullPointerException
   at org.apache.nutch.util.LogUtil$1.flush(LogUtil.java:103)
   at java.io.PrintStream.write(PrintStream.java:432)
   at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:202)
   at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:272)
   at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:85)
   at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:168)
   at java.io.PrintStream.newLine(PrintStream.java:496)
   at java.io.PrintStream.println(PrintStream.java:757)
   at java.lang.Throwable.printStackTraceAsCause(Throwable.java:492)
   at java.lang.Throwable.printStackTrace(Throwable.java:468)
   at 
 org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:197)
   at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:665)
 
 Patch:
 FATAL = Logger.class.getMethod(error, new Class[] { String.class });

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Issue Comment Edited] (NUTCH-1138) remove LogUtil from trunk and nutch gora

2011-11-01 Thread Zhang JinYan (Issue Comment Edited) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13141334#comment-13141334
 ] 

Zhang JinYan edited comment on NUTCH-1138 at 11/1/11 5:13 PM:
--

Apply the path to branch-1.4, rebuild with cmd: ant clean build.
Config to crawl websites:
{quote}
http://172.16.123.123/bbs/viewthread.php?tid=12345
http://172.16.123.123/bbs/attachment.php?aid=12345
http://www.jettycn.com/
{quote}

The previous two sites are not available.
Run crawl with cmd(platform windows):
{quote}
sh.exe ./bin/nutch crawl seedurl -dir crawldev -solr http://localhost:8983/solr/
{quote}

Complete the crawl successfully. Query in solr admin return:
{code:xml}
result name=response numFound=320 start=0/result
{code}

Search word ERROR in hadoop.log,find 3 results caused by:
{code}
java.net.ConnectException: Connection timed out: connect
{code}

Search word Exception in hadoop.log, find results like this:
{quote}
2011-11-02 00:39:01,821 INFO  httpclient.HttpMethodDirector - I/O exception 
(org.apache.commons.httpclient.NoHttpResponseException) caught when processing 
request: The server www.jettycn.com failed to respond
2011-11-02 00:39:01,821 INFO  httpclient.HttpMethodDirector - Retrying request
{quote}

So there is no exception related your path in the hadoop.log.
The path work fine with branch-1.4 for me.

  was (Author: yearn20m):
Apply the path to branch-1.4, rebuild with cmd: ant clean build.
Config to crawl websites:
{quote}
http://172.16.123.123/bbs/viewthread.php?tid=12345
http://172.16.123.123/bbs/attachment.php?aid=12345
http://www.jettycn.com/
{quote}

The previous two sites are not available.
Run crawl with cmd(platform windows):
{quote}
sh.exe ./bin/nutch crawl seedurl -dir crawldev -solr http://localhost:8983/solr/
{quote}

Complete the crawl successfully. Query in solr admin return:
{code:xml}
result name=response numFound=320 start=0/result
{code}

Check the hadoop.log, search word ERROR,find 3 results caused by:
{code}
java.net.ConnectException: Connection timed out: connect
{code}

Search word Exception, find results like this:
{quote}
2011-11-02 00:39:01,821 INFO  httpclient.HttpMethodDirector - I/O exception 
(org.apache.commons.httpclient.NoHttpResponseException) caught when processing 
request: The server www.jettycn.com failed to respond
2011-11-02 00:39:01,821 INFO  httpclient.HttpMethodDirector - Retrying request
{quote}

So there is no exception related your path in the hadoop.log.
The path work fine with branch-1.4 for me.
  
 remove LogUtil from trunk and nutch gora
 

 Key: NUTCH-1138
 URL: https://issues.apache.org/jira/browse/NUTCH-1138
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.4, nutchgora
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: nutchgora, 1.5

 Attachments: Document1.txt, NUTCH-1138-trunk-20111023.patch


 This should move towards the removal of the LogUtil class from both codebases 
 as per comments in NUTCH-1078.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Issue Comment Edited] (NUTCH-1138) remove LogUtil from trunk and nutch gora

2011-11-01 Thread Zhang JinYan (Issue Comment Edited) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13141334#comment-13141334
 ] 

Zhang JinYan edited comment on NUTCH-1138 at 11/1/11 5:12 PM:
--

Apply the path to branch-1.4, rebuild with cmd: ant clean build.
Config to crawl websites:
{quote}
http://172.16.123.123/bbs/viewthread.php?tid=12345
http://172.16.123.123/bbs/attachment.php?aid=12345
http://www.jettycn.com/
{quote}

The previous two sites are not available.
Run crawl with cmd(platform windows):
{quote}
sh.exe ./bin/nutch crawl seedurl -dir crawldev -solr http://localhost:8983/solr/
{quote}

Complete the crawl successfully. Query in solr admin return:
{code:xml}
result name=response numFound=320 start=0/result
{code}

Check the hadoop.log, search word ERROR,find 3 results caused by:
{code}
java.net.ConnectException: Connection timed out: connect
{code}

Search word Exception, find results like this:
{quote}
2011-11-02 00:39:01,821 INFO  httpclient.HttpMethodDirector - I/O exception 
(org.apache.commons.httpclient.NoHttpResponseException) caught when processing 
request: The server www.jettycn.com failed to respond
2011-11-02 00:39:01,821 INFO  httpclient.HttpMethodDirector - Retrying request
{quote}

So there is no exception related your path in the hadoop.log.
The path work fine with branch-1.4 for me.

  was (Author: yearn20m):
Apply the path to branch-1.4, rebuild with cmd: ant clean build.
Config to crawl websites:
{quote}
http://172.16.123.123/bbs/viewthread.php?tid=12345
http://172.16.123.123/bbs/attachment.php?aid=12345
http://www.jettycn.com/
{quote}

The previous two sites are not available.
Run crawl with cmd(platform windows):
{quote}
sh.exe ./bin/nutch crawl seedurl -dir crawldev -solr http://localhost:8983/solr/
{quote}

Complete the crawl successfully.Query int solr admin return:
{code:xml}
result name=response numFound=320 start=0/result
{code}

Check the hadoop.log, search word ERROR,find 3 results caused by:
{code}
java.net.ConnectException: Connection timed out: connect
{code}

Search word Exception, find results like this:
{quote}
2011-11-02 00:39:01,821 INFO  httpclient.HttpMethodDirector - I/O exception 
(org.apache.commons.httpclient.NoHttpResponseException) caught when processing 
request: The server www.jettycn.com failed to respond
2011-11-02 00:39:01,821 INFO  httpclient.HttpMethodDirector - Retrying request
{quote}

So there is no exception related your path in the hadoop.log.
The path work fine with branch-1.4 for me.
  
 remove LogUtil from trunk and nutch gora
 

 Key: NUTCH-1138
 URL: https://issues.apache.org/jira/browse/NUTCH-1138
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.4, nutchgora
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: nutchgora, 1.5

 Attachments: Document1.txt, NUTCH-1138-trunk-20111023.patch


 This should move towards the removal of the LogUtil class from both codebases 
 as per comments in NUTCH-1078.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




recrawl sites in nutch 1.3

2011-11-01 Thread mina
hi, i want to re_crawl my sites every hour. i write a script for this. i edit
some properties in nutch-site.xml. but my re_crawler fetches urls only for 3
times an after that it stop fetching. it's mean that my nutch don't update
after 3 hours. this is my changes in nutch-site.xml:

property
  namedb.fetch.interval.default/name 
  value30/value 
  descriptionThe default number of seconds between re-fetches of a page
(30 days)./description 
  /property
property
  namedb.fetch.schedule.class/name 
  valueorg.apache.nutch.crawl.AdaptiveFetchSchedule/value 
  descriptionThe implementation of fetch schedule. DefaultFetchSchedule
simply adds the original fetchInterval to the last fetch time, regardless of
page changes./description 
  /property
 property
  namesolr.commit.size/name 
  value10/value 
  descriptionDefines the number of documents to send to Solr in a single
update batch. Decrease when handling very large documents to prevent Nutch
from running out of memory./description 
  /property
 property
  namedb.fetch.interval.max/name 
  value36000/value 
  descriptionThe maximum number of seconds between re-fetches of a page
(90 days). After this period every page in the db will be re-tried, no
matter what is its status./description 
  /property


--
View this message in context: 
http://lucene.472066.n3.nabble.com/recrawl-sites-in-nutch-1-3-tp3470457p3470457.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.


[jira] [Commented] (NUTCH-1138) remove LogUtil from trunk and nutch gora

2011-11-01 Thread Lewis John McGibbney (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13141376#comment-13141376
 ] 

Lewis John McGibbney commented on NUTCH-1138:
-

Hi. Current 1.4 development is located at the trunk area of the SVN area. Is 
this where the confusion is possibly stemming from?
When we make code commits, we are committing to the trunk 1.4 development, 
rather than the branch-1.4 development. The reasoning behind this can be seen 
on the latest announcement on the Nutch home page.

 remove LogUtil from trunk and nutch gora
 

 Key: NUTCH-1138
 URL: https://issues.apache.org/jira/browse/NUTCH-1138
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.4, nutchgora
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: nutchgora, 1.5

 Attachments: Document1.txt, NUTCH-1138-trunk-20111023.patch


 This should move towards the removal of the LogUtil class from both codebases 
 as per comments in NUTCH-1078.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira