Hi Sujen,

strange: other Kafka classes are found (ConfigDef, ProducerConfig, 
ConfigException) but
DefaultPartitioner which is contained in the same jar not. Need to check the 
Kafka code how the
partitioner classes are loaded. Could be that Kafka loads classes dynamically 
in a way incompatible
with Nutch's plugin class loader design. We had a similar problem with the Rome 
library, cf.
NUTCH-1494 and NUTCH-1893 resp. https://github.com/rometools/rome/issues/130

> Thanks for your help and comments on https://github.com/apache/nutch/pull/152

Thanks for your patience. Keeping the base class path (without plugins) as lean 
as possible is
important. Dependency conflicts are ugly to resolve, see NUTCH-2316 for a 
recent problem.

Thanks,
Sebastian

On 09/26/2016 04:17 AM, Sujen Shah wrote:
> Hi Sebastian, 
> 
> Here is the complete log trace from the haddop.log file
> 
> 2016-09-25 19:14:08,455 INFO  fetcher.FetchItemQueues - Using queue mode : 
> byHost
> 2016-09-25 19:14:08,455 INFO  fetcher.Fetcher - Fetcher: threads: 50
> 2016-09-25 19:14:08,455 INFO  fetcher.Fetcher - Fetcher: time-out divisor: 2
> 2016-09-25 19:14:08,459 INFO  fetcher.QueueFeeder - QueueFeeder finished: 
> total 3 records + hit by
> time limit :0
> 2016-09-25 19:14:08,559 INFO  net.URLExemptionFilters - Found 0 extensions at
> point:'org.apache.nutch.net.URLExemptionFilter'
> 2016-09-25 19:14:08,570 INFO  fetcher.FetcherThreadPublisher - Setting up 
> publishers
> 2016-09-25 19:14:08,587 WARN  mapred.LocalJobRunner - job_local1447446310_0001
> java.lang.Exception: java.lang.ExceptionInInitializerError
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
> at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
> Caused by: java.lang.ExceptionInInitializerError
> at 
> org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:188)
> at 
> org.apache.nutch.publisher.kafka.KafkaPublisherImpl.setConfig(KafkaPublisherImpl.java:70)
> at 
> org.apache.nutch.publisher.NutchPublishers.setConfig(NutchPublishers.java:44)
> at 
> org.apache.nutch.fetcher.FetcherThreadPublisher.<init>(FetcherThreadPublisher.java:40)
> at org.apache.nutch.fetcher.FetcherThread.<init>(FetcherThread.java:174)
> at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:213)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.kafka.common.config.ConfigException: Invalid value
> org.apache.kafka.clients.producer.internals.DefaultPartitioner for 
> configuration partitioner.class:
> Class org.apache.kafka.clients.producer.internals.DefaultPartitioner could 
> not be found.
> at org.apache.kafka.common.config.ConfigDef.parseType(ConfigDef.java:672)
> at org.apache.kafka.common.config.ConfigDef.define(ConfigDef.java:110)
> at org.apache.kafka.common.config.ConfigDef.define(ConfigDef.java:132)
> at org.apache.kafka.common.config.ConfigDef.define(ConfigDef.java:171)
> at org.apache.kafka.common.config.ConfigDef.define(ConfigDef.java:333)
> at org.apache.kafka.common.config.ConfigDef.define(ConfigDef.java:346)
> at 
> org.apache.kafka.clients.producer.ProducerConfig.<clinit>(ProducerConfig.java:222)
> ... 14 more
> 2016-09-25 19:14:09,346 ERROR fetcher.Fetcher - Fetcher: java.io.IOException: 
> Job failed!
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:865)
> at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:484)
> at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:519)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:493)
> 
> Thanks for your help and comments on https://github.com/apache/nutch/pull/152
> <https://github.com/apache/nutch/pull/152>.
> 
> On Sun, Sep 25, 2016 at 2:54 AM, Sebastian Nagel <wastl.na...@googlemail.com
> <mailto:wastl.na...@googlemail.com>> wrote:
> 
>     Hi Sujen,
> 
>     could you send the complete stack trace? Just to be sure from where the 
> error stems.
> 
>     > I looked at the code here 
> https://github.com/apache/nutch/blob/master/src/bin/nutch#L155-L164
>     <https://github.com/apache/nutch/blob/master/src/bin/nutch#L155-L164>
>     > <https://github.com/apache/nutch/blob/master/src/bin/nutch#L155-L164
>     <https://github.com/apache/nutch/blob/master/src/bin/nutch#L155-L164>> 
> and cannot understand the use
>     > of lines 161-163, if the plugins folder is found add the home directory 
> to the classpath ?
> 
>     In a local installation $NUTCH_HOME ("runtime/local") is added to the 
> classpath because the folder
>     "plugins" defined in the property "plugin.folders" is located here 
> ("runtime/local/plugins"), see:
> 
>     <property>
>       <name>plugin.folders</name>
>       <value>plugins</value>
>       <description>Directories where nutch plugins are located.  Each
>       element may be a relative or absolute path.  If absolute, it is used
>       as is.  If relative, it is searched for on the classpath.</description>
>     </property>
> 
>     See also my comments on https://github.com/apache/nutch/pull/152
>     <https://github.com/apache/nutch/pull/152>
> 
>     Sebastian
> 
> 
>     On 09/23/2016 12:06 AM, Sujen Shah wrote:
>     > Thank you Sebastian for your response.
>     >
>     > I followed the steps as per your suggestion and added the required jars 
> under runtime in plugin.xml.
>     > My code is at - 
> https://github.com/sujen1412/nutch/blob/kafka/src/plugin/publish-kafka/plugin.xml
>     
> <https://github.com/sujen1412/nutch/blob/kafka/src/plugin/publish-kafka/plugin.xml>
>     > 
> <https://github.com/sujen1412/nutch/blob/kafka/src/plugin/publish-kafka/plugin.xml
>     
> <https://github.com/sujen1412/nutch/blob/kafka/src/plugin/publish-kafka/plugin.xml>>.
>     >
>     > Now after compiling and running ./bin/crawl in local mode, the fetch 
> job fails due to
>     >
>     > Caused by: org.apache.kafka.common.config.ConfigException: Invalid value
>     > org.apache.kafka.clients.producer.internals.DefaultPartitioner for 
> configuration partitioner.class:
>     > Class org.apache.kafka.clients.producer.internals.DefaultPartitioner 
> could not be found.
>     >
>     > Am I missing something ?
>     >
>     > To find out the cause for this, I copied the jars from the 
> runtime/local/plugin/<some-plugin>/*.jar
>     > to the runtime/local/lib directory, the code seems to work perfectly 
> fine, which may imply that the
>     > jars listed under the runtime tag in plugin.xml are not getting added 
> to classpath during runtime.
>     >
>     > I looked at the code here 
> https://github.com/apache/nutch/blob/master/src/bin/nutch#L155-L164
>     <https://github.com/apache/nutch/blob/master/src/bin/nutch#L155-L164>
>     > <https://github.com/apache/nutch/blob/master/src/bin/nutch#L155-L164
>     <https://github.com/apache/nutch/blob/master/src/bin/nutch#L155-L164>> 
> and cannot understand the use
>     > of lines 161-163, if the plugins folder is found add the home directory 
> to the classpath ?
>     > Looking into to various ways to set a classpath
>     > 
> (https://docs.oracle.com/javase/8/docs/technotes/tools/windows/classpath.html#A1100762
>     
> <https://docs.oracle.com/javase/8/docs/technotes/tools/windows/classpath.html#A1100762>),
>  it says
>     > that subdirectories are not searched recursively.
>     >
>     > Thanks once again for your help.
>     >
>     >
>     > On Wed, Sep 14, 2016 at 12:10 AM, Sebastian Nagel 
> <wastl.na...@googlemail.com <mailto:wastl.na...@googlemail.com>
>     > <mailto:wastl.na...@googlemail.com 
> <mailto:wastl.na...@googlemail.com>>> wrote:
>     >
>     >     Hi Sujen,
>     >
>     >     are the jars also listed in the plugin.xml?
>     >
>     >     That's required. The plugin-specific ivy.xml is only used at 
> compile time
>     >     to fetch the library and its dependencies and get the plugin 
> compiled.
>     >
>     >     At runtime all required libs have to be listed in the plugin.xml, 
> e.g.,
>     >     
> https://github.com/apache/nutch/blob/master/src/plugin/parse-tika/plugin.xml
>     
> <https://github.com/apache/nutch/blob/master/src/plugin/parse-tika/plugin.xml>
>     >     
> <https://github.com/apache/nutch/blob/master/src/plugin/parse-tika/plugin.xml
>     
> <https://github.com/apache/nutch/blob/master/src/plugin/parse-tika/plugin.xml>>
>     >
>     >     This double work is not ideal and a frequent cause for errors but 
> that's
>     >     how it works right now.
>     >
>     >     Cheers,
>     >     Sebastian
>     >
>     >
>     >     On 09/12/2016 11:56 PM, Sujen Shah wrote:
>     >     > Hi Devs,
>     >     >
>     >     > I am facing issues in loading jars required for plugins while 
> running Nutch in local mode.
>     >     >
>     >     > I am doing the following :
>     >     > 1. add a dependency in <some-plugin>/ivy.xml
>     >     > 2. ant clean runtime
>     >     >
>     >     > Now, when I print the classpath before running, the /bin/nutch 
> script does not seem to
>     be adding
>     >     > those jars on to the classpath and throws runtime exceptions. To 
> mitigate this I added the
>     >     > dependency in the root ivy.xml.
>     >     >
>     >     > I don't know if I am missing something here or anyone else has 
> faced the same issue and
>     found a
>     >     > solution.
>     >     > For example - 
> https://github.com/apache/nutch/tree/master/src/plugin/publish-rabbitmq
>     <https://github.com/apache/nutch/tree/master/src/plugin/publish-rabbitmq>
>     >     
> <https://github.com/apache/nutch/tree/master/src/plugin/publish-rabbitmq
>     
> <https://github.com/apache/nutch/tree/master/src/plugin/publish-rabbitmq>>, 
> the
>     >     > dependency for amqp-client had to be added in the root ivy.xml as 
> well for it to not
>     throw runtime
>     >     > exceptions (ex - ClassNotFound)
>     >     >
>     >     > I have a created a patch which modifies the ./bin/nutch script to 
> load the plugin jars
>     onto the
>     >     > classpath which is attached below. This patch eliminates the need 
> to modify the root
>     ivy.xml for
>     >     > plugin specific dependencies.
>     >     >
>     >     > I wanted to ask the devs first if there was already a solution 
> before filing a JIRA
>     issue. If not,
>     >     > I'll submit it through JIRA.
>     >     >
>     >     > Thank you for your help.
>     >     >
>     >     >
>     >     > Regards,
>     >     > Sujen Shah
>     >
>     >
> 
> 

Reply via email to