[ 
https://issues.apache.org/jira/browse/NUTCH-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16216604#comment-16216604
 ] 

ASF GitHub Bot commented on NUTCH-2429:
---------------------------------------

sebastian-nagel commented on issue #222: NUTCH-2429 Fix Plugin System to allow 
protocol plugins to bundle their URLStreamHandlers
URL: https://github.com/apache/nutch/pull/222#issuecomment-338931121
 
 
   Hi @HiranChaudhuri, sorry for the late review. First trivial points:
   - could you add license headers also to Foo.java and Handler.java?
   - Foo.java has \r\n line breaks, should be consistently only \n
   - maybe make also `foo://example.com/` working (ending with slash), not only 
`foo://example.com`?
   - I would second Lewis: debug output which just reflects the call-stack 
shouldn't be there. You can use a debugger for that. Thanks!
   
   I took the time to run a test crawl and was able to index foo://example.com 
into Solr! That's promising. Other things failed, e.g. without the 
PluginRepository the readdb tool fails with the option `-stats`:
   ```
   java.lang.Exception: java.net.MalformedURLException: unknown protocol: foo
           at 
org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
           at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
   Caused by: java.net.MalformedURLException: unknown protocol: foo
           at java.net.URL.<init>(URL.java:600)
           at java.net.URL.<init>(URL.java:490)
           at java.net.URL.<init>(URL.java:439)
           at 
org.apache.nutch.crawl.CrawlDbReader$CrawlDbStatMapper.map(CrawlDbReader.java:210)
           at 
org.apache.nutch.crawl.CrawlDbReader$CrawlDbStatMapper.map(CrawlDbReader.java:182)
           at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
   ```
   If fixed this error and few more in 
[/sebastian-nagel/nutch/tree/NUTCH-2429](/sebastian-nagel/nutch/tree/NUTCH-2429).
 Thanks! Could you cherry-pick the fix from there. Also parsechecker works now:
   `bin/nutch parsechecker -Dplugin.includes='protocol-foo|parse-html' 
foo://example.com`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> Fix Plugin System to allow protocol plugins to bundle their URLStreamHandlers
> -----------------------------------------------------------------------------
>
>                 Key: NUTCH-2429
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2429
>             Project: Nutch
>          Issue Type: Improvement
>          Components: commoncrawl
>    Affects Versions: 1.14
>         Environment: Tested on both Nutch 1.13 and 1.14 in Ubuntu Linux with 
> OpenJDK 1.8.
>            Reporter: Hiran Chaudhuri
>             Fix For: 1.14
>
>
> While trying to use the protocol-smb plugin (which is not part of the Nutch 
> distribution) I realized there are four steps to successfully make use of a 
> protocol plugin:
> 1 - put the artifact into the plugins directory
> 2 - modify Nutch configuration files to allow smb:// urls plus include the 
> plugin to the loaded list
> 3 - extract jcifs.jar and place it on the system classpath
> 4 - run nutch with the correct system property
> While steps 1 and 2 seem obvious, 3 and 4 require knowledge of plugin 
> internals which does not feel right for nutch and plugin users. Even more, 
> the jcifs.jar would exist twice on the classpath and could even cause further 
> problems during runtime.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to