[jira] [Comment Edited] (NUTCH-1483) Can't crawl filesystem with protocol-file plugin

Mengying Wang (JIRA) Thu, 23 Oct 2014 16:21:12 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14182029#comment-14182029
 ]


Mengying Wang edited comment on NUTCH-1483 at 10/23/14 11:20 PM:
-----------------------------------------------------------------

Hey [~wastl-nagel], I am following this tutorial 
https://wiki.apache.org/nutch/IntranetDocumentSearch to crawl local xml files. 
Also the duplicate slashes rule in file regex-normalize.xml is removed, and 
only one slash in my seed URLs. However, still no document is indexed in my 
Solr. Has I missed something? Thank you for your help! 
Following is my configuration:
(1) File: conf/regex-urlfilter.txt
a. Comment out the line preventing file:// URLs
#-^(file|ftp|mailto):

b. Add the line skipping http: URLs. 
-^(http|ftp|mailto|https):

c. Use the regular expression for the paths I would like to index.
+^file:/Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/

(2) File: conf/nutch-site.xml
Enable the protocol-file plugin.
<property>
 <name>plugin.includes</name>
 
<value>protocol-file|protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata|more)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
 <description>......</description>
</property>

(3) File: conf/regex-normalize.xml 
Comment out the duplicate slashes removing rule.
<!-- removes duplicate slashes
<regex>
  <pattern>(?& lt;!: )/{2,}</pattern>
  <substitution>/</substitution>
</regex>
-->
 
(4) File: urls/local-fs
file:/Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/monitor.xml

Then start running the crawler using the command: 
./crawl urls crawlId http://localhost:8983/solr/collection1 2

No error in my terminal console, and also nutch log file: hadoop.log. However, 
no document is indexed in my Solr.

Below is my console log: 
https://docs.google.com/document/d/1Io90JhnRCIZ-2KhGChCxz7qi69Zeq-Iot4yPbvzPR44/edit?usp=sharing

By the way, I have successfully followed this tutorial 
http://wiki.apache.org/nutch/NutchTutorial to crawl some website, and index the 
metadata in Solr. My nutch version is 1.9, and the Solr version is 4.10. Any 
idea so far? Thank you!
  


was (Author: angela_wang):
Hey everyone, I am following this tutorial 
https://wiki.apache.org/nutch/IntranetDocumentSearch to crawl local xml files. 
However, no document is indexed in my solr. Has I missed something? Thank you 
for your help! 
Following is my configuration:
(1) File: conf/regex-urlfilter.txt
a. Comment out the line preventing file:// URLs
#-^(file|ftp|mailto):

b. Add the line skipping http: URLs. 
-^(http|ftp|mailto|https):

c. Use the regular expression for the paths I would like to index.
+^file:/Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/

(2) File: conf/nutch-site.xml
Enable the protocol-file plugin.
<property>
 <name>plugin.includes</name>
 
<value>protocol-file|protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata|more)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
 <description>......</description>
</property>

(3) File: conf/regex-normalize.xml 
Comment out the duplicate slashes removing rule.
<!-- removes duplicate slashes
<regex>
  <pattern>(?& lt;!: )/{2,}</pattern>
  <substitution>/</substitution>
</regex>
-->
 
(4) File: urls/local-fs
file:/Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/monitor.xml

Then start running the crawler using the command: 
./crawl urls crawlId http://localhost:8983/solr/collection1 2

No error in my terminal console, and also nutch log file: hadoop.log. However, 
no document is indexed in my Solr.

Below is my console log: 
https://docs.google.com/document/d/1Io90JhnRCIZ-2KhGChCxz7qi69Zeq-Iot4yPbvzPR44/edit?usp=sharing

By the way, I have successfully followed this tutorial 
http://wiki.apache.org/nutch/NutchTutorial to crawl some website, and index the 
metadata in Solr. My nutch version is 1.9, and the Solr version is 4.10. Any 
idea so far? Thank you!
  

> Can't crawl filesystem with protocol-file plugin
> ------------------------------------------------
>
>                 Key: NUTCH-1483
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1483
>             Project: Nutch
>          Issue Type: Bug
>          Components: protocol
>    Affects Versions: 1.6, 2.1
>         Environment: OpenSUSE 12.1, OpenJDK 1.6.0, HBase 0.90.4
>            Reporter: Rogério Pereira Araújo
>            Priority: Critical
>             Fix For: 2.3, 1.10
>
>         Attachments: TestProtocolFileUrlUri.java
>
>
> I tried to follow the same steps described in this wiki page:
> http://wiki.apache.org/nutch/IntranetDocumentSearch
> I made all required changes on regex-urlfilter.txt and added the following 
> entry in my seed file:
> file:///home/rogerio/Documents/
> The permissions are ok, I'm running nutch with the same user as folder owner, 
> so nutch has all the required permissions, unfortunately I'm getting the 
> following error:
> org.apache.nutch.protocol.file.FileError: File Error: 404
>         at 
> org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:105)
>         at 
> org.apache.nutch.fetcher.FetcherReducer$FetcherThread.run(FetcherReducer.java:514)
> fetch of file://home/rogerio/Documents/ failed with: 
> org.apache.nutch.protocol.file.FileError: File Error: 404
> Why the logs are showing file://home/rogerio/Documents/ instead of 
> file:///home/rogerio/Documents/ ???
> Note: The regex-urlfilter entry only works as expected if I add the entry 
> +^file://home/rogerio/Documents/ instead of +^file:///home/rogerio/Documents/ 
> as wiki says.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (NUTCH-1483) Can't crawl filesystem with protocol-file plugin

Reply via email to