[
https://issues.apache.org/jira/browse/NUTCH-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14182029#comment-14182029
]
Mengying Wang commented on NUTCH-1483:
--------------------------------------
Hey everyone, I am following this tutorial
https://wiki.apache.org/nutch/IntranetDocumentSearch to crawl local xml files.
However, no document is indexed in my solr. Has I missed something? Thank you
for your help!
Following is my configuration:
(1) File: conf/regex-urlfilter.txt
a. Comment out the line preventing file:// URLs
#-^(file|ftp|mailto):
b. Add the line skipping http: URLs.
-^(http|ftp|mailto|https):
c. Use the regular expression for the paths I would like to index.
+^file:/Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/
(2) File: conf/nutch-site.xml
Enable the protocol-file plugin.
<property>
<name>plugin.includes</name>
<value>protocol-file|protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata|more)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description>......</description>
</property>
(3) File: conf/regex-normalize.xml
Comment out the duplicate slashes removing rule.
<!-- removes duplicate slashes
<regex>
<pattern>(?& lt;!: )/{2,}</pattern>
<substitution>/</substitution>
</regex>
-->
(4) File: urls/local-fs
file:/Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/monitor.xml
Then start running the crawler using the command:
./crawl urls crawlId http://localhost:8983/solr/collection1 2
No error in my terminal console, and also nutch log file: hadoop.log. However,
no document is indexed in my Solr.
Below is my console log:
https://docs.google.com/document/d/1Io90JhnRCIZ-2KhGChCxz7qi69Zeq-Iot4yPbvzPR44/edit?usp=sharing
By the way, I have successfully followed this tutorial
http://wiki.apache.org/nutch/NutchTutorial to crawl some website, and index the
metadata in Solr. My nutch version is 1.9, and the Solr version is 4.10. Any
idea so far? Thank you!
> Can't crawl filesystem with protocol-file plugin
> ------------------------------------------------
>
> Key: NUTCH-1483
> URL: https://issues.apache.org/jira/browse/NUTCH-1483
> Project: Nutch
> Issue Type: Bug
> Components: protocol
> Affects Versions: 1.6, 2.1
> Environment: OpenSUSE 12.1, OpenJDK 1.6.0, HBase 0.90.4
> Reporter: Rogério Pereira Araújo
> Priority: Critical
> Fix For: 2.3, 1.10
>
> Attachments: TestProtocolFileUrlUri.java
>
>
> I tried to follow the same steps described in this wiki page:
> http://wiki.apache.org/nutch/IntranetDocumentSearch
> I made all required changes on regex-urlfilter.txt and added the following
> entry in my seed file:
> file:///home/rogerio/Documents/
> The permissions are ok, I'm running nutch with the same user as folder owner,
> so nutch has all the required permissions, unfortunately I'm getting the
> following error:
> org.apache.nutch.protocol.file.FileError: File Error: 404
> at
> org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:105)
> at
> org.apache.nutch.fetcher.FetcherReducer$FetcherThread.run(FetcherReducer.java:514)
> fetch of file://home/rogerio/Documents/ failed with:
> org.apache.nutch.protocol.file.FileError: File Error: 404
> Why the logs are showing file://home/rogerio/Documents/ instead of
> file:///home/rogerio/Documents/ ???
> Note: The regex-urlfilter entry only works as expected if I add the entry
> +^file://home/rogerio/Documents/ instead of +^file:///home/rogerio/Documents/
> as wiki says.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)