[
https://issues.apache.org/jira/browse/NUTCH-2524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16395120#comment-16395120
]
ASF GitHub Bot commented on NUTCH-2524:
---------------------------------------
sebastian-nagel closed pull request #291: NUTCH-2524
URL: https://github.com/apache/nutch/pull/291
This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:
As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):
diff --git a/src/bin/crawl b/src/bin/crawl
index 7a32be207..aeb4a144b 100755
--- a/src/bin/crawl
+++ b/src/bin/crawl
@@ -219,8 +219,19 @@ function __bin_nutch {
fi
}
+#check if directory exists on locally or on hdfs
+function __directory_exists {
+ if [[ "$mode" == local && -d "$1" ]]; then
+ return 0
+ elif [[ "$mode" == distributed ]] && hadoop fs -test -d "$1"; then
+ return 0
+ else
+ return 1
+ fi
+}
+
function __update_hostdb {
- if [[ -d "$CRAWL_PATH"/crawldb ]]; then
+ if __directory_exists "$CRAWL_PATH"/crawldb; then
echo "Updating HostDB"
__bin_nutch updatehostdb -crawldb "$CRAWL_PATH"/crawldb -hostdb
"$CRAWL_PATH"/hostdb
fi
@@ -261,7 +272,7 @@ do
[[ $a -eq 1 ]] && __update_hostdb
# sitemap processing based on HostDB
- if [[ -d "$CRAWL_PATH"/hostdb ]]; then
+ if __directory_exists "$CRAWL_PATH"/hostdb; then
echo "Processing sitemaps based on hosts in HostDB"
__bin_nutch sitemap "$CRAWL_PATH"/crawldb -hostdb "$CRAWL_PATH"/hostdb
-threads $NUM_THREADS
fi
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> bin/crawl: fix check for HostDb in distributed mode
> ---------------------------------------------------
>
> Key: NUTCH-2524
> URL: https://issues.apache.org/jira/browse/NUTCH-2524
> Project: Nutch
> Issue Type: Bug
> Components: bin
> Affects Versions: 1.15
> Reporter: Semyon Semyonov
> Priority: Major
> Fix For: 1.15
>
>
> In crawl script you can find something likeĀ
> if [[ -d "$CRAWL_PATH"/hostdb ]]; then
> echo "Processing sitemaps based on hosts in HostDB"
> __bin_nutch sitemap "$CRAWL_PATH"/crawldb -hostdb "$CRAWL_PATH"/hostdb
> -threads $NUM_THREADS
> fi
> if [[ -d "$CRAWL_PATH"/hostdb ]]; doesnt work for HDFS only for local mode.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)