Dear Sebastian, 1: Actually, /Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/ is not a "real" path, cas-curator is a symbolic link of the real fold cas-curator-0.6.
$ greadlink -f /Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/ /Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml On my god! When I change my original "symbolic virtual" path to the "real" path, the Nutch could crawl the local files into my Solr now. Many thanks! Sebastian, you helped a lot! Thank you! 2: Also I have applied your new patch file, and the java.lang.NullPointerException error totally disappears. Amazing! Thank you! $ ./nutch parsechecker "file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/" fetching: file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/ parsing: file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/ contentType: text/html signature: 17bdb44990391c96bb8d48d1802ff11c --------- Url --------------- file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/ --------- ParseData --------- Version: 5 Status: success(1,0) Title: Index of /Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml Outlinks: 2 outlink: toUrl: file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/ anchor: ../ outlink: toUrl: file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/monitor.xml anchor: monitor.xml Content Metadata: Content-Length=352 nutch.crawl.score=0.0 Last-Modified=Tue, 14 Oct 2014 20:05:50 GMT Content-Type=text/html Parse Metadata: CharEncodingForConversion=windows-1252 OriginalCharEncoding=windows-1252 $ ./nutch indexchecker "file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/" fetching: file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/ parsing: file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/ contentType: text/html content : Index of /Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml Index of /Us id : file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/ title : Index of /Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml host : digest : 17bdb44990391c96bb8d48d1802ff11c tstamp : Wed Oct 29 21:54:00 PDT 2014 url : file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/ 3: Something wrong with this tutorial https://wiki.apache.org/nutch/IntranetDocumentSearch. To index the local files in the Solr, we also need to enable the "indexer-solr" plugin in File: conf/nutch-site.xml which is not mentioned there. Please add it too, so future users could easily follow it step by step. Best, Mengying (Angela) Wang On Mon, Oct 27, 2014 at 4:29 PM, Sebastian Nagel <[email protected] > wrote: > Hi, > > thanks for testing! > > 1. is > /Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/ > the "real" path. I.e., are there no symbolic links in the path? > The command > readlink -f > /Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/ > should show you whether this is the case or not. > Because Parse objects results are stored by "real" path in the > ParseResult > this may cause a NPE, when there is no ParseResult available per > original path. > > 2. unhappily, the log output is ambiguous. there are two places in > ParserChecker where > exceptions are catched with the same log message. > Can you apply the attached patch and test again? Just to get more > verbose log messages. > If you have time, please, open a Jira to improve the logging in this > case. > > Thanks, > Sebastian > > On 10/26/2014 02:24 AM, Mengying Wang wrote: > > Hi Sebastian, > > > > I have downloaded the Nutch source code from github ( > https://github.com/apache/nutch), applied the > > patches (NUTCH-1879 and NUTCH-1880), and then reinstalled the Nutch. > Now the good news is that all > > urls contain only 1 slash. But unfortunately, > java.lang.NullPointerException warning/error occurs > > for both of the parsechecker and indexchecker commands. > > > > Below is the running log: > > > > $ ./nutch parsechecker > > > "file:/Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/" > > fetching: > file:/Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/ > > parsing: > file:/Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/ > > contentType: text/html > > signature: 17bdb44990391c96bb8d48d1802ff11c > > Couldn't pass score, url > > > file:/Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/ > > (java.lang.NullPointerException) > > --------- > > Url > > --------------- > > > > > file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/ > > --------- > > ParseData > > --------- > > > > Version: 5 > > Status: success(1,0) > > Title: Index of > /Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml > > Outlinks: 2 > > outlink: toUrl: > file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/ > > anchor: ../ > > outlink: toUrl: > > > file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/monitor.xml > > anchor: monitor.xml > > Content Metadata: Content-Length=352 nutch.crawl.score=0.0 > Last-Modified=Tue, 14 Oct 2014 20:05:50 > > GMT Content-Type=text/html > > Parse Metadata: CharEncodingForConversion=windows-1252 > OriginalCharEncoding=windows-1252 > > > > > > $ ./nutch indexchecker > > > "file:/Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/" > > fetching: > file:/Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/ > > parsing: > file:/Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/ > > contentType: text/html > > Exception in thread "main" java.lang.NullPointerException > > at > org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:139) > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > > at > org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:177) > > > > Thanks. > > Mengying (Angela) Wang > > -- Best, Mengying (Angela) Wang

