Dear Sebastian, Done. Add some comment in the Jira https://issues.apache.org/jira/browse/NUTCH-1483 to explain why I cannot crawl the filesystem using the protocol-file, and how to solve it. Also mention to the new Jira https://issues.apache.org/jira/browse/NUTCH-1884 which is actually not a "real" bug. Thanks. :)
Best, Mengying (Angela) Wang On Thu, Oct 30, 2014 at 11:32 AM, Sebastian Nagel < [email protected]> wrote: > Hi Mengying, > > great! > > > When I change my original "symbolic virtual" path to the "real" path, > the Nutch could > > crawl the local files > > In fact, path normalization is good here, otherwise you could end up with > many > duplicates. But the protocol-file plugin could make this more explicit. > Could think about treating such pathes as redirects: that's conceptually > close. > > > 2: Also I have applied your new patch file, and the > java.lang.NullPointerException error totally > > disappears. Amazing! Thank you! > > Perfect! > > If you have the time, please, open Jiras for the two problems. > If not, let me know, and I'll do this. > > Thanks for testing! > > Best, > Sebastian > > On 10/30/2014 06:15 AM, MengYing Wang wrote: > > Dear Sebastian, > > > > 1: Actually, > /Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/ > is not > > a "real" path, cas-curator is a symbolic link of the real fold > cas-curator-0.6. > > > > $ greadlink -f > /Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/ > > > > > /Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml > > > > On my god! When I change my original "symbolic virtual" path to the > "real" path, the Nutch could > > crawl the local files into my Solr now. Many thanks! Sebastian, you > helped a lot! Thank you! > > > > 2: Also I have applied your new patch file, and the > java.lang.NullPointerException error totally > > disappears. Amazing! Thank you! > > > > $ ./nutch parsechecker > > > "file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/" > > > > fetching: > file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/ > > > > parsing: > file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/ > > > > contentType: text/html > > > > signature: 17bdb44990391c96bb8d48d1802ff11c > > > > --------- > > > > Url > > > > --------------- > > > > > > > file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/ > > > > --------- > > > > ParseData > > > > --------- > > > > > > Version: 5 > > > > Status: success(1,0) > > > > Title: Index of > /Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml > > > > Outlinks: 2 > > > > outlink: toUrl: > file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/ > > anchor: ../ > > > > outlink: toUrl: > > > file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/monitor.xml > > anchor: monitor.xml > > > > Content Metadata: Content-Length=352 nutch.crawl.score=0.0 > Last-Modified=Tue, 14 Oct 2014 20:05:50 > > GMT Content-Type=text/html > > > > Parse Metadata: CharEncodingForConversion=windows-1252 > OriginalCharEncoding=windows-1252 > > > > > > $ ./nutch indexchecker > > > "file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/" > > > > fetching: > file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/ > > > > parsing: > file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/ > > > > contentType: text/html > > > > content :Index of > /Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml > > Index of /Us > > > > id > :file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/ > > > > title :Index of > /Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml > > > > host : > > > > digest :17bdb44990391c96bb8d48d1802ff11c > > > > tstamp :Wed Oct 29 21:54:00 PDT 2014 > > > > url > :file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/ > > > > 3: Something wrong with this tutorial > https://wiki.apache.org/nutch/IntranetDocumentSearch. To index > > the local files in the Solr, we also need to enable the "indexer-solr" > plugin in File: > > conf/nutch-site.xml which is not mentioned there. Please add it too, so > future users could easily > > follow it step by step. > > > > > > Best, > > > > Mengying (Angela) Wang > > > > > > > > > > > > > > > > > > On Mon, Oct 27, 2014 at 4:29 PM, Sebastian Nagel < > [email protected] > > <mailto:[email protected]>> wrote: > > > > Hi, > > > > thanks for testing! > > > > 1. is > /Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/ > > the "real" path. I.e., are there no symbolic links in the path? > > The command > > readlink -f > /Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/ > > should show you whether this is the case or not. > > Because Parse objects results are stored by "real" path in the > ParseResult > > this may cause a NPE, when there is no ParseResult available per > original path. > > > > 2. unhappily, the log output is ambiguous. there are two places in > ParserChecker where > > exceptions are catched with the same log message. > > Can you apply the attached patch and test again? Just to get more > verbose log messages. > > If you have time, please, open a Jira to improve the logging in > this case. > > > > Thanks, > > Sebastian > > > > On 10/26/2014 02:24 AM, Mengying Wang wrote: > > > Hi Sebastian, > > > > > > I have downloaded the Nutch source code from github ( > https://github.com/apache/nutch), applied the > > > patches (NUTCH-1879 and NUTCH-1880), and then reinstalled the > Nutch. Now the good news is > > that all > > > urls contain only 1 slash. But unfortunately, > java.lang.NullPointerException warning/error occurs > > > for both of the parsechecker and indexchecker commands. > > > > > > Below is the running log: > > > > > > $ ./nutch parsechecker > > > > "file:/Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/" > > > fetching: > file:/Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/ > > > parsing: > file:/Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/ > > > contentType: text/html > > > signature: 17bdb44990391c96bb8d48d1802ff11c > > > Couldn't pass score, url > > > > file:/Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/ > > > (java.lang.NullPointerException) > > > --------- > > > Url > > > --------------- > > > > > > > file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/ > > > --------- > > > ParseData > > > --------- > > > > > > Version: 5 > > > Status: success(1,0) > > > Title: Index of > /Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml > > > Outlinks: 2 > > > outlink: toUrl: > file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/ > > > anchor: ../ > > > outlink: toUrl: > > > > file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/monitor.xml > > > anchor: monitor.xml > > > Content Metadata: Content-Length=352 nutch.crawl.score=0.0 > Last-Modified=Tue, 14 Oct 2014 20:05:50 > > > GMT Content-Type=text/html > > > Parse Metadata: CharEncodingForConversion=windows-1252 > OriginalCharEncoding=windows-1252 > > > > > > > > > $ ./nutch indexchecker > > > > "file:/Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/" > > > fetching: > file:/Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/ > > > parsing: > file:/Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/ > > > contentType: text/html > > > Exception in thread "main" java.lang.NullPointerException > > > at > org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:139) > > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > > > at > org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:177) > > > > > > Thanks. > > > Mengying (Angela) Wang > > > > > > > > > > -- > > Best, > > Mengying (Angela) Wang > > -- Best, Mengying (Angela) Wang

