Hi Mengying,

great!

> When I change my original "symbolic virtual" path to the "real" path, the 
> Nutch could
> crawl the local files

In fact, path normalization is good here, otherwise you could end up with many
duplicates. But the protocol-file plugin could make this more explicit.
Could think about treating such pathes as redirects: that's conceptually
close.

> 2: Also I have applied your new patch file, and the 
> java.lang.NullPointerException error totally
> disappears. Amazing! Thank you!

Perfect!

If you have the time, please, open Jiras for the two problems.
If not, let me know, and I'll do this.

Thanks for testing!

Best,
Sebastian

On 10/30/2014 06:15 AM, MengYing Wang wrote:
> Dear Sebastian,
> 
> 1: Actually, 
> /Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/ 
> is not
> a "real" path, cas-curator is a symbolic link of the real fold 
> cas-curator-0.6. 
> 
> $ greadlink -f 
> /Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/
> 
> /Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml
> 
> On my god! When I change my original "symbolic virtual" path to the "real" 
> path, the Nutch could
> crawl the local files into my Solr now. Many thanks! Sebastian, you helped a 
> lot! Thank you!
> 
> 2: Also I have applied your new patch file, and the 
> java.lang.NullPointerException error totally
> disappears. Amazing! Thank you!
> 
> $ ./nutch parsechecker
> "file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/"
> 
> fetching: 
> file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/
> 
> parsing: 
> file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/
> 
> contentType: text/html
> 
> signature: 17bdb44990391c96bb8d48d1802ff11c
> 
> ---------
> 
> Url
> 
> ---------------
> 
> 
> file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/
> 
> ---------
> 
> ParseData
> 
> ---------
> 
> 
> Version: 5
> 
> Status: success(1,0)
> 
> Title: Index of 
> /Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml
> 
> Outlinks: 2
> 
>   outlink: toUrl: 
> file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/
> anchor: ../
> 
>   outlink: toUrl:
> file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/monitor.xml
> anchor: monitor.xml
> 
> Content Metadata: Content-Length=352 nutch.crawl.score=0.0 Last-Modified=Tue, 
> 14 Oct 2014 20:05:50
> GMT Content-Type=text/html 
> 
> Parse Metadata: CharEncodingForConversion=windows-1252 
> OriginalCharEncoding=windows-1252 
> 
> 
> $ ./nutch indexchecker
> "file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/"
> 
> fetching: 
> file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/
> 
> parsing: 
> file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/
> 
> contentType: text/html
> 
> content :Index of 
> /Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml
> Index of /Us
> 
> id 
> :file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/
> 
> title :Index of 
> /Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml
> 
> host :
> 
> digest :17bdb44990391c96bb8d48d1802ff11c
> 
> tstamp :Wed Oct 29 21:54:00 PDT 2014
> 
> url 
> :file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/
> 
> 3: Something wrong with this tutorial 
> https://wiki.apache.org/nutch/IntranetDocumentSearch. To index
> the local files in the Solr, we also need to enable the "indexer-solr" plugin 
> in File:
> conf/nutch-site.xml which is not mentioned there. Please add it too, so 
> future users could easily
> follow it step by step.
> 
> 
> Best,
> 
> Mengying (Angela) Wang
> 
> 
> 
> 
> 
> 
> 
> 
> On Mon, Oct 27, 2014 at 4:29 PM, Sebastian Nagel <[email protected]
> <mailto:[email protected]>> wrote:
> 
>     Hi,
> 
>     thanks for testing!
> 
>     1. is 
> /Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/
>        the "real" path. I.e., are there no symbolic links in the path?
>        The command
>          readlink -f 
> /Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/
>        should show you whether this is the case or not.
>        Because Parse objects results are stored by "real" path in the 
> ParseResult
>        this may cause a NPE, when there is no ParseResult available per 
> original path.
> 
>     2. unhappily, the log output is ambiguous. there are two places in 
> ParserChecker where
>        exceptions are catched with the same log message.
>        Can you apply the attached patch and test again? Just to get more 
> verbose log messages.
>        If you have time, please, open a Jira to improve the logging in this 
> case.
> 
>     Thanks,
>     Sebastian
> 
>     On 10/26/2014 02:24 AM, Mengying Wang wrote:
>     > Hi Sebastian,
>     >
>     > I have downloaded the Nutch source code from github 
> (https://github.com/apache/nutch), applied the
>     > patches (NUTCH-1879 and NUTCH-1880), and then reinstalled the Nutch.  
> Now the good news is
>     that all
>     > urls contain only 1 slash. But unfortunately,  
> java.lang.NullPointerException warning/error occurs
>     > for both of the parsechecker and indexchecker commands.
>     >
>     > Below is the running log:
>     >
>     > $ ./nutch parsechecker
>     > 
> "file:/Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/"
>     > fetching: 
> file:/Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/
>     > parsing: 
> file:/Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/
>     > contentType: text/html
>     > signature: 17bdb44990391c96bb8d48d1802ff11c
>     > Couldn't pass score, url
>     > 
> file:/Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/
>     > (java.lang.NullPointerException)
>     > ---------
>     > Url
>     > ---------------
>     >
>     > 
> file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/
>     > ---------
>     > ParseData
>     > ---------
>     >
>     > Version: 5
>     > Status: success(1,0)
>     > Title: Index of 
> /Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml
>     > Outlinks: 2
>     >   outlink: toUrl: 
> file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/
>     > anchor: ../
>     >   outlink: toUrl:
>     > 
> file:/Users/AngelaWang/Documents/programs/oodt/cas-curator-0.6/staging/products/xml/monitor.xml
>     > anchor: monitor.xml
>     > Content Metadata: Content-Length=352 nutch.crawl.score=0.0 
> Last-Modified=Tue, 14 Oct 2014 20:05:50
>     > GMT Content-Type=text/html
>     > Parse Metadata: CharEncodingForConversion=windows-1252 
> OriginalCharEncoding=windows-1252
>     >
>     >
>     > $ ./nutch indexchecker
>     > 
> "file:/Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/"
>     > fetching: 
> file:/Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/
>     > parsing: 
> file:/Users/AngelaWang/Documents/programs/oodt/cas-curator/staging/products/xml/
>     > contentType: text/html
>     > Exception in thread "main" java.lang.NullPointerException
>     > at 
> org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:139)
>     > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>     > at 
> org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:177)
>     >
>     > Thanks.
>     > Mengying (Angela) Wang
> 
> 
> 
> 
> -- 
> Best,
> Mengying (Angela) Wang

Reply via email to