[ 
https://issues.apache.org/jira/browse/NUTCH-1930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14497115#comment-14497115
 ] 

Clement Mai commented on NUTCH-1930:
------------------------------------

I'm new to Nutch and I'm currently trying to use Nutch to crawl some PDFs and 
index into Elasticsearch.  I see the same problem when fetch a PDF with size > 
2MB.  The markers are erased after the fetch job completes.  However, it works 
fine with PDF < 2MB.

Setting http.content.limit to be >= 2147483648 gives exception.  You can't set 
a value above the MAX_VALUE of Integer.

I print the markers in FetcherReducer output(), and verify the markers are 
correctly put into the WebPage.  I'm not clear how if there is no 
Content-Length would cause the markers erased.

Thanks.

=============================
PDF size > 2MB, no markers in HBase
=============================
org.apache.nutch.fetcher.FetcherJob: fetching 
https://dev-web/fdnycfa/htmls/test8.pdf (queue crawl delay=100ms)
org.apache.nutch.protocol.httpclient.Http: http.content.limit = 20971520
org.apache.nutch.protocol.httpclient.Http: http.accept = 
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
org.apache.commons.httpclient.HttpMethodBase: Response content length is not 
known
org.apache.nutch.fetcher.FetcherJob: output content length: 8340405
org.apache.nutch.fetcher.FetcherJob: _gnmrk_ = 1429127685-28325
org.apache.nutch.fetcher.FetcherJob: _injmrk_ = y
org.apache.nutch.fetcher.FetcherJob: _ftcmrk_ = 1429127685-28325
org.apache.nutch.fetcher.FetcherJob: dist = 0
org.apache.nutch.fetcher.FetcherJob: -finishing thread FetcherThread10, 
activeThreads=5

hbase(main):069:0> get 
'TestCrawl_webpage','dev-web:https/fdnycfa/htmls/test8.pdf',{COLUMN => 'mk'}
COLUMN                CELL
0 row(s) in 0.0070 seconds


=============================
PDF size < 2MB
=============================
org.apache.nutch.fetcher.FetcherJob: fetching 
https://dev-web/fdnycfa/htmls/test2_006.pdf (queue crawl delay=100ms)
org.apache.nutch.protocol.httpclient.Http: http.content.limit = 20971520
org.apache.nutch.protocol.httpclient.Http: http.accept = 
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
org.apache.commons.httpclient.HttpMethodBase: Response content length is not 
known
org.apache.nutch.fetcher.FetcherJob: output content length: 2006860
org.apache.nutch.fetcher.FetcherJob: _gnmrk_ = 1429127685-28325
org.apache.nutch.fetcher.FetcherJob: _injmrk_ = y
org.apache.nutch.fetcher.FetcherJob: _ftcmrk_ = 1429127685-28325
org.apache.nutch.fetcher.FetcherJob: dist = 0
org.apache.nutch.fetcher.FetcherJob: -finishing thread FetcherThread9, 
activeThreads=7

hbase(main):007:0> get 
'TestCrawl_webpage','dev-web:https/fdnycfa/htmls/test2_006.pdf',{COLUMN => 'mk'}
COLUMN                CELL
 mk:_ftcmrk_          timestamp=1429127801312, value=1429127685-28325
 mk:_gnmrk_           timestamp=1429127801312, value=1429127685-28325
 mk:_injmrk_          timestamp=1429127801312, value=y
 mk:dist              timestamp=1429127801312, value=0
4 row(s) in 0.0480 seconds


> Fetcher erases Markers for certain URLs / documents
> ---------------------------------------------------
>
>                 Key: NUTCH-1930
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1930
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 2.3
>            Reporter: Michiel
>             Fix For: 2.4
>
>
> During an active crawling project, I noticed what appears to be a bug in the 
> fetcher: the markers for certain pages (PDFs especially) are either not 
> saved, or erased altogether. The pages are thus not parsed, nor updated in 
> the DB. They keep appearing in the generate lists and fetch lists. Note that 
> this is a separate issue from NUTCH-1922. That one involves correctly parsed 
> pages. This bug prevents certain pages from getting correct markers set.
> Although I'm still new to Nutch and no java expert, I'm currently trying to 
> debug this. Because it seems to be rather easy to replicate the error, so it 
> seemed sensible to share my findings so far. If I find out more myself, I'll 
> update this issue.
> For this test, I injected two test URLs which never seemed to get parsed, 
> even though they are valid documents which are not excluded by any filters. I 
> use a http.content.limit of 64 MB, and tika is used for parsing documents. 
> Note that these are just two examples, I can provide more if needed.
> - 
> http://www.aanvalopschooluitval.nl/userfiles/file/projectenbank/Flex%20Lectoraat.pdf
> - 
> http://www.prettywoman-utrecht.nl/wp-content/uploads/PrettyWoman-methodiek_web.pdf
> Steps:
> 1) Whenever a batch gets generated, the GENERATE_MARK is set. So far so good.
> 2) During fetch, map() inside FetcherJob checks if this GENERATE_MARK is set. 
> If so, it continues. Still, so far so good.
> 3) After fetch, output() inside FetcherReducer sets the FETCH_MARK. I've 
> logged the marker, and it gets set with the correct batchId. It gets a value.
> 4) However, when another nutch command is run, all the markers from these 
> example URLs appear to have been erased. Not only is FETCH_MARK suddenly not 
> set, GENERATE_MARK is also erased. Thus, the parser will think the URL hasn't 
> been fetched yet. The fetchStatus, however, is nicely set to "2 
> (status_fetched)". It's just the markers that are not correctly set.
> My first assumption was that FETCH_MARK was not saved. However, as noted in 
> step 3), it gets the correct value. Also, GENERATE_MARK is erased after the 
> process is complete, so something else goes wrong. Somewhere before the end 
> of FetcherJob, the markers for certain pages are erased. Note that all other 
> values, like content, baseUrl, fetchtimes and fetchStatus, are saved 
> correctly for these URLs.
> Finally, for testing purposes, here is an example URL that DOES work: 
> http://www.aanvalopschooluitval.nl/userfiles/file/2011/Plusvoorzieningenkrant.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to