[
https://issues.apache.org/jira/browse/NUTCH-1930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14497115#comment-14497115
]
Clement Mai commented on NUTCH-1930:
------------------------------------
I'm new to Nutch and I'm currently trying to use Nutch to crawl some PDFs and
index into Elasticsearch. I see the same problem when fetch a PDF with size >
2MB. The markers are erased after the fetch job completes. However, it works
fine with PDF < 2MB.
Setting http.content.limit to be >= 2147483648 gives exception. You can't set
a value above the MAX_VALUE of Integer.
I print the markers in FetcherReducer output(), and verify the markers are
correctly put into the WebPage. I'm not clear how if there is no
Content-Length would cause the markers erased.
Thanks.
=============================
PDF size > 2MB, no markers in HBase
=============================
org.apache.nutch.fetcher.FetcherJob: fetching
https://dev-web/fdnycfa/htmls/test8.pdf (queue crawl delay=100ms)
org.apache.nutch.protocol.httpclient.Http: http.content.limit = 20971520
org.apache.nutch.protocol.httpclient.Http: http.accept =
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
org.apache.commons.httpclient.HttpMethodBase: Response content length is not
known
org.apache.nutch.fetcher.FetcherJob: output content length: 8340405
org.apache.nutch.fetcher.FetcherJob: _gnmrk_ = 1429127685-28325
org.apache.nutch.fetcher.FetcherJob: _injmrk_ = y
org.apache.nutch.fetcher.FetcherJob: _ftcmrk_ = 1429127685-28325
org.apache.nutch.fetcher.FetcherJob: dist = 0
org.apache.nutch.fetcher.FetcherJob: -finishing thread FetcherThread10,
activeThreads=5
hbase(main):069:0> get
'TestCrawl_webpage','dev-web:https/fdnycfa/htmls/test8.pdf',{COLUMN => 'mk'}
COLUMN CELL
0 row(s) in 0.0070 seconds
=============================
PDF size < 2MB
=============================
org.apache.nutch.fetcher.FetcherJob: fetching
https://dev-web/fdnycfa/htmls/test2_006.pdf (queue crawl delay=100ms)
org.apache.nutch.protocol.httpclient.Http: http.content.limit = 20971520
org.apache.nutch.protocol.httpclient.Http: http.accept =
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
org.apache.commons.httpclient.HttpMethodBase: Response content length is not
known
org.apache.nutch.fetcher.FetcherJob: output content length: 2006860
org.apache.nutch.fetcher.FetcherJob: _gnmrk_ = 1429127685-28325
org.apache.nutch.fetcher.FetcherJob: _injmrk_ = y
org.apache.nutch.fetcher.FetcherJob: _ftcmrk_ = 1429127685-28325
org.apache.nutch.fetcher.FetcherJob: dist = 0
org.apache.nutch.fetcher.FetcherJob: -finishing thread FetcherThread9,
activeThreads=7
hbase(main):007:0> get
'TestCrawl_webpage','dev-web:https/fdnycfa/htmls/test2_006.pdf',{COLUMN => 'mk'}
COLUMN CELL
mk:_ftcmrk_ timestamp=1429127801312, value=1429127685-28325
mk:_gnmrk_ timestamp=1429127801312, value=1429127685-28325
mk:_injmrk_ timestamp=1429127801312, value=y
mk:dist timestamp=1429127801312, value=0
4 row(s) in 0.0480 seconds
> Fetcher erases Markers for certain URLs / documents
> ---------------------------------------------------
>
> Key: NUTCH-1930
> URL: https://issues.apache.org/jira/browse/NUTCH-1930
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Affects Versions: 2.3
> Reporter: Michiel
> Fix For: 2.4
>
>
> During an active crawling project, I noticed what appears to be a bug in the
> fetcher: the markers for certain pages (PDFs especially) are either not
> saved, or erased altogether. The pages are thus not parsed, nor updated in
> the DB. They keep appearing in the generate lists and fetch lists. Note that
> this is a separate issue from NUTCH-1922. That one involves correctly parsed
> pages. This bug prevents certain pages from getting correct markers set.
> Although I'm still new to Nutch and no java expert, I'm currently trying to
> debug this. Because it seems to be rather easy to replicate the error, so it
> seemed sensible to share my findings so far. If I find out more myself, I'll
> update this issue.
> For this test, I injected two test URLs which never seemed to get parsed,
> even though they are valid documents which are not excluded by any filters. I
> use a http.content.limit of 64 MB, and tika is used for parsing documents.
> Note that these are just two examples, I can provide more if needed.
> -
> http://www.aanvalopschooluitval.nl/userfiles/file/projectenbank/Flex%20Lectoraat.pdf
> -
> http://www.prettywoman-utrecht.nl/wp-content/uploads/PrettyWoman-methodiek_web.pdf
> Steps:
> 1) Whenever a batch gets generated, the GENERATE_MARK is set. So far so good.
> 2) During fetch, map() inside FetcherJob checks if this GENERATE_MARK is set.
> If so, it continues. Still, so far so good.
> 3) After fetch, output() inside FetcherReducer sets the FETCH_MARK. I've
> logged the marker, and it gets set with the correct batchId. It gets a value.
> 4) However, when another nutch command is run, all the markers from these
> example URLs appear to have been erased. Not only is FETCH_MARK suddenly not
> set, GENERATE_MARK is also erased. Thus, the parser will think the URL hasn't
> been fetched yet. The fetchStatus, however, is nicely set to "2
> (status_fetched)". It's just the markers that are not correctly set.
> My first assumption was that FETCH_MARK was not saved. However, as noted in
> step 3), it gets the correct value. Also, GENERATE_MARK is erased after the
> process is complete, so something else goes wrong. Somewhere before the end
> of FetcherJob, the markers for certain pages are erased. Note that all other
> values, like content, baseUrl, fetchtimes and fetchStatus, are saved
> correctly for these URLs.
> Finally, for testing purposes, here is an example URL that DOES work:
> http://www.aanvalopschooluitval.nl/userfiles/file/2011/Plusvoorzieningenkrant.pdf
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)