[
https://issues.apache.org/jira/browse/NUTCH-1930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15731376#comment-15731376
]
Rohith commented on NUTCH-1930:
-------------------------------
After debugging hbase-gora.jar found a solution for this issue.
Apparently the column families- headers,inlinks,outlinks,metadata and markers
for some urls were getting deleted at some point of time.
The gora-hbase connector was deleting the whole column family if qualifier is
not found from gora-hbase-mapping.xml.
All i had to do was to add a qualifier for these column families
<!-- score fields -->
<field name="score" family="s" qualifier="s"/>
<field name="headers" family="h" qualifier="h"/>
<field name="inlinks" family="il" qualifier="il"/>
<field name="outlinks" family="ol" qualifier="ol"/>
<field name="metadata" family="mtdt" qualifier="mtdt"/>
<field name="markers" family="mk" qualifier="mk"/>
None of the markers were deleted and all files were getting indexed.
Note:With nutch 2.3.1 i have faced issues using hbase versions other than .98.8
> Fetcher erases Markers for certain URLs / documents
> ---------------------------------------------------
>
> Key: NUTCH-1930
> URL: https://issues.apache.org/jira/browse/NUTCH-1930
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Affects Versions: 2.3
> Reporter: Michiel
> Fix For: 2.5
>
>
> During an active crawling project, I noticed what appears to be a bug in the
> fetcher: the markers for certain pages (PDFs especially) are either not
> saved, or erased altogether. The pages are thus not parsed, nor updated in
> the DB. They keep appearing in the generate lists and fetch lists. Note that
> this is a separate issue from NUTCH-1922. That one involves correctly parsed
> pages. This bug prevents certain pages from getting correct markers set.
> Although I'm still new to Nutch and no java expert, I'm currently trying to
> debug this. Because it seems to be rather easy to replicate the error, so it
> seemed sensible to share my findings so far. If I find out more myself, I'll
> update this issue.
> For this test, I injected two test URLs which never seemed to get parsed,
> even though they are valid documents which are not excluded by any filters. I
> use a http.content.limit of 64 MB, and tika is used for parsing documents.
> Note that these are just two examples, I can provide more if needed.
> -
> http://www.aanvalopschooluitval.nl/userfiles/file/projectenbank/Flex%20Lectoraat.pdf
> -
> http://www.prettywoman-utrecht.nl/wp-content/uploads/PrettyWoman-methodiek_web.pdf
> Steps:
> 1) Whenever a batch gets generated, the GENERATE_MARK is set. So far so good.
> 2) During fetch, map() inside FetcherJob checks if this GENERATE_MARK is set.
> If so, it continues. Still, so far so good.
> 3) After fetch, output() inside FetcherReducer sets the FETCH_MARK. I've
> logged the marker, and it gets set with the correct batchId. It gets a value.
> 4) However, when another nutch command is run, all the markers from these
> example URLs appear to have been erased. Not only is FETCH_MARK suddenly not
> set, GENERATE_MARK is also erased. Thus, the parser will think the URL hasn't
> been fetched yet. The fetchStatus, however, is nicely set to "2
> (status_fetched)". It's just the markers that are not correctly set.
> My first assumption was that FETCH_MARK was not saved. However, as noted in
> step 3), it gets the correct value. Also, GENERATE_MARK is erased after the
> process is complete, so something else goes wrong. Somewhere before the end
> of FetcherJob, the markers for certain pages are erased. Note that all other
> values, like content, baseUrl, fetchtimes and fetchStatus, are saved
> correctly for these URLs.
> Finally, for testing purposes, here is an example URL that DOES work:
> http://www.aanvalopschooluitval.nl/userfiles/file/2011/Plusvoorzieningenkrant.pdf
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)