[jira] [Commented] (NUTCH-1930) Fetcher erases Markers for certain URLs / documents

Rohith (JIRA) Wed, 07 Dec 2016 23:14:39 -0800

    [ 
https://issues.apache.org/jira/browse/NUTCH-1930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15731376#comment-15731376
 ]


Rohith commented on NUTCH-1930:
-------------------------------

After debugging hbase-gora.jar found a solution for this issue.
Apparently the column families- headers,inlinks,outlinks,metadata and markers 
for some urls were getting deleted at some point of time.
The gora-hbase connector was deleting the whole column family if qualifier is 
not found from gora-hbase-mapping.xml.
All i had to do was to add a qualifier  for these column families 
<!-- score fields                                       -->
        <field name="score" family="s" qualifier="s"/>
        <field name="headers" family="h" qualifier="h"/>
        <field name="inlinks" family="il" qualifier="il"/>
        <field name="outlinks" family="ol" qualifier="ol"/>
        <field name="metadata" family="mtdt" qualifier="mtdt"/>
        <field name="markers" family="mk" qualifier="mk"/>

None of the markers were deleted and all files were getting indexed.

Note:With nutch 2.3.1 i have faced issues using hbase versions other than .98.8

> Fetcher erases Markers for certain URLs / documents
> ---------------------------------------------------
>
>                 Key: NUTCH-1930
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1930
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 2.3
>            Reporter: Michiel
>             Fix For: 2.5
>
>
> During an active crawling project, I noticed what appears to be a bug in the 
> fetcher: the markers for certain pages (PDFs especially) are either not 
> saved, or erased altogether. The pages are thus not parsed, nor updated in 
> the DB. They keep appearing in the generate lists and fetch lists. Note that 
> this is a separate issue from NUTCH-1922. That one involves correctly parsed 
> pages. This bug prevents certain pages from getting correct markers set.
> Although I'm still new to Nutch and no java expert, I'm currently trying to 
> debug this. Because it seems to be rather easy to replicate the error, so it 
> seemed sensible to share my findings so far. If I find out more myself, I'll 
> update this issue.
> For this test, I injected two test URLs which never seemed to get parsed, 
> even though they are valid documents which are not excluded by any filters. I 
> use a http.content.limit of 64 MB, and tika is used for parsing documents. 
> Note that these are just two examples, I can provide more if needed.
> - 
> http://www.aanvalopschooluitval.nl/userfiles/file/projectenbank/Flex%20Lectoraat.pdf
> - 
> http://www.prettywoman-utrecht.nl/wp-content/uploads/PrettyWoman-methodiek_web.pdf
> Steps:
> 1) Whenever a batch gets generated, the GENERATE_MARK is set. So far so good.
> 2) During fetch, map() inside FetcherJob checks if this GENERATE_MARK is set. 
> If so, it continues. Still, so far so good.
> 3) After fetch, output() inside FetcherReducer sets the FETCH_MARK. I've 
> logged the marker, and it gets set with the correct batchId. It gets a value.
> 4) However, when another nutch command is run, all the markers from these 
> example URLs appear to have been erased. Not only is FETCH_MARK suddenly not 
> set, GENERATE_MARK is also erased. Thus, the parser will think the URL hasn't 
> been fetched yet. The fetchStatus, however, is nicely set to "2 
> (status_fetched)". It's just the markers that are not correctly set.
> My first assumption was that FETCH_MARK was not saved. However, as noted in 
> step 3), it gets the correct value. Also, GENERATE_MARK is erased after the 
> process is complete, so something else goes wrong. Somewhere before the end 
> of FetcherJob, the markers for certain pages are erased. Note that all other 
> values, like content, baseUrl, fetchtimes and fetchStatus, are saved 
> correctly for these URLs.
> Finally, for testing purposes, here is an example URL that DOES work: 
> http://www.aanvalopschooluitval.nl/userfiles/file/2011/Plusvoorzieningenkrant.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-1930) Fetcher erases Markers for certain URLs / documents

Reply via email to